Prof. Gheith Abandah from the University of Jordan (UJ) is spending his 2025/2026 sabbatical year at the American University of Sharjah (AUS), where he is leading an ambitious research project on Arabic language technology. Hosted by the Department of Computer Science and Engineering and supported by the AUS High-Performance Computing Center, Prof. Abandah is working on advancing automatic diacritization of Arabic poetry using state-of-the-art AI methods.
The project, titled “Curriculum-Guided Transfer Learning and LLM Fine-Tuning for Near-Perfect Diacritization of Arabic Poetry," addresses a long-standing challenge in computational linguistics. Diacritizing Arabic poetry is significantly more complex than prose because of meter-driven syntax, rare morphological patterns, and the limited availability of fully vowelized poetic corpora. These factors have kept error rates for even the best existing systems above 3%.
Building on earlier work that achieved a 42% reduction in diacritization error rate (DER), the current project aims to push DER below 2%. The research adopts a curriculum-guided, multi-phase methodology. It plans to leverage pretrained Arabic-centric large language models (LLMs) and lightweight recurrent architectures to extract rhythmic and contextual cues from undiacritized verses. These models will then be incrementally fine-tuned to predict short vowels accurately.
Methodologically, the project introduces a new loss function capable of handling partially diacritized training data, which is common in real-world resources. It also explores adaptive transfer-learning schedules to benefit from large prose corpora, while carefully avoiding overfitting to non-poetic styles. Evaluation will not be limited to DER; it will also examine gains in reading fluency and performance on downstream tasks such as automatic meter classification.
The project is a collaborative effort involving Dr. Mohammad Abdel-Majeed and Eng. Asma Abdelkareem from UJ, Prof. Nuha Alshaar from AUS, and Eng. Rabie Otoum from industry. All resulting models, code, and evaluation scripts are planned to be released as open-source tools, contributing to the broader ecosystems of computational linguistics, Arabic NLP, and digital humanities, and supporting future research on classical and modern Arabic poetry.