The University of Jordan :: Research Groups :: Automatic Diacritization of Arabic Text and...
Conference

Automatic Diacritization of Arabic Text and Poetry Using Pretrained Byte-to-Byte Language Models and Multiphase Training

​Presented in: : 2025 1st International Conference on Computational Intelligence Approaches and Applications (ICCIAA)

Abstract: ​​​Manual diacritization demands extensive time and a profound comprehension of Arabic grammar and morphology. While current automatic systems often suffer from accuracy and efficiency issues and require substantial training data and computational resources. The byte-to-byte model (ByT5), a variant of text-to-text transfer transformers, emerges as a promising solution for automating Arabic text and poetry diacritization within natural language processing (NLP). This research aims to address these challenges, particularly focusing on poetry, where undiacritized or partially diacritized poetry poses a significant hurdle. Leveraging the model's byte-level processing and auto-tokenization features, along with a multi-phase training strategy, is essential to overcome the scarcity of training data. The proposed solution achieves a diacritic error rate (DER) of 0.86% and a word error rate (WER) of 2.64% for Arabic prose diacritization. However, automating Arabic poetry diacritization presents additional challenges, resulting in a DER of 3.28% and a WER of 10.74%.​