The University of Jordan :: Research Groups :: Accurate Jordanian Arabic Dialect Text Recognition...
Conference

Accurate Jordanian Arabic Dialect Text Recognition Using the ByT5 Pre-Trained Language Model

Presented in: 2025 1st Int’l Conf. on Computational Intelligence Approaches and Applications (ICCIAA)​

​Abstract: ​Jordanian Arabic, a prominent Levantine dialect, increasingly appears in written form on social media, requiring robust NLP tools for accurate recognition. This paper presents a ByT5-based approach to Jordanian dialect identification among various Arabic dialects and Modern Standard Arabic. ByT5, a token-free, pre-trained transformer model, processes UTF-8 text as raw bytes, reducing preprocessing overhead and boosting robustness. Our dataset includes ~20K Jordanian texts from the Arabic Online Commentary (AOC), the Shami Dialect Corpus (SDC), and the MADAR project. Experiments yield an 88.7% F1 score in detecting Jordanian dialect. Expanding the dataset by adding 19K sequences (JODA) improves performance to 95.3%. These results highlight ByT5's effectiveness in Arabic dialect recognition, offering a promising solution for social media analysis, machine translation, and other NLP tasks.​