The University of Jordan :: Research Groups :: Efficient Stochastic Error Injection for...
Conference

Efficient Stochastic Error Injection for Optimizing Large Language Models in Arabic Spelling Correction

​​​Pre​sented in:  2025 Int’l Conf. on New Trends in Computing Sciences (ICTCS)

​Abstract: Spelling mistakes in written communication among Arabic speakers have become more common, especially with the rise of social media. Correcting these errors requires effective natural language processing (NLP) tools. This paper introduces a ByT5-based approach to address two types of spelling mistakes: directed and general. ByT5, a token-free, pre-trained transformer model, processes UTF-8 text as raw bytes, which minimizes preprocessing and enhances robustness. We perform experiments with varying error injection rates to identify the optimal rate for correction. Our results are evaluated on two test sets: Test200 and TSMTS. The findings demonstrate ByT5’s strong performance in spelling error correction, reducing the character error rate (CER) from 5% to 1.37% on the Test200 set and from 5% to 1.77% on the TSMTS set. These test sets include real Arabic sentences containing actual spelling mistakes. Overall, our results show that this approach offers a promising solution for spelling error correction and contributes to generating effective synthetic datasets for training large language models.