A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models

Ali Shazal, Aiza Usman, Nizar Habash


Abstract
While online Arabic is primarily written using the Arabic script, a Roman-script variety called Arabizi is often seen on social media. Although this representation captures the phonology of the language, it is not a one-to-one mapping with the Arabic script version. This issue is exacerbated by the fact that Arabizi on social media is Dialectal Arabic which does not have a standard orthography. Furthermore, Arabizi tends to include a lot of code mixing between Arabic and English (or French). To map Arabizi text to Arabic script in the context of complete utterances, previously published efforts have split Arabizi detection and Arabic script target in two separate tasks. In this paper, we present the first effort on a unified model for Arabizi detection and transliteration into a code-mixed output with consistent Arabic spelling conventions, using a sequence-to-sequence deep learning model. Our best system achieves 80.6% word accuracy and 58.7% BLEU on a blind test set.
Anthology ID:
2020.wanlp-1.15
Volume:
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
167–177
Language:
URL:
https://aclanthology.org/2020.wanlp-1.15
DOI:
Bibkey:
Cite (ACL):
Ali Shazal, Aiza Usman, and Nizar Habash. 2020. A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 167–177, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models (Shazal et al., WANLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wanlp-1.15.pdf