Exploiting Arabic Diacritization for High Quality Automatic Annotation

Nizar Habash, Anas Shahrour, Muhamed Al-Khalil


Abstract
We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.
Anthology ID:
L16-1681
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4298–4304
Language:
URL:
https://aclanthology.org/L16-1681
DOI:
Bibkey:
Cite (ACL):
Nizar Habash, Anas Shahrour, and Muhamed Al-Khalil. 2016. Exploiting Arabic Diacritization for High Quality Automatic Annotation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4298–4304, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Exploiting Arabic Diacritization for High Quality Automatic Annotation (Habash et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1681.pdf