Finite State Machine Pattern-Root Arabic Morphological Generator, Analyzer and Diacritizer

Maha Alkhairy, Afshan Jafri, David Smith


Abstract
We describe and evaluate the Finite-State Arabic Morphologizer (FSAM) – a concatenative (prefix-stem-suffix) and templatic (root- pattern) morphologizer that generates and analyzes undiacritized Modern Standard Arabic (MSA) words, and diacritizes them. Our bidirectional unified-architecture finite state machine (FSM) is based on morphotactic MSA grammatical rules. The FSM models the root-pattern structure related to semantics and syntax, making it readily scalable unlike stem-tabulations in prevailing systems. We evaluate the coverage and accuracy of our model, with coverage being percentage of words in Tashkeela (a large corpus) that can be analyzed. Accuracy is computed against a gold standard, comprising words and properties, created from the intersection of UD PADT treebank and Tashkeela. Coverage of analysis (extraction of root and properties from word) is 82%. Accuracy results are: root computed from a word (92%), word generation from a root (100%), non-root properties of a word (97%), and diacritization (84%). FSAM’s non-root results match or surpass MADAMIRA’s, and root result comparisons are not made because of the concatenative nature of publicly available morphologizers.
Anthology ID:
2020.lrec-1.473
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3834–3841
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.473
DOI:
Bibkey:
Cite (ACL):
Maha Alkhairy, Afshan Jafri, and David Smith. 2020. Finite State Machine Pattern-Root Arabic Morphological Generator, Analyzer and Diacritizer. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3834–3841, Marseille, France. European Language Resources Association.
Cite (Informal):
Finite State Machine Pattern-Root Arabic Morphological Generator, Analyzer and Diacritizer (Alkhairy et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.473.pdf