Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation

Mohamed Maamouri, Seth Kulick, Ann Bies


Abstract
The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both “vocalized” and “unvocalized” forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the “real-world” situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent “real-world” data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.
Anthology ID:
L08-1361
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/706_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mohamed Maamouri, Seth Kulick, and Ann Bies. 2008. Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation (Maamouri et al., LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/706_paper.pdf