A Basic Language Resource Kit for Persian

Mojgan Seraji, Beáta Megyesi, Joakim Nivre


Abstract
Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated. Keywords: BLARK for Persian, PoS tagged corpus, Persian treebank
Anthology ID:
L12-1162
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2245–2252
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/338_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mojgan Seraji, Beáta Megyesi, and Joakim Nivre. 2012. A Basic Language Resource Kit for Persian. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2245–2252, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
A Basic Language Resource Kit for Persian (Seraji et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/338_Paper.pdf