The CMU METAL Farsi NLP Approach

Weston Feely, Mehdi Manshadi, Robert Frederking, Lori Levin


Abstract
While many high-quality tools are available for analyzing major languages such as English, equivalent freely-available tools for important but lower-resourced languages such as Farsi are more difficult to acquire and integrate into a useful NLP front end. We report here on an accurate and efficient Farsi analysis front end that we have assembled, which may be useful to others who wish to work with written Farsi. The pre-existing components and resources that we incorporated include the Carnegie Mellon TurboParser and TurboTagger (Martins et al., 2010) trained on the Dadegan Treebank (Rasooli et al., 2013), the Uppsala Farsi text normalizer PrePer (Seraji, 2013), the Uppsala Farsi tokenizer (Seraji et al., 2012a), and Jon Dehdari’s PerStem (Jadidinejad et al., 2010). This set of tools (combined with additional normalization and tokenization modules that we have developed and made available) achieves a dependency parsing labeled attachment score of 89.49%, unlabeled attachment score of 92.19%, and label accuracy score of 91.38% on a held-out parsing test data set. All of the components and resources used are freely available. In addition to describing the components and resources, we also explain the rationale for our choices.
Anthology ID:
L14-1481
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4052–4055
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/596_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Weston Feely, Mehdi Manshadi, Robert Frederking, and Lori Levin. 2014. The CMU METAL Farsi NLP Approach. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4052–4055, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
The CMU METAL Farsi NLP Approach (Feely et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/596_Paper.pdf