Low-Density Language Bootstrapping: the Case of Tajiki Persian

Karine Megerdoomian, Dan Parvaz


Abstract
Low-density languages raise difficulties for standard approaches to natural language processing that depend on large online corpora. Using Persian as a case study, we propose a novel method for bootstrapping MT capability for a low-density language in the case where it relates to a higher density variant. Tajiki Persian is a low-density language that uses the Cyrillic alphabet, while Iranian Persian (Farsi) is written in an extended version of the Arabic script and has many computational resources available. Despite the orthographic differences, the two languages have literary written forms that are almost identical. The paper describes the development of a comprehensive finite-state transducer that converts Tajik text to Farsi script and runs the resulting transliterated document through an existing Persian-to-English MT system. Due to divergences that arise in mapping the two writing systems and phonological and lexical distinctions, the system uses contextual cues (such as the position of a phoneme in a word) as well as available Farsi resources (such as a morphological analyzer to deal with differences in the affixal structures and a lexicon to disambiguate the analyses) to control the potential combinatorial explosion. The results point to a valuable strategy for the rapid prototyping of MT packages for languages of similar uneven density.
Anthology ID:
L08-1591
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/827_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Karine Megerdoomian and Dan Parvaz. 2008. Low-Density Language Bootstrapping: the Case of Tajiki Persian. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Low-Density Language Bootstrapping: the Case of Tajiki Persian (Megerdoomian & Parvaz, LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/827_paper.pdf