Tajik-Farsi Persian Transliteration Using Statistical Machine Translation

Chris Irwin Davis


Abstract
Tajik Persian is a dialect of Persian spoken primarily in Tajikistan and written with a modified Cyrillic alphabet. Iranian Persian, or Farsi, as it is natively called, is the lingua franca of Iran and is written with the Persian alphabet, a modified Arabic script. Although the spoken versions of Tajik and Farsi are mutually intelligible to educated speakers of both languages, the difference between the writing systems constitutes a barrier to text compatibility between the two languages. This paper presents a system to transliterate text between these two different Persian dialects that use incompatible writing systems. The system also serves as a mechanism to facilitate sharing of computational linguistic resources between the two languages. This is relevant because of the disparity in resources for Tajik versus Farsi.
Anthology ID:
L12-1601
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3988–3995
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1012_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Chris Irwin Davis. 2012. Tajik-Farsi Persian Transliteration Using Statistical Machine Translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3988–3995, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Tajik-Farsi Persian Transliteration Using Statistical Machine Translation (Davis, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1012_Paper.pdf