New language resources for the Pashto language

Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, Karim Boudahmane


Abstract
This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.
Anthology ID:
L12-1490
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2917–2922
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/824_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, and Karim Boudahmane. 2012. New language resources for the Pashto language. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2917–2922, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
New language resources for the Pashto language (Mostefa et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/824_Paper.pdf