TLAXCALA: a multilingual corpus of independent news

Antonio Toral


Abstract
We acquire corpora from the domain of independent news from the Tlaxcala website. We build monolingual corpora for 15 languages and parallel corpora for all the combinations of those 15 languages. These corpora include languages for which only very limited such resources exist (e.g. Tamazight). We present the acquisition process in detail and we also present detailed statistics of the produced corpora, concerning mainly quantitative dimensions such as the size of the corpora per language (for the monolingual corpora) and per language pair (for the parallel corpora). To the best of our knowledge, these are the first publicly available parallel and monolingual corpora for the domain of independent news. We also create models for unsupervised sentence splitting for all the languages of the study.
Anthology ID:
L14-1093
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3689–3692
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1134_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Antonio Toral. 2014. TLAXCALA: a multilingual corpus of independent news. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3689–3692, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
TLAXCALA: a multilingual corpus of independent news (Toral, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1134_Paper.pdf