IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus

Septina Dian Larasati


Abstract
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: ‘plain', stored in text format and ‘morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
Anthology ID:
L12-1374
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
902–906
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/644_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Septina Dian Larasati. 2012. IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 902–906, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus (Larasati, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/644_Paper.pdf