ROMBAC: The Romanian Balanced Annotated Corpus

Radu Ion, Elena Irimia, Dan Ştefănescu, Dan Tufiș


Abstract
This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.
Anthology ID:
L12-1074
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
339–344
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/218_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Radu Ion, Elena Irimia, Dan Ştefănescu, and Dan Tufiș. 2012. ROMBAC: The Romanian Balanced Annotated Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 339–344, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
ROMBAC: The Romanian Balanced Annotated Corpus (Ion et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/218_Paper.pdf