CoRoLa — The Reference Corpus of Contemporary Romanian Language

Verginica Barbu Mititelu, Elena Irimia, Dan Tufiș


Abstract
We present the project of creating CoRoLa, a reference corpus of contemporary Romanian (from 1945 onwards). In the international context, the project finds its place among the initiatives of gathering huge collections of texts, of pre-processing and annotating them at several levels, and also of documenting them with metadata (CMDI). Our project is a joined effort of two institutes of the Romanian Academy. We foresee a corpus of more than 500 million word forms, covering all functional styles of the language. Although the vast majority of texts will be in written form, we target about 300 hours of oral texts, too, obligatorily with associated transcripts. Most of the texts will be from books, while the rest will be harvested from newspapers, booklets, technical reports, etc. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will be done at a morphological level in a first stage, followed by lemmatization, with the possibility of adding syntactic, semantic and discourse annotation in a later stage. A core of CoRoLa is described in the article. The target users of our corpus will be researchers in linguistics and language processing, teachers of Romanian, students.
Anthology ID:
L14-1311
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1235–1239
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/360_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Verginica Barbu Mititelu, Elena Irimia, and Dan Tufiș. 2014. CoRoLa — The Reference Corpus of Contemporary Romanian Language. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1235–1239, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
CoRoLa — The Reference Corpus of Contemporary Romanian Language (Mititelu et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/360_Paper.pdf