Design and compilation of a specialized Spanish-German parallel corpus

Carla Parra Escartín


Abstract
This paper discusses the design and compilation of the TRIS corpus, a specialized parallel corpus of Spanish and German texts. It will be used for phraseological research aimed at improving statistical machine translation. The corpus is based on the European database of Technical Regulations Information System (TRIS), containing 995 original documents written in German and Spanish and their translations into Spanish and German respectively. This parallel corpus is under development and the first version with 97 aligned file pairs was released in the first META-NORD upload of metadata and resources in November 2011. The second version of the corpus, described in the current paper, contains 205 file pairs which have been completely aligned at sentence level, which account for approximately 1,563,000 words and 70,648 aligned sentence pairs.
Anthology ID:
L12-1326
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2199–2206
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/577_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Carla Parra Escartín. 2012. Design and compilation of a specialized Spanish-German parallel corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2199–2206, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Design and compilation of a specialized Spanish-German parallel corpus (Escartín, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/577_Paper.pdf