Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair

Nikola Ljubešić, Miquel Esplà-Gomis, Antonio Toral, Sergio Ortiz Rojas, Filip Klubička


Abstract
This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “.hr” and the Slovene top-level domain “.si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.
Anthology ID:
L16-1471
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2949–2956
Language:
URL:
https://aclanthology.org/L16-1471
DOI:
Bibkey:
Cite (ACL):
Nikola Ljubešić, Miquel Esplà-Gomis, Antonio Toral, Sergio Ortiz Rojas, and Filip Klubička. 2016. Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2949–2956, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair (Ljubešić et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1471.pdf