Bifixer and Bicleaner: two open-source tools to clean your parallel data

Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón, Sergio Ortiz Rojas


Abstract
This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled multilingual websites, we evaluate their performance in a different scenario: cleaning publicly available corpora commonly used to train machine translation systems. We choose four English–Portuguese corpora which we plan to use internally to compute paraphrases at a later stage. We clean the four corpora using both tools, which are described in detail, and analyse the effect of some of the cleaning steps on them. We then compare machine translation training times and quality before and after cleaning these corpora, showing a positive impact particularly for the noisiest ones.
Anthology ID:
2020.eamt-1.31
Volume:
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
Month:
November
Year:
2020
Address:
Lisboa, Portugal
Editors:
André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, Mikel L. Forcada
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
291–298
Language:
URL:
https://aclanthology.org/2020.eamt-1.31
DOI:
Bibkey:
Cite (ACL):
Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón, and Sergio Ortiz Rojas. 2020. Bifixer and Bicleaner: two open-source tools to clean your parallel data. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 291–298, Lisboa, Portugal. European Association for Machine Translation.
Cite (Informal):
Bifixer and Bicleaner: two open-source tools to clean your parallel data (Ramírez-Sánchez et al., EAMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.eamt-1.31.pdf
Code
 bitextor/bicleaner
Data
OpenSubtitlesWikiMatrix