Data Filtering using Cross-Lingual Word Embeddings

Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, Hermann Ney


Abstract
Data filtering for machine translation (MT) describes the task of selecting a subset of a given, possibly noisy corpus with the aim to maximize the performance of an MT system trained on this selected data. Over the years, many different filtering approaches have been proposed. However, varying task definitions and data conditions make it difficult to draw a meaningful comparison. In the present work, we aim for a more systematic approach to the task at hand. First, we analyze the performance of language identification, a tool commonly used for data filtering in the MT community and identify specific weaknesses. Based on our findings, we then propose several novel methods for data filtering, based on cross-lingual word embeddings. We compare our approaches to one of the winning methods from the WMT 2018 shared task on parallel corpus filtering on three real-life, high resource MT tasks. We find that said method, which was performing very strong in the WMT shared task, does not perform well within our more realistic task conditions. While we find that our approaches come out at the top on all three tasks, different variants perform best on different tasks. Further experiments on the WMT 2020 shared task for parallel corpus filtering show that our methods achieve comparable results to the strongest submissions of this campaign.
Anthology ID:
2021.naacl-main.15
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
162–172
Language:
URL:
https://aclanthology.org/2021.naacl-main.15
DOI:
10.18653/v1/2021.naacl-main.15
Bibkey:
Cite (ACL):
Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, and Hermann Ney. 2021. Data Filtering using Cross-Lingual Word Embeddings. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 162–172, Online. Association for Computational Linguistics.
Cite (Informal):
Data Filtering using Cross-Lingual Word Embeddings (Herold et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.15.pdf
Video:
 https://aclanthology.org/2021.naacl-main.15.mp4