Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation

Xinyi Wang, Graham Neubig


Abstract
To improve low-resource Neural Machine Translation (NMT) with multilingual corpus, training on the most related high-resource language only is generally more effective than us- ing all data available (Neubig and Hu, 2018). However, it remains a question whether a smart data selection strategy can further improve low-resource NMT with data from other auxiliary languages. In this paper, we seek to construct a sampling distribution over all multilingual data, so that it minimizes the training loss of the low-resource language. Based on this formulation, we propose and efficient algorithm, (TCS), which first samples a target sentence, and then conditionally samples its source sentence. Experiments show TCS brings significant gains of up to 2 BLEU improvements on three of four languages we test, with minimal training overhead.
Anthology ID:
P19-1583
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5823–5828
Language:
URL:
https://aclanthology.org/P19-1583
DOI:
10.18653/v1/P19-1583
Bibkey:
Cite (ACL):
Xinyi Wang and Graham Neubig. 2019. Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5823–5828, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation (Wang & Neubig, ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1583.pdf
Video:
 https://aclanthology.org/P19-1583.mp4