Better OOV Translation with Bilingual Terminology Mining

Matthias Huck, Viktor Hangya, Alexander Fraser


Abstract
Unseen words, also called out-of-vocabulary words (OOVs), are difficult for machine translation. In neural machine translation, byte-pair encoding can be used to represent OOVs, but they are still often incorrectly translated. We improve the translation of OOVs in NMT using easy-to-obtain monolingual data. We look for OOVs in the text to be translated and translate them using simple-to-construct bilingual word embeddings (BWEs). In our MT experiments we take the 5-best candidates, which is motivated by intrinsic mining experiments. Using all five of the proposed target language words as queries we mine target-language sentences. We then back-translate, forcing the back-translation of each of the five proposed target-language OOV-translation-candidates to be the original source-language OOV. We show that by using this synthetic data to fine-tune our system the translation of OOVs can be dramatically improved. In our experiments we use a system trained on Europarl and mine sentences containing medical terms from monolingual data.
Anthology ID:
P19-1581
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5809–5815
Language:
URL:
https://aclanthology.org/P19-1581
DOI:
10.18653/v1/P19-1581
Bibkey:
Cite (ACL):
Matthias Huck, Viktor Hangya, and Alexander Fraser. 2019. Better OOV Translation with Bilingual Terminology Mining. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5809–5815, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Better OOV Translation with Bilingual Terminology Mining (Huck et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1581.pdf
Video:
 https://aclanthology.org/P19-1581.mp4