Using Confidential Data for Domain Adaptation of Neural Machine Translation

Sohyung Kim, Arianna Bisazza, Fatih Turkmen


Abstract
We study the problem of domain adaptation in Neural Machine Translation (NMT) when domain-specific data cannot be shared due to confidentiality or copyright issues. As a first step, we propose to fragment data into phrase pairs and use a random sample to fine-tune a generic NMT model instead of the full sentences. Despite the loss of long segments for the sake of confidentiality protection, we find that NMT quality can considerably benefit from this adaptation, and that further gains can be obtained with a simple tagging technique.
Anthology ID:
2021.privatenlp-1.6
Volume:
Proceedings of the Third Workshop on Privacy in Natural Language Processing
Month:
June
Year:
2021
Address:
Online
Editors:
Oluwaseyi Feyisetan, Sepideh Ghanavati, Shervin Malmasi, Patricia Thaine
Venue:
PrivateNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
46–52
Language:
URL:
https://aclanthology.org/2021.privatenlp-1.6
DOI:
10.18653/v1/2021.privatenlp-1.6
Bibkey:
Cite (ACL):
Sohyung Kim, Arianna Bisazza, and Fatih Turkmen. 2021. Using Confidential Data for Domain Adaptation of Neural Machine Translation. In Proceedings of the Third Workshop on Privacy in Natural Language Processing, pages 46–52, Online. Association for Computational Linguistics.
Cite (Informal):
Using Confidential Data for Domain Adaptation of Neural Machine Translation (Kim et al., PrivateNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.privatenlp-1.6.pdf
Optional supplementary data:
 2021.privatenlp-1.6.OptionalSupplementaryData.zip
Code
 sohyo/using-confidential-data-for-nmt