An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation

In this paper, we propose a novel domain adaptation method named “mixed fine tuning” for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.


Introduction
One of the most attractive features of neural machine translation (NMT) (Bahdanau et al., 2015;Cho et al., 2014;Sutskever et al., 2014) is that it is possible to train an end to end system without the need to deal with word alignments, translation rules and complicated decoding algorithms, which are a characteristic of statistical machine translation (SMT) systems. However, it is reported that NMT works better than SMT only when there is an abundance of parallel corpora. In the case of low resource domains, vanilla NMT is either worse than or comparable to SMT (Zoph et al., 2016).
Domain adaptation has been shown to be effective for low resource NMT. The conventional domain adaptation method is fine tuning, in which an out-of-domain model is further trained on indomain data (Luong and Manning, 2015;Sennrich et al., 2016b;Servan et al., 2016;Freitag and Al-Onaizan, 2016). However, fine tuning * This work was done when the first author was a researcher of Japan Science and Technology Agency. tends to overfit quickly due to the small size of the in-domain data. On the other hand, multi domain NMT (Kobus et al., 2016) involves training a single NMT model for multiple domains. This method adds tags "<2domain>" to the source sentences in the parallel corpora to indicate domains without any modifications to the NMT system architecture. However, this method has not been studied for domain adaptation in particular.
Motivated by these two lines of studies, we propose a new domain adaptation method called "mixed fine tuning," where we first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus that is a mix of the in-domain and out-of-domain corpora. Fine tuning on the mixed corpus instead of the indomain corpus can address the overfitting problem. All corpora are augmented with artificial tags to indicate specific domains. We tried two different corpora settings on two different language pairs: • Manually created resource poor corpus (Chinese-to-English translation): Using the NTCIR data (patent domain; resource rich) (Goto et al., 2013) to improve the translation quality for the IWSLT data (TED talks; resource poor) (Cettolo et al., 2015).
• Automatically extracted resource poor corpus (Chinese-to-Japanese translation): Using the ASPEC data (scientific domain; resource rich)  to improve the translation quality for the Wiki data (resource poor). The parallel corpus of the latter domain was automatically extracted (Chu et al., 2016a).
We observed that "mixed fine tuning" works significantly better than methods that use fine tuning In it ia li z e and domain tag based approaches separately. Our contributions are twofold: • We propose a novel method that combines the best of existing approaches and show that it is effective.
• To the best of our knowledge this is the first work on an empirical comparison of various domain adaptation methods.

Related Work
Fine tuning has also been explored for various NLP tasks using neural networks such as sentiment analysis and paraphrase detection (Mou et al., 2016). Tag based NMT has also been shown to be effective for control the politeness of translations (Sennrich et al., 2016a) and multilingual NMT (Johnson et al., 2016). Besides fine tuning and multi domain NMT using tags, another direction of domain adaptation for NMT is using in-domain monolingual data. Either training an in-domain recurrent neural network (RNN) language model for the NMT decoder (Gülçehre et al., 2015) or generating synthetic data by back translating target in-domain monolingual data (Sennrich et al., 2016b) have been studied.

Methods for Comparison
All the methods that we compare are simple and do not need any modifications to the NMT system.

Fine Tuning
Fine tuning is the conventional way for domain adaptation, and thus serves as a baseline in this study. In this method, we first train an NMT system on a resource rich out-of-domain corpus till convergence, and then fine tune its parameters on a resource poor in-domain corpus (Figure 1).

Multi Domain
The multi domain method is originally motivated by (Sennrich et al., 2016a), which uses tags to control the politeness of NMT translations. The overview of this method is shown in the dotted section in Figure 2. In this method, we simply concatenate the corpora of multiple domains with two small modifications: • Appending the domain tag "<2domain>" to the source sentences of the respective corpora. 1 This primes the NMT decoder to generate sentences for the specific domain.
• Oversampling the smaller corpus so that the training procedure pays equal attention to each domain.
We can further fine tune the multi domain model on the in-domain data, which is named as "multi domain + fine tuning."

Mixed Fine Tuning
The proposed mixed fine tuning method is a combination of the above methods (shown in Figure  2). The training procedure is as follows: 1. Train an NMT model on out-of-domain data till convergence.
2. Resume training the NMT model from step 1 on a mix of in-domain and out-of-domain data (by oversampling the in-domain data) till convergence.
By default, we utilize domain tags, but we also consider settings where we do not use them (i.e., "w/o tags"). We can further fine tune the model from step 2 on the in-domain data, which is named as "mixed fine tuning + fine tuning." Note that in the "fine tuning" method, the vocabulary obtained from the out-of-domain data is used for the in-domain data; while for the "multi domain" and "mixed fine tuning" methods, we use a vocabulary obtained from the mixed in-domain and out-of-domain data for all the training stages. Regarding development data, for fine tuning, an out-of-domain development set is first used for training the out-of-domain NMT model, then an in-domain development set is used for fine tuning; For multi-domain, a mix of in-domain and out-ofdomain development sets are used; For mixed fine tuning, an out-of-domain development set is first used for training the out-of-domain NMT model, then a mix of in-domain and out-of-domain development sets are used for mixed fine tuning.

Experimental Settings
We conducted NMT domain adaptation experiments in two different settings as follows:

High Quality In-domain Corpus Setting
Chinese-to-English translation was the focus of the high quality in-domain corpus setting. We utilized the resource rich patent out-of-domain data to augment the resource poor spoken language indomain data. The patent domain MT was conducted on the Chinese-English subtask (NTCIR-CE) of the patent MT task at the NTCIR-10 workshop 2 (Goto et al., 2013). The NTCIR-CE task uses 1M, 2k, and 2k sentences for training, development, and testing, respectively. The spoken domain MT was conducted on the Chinese-English subtask (IWSLT-CE) of the TED talk MT task at the IWSLT 2015 workshop (Cettolo et al., 2015).

MT Systems
For NMT, we used the KyotoNMT system 5 (Cromieres et al., 2016). The NMT settings were the same as (Cromieres et al., 2016) except that we used a vocabulary size of 32k for all the experiments, and did not ensemble independently trained parameters. The sizes of the source and target vocabularies, the source and target side embeddings, the hidden states, the attention mechanism hidden states, and the deep softmax output with a 2-maxout layer were set to 32,000, 620, 1000, 1000, and 500, respectively. We used 2layer LSTMs for both the source and target sides. ADAM was used as the learning algorithm, with a dropout rate of 20% for the inter-layer dropout, and L2 regularization with a weight decay coefficient of 1e-6. The mini batch size was 64, and sentences longer than 80 tokens were discarded. We early stopped the training process when we observed that the BLEU score of the development set converges. For testing, we ensembled the three parameters of the best development loss, the best development BLEU, and the final parameters in a single training run. Beam size was set to 100. The maximum length of the translation was set to 2, and 1.5 times of the source sentences for Chineseto-English, and Chinese-to-Japanese, respectively. For performance comparison, we also conducted experiments on phrase based SMT (PB-SMT). We used the Moses PBSMT system (Koehn et al., 2007) for all of our MT experiments. For the respective tasks, we trained 5-gram language models on the target side of the training data using the KenLM toolkit 6 with interpolated Kneser-Ney discounting, respectively. In all of our experiments, we used the GIZA++ toolkit 7 for word alignment; tuning was performed by minimum error rate training (Och, 2003), and it was re-run for every experiment.
For both MT systems, we preprocessed the data as follows. For Chinese, we used KyotoMorph 8 for segmentation, which was trained on the CTB version 5 (CTB5) and SCTB (Chu et al., 2016b). For English, we tokenized and lowercased the sentences using the tokenizer.perl script in Moses. Japanese was segmented using JUMAN 9 (Kurohashi et al., 1994).
For NMT, we further split the words into subwords using byte pair encoding (BPE) (Sennrich et al., 2016c), which has been shown to be effective for the rare word problem in NMT. Another motivation of using sub-words is making the different domains share more vocabulary, which is important especially for the resource poor domain. For the Chinese-to-English tasks, we trained two BPE models on the Chinese and English vocabularies, respectively. For the Chinese-to-Japanese tasks, we trained a joint BPE model on both of the Chinese and Japanese vocabularies, because Chinese and Japanese could share some vocabularies of Chinese characters. The number of merge operations was set to 30k for all the tasks. 6 https://github.com/kpu/kenlm/ 7 http://code.google.com/p/giza-pp 8 https://bitbucket.org/msmoshen/kyotomorph-beta 9 http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN

Results
Tables 1 and 2 show the translation results on the Chinese-to-English and Chinese-to-Japanese tasks, respectively. The entries with SMT and NMT are the PBSMT and NMT systems, respectively; others are the different methods described in Section 3. In both tables, the numbers in bold indicate the best system and all systems that were not significantly different from the best system. The significance tests were performed using the bootstrap resampling method (Koehn, 2004) at p < 0.05.
We can see that without domain adaptation, the SMT systems perform significantly better than the NMT system on the resource poor domains, i.e., IWSLT-CE and WIKI-CJ; while on the resource rich domains, i.e., NTCIR-CE and ASPEC-CJ, NMT outperforms SMT. Directly using the SMT/NMT models trained on the out-of-domain data to translate the in-domain data shows bad performance. With our proposed "Mixed fine tuning" domain adaptation method, NMT significantly outperforms SMT on the in-domain tasks.
Comparing different domain adaptation methods, "Mixed fine tuning" shows the best perfor-mance. We believe the reason for this is that "Mixed fine tuning" can address the over-fitting problem of "Fine tuning." We observed that both fine-tuning and mixed fine-tuning tends to converge after 1 epoch of training, and thus we early stopped training soon after 1 epoch. After 1 epoch of training, fine-tuning overfitted very quickly, while mixed fine-tuning did not overfit. In addition, "Mixed fine tuning" does not worsen the quality of out-of-domain translations, while "Fine tuning" and "Multi domain" do. One shortcoming of "Mixed fine tuning" is that compared to "Fine tuning," it took a longer time for the fine tuning process, as the time until convergence is essentially proportional to the size of the data used for fine tuning. Note that training as long as "Mixed fine tuning" is not helpful for "Fine tuning" due to overfitting.
"Multi domain" performs either as well as (IWSLT-CE) or worse than (WIKI-CJ) "Fine tuning," but "Mixed fine tuning" performs either significantly better than (IWSLT-CE) or is comparable to (WIKI-CJ) "Fine tuning." We believe the performance difference between the two tasks is due to their unique characteristics. As WIKI-CJ data is of relatively poorer quality, mixing it with out-of-domain data does not have the same level of positive effects as those obtained by the IWSLT-CE data.
The domain tags are helpful for both "Multi domain" and "Mixed fine tuning." Essentially, further fine tuning on in-domain data does not help for both "Multi domain" and "Mixed fine tuning." We believe that there are two reasons for this. Firstly, the "Multi domain" and "Mixed fine tuning" methods already utilize the in-domain data used for fine tuning. Secondly, fine tuning on the small in-domain data overfits very quickly. Actually, we observed that adding fine-tuning on top of both "Multi domain" and "Mixed fine tuning" overfits at the beginning of training.
Mixed fine tuning performs significantly better on the out-domain NTCIR-CE test set without tags as compared to with tags (39.67 v.s. 37.01). We believe the reason for this is that without tags the IWSLT-CE in-domain data can contribute more to the out-of-domain NTCIR-CE data. With tags, the NMT training tends to learn a model that pays equal attention to each domain. Without tags, the NMT training pays more attention to the NTCIR-CE data as it contains much longer sentences, al-though we oversampled the IWSLT-CE data. As the IWSLT-CE data is TED talks, there could be some vocabulary and content overlaps between the IWSLT-CE the NTCIR-CE data, and thus appending the IWSLT-CE data to the NTCIR-CE data can benefit for the NTCIR-CE translation. In the case of WIKI-CJ and ASPEC-CJ, due to the low quality of WIKI-CJ, appending WIKI-CJ to ASPEC-CJ does not improve the ASPEC-CJ translation.

Conclusion
In this paper, we proposed a novel domain adaptation method named "mixed fine tuning" for NMT. We empirically compared our proposed method against fine tuning and multi domain methods, and have shown that it is effective but is sensitive to the quality of the in-domain data used. The presented methods are language and domain independent, and thus we believe that the general observations also hold on other languages and domains. Furthermore, we believe the contribution in this paper can be helpful for domain adaptation of other NN based natural language processing tasks.
In the future, we plan to incorporate an RNN model into our architecture to leverage abundant in-domain monolingual corpora. We also plan on exploring the effects of synthetic data by back translating large in-domain monolingual corpora.