Instance Weighting for Neural Machine Translation Domain Adaptation

Instance weighting has been widely applied to phrase-based machine translation domain adaptation. However, it is challenging to be applied to Neural Machine Translation (NMT) directly, because NMT is not a linear model. In this paper, two instance weighting technologies, i.e., sentence weighting and domain weighting with a dynamic weight learning strategy, are proposed for NMT domain adaptation. Empirical results on the IWSLT English-German/French tasks show that the proposed methods can substantially improve NMT performance by up to 2.7-6.7 BLEU points, outperforming the existing baselines by up to 1.6-3.6 BLEU points.


Introduction
In Statistical Machine Translation (SMT), unrelated additional corpora, known as out-ofdomain corpora, have been shown not to benefit some domains and tasks, such as TED-talks and IWSLT tasks (Axelrod et al., 2011;. Several Phrase-based SMT (PBSMT) domain adaptation methods have been proposed to overcome this problem of the lack of substantial data in some specific domains and languages: i) Data selection. The main idea is to score the out-of-domain data using models trained from the in-domain and out-of-domain data, respectively. Then select training data by using these ranked scores (Moore and Lewis, 2010;Axelrod et al., 2011;Duh et al., 2013;Hoang and Sima'an, 2014a,b;Durrani et al., 2015;Chen et al., 2016).
ii) Model Linear Interpolation. Several PBSMT models, such as language models, translation models, and reordering models, individually corresponding to each corpus, are trained. These models are then combined to achieve the best performance (Sennrich, 2012;Sennrich et al., 2013;Durrani et al., 2015Durrani et al., , 2016Imamura and Sumita, 2016).
iii) Instance Weighting. Instance Weighting has been applied to several NLP domain adaptation tasks (Jiang and Zhai, 2007), such as POS tagging, entity type classification and especially PBSMT (Matsoukas et al., 2009;Shah et al., 2010;Foster et al., 2010;Rousseau et al., 2011;Zhou et al., 2015;Wang et al., 2016;Imamura and Sumita, 2016). They firstly score each instance/domain by using rules or statistical methods as a weight, and then train PBSMT models by giving each instance/domain the weight.
For Neural Machine Translation (NMT) domain adaptation, the sentence selection can also be used (Chen et al., 2016;Wang et al., 2017). Meanwhile, the model linear interpolation is not easily applied to NMT directly, because NMT is not a linear model. There are two methods for model combination of NMT: i) the in-domain model and out-of-domain model can be ensembled (Jean et al., 2015). ii) an NMT further training (fine-tuning) method . The training is performed in two steps: first, the NMT system is trained using out-of-domain data, and then further trained using in-domain data. Recently, Chu et al. (2017) make an empirical comparison of NMT further training  and domain control (Kobus et al., 2016), which applied word-level domain features to word embedding layer. This approach provides natural baselines for comparison.
To the best of our knowledge, there is no existing work concerning instance weighting in NMT. The main challenge is that NMT is not a liner model or combination of linear models, where the instance weight can be integrated into directly. To overcome this difficulty, we try to integrate the instance weight into NMT objective function.
Two technologies, i.e., sentence weighting and domain weighting, are proposed to apply instance weighting to NMT. In addition, we also propose a dynamic weight learning strategy to tune the proposed domain weights.

NMT Background
An attention based NMT is a neural network that directly models the conditional probability p(y|x) of translating a source sentence, x = {x 1 , ..., x n }, to a target sentence, y = {y 1 , ..., y m } : sof tmax(g(y j |y j−1 , s j , c j )), (1) with g being the transformation function that outputs a vocabulary-sized vector, s j being the RNN hidden unit and c j being the weighted sum of source annotations H x . The NMT training objective (maximize) is formulated as, where D is the parallel training corpus.

Instance weighting for NMT
In this paper, we integrate the instance weight into the NMT objective function. Our main hypothesis is that the in-domain data should have a higher weight in the NMT objective function than the outof-domain ones. The training corpus D can be divided into indomain one D in and the out-of-domain one D out . So, the Eq. (2) can be rewritten as, where x, y is a parallel sentence pair.

Sentence Weighting
A general method is to give each sentence a weight. As Axelrod et al. (2011) mentioned, there are some pseudo in-domain data in out-of-domain data, which are close to in-domain data. We can apply their bilingual cross-entropy method to score each x i , y i as a weight λ i , the higher the better, 1 Take H in (x i ) as example, it indicates the cross-entropy between sentence x i and in-domain language model (Axelrod et al., 2011). Min-max normalization δ (Priddy and Keller, 2005) The λ for in-domain data will set as one directly.
The updated objective function by sentence weighting (J sw ) can be rewritten as,

Domain Weighting
An alternative way is to modify the weight of each domain in objective function. For we design a weight parameter λ in for in-domain data. The updated objective function by domain weighting (J dw ) can be estimated as, 3.2.1 Batch weighting A straightforward domain weighting implementation is to modify the ratio between in-domain and out-of-domain data in each NMT mini-batch. That is, we can increase the in-domain weight by increasing the number of in-domain sentences included in a mini-batch. The updated in-domain data ratio R in in each NMT mini-batch can be calculated as, where |D in | and |D out | are the sentence number from in and out-of-domain data in each minibatch, respectively.
Take the IWSLT EN-DE corpus in Table 1 as example, the original ratio R in between indomain data and all of the data is around 1:20. That is, for a 80-sized mini-batch, it would include around four sentence from in-domain data and 76 from out-of-domain data on average. For batch weighting, we can set the ratio R in as 1:2 manually. That is, we load 40 in-domain and 40 out-of-domain sentences into each mini-batch.
In practice, we create two data iterators, one for in-domain and one for out-of-domain. Both of the in and out-of-domain data will be randomly shuffled and then loaded into corresponding data iterators before each epoch. For each minibatch, the data from these two data iterators are determined by the ratio R in . Because the size of out-of-domain data is much larger than the indomain one, the in-domain data will be loaded and trained for several epochs, while the out-ofdomain data is only trained for one epoch at the same time.

Dynamic Weight Tuning
For the batch weight tuning, one way is to fix the weights for several systems and select the best-performed system on the development data. Besides this, we also tried to learn the batch weighting dynamically. That is, the initial indomain data ration in mini-batch is set as 0%. We increased 10% ratio of in-domain data in the minibatch if the training cost does not decrease for tentime evaluations (the training cost is evaluated on development data set every 1K batches training).

Data Sets
The proposed methods were evaluated by adapting WMT corpora to IWSLT (mainly contains TED talks) corpora. 2 Statistics on data sets were shown in Table 1.
• IWSLT 2015 English (EN) to German (DE) training corpus (Cettolo et al., 2015) was used as in-domain training data. Outof-domain corpora contained WMT 2014 English-German corpora. This adaptation corpora settings were the same as those used in .

NMT Systems
We implemented the proposed method in Nematus 3 (Sennrich et al., 2017) and online available 4 , which is one of the state-of-the-art NMT frameworks. The default settings of Nematus were applied to all NMT systems (both baselines and the proposed methods): the word embedding dimension was 620 and the size of a hidden layer was 1000, the batch size was 80, the maximum sequence length were 50, and the beam size for decoding was 10. The 30K-sized vocabulary, which was created by using both in and out-of-domain data, were applied to all of the systems. Default dropout was applied. Each NMT model was trained for 500K batches by using ADADELTA optimizer (Zeiler, 2012). Training was conducted on a single Tesla P100 GPU, taking 7-10 days. We observed that all of the systems converged before 500K batches training.
For the coding cost of duplicating data, we only add two data iterators as mentioned in 3.2.1. For the training cost, using batch weighting can a accelerate the model converge on development data in our experiments, because the development data are also in-domain data.
Overall, the overhead cost is not too much.

Results and Analysis
In Tables 2 and 3, SMT indicates standard PBSMT (Koehn et al., 2007) models were trained by corresponding corpora (in, out, and in+out). The in, out and in + out indicate that the in-domain, out-of-domain and their mixture were used as the NMT training corpora.
For related NMT domain adaptation baselines, "ensemble" indicates in and out models were ensembled in decoding and "sampler" indicates that we sampled duplicated in-domain data into training data, to make the ratio between in/out be 1:1 manually. Actually, if the mini-batch size was as large as the whole corpus, the sampling method, and batch weighting method would be the same. Batch weighting method makes the data more balanced in each single mini-batch. However, the mini-batch size is limited, so these two methods are different.
We also compared Axelrod et al. (2011)'s sentence selection and Kobus et al. (2016)'s domain control method, which added a word feature (in or out) to each word in the training corpora. For all of the baselines, we tried our best to re-implemented their methods. The translation performance was measured by the case-insensitive BLEU (Papineni et al., 2002), with the paired bootstrap re-sampling test (Koehn, 2004) Tables 3) indicate whether the proposed methods were significantly better than the best performed baselines in bold ("++": better at significance level α = 0.01, "+": α = 0.05).
In Tables 2 and 3  • Adding out-of-domain to in-domain data, or directly using out-of-domain data, degraded NMT performance.
• The proposed instance weighting methods substantially improved NMT performance (in) up to 2.7-6.7 BLEU points, and outperformed the best existing baselines up to 1.6-3.6 BLEU points.
• Among the proposed methods, batch weighting performed the best, although it was the simplest one. The reason may be: a) the batch weighting method directly balanced the in-domain data ratio in each mini-batch, to overcome the in-domain data sparse problem. b) The batch weight can be tuned on development data, in comparison with sentence weighting method, whose weights were learned and fixed before NMT training.
• The dynamic weight tuning strategy outperformed the fixed weight tuning strategy by 0.6-2.4 BLEU points. Figure 1 showed the batch weight tuning experiments on development data of IWSLT EN-DE, where the horizontal axis indicates the indomain ratio R in in Eq. (8). "Fix" indicates several systems were trained with fixed weights and the best-performed system would be selected. "Dynamic" indicates that only one system was trained and the domain weight was learned dynamically as mentioned in Section 3.2.2. As shown in Figure 1, the fixed weight learning reached the highest BLEU on dev at around 50% and dynamic learning at around 60%. If we keep training the dynamic learning after 100% indomain data were used, the performance would trend to become similar to only using in-domain data from the beginning.

Further Training
Further training  can be viewed as a special case of the proposed batch weighting method. That is, it trained NMT model by using 0% in-domain data at first and then using 100% in-domain data. In comparison, our batch weighting kept some ratio of out-ofdomain data during the whole training process. In addition, further training can work together with batch weighting. That is, NMT was trained with 0% in-domain data at first and then with batch weighing method for further training (Luong + bw in Table 4). . R in was tuned on development data. As mentioned in Section 3.2.2, "bw + dynamic tuning" indicates that this batch weighting was learned dynamically. IWSLT     is the baseline for significance test. Table 4 shows that batch weighting worked synergistically with Luong's further training method, and slightly improved NMT performance. The "bw + dynamic tuning" method outperformed both of them. We observed that the original further training overfitted quickly after around one epoch training. Keeping some out-of-domain data would prevent further training from overfitting.

Conclusion and Future Work
In this paper, we proposed two straightforward instance weighting methods with a dynamic weight learning strategy for NMT domain adaptation. Empirical results on IWSLT EN-DE/FR tasks showed that the proposed methods can substantially improve NMT performances and outperform state-of-the-art NMT adaptation methods.
The current sentence weighting method is a simple implementation of the existing PBSMT adaptation methods. In the future, we will try to study a specific sentence weighting method for NMT domain adaptation.