Sentence Embedding for Neural Machine Translation Domain Adaptation

Although new corpora are becoming increasingly available for machine translation, only those that belong to the same or similar domains are typically able to improve translation performance. Recently Neural Machine Translation (NMT) has become prominent in the field. However, most of the existing domain adaptation methods only focus on phrase-based machine translation. In this paper, we exploit the NMT’s internal embedding of the source sentence and use the sentence embedding similarity to select the sentences which are close to in-domain data. The empirical adaptation results on the IWSLT English-French and NIST Chinese-English tasks show that the proposed methods can substantially improve NMT performance by 2.4-9.0 BLEU points, outperforming the existing state-of-the-art baseline by 2.3-4.5 BLEU points.


Introduction
Recently, Neural Machine Translation (NMT) has set new state-of-the-art benchmarks on many translation tasks (Cho et al., 2014;Bahdanau et al., 2015;Jean et al., 2015;Tu et al., 2016;Mi et al., 2016;. An ever increasing amount of data is becoming available for NMT training. However, only the in-domain or relateddomain corpora tend to have a positive impact on NMT performance. Unrelated additional corpora, known as out-of-domain corpora, have been shown not to benefit some domains and tasks for NMT, such as TED-talks and IWSLT tasks (Luong and Manning, 2015).
To the best of our knowledge, there are only a few works concerning NMT adaptation (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016). Most traditional adaptation methods focus on Phrase-Based Statistical Machine Translation (PBSMT) and they can be broken down broadly into two main categories namely model adaptation and data selection  as follows.
For model adaptation, several PBSMT models, such as language models, translation models and reordering models, individually corresponding to each corpus, are trained. These models are then combined to achieve the best performance (Sennrich, 2012;Sennrich et al., 2013;. Since these methods focus on the internal models within a PBSMT system, they are not applicable to NMT adaptation. Recently, an NMT adaptation method (Luong and Manning, 2015) was proposed. The training is performed in two steps: first the NMT system is trained using out-of-domain data, and then further trained using in-domain data. Empirical results show their method can improve NMT performance, and this approach provides a natural baseline.
For adaptation through data selection, the main idea is to score the out-domain data using models trained from the in-domain and out-of-domain data and select training data from the out-ofdomain data using a cut-off threshold on the resulting scores. A language model can be used to score sentences (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013;Wang et al., 2015), as well as joint models (Hoang and Sima'an, 2014a,b;, and more recently Convolutional Neural Network (CNN) models (Chen et al., 2016). These methods select useful sentences from the whole corpus, so they can be directly applied to NMT. However, these methods are specifically designed for PBSMT and nearly all of them use the models or criteria which do not have a direct relationship with the neural translation process.
For NMT sentences selection, our hypothesis is that the NMT system itself can be used to score each sentence in the training data. Specifically, an NMT system embeds the source sentence into a vector representation 1 and we can use these vectors to measure a sentence pair's similarity to the in-domain corpus. In comparison with the CNN or other sentence embedding methods, this method can directly make use of information induced by the NMT system information itself. In addition, the proposed sentence selection method can be used in conjunction with the NMT further training method (Luong and Manning, 2015).

NMT Background
An attention-based NMT system uses a Bidirectional RNN (BiRNN) as an encoder and a decoder that emulates searching through a source sentence during decoding (Bahdanau et al., 2015). The encoder's BiRNN consists of forward and backward RNNs. Each word x i is represented by concatenating the forward hidden state − → h i and the backward one In this way, the source sentence X = {x 1 , ..., x Tx } can be represented as annotations H = {h 1 , ..., h Tx }. In the decoder, an RNN hidden state s j for time j is computed by: The context vector c j is then, computed as a weighted sum of these annotations H = {h 1 , ..., h Tx }, by using alignment weight α ji : 3 Sentence Embedding and Selection

Sentence Embedding
A source sentence can be represented as the annotations H. However the length of H depends on the sentence length T x . To represent a sentence as a fixed-length vector, we adopt the initial hidden layer state s init for the decoder as this vector: where an average pooling layer averages the annotation h i for each source word into a fixedlength source sentence vector, and a nonlinear transition layer (weights W and bias b are jointly trained with all the other components of NMT system) transforms this embedded source sentence vector into the initial hidden state s init for the decoder (Bahdanau et al., 2015).

Sentence Selection
We employ the data selection method, which is inspired by (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013). As Axelrod et al. (2011) mentioned, there are some pseudo in-domain data in out-of-domain data, which are close to in-domain data. Our intuition is to select the sentences whose embeddings are similar to the average in-domain ones, while being dis-similar to the average out-of-domain ones: • 1) We train a French-to-English NMT system N FE using the in-domain and out-of-domain data together as training data. 2 • 2) Each sentence f in the training data F (both in-domain F in and out-of-domain F out ) is embedded as a vector v f = s init (f ) by using N FE .
• 3) The sentence pairs (f, e) in the outof-domain corpus F out are classified into two sets: the sentences close to in-domain sentences, and those that are distant.
That is, we firstly calculate the vector centers of in-domain C F in and out-of-domain C Fout corpora, respectively.
Then we measure the Euclidean distance d between each sentence vector v f and in-domain vector center C F in as d(v f , C F in ) and out-ofdomain vector center C Fout as d(v f , C Fout ), respectively. We use the difference δ of these two distances to classify each sentence: By using an English-to-French NMT system N EF , we can obtain a target sentence embedding v e , in-domain target vector center C E in and out-of-domain target vector center C Eout . Corresponding distance difference δ e is, (6) δ f , δ e and δ f e = δ f + δ e can be used to select sentences. That is, the sentence pairs (f, e) with δ f (or δ e , δ f e ) less than a threshold are the new selected in-domain corpus. This threshold is tuned by using the development data.

Data sets
The proposed methods were evaluated on two data sets as shown in Table 1.
• IWSLT 2014 English (EN) to French (FR) corpus 3 was used as in-domain training data and dev2010 and test2010/2011 (Cettolo et al., 2014), were selected as development (dev) and test data, respectively. Outof-domain corpora contained Common Crawl, Europarl v7, News Commentary v10 and United Nation (UN) EN-FR parallel corpora. 4 • NIST 2006 Chinese (ZH) to English corpus 5 was used as the in-domain training corpus, following the settings of (Wang et al., 2014). Chinese-to-English UN data set (LDC2013T06) and NTCIR-9 (Goto et al., 2011) patent data set were used as out-ofdomain data. NIST MT 2002-2004and NIST MT 2005/2006 were used as the development and test data, respectively. We are aware of that there are additional NIST corpora in a similar domain, but because this task was for domain adaptation, we only selected a small subset, which is mainly focused on news and blog texts. The statistics on data sets were shown in Table 1.
These adaptation corpora settings were nearly the same as that used in . The differences were: • For IWSLT, they chose FR-EN translation task, which is popular in PBSMT. We chose EN-FR, which is more popular in NMT; • For NIST, they chose 02-05 as dev set, and we chose 02-04.

NMT System
We implemented the proposed method in Groundhog 6 (Bahdanau et al., 2015), which is one of the state-of-the-art NMT frameworks. The default settings of Groundhog were applied for all NMT systems: the word embedding dimension was 620 and the size of a hidden layer was 1000, the batch size was 64, the source and target side vocabulary sizes were 30K, the maximum sequence length were 50, and the beam size for decoding was 10. Default dropout were applied. We used a mini-batch Stochastic Gradient Descent (SGD) algorithm together with ADADELTA optimizer (Zeiler, 2012). Training was conducted on a single Tesla K80 GPU. Each NMT model was trained for 500K batches, taking 7-10 days. For sentence embedding and selection, it only took several hours to process all of sentences in the training data, because decoding was not necessary.

Baselines
Along with the standard NMT baseline system, we also compared the proposed methods to the recent state-of-the-art NMT adaptation method of Luong and Manning (2015) 7 as described in Section 1. Two typical sentence selection methods for PBSMT were also used as baselines: Axelrod et al. (2011) used language model-based crossentropy difference as criterion; Chen et al. (2016) used a CNN to classify the sentences as either in-domain or out-of-domain. In addition, we randomly sampled out-of-domain data to create a corpus the same size as that used for the best performing proposed system. We tried our best to re-implement the baseline methods using the same basic NMT setting as the proposed method.

Results and Analyses
In Tables 2 and 3, the in, out and in + out indicate that the in-domain, out-of-domain and their mixture were used as the NMT training corpora. δ f , δ e and δ f e indicate that corresponding criterion was used to select sentences, and these selected sentences were added to in-domain corpus to construct the new training corpora. +f ur indicates that the selected sentences were used to train an initial NMT system, and then this initial system was further trained by in-domain data (Luong and Manning, 2015). The threshold for the sentence selection method was selected on development data. That is, we selected the top ranked 10%, 20%,...,90% out-of-domain data to be added into the in-domain data, and the best performing models on development data were used in the evaluation on test data. The vocabulary was built by using the selected corpus and in-domain corpus. 8 Translation performance was measured by case-insensitive BLEU (Papineni et al., 2002). Since the proposed method is a sentence selection approach, we can also show the effect on standard PBSMT (Koehn et al., 2007).
In the IWSLT task, the observations were as follows: • Adding out-of-domain to in-domain data, or directly using out-of-domain data, degraded 7 Freitag and Al-Onaizan (2016)'s method is quite similar to Luong and Manning (2015)'s, so we did not compare to them. 8 According to our empirical comparison, the performance did not significantly change if we used in + out to build the vocabulary for all of the systems.   PBSMT and NMT performance.
• Adding data selected by δ f , δ e and δ f e substantially improved NMT performance (3.9 to 6.6 BLEU points), and gave rise to a modest improvement in PBSMT performance (0.4 to 3.1 BLEU points). This method also outperformed the best existing baselines by up to 1.1 BLEU points for NMT and 0.8 BLEU for PBSMT.
• The proposed method worked synergistically with Luong's further training method, and the combination was able to add up to an additional 2-3 BLEU points, indicating that the proposed method and Luong's method are essentially orthogonal.
• The performance by using both sides of sentence embeddings δ f e was slightly better than using monolingual sentence embedding δ f and δ e .
In the NIST task, the observations were similar to the IWSLT task, except: • Adding out-of-domain slightly improved PBSMT and NMT performance.
• The proposed method improved both PBSMT and NMT performance, but not as substantially as in IWSLT.
These observations suggest that the out-ofdomain data was closer to the in-domain than in IWSLT.

Selected Size Effect
We show experimental results on varying the size of additional data selected from the out-of-domain dataset, in Figure 1. It shows that the proposed method δ f e reached the highest performance on dev set, when top 30% out-of-domain sentences are selected as pseudo in-domain data. δ f e outperforms the other methods in most of the cases on development data.

Training Time Effect
We also show the relationship between BLEU and batches of training in Figure 2. Most of the methods (without further training) converged after similar batches training. Specifically, in researched the highest BLEU performance on dev faster than other methods (without further training), then decreased and finally converged.
The further training methods, which firstly trained the models using out-of-domain data and then in-domain data, converged very soon after in-domain data were introduced. In further training, the out-of-domain trained system could be considered as a pre-trained NMT system. Then the in-domain data training help NMT system overfit at in-domain data and gained around two BLEU improvement.

Conclusion and Future Work
In this paper, we proposed a straightforward sentence selection method for NMT domain adaptation.
Instead of the existing external selection criteria, we applied the internal NMT sentence embedding similarity as the criterion. Empirical results on IWSLT and NIST tasks showed that the proposed method can substantially improve NMT performances and outperform state-of-the-art existing NMT adaptation methods on NMT (even PBSMT) performances.
In addition, we found that the combination of sentence selection and further training has an additional effect, with a fast convergence. In our further work, we will investigate the effect of training data order and batch data selection on NMT training.