Transfer Learning for Low-Resource Neural Machine Translation

The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios, but is much less effective for low-resource languages. We present a transfer learning method that significantly improves Bleu scores across a range of low-resource languages. Our key idea is to first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. Using our transfer learning method we improve baseline NMT models by an average of 5.6 Bleu on four low-resource language pairs. Ensembling and unknown word replacement add another 2 Bleu which brings the NMT performance on low-resource machine translation close to a strong syntax based machine translation (SBMT) system, exceeding its performance on one language pair. Additionally, using the transfer learning model for re-scoring, we can improve the SBMT system by an average of 1.3 Bleu, improving the state-of-the-art on low-resource machine translation.


Introduction
Neural machine translation (NMT)  is a promising paradigm for extracting translation knowledge from parallel text. NMT systems have achieved competitive accuracy rates under large-data training conditions for language pairs This work was carried out while all authors were at USC's Information Sciences Institute. *This author is currently at Google Brain. such as English-French. However, neural methods are data-hungry and learn poorly from low-count events. This behavior makes vanilla NMT a poor choice for low-resource languages, where parallel data is scarce. Table 1 shows that for 4 low-resource languages, a standard string-to-tree statistical MT system (SBMT) (Galley et al., 2004;Galley et al., 2006) strongly outperforms NMT, even when NMT uses the state-of-the-art local attention plus feedinput techniques from Luong et al. (2015a).
In this paper, we describe a method for substantially improving NMT results on these languages. Our key idea is to first train a high-resource language pair, then use the resulting trained network (the parent model) to initialize and constrain training for our low-resource language pair (the child model). We find that we can optimize our results by fixing certain parameters of the parent model and letting the rest be fine-tuned by the child model. We report NMT improvements from transfer learning of 5.6 BLEU on average, and we provide an analysis of why the method works. The final NMT system approaches strong SBMT baselines in all four language pairs, and exceeds SBMT performance in one of them. Furthermore, we show that NMT is an exceptional re-scorer of 'traditional' MT output; even NMT that on its own is worse than SBMT is consistently able to improve upon SBMT system output when incorporated as a re-scoring model.
We provide a brief description of our NMT model in Section 2. Section 3 gives some background on transfer learning and explains how we use it to improve machine translation performance. Our main experiments translating Hausa, Turkish, Uzbek, and Urdu into English with the help of a French-English parent model are presented in Section 4. Section 5 explores alternatives to our model to enhance understanding. We find that the choice of parent language pair affects performance, and provide an empirical upper bound on transfer performance using an artificial language. We experiment with English-only language models, copy models, and word-sorting models to show that what we transfer goes beyond monolingual information and that using a translation model trained on bilingual corpora as a parent is essential. We show the effects of freezing, finetuning, and smarter initialization of different components of the attention-based NMT system during transfer. We compare the learning curves of transfer and no-transfer models, showing that transfer solves an overfitting problem, not a search problem. We summarize our contributions in Section 6.

NMT Background
In the neural encoder-decoder framework for MT (Neco and Forcada, 1997;Castaño and Casacuberta, 1997;Bahdanau et al., 2014;Luong et al., 2015a), we use a recurrent neural network (encoder) to convert a source sentence into a dense, fixed-length vector. We then use another recurrent network (decoder) to convert that vector to a target sentence. In this paper, we use a two-layer encoder-decoder system ( Figure 1) with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997). The models were trained to optimize maximum likelihood (via a softmax layer) with back-propagation through time (Werbos, 1990). Additionally, we use an attention mechanism that allows the target decoder to look back at the source encoder, specifically the local attention model from Luong et al. (2015a). In our model we also use the feed-input input connection from Luong et al. (2015a) where at each timestep on the decoder we feed in the top layer's hidden state into the lowest layer of the next timestep.

Transfer Learning
Transfer learning uses knowledge from a learned task to improve the performance on a related task, typically reducing the amount of required training data (Torrey and Shavlik, 2009;Pan and Yang, 2010). In natural language processing, transfer learning methods have been successfully applied to speech recognition, document classification and sentiment analysis (Wang and Zheng, 2015). Deep learning models discover multiple levels of representation, some of which may be useful across tasks, which makes them particularly suited to transfer learning (Bengio, 2012). For example, Cireşan et al. (2012) use a convolutional neural network to recognize handwritten characters and show positive effects of transfer between models for Latin and Chinese characters. Ours is the first study to apply transfer learning to neural machine translation.
There has also been work on using data from multiple language pairs in NMT to improve performance. Recently, Dong et al. (2015) showed that sharing a source encoder for one language helps performance when using different target decoders  for different languages. In that paper the authors showed that using this framework improves performance for low-resource languages by incorporating a mix of low-resource and high-resource languages. Firat et al. (2016) used a similar approach, employing a separate encoder for each source language, a separate decoder for each target language, and a shared attention mechanism across all languages. They then trained these components jointly across multiple different language pairs to show improvements in a lower-resource setting.
There are a few key differences between our work and theirs. One is that we are working with truly small amounts of training data. Dong et al. (2015) used a training corpus of about 8m English words for the low-resource experiments, and Firat et al. (2016) used from 2m to 4m words, while we have at most 1.8m words, and as few as 0.2m. Additionally, the aforementioned previous work used the same domain for both low-resource and high-resource languages, while in our case the datasets come from vastly different domains, which makes the task much harder and more realistic. Our approach only requires using one additional high-resource language, while the other papers used many. Our approach also allows for easy training of new lowresource languages, while Dong et al. (2015) and Firat et al. (2016) do not specify how a new language should be added to their pipeline once the models are trained. Finally, Dong et al. (2015) observe an average BLEU gain on their low-resource experiments of +1.16, and Firat et al. (2016) obtain BLEU gains of +1.8, while we see a +5.6 BLEU gain.
The transfer learning approach we use is simple and effective. We first train an NMT model on a  large corpus of parallel data (e.g., French-English).
We call this the parent model. Next, we initialize an NMT model with the already-trained parent model. This new model is then trained on a very small parallel corpus (e.g., Uzbek-English). We call this the child model. Rather than starting from a random position, the child model is initialized with the weights from the parent model. A justification for this approach is that in scenarios where we have limited training data, we need a strong prior distribution over models. The parent model trained on a large amount of bilingual data can be considered an anchor point, the peak of our prior distribution in model space. When we train the child model initialized with the parent model, we fix parameters likely to be useful across tasks so that they will not be changed during child model training. In the French-English to Uzbek-English example, as a result of the initialization, the English word embeddings from the parent model are copied, but the Uzbek words are initially mapped to random French embeddings. The parameters of the English embeddings are then frozen, while the Uzbek embeddings' parameters are allowed to be modified, i.e. fine-tuned, during training of the child model. Freezing certain transferred parameters and fine tuning others can be considered a hard approximation to a tight prior or strong regularization applied to some of the parameter space. We also experiment with ordinary L2 regularization, but find it does not significantly improve over the parameter freezing described above.
Our method results in large BLEU increases for a variety of low resource languages. In one of the Language Pair

Role Train Dev Test
Size Size Size Spanish-English child 2.5m 58k 59k French-English parent 53m 58k 59k German-English parent 53m 58k 59k four language pairs our NMT system using transfer beats a strong SBMT baseline. Not only do these transfer models do well on their own, they also give large gains when used for re-scoring n-best lists (n = 1000) from the SBMT system. Section 4 details these results.

Experiments
To evaluate how well our transfer method works we apply it to a variety of low-resource languages, both stand-alone and for re-scoring a strong SBMT baseline. We report large BLEU increases across the board with our transfer method.
For all of our experiments with low-resource languages we use French as the parent source language and for child source languages we use Hausa, Turkish, Uzbek, and Urdu. The target language is always English. Table 1 shows parallel training data set sizes for the child languages, where the language with the most data has only 1.8m English tokens. For comparison, our parent French-English model uses a training set with 300 million English tokens and achieves 26 BLEU on the development set. Table 1 also shows the SBMT system scores along with the NMT baselines that do not use transfer. There is a large gap between the SBMT and NMT systems when our transfer method is not used.
The SBMT system used in this paper is a stringto-tree statistical machine translation system (Galley et al., 2006;Galley et al., 2004). In this system there are two count-based 5-gram language models. One is trained on the English side of the WMT 2015 English-French dataset and the other is trained on the English side of the low-resource bitext. Additionally, the SBMT models use thousands of sparsely-occurring, lexicalized syntactic features (Chiang et al., 2009  ing rate, parameter initialization range, dropout rate, and hidden state size for all the experiments. For training we use a minibatch size of 128, hidden state size of 1000, a target vocabulary size of 15K, and a source vocabulary size of 30K. The child models are trained with a dropout probability of 0.5, as in Zaremba et al. (2014). The common parent model is trained with a dropout probability of 0.2. The learning rate used for both child and parent models is 0.5 with a decay rate of 0.9 when the development perplexity does not improve. The child models are all trained for 100 epochs. We re-scale the gradient when the gradient norm of all parameters is greater than 5. The initial parameter range is [-0.08, +0.08]. We also initialize our forget-gate biases to 1 as specified by Józefowicz et al. (2015) and Gers et al. (2000). For decoding we use a beam search of width 12.

Transfer Results
The results for our transfer learning method applied to the four languages above are in Table 2. The parent models were trained on the WMT 2015 (Bojar et al., 2015) French-English corpus for 5 epochs. Our baseline NMT systems ('NMT' row) all receive a large BLEU improvement when using the transfer method (the 'Xfer' row) with an average BLEU improvement of 5.6. Additionally, when we use unknown word replacement from Luong et al. (2015b) and ensemble together 8 models (the 'Final' row) we further improve upon our BLEU scores, bringing the average BLEU improvement to 7.5. Overall our method allows the NMT system to reach competitive scores and outperform the SBMT system in one of the four language pairs. During transfer learning, we expect the source-language related blocks to change more than the target-language related blocks.  French , identical to French except for word spellings. We observe that French transfer learning helps French (13.3→20.0) more than it helps Uzbek (10.7→15.0).

Re-scoring Results
We also use the NMT model with transfer learning as a feature when re-scoring output n-best lists (n = 1000) from the SBMT system. Table 3 shows the results of re-scoring. We compare re-scoring with transfer NMT to re-scoring with baseline (i.e. non-transfer) NMT and to re-scoring with a neural language model. The neural language model is an LSTM RNN with 2 layers and 1000 hidden states. It has a target vocabulary of 100K and is trained using noise-contrastive estimation (Mnih and Teh, 2012;Vaswani et al., 2013;Baltescu and Blunsom, 2015;Williams et al., 2015). Additionally, it is trained using dropout with a dropout probability of 0.2 as suggested by Zaremba et al. (2014). Re-scoring with the transfer NMT model yields an improvement of 1.1-1.6 BLEU points above the strong SBMT system; we find that transfer NMT is a better re-scoring feature than baseline NMT or neural language models.
In the next section, we describe a number of additional experiments designed to help us understand the contribution of the various components of our transfer model.

Analysis
We analyze the effects of using different parent models, regularizing different parts of the child model, and trying different regularization techniques.

Different Parent Languages
In the above experiments we use French-English as the parent language pair. Here, we experiment with different parent languages. In this set of experiments we use Spanish-English as the child language pair. A description of the data used in this section is presented in Table 4. Our experimental results are shown in Table 5, where we use French and German as parent languages. If we just train a model with no transfer on a small Spanish-English training set we get a BLEU score of 16.4. When using our transfer method we get Spanish-English BLEU scores of 31.0 and 29.8 via French and German parent languages, respectively. As expected, French is a better parent than German for Spanish, which could be the result of the parent language being more similar to the child language. We suspect using closely-related parent language pairs would improve overall quality.

Effects of having Similar Parent Language
Next, we look at a best-case scenario in which the parent language is as similar as possible to the child language.
Here we devise a synthetic child language (called French ) which is exactly like French, except its vocabulary is shuffled randomly. (e.g., "internationale" is now "pomme," etc). This language, which looks unintelligible to human eyes, nevertheless has the same distributional and relational properties as actual French, i.e. the word that, prior to vocabulary reassignment, was 'roi' (king) is likely to share distributional characteristics, and hence embedding similarity, to the word that, prior to reassignment, was 'reine' (queen). French should be the ideal parent model for French .
The results of this experiment are shown in Table 6. We get a 4.3 BLEU improvement with an unrelated parent (i.e. French-parent and Uzbekchild), but we get a 6.7 BLEU improvement with a 'closely related' parent (i.e. French-parent and French -child). We conclude that the choice of parent model can have a strong impact on transfer models, and choosing better parents for our low-resource languages (if data for such parents can be obtained) could improve the final results.

Ablation Analysis
In all the above experiments, only the target input and output embeddings are fixed during training. In this section we analyze what happens when different Uzbek/French word-type mapping without help. parts of the model are fixed, in order to determine the scenario that yields optimal performance. Figure 2 shows a diagram of the components of a sequenceto-sequence model. Table 7 shows the effects of allowing various components of the child NMT model to be trained. We find that the optimal setting for transferring from French-English to Uzbek-English in terms of BLEU performance is to allow all of the components of the child model to be trained except for the input and output target embeddings.
Even though we use this setting for our main experiments, the optimum setting is likely to be language-and corpus-dependent. For Turkish, experiments show that freezing attention parameters as well gives slightly better results. For parent-child models with closely related languages we expect freezing, or strongly regularizing, more components of the model to give better results.

Learning Curve
In Figure 3 we plot learning curves for both a transfer and a non-transfer model on training and development sets. We see that the final training set perplexities for both the transfer and non-transfer model are very similar, but the development set perplexity for the transfer model is much better.
The fact that the two models start from and converge to very different points, yet have similar training set performances, indicates that our architecture  and training algorithm are able to reach a good minimum of the training objective regardless of the initialization. However, the training objective seems to have a large basin of models with similar performance and not all of them generalize well to the development set. The transfer model, starting with and staying close to a point known to perform well on a related task, is guided to a final point in the weight space that generalizes to the development set much better.

Dictionary Initialization
Using the transfer method, we always initialize input language embeddings for the child model with randomly-assigned embeddings from the parent (which has a different input language). A smarter method might be to initialize child embeddings with similar parent embeddings, where similarity is measured by word-to-word t-table probabilities. To get these probabilities we compose Uzbek-English and English-French t-tables obtained from the Berkeley Aligner (Liang et al., 2006). We see from Figure 4 that this dictionary-based assignment results in faster improvement in the early part of the training. However the final performance is similar to our standard model, indicating that the training is able to untangle the dictionary permutation introduced by randomly-assigned embeddings.

Different Parent Models
In the above experiments, we use a parent model trained on a large French-English corpus. One might hypothesize that our gains come from exploit-  ing the English half of the corpus as an additional language model resource. Therefore, we explore transfer learning for the child model with parent models that only use the English side of the French-English corpus. We consider the following parent models in our ablative transfer learning scenarios: • A true translation model (French-English Parent) • A word-for-word English copying model (English-English Parent) • A model that unpermutes scrambled English (EngPerm-English Parent) • (The parameters of) an RNN language model (LM Xfer) The results, in Table 8, show that transfer learning does not simply import an English language model, but makes use of translation parameters learned from the parent's large bilingual text.

Conclusion
Overall, our transfer method improves NMT scores on low-resource languages by a large margin and allows our transfer NMT system to come close to the performance of a very strong SBMT system, even exceeding its performance on Hausa-English. In addition, we consistently and significantly improve state-of-the-art SBMT systems on low-resource languages when the transfer NMT system is used for rescoring. Our experiments suggest that there is still room for improvement in selecting parent languages that are more similar to child languages, provided data for such parents can be found.