Trivial Transfer Learning for Low-Resource Neural Machine Translation

Transfer learning has been proven as an effective technique for neural machine translation under low-resource conditions. Existing methods require a common target language, language relatedness, or specific training tricks and regimes. We present a simple transfer learning method, where we first train a “parent” model for a high-resource language pair and then continue the training on a low-resource pair only by replacing the training corpus. This “child” model performs significantly better than the baseline trained for low-resource pair only. We are the first to show this for targeting different languages, and we observe the improvements even for unrelated languages with different alphabets.


Introduction
Neural machine translation (NMT) has made a big leap in performance and became the unquestionable winning approach in the past few years (Bahdanau et al., 2014;Sutskever et al., 2014;Sennrich et al., 2017;Vaswani et al., 2017). The main reason behind the success of NMT in realistic conditions was the ability to handle large vocabulary (Sennrich et al., 2016b) and to utilize large monolingual data (Sennrich et al., 2016a). However, NMT still struggles if the parallel data is insufficient (e.g. fewer than 1M parallel sentences), producing fluent output unrelated to the source and performing much worse than phrasebased machine translation (Koehn and Knowles, 2017).
Many strategies have been used in MT in the past for employing resources from additional languages, see e.g. Wu and Wang (2007), Nakov and Ng (2012), El Kholy et al. (2013), or Hoang and . For NMT, a particularly promising approach is transfer learning or "do-main adaptation" where the "domains" are the different languages.
For example, Zoph et al. (2016) train a "parent" model in a high-resource language pair, then use some of the trained weights as the initialization for a "child" model and further train it on the low-resource language pair. In Zoph et al. (2016), the parent and child pairs shared the target language (English) and a number of modifications of the training process were needed to achieve an improvement in translation from Hansa, Turkish, and Uzbek into English with the help of French-English data.
Nguyen and Chiang (2017) explore a related scenario where the parent language pair is also low-resource but it is related to the child language pair. They improved the previous approach by using a shared vocabulary of subword units (BPE, Sennrich et al., 2016b). Additionally, they used transliteration to improve their results.
In this paper, we contribute empirical evidence that transfer learning for NMT can be simplified even further. We leave out the restriction on relatedness of the languages and extend the experiments to parent-child pairs where the target language changes. Moreover, we do not utilize any special modifications to the training regime or data pre-preprocessing.
In contrast to previous work, we test the method with the Transformer model (Vaswani et al., 2017), instead of the recurrent approaches (Bahdanau et al., 2014).
As documented in e.g. Popel and Bojar (2018) and anticipated in WMT18, 1 the Transformer model seems superior to other NMT approaches.

Method Description
The proposed method is extremely simple: We train the parent language pair for a number of iterations and switch the training corpus to the child language pair for the rest of the training, without resetting any of the training (hyper)parameters.
As such, this method is similar to the transfer learning proposed by Zoph et al. (2016) but uses the shared vocabulary as in Nguyen and Chiang (2017). The novelty is that we are removing the restriction about relatedness of the language pairs, and in contrast to the previous papers, we show that this simple style of transfer learning can be used on both sides (i.e. either the source or the target language), not only with the target language common to both parent and child model. In fact, the method is effective also for fully unrelated language pairs.
Our method does not need any modification of existing NMT frameworks. The only requirement is to use a shared vocabulary of subword units (we use wordpieces, Johnson et al., 2017) across both language pairs. This is achieved by learning wordpiece segmentation from the concatenated source and target sides of both the parent and child language pairs. All other parameters of the model stay the same as for the standard NMT training.
During the training we first train the NMT model for the high-resource language pair until convergence. This model is called "parent". After that, we train the child model without any restart, i.e. only by changing the training corpora to the low-resource language pair.

Details on Shared Vocabulary
Current NMT systems use vocabularies of subword units instead of whole words. Using subword units gives a balance between the flexibility of separate characters and efficiency of whole words. It solves the out-of-vocabulary words problem and reduces the vocabulary size. The majority of NMT systems use either the byte pair encoding (Sennrich et al., 2016b) or wordpieces (Wu et al., 2016). Given a training corpus and the desired maximal vocabulary size, either method produces deterministic rules for word segmentation to achieve the fewest possible splits.
Our method requires the vocabulary shared across both the parent (translating from language XX to YY) and the child model (translating from AA to BB). This is obtained by concatenating both training corpora into one corpus of sentences in languages AA, BB,XX and YY. 2 Due to our focus on low-resource language pairs, we decided to generate the vocabulary in a balanced way by selecting the same amount of sentences from both language pairs. We thus use the same number of sentence pairs of the parent corpus as there are in the child corpus.
We did not experiment with any other balancing of the vocabulary. Future research could also investigate the impact of using only the child corpus for vocabulary generation or various amounts of used sentences.
We generated vocabularies aiming at 32k subword types. The exact size of the vocabulary varies from 26.1k to 34.8k. All experiments of a given language set use the same vocabulary. Vocabulary overlap in each language set is further studied in Section 6.1.

Model Description
We use the Transformer sequence-to-sequence model (Vaswani et al., 2017) as implemented in Tensor2Tensor (Vaswani et al., 2018) version 1.4.2. Our models are based on the "big single GPU" configuration as defined in the paper. To fit the model to our GPUs (NVIDIA GeForce GTX 1080 Ti with 11 GB RAM), we set the batch size to 2300 tokens and limit sentence length to 100 wordpieces.
We use exponential learning rate decay with the starting learning rate of 0.2 and 32000 warm up steps and Adam optimized. In our experiments, we find that it is undesirable to reset learning rate as it leads to the loss of the performance from the parent model. Therefore the transfer learning is handled only by changing the training corpora and nothing else.
Decoding uses the beam size of 8 and the length normalization penalty is set to 1.
The models were trained for 1M steps (approx. 140 hours), which was sufficient for models to converge to the best performance. We selected the model with the best performance on the development test for the final evaluation on the testset.

Datasets
In our experiments, we compare low-resource and high-resource language pairs spanning two orders of magnitude of training data sizes. We consider Estonian (ET) and Slovak (SK) as low-resource languages compared to the Finnish (FI) and Czech (CS) counterparts. The choice of languages was closely related to the languages in this year's WMT 2018 shared tasks. In particular, Estonian and Finnish (paired with English) were suggested as the main focus for their relatedness. We added Czech and Slovak as another closely related language pair. Russian (RU) for the parent model was chosen for two reasons: (1) written in Cyrillic, there will be hardly any intersection in the shared vocabulary with the child language pairs, and (2) previous work uses transliteration to handle Russian, which is a nice contrast to our work. Finally, we added Arabic (AR), French (FR) and Spanish (ES) for experiments with unrelated languages.
The sizes of the training datasets are in Table 1.
If not specified otherwise we use training, development and test sets from WMT. 3 Pairs with training sentences with less than 4 words or more than 75 words on either the source or the target side are removed to allow for a speedup of Transformer by capping the maximal length and allowing a bigger batch size. The reduction of training data is small and based on our experiments, it does not change the performance of the translation model.
We use the Europarl and Rapid corpora for Estonian-English. We disregard Paracrawl due to its noisiness. The development and test sets are 3 http://www.statmt.org/wmt18/ from WMT news 2018.
The Finnish-English was prepared as in Ostling et al. (2017), removing Wikipedia headlines. The dev and test sets are from WMT news 2015.
For English-Czech, we use all paralel data allowed in WMT2018 except Paracrawl. The main resource is CzEng 1.7 (the filtered version, . The devset is WMT new-stest2011 and the testset is WMT newstest2017. Slovak-English uses corpora from Galušcáková and Bojar (2012), detokenized by Moses. 4 WMT newstest2011 serves as the devset and testset.
The Russian-English training set was created from News Commentary, Yandex and UN Corpus. As the devset, we use WMT newstest 2012.
The language pairs Arabic-Russian, French-Russian, Spanish-French and Spanish-Russian were selected from UN corpus (Ziemski et al., 2016), which provides over 10 million multiparallel sentences in 6 languages.

Results
In this section, we present results of our approach. Statistical significance of the winner (marked with ‡) is tested by paired bootstrap resampling against the baseline (child-only) setup (1000 samples, conf. level 0.05; Koehn, 2004).
As customary, we label the models with the pair of the source and target language codes, for example the English-to-Estonian translation model is denoted by ENET.
The vocabularies are generated as described in 2.1 separately for each experimented combination of parent and child. The same vocabulary is used whenever the parent and child use the same set of languages, i.e. disregarding the translation direction and model stage (parent or child). Table 2 summarizes our results for various combinations of high-resource parent and low-resource child language pairs when English is shared between the child and parent either in the encoder or in the decoder.

English as the Common Language
We confirm that sharing the target language improves performance as previously shown (Zoph et al., 2016;Nguyen and Chiang, 2017 Table 2: Transfer learning with English reused either in source (encoder) or target (decoder). The column "Transfer" is our method, baselines correspond to training on one of the corpora only. Scores (BLEU) are always for the child language pair and they are comparable only within lines or when the child language pair is the same. "Unrelated" language pairs in bold. Upper part: parent larger, lower part: child larger. ("EN" lowercased just to stand out.) with the FIEN parent. Using only the parent (FIEN) model to translate the child (ETEN) test set gives a miserable performance, confirming the need for transfer learning or "finetuning". A novel result is that the method works also for sharing the source language, improving ENET by up to 2.71 BLEU thanks to ENFI parent.
Furthermore, the improvement is not restricted only to related languages as Estonian and Finnish as shown in previous works. Unrelated language pairs (shown in bold in Table 2) like Czech and Estonian work too and in some cases even better than with the related datasets. We reach an improvement of 3.38 BLEU for ENET when parent model was ENCS, compared to improvement of 2.71 from ENFI parent. This statistically significant improvement contradicts Dabre et al. (2017) who concluded that the more related the languages are, the better transfer learning works. We see it as an indication that the size of the parent training set is more important than relatedness of languages.
The results with Russian parent for Estonian child (both directions) show that transliteration is also not necessary. Because there is no vocabulary sharing between Russian Cyrilic and Estonian Latin (except numbers and punctuation, see Section 6.1 for further details), the improvement could be attributed to a better coverage of English; an effect similar to domain adaptation.
On the other hand, this transfer learning works well only when the parent has more training data  than the child. As presented in the bottom part of Table 2, low-resource parents do not generally improve the performance of better-resourced childs and sometimes, they even (significantly) decrease it. This is another indication, that the most important is the size of the parent corpus compared to the child one. The baselines are either models trained purely on the child parallel data or only on the parent data. The second baseline only indicates the relatedness of languages because it is only tested but never trained on the child language pair. Also, we do not add any language tag as in Johnson et al. (2017). This also highlights that the improvement of our method cannot be directly attributed to the relatedness of languages: e.g. Czech and Slovak are much more similar than Czech and Estonian (Parent Only BLEU of translation out of English is 6.51 compared to 1.42) and yet the gain from transfer learning is larger for Estonian (+3.38) than from Slovak (+1.62).

Simulated Very Low Resources
In Table 3, we simulate very low-resource settings by downscaling the data for the child model. It is a common knowledge, that gains from transfer learning are more pronounced for smaller childs. The point of Table 3 is to illustrate that our approach is applicable even to extremely small child setups, with as few as 10k sentence pairs. Our transfer learning ("start with a model for whatever parent pair") may thus resolve the issue of applicability of NMT for low resource languages as pointed out by Koehn and Knowles (2017).   Table 4: Results of child following a parent with swapped direction. "Baseline" is child-only training. "Aligned" is the more natural setup with English appearing on the "correct" side of the parent, the numbers in this column thus correspond to those in Table 2. ent. Therefore, it is better to use a parent model that already converged and reached its best performance.

Direction Swap in Parent and Child
Relaxing the setup in Section 5.1, we now allow a mismatch in translation direction of the parent and child. The parent XX-EN is thus followed by an EN-YY child or vice versa. It is important to note that Transformer shares word embeddings for the source and target side. The gain can be thus due to better English word embeddings, but definitely not due to a better English language model. It would be interesting to study the effect of not sharing the embeddings but we leave it for some future work. The results in Table 4 document that an im-  provement can be reached even when none of the involved languages is reused on the same side. This interesting result should be studied in more detail. Firat et al. (2016) hinted possible gains even when both languages are distinct from the low-resource languages but in a multilingual setting. Not surprisingly, the improvements are better when the common language is aligned. The bottom part of Table 4 shows a particularly interesting trick: the parent is not any highresource pair but the very same EN-ET corpus with source and target swapped. We see gains in both directions, although not always statistically significant. Future work should investigate if this performance boost is possible even for highresource languages. Similar behavior has been shown in Niu et al. (2018), where in contrast to our work they mixed the data together and added an artificial token indicating the target language.

No Language in Common
Our final set of experiments examines the performance of ETEN child trained off parents in totally unrelated language pairs. Without any common language, the gains cannot be attributed, e.g., to the shared English word embeddings. The vocabulary overlap is mostly due to short n-grams or numbers and punctuations.
We see gains from transfer learning in all cases, mostly significant. The only non-significant gain is from Arabic-Russian which does not share the script with the child Latin at all. (Sharing of punctuation and numbers is possible across all the tested scripts.) The gains are quite similar (+0.49-+0.78 BLEU), supporting our assumption that the main factor is the size of the parent (here, all have 10M sentence pairs) rather than language relatedness.

Analysis
Here we provide a rather initial analysis of the sources of the gains.  Table 6: Breakdown of subword vocabulary of experiments involving ET, EN and RU.

Vocabulary Overlap
Out method relies on the vocabulary estimated jointly from the child and parent model. In Transformer, the vocabulary is even shared across encoder and decoder. With a large overlap, we could expect a lot of "information reuse" between the parent and the child.
Since the subword vocabulary depends on the training corpora, a little clarification is needed. We take the vocabulary of subword units as created e.g. for ENRU-ENET experiments, see Section 2.1. This vocabulary contains 28.2k subwords in total. We then process the training corpora for each of the languages with this shared vocabulary, ignore all subwords that appear less than 10 times in each of the languages (these subwords will have little to no impact on the result of the training) and break down the total 28.2k subwords into classes depending on the languages in which the particular subword was observed, see Table 6.
We see that the vocabulary is reasonably balanced, with each language having 20-30% of subwords unique to it. English and Estonian share 10% subwords not seen in Russian while Russian shares only 0-1.39% of subwords with each of the other languages. Overall 8.89% of subwords are seen in all three languages.
A particularly interesting subset is the one where parent languages help the child model, in other words subwords appearing anywhere in English and also tokens common to Estonian and Russian. For this set of languages, this amounts to 20.69+10.06+1.39+0.0+8.89 = 41.03%. We list this number on a separate line in Table 6, "From parent". These subwords get their embeddings trained better thanks to the parent model.   portion is shared by all the languages and what portion of subwords benefits from the parent training. We see a similar picture across the board, only AR-RU-ET-EN stands out with the very low number of subwords (6.2%) available already in the parent. The parent AR-RU thus offered very little word knowledge to the child and yet lead to a gain in BLEU.

Output Analysis
Since we rely on automatic analysis, we need to prevent some potential overestimations of translation quality due to BLEU. For this, we took a closer look at the baseline ENET model (BLEU of 17.03 in Table 2) and two ENET childs derived from ENCS (BLEU of 20.41) and ENRU parent (BLEU 20.09). Table 8 confirms the improvements are not an artifact of uncased BLEU. The gains are apparent with several (now cased) automatic scores.
As documented in Table 9, the improved outputs are considerably longer. In the table, we show also individual n-gram precisions and brevity penalty (BP) of BLEU. The longer output clearly helps to reduce the incurred BP but the improvements are also apparent in n-gram precisions. In other words, the observed gain cannot be attributed solely to producing longer outputs.
Table 10 explains the gains in unigram precisions by checking which tokens in the improved outputs (the parent followed by the child) were present also in the baseline (child-only, denoted "b" in Table 10) and/or confirmed by the refer-   ence (denoted "r"). We see that about 44+20% of tokens of improved outputs can be seen as "unchanged" compared to the baseline because they appear already in the baseline output ("b"). (The 44% "rb" tokens are actually confirmed by the reference.) The differing tokens are more interesting: "-" denotes the cases when the improved system produced something different from the baseline and also from the reference. Gains in BLEU are due to "r" tokens, i.e. tokens only in the improved outputs and the reference but not the baseline "b". For both parent setups, there are about 9-9.7 % of such tokens. We looked at these 3.2k and 3.5k tokens and we have to conclude that these are regular Estonian words; no Czech or Russian leaks to the output and the gains are not due to simple token types common to all the languages (punctuation, numbers or named entities). We see identical BLEU gains even if we remove all such simple tokens from the candidates and references. A better explanation of the gains thus still has to be sought for. Firat et al. (2016) propose multi-way multi-lingual systems, with the main goal of reducing the total number of parameters needed to cater multiple source and target languages. To keep all the language pairs "active" in the model, a special training schedule is needed. Otherwise, catastrophic forgetting would remove the ability to translate among the languages trained earlier. Johnson et al. (2017) is another multi-lingual approach: all translation pairs are simply used at once and the desired target language is indicated with a special token at the end of the source side. The model implicitly learns translation between many languages and it can even translate among language pairs never seen together. Lack of parallel data can be tackled by unsupervised translation (Artetxe et al., 2018;Lample et al., 2018). The general idea is to mix monolingual training of autoencoders for the source and target languages with translation trained on data translated by the previous iteration of the system.

Related Work
When no parallel data are available, the trainset of closely related high-resource pair can be used with transliteration approach as described in Karakanta et al. (2018).
Aside from the common back-translation (Sennrich et al., 2016a;Kocmi et al., 2018), simple copying of target monolingual data back to source  has been also shown to improve translation quality in low-data conditions.
Similar to transfer learning is also curriculum learning (Bengio et al., 2009;Kocmi and Bojar, 2017), where the training data are ordered from foreign out-of-domain to the in-domain training examples.

Conclusion
We presented a simple method for transfer learning in neural machine translation based on training a parent high-resource pair followed a lowresource language pair dataset. The method works for shared source or target side as well as for language pairs that do not share any of the translation sides. We observe gains also from totally unrelated language pairs, although not always significant.
One interesting trick we propose for lowresource languages is to start training in the opposite direction and swap to the main one afterwards.
The reasons for the gains are yet to be explained in detail but our observations indicate that the key factor is the size of the parent corpus rather than e.g. vocabulary overlaps.