Iterative Back-Translation for Neural Machine Translation

We present iterative back-translation, a method for generating increasingly better synthetic parallel data from monolingual data to train neural machine translation systems. Our proposed method is very simple yet effective and highly applicable in practice. We demonstrate improvements in neural machine translation quality in both high and low resourced scenarios, including the best reported BLEU scores for the WMT 2017 German↔English tasks.


Introduction
The exploitation of monolingual training data for neural machine translation is an open challenge. One successful method is back-translation (Sennrich et al., 2016b), whereby an NMT system is trained in the reverse translation direction (targetto-source), and is then used to translate target-side monolingual data back into the source language (in the backward direction, hence the name backtranslation). The resulting sentence pairs constitute a synthetic parallel corpus that can be added to the existing training data to learn a source-totarget model. Figure 1 illustrates this idea.
In this paper, we show that the quality of backtranslation matters and propose iterative backtranslation, where back-translated data is used to build better translation systems in forward and backward directions, which in turn is used to reback-translate monolingual data. This process can be "iterated" several times. This is a form of co-training (Blum and Mitchell, 1998) where the two models over both translation directions can be used to train one another. We show that iterative back-translation leads to improved results over simple back-translation, under both high and reverse system final system real synthetic real+synthetic Figure 1: Creating a synthetic parallel corpus through back-translation. First, a system in the reverse direction is trained and then used to translate monolingual data from the target side backward into the source side, to be used in the final system. low resource conditions, improving over the state of the art.

Related Work
The idea of back-translation dates back at least to statistical machine translation, where it has been used for semi-supervised learning (Bojar and Tamchyna, 2011), or self-training (Goutte et al., 2009, ch.12, p.237). In modern NMT research,  reported significant gains on the WMT and IWSLT shared tasks. They showed that even simply duplicating the monolingual target data into the source was sufficient to realise some benefits.  reported similar findings for low resource conditions, showing that even poor translations can be beneficial. Gwinnup et al. (2017) mention in their system description iteratively applying back-translation, but did not report successful experiments.
an implementation of this approach, but did not achieve any gains.
An alternative way to make use of monolingual data is the integration of a separately trained language model into the neural machine translation architecture (Gülçehre et al., 2015), but this has not yet to be proven to be as successful as backtranslation. Lample et al. (2018) explore the use of backtranslated data generated by neural and statistical machine translation systems, aided by denoising with a language model trained on the target side.

Impact of Back-Translation Quality
Our work is inspired by the intuition that a better back-translation system will lead to a better synthetic corpus, hence producing a better final system. To empirically validate this hypothesis and measure the correlation between back-translation system quality and final system quality, we use a set of machine translation systems of differing quality (trained in the reverse "back-translation" direction), and check how this effects the final system quality.
We carried out experiments on the highresource WMT German↔English news translation tasks (Bojar et al., 2017). For these tasks, large parallel corpora are available from related domains. 1 In addition, in-domain monolingual news corpora are provided as well, in much larger quantities. We sub-sampled the 2016 news corpus (see Table 1) for about twice as large as corpus as the parallel training corpus.
Following Sennrich et al. (2016b), a synthetic parallel corpus is created from the in-domain news monolingual data, in equal amounts to the existing real parallel corpus. The systems used to translate the monolingual data are canonical atten-  Table 2: WMT News Translation Task English↔German, reporting cased BLEU on newstest2017, evaluating the impact of the quality of the back-translation system on the final system. Note that the back-translation systems run in the opposite direction and are not comparable to the numbers in the same row.
tional neural machine translation systems (Bahdanau et al., 2015). Our setup is very similar to Edinburgh's submission to the WMT 2016 evaluation campaign (Sennrich et al., 2016a), 2 but uses the fast Marian toolkit (Junczys-Dowmunt et al., 2018) for training. We trained 3 different backtranslation systems, namely: 10k iterations Training a neural translation model on the parallel corpus, but stopping after 0.15 epochs; 100k iterations As above, but stopping after 1½ epochs; and convergence As above, but training until convergence (10 epochs, 3 GPU days). Given these three different systems, we create three synthetic parallel corpora of different quality and train systems on each. Table 2 shows the quality of the final systems. For both direc-back system 2 final system back system 1 Figure 2: Re-Back-Translation: Taking the idea of back-translation one step further. After training a system with back-translated data (back system 2 above), it is used to create a synthetic parallel corpus for the final system.
tions, the quality of the back-translation systems differs vastly. The 10k iteration systems perform poorly, and their synthetic parallel corpus provides no benefit over a baseline that does not use any back-translated data.
The longer trained systems have much better translation quality, and their synthetic parallel corpora prove to be beneficial. The back-translation system trained for 100k iteration already provides tangible benefits (+1.5 BLEU for both directions), while the converged system yields even bigger improvements (+2.9 for German-English, and +2.2 for English-German). These results indicate that the quality of the back-translation system is a significant factor for the success of the approach.

Iterative Back-Translation
We now take the idea of back-translation one step further. If we can build a better system with the back-translated data, then we can continue repeating this process: Use this better system to backtranslate the data, and use this data in order to build an even better system. See Figure 2 for an illustration of this re-back-translation process (repeated back-translation). See Algorithm 1 for the details of this iterated back-translation process. The final system benefits from monolingual data in both the source and target languages.
We do not have to stop at one iteration of repeated back-translation. We can iterate training the two back-translation systems multiple times. We refer this process to iterative back-translation.
In our experiments, we validate our approach under both high-resource and low-resource conditions. Under high-resource conditions, we improve the state of the art with re-back-translation. Under low-resource conditions, we demonstrate Algorithm 1 Iterative Back-Translation Input: parallel data D p , monolingual source, D s , and target D t text 1: Let T ← = D p 2: repeat 3: Train source-to-target model Θ → on T → 7: Use Θ → to create S = {(s,t)}, for s ∈ D s 8: Let T ← = D p ∪ S 9: until convergence condition reached Output: newly-updated models Θ ← and Θ → the effectiveness of iterative back-translation.

Experiments on High Resource Scenario
In §3 we demonstrated that the quality of the backtranslation system has significant impact on the effectiveness of the back-translation approach under high-resource data conditions such as WMT 2017 German-English. Here we ask: how much additional benefit can be realised for repeating this process? Also, do the gains for state-of-the-art systems that use deeper models, i.e., more layers in encoder and decoder  still apply in this setting?
We evaluate on German-English and English-German, under the same data conditions as in Section 3. We experiment with both shallow and deep stacked-layer encoder/decoder architectures. The base translation system is trained on the parallel data only. We train a shallow system using 4-checkpoint ensembling (Chen et al., 2017). The system is used to translate the monolingual data using a beam size of 2. The first back-translation system is trained on  the parallel data and the synthetic data generated by the base translation system. For better performance, we train a deep model with 8checkpoint ensembling; again we use a beam size of 2. The final back-translation systems were trained using several different systems: a shallow architecture, a deep architecture, and an ensemble system of 4 independent training runs. Across the board, the final systems with reback-translation outperform the final systems with simple back-translation, by a margin of 0.5-1.1 BLEU.
Notably, the final deep systems trained by re-back-translation outperform the state-of-the-art established at the WMT 2017 evaluation campaign for these language pairs, by a margin of about 1 BLEU point. These are the best published results for this dataset, to the best of our knowledge.
Experimental settings For the experiments in the German-English high-resource scenario, we used the Marian toolkit (Junczys-Dowmunt et al., 2018) for training and for back-translation. The shallow systems (also used for the back-translation step) match the setup of Edinburgh's WMT 2016 system (Sennrich et al., 2016a). It is an attentional RNN (default Marian settings) with dropout of 0.2 for the RNN parameters, and 0.1 otherwise. Training is smoothed with moving average. It takes about 2-4 days.
The deep system uses matches the setup of Edinburgh's WMT 2017 system . It uses 4 encoder and 4 decoder layers (Marian setting best-deep) with LSTM cells.
Drop-out settings are the same as above. Decoding during test time is done with a beam size of 12, while back-translation uses only a beam size of 2. This difference is reflected in the reported BLEU score for the deep system after backtranslation (35.0 for German-English, 28.3 for English-German) and the score reported for the quality of the back-translation system (34.8 (-0.2) and 27.9 (-0.4), respectively) in Table 3.
For all experiments, the true-casing model and the list of BPE operations is left constant. Both were learned from the original parallel training corpus.

Experiments on Low Resource Scenario
NMT is a data-hungry approach, requiring a large amount of parallel data to reach reasonable performance (Koehn and Knowles, 2017). In a lowresource setting, only small amount of parallel data exist. Previous work has attempted to incorporate prior or external knowledge to compensate for the lack of parallel data, e.g. injecting inductive bias via linguistic constraints  or linguistic factors . However, it is much cheaper and easier to obtain monolingual data in either the source or target language. An interesting question is whether the (iterative) back-translation can compensate for the lack of parallel data in such low-resource settings.
To explore this question, we conducted experiments on two datasets: A simulated low-resource setting with English-French, and a more realistic setting with English-Farsi. For the English-French dataset, we used the original WMT dataset, sub-sampled to create smaller sets of 100K and 1M parallel sentence pairs. For English-Farsi, we used the available datasets from LDC and TED Talks, totaling about 100K sentence pairs. For detailed statistics see Table 1.
Following the same experimental setup as in high-resource setting, 3 we obtain similar patterns of improvement of translation quality (Table 4).

Back-Translation
Generally, it is our expectation that the back-translation approach still improves the translation accuracy in all language pairs with a low-resource setting. In the English-French experiments, large improvements over the baseline are observed in both directions, with +3.5 where the domain in monolingual data is entirely news, leading to much lower quality than the other datasets. Measuring the impact of iteratively backtranslated data in relation to varying domain mismatch between parallel and monolingual data, is a very interesting problem which we will explore in future work; but is out of the scope for this paper.
Balance of real and synthetic parallel data In all our experiments with back-translation, in order to create synthetic parallel data, a small amount of monolingual data is randomly sampled from the big monolingual data (Table 1). As pointed out by (Sennrich et al., 2016b), the balance between the real and synthetic parallel data matters. However, there is no obvious evidence about the affect of the sample size, hence we further studied this by choosing a ratio between the real and synthetic parallel data. We opt to use different ratio (e.g., 1(real):2(synthetic) and 1(real):3(synthetic)) 4 All the scores are statistically significant with p < 0.01.
in our experiments. Our results in Table 5 show that more synthetic parallel data seems to be useful (though not obvious), e.g., gains from 16.7 to 16.9 in English to Farsi and gain from 22.1 to 22.4 in Farsi to English.
Iterative back-translation For iterative backtranslation, we obtained consistent results with the earlier findings from §4.1. In English-French tasks, we see more than +1 BLEU from a further iteration of back-translations, with little difference between 1 or 2 additional iterations. However, in English-Farsi tasks, gains are much smaller.

Comparison to back-translation with Moses
We now consider the utility of creating synthetic parallel data from different sources, e.g., from a phrase-based SMT models produced by Moses (Koehn et al., 2007), a considerably faster and more scalable system than modern NMT techniques. As can be seen in Table 4, this has mixed results, being better for English-French, and worse in English-Farsi, than using neural models, although in all cases the results are not far apart.
Quality of the sampled monolingual data Back-translation is dependent much on the quality of back-translated synthetic data. In our paper, repeating the back translation process in 2-3 times can lead to improved translation. However, this can be different in other language pairs and domains. Also, in our work, we sampled the monolingual data uniformly at random, so sentences may be used more than once in subsequent rounds. Its quite likely that other techniques for data sampling and selection, e.g., nonuniform sampling like transductive selection or active learning -which potentially diversifies the quality and quantity of monolingual data -would lead further improvements in translation performance. We leave this for our future work.
Efficacy on iterative back-translation The efficiency of the NMT toolkits we used (sockeye, marian-nmt) is excellent. Both support batch decoding for fast translation, e.g., with a batch-size of 200 (beam-size 5) marian-nmt can achieve over 5000 words per second on one GPU (less than 1 day for translating 4M sentences) 5 ; and also this scales linearly to the number of GPUs we have. Alternatively, we can split the monolingual data into smaller parts and distribute these parts over different GPUs. This can greatly speed up the back-translation process. This leaves the problem of training the model in each iteration, which we do 2-3 times. Overall the computational complexity is not a big deal (even with larger dataset), and the iterative back translation is quite feasible with existing modern GPU servers.

Conclusion
We presented a simple but effective extension of the back-translation approach to training neural machine translation systems. We empirically showed that the quality of the back-translation system matters for synthetic corpus creation, and that neural machine translation performance can be improved by iterative back-translation in both high-resource and low-resource scenarios. We show empirically that this works well for both high and low resource conditions. The method is simple but highly applicable in practice.
An important avenue for future work is to unify the various approaches to learning, including back-translation (Sennrich et al., 2016b), iterative