Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations

Zero-shot translation, translating between language pairs on which a Neural Machine Translation (NMT) system has never been trained, is an emergent property when training the system in multilingual settings. However, naive training for zero-shot NMT easily fails, and is sensitive to hyper-parameter setting. The performance typically lags far behind the more conventional pivot-based approach which translates twice using a third language as a pivot. In this work, we address the degeneracy problem due to capturing spurious correlations by quantitatively analyzing the mutual information between language IDs of the source and decoded sentences. Inspired by this analysis, we propose to use two simple but effective approaches: (1) decoder pre-training; (2) back-translation. These methods show significant improvement (4 22 BLEU points) over the vanilla zero-shot translation on three challenging multilingual datasets, and achieve similar or better results than the pivot-based approach.


Introduction
Despite the recent domination of neural networkbased models (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017) in the field of machine translation, which have fewer pipelined components and significantly outperform phrasebased systems (Koehn et al., 2003), Neural Machine Translation (NMT) still works poorly when the available number of training examples is limited.Research on low-resource languages is drawing increasing attention, and it has been found promising to train a multilingual NMT (Firat et al., 2016a) model for high-and row-resource languages to deal with low-resource translation (Gu et al., 2018b).As an extreme in terms of the number of supervised examples, prior works dug into * Equal contribution.
translation with zero-resource (Firat et al., 2016b;Chen et al., 2017;Lample et al., 2018a,b) where the language pairs in interest do not have any parallel corpora between them.In particular, Johnson et al. (2017) observed an emergent property of zero-shot translation where a trained multilingual NMT model is able to automatically do translation on unseen language pairs; we refer to this setting as zero-shot NMT from here on.
In this work, we start with a typical degeneracy issue of zero-shot NMT, reported in several recent works (Arivazhagan et al., 2018;Sestorain et al., 2018), that zero-shot NMT is sensitive to training conditions, and the translation quality usually lags behind the pivot-based methods which use a shared language as a bridge for translation (Utiyama and Isahara, 2007;Cheng et al., 2016;Chen et al., 2017).We first quantitatively show that this degeneracy issue of zeroshot NMT is a consequence of capturing spurious correlation in the data.Then, two approaches are proposed to help the model ignore such correlation: language model pre-training and backtranslation.We extensively evaluate the effectiveness of the proposed strategies on four languages from Europarl, five languages from IWSLT and four languages from MultiUN.Our experiments demonstrate that the proposed approaches significantly improve the baseline zero-shot NMT performance and outperforms the pivot-based translation in some language pairs by 2 ∼ 3 BLEU points.
where special tokens y 0 ( bos ) and y T +1 ( eos ) are used to represent the beginning and the end of a target sentence.The conditional probability is parameterized using a neural network, typically, an encoder-decoder architecture based on either RNNs (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2014), CNNs (Gehring et al., 2017) or the Transformers (Vaswani et al., 2017).
Multilingual NMT We start with a many-tomany multilingual system similar to Johnson et al. (2017) which leverages the knowledge from translation between multiple languages.It has an identical model architecture as the single pair translation model, but translates between multiple languages.For a different notation, we use (x i , y j ) where i, j ∈ {0, ..., K} to represent a pair of sentences translating from a source language i to a target language j.K + 1 languages are considered in total.A multilingual model is usually trained by maximizing the likelihood over training sets D i,j of all available language pairs S. That is: where we denote L j θ (x i , y j ) = log p(y j |x i , j; θ).Specifically, the target language ID j is given to the model so that it knows to which language it translates, and this can be readily implemented by setting the initial token y 0 = j for the target sentence to start with. 1 The multilingual NMT model shares a single representation space across multiple languages, which has been found to facilitate translating low-resource language pairs (Firat et al., 2016a;Lee et al., 2016;Gu et al., 2018b,c).
Pivot-based NMT In practise, it is almost impossible for the training set to contain all K × (K+1) combinations of translation pairs to learn a multilingual model.Often only one (e.g.English) or a few out of the K + 1 languages have parallel sentence pairs with the remaining languages.For instance, we may only have parallel pairs between English & French, and Spanish & English, but not between French & Spanish.What happens if we evaluate on an unseen direction e.g.Spanish to French?A simple but commonly used solution is pivoting: we first translate from Spanish to English, and then from English to French with two separately trained single-pair models or a single multilingual model.However, it comes with two drawbacks: (1) at least 2× higher latency than that of a comparable direct translation model; (2) the models used in pivot-based translation are not trained taking into account the new language pair, making it difficult, especially for the second model, to cope with errors created by the first model.
Zero-shot NMT Johnson et al. (2017) showed that a trained multilingual NMT system could automatically translate between unseen pairs without any direct supervision, as long as both source and target languages were included in training.In other words, a model trained for instance on English & French and Spanish & English is able to directly translate from Spanish to French.Such an emergent property of a multilingual system is called zero-shot translation.It is conjectured that zero-shot NMT is possible because the optimization encourages different languages to be encoded into a shared space so that the decoder is detached from the source languages.As an evidence, Arivazhagan et al. (2018) measured the "cosine distance" between the encoder's pooled outputs of each sentence pair, and found that the distance decreased during the multilingual training.

Degeneracy Issue of Zero-shot NMT
Despite the nice property of the emergent zeroshot NMT compared to other approaches such as pivot-based methods, prior works (Johnson et al., 2017;Firat et al., 2016b;Arivazhagan et al., 2018), however, have shown that the quality of zero-shot NMT significantly lags behind pivot-based translation.In this section, we investigate an underlying cause behind this particular degeneracy issue.

Zero-shot NMT is Sensitive to Training Conditions
Preliminary Experiments Before drawing any conclusions, we first experimented with a variety of hyper-parameters to train multilingual systems and evaluated them on zero-shot situations, which refer to language pairs without parallel resource.We performed the preliminary experiments on Europarl2 with the following languages: English (En), French (Fr), Spanish (Es) and German (De) with no parallel sentences between any two of Fr,  Es and De.We used newstest20103 as the validation set which contains all six directions.The corpus was preprocessed with 40, 000 BPE operations across all the languages.We chose Transformer (Vaswani et al., 2017) -the state-of-the-art NMT architecture on a variety of languages -with the parameters of d model = 512, d hidden = 2048, n heads = 8, n layers = 6.Multiple copies of this network were trained on data with all parallel directions for {De,Es,Fr} & En, while we varied other hyper-parameters.As the baseline, six single-pair models were trained to produce the pivot results.

Results
The partial results are shown in Fig. 1 including five out of many conditions on which we have tested.The default uses the exact Transformer architecture with xavier uniform (Glorot and Bengio, 2010) initialization for all layers, and is trained with lr max = 0.005, t warmup = 4000, dropout = 0.1, n batch = 2400 tokens/direction.For the other variants compared to the default setting, large-bs uses a bigger batch-size of 9,600; attn-drop has an additional dropout (0.1) on each attention head (Vaswani et al., 2017); we use the Pytorch's default method4 to initialize all the weights for pytorch-init; we also try to change the conventional architecture with a layer-wise attention (Gu et al., 2018a) between the encoder and decoder, and it is denoted as layerwise-attn.All results are evaluated on the validation set using greedy decoding.
From Fig. 1, we can observe that the translation quality of zero-shot NMT is highly sensitive to the hyper-parameters (e.g.layerwise-attn completely fails on zero-shot pairs) while almost all the models achieve the same level as the baseline does on parallel directions.Also, even with the stable setting (default), the translation quality of zero-shot NMT is still far below that of pivotbased translation on some pairs such as Fr-De.

Performance Degeneracy is Due to Capturing Spurious Correlation
We look into this problem with some quantitative analysis by re-thinking the multilingual training in Eq. ( 4).For convenience, we model the decoder's output y j as a combination of two factors: the output language ID z ∈ {0, . . ., K}, and languageinvariant semantics s (see Fig. 2 for a graphical illustration.).In this work, both z and s are unobserved variables before the y j was generated.Note that z is not necessarily equal to the language id j.
The best practise for zero-shot NMT is to make z and s conditionally independent given the source sentence.That is to say, z is controlled by j and s is controlled by x i .This allows us to change the target language by setting j to a desired language, and is equivalent to ignoring the correlation between x i and z.That is, the mutual information between the source language ID i and the output language ID z -I(i; z) -is 0. However, the conventional multilingual training on an imbalanced dataset makes zero-shot NMT problematic because the MLE objective will try to capture all possible correlations in the data including the spurious dependency between i and z.For instance, consider training a multilingual NMT model for Es as input only with En as the target language.Although it is undesirable for the model to capture the dependency between i (Es) and z (En), MLE does not have a mechanism to prevent it (i.e., I(i; z) > 0) from happening.In other words, we cannot explicitly control the trade off between I(i; z) and I(j; z) with MLE training.When I(i; z) increases as opposed to I(j; z), the Dec Enc y j < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > x i < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > j < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > s < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Figure 2: A conceptual illustration of decoupling the output translation (y j ) into two latent factors (language type and the semantics) where the undesired spurious correlation (in red) will be wrongly captured if i is always translated to j during training.decoder ignores j, which makes it impossible for the trained model to perform zero-shot NMT, as the decoder cannot output a translation in a language that was not trained before.
Quantitative Analysis We performed the quantitative analysis on the estimated mutual information I(i; z) as well as the translation quality of zero-shot translation on the validation set.As an example, we show the results of large-bs setting in Fig. 3 where the I(i; z) is estimated by: where the summation is over all possible language pairs, and p(•) represents frequency.The latent language identity z = φ(y j ) is estimated by an external language identification tool given the actual output (Lui and Baldwin, 2012).In Fig. 3, the trend of zero-shot performance is inversely proportional to I(i; z), which indicates that the degeneracy is from the spurious correlation.
The analysis of the mutual information also explains the sensitivity issue of zero-shot NMT during training.As a side effect of learning translation, I(i; z) tends to increase more when the training conditions make MT training easier (e.g.large batch-size).The performance of zero-shot NMT becomes more unstable and fails to produce translation in the desired language (j).

Approaches
In this section, we present two existing, however, not investigated in the scenario of zero-shot NMT approaches -decoder pre-training and backtranslation -to address this degeneracy issue.Figure 3: The learning curves of the mutual information between input and output language IDs as well as the averaged BLEU scores of all zero-shot directions on the validation sets for the large-bs setting.

Language Model Pre-training
The first approach strengthens the decoder language model (LM) prior to MT training.Learning the decoder language model increases I(j; z) which facilitates zero-shot translation.Once the model captures the correct dependency that guides the model to output the desired language, it is more likely for the model to ignore the spurious correlation during standard NMT training.That is, we pre-train the decoder as a multilingual language model.Similar to Eq. ( 4): where Lj θ (y j ) = log p(y j |0, j; θ), which represents that pre-training can be implemented by simply replacing all the source representations by zero vectors during standard NMT training (Sennrich et al., 2016).In Transformer, it is equivalent to ignoring the attention modules between the encoder and decoder.
The proposed LM pre-training can be seen as a rough approximation of marginalizing all possible source sentences, while empirically we found it worked well.After a few gradient descent steps, the pre-trained model continues with MT training.In this work, we only consider using the same parallel data for pre-training.We summarize the pros and cons as follows: Pros: Efficient (a few LM training steps + NMT training); no additional data needed; Cons: The LM pre-training objective does not necessarily align with the NMT objective.

Back-Translation
In order to apply language model training along with the NMT objective, we have to take the encoder into account.We use back-translation (BT, Sennrich et al., 2016), but in particular for multilingual training.Unlike the original purpose of using BT for semi-supervised learning, we utilize BT to generate synthetic parallel sentences for all zero-shot directions (Firat et al., 2016b), and train the multilingual model from scratch on the merged datasets of both real and synthetic sentences.By doing so, every language is forced to translate to all the other languages.Thus, I(i; z) is effectively close to 0 from the beginning, preventing the model from capturing the spurious correlation between i and z.
Generating the synthetic corpus requires at least a reasonable starting point that translates on zeroshot pairs which can be chosen either through a pivot language (denoted as BTTP) or the current zero-shot NMT trained without BT (denoted BTZS).For instance, in previous examples, to generate synthetic pairs for Es-Fr given the training set of En-Fr, BTTP translates every En sentence to Es with a pre-trained En-Es model (used in pivot-based MT), while BTZS uses the pretrained zero-shot NMT to directly translate all Fr sentences to Es. Next, we pair the generated sentences in the reverse direction Es-Fr and merge them to the training set.The same multilingual training is applied after creating synthetic corpus for all translation pairs.Similar methods have also been explored by Firat et al. (2016b) Cons: BT is computationally more expensive as we need to create synthetic parallel corpora for all language pairs (up to O(K 2 )) to train a multilingual model for K languages; both the performance of BTTP and BTZS might be affected by the quality of the pre-trained models.Training Conditions For all non-IWSLT experiments, we use the same architecture as the preliminary experiments with the training conditions of default, which is the most stable setting for zero-shot NMT in Sec.3.1.Since IWSLT is much smaller compared to the other two datasets, we find that the same hyper-parameters except with t warmup = 8000, dropout = 0.2 works better.

Experiments
Models As the baseline, two pivot-based translation are considered: • PIV-S (through two single-pair NMT models trained on each pair;) • PIV-M (through a single multilingual NMT model trained on all available directions;) Moreover, we directly use the multilingual system that produce PIV-M results for the vanilla zeroshot NMT baseline.
As described in Sec. 4, both the LM pre-training and BT use the same datasets as that in MT training.By default, we take the checkpoint of 20, 000 steps LM pre-training to initialize the NMT model as our preliminary exploration implied that further increasing the pre-training steps would not be helpful for zero-shot NMT.For BTTP, we choose either PIV-S or PIV-M to generate the synthetic corpus based on the average BLEU scores on parallel data.On the other hand, we always select the best zero-shot model with LM pre-training for BTZS by assuming that pre-training consistently improves the translation quality of zero-shot NMT.

Model Selection for Zero-shot NMT
In principle, zero-shot translation assumes we cannot access any parallel resource for the zero-shot pairs during training, including cross-validation for selecting the best model.However, according to Fig. 1, the performance of zero-shot NMT tends to drop while the parallel directions are still improving which indicates that simply selecting the best model based on the validation set of parallel directions is sub-optimal for zero-shot pairs.In this work, we propose to select the best model by maximizing the likelihood over all available validation set Di,j of parallel directions together with a language model score from a fully trained language model θ (Eq.( 4)).That is, where ŷk is the greedy decoding output generated from the current model p(•|x i , k; θ) by forcing it to translate x i to language k that has no parallel data with i during training.The first term measures the learning progress of machine translation, and the second term shows the level of degeneracy in zero-shot NMT.Therefore, when the spurious correlation between the input and decoded languages is wrongly captured by the model, the desired language model score will decrease accordingly.

Results and Analysis
Overall Performance Comparison We show the translation quality of zero-shot NMT on the three datasets in Table 2.All the results (including pivot-based approaches) are generated using beam-search with beam size = 4 and length penalty α = 0.6 (Vaswani et al., 2017).Experimental results in Table 2 demonstrate that both our proposed approaches achieve significant improvement in zero-shot translation for both directions in all the language pairs.Only with LM pretraining, the zero-shot NMT has already closed the gap between the performance and that of the strong pivot-based baseline for datasets.For pairs which are lexically more similar compared to the pivot language (e.g.Es-Fr v.s.En), ZS+LM achieved much better performance than its pivotbased counterpart.Depending on which languages we consider, zero-shot NMT with the help of BTTP & BTZS can achieve a significant improvement around 4 ∼ 22 BLEU points compared to the naïve approach.For a fair comparison, we also re-implement the alignment method proposed by Arivazhagan et al. (2018) based on cosine distance and the results are shown as ZS+Align in Table .2, which is on average 1.5 BLEU points lower than our proposed ZS+LM approach indicating that our approaches might fix the degeneracy issue better.
As a reference of upper bound, we also include the results with a fully supervised setting, where all the language pairs are provided for training.Table 2 shows that the proposed BTTP & BTZS are competitive and even very close to this upper bound, and BTZS is often slightly better than BTTP across different languages.
No Pivots We conduct experiments on the setting without available pivot languages.Shown in Table 2: Overall BLEU scores including parallel and zero-shot directions on the test sets of three multilingual datasets.In (a) (c) (e), En is used as the pivot-language; no language is available as the pivot for (b); we also present partial results in (d) where a chain of pivot languages are used.For all columns, the highest two scores are marked in bold for all models except for the fully-supervised "upper bound".
merged dataset.As shown in Table 2(a) and (b), although the setting of no pivot pairs performs slightly worse than that with pivot languages, both our approaches (LM, BTZS) substantially improve the vanilla model and achieve competitive performance compared to the fully supervised setting.
Case Study We also show a randomly selected example for Ru → Zh from the validation set of MultiUN dataset in Fig. 5.We can see that at the beginning, the output sentence of ZS+LM is fluent while ZS learns translation faster than ZS+LM.
Then, En tokens starts to appear in the output sentence of ZS, and it totally shifts to En eventually.

Related Works
Zero-shot Neural Machine Translation Zeroshot NMT has received increasingly more interest in recent years.Platanios et al. (2018) introduced the contextual parameter generator, which generated the parameters of the system and performed zero-shot translation.Arivazhagan et al. (2018) conjectured the solution towards the degeneracy in zero-shot NMT was to guide an NMT encoder to learn language agnostic representations.Sestorain et al. (2018) combined dual learning to improve zero-shot NMT.However, unlike our work, none of these prior works performed quantitative investigation of the underlying cause.
Zero Resource Translation This work is also closely related to zero-resource translation which is a general task to translate between languages without parallel resources.Possible solutions include pivot-based translation, multilingual or unsupervised NMT.For instance, there have been attempts to train a single-pair model with a pivotlanguage (Cheng et al., 2016;Chen et al., 2017) or a pivot-image (Lee et al., 2017;Chen et al., 2018).
Unsupervised Translation Unlike the focus of this work, unsupervised translation usually refers to a zero-resource problem where many monolingual corpora are available.Lample et al. (2018a); Artetxe et al. (2018) proposed to enforce a shared latent space to improve unsupervised translation quality which was shown not necessary by Lample et al. (2018b) in which a more effective initialization method for related languages was proposed.
Neural Machine Translation Pre-training As a standard transfer learning approach, pre-training significantly improves the translation quality of low resource languages by fine-tuning the parameters trained on high-resource languages (Zoph et al., 2016;Gu et al., 2018c;Lample and Conneau, 2019).Our proposed LM pre-training can also be included in the same scope while following a different motivation.

Conclusion
In this paper, we analyzed the issue of zero-shot translation quantitatively and successfully close the gap of the performance of between zero-shot translation and pivot-based zero-resource translation.We proposed two simple and effective strategies for zero-shot translation.Experiments on the Europarl, IWSLT and MultiUN corpora show that our proposed methods significantly improve the vanilla zero-shot NMT and consistently outperform the pivot-based methods.

A Additional Experiments
A.1 Trade-off between decoding speed and translation quality In Table .3, we empirically tested the decoding speed by using either pivot-based methods or zeroshot NMT.The overhead of switching models in pivot-based translation has been ignored.All the speed are measured as "ms/sentence" and tested in parallel on 8 V100 GPUs using beam-search with a beam size 4. Vanilla zero-shot NMT is faster but performs worse than pivot-based methods.There exists a trade-off between the decoding speed and the translation quality where we also present a fast pivoting method where we found that using greedy-decoding for the pivot language only affects the translation quality by a small margin.

Model
However, both our proposed approaches significantly improve the zero-shot NMT and outperforms the pivot-based translation with shorter decoding time, making such trade-off meaningless.

A.2 Effect of Using Multi-way Data
Prior research (Cheng et al., 2016) also reported that the original Europarl dataset contains a large proportion of multi-way translations.To investigate the affects, we followed the same process in (Cheng et al., 2016;Chen et al., 2017) to exclude all multi-way translation sentences, which means there are no overlaps in pairwise language pairs.The statistics of this modified dataset (Europarlc) compared to the original Europarl dataset are shown in Table 4.Although we observed a performance drop by using data without multi-way sentences, the results in Table 5 show that the proposed LM pre-training is not affected by obtaining multi-way data and consistently improves the vanilla zero-shot NMT.We conjecture that the performance drop is mainly because of the size of the dataset.Also our methods can easily beat (Chen et al., 2017)

Figure 1 :
Figure 1: Partial results on zero-shot and parallel directions on Europarl dataset with variant multilingual training conditions (blue: default, red: large-bs, orange: pytorch-init, green: attn-drop, purple: layerwise-attn).The dashed lines are the pivot-based or direct translation results from baseline models.
; Zheng et al. (2017); Sestorain et al. (2018), but have not been studied or used in the context of zero-shot NMT.Pros: BT explicitly avoids the spurious correlation.Also, BTZS potentially improves further by utilizing the zero-shot NMT model augmented with LM pre-training.

Figure 4 :
Figure 4: Learning curves of the two proposed approaches (LM, BTZS) and the vanilla ZS on Europarl Fr→De with two conditions (default, large-bs).The red dashed line is the pivot-based baseline.

Figure 5 :
Figure 5: Zero-shot translation performance on Ru → Zh from MultiUN dataset.(↑) An example randomly selected from the validation set, is translated by both the vanilla zero-shot NMT and that with LM pre-training at four checkpoints.Translation in an incorrect language (English) is marked in pink color.(←) We showed the two learning curves for the averaged zero-shot BLEU scores on validation set of Multi-UN with the corresponded checkpoints marked.

Table 2 (
b), our training sets only contain Es-En and De-Fr.Then if we want to translate from De to Fr, pivot-based methods will not work.However, we can still perform zero-shot NMT by simply training a multilingual model on the

Table 3 :
Decoding speed and the translation quality (average BLEU scores) of the zero-shot pairs on Europarl dataset.

Table 5 :
Effects of multi-way data on Europarl."Yes" means with multi-way translation, and "No" means the opposite.