Edinburgh Research Explorer Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-speciﬁc components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both one-to-many and many-to-many settings, and improves zero-shot performance by ∼ 10 BLEU, approaching conventional pivot-based meth-ods. 1

Zero-Shot Les Member States have been consultedés and have approved this proposal. French→German zero-shot translations with a multilingual NMT model. Our baseline multilingual NMT model often translates into the wrong language for zero-shot language pairs, such as copying the source sentence or translating into English rather than German. and enable zero-shot translation (i.e. direct translation between a language pair never seen in training) (Firat et al., 2016b;Johnson et al., 2017;Al-Shedivat and Parikh, 2019;Gu et al., 2019). Despite these potential benefits, multilingual NMT tends to underperform its bilingual counterparts (Johnson et al., 2017;Arivazhagan et al., 2019b) and results in considerably worse translation performance when many languages are accommodated (Aharoni et al., 2019). Since multilingual NMT must distribute its modeling capacity between different translation directions, we ascribe this deteriorated performance to the deficient capacity of single NMT models and seek solutions that are capable of overcoming this capacity bottleneck. We propose language-aware layer normalization and linear transformation to relax the representation constraint in multilingual NMT models. The linear transformation is inserted in-between the encoder and the decoder so as to facilitate the induction of language-specific translation correspon-dences. We also investigate deep NMT architectures (Wang et al., 2019a;Zhang et al., 2019) aiming at further reducing the performance gap with bilingual methods.
Another pitfall of massively multilingual NMT is its poor zero-shot performance, particularly compared to pivot-based models. Without access to parallel training data for zero-shot language pairs, multilingual models easily fall into the trap of offtarget translation where a model ignores the given target information and translates into a wrong language as shown in Table 1. To avoid such a trap, we propose the random online backtranslation (ROBT) algorithm. ROBT finetunes a pretrained multilingual NMT model for unseen training language pairs with pseudo parallel batches generated by back-translating the target-side training data. 2 We perform backtranslation (Sennrich et al., 2016a) into randomly picked intermediate languages to ensure good coverage of ∼10,000 zero-shot directions. Although backtranslation has been successfully applied to zero-shot translation (Firat et al., 2016b;Gu et al., 2019;Lakew et al., 2019), whether it works in the massively multilingual set-up remained an open question and we investigate it in our work.
For experiments, we collect OPUS-100, a massively multilingual dataset sampled from OPUS (Tiedemann, 2012). OPUS-100 consists of 55M English-centric sentence pairs covering 100 languages. As far as we know, no similar dataset is publicly available. 3 We have released OPUS-100 to facilitate future research. 4 We adopt the Transformer model (Vaswani et al., 2017) and evaluate our approach under one-to-many and manyto-many translation settings. Our main findings are summarized as follows: • Increasing the capacity of multilingual NMT yields large improvements and narrows the performance gap with bilingual models. Lowresource translation benefits more from the increased capacity. • Language-specific modeling and deep NMT architectures can slightly improve zero-shot 2 Note that backtranslation actually converts the zero-shot problem into a zero-resource problem. We follow previous work and continue referring to zero-shot translation, even when using synthetic training data.
3 Previous studies (Aharoni et al., 2019;Arivazhagan et al., 2019b) adopt in-house data which was not released. 4 https://github.com/EdinburghNLP/ opus-100-corpus translation, but fail to alleviate the off-target translation issue. • Finetuning multilingual NMT with ROBT substantially reduces the proportion of offtarget translations (by ∼50%) and delivers an improvement of ∼10 BLEU in zero-shot settings, approaching the conventional pivotbased method. We show that finetuning with ROBT converges within a few thousand steps.

Related Work
Pioneering work on multilingual NMT began with multitask learning, which shared the encoder for one-to-many translation (Dong et al., 2015) or the attention mechanism for many-to-many translation (Firat et al., 2016a). These methods required a dedicated encoder or decoder for each language, limiting their scalability. By contrast, Lee et al. (2017) exploited character-level inputs and adopted a shared encoder for many-to-one translation. Ha et al. (2016) and Johnson et al. (2017) further successfully trained a single NMT model for multilingual translation with a target language symbol guiding the translation direction. This approach serves as our baseline. Still, this paradigm forces different languages into one joint representation space, neglecting their linguistic diversity. Several subsequent studies have explored different strategies to mitigate this representation bottleneck, ranging from reorganizing parameter sharing (Blackwood et al., 2018;Wang et al., 2019c;Vázquez et al., 2019), designing language-specific parameter generators (Platanios et al., 2018), decoupling multilingual word encodings (Wang et al., 2019b) to language clustering (Tan et al., 2019). Our languagespecific modeling continues in this direction, but with a special focus on broadening normalization layers and encoder outputs. Multilingual NMT allows us to perform zeroshot translation, although the quality is not guaranteed (Firat et al., 2016b;Johnson et al., 2017). We observe that multilingual NMT often translates into the wrong target language on zero-shot directions (Table 1), resonating with the 'missing ingredient problem' (Arivazhagan et al., 2019a) and the spurious correlation issue (Gu et al., 2019). Approaches to improve zero-shot performance fall into two categories: 1) developing novel cross-lingual regularizers, such as the alignment regularizer (Arivazhagan et al., 2019a) and the consistency regularizer (Al-Shedivat and Parikh, 2019); and 2) generating artificial parallel data with backtranslation (Firat et al., 2016b;Gu et al., 2019;Lakew et al., 2019) or pivotbased translation (Currey and Heafield, 2019). The proposed ROBT algorithm belongs to the second category. In contrast to Gu et al. (2019) and Lakew et al. (2019), however, we perform online backtranslation for each training step with randomly selected intermediate languages. ROBT avoids decoding the whole training set for each zero-shot language pair and can therefore scale to massively multilingual settings.
Our work belongs to a line of research on massively multilingual translation (Aharoni et al., 2019;Arivazhagan et al., 2019b). Aharoni et al. (2019) demonstrated the feasibility of massively multilingual NMT and reported encouraging results. We continue in this direction by developing approaches that improve both multilingual and zero-shot performance. Independently from our work, Arivazhagan et al. (2019b) also find that increasing model capacity with deep architectures (Wang et al., 2019a;Zhang et al., 2019) substantially improves multilingual performance. A concurrent related work is (Bapna and Firat, 2019), which introduces taskspecific and lightweight adaptors for fast and scalable model adaptation. Compared to these adaptors, our language-aware layers are jointly trained with the whole NMT model from scratch without relying on any pretraining.

Multilingual NMT
We briefly review the multilingual approach (Ha et al., 2016;Johnson et al., 2017) and the Transformer model (Vaswani et al., 2017), which are used as our baseline. Johnson et al. (2017) rely on prepending tokens specifying the target language to each source sentence. In that way a single NMT model can be trained on the modified multilingual dataset and used to perform multilingual translation. Given a source sentence x=(x 1 , x 2 , . . . , x |x| ), its target reference y=(y 1 , y 2 , . . . , y |y| ) and the target language token t 5 , multilingual NMT translates under the encoder-decoder framework (Bahdanau et al., 2015): where H ∈ R |x|×d /S ∈ R |y|×d denote the encoder/decoder output. d is the model dimension.
We employ the Transformer (Vaswani et al., 2017) as the backbone NMT model due to its superior multilingual performance (Lakew et al., 2018). The encoder is a stack of L = 6 identical layers, each containing a self-attention sublayer and a point-wise feedforward sublayer. The decoder follows a similar structure, except for an extra cross-attention sublayer used to condition the decoder on the source sentence. Each sublayer is equipped with a residual connection , followed by layer normalization (Ba et al., 2016, LN(·)): where denotes element-wise multiplication, µ and σ are the mean and standard deviation of the input vector a ∈ R d , respectively. g ∈ R d and b ∈ R d are model parameters. They control the sharpness and location of the regularized layer outputā. Layer normalization has proven effective in accelerating model convergence (Ba et al., 2016).

Approach
Despite its success, multilingual NMT still suffers from 1) insufficient modeling capacity, where including more languages results in reduction in translation quality (Aharoni et al., 2019); and 2) off-target translation, where models translate into a wrong target language on zero-shot directions (Arivazhagan et al., 2019a). These drawbacks become severe in massively multilingual settings and we explore approaches to alleviate them. We hypothesize that the vanilla Transformer has insufficient capacity and search for model-level strategies such as deepening Transformer and devising languagespecific components. By contrast, we regard the lack of parallel data as the reason behind the offtarget issue. We resort to data-level strategy by creating, in online fashion, artificial parallel training data for each zero-shot language pair in order to encourage its translation.
Deep Transformer One natural way to improve the capacity is to increase model depth. Deeper neural models are often capable of inducing more generalizable ('abstract') representations and capturing more complex dependencies and have shown encouraging performance on bilingual translation (Bapna et al., 2018;Zhang et al., 2019;Wang et al., 2019a). We adopt the depth-scaled initialization method (Zhang et al., 2019) to train a deep Transformer for multilingual translation.
Language-aware Layer Normalization Regardless of linguistic differences, layer normalization in multilingual NMT simply constrains all languages into one joint Gaussian space, which makes learning more difficult. We propose to relax this restriction by conditioning the normalization on the given target language token t (LALN for short) as follows: We apply this formula to all normalization layers, and leave the study of conditioning on source language information for the future.
Language-aware Linear Transformation Different language pairs have different translation correspondences or word alignments (Koehn, 2010).
In addition to LALN, we introduce a targetlanguage-aware linear transformation (LALT for short) between the encoder and the decoder to enhance the freedom of multilingual NMT in expressing flexible translation relationships. We adapt Eq.
(2) as follows: where W t ∈ R d×d denotes model parameters.
Note that adding one more target language in LALT brings in only one weight matrix. 6 Compared to existing work (Firat et al., 2016b;, LALT reaches a better trade-off between expressivity and scalability. Random Online Backtranslation Prior studies on backtranslation for zero-shot translation decode the whole training set for each zero-shot language pair (Gu et al., 2019;Lakew et al., 2019), and scalability to massively multilingual translation is questionable -in our setting, the number of zero-shot translation directions is 9702. We address scalability by performing online backtranslation paired with randomly sampled intermediate languages. Algorithm 1 shows the detail of ROBT, where for each training instance (x k , y k , t k ), we uniformly sample an intermediate language t k (t k = t k ), back-translate y k into . Although x k may be poor initially (translations are produced on-line by the model being trained), ROBT still benefits from the translation signal of t k → t k . To reduce the computational cost, we implement batch-based greedy decoding for line 7.

OPUS-100
Recent work has scaled up multilingual NMT from a handful of languages to tens or hundreds, with many-to-many systems being capable of translation in thousands of directions. Following Aharoni et al. (2019), we created an English-centric dataset, meaning that all training pairs include English on either the source or target side. Translation for any language pair that does not include English is zero-shot or must be pivoted through English.
We created OPUS-100 by sampling data from the OPUS collection (Tiedemann, 2012). OPUS-100 is at a similar scale to Aharoni et al. (2019)'s, with 100 languages (including English) on both sides and up to 1M training pairs for each language pair. We selected the languages based on the volume of parallel data available in OPUS.
The OPUS collection is comprised of multiple corpora, ranging from movie subtitles to GNOME  Table 2: Test BLEU for one-to-many translation on OPUS-100 (100 languages). "Bilingual": bilingual NMT, "L": model depth (for both encoder and decoder), "#Param": parameter number, "WR": win ratio (%) compared to ref (  To evaluate zero-shot translation, we also sampled 2000 sentence pairs of test data for each of the 15 pairings of Arabic, Chinese, Dutch, French, German, and Russian. Filtering was used to exclude sentences already in OPUS-100.

Setup
We perform one-to-many (English-X) and manyto-many (English-X ∪ X-English) translation on OPUS-100 (|T | is 100). We apply byte pair encoding (BPE) (Sennrich et al., 2016b;Kudo and Richardson, 2018) to handle multilingual words with a joint vocabulary size of 64k. We randomly 7 For efficiency, we only use 200 sentences per language pair for validation in our multilingual experiments.
shuffle the training set to mix instances of different language pairs. We adopt BLEU (Papineni et al., 2002) for translation evaluation with the toolkit SacreBLEU (Post, 2018) 8 . We employ the langdetect library 9 to detect the language of translations, and measure the translation-language accuracy for zero-shot cases. Rather than providing numbers for each language pair, we report average BLEU over all 94 language pairs with test sets (BLEU 94 ). We also show the win ratio (WR), counting the proportion where our approach outperforms its baseline.
Apart from multilingual NMT, our baselines also involve bilingual NMT and pivot-based translation (only for zero-shot comparison). We select four typologically different target languages (German/De, Chinese/Zh, Breton/Br, Telugu/Te) with varied training data size for comparison to bilingual models as applying bilingual NMT to each language pair is resource-consuming. We report average BLEU over these four languages as BLEU 4 . We reuse the multilingual BPE vocabulary for bilingual NMT.

Results on One-to-Many Translation
Deepening the Transformer also improves the modeling capacity (+1.88 BLEU 94 , 4 → 8 ). Although deep Transformer performs worse than LALN+LALT under a similar number of model parameters in terms of BLEU (-1.49 BLEU 94 , 7 → 8 ), it shows more consistent improvements across different language pairs (+6.4% WR). We obtain better performance when integrating all approaches ( 9 ). By increasing the model depth to 24 (10 ), Transformer with our approach yields a score of 29.60 BLEU 94 and 21.23 BLEU 4 , beating the baseline ( 3 ) on 92.6% tasks and outperforming the base bilingual model ( 1 ) by 0.33 BLEU 4 . Our approach significantly narrows the performance gap between multilingual NMT and bilingual NMT (20.90 BLEU 4 → 21.23 BLEU 4 , 1 →10 ), although similarly deepening bilingual models surpasses our approach by 1.52 BLEU 4 (10 → 2 ).

Results on Many-to-Many Translation
We train many-to-many NMT models on the concatenation of the one-to-many dataset (English→X) and its reversed version (X→English), and evaluate the zero-shot performance on X→X language pairs. Table 3 and Table 4 show the translation results for English→X and X→English, respectively. 10 We focus on the translation performance w/o ROBT in this subsection.
Compared to the one-to-many translation, the many-to-many translation must accommodate twice as many translation directions. We observe that many-to-many NMT models suffer more se-   Table 6: Test BLEU and translation-language accuracy for zero-shot translation in many-to-many setting on OPUS-100 (100 languages). "BLEUzero/ACCzero": average BLEU/accuracy over all zero-shot translation directions in test set, "Pivot": the pivot-based translation that first translates one source sentence into English (X→English NMT), and then into the target language (English→X NMT). Lower accuracy indicates severe off-target translation. The average Pearson correlation coefficient between language accuracy and the corresponding BLEU is 0.93 (significant at p < 0.01).
We find that the overall quality of English→X translation (19.50/23.96 BLEU 94 , 2 / 7 , Table 3) lags far behind that of its X→English counterpart (27.60/31.36 BLEU 94 , 2 /12 , Table 4), regardless of the modeling capacity. We ascribe this to the highly skewed training data distribution, where half of the training set uses English as the target. This strengthens the ability of the decoder to translate into English, and also encourages knowledge transfer for X→English language pairs. LALN and LALT show the largest benefit for English→X (+2.9 BLEU 94 , 3 → 4 , Table 3), and only a small benefit for X→English (+0.6 BLEU 94 , 3 → 4 , Table 4). This makes sense considering that LALN and LALT are specific to the target language, so capacity is mainly increased for English→X. Deepening the Transformer yields benefits in both directions (+2.57 BLEU 94 for English→X, +3.86 BLEU 94 for X→English; 4 → 7 , Tables 3 and 4).

Effect of Training Corpus Size
Our multilingual training data is distributed unevenly across different language pairs, which could affect the knowledge transfer delivered by language-aware modeling and deep Transformer in multilingual translation. We investigate this effect by grouping different language pairs in OPUS-100 into three categories according to their training data size: High (≥ 0.9M, 45), Low (< 0.1M, 18) and Medium (others, 31). Table 5 shows the results.
Language-aware modeling benefits low-resource language pairs the most on English→X translation (+5.82 BLEU, Low versus +1.37/+3.11 BLEU, High/Med, 2 → 3 ), but has marginal impact on X→English translation as analyzed in Section 6.3. By contrast, deep Transformers yield similar benefits across different data scales (+2.38 average BLEU, English→X and +2.31 average BLEU, X→English, 2 → 4 ). We obtain the best perfor-mance by integrating both ( 1 → 6 ) with a clear positive transfer to low-resource language pairs.

Results on Zero-Shot Translation
Previous work shows that a well-trained multilingual model can do zero-shot X→Y translation directly (Firat et al., 2016b;Johnson et al., 2017). Our results in Table 6 reveal that the translation quality is rather poor (3.97 BLEU zero , 2 w/o ROBT) compared to the pivot-based bilingual baseline (12.98 BLEU zero , 1 ) under the massively multilingual setting (Aharoni et al., 2019), although translations into different target languages show varied performance. The marginal gain by the deep Transformer with LALN + LALT (+1.44 BLEU zero , 2 → 6 , w/o ROBT) suggests that weak model capacity is not the major cause of this inferior performance.
In a manual analysis on the zero-shot NMT outputs, we found many instances of off-target translation (Table 1). We use translation-language accuracy to measure the proportion of translations that are in the correct target language. Results in Table 6 show that there is a huge accuracy gap between the multilingual and the pivot-based method (-48.83% ACC zero , 1 → 2 , w/o ROBT), from which we conclude that the off-target translation issue is one source of the poor zero-shot performance.
We apply ROBT to multilingual models by finetuning them for an extra 100k steps with the same batch size as for training. Table 6 shows that ROBT substantially improves ACC zero by 35%∼50%, reaching 85%∼87% under different model settings. The multilingual Transformer with ROBT achieves a translation improvement of up to 10.11 BLEU zero ( 2 w/o ROBT→ 7 w/ ROBT), outperforming the bilingual baseline by 1.1 BLEU zero ( 1 w/o ROBT→ 7 w/ ROBT) and approaching the pivotbased multilingual baseline (-0.63 BLEU zero , 8 w/o ROBT→ 7 w/ ROBT). 11 The strong Pearson correlation between the accuracy and BLEU (0.92 on average, significant at p < 0.01) suggests that the improvement on the off-target translation issue explains the increased translation performance to a large extent.
Results in Table 3 and 4 show that ROBT's success on zero-shot translation comes at the cost of sacrificing ∼0.50 BLEU 94 and ∼4% WR on English→X and X→English translation. We also note that models with more capacity yield higher 11 Note that ROBT improves all zero-shot directions due to its randomness in sampling the intermediate languages. We do not bias ROBT to the given zero-shot test set. Setting BLEUzero 6-to-6 11.98 100-to-100 11.23 Table 7: Zero-short translation quality for ROBT under different settings. "100-to-100": the setting used in the above experiments; we set T to all target languages. "6-to-6": T only includes the zero-shot languages in the test set. We employ 6-layer Transformer with LALN and LALT for experiments.
language accuracy (+7.78%/+13.81% ACC zero , 3 → 5 / 3 → 4 , w/o ROBT) and deliver better zero-shot performance before (+1.22/+0.53 BLEU zero , 3 → 5 / 3 → 4 , w/o ROBT) and after ROBT (+2.20/+1.56 BLEU zero , 3 → 5 / 3 → 4 , w/ ROBT). In other words, increasing the modeling capacity benefits zero-shot translation and improves robustness. Convergence of ROBT. Unlike prior studies (Gu et al., 2019;Lakew et al., 2019), we resort to an online method for backtranslation. The curve in Figure 1 shows that ROBT is very effective, and takes only a few thousand steps to converge, suggesting that it is unnecessary to decode the whole training set for each zero-shot language pair. We leave it to future work to explore whether different back-translation strategies (other than greedy decoding) will deliver larger and continued benefits with ROBT.
Impact of T on ROBT. ROBT heavily relies on T , the set of target languages considered, to distribute the modeling capacity on zero-shot directions. To study its impact, we provide a comparison by constraining T to 6 languages in the zero-shot test set. Results in Table 7 show that the biased ROBT outperforms the baseline by 0.75 BLEU zero . By narrowing T , more capacity is scheduled to the focused languages, which results in performance improvements. But the small scale of this improve-ment suggests that the number of zero-shot directions is not ROBT's biggest bottleneck.

Conclusion and Future Work
This paper explores approaches to improve massively multilingual NMT, especially on zero-shot translation. We show that multilingual NMT suffers from weak capacity, and propose to enhance it by deepening the Transformer and devising language-aware neural models. We find that multilingual NMT often generates off-target translations on zero-shot directions, and propose to correct it with a random online backtranslation algorithm. We empirically demonstrate the feasibility of backtranslation in massively multilingual settings to allow for massively zero-shot translation for the first time. We release OPUS-100, a multilingual dataset from OPUS including 100 languages with around 55M sentence pairs for future study. Our experiments on this dataset show that the proposed approaches substantially increase translation performance, narrowing the performance gap with bilingual NMT models and pivot-based methods.
In the future, we will develop lightweight alternatives to LALT to reduce the number of model parameters. We will also exploit novel strategies to break the upper bound of ROBT and obtain larger zero-shot improvements, such as generative modeling (Zhang et al., 2016;Su et al., 2018;García et al., 2020;Zheng et al., 2020 Table 8 lists the languages (other than English) and numbers of sentence pairs in the English-centric multilingual dataset.

B Model Settings
We optimize model parameters using Adam (β 1 = 0.9, β 2 = 0.98) (Kingma and Ba, 2015) with label smoothing of 0.1 and scheduled learning rate (warmup step 4k). We set the initial learning rate to 1.0 for bilingual models, but use 0.5 for multilingual models in order to stabilize training. We apply dropout to residual layers and attention weights, with a rate of 0.1/0.1 for 6-layer Transformer models and 0.3/0.2 for deeper ones. We group sentence