Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation

Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 33 BLEU on ro-en translation without any parallel data or back-translation.


Introduction
Recent work has demonstrated the efficacy of multilingual neural machine translation (multilingual NMT) on improving the translation quality of low-resource languages (Firat et al., 2016;Aharoni et al., 2019) as well as zero-shot translation (Ha et al., 2016;Johnson et al., 2017;Arivazhagan et al., 2019b). The success of multilingual NMT on low-resource languages relies heavily on transfer learning from high-resource languages for which copious amounts of parallel data is easily accessible. However, existing multilingual NMT approaches often do not effectively utilize the abundance of monolingual data, especially in lowresource languages. On the other end of the spectrum, self-supervised learning methods, consuming In this work, we propose to combine the beneficial effects of multilingual NMT with the selfsupervision from monolingual data. Compared with multilingual models trained without any monolingual data, our approach shows consistent improvements in the translation quality of all languages, with greater than 10 BLEU points improvements on certain low-resource languages. We further demonstrate improvements in zero-shot translation, where our method has almost on-par quality with pivoting-based approaches, without using any alignment or adversarial losses. The most interesting aspect of this work, however, is that we introduce a path towards effectively adding new unseen languages to a multilingual NMT model, showing strong translation quality on several language pairs by leveraging only monolingual data with self-supervised learning, without the need for any parallel data for the new languages.

Method
We propose a co-training mechanism that combines supervised multilingual NMT with monolingual data and self-supervised learning. While several pre-training based approaches have been studied in the context of NMT (Dai and Le, 2015;Conneau and Lample, 2019;Song et al., 2019), we proceed with Masked Sequence-to-Sequence (MASS) (Song et al., 2019) given its success on unsupervised and low-resource NMT, and adapt it to the multilingual setting.

Adapting MASS for multilingual models
MASS adapts the masked de-noising objective (Devlin et al., 2019;Raffel et al., 2019) for sequenceto-sequence models, by masking the input to the encoder and training the decoder to generate the masked portion of the input. To utilize this objective function for unsupervised NMT, Song et al. (2019) enhance their model with additional improvements, including language embeddings, target language-specific attention context projections, shared target embeddings and softmax parameters and high variance uniform initialization for target attention projection matrices 1 . We use the same set of hyper-parameters for self-supervised training as described in (Song et al., 2019). However, while the success of MASS relies on the architectural modifications described above, we find that our multilingual NMT experiments are stable even in the absence of these techniques, thanks to the smoothing effect of multilingual joint training. We also forego the separate source and target language embeddings in favour of pre-pending the source sentences with a < 2xx > token (Johnson et al., 2017).
We train our models simultaneously on supervised parallel data using the translation objective and on monolingual data using the MASS objective. To denote the target language in multilingual NMT models we prepend the source sentence with the < 2xx > token denoting the target language.

Datasets
We use the parallel and monolingual training data provided with the WMT corpus, for 15 languages to and from English. The amount of parallel data available ranges from more than 60 million sentence pairs as in En-Cs to roughly 10k sentence pairs as in En-Gu. We also collect additional monolingual data from WMT news-crawl, newscommentary, common-crawl, europarl-v9, newsdiscussions and wikidump datasets in all 16 languages including English. 2 The amount of monolingual data varies from 2 million sentences in Zh to 270 million in De. The distribution of our parallel and monolingual data is depicted in Figure 1.

Data Sampling
Given the data imbalance across languages in our datasets, we use a temperature-based data balancing strategy to over-sample low-resource languages in our multilingual models (Arivazhagan et al., 2019b). We use a temperature of T = 5 to balance our parallel training data. When applicable, we sample monolingual data uniformly across languages since this distribution is not as skewed. For experiments that use both monolingual and parallel data, we mix the two sources at an equal ratio (50% monolingual data with self-supervision and 50% parallel data).

Architecture and Optimization
All experiments are performed with the Transformer architecture (Vaswani et al., 2017) using the open-source Tensorflow-Lingvo implementation (Shen et al., 2019). Specifically, we use the Transformer Big model containing 375M parameters (6 layers, 16 heads, 8192 hidden dimension) (Chen et al., 2018) and a shared source-target Sen-tencePiece model (SPM) 3 (Kudo and Richardson, 2018). We use a vocabulary size of 32k for the bilingual models and 64k for the multilingual mod-

Using Monolingual Data for Multilingual NMT
We evaluate the performance of the models using SacreBLEU (Post, 2018) on standard WMT validation and test sets (Papineni et al., 2002). The performance of our bilingual baselines for all 30 English-centric language pairs are reported in Table 1. We compare the performance of bilingual models, multilingual models trained with just supervised data for 30 language pairs (15 languages to and from English) and multilingual models trained with a combination of supervised and monolingual data in Figure 2.
High-Resource Translation Our results suggest that a single multilingual model is able to match the quality of individual bilingual models with a gap of less than 2 BLEU points for most high-resource languages, with the exception of Chinese (Zh). The slight quality regression is not surprising, given the large number of languages competing for capacity within the same model (Arivazhagan et al., 2019b). We find that adding additional monolingual data improves the multilingual model quality across the board, even for high-resource language pairs.
Low-Resource Translation From Figure 2, we observe that our supervised multilingual NMT model significantly improves the translation quality for most low and medium-resource languages compared with the bilingual baselines. Adding additional monolingual data leads to an additional im-provement of 1-2 BLEU for most medium-resource languages. For the lowest-resource languages like Kazakh (kk), Turkish (tr) and Gujarati (gu), we can see that multilingual NMT alone is not sufficient to reach high translation quality. The addition of monolingual data has a large positive impact on very low resource languages, significantly improving quality over the supervised multilingual model. These improvements range from 3-5 BLEU in the en→xx direction to more than 5 BLEU for the xx→en translation.
Zero-Shot Translation We next evaluate the effect of training on additional monolingual data on zero-shot translation in multilingual models. Table 2 demonstrates the zero-shot performance of our multilingual model that is trained on 30 language pairs, and evaluated on French(fr)-German(de) and German(de)-Czech(cs), when trained with and without monolingual data. To compare with the existing work on zero-shot translation, we also evaluate the performance of multilingual models trained on just the relevant languages (en-fr-de for fr-de translation, en-cs-de for cs-de translation). We observe that the additional monolingual data significantly improves the quality of zero-shot translation, often resulting in 3-6 BLEU increase on all zero-shot directions compared to our multilingual baseline. We hypothesize that the additional monolingual data seen during the selfsupervised training process helps better align representations across languages, akin to the smoothing effect in semi-supervised learning (Chapelle et al., 2010). We leave further exploration of this intriguing phenomenon to future work.   Table 3: Translation quality of the new language added to Multilingual NMT using just monolingual data. Multilingual NMT here is a multilingual model with 30 language pairs, Mono. Only is a bilingual model used as a baseline trained with only monolingual data with self-supervised learning, Multilingual NMT-xx is a multilingual model trained on 28 language pairs (xx is the language not present in the model). Multilingual NMT-xx + Mono. is a multilingual model with 28 language pairs but only monolingual data for xx.

Adding New Languages to Multilingual NMT
Inspired by the effectiveness of monolingual data in boosting low-resource language translation quality, we continue with a stress-test in which we completely remove the available parallel data from our multilingual model, one language at a time, in order to observe the unsupervised machine translation quality for the missing language.
Results of this set of experiments are detailed in Table 3. We find that simply adding monolingual data for a new language to the training procedure of a multilingual model is sufficient to obtain strong translation quality for several languages, often attaining within a few BLEU points of the fully supervised multilingual baseline, without the need for iterative back-translation. We also notice significant quality improvements over models trained with just self-supervised learning using monolingual data for a variety of languages. On WMT ro-en, the performance of our model exceeds XLM (Conneau and Lample, 2019) by over 1.5 BLEU and matches bilingual MASS (Song et al., 2019), without utilizing any back-translation. This suggests that jumpstarting the iterative back-translation process from multilingual models might be a promising avenue to supporting new languages.

Related Work
Our work builds on several recently proposed techniques for multilingual NMT and self-supervised representation learning. While massively multilingual models have obtained impressive quality improvements for low-resource languages as well as zero-shot scenarios (Aharoni et al., 2019;Arivazhagan et al., 2019a), it has not yet been shown how these massively multilingual models could be extended to unseen languages, beyond the pipelined approaches (Currey and Heafield, 2019;Lakew et al., 2019). On the other hand, self-supervised learning approaches have excelled at down-stream cross-lingual transfer (Devlin et al., 2019;Raffel et al., 2019;Conneau et al., 2019), but their success for unsupervised NMT (Conneau and Lample, 2019;Song et al., 2019) currently lacks robustness when languages are distant or monolingual data domains are mismatched (Neubig and Hu, 2018;Vulić et al., 2019). We observe that these two lines of research can be quite complementary and can compensate for each other's deficiencies.

Conclusion and Future Directions
We present a simple framework to combine multilingual NMT with self-supervised learning, in an effort to jointly exploit the learning signals from multilingual parallel data and monolingual data. We demonstrate that combining multilingual NMT with monolingual data and self-supervision (i) improves the translation quality for both low and highresource languages in a multilingual setting, (ii) leads to on-par zero-shot capability compared with competitive bridging-based approaches and (iii) is an effective way to extend multilingual models to new unseen languages.
Future work should explore techniques like iterative back-translation (Hoang et al., 2018) for further improvement and scaling to larger model capacities and more languages (Arivazhagan et al., 2019b;Huang et al., 2019) to maximize transfer across languages and across data sources.  Table 6: Absolute BLEU scores for results in Figure 2 in the paper.