Cross-lingual Supervision Improves Unsupervised Neural Machine Translation

We propose to improve unsupervised neural machine translation with cross-lingual supervision (), which utilizes supervision signals from high resource language pairs to improve the translation of zero-source languages. Specifically, for training En-Ro system without parallel corpus, we can leverage the corpus from En-Fr and En-De to collectively train the translation from one language into many languages under one model. % is based on multilingual models which require no changes to the standard unsupervised NMT. Simple and effective, significantly improves the translation quality with a big margin in the benchmark unsupervised translation tasks, and even achieves comparable performance to supervised NMT. In particular, on WMT’14 -tasks achieves 37.6 and 35.18 BLEU score, which is very close to the large scale supervised setting and on WMT’16 -tasks achieves 35.09 BLEU score which is even better than the supervised Transformer baseline.


Introduction
Neural machine translation (NMT) has achieved great success and reached satisfactory translation performance for several language pairs (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017). Such breakthroughs heavily depend on the availability of colossal amounts of bilingual sentence pairs, such as the some 40 million parallel sentence pairs used in the training of WMT14 English French Task. As bilingual sentence pairs are costly to collect, the success of NMT has not been fully duplicated in the vast majority of language pairs, especially for zero-resource languages. Recently, (Artetxe et al., 2018b;Lample et al., 2018a;?) tackled this challenge by training unsupervised neural machine translation (UNMT) models using only monolingual data, which achieves considerably high accuracy, but still not on par with that of the state of the art supervised models. Most previous works focused on modeling the architecture through parameter sharing or proper initialization to improve UNMT. We argue that the drawback of UNMT mainly stems from the lack of supervised signals, and it is beneficial to transfer multilingual information across languages. In this paper, we take a step towards practical unsupervised NMT with cross-lingual supervision (CUNMT) -making the most of the signal from other language. We investigate two variants of multilingual supervision for UNMT. a) CUNMT w/o Para.: a general setting where unrelated monolingual data can be introduced. For example, using monolingual Fr data to help the training of En-De (Figure 1(c)). b) CUNMT w/ Para.: a relatively strict setting where other bilingual language pairs can be introduced. For example, we can naturally leverage parallel En-Fr data to facilitate the unsupervised En-De transla-tion (Figure 1(d)).
We introduce cross-lingual supervision which aims at modeling explicit translation probabilities across languages. Taking three languages as an example, suppose the target unsupervised direction is En → De and the auxiliary language is Fr. Our target is to model the translation probability p(De|En) with the support of p(Fr|En) and p(De|Fr). For forward cross-lingual supervision, the system NMT Fr→De serves as a teacher, translating the Fr part of parallel data (En, Fr) to De. The resulted synthetic data (En, Fr, De) can be used to improve our target system NMT En→De . For backward cross-lingual supervision, we translate the monolingual De to Fr with NMT De→Fr , and then translate Fr to En with NMT Fr→En . The resulted synthetic bilingual data (De, En) can be used for NMT En→De as well.
Our contributions can be summarized as follow: a) Empirical evaluation of CUNMT on six benchmarks verifies that it surpassed individual MT models by a large margin of more than 3.0 BLEU points on average, and also bested several strong competitors. Particularly, on WMT'16 En-Ro tasks, CUNMT surpass the supervised baseline by 0.7 BLEU, showing the great potential for UNMT. b) CUNMT is very effective in the use of additional training data. MBART or MASS introduces billions of sentences, while CUNMT only introduces tens of millions of sentences and achieves super or comparable results. It shows the importance of introducing explicit supervision.

The Proposed CUNMT
CUNMT is based on a multilingual machine translation model involving supervised and unsupervised methods with a triangular training structure. The original unsupervised NMT depends only on monolingual corpus, therefore the performances of these translation directions cannot be guaranteed.
Formally, given n different languages L i , x i denotes a sentence in language L i . D i denotes a monolingual dataset of L i , and D i,j denotes a parallel dataset of (L i , L j ). We use E to indicate the set of all translation directions with parallel data and W to indicate the set of all unsupervised translation directions respectively. The goal of CUNMT is to minimize the log likelihood of both unsuper- i→j is the unsupervised direct supervision, and L S i→j is the direct supervised supervision, andL i→j is the indirect supervision.

Direct & Cross-lingual Supervision
Direct supervision We will first introduce the notion of direct supervision loss, which only consider the translation probability between two different languages.
For supervised machine translation models, given parallel dataset D s,t with source language L s and target language L t , we use L S s→t to denote the supervised training loss from language L s to language L t . The training loss for a single sentence can be defined as: For unsupervised machine translation models, only monolingual dataset D s and D t are given. We use L U s→t to denote the unsupervised training loss from language L s to language L t . We use B s→t to denote this back translation procedure. After that, we can use these data to train the model with supervised method from L s to L t . The losses of the dual structural are: where g s→t (x s ) translate the sentence in language L s to L t , that is, the back translation of x s . Then the total loss of an unsupervised machine translation is: Cross-lingual supervision When extend to the multilingual scenario, it is natural to introduce indirect supervision across languages. Given n different languages, for each language pair (L i , L j ), we can easily obtain the translation probability of P (x i |x j ) through the direct supervised model L S or L U . We useL s→t to indicate the indirect supervised loss, which can be defined as: where λ is the coefficient. T Due to the lack of triples data (L i , L k , L j ), it is difficult to directly estimate the cross translation lossL s→i→t . We therefor propose the backward and forward indirect supervision to calculate the cross loss: where g t→j→s (x t ) is the indirect backward translation which translate x t to language L s and f s→j→s (x t ) is the indirect forward translation which translate x s to language L t .

Training Procedure of CUNMT
The procedure of CUNMT includes two main steps: multi-lingual pre-training and iterative multi-lingual training.
Multi-lingual Pre-training Due to the ill-posed nature, it is also important to find a good initialization to associate the source side languages and the target side languages. We propose a Multi-lingual Pre-training approach, which jointly train the unsupervised auto-encoder and supervised machine translation. Intuitively, the multi-lingual joint pretraining can take advantage of transfer learning and thus benefit the low resource languages. Apart form the monolingual data, pre-training can also leverage the bilingual parallel data. We suggest the supervised data provides strong signal to optimize the network, which also advantage the unrelated unsupervised NMT pre-training. For example, it is beneficial to use the supervised En-Fr model to initialize the unsupervised De-Fr model.

Indirect Supervised Training
The goal is to train a single system that minimize the jointly loss function of L CUNMT . Generally, CUNMT can be applied to a restrict unsupervised scenario where only monolingual are provided, and also can be extended to a unrestricted scenario where parallel data are introduced. For the sake of simplicity, we describe our method on three language pairs, which can be easily extended to more language pairs. Suppose that the three languages are denoted as the triad (En, Fr, De), and we have monolingual data for all the three languages and also bilingual data for En-Fr. The target is to train an unsupervised En →Fr system. The detailed method is as follows: For indirect or direct supervision, we follow the Equation (6), which will adopts one step forward translation if parallel data is provided. Since we train all directions in one model, the pseudo data will include all directions. In this setting, it contains: En ↔ Fr, En ↔ De, Fr ↔ De with both direct and indirect directions.

Datasets and Settings
We conduct experiments including (De, En, Fr), (Fr, En, De), and (Ro, En, Fr). For monolingual data of English, French and German, 20 million sentences from available WMT monolingual News Crawl datasets were randomly selected. For Romanian monolingual data, all of the available Romanian sentences from News Crawl dataset were used and and were supplemented with WMT16 monolingual data to yield a total of in 2.9 million sentences. For parallel data, we use the standard WMT 2014 English-French dataset consisting of about 36M sentence pairs, and the standard WMT 2014 English-German dataset consisting of about 4.5M sentence pairs. For analyses, we also introduce the standard WMT 2017 English-Chinese dataset consisting of 20M sentence pairs. Consist with previous work, we report results on newstest 2014 for English-French pair, and on newstest 2016 for English-German and English-Romanian.
In the experiments, CUNMT is built upon Transformer models. We use the Transformer with 6 layers, 1024 hidden units, 16 heads. We train our models with the Adam optimizer, a linear warm-up and learning rates varying from 10 −4 to 5 × 10 −4 . The model is trained on 8 NVIDIA V100 GPUs. We implement all our models in Py-Torch based on the code of (Lample and Conneau, 2019) 1 . All the results are evaluated on BLEU score with Moses scripts, which is in consist with the previous studies.

Main Results
The main results of similar pairs are shown in Table 1. We make comparison with three strong unsupervised methods: CUNMT is very efficient in the use of multi-lingual data. While the pretrained language model is obtained through several hundred times larger monolingual or cross-lingual corpus, CUNMT achieves superior or comparable results with much less cost. The model was improved by using synthetic data of cross translation that is based on the jointly trained model. The results of "CUNMT + Forward" are from the model tuned by only 1 epoch with about 100K sentences. This method is fast and the performances are surprisingly effective. The "CUNMT + Forward + Backward" denotes that, besides forward translation, we also use monolingual data and cross translate it to the source language. This method yielded the best performance by outperforming the "CUNMT w/o Para." by more than 3 BLEU score on average. The improvements show great potential for introducing indirect cross lingual supervision for unsupervised NMT.
When compared with supervised approaches, CUNMT shows very promising performance. For the large scale WMT14 En-Fr tasks, the gap between CUNMT and supervised baseline is closed to 3.4 BLEU score. And for the medium WMT16 En-Ro task, CUNMT performs even better than the supervised approach.

Analyses
In this part, we conduct several studies on CUNMT to better understand its setting. Backward or Forward Here we have explored the effect of cross-lingual backward supervision and cross-lingual forward supervision, and plot the performance curves along with the training procedure in Figure 3. The comparison system is CUNMT trained only with monolingual data.
To make a fair comparison, we use "CUNMT w/ Para." as the baseline model and fine-tuning it with only indirect forward supervision or indirect backward supervision. We conduct experiments on WMT16 En-De and En-Ro tasks. Clearly, the forward supervision outperforms the backward one with big margins, which shows the importance of introducing the forward supervision for multilingual UNMT. It is still interesting to find that only introducing the indirect backward translation achieves better results than the unsupervised baseline. We suppose the reasons for the performance gap is that, a) The UNMT baseline has included the traditional direct back translation, therefore the information gain from indirect backward translation is limited compared to the forward translation. b) The indirect forward translation provides a more direct way to model the relation across different languages. The results in consist with the previous research that pivot translation can help low resource language translation.  scale. The results also dovetail with the unsupervised En-Fr experiments in Table 1. As it turns out the smaller parallel data of En-De was able to significantly improve the performance of unsupervised En-Fr translation. We then reduce the scale of the parallel data En-De and surprisingly find that even with only 25% supervised data, CUNMT still works well. The experiments demonstrate that CUNMT is robust and has great potential to be applied to practical systems.   Table  3 shows effects of the auxiliary language. We first switch the parallel data from En-Fr to En-De, the performance is almost consistent. We then switch the parallen data to En − Zh, where Zh is dissimilar with Ro, the performance increases. This is in line with our expectations, that similar languages make it easier for transfer learning. Finally, we extend the parallel data to En-De and En-Fr, and achieves further benefits. Compared with , we suggest the language similarity is more important than the auxiliary data scale.

Importance of the Auxiliary Language
Benefits as All in One Model In  of CUNMT is slightly lower than that of its state of the art counterparts. Also, some techniques such as model average are not applied, and two directions are trained in one model. In CUNMT, the performance of supervised directions drops a little, but in exchange, the performances of zero-shot directions are greatly improved and the model is convenient to serve for multiple translation directions.

Strategies of Synthetic Data Generation
For the synthetic data generation, the reported results are from greedy decoding for time efficiency. We compared the effects of sample strategies on the language setting of (Ro, En, De) where En-De is the supervised direction. The results based on beam search generation for En → Ro is 34.86, and 33.18 for En → Fr in terms of BLEU. Compared with greedy decoding, the performance of beam search is slightly inferior. A possible reason is that the beam search makes the synthetic data further biased on the learned pattern. The results suggest that CUNMT is exceedingly robust to the sampling strategies when performing forward and backward cross translation.

Related Work
Multilingual NMT It has been proven low resource machine translation can adopt methods to utilize other rich resource data in order to develop a better system. These methods include multilingual translation system (Firat et al., 2016;Johnson et al., 2017), teacher-student framework , or others (Zheng et al., 2017). Apart from parallel data as an entry point, many attempts have been made to explore the usefulness of monolingual data, including semi-supervised methods and unsupervised methods which only monolingual data is used. Much work also has been done to attempt to marry monolingual data with supervised data to create a better system, some of which include using small amounts of parallel data and augment the system with monolingual data (Sennrich et al., 2016;He et al., 2016;Wang et al., 2018;Gu et al., 2018;Edunov et al., 2018;Yang et al., 2020). Others also try to utilize parallel data of rich resource language pairs and also monolingual data (Ren et al., 2018;Al-Shedivat and Parikh, 2019;Lin et al., 2020). (Ren et al., 2018) also proposed a triangular architecture, but their work still relied on parallel data of low resource language pairs. With the joint support of parallel and monolingual data, the performance of a low resource system can be improved.
Unsupervised NMT In 2017, pure unsupervised machine translation method with only monolingual data was proven to be feasible. On the basis of embedding alignment (Artetxe et al., 2017;Lample et al., 2018b), (Lample et al., 2018a) and (Artetxe et al., 2018b) devised similar methods for fully unsupervised machine translation. Considerable work has been done to improve the unsupervised machine translation systems by methods such as statistical machine translation (Lample et al., 2018c;Artetxe et al., 2018a;Ren et al., 2019;Artetxe et al., 2019), pretraining models (Lample and Conneau, 2019;Song et al., 2019), or others (Wu et al., 2019), and all of which greatly improve the performance of unsupervised machine translation. Our work attempts to utilize both monolingual and parallel data, and combine unsupervised and supervised machine translation through multilingual translation method into a single model CUNMT to ensure better performance for unsupervised language pairs.

Conclusion
In this work, we propose a multilingual machine translation framework CUNMT incorporating distant supervision to tackle the challenge of the unsupervised translation task. By mixing different training schemes into one model and utilizing unrelated bilingual corpus, we greatly improve the performance of the unsupervised NMT direction. By joint training, CUNMT can serve all translation directions in one model. Empirically, CUNMT has been proven to deliver substantial improvements over several strong UNMT competitors and even achieve comparable performance to supervised NMT. In the future, we plan to build a universal CUNMT system that is applicable in a wide span of languages.