Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation

Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. However, it can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. That is, research on multilingual UNMT has been limited. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder, making use of multilingual data to improve UNMT for all language pairs. On the basis of the empirical findings, we propose two knowledge distillation methods to further enhance multilingual UNMT performance. Our experiments on a dataset with English translated to and from twelve other languages (including three language families and six language branches) show remarkable results, surpassing strong unsupervised individual baselines while achieving promising performance between non-English language pairs in zero-shot translation scenarios and alleviating poor performance in low-resource language pairs.


Introduction
Recently, neural machine translation (NMT) has been adapted to the unsupervised scenario in which NMT is trained without any bilingual data. Unsupervised NMT (UNMT) (Artetxe et al., 2018;Lample et al., 2018a) requires only monolingual corpora. UNMT achieves remarkable results by using a combination of diverse mechanisms (Lample et al., 2018b) such as an initialization with bilingual word embeddings, denoising auto-encoder (Vincent et al., 2010), back-translation (Sennrich et al., 2016a), and shared latent representation. More recently, Lample and Conneau (2019) achieves better * Haipeng Sun was an internship research fellow at NICT when conducting this work.
UNMT performance by introducing the pretrained language model. However, conventional UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time .
Multilingual UNMT (MUNMT) translating multiple languages at the same time can save substantial training time and resources. Moreover, the performance of MUNMT in similar languages can promote each other. Research on MUNMT has been limited and there are only a few pioneer studies. For example,  and Sen et al. (2019) proposed a multilingual scheme that jointly trains multiple languages with multiple decoders. However, the performance of their MUNMT is much worse than our re-implemented individual baselines (shown in Tables 2 and 3) and the scale of their study is modest (i.e., 4-5 languages).
In this paper, we empirically introduce an unified framework to translate among thirteen languages (including three language families and six language branches) using a single encoder and single decoder, making use of multilingual data to improve UNMT for all languages. On the basis of these empirical findings, we propose two knowledge distillation methods, i.e., self-knowledge distillation and language branch knowledge distillation, to further enhance MUNMT performance. Our experiments on a dataset with English translated to and from twelve other languages show remarkable results, surpassing strong unsupervised individual baselines.This paper primarily makes the following contributions: • We propose a unified MUNMT framework to translate between thirteen languages using a single encoder and single decoder. This paper is the first step of multilingual UNMT training on a large scale of European languages.
• We propose two knowledge distillation meth-ods for MUNMT and our proposed knowledge distillation methods consider linguistic knowledge in the specific translation task.
• Our proposed MUNMT system achieves stateof-the-art performance on the thirteen languages. It also achieves promising performance in zero-shot translation scenarios and alleviates poor performance in low-resource language pairs.

Background of UNMT
UNMT can be decomposed into four components: cross-lingual language model pretraining, denoising auto-encoder, back-translation, and shared latent representations. For UNMT, two monolingual corpora X 1 = {X 1 i } and X 2 = {X 2 i } in two languages L 1 and L 2 are given. |X 1 | and |X 2 | are the number of sentences in monolingual corpora {X 1 i } and {X 2 i } respectively.

Cross-lingual Language Model Pretraining
A cross-lingual masked language model, which can encode two monolingual sentences into a shared latent space, is first trained. The pretrained crosslingual encoder is then used to initialize the whole UNMT model (Lample and Conneau, 2019). Compared with previous bilingual embedding pretraining (Artetxe et al., 2018;Lample et al., 2018a;Yang et al., 2018;Lample et al., 2018b;, this pretraining can provide much more crosslingual information, causing the UNMT model to achieve better performance and faster convergence.

Denoising Auto-encoder
Noise obtained by randomly performing local substitutions and word reorderings (Vincent et al., 2010;Hill et al., 2016;He et al., 2016), is added to the input sentences to improve model learning ability and regularization. Consequently, the input data are continuously modified and are different at each epoch. The denoising auto-encoder model objective function can be minimized by encoding a noisy sentence and reconstructing it with the decoder in the same language: where {C(X 1 i )} and {C(X 2 i )} are noisy sentences. P L 1 →L 1 and P L 2 →L 2 denote the reconstruction probability in language L 1 and L 2 , respectively.

Back-translation
Back-translation (Sennrich et al., 2016a) plays a key role in achieving unsupervised translation that relies only on monolingual corpora in each language. The pseudo-parallel sentence pairs {(M 2 (X 1 i ), X 1 i )} and {(M 1 (X 2 i ), X 2 i )} produced by the model in the previous iteration are used to train the new translation model. Therefore, the back-translation objective function can be optimized by minimizing: where P L 1 →L 2 and P L 2 →L 1 denote the translation probability across the two languages.

Sharing Latent Representations
Encoders and decoders are (partially) shared between L 1 and L 2 . Therefore, L 1 and L 2 must use the same vocabulary. The entire training of UNMT needs to consider back-translation between the two languages and their respective denoising processes. In summary, the entire UNMT model can be optimized by minimizing: 3 Multilingual UNMT (MUNMT)

Multilingual Pretraining
Motivated by Lample and Conneau (2019), we construct a multilingual masked language model, using a single encoder. For each language, the language model is trained by encoding the masked input and reverting it with this encoder. This pretrained multilingual language model is used to initialize the full set of parameters of MUNMT.

Multilingual UNMT Training
We have established a MUNMT model on N languages with a single encoder and single decoder. We denote a sentence in language L j as X j i . For example, L 1 indicates English. |X j | is the number of sentences in the corpus X j = {X j i }. Figure 1: MUNMT architecture. We take L 1 ↔ L j time-step as an example. The grey symbols indicate that the corresponding data are not used or generated during this time-step.
As Figure 1 shows, the entire training process of the MUNMT model is performed through the denoising and back-translation mechanisms, between English and non-English language pairs, by minimizing: where L M D denotes the denoising function and L M B denotes the back-translation function.
In the denoising training, noise (in the form of random token deletion and swapping) is introduced into the input sentences for any language L j . The denoising auto-encoder, which encodes a noisy version and reconstructs it with the decoder in the same language, is optimized by minimizing: where {C(X j i )} is a set of noisy sentences for language L j . P L j →L j denotes the reconstruction probability in L j .
In this paper, we primarily focus on the translation from English to other languages or from other languages to English. This is because most test dataset contains English. In the process of back-translation training, we only conduct backtranslation from language L 1 (English) to other languages and back-translation from other languages to language L 1 . For any non-English language L j , the pseudo-parallel sentence pairs {(M j (X 1 i ), X 1 i )} and {(M 1 (X j i ), X j i )} are obtained by the previous model in the L 1 → L j Algorithm 1 The SKD algorithm Input: Monolingual training data X 1 , X 2 , · · · , X N ; The pretrained model θ 0 ; Number of steps K 1: Initialize θ ← θ 0 2: while Step q ≤ max step K do 3: end for 18: end while and L j → L 1 direction, respectively. Therefore, the back-translation objective function can be optimized on these pseudo-parallel sentence pairs by minimizing: where P L 1 →L j and P L j →L 1 denote the translation probabilities, in each direction, between any non-English language and English.

Knowledge Distillation for MUNMT
To further enhance the performance of our proposed MUNMT described in Section 3, we propose two knowledge distillation methods: selfknowledge distillation (Algorithm 1) and language branch knowledge distillation (Algorithm 2).  MUNMT objective function can be reformulated as follows: where α is a hyper-parameter that adjusts the weight of the two loss functions during backtranslation. T denotes the temperature used on the softmax layer. If the temperature is higher, the probability distribution obtained would be softer (Hinton et al., 2015).

Self-knowledge Distillation
On the basis of the existing architecture of MUNMT, we introduce self-knowledge distillation (Hahn and Choi, 2019) (SKD) during backtranslation, to enhance the generalization ability of the MUNMT model, as shown in Figure 2(a). Unlike Hahn and Choi (2019)'s method, using two soft target probabilities that are based on the word embedding space, we make full use of multilingual information via self-knowledge distillation. During back-translation, only language L j sentences M j (X 1 i ) are generated before training the MUNMT model in the L j → L 1 direction. However, other languages, which have substantial multilingual information, are not used during this training. Motivated by this, we propose to introduce another language L z (randomly chosen but dis-tinct from L 1 and L j ) during this training. We argue that the translation from the source sentences through different paths, L 1 → L j → L 1 and L 1 → L z → L 1 , should be similar. The MUNMT model matches not only the ground-truth output of language L j sentences M j (X 1 i ), but also the soft probability output of language L z sentences M z (X 1 i ). The opposite direction is similar. Therefore, this MUNMT model is optimized by minimizing the objective function: where KL(·) denotes the KL divergence. It is computed over full output distributions to keep these two probability distributions similar. For any language L j , X 1 (M j (X 1 i )) and X 1 (M z (X 1 i )) denote the softened L 1 sentence probability distribution after encoding M j (X 1 i ) and M z (X 1 i ), respectively. M j (X 1 i ) and M z (X 1 i ) were generated by the previous model in the L 1 → L j and L 1 → L z directions, respectively. X j (M 1 (X j i )) and X j (M z (X j i )) denote the softened L j sentence probability distribution after encoding M 1 (X j i )  Algorithm 2 The LBKD algorithm Input: Monolingual training data X 1 , X 2 , · · · , X N ; LBUNMT models θ LB 1 , θ LB 2 , · · · , θ LB M ; The pretrained model θ 0 ; Number of steps K 1: Initialize θ ← θ 0 2: while Step q ≤ max step K do end for 18: end while and M z (X j i ), respectively. M 1 (X j i ) and M z (X j i ) were generated by the previous model in the L j → L 1 and L j → L z directions, respectively. Note that zero-shot translation was used to translate language L j to language L z . The direction L j → L z was not trained during MUNMT training.
As shown in Figure 2(b), we propose knowledge distillation within a language branch (LBKD), to improve MUNMT performance through the existing teacher models. To the best of our knowledge, this is the first proposal that aims to distill knowledge within a language branch. As the number of languages increases, the cost of training time and resources to train an individual model on any two languages increases rapidly. An alternative knowledge distillation method within a language branch can avoid this prohibitive computational cost. Because languages in the same language branch are similar, we first train small multilingual models across all languages in the same language branch (LBUNMT) before training MUNMT. The LBUNMT model trained in the same language branch performed better than the single model because similar languages have a positive interaction during the training process as shown in Tables 2 and  3. Therefore, the distilled information of LBUNMT is used to guide the MUNMT model during backtranslation. The MUNMT model matches both the ground-truth output and the soft probability output of LBUNMT. Therefore, this MUNMT model is optimized by minimizing the objective function: where X 1 (M j (X 1 i )) and LB 1 (M j (X 1 i )) denote the softened L 1 sentence probability distribution of the MUNMT and LBUNMT models, respectively, after encoding M j (X 1 i ) generated by the previous MUNMT model in the L 1 → L j direction. X j (M 1 (X j i )) and LB j (M 1 (X j i )) denote the softened L j sentence probability distribution of the MUNMT and LBUNMT models, respectively, after encoding M 1 (X j i ) generated by the previous MUNMT model in the L j → L 1 direction.

Datasets
To establish an MUNMT system, we considered 13 languages from WMT monolingual news crawl datasets: Cs, De, En, Es, Et, Fi, Fr, Hu, It, Lt, Lv, Ro, and Tr. For preprocessing, we used the Moses tokenizer (Koehn et al., 2007). For cleaning, we only applied the Moses script clean-corpus-n.perl to remove lines in the monolingual data containing more than 50 words. We then used a shared vocabulary for all languages, with 80,000 sub-word tokens based on BPE (Sennrich et al., 2016b). The statistics of the data are presented in Table 1. For Cs,De,En, we randomly extracted 50M monolingual news crawl data after cleaning; For other languages, we used all news crawl data after cleaning as shown in Table 1 We report the results for WMT newstest2013 for Cs-En, De-En, Es-En, and Fr-En. We can evaluate the translation performance between pairs of non-English languages because newstest2013 includes these five languages parallel to each other. For other language pairs, we chose the newest WMT newstest set. That is, we reported the results on WMT newstest2019 for Fi-En and Lt-En; WMT newstest2018 for Et-En and Tr-En; WMT new-stest2017 for Lv-En; WMT newstest2016 for Ro-En; and WMT newstest2009 for Hu-En and It-En. Note that the versions of newstest2019 on Fi/Lt→ En and En → Fi / Lt are different. We chose the corresponding newstest2019 for each direction.

Language Model and UNMT Settings
We used a transformer-based XLM toolkit to train a multilingual masked language model and followed the settings used in Lample and Conneau (2019): six layers were used for the encoder. The dimension of hidden layers was set to 1024. The Adam optimizer (Kingma and Ba, 2015) was used to optimize the model parameters. The initial learning rate was 0.0001, β 1 = 0.9, and β 2 = 0.98.
We used the same toolkit and followed the settings of UNMT used in (Lample and Conneau, 2019): six layers were used for the encoder and decoder. The batch size was set to 2000 tokens. The other parameters were the same as those used for training language model. For our proposed knowledge distillation method, α was set to 0.1 and T was set to 2 (the parameters are empirically selected by small-scale experiments and most of the settings achieved good results). The cross-lingual language model was used to pretrain the encoder and decoder of the whole UNMT model. All monolingual data, described in Table 1, were used in the pretraining and MUNMT training phase. The parameters of the multilingual and single models were the same.
For evaluation, we used the case-sensitive BLEU scores computed by the Moses script multi-bleu.perl. We executed a single model (two languages) for 60,000 iterations, a small multilingual model (three to five languages) for 30,000 iterations, and a large multilingual model (13 languages) for 15,000 iterations. Eight V100 GPUs were used to train all UNMT models. The single model was trained for approximately two days; the multilingual model (13 languages) costs approximately six days since 13 languages participated in the training.

Main Results
Tables 2 and 3 present the detailed BLEU scores of all systems on the English and non-English language pairs, in each direction 1 . Our observations   Table 3: BLEU scores of all models on the non-English to English language pairs. are as follows: 1) Our proposed LBUNMT model trained in the same language branch performed better than the single model (SM) because similar languages have a positive interaction during the training process. Moreover, SM performed very poorly on lowresource language pairs such as En-Lt and En-Lv in the Baltic language branch.
2) Our proposed MUNMT model trained in all languages significantly outperformed the previous work (Sen et al., 2019; by 4∼12 BLEU scores. Moreover, the MUNMT model could alleviate the poor performance achieved with initialized with the same parameters of pretrained language model (just an encoder). low-resource language pairs, such as En-Lt and En-Lv. However, the performance of MUNMT is slightly worse than SM in some language pairs.
3) Our proposed knowledge distillation methods outperformed the original MUNMT model by approximately 1 BLEU score. Moreover, our proposed MUNMT with knowledge distillation performed better than SM in all language pairs with fewer training iterations. Regarding our two proposed methods, LBKD achieved better performance since it could obtain much more knowledge distilled from LBUNMT model. 4) There is a gap between the performance of our proposed MUNMT model and that of the su-pervised NMT systems. To bridge this gap, relying solely on monolingual training data, is worthy of being studied in the future.

Zero-shot Translation Analysis
We also studied the zero-shot translation accuracy of the MUNMT model. Although MUNMT could be trained on all translation directions (ordered language pairs), it would require an extremely long training time. Our proposed MUNMT model was trained in 24 translation directions (all English and non-English language pairs, in each direction), whereas 156 translation directions exist. As the number of languages increases, the number of translation directions increases quadratically. Therefore, zero-shot translation accuracy is important to the MUNMT model.   Table 4 shows the performance of translation between non-English language pairs in the zeroshot translation scenario. Note that  (2019) shows the results of direct translation between the two languages, not the result of zero-shot translation. Compared with previous works, our MUNMT model outperformed the previous systems in almost all translation directions, particu-larly the direct translation results reported in . Compared with the original MUNMT model, our proposed knowledge distillation methods further improved the performance of zero-shot translation. Regarding our two proposed methods, SKD significantly outperformed LBKD by approximately 3 BLEU scores since the third language was introduced during SKD translation training for two language pairs, achieving much more cross-lingual knowledge.

Further Training (Fine-tuning) Analysis
To better assess the effectiveness of our proposed MUNMT model, we further trained the MUNMT and LBKD model individually on each language pair for 15,000 iterations. As shown in Tables 5 and  6, after further training, the model outperformed the original single model on each language pair by approximately 4 BLEU scores. Actually, the number of iterations of the whole process (including training the MUNMT model) is half that of the original single model. This demonstrates that our proposed MUNMT model is a robust system and contains substantial cross-lingual information that could improve translation performance.

Related Work
Multilingual NMT has attracted much attention in the machine translation community. Dong et al. (2015) first extended NMT from the translation of a single language pair to multiple language pairs, using a shared encoder and multiple decoders and   (2017) proposed a simpler method to use one encoder and one decoder to translate between multiple languages. Recently, many methods (Lakew et al., 2018;Platanios et al., 2018;Blackwood et al., 2018;Lu et al., 2018;Wang et al., 2019a;Aharoni et al., 2019;Wang et al., 2019b;Wang and Neubig, 2019) have been proposed to boost multilingual NMT performance. In particular, Tan et al. proposed a knowledge distillation method (Tan et al., 2019b) and a language clustering method (Tan et al., 2019a) to improve the performance of multilingual NMT. Ren et al. (2018) propose a triangular architecture to tackle the problem of low-resource pairs translation by introducing another rich language.
To further tackle the problem of low-resource pairs translation, UNMT (Artetxe et al., 2018;Lample et al., 2018a) has been proposed, using a combination of diverse mechanisms such as initialization with bilingual word embeddings, denoising autoencoder (Vincent et al., 2010), back-translation (Sennrich et al., 2016a), and shared latent representation. Lample et al. (2018b) concatenated two bilingual corpora as one monolingual corpus, and used monolingual embedding pretraining in the initialization step, to achieve remarkable results with some similar language pairs. Lample and Conneau (2019) achieved better UNMT performance by introducing a pretrained language model.  proposed to train UNMT with cross-lingual language representation agreement, to further improve UNMT performance. Moreover, an unsupervised translation task that evaluated in the WMT19 news translation task (Barrault et al., 2019) attracted many researchers to participate (Marie et al., 2019;Li et al., 2019).
For Multilingual UNMT,  exploited multiple auxiliary languages for jointly boosting UNMT models via the Polygon-Net framework. Sen et al. (2019) proposed an MUNMT scheme that jointly trains multiple languages with a shared encoder and multiple decoders. In contrast with their use of multiple decoders, we have constructed a simpler MUNMT model with one encoder and one decoder. Further, we have extended the four or five languages used in their work to thirteen languages, for training our MUNMT model.

Conclusion and Future Work
In this paper, we have introduced a unified framework, using a single encoder and decoder, for MUNMT training on a large scale of European languages. To further enhance MUNMT performance, we have proposed two knowledge distillation methods. Our extensive experiments and analysis demonstrate the effectiveness of our proposed methods. In the future, we intend to extend the work to include language types such as Asian languages. We will also introduce other effective methods to improve zero-shot translation quality.