Explicit Cross-lingual Pre-training for Unsupervised Machine Translation

Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios. However, the cross-lingual information obtained from shared BPE spaces is inexplicit and limited. In this paper, we propose a novel cross-lingual pre-training method for unsupervised machine translation by incorporating explicit cross-lingual training signals. Specifically, we first calculate cross-lingual n-gram embeddings and infer an n-gram translation table from them. With those n-gram translation pairs, we propose a new pre-training model called Cross-lingual Masked Language Model (CMLM), which randomly chooses source n-grams in the input text stream and predicts their translation candidates at each time step. Experiments show that our method can incorporate beneficial cross-lingual information into pre-trained models. Taking pre-trained CMLM models as the encoder and decoder, we significantly improve the performance of unsupervised machine translation.


Introduction
Unsupervised machine translation has become an emerging research interest in recent years (Artetxe et al., 2017;Lample et al., 2017Lample et al., , 2018;;Artetxe et al., 2018b;Marie and Fujita, 2018;Ren et al., 2019;Lample and Conneau, 2019).The common framework of unsupervised machine translation builds two initial translation models at first (i.e., source to target and target to source), and then does iterative back-translation (Sennrich et al., 2016a;Zhang et al., 2018) with the two models using pseudo data generated by each other.The initialization process is crucial to the final translation * Contribution during internship at MSRA. performance as pointed in Lample et al. (2018), Artetxe et al. (2018b) and Ren et al. (2019).
Previous approaches benefit mostly from crosslingual n-gram embeddings, but recent work proves that cross-lingual language model pretraining could be a more effective way to build initial unsupervised machine translation models (Lample and Conneau, 2019).However, in their method, the cross-lingual information is mostly obtained from shared Byte Piece Encoding (BPE) (Sennrich et al., 2016b) spaces during pre-training, which is inexplicit and limited.Firstly, although the same BPE pieces from different languages may share the same semantic space, the semantic information of n-grams or sentences in different languages may not be shared properly.However, cross-lingual information based on n-gram level is crucial to model the initialization of unsupervised machine translation (Lample et al., 2018;Artetxe et al., 2018b), which is absent in the current pretraining method.Secondly, BPE sharing is limited to languages that share much of their alphabet.For language pairs that are not the case, the above pre-training method may not provide much useful cross-lingual information.
In this paper, by incorporating explicit crosslingual training signals, we propose a novel crosslingual pre-training method based on BERT (Devlin et al., 2018) for unsupervised machine translation.Our method starts from unsupervised crosslingual n-gram embeddings, from which we infer n-gram translation pairs.Then, we propose a new pre-training objective called Cross-lingual Masked Language Model (CMLM), which masks the input n-grams randomly and predicts their corresponding n-gram translation candidates inferred above.To solve the mismatch between different lengths of the masked source and predicted target n-grams, IBM models are introduced (Brown et al., 1993) to derive the training loss at each Figure 1: Masked Language Model (MLM) for BERT training.For a given sentence, this task is to predict randomly masked tokens, i.e., "contrast", "held" and "opinion".In practice, it is implemented based on BPE.time step.In this way, we can guide the model with more explicit and strong cross-lingual training signals, and meanwhile, leverage the potential of BERT to model context information.We then use two pre-trained cross-lingual language models as the encoder and decoder respectively to build desired machine translation models.Our method can be iteratively performed with the n-gram translation table updated by downstream tasks.Experiments show that our method can produce better cross-lingual representations and significantly improve the performance of unsupervised machine translation.Our contributions are listed as follows.
• We propose a novel cross-lingual pre-training method to incorporate explicit cross-lingual information into pre-trained models, which significantly improves the performance of unsupervised machine translation.
• We introduce IBM models to calculate the step-wise training loss for CMLM, which breaks the limitation that masked n-grams and predicted ones have to be the same length during BERT training.
• We produce strong context-aware crosslingual representations with our pre-training method, which helps in word alignment and cross-lingual classification tasks.

Background
2.1 BERT BERT (Devlin et al., 2018), short for Bidirectional Encoder Representations from Transformers, is a powerful pre-training method for natural language processing and breaks records of many NLP tasks after corresponding fine-tuning.The core idea of BERT is pre-training a deep bidirectional Transfomer encoder (Vaswani et al., 2017) with two training tasks.The first one is Masked Language Model (MLM) referring to the Cloze task (Taylor, 1953), which takes a straightforward approach of masking some percentage of the input tokens at random, and then predicting them with the corresponding Transformer hidden states, as shown in Figure 1.The second one is Next Sentence Prediction, which means to predict whether two sentences are adjacent or not.This task is designed for some tasks that need modeling the relationship between two sentences such as Question Answering (QA) and Natural Language Inference (NLI).

XLM
Based on BERT, Lample and Conneau (2019) propose a cross-lingual version called XLM and reach the state-of-the-art performance on some crosslingual NLP tasks including cross-lingual classification (Conneau et al., 2018), machine translation, and unsupervised cross-lingual word embedding.The basic points of XLM are mainly two folds.The first one is to use a shared vocabulary of BPE (Sennrich et al., 2016b)   inferring.The second one is pre-training with our proposed objective Cross-lingual Masked Language Model (CMLM) which is to predict the translation candidates of randomly masked n-grams.The last step is to leverage the pre-trained cross-lingual language models as the encoder and decoder of a neural machine translation model and fine-tune the translation model iteratively.

Overview
Our method can be divided into three steps as shown in Figure 2. Given two languages X and Y , we first get unsupervised cross-lingual n-gram embeddings of them, from which we infer n-gram translation tables (source-to-target and target-tosource).The n-gram translation pairs inferred in this way have proven to be instructive for initial unsupervised machine translation models (Artetxe et al., 2018b;Lample et al., 2018;Marie and Fujita, 2018;Ren et al., 2019).Then, we pre-train cross-lingual BERT language models with our proposed Cross-lingual Masked Language Model (CMLM) objective.Specifically, we randomly choose n-grams in the monolingual sentences and predict corresponding translation candidates in the n-gram translation table inferred in the first step.
In this way, we can guide the model with explicit and strong cross-lingual training signals.Finally, two pre-trained cross-lingual language models are used to initialize the encoder and decoder respectively, based on which, denoising auto-encoder and iterative back-translation are leveraged to finetune the unsupervised machine translation models.
The above process is repeated by updating the n-gram table with the n-gram translation pairs extracted from the pseudo data generated by the translation models.In the following subsections, we will give details of each step.

N-gram Translation Table Inferring
Following previous work of unsupervised machine translation (Artetxe et al., 2018b;Lample et al., 2018;Ren et al., 2019), given two languages X and Y , we build our n-gram translation tables as follows.First, we obtain monolingual n-gram embeddings using fastText (Bojanowski et al., 2017) and then get cross-lingual n-gram embeddings using vecmap (Artetxe et al., 2018a) in a fully unsupervised way.Based on that, we calculate the similarity score of n-grams x and y in two languages respectively with the marginal-based scoring method (Conneau et al., 2017;Artetxe and Schwenk, 2018).Specifically, given the crosslingual embeddings of x and y, denoted as e x and e y , the similarity score is calculated as: sim(x, y) = margin(cos(e x , e y ), cos(e y , e z ) 2n ) (1) where margin(a, b) is a marginal scoring function and NN n (x) denotes x's k-nearest neighbors in the other language.In our experiments, n is 5 and margin(a, b) = a b .We take the above similarity scores as the translation probabilities between x and y in the ngram table.For each top frequent n-gram in the source language, we retrieve top-k n-gram translation candidates in the target language.

Cross-lingual Masked Language Model
In this section, we introduce our proposed method for pre-training cross-lingual language models based on BERT.Unlike the masked language model (MLM) described in Section 2.2 which masks several tokens in the input stream and predict those tokens themselves, we randomly select some percentage of n-grams in the input source sentence, and predict their translation candidates retrieved from Section 3.2.We call our proposed pre-training objective "Cross-lingual Masked Language Model" (CMLM) as shown in Figure 2. The difficulty for BERT to predict target phrases during training is that the lengths of the translation candidates are sometimes different from the source n-grams.To deal with this problem, we turn to IBM Model 2 (Brown et al., 1993) to calculate the training loss at each time step.Our proposed method breaks the limitation that masked n-grams and predicted ones have to be the same length during BERT training.
Specifically, according to IBM Model 2, given a source n-gram x l 1 and a target one y m 1 , where l and m are the numbers of tokens in the source and target n-grams respectively, the translation probability from x l 1 to y m 1 is calculated as: where = p(m|x l 1 ), i.e. the probability that the translation of x l 1 consists of m tokens; a(j|i, l, m) is the probability that the i th source token is aligned with the j th target token conditioned on the lengths l and m, while p(y j |x i ) is the translation probability from the source token x i to the target token y i .Based on the IBM Model 2, the loss function of our CMLM is defined as (3) The derived gradient w.r.t model parameters θ at each time step can be calculated as follows: Since the target n-gram y m 1 is predicted with our modified BERT, in practice, the source word x i in Eq.( 4) is replaced with its context-sensitive embedding C(x l 1 ), which is the corresponding hidden state of the top Transformer layer.The alignment probability a(i|j, l, m) cannot be learned during training because of the absence of bilingual corpus.Therefore, cross-lingual BPE embeddings are leveraged to calculate the normalized sim(x i , y j ) to approximate a(i|j, m, l).p(y j |x i ) is the model prediction in Softmax outputs.For each source ngram, all of the retrieved k translation candidates are used to calculate the cross entropy loss, which are weighted with their translation probabilities in the n-gram table.
Given a language pair X − Y , we process both languages with the same shared BPE vocabulary using their monolingual sentences together during pre-training.Following Devlin et al. (2018); Lample and Conneau (2019), in our CMLM objective, we randomly sample 15% of the BPE ngrams from the text streams, and replace them by [MASK] tokens 70% of the time.During pretraining, in each iteration, a batch is composed of sentences sampled from the same language, and we alternate between MLM and CMLM objectives.Different from the original MLM in BERT, in the half of the MLM time, we randomly choose some source n-grams in the input text stream, and replace them with their translation candidates to construct code-switching sentences.Our final pretraining loss is defined as

Unsupervised Machine Translation
Taking two cross-lingual language models pretrained with the above method as the encoder and decoder, we build an initial unsupervised neural machine translation model.Then, we train the model with monolingual data until convergence via denoising auto-encoder and iterative backtranslation, as described in Artetxe et al. (2017); Lample et al. (2017Lample et al. ( , 2018)).Different from them, we step further and make another iteration with updated n-gram translation tables.Specifically, we translate the monolingual sentences with our latest translation model and run GIZA++ (Och and Ney, 2003) on the generated pseudo parallel data to extract updated n-gram translation pairs, which are used to tune the encoder as Section 3.3, together with the back-translation within a multi-task learn-ing framework.Experimental results show that running another iteration can further improve the translation performance.
It is also interesting to explore the usage of pre-trained decoders in the translation model.It seems that pre-training decoders has a smaller effect on the final performance than pre-training encoders (Lample and Conneau, 2019), one reason for which could be that the encoder-to-decoder attention is not pre-trained.Therefore, the parameters of the decoder need to be re-adjusted substantially in the following tuning process for MT task.In our experiments, we explore some other usage of pre-trained decoders, i.e., we use the pretrained decoder as the feature extractor and feed the outputs into a new decoder consisting of several Transformer layers with the attention to the encoder.We find this method improves the performance of some language translation directions.

Experiments
In this section, we conduct experiments to evaluate our proposed pre-training method.In Section 4.1, we will introduce the setup of our experiments, followed by the overall results of the final unsupervised MT models.Then, in Section 4.3, we will discuss another usage of pre-trained decoders for translation models.To evaluate the cross-lingual modeling capacity of our pre-trained encoders, in Section 4.4, we conduct experiments on word alignment and cross-lingual classification tasks.Finally, we do the ablation study to check the performance contribution of each component in our proposed method.

Data and Preprocess
In our experiments, we consider three language pairs, English-French (en-fr), English-German (en-de) and English-Romanian (en-ro).For each language, we use all the available sentences in NewsCrawl till 2018, monolingual datasets from WMT.The NewsCrawl data are used in both pretraining and the following unsupervised NMT iteration process.Our CMLM is optimized based on the pre-trained models released by Lample and Conneau (2019)1 , which are trained with Wikipedia dumps.For fair comparison, we use newstest 2014 as the test set for en-fr, and newstest 2016 for en-de and en-ro.
We use Moses scripts for tokenization, and use fastBPE2 to split words into subword units with their released BPE codes 1 .The number of shared BPE codes for each language pair is 60,000.

Implementation Details
Our implementation is based on the released code of XLM 1 (Paszke et al., 2017).Specifically, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations (Hendrycks and Gimpel, 2016), with a dropout rate of 0.1.The models are trained with the Adam optimizer (Kingma and Ba, 2014), a linear warmup (Vaswani et al., 2017) and the learning rates varying from 10 4 to 5 × 10 4 .
For both of the MLM and CMLM objectives, we use streams of 256 tokens and mini-batches of size 64.We use the averaged perplexity over languages as a stopping criterion for training.For machine translation, we use 6 Transformer layers, and we create mini-batches of 2000 tokens.

Baselines
Our method is compared with six baselines of unsupervised MT systems listed in the upper part of Table 1.The first two baselines (Artetxe et al., 2017;Lample et al., 2017) use a shared encoder and different decoders for two languages with the training methods of denoising auto-encoder and iterative back-translation.The third baseline (Artetxe et al., 2018b) is an unsupervised PB-SMT model, which uses the initial PBSMT models built with language models and n-gram translation tables inferred from cross-lingual embeddings, followed with the iterative back-translation.The fourth baseline (Lample et al., 2018) is a hybrid method of unsupervised NMT and PBSMT by combining the pseudo data generated by PBSMT models into the final iteration of NMT.The fifth baseline (Ren et al., 2019) is also a hybrid method of NMT and PBSMT but different from Lample et al. (2018), they leverage PBSMT as posterior regularization in each NMT iteration.The last baseline is XLM described in Section 2.2.

Overall Results
The overall comparison results of unsupervised machine translation are shown in Table 1.From the table, we find that our proposed method significantly outperforms previous methods on all language pairs by the average BLEU score of 1.7, Method fr2en en2fr de2en en2de ro2en en2ro Baselines (Artetxe et al., 2017) 15 and both the improvements of en2fr and ro2en are over 2 BLEU points.The results indicate that the explicit cross-lingual information incorporated by our proposed CMLM is beneficial to the unsupervised machine translation task.Notice that by doing another iteration ("Iter 2") with updated ngram tables as described in Section 3.4, we further improve the performance a bit for most translation directions with the improvements of en2fr and ro2en bigger than 0.5 BLEU point, which confirms the potential that fine-tuned machine translation models contain more beneficial cross-lingual information than the initial n-gram translation tables, which can be used to enhance the pre-trained model iteratively.
The improvement made by Lample and Conneau (2019) compared with the first five baselines shows that cross-lingual pre-training can be necessary for unsupervised MT.However, the crosslingual information learned with this method during pre-training is mostly from the shared subword space, which is inexplicit and not strong enough.Our proposed method can give the model more explicit and strong cross-lingual training signals so that the pre-trained model contains much beneficial cross-lingual information for unsupervised machine translation.As a result, we can further improve the translation performance significantly, compared with Lample and Conneau (2019) (with the significance level of p<0.01).

Usage of Pre-trained Decoder
As mentioned in Section 3.4, it is interesting to explore the different usage of pre-trained decoders in the MT task.According to our intuition, directly using the pre-trained model as the decoder may not work well because parameters of the decoder need substantial adjustment due to the attention part between the encoder and the decoder.Therefore, we treat the pre-trained decoder as the feature extractor and add several Transformer layers with the encoder-to-decoder attention on top of it.We also try to fix the pre-trained decoder and just fine-tune the encoder and the added decoder part.The experiments are conducted based on "Iter 1" with the results reported in Table 2.
From this table, we can see that directly using the pre-trained model as the decoder may be the best choice for most of the time, with the exceptions of en2fr and ro2en.By adding 6 Transformer layers on top of the original pre-trained decoder can achieve higher performance for en2fr and ro2en, but not significant.The reason could be that it is difficult to train the additional Transformer layers from scratch in the unsupervised scenario.There is also an interesting phenomenon that if we fix the pre-trained part of the decoder and only tune the added Transformer layers, the final performance will drop drastically, which indicates that the representation space of the decoder requires substantial adaptation, even though the pre-trained models already contain cross-lingual information.We think that further deep research on the decoder initialization could be a necessary and interesting topic in the future.

Evaluation of Cross-lingual Pre-trained Encoder Word Alignment
To evaluate the cross-lingual modeling capacity of our pre-trained models, we first conduct experiments on the English-French (en-fr) dataset of the HLT/NAACL 2003 alignment shared task (Mihalcea and Pedersen, 2003).Given two parallel sentences in English and French respectively, we feed each sentence into the pre-trained cross-lingual encoder and get its respective outputs.Then, we calculate the similarities between the outputs of the two sentences and choose target words with max similarity scores as the alignments of corre- sponding source words.We compare the context-unaware method (i.e., directly calculating the similarity scores between unsupervised cross-lingual embeddings (Artetxe et al., 2018a) of source and target words), XLM (Lample and Conneau, 2019)  Table 3: Results of word alignment tasks using different cross-lingual word embeddings.In this table, "P" means "precision", "R" means recall", "F" means "Fmeasure" and "AER" means the "alignment error rate".
From this table, we find that, based on BERT, both XLM and our method can model crosslingual context information, indicating that context information can greatly enhance the crosslingual mapping between the source and target words.By leveraging the explicit cross-lingual information in the model training, our CMLM can outperform XLM significantly.This confirms that our CMLM does better to connect the source and target representation space, with which as pretrained models, the performance of unsupervised NMT can be improved.

Cross-lingual Classification
We also conduct experiments on the cross-lingual classification task (Conneau et al., 2018) using the cross-lingual language inference (XNLI) dataset (Conneau et al., 2018).Specifically, we add a linear classification layer on top of the first hidden state of our pre-trained model and fine-tune its parameters on the English NLI dataset.Without using any labeled data for French (fr) and Germany (de) languages, we only report the zero-shot classification results for them as shown in Table 4.We can find that our method reaches a new record of the zero-shot cross-lingual classification task on languages of French (fr) and Germany (de), which confirms again that our CMLM works better on modeling cross-lingual information than previous methods in the unsupervised scenario.

Ablation Study
In this section, we will discuss the different settings of our method.Firstly, the training loss of our pre-trained method contains two parts, i.e., CMLM and MLM, just as Eq.( 5) shows.To study the respective influences of these two parts, we remove the MLM loss from it and compare the performance on en-fr and en-de translation tasks.
Since our CMLM task differs from XLM in two aspects during pre-training.The first one is that we randomly choose n-grams to mask in the input text stream rather than BPE tokens, and the second one is that we predict the translation candidates of a source n-gram rather than predicting the source n-gram itself.Although the first one has proven to be beneficial during pre-training to some NLP tasks, we want to check how much its influence is to our final translation performance.Therefore, we disable those two modifications in CMLM one by one and report the translation results.Our experiments are conducted based on "Iter 1" with the results in Table 5.
From Table 5, we can find that the combination of CMLM and MLM can improve the translation performance by about 0.6 to 0.7 BLEU compared with any one only.This confirms the monolingual context modeling capacity of the MLM, which is quite useful for unsupervised machine translation.By combining CMLM and MLM, we can enforce  2018) are mainly the modifications of the encoder-decoder structure.The core idea is to constrain outputs of encoders of two languages into a same latent space with a weight sharing mechanism such as using a shared encoder.Denoising auto-encoder (Vincent et al., 2010) and adversarial training methods are also leveraged.Besides, they apply iterative backtranslation to generated pseudo data for crosslingual training.In addition to NMT methods for unsupervised machine translation, some following work shows that SMT methods and the hybrid of NMT and SMT can be more effective (Artetxe et al., 2018b;Lample et al., 2018;Marie and Fujita, 2018;Ren et al., 2019).They all build unsupervised PBSMT systems, and all of their models are initialized with language models and phrase tables inferred from cross-lingual word or n-gram embeddings and then use the initial PBSMT models to do iterative back-translation.2019) reach new state-of-the-art performance on unsu-pervised en-fr and en-de translation tasks.They propose a cross-lingual language model pretraining method based on BERT (Devlin et al., 2018), and then treat two cross-lingual language models as the encoder and decoder to finish the translation.Leveraging much more monolingual data from Wikipedia, their work shows a big potential of pre-training for unsupervised machine translation.However, the cross-lingual information is obtained mostly from the shared BPE space during their pre-training method, which is inexplicit and limited.Therefore, we figure out a new pre-training method that gives the model more explicit and stronger cross-lingual information.
In the recent work of Song et al. (2019), they also mask several consecutive tokens in the source sentence, but jointly pre-train the encoder and decoder by making the decoder to predict those masked tokens in both the source and target sides.Their method is a good case of pre-training for seq-to-seq tasks but the cross-lingual information incorporated with their method is from BPE sharing, which is also implicit.Our proposed method can be combined with their method within a multitask framework, which could be done in the future.

Conclusion
In this paper, we propose a novel cross-lingual pretraining method for unsupervised machine translation.In our method, we leverage Cross-lingual Masked Language Model (CMLM) to incorporate explicit and strong cross-lingual information into pre-trained models.Experimental results on en-fr, en-de, and en-ro language pairs demonstrate the effectiveness of our proposed method.
In the future, we may apply our pre-training method to other language pairs and delve into the performance of the pre-trained encoders on other NLP tasks, such as Name Entity Recognition.

Figure 2 :
Figure 2: Method overview.Our method consists of three steps.The first one is the n-gram translation tableinferring.The second one is pre-training with our proposed objective Cross-lingual Masked Language Model (CMLM) which is to predict the translation candidates of randomly masked n-grams.The last step is to leverage the pre-trained cross-lingual language models as the encoder and decoder of a neural machine translation model and fine-tune the translation model iteratively.

Table 1 :
Comparison of the final unsupervised MT performance (BLEU).In this table, "Iter 2" means we do the whole process with another iteration as described in Section 3.4.

Table 2 :
Test BLEU scores with different usage of the pre-trained decoder.
and our proposed CMLM pre-training method in the Table3.In this experiment, we leave out all the OOV words and those torn apart by the BPE operations.

Table 4 :
Results of zero-shot cross-lingual classification (on XNLI test sets).

Table 5 :
Ablation study."CMLM + MLM" means we use L pre as the pre-training loss; "CMLM" means we only use L cmlm as the pre-training loss; "-translation prediction" means we predict the masked n-grams themselves rather than their translation candidates during pre-training; "--n-gram mask" means we randomly mask BPE tokens rather than n-grams based on "-translation prediction" during pre-training, which degrades our method to XLM.
Unsupervised machine translation dates back to Klementiev et al. (2012); Nuhn et al. (2012), but becomes a hot research topic in recent years.As the pioneering methods, Artetxe et al. (2017); Lample et al. (2017); Yang et al. ( Lample et al. (2018)also build a hybrid system by combining the best pseudo data that SMT models generate into the training of the NMT model while Ren et al. (2019) alternately train SMT and NMT models with the framework of posterior regularization.More recently, Lample and Conneau (