Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

We investigate the following question for machine translation (MT): can we develop a single universal MT model to serve as the common seed and obtain derivative and improved models on arbitrary language pairs? We propose mRASP, an approach to pre-train a universal multilingual neural machine translation model. Our key idea in mRASP is its novel technique of random aligned substitution, which brings words and phrases with similar meanings across multiple languages closer in the representation space. We pre-train a mRASP model on 32 language pairs jointly with only public datasets. The model is then fine-tuned on downstream language pairs to obtain specialized MT models. We carry out extensive experiments on 42 translation directions across a diverse settings, including low, medium, rich resource, and as well as transferring to exotic language pairs. Experimental results demonstrate that mRASP achieves significant performance improvement compared to directly training on those target pairs. It is the first time to verify that multiple low-resource language pairs can be utilized to improve rich resource MT. Surprisingly, mRASP is even able to improve the translation quality on exotic languages that never occur in the pre-training corpus. Code, data, and pre-trained models are available at https://github.com/linzehui/mRASP.


Introduction
Pre-trained language models such as BERT have been highly effective for NLP tasks (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Conneau and Lample, 2019;Yang et al., 2019). Pre-training and fine-tuning has been a successful paradigm. It is intriguing to discover a "BERT" equivalent -a pre-trained model -for machine translation. In this paper, we study the following question: can we develop a single universal MT model and derive specialized models by fine-tuning on an arbitrary pair of languages?
While pre-training techniques are working very well for NLP task, there are still several limitations for machine translation tasks. First, pre-trained language models such as BERT are not easy to directly fine-tune unless using some sophisticated techniques (Yang et al., 2020). Second, there is a discrepancy between existing pre-training objective and down-stream ones in MT. Existing pre-training approaches such as MASS (Song et al., 2019) and mBART  rely on auto-encoding objectives to pre-train the models, which are different from translation. Therefore, their fine-tuned MT models still do not achieve adequate improvement. Third, existing MT pre-training approaches focus on using multilingual models to improve MT for low resource or medium resource languages. There has not been one pre-trained MT model that can improve for any pairs of languages, even for rich resource settings such as English-French.
In this paper, we propose multilingual Random Aligned Substitution Pre-training (mRASP), a method to pre-train a MT model for many languages, which can be used as a common initial model to fine-tune on arbitrary language pairs. mRASP will then improve the translation performance, comparing to the MT models directly trained on downstream parallel data. In our method, we ensure that the pre-training on many languages and the down-stream fine-tuning share the same model architecture and training objective. Therefore, this approach lead to large translation performance gain. Consider that many languages differ lexically but are closely related at the semantic level, we start by training a large-scale multilingual NMT model across different translation directions, then fine-tuning the model in a specific direction.
Further, to close the representation gap across different languages and make full use of multilingual knowledge, we explicitly introduce additional loss based on random and aligned substitution of the words in the source and target sentences. Substituted sentences are trained jointly with the same translation loss as the original multilingual parallel corpus. In this way, the model is able to bridge closer the representation space across different languages.
We carry out extensive experiments in different scenarios, including translation tasks with different dataset scales, as well as zero-shot translation tasks. For extremely low resource (<100k), mRASP obtains gains up to +29 BLEU points compared to directly trained models on the downstream language pairs. mRASP obtains consistent performance gains as the size of datasets increases. Remarkably, even for rich resource (>10M, e.g. English-French), mRASP still achieves big improvements. Surprisingly, even when mRASP is fine-tuned on two exotic languages that never occur in the pre-training corpus, the resulting MT model is still much better than the directly trained ones (+3.3 to +14.1 BLEU). We finally conduct extensive analytic experiments to examine the contributing factors inside the mRASP method for the performance gains.
We highlight our contributions as follows: • We propose mRASP, an effective pre-training method that can be utilized to fine-tune on any language pairs in NMT. It is very efficient in the use of parallel data in multiple languages. While other pre-trained language models are obtained through hundreds of billions of monolingual or cross-lingual sentences, mRASP only introduces several hundred million bilingual pairs. We suggest that the consistent objectives of pre-training and fine-tuning lead to better model performance.
• We explicitly introduce a random aligned substitution technique into the pre-training strategy, and find that such a technique can bridge the semantic space between different languages and thus improve the final translation performance.
• We conduct extensive experiments 42 translation directions across different scenarios, demonstrating that mRASP can significantly boost the performance on various translation tasks. mRASP achieves 14.1 BLEU with only 12k pairs of Dutch and Portuguese sentences even though neither appears in the pre-training data. mRASP also achieves 44.3 BLEU on WMT14 English-French translation. Note that our pre-trained model only use parallel corpus in 32 languages, unlike other methods that also use much more monolingual raw corpus.

Methodology
In this section, we introduce our proposed mRASP and the training details.

mRASP
Architecture We adopt a standard Transformerlarge architecture (Vaswani et al., 2017) with 6layer encoder and 6-layer decoder. The model dimension is 1,024 on 16 heads. We replace ReLU with GeLU (Hendrycks and Gimpel, 2016) as activation function on feed forward network. We also use learned positional embeddings.
Methodology A multilingual neural machine translation model learns a many-to-many mapping function f to translate from one language to another. More formally, define L = {L 1 , . . . , L M } where L is a collection of languages involving in the pre-training phase. D i,j denotes a parallel dataset of (L i , L j ), and E denotes the set of parallel datasets {D} i=N i=1 , where N the numbers of the bilingual pair. The training loss is then defined as: (1) where x i represents a sentence in language L i , and θ is the parameter of mRASP, and C(x i ) is our proposed alignment function, which randomly replaces the words in x i with a different language. In the pre-training phase, the model jointly learns all the translation pairs. Language Indicator Inspired by (Johnson et al., 2017;Ha et al., 2016), to distinguish from different translation pairs, we simply add two artificial language tokens to indicate languages at the source and target side. For instance, the following En→Fr sentence "How are you? -> Comment vas tu? " is transformed to "<en> How are you? -> <fr> Comment vas tu?"  Figure 1: The proposed mRASP method. "Tok" denotes token embedding while "Pos" denotes position embedding.
During the pre-training phase, parallel sentence pairs in many languages are trained using translation loss, together with their substituted ones. We randomly substitute words with the same meanings in the source and target sides. During the fine-tuning phase, we further train the model on the downstream language pairs to obtain specialized MT models.
Multilingual Pre-training via RAS Recent work proves that cross-lingual language model pretraining could be a more effective way to representation learning (Conneau and Lample, 2019;Huang et al., 2019). However, the cross-lingual information is mostly obtained from shared subword vocabulary during pre-training, which is limited in several aspects: • The vocabulary sharing space is sparse in most cases. Especially for dissimilar language pairs, such as English and Hindi, they share a fully different morphology. • The same subword across different languages may not share the same semantic meanings. • The parameter sharing approach lacks explicit supervision to guild the word with the same meaning from different languages shares the same semantic space.
Inspired by constructive learning, we propose to bridge the semantic gap among different languages through Random Aligned Substitution (RAS). Given a parallel sentence (x i , x j ), we randomly replace a source word in x i t to a different random language L k , where t is the word index. We adopt an unsupervised word alignment method MUSE (Lample et al., 2018b), which can translate is the dictionary translating function. With the dictionary replacement, the original bilingual pair will construct a code-switched sentence pair (C(x i ), x j ). As the benefits of random sampling, the translation set {d i,k (x i t )} k=M k=1 potentially appears in the same context. Since the word representation depends on the context, the word with similar meaning across different languages can share a similar representation. Figure 1 shows our alignment methodology.

Pre-training Data
We collect 32 English-centric language pairs, resulting in 64 directed translation pairs in total. English is served as an anchor language bridging all other languages. The parallel corpus are from various sources: ted 1 , wmt 2 , europarl 3 , paracrawl 4 , opensubtitles 5 , qed 6 . We refer to our pre-training data as PC32(Parallel Corpus 32). PC32 contains a total size of 197M pairs of sentences. Detailed descriptions and summary for the datasets can be found in Appendix.
For RAS, we utilize ground-truth En-X bilingual dictionaries 7 , where X denotes languages involved in PC32. Since not all languages in PC32 have ground-truth dictionaries, we only use available dictionaries.

Pre-training Details
We use learned joint vocabulary. We learn shared BPE (Sennrich et al., 2016b) merge operations (with 32k merge ops) across all the training data and added monolingual data as a supplement (limit to 1M sentences). We do over-sampling in learning BPE to balance the vocabulary size of languages, whose resources are drastically different in size. We over-sampled the corpus of each language based on the volume of the largest language corpus. We keep tokens occurring more than 20, which results in a subword vocabulary of 64,808 tokens. In pre-training phase, we train our model with the full pairs of the parallel corpus. Following the training setting in Transformer, we use Adam optimizer with = 1e − 8, β 2 = 0.98. A warm-up and linear decay scheduling with a warm-up step of 4000 is used. We pre-train the model for a total of 150000 steps.
For RAS, we use the top 1000 words in dictionaries and only substitute words in source sentences. Each word is replaced with a probability of 30% according to the En-X bilingual dictionaries. To address polysemy, we randomly select one substitution from all candidates.

Experiments
This section shows that mRASP obtains consistent performance gains in different scenarios. We also compare our method with existing pre-training methods and outperforms the baselines on En→Ro dataset. The performance further boosts by combining back-translation (Sennrich et al., 2016a) technique. Otherwise stated, for all experiments, we use the pre-trained model as initialization and fine-tune with the downstream target parallel corpus.

Experiment Settings
Datasets We collect 14 pairs of parallel corpus to simulate different scenarios. Most of the En-X parallel datasets are from the pre-training phase to avoid introducing new information. Most pairs for fine-tuning are from previous years of WMT and IWSLT. Specifically, we use WMT14 for En-De and En-Fr, WMT16 for En-Ro. For pairs like Nl(Dutch)-Pt(Portuguese) that are not available in WMT or IWSLT, we use news-commentary instead. For a detailed description, please refer to the Appendix.
Based on the volume of parallel bi-texts, we di-  vide the datasets into four categories: extremely low resource (<100K), low resource(>100k and <1M), medium resource (>1M and <10M), and rich resource (>10M). For back translation, we include 2014-2018 newscrawl for the target side, En. The total size of the monolingual data is 3M.
Baseline To better quantify the effectiveness of the proposed pre-training models, we also build two baselines. mRASP w/o RAS. To measure the effect of alignment information, we also pre-train a model on the same PC32. We do not include alignment information on this pre-training model.
Direct. We also train randomly initialized models directly on downstream bilingual parallel corpus as a comparison with pre-training models.
Fine-tuning We fine-tune our obtained mRASP model on the target language pairs. We apply a dropout rate of 0.3 for all pairs except for rich resource such as En-Zh and En-Fr with 0.1. We carefully tune the model, setting different learning rates and learning scheduler warm-up steps for different data scale. For inference, we use beam-search with beam size 5 for all directions. For most cases, We measure case-sensitive tokenized BLEU. We also report de-tokenized BLEU with SacreBLEU (Post, 2018) for a fair comparison with previous works.

Main Results
We first conduct experiments on the (extremely) low-resource and medium-resource datasets, where multilingual translation usually obtains significant improvements. As illustrated in Table 1, we obtain significant gains in all datasets. For extremely low resources setting such as En-Be (Belarusian) where the amount of datasets cannot train an NMT model properly, utilizing the pre-training model boosts performance.
We also obtain consistent improvements in low and medium resource datasets. Not surprisingly, We observe that with the scale of the dataset increasing, the gap between the randomly initialized baseline and pre-training model is becoming closer. It is worth noting that, for En→De benchmark, we obtain 1.0 BLEU points gains 9 .
To verify mRASP can further boost performance on rich resource datasets, we also conduct experiments on En→Zh and En→Fr. We compare our results with two strong baselines reported by Ott et al. (2018); . As shown in Table 2, surprisingly, when large parallel datasets are provided, it still benefits from pre-training models. In En→Fr, we obtain 1.1 BLEU points gains.

Comparing to other Pre-training Approaches
We compare our mRASP to recently proposed multilingual pre-training models. Following , we conduct experiments on En-Ro, the only pairs with established results. To make a fair comparison, we report de-tokenized BLEU.
As illustrated in Table 4 , Our model reaches comparable performance on both En→Ro and Ro→En. We also combine Back Translation (Sennrich et al., 2016a) with our PNMT, observing performance boost up to 2 BLEU points, suggesting mRASP is complementary to BT. It should be noted that the competitors introduce much more pre-training data.
mBART contucted experiments on extensive language pairs. To illustrate the superiority of mRASP, we also compare our results with mBART. We use the same test sets as mBART. As illustrated in Table 5, mRASP outperforms mBART for most of language pairs by a large margin. Note that while mBART underperforms baseline for benchmarks En-De and En-Fr, mRASP obtains 4.3 and 2.9 BLEU gains compared to baseline.  Table 3: Fine-tuning MT performance on exotic language corpus. For two the translation direction A→B, exotic pair: A and B occur in the pre-training corpus but no pairs of sentences of (A,B) occur; exotic full: no sentences in either A nor B occur in the pre-training; exotic source: sentences from the target side B occur in the pre-training but not the source side A; exotic target: sentences from the source side A occur in the pre-training but not the target side B. Notice that pre-training with mRASP and fine-tuning on those exotic languages consistently obtains significant improvements MT performance in each category.

Model
En→Ro Ro→En Ro→En +BT

Generalization to Exotic Corpus
To illustrate the generalization of mRASP, we also conduct experiments on exotic corpus, which is not included in our pre-training phase. Here we divide exotic corpus into four categories with respect to the source and target side.
• Exotic Pair Both source and target languages are individually pre-trained while they have not been seen as bilingual pairs. • Exotic Source Only target language is pretrained, but source language is not. • Exotic Target Only source language is pretrained, but the target language is not.
• Exotic Full Neither source nor target language is pre-trained.
For each category, we select language pairs of different scales. The results are shown in Table 3. As is shown, mRASP obtains significant gains for each category for different scales of datasets, indicating that even trained with exotic languages, with pre-training initialization, the model still works reasonably well. Note that in the most challenging case, Exotic Full, where the model does not have any knowledge of both sides, with only 11K parallel pairs for Nl(Dutch)-Pt(Portuguese), the pre-training model still reaches reasonable performance, while the baseline fails to train appropriately. It suggests the pre-train model does learn language-universal knowledge and can transfer to exotic languages easily.

Analysis
In this section, we conduct a set of analytical experiments to better understand what contributes to performance gains. Three aspects are studied. First, we study whether the main contribution comes from pre-training or fine-tuning by comparing the performance of fine-tuning and no-fine-tuning. The results suggest that the performance mainly comes from pre-training, while fine-tuning further boosts  the performance. Second, we thoroughly analyze the difference between incorporating RAS at the pre-training phase and pre-training without RAS. The finding shows that incorporating alignment information helps bridge different languages and obtains additional gains. Lastly, we study the effect of data volume in the fine-tuning phase.
The effects with fine-tuning . In the pre-training phase, the model jointly learns from different language pairs. To verify whether the gains come from pre-training or fine-tuning, we directly measure the performance without any finetuning, which is, in essence, zero-shot translation task.
We select datasets covering different scales. Specifically, En-Af (41k) from extremely low resource, En-Ro (600k) from low resource, En-De (4.5M) from medium resource, and En-Fr (40M) from rich resource are selected.
As shown in Table 6 , we find that model without fine-tuning works surprisingly well on all datasets, especially in low resource where we observe model without fine-tuning outperforms randomly initialized baseline model. It suggests that the model already learns well on the pre-training phase, and fine-tuning further obtains additional gains. We suspect that the model mainly tunes the embed-ding of specific language at the fine-tuning phase while keeping the other model parameters mostly unchanged. Further analytical experiments can be conducted to verify our hypothesis.
Note that we also report pre-trained model without RAS (NA-mRASP). For comparison, we do not apply fine-tuning on NA-mRASP. mRASP consistently obtains better performance that NA-mRASP, implying that injecting information at the pre-training phase do improve the performance.
The effectiveness of RAS technique .
In the pre-training phase, we explicitly incorporate RAS. To verify the effectiveness of RAS, we first compare the performance of mRASP and mRASP without RAS.
As illustrated in Table 7, We find that utilizing RAS in the pre-training phase consistently helps improve the performance in datasets with different scales, obtaining gains up to 2.5+ BLEU points. To verify whether the semantic space of different languages draws closer after adding alignment information quantitatively, we calculate the average cosine similarity of words with the same meaning in different languages. We choose the top frequent 1000 words according to MUSE dictionary. Since words are split into subwords through BPE, we  Table 6: MT performance of mRASP with and without the RAS technique and fine-tuning strategy. mRASP includes both the RAS technique and fine-tuning strategy. "w/o ft" denotes "without fine-tuning". We also report mRASP without fine-tuning and NAS to compare with mRASP without fine-tuning. Both RAS and fine-tuning proves effective and essential for mRASP.  simply add all subwords constituting the word. As illustrated in Figure 3, we find that for all pairs in the Figure To further illustrate the effect of RAS on semantic space more clearly, we use PCA (Principal Component Analysis) to visualize the word embedding space. We plot En-Zh as the representative for dissimilar pairs and En-Af for similar pairs. More figures can be found in the Appendix.
As illustrated in Figure 2, we find that for both similar pair and dissimilar pair, the overall word embedding distribution becomes closer after RAS.
For En-Zh, as the dashed lines illustrate, the angle of the two word embedding spaces becomes smaller after RAS. And for En-Af, we observe that the overlap between two space becomes larger. We also randomly plot the position of three pairs of words, with each pair has the same meaning in different languages. Fine-tuning Volume To study the effect of data volume in the fine-tuning phase, we randomly sample 1K, 5K, 10K, 50K, 100K, 500K, 1M datasets from the full En-De corpus (4.5M). We fine-tune the model with the sampled datasets, respectively. Figure 4 illustrates the trend of BLEU with the increase of data volume. With only 1K parallel pairs, the pre-trained model works surprisingly well, reaching 24.46. As a comparison, the model with random initialization fails on this extremely low resource. With only 1M pairs, mRASP reaches comparable results with baseline trained on 4.5M pairs. With the size of dataset increases, the performance of the pre-training model consistently increases. While the baseline does not see any improvement until the volume of the dataset reaches 50K. The results confirm the remarkable boosting of mRASP on low resource dataset.

Related Works
Multilingual NMT aims at taking advantage of multilingual data to improve NMT for all languages involved, which has been extensively studied in a number of papers such as ; Johnson et al. (2017) 2019), which performs extensive experiments in training massively multilingual NMT models. They show that multilingual many-to-many models are effective in low resource settings. Inspired by their work, we believe that the translation quality of low-resource language pairs may improve when trained together with richresource ones. However, we are different in at least two aspects: a) Our goal is to find the best practice of a single language pair with multilingual pretraining. Multilingual NMT usually achieves inferior accuracy compared with its counterpart, which trains an individual model for each language pair when there are dozens of language pairs. b) Different from multilingual NMT, mRASP can obtain improvements with rich-resource language pairs, such as English-Frence.
Unsupervised Pretraining has significantly improved the state of the art in natural language understanding from word embedding (Mikolov et al., 2013b;Pennington et al., 2014), pretrained contextualized representations (Peters et al., 2018;Radford et al., 2019;Devlin et al., 2019) and sequence to sequence pretraining (Song et al., 2019). It is widely accepted that one of the most important factors for the success of unsupervised pre-training is the scale of the data. The most successful efforts, such as RoBERTa, GPT, and BERT, highlight the importance of scaling the amount of data. Following their spirit, we show that with massively multilingual pre-training, more than 110 million sentence pairs, mRASP can significantly boost the performance of the downstream NMT tasks.
On parallel, there is a bulk of work on unsupervised cross-lingual representation. Most traditional studies show that cross-lingual representations can be used to improve the quality of monolingual representations. Mikolov et al. (2013a) first introduces dictionaries to align word representations from different languages. A series of followup studies focus on aligning the word representation across languages (Xing et al., 2015;Ammar et al., 2016;Smith et al., 2017;Lample et al., 2018b). Inspired by the success of BERT, Conneau and Lample (2019) introduced XLM -masked language models trained on multiple languages, as a way to leverage parallel data and obtain impressive empirical results on the cross-lingual natural language inference (XNLI) benchmark and unsupervised NMT (Sennrich et al., 2016a;Lample et al., 2018a;Garcia et al., 2020). Huang et al. (2019) extended XLM with multi-task learning and proposed a universal language encoder. Different from these works, a) mRASP is actually a multilingual sequence to sequence model which is more desirable for NMT pre-training; b) mRASP introduces alignment regularization to bridge the sentence representation across languages.

Conclusion
In this paper, we propose a multilingual neural machine translation pre-training model (mRASP). To bridge the semantic space between different languages, we incorporate word alignment into the pre-training model. Extensive experiments are conducted on different scenarios, including low/medium/rich resource and exotic corpus, demonstrating the efficacy of mRASP. We also conduct a set of analytical experiments to quantify the model, showing that the alignment information does bridge the gap between languages as well as boost the performance. We leave different alignment approaches to be explored in the future. In future work, we will pre-train on larger corpus to further boost the performance.