Microsoft Research Asia’s Systems for WMT19

We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).


Introduction
We participated in the WMT19 shared news translation task in 11 translation directions.We achieved first place for 8 directions: German↔English, German↔French, Chinese↔English, English→Lithuanian, English→Finnish, and Russian→English, and three other directions were placed second (ranked by teams), which included Lithuanian→English, Finnish→English, and English→Kazakh.
Our basic systems are based on Transformer, back translation and knowledge distillation.We experimented with several techniques we proposed recently.In brief, the innovations we introduced are: Multi-agent dual learning (MADL) The core idea of dual learning is to leverage the duality between the primal task (mapping from domain X to domain Y) and dual task (mapping from domain Y to X ) to boost the performances of both tasks.MADL (Wang et al., 2019) extends the dual learning (He et al., 2016;Xia et al., 2017a) framework by introducing multiple primal and dual models.It was integrated into our submitted systems for *Corresponding author.This work was conducted at Microsoft Research Asia.
German↔English and German↔French translations.
Masked sequence-to-sequence pretraining (MASS) Pre-training and fine-tuning have achieved great success in language understanding.MASS (Song et al., 2019), a pre-training method designed for language generation, adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment.It was integrated into our submitted systems for Chinese→English and English→Lithuanian translations.
Neural architecture optimization (NAO) As well known, the evolution of neural network architecture plays a key role in advancing neural machine translation.Neural architecture optimization (NAO), our newly proposed method (Luo et al., 2018), leverages the power of a gradient-based method to conduct optimization and guide the creation of better neural architecture in a continuous and more compact space given the historically observed architectures and their performances.It was applied in English↔Finnish translations in our submitted systems.

Soft contextual data augmentation (SCA)
While data augmentation is an important trick to boost the accuracy of deep learning methods in computer vision tasks, its study in natural language tasks is relatively limited.SCA (Zhu et al., 2019) softly augments a randomly chosen word in a sentence by its contextual mixture of multiple related words, i.e., replacing the one-hot representation of a word by a distribution provided by a language model over the vocabulary.It was applied in Russian→English translation in our submitted systems.MADL is an enhanced version of dual learning (He et al., 2016;Wang et al., 2018).It leverages N primal translation models f i and N dual translation models g j for training, and eventually outputs one f 0 and one g 0 for inference, where All these models are pre-trained on bilingual data .The i-th primal model f i has a non-negative weight α i and the j-th dual model g i has a nonnegative weight β j .All the α • 's and β • 's are hyperparameters.Let F α denote a combined translation model from X to Y, and G β a combined translation model from Y to X , (1) F α and G β work as follows: for any x ∈ X and y ∈ Y, Let B denote the bilingual dataset.Let M x and M y denote the monolingual data of X and Y.The training objective function of MADL can be written as follows: (2) Note that f >0 and g >0 will not be optimized during training and we eventually output f 0 and g 0 for translation.More details can be found in (Wang et al., 2019).

Masked sequence-to-sequence pre-training (MASS)
MASS is a pre-training method for language generation.For machine translation, it can leverage monolingual data in two languages to pre-train a translation model.Given a sentence x ∈ X , we denote x \u:v as a modified version of x where its fragment from position u to v are masked, 0 < u < v < m and m is the number of tokens of sentence x.We denote k = v − u + 1 as the number of tokens being masked from position u to v.
We replace each masked token by a special symbol [M], and the length of the masked sentence is not changed.x u:v denotes the sentence fragment of x from u to v. MASS pre-trains a sequence to sequence model by predicting the sentence fragment x u:v taking the masked sequence x \u:v as input.We use the log likelihood as the objective function: where X , Y denote the source and target domain.
In addition to zero/low-resource setting (Leng et al., 2019), we also extend MASS to supervised setting where bilingual sentence pair (x, y) ∈ (X , Y) can be leveraged for pre-training.The log likelihood in the supervised setting is as follows: + log P (x|y \u:v ; θ) + log P (y u:v |x \u:v ; θ) + log P (x u:v |y \u:v ; θ)).
(4) where [•; •] represents the concatenation operation.P (y|x \u:v ; θ) and P (x|y \u:v ; θ) denote the probability of translating a masked sequence to another language, which encourage the encoder to extract meaningful representations of unmasked input tokens in order to predict the masked output sequence.P (x u:v |[x \u:v ; y \u:v ]; θ) and P (y u:v |[x \u:v ; y \u:v ]; θ) denote the probability of generating the masked source/target segment given both the masked source and target sequences, which encourage the model to extract cross-lingual information.P (y u:v |x \u:v ; θ) and P (x u:v |y \u:v ; θ) denote the probability of generating the masked fragment given only the masked sequence in another language.More details about MASS can be found in Song et al. (2019).

Neural architecture optimization (NAO)
NAO (Luo et al., 2018) is a gradient based neural architecture search (NAS) method.It contains three key components: an encoder, an accuracy predictor, and a decoder, and optimizes a network architecture as follows.(1) The encoder maps a network architecture x to an embedding vector e x in a continuous space E. (2) The predictor, a function f , takes e x ∈ E as input and predicts the dev set accuracy of the architecture x.We perform a gradient ascent step, i.e., moving e x along the direction specified via the gradient ∂f ∂ex , and get a new embedding vector e x : where η is the step size.
(3) The decoder is used to map e x back to the corresponding architecture x .
The new architecture x is assumed to have better performance compared with the original one x due to the property of gradient ascent.NAO repeats the above three steps, and sequentially generates better and better architectures.
To learn high-quality encoder, decoder and performance prediction function, it is essential to have a large quantity of paired training data in the form of (x, y), where y is the dev set accuracy of the architecture x.To reduce computational cost, we share weights among different architectures (Pham et al., 2018) to aid the generation of such paired training data.
We use NAO to search powerful neural sequence-to-sequence architectures.The search space is illustrated in Fig. 1.Specifically, each network is composed of N encoder layers and N decoder layers.We set N = 6 in our experiments.Each encoder layer further contains 2 nodes and each decoder layer contains 3 nodes.The node has two branches, respectively taking the output of other node as input, and applies a particular operator (OP), for example, identity, self-attention and convolution, to generate the output.The outputs of the two branches are added together as the output of the node.Each encoder layer contains two nodes while each decoder layer has three.For each layer, we search: 1) what is the operator at each branch of every node.For a comprehensive list of different OPs, please refer to the Appendix of this paper; 2) the topology of connection between nodes within each layer.In the middle part of Fig. 1, we plot possible connections within the nodes of a layer specified by all candidate architectures, with a particular highlight of Transformer (Vaswani et al., 2017).
To construct the final network, we do not adopt the typically used way of stacking the same layer multiple times.Instead we assume that layers in encoder/decoder could have different architectures and directly search such personalized architecture for each layer.We found that such a design significantly improves the performance due to the more flexibility.

Soft contextual data augmentation (SCA)
SCA is a data augmentation technology for NMT (Zhu et al., 2019), which replaces a randomly chosen word in a sentence with its soft version.For any word w ∈ V , its soft version is a distribution over the vocabulary of |V | words: P (w) = (p 1 (w), p 2 (w), ..., p |V | (w)), where p j (w) ≥ 0 and Given the distribution P (w), one may simply sample a word from this distribution to replace the original word w.Different from this method, we directly use this distribution vector to replace the randomly chosen word w from the original sentence.Suppose E is the embedding matrix of all the |V | words.The embedding of the soft version of w is which is the expectation of word embeddings over the distribution.
In our systems, we leverage a pre-trained language model to compute P (w) and condition on all the words preceding w.That is, for the t-th word x t in a sentence, we have where LM (v j |x <t ) denotes the probability of the j-th word v j in the vocabulary appearing after the sequence x 1 , x 2 , • • • , x t−1 .The language model is pre-trained using the monolingual data.
3 Submitted Systems

English↔German
We submit constrained systems to both English to German and German to English translations, with the same techniques.Dataset We concatenate "Europarl v9", "News Commentary v14", "Common Crawl corpus" and "Document-split Rapid corpus" as the basic bilingual dataset (denoted as B 0 ).Since "Paracrawl" data is noisy, we select 20M bilingual data from this corpus using the script filter interactive.py1 .The two parts of bilingual data are concatenated together (denoted as B 1 ).We clean B 1 by normalizing the sentences, removing non-printable characters, and tokenization.We share a vocabulary for the two languages and apply BPE for word segmentation with 35000 merge operations.(We tried different BPE merge operations but found no significant differences.)For monolingual data, we use 120M English sentences (denoted as M en ) and 120M German sentences (denoted as M de ) from Newscrawl, and preprocess them in the same way as bilingual data.
We use newstest 2016 and the validation set and newstest 2018 as the test set.

Model Configuration
We use the PyTorch implementation of Transformer2 .We choose the Transformer big setting, in which both the encoder and decoder are of six layers.The dropout rate is fixed as 0.2.We set the batchsize as 4096 and the parameter --update-freq as 16.We apply Adam (Kingma and Ba, 2015) optimizer with learning rate 5 × 10 −4 .
Training Pipeline The pipeline consists of three steps: 1. Pre-train two English→German translation models (denoted as f1 and f2 ) and two German→English translation models (denoted as ḡ1 and ḡ2 ) on B 1 ; pre-train another English→German (denoted as f3 ) and German→English (denoted as ḡ3 ) on B 0 .
2. Apply back translation following (Sennrich et al., 2016a;Edunov et al., 2018).We backtranslate M en and M de using f3 and ḡ3 with beam search, add noise to the translated sentences (Edunov et al., 2018), merge the synthetic data with B 1 , and train one English→German model f 0 and one German→English model g 0 for seven days on eight V100 GPUs.
3. Apply MADL to f 0 and g 0 .That is, the F α in Eqn.( 2) is specified as the combination of f 0 , f1 , f2 with equal weights; and G β consists of g 0 , ḡ1 , ḡ2 .During training, we will only update f 0 and g 0 .To speed up training, we randomly select 20M monolingual English and German sentences from M en and M de respectively instead of using all monolingual sentences.The eventual output models are denoted as f 1 and g 1 respectively.This step takes 3 days on four P40 GPUs.For the final submission, we accumulate many translation models (trained using bitext, back translation, and MADL, with different random seeds) and do knowledge distillation on the source sentences from WMT14 to WMT19 test sets.Take English→German translation as an example.Denote the English inputs as T = {s i } N T i=1 , where N T is the size of the test set.For each s in T , we translate s to d using M English→German models and eventually obtain where f (j) is the j-th translation model we accumulated, T is the combination of inputs from WMT14 to WMT19.After obtaining E, we randomly select N T M bitext pairs (denoted as B 2 ) from B 1 and finetune model f 1 on B 2 ∪ E. We stop tuning when the BLEU scores of WMT16 (i.e., the validation set) drops.
We eventually obtain 44.9 BLEU score for English→German and 42.8 for German→English on WMT19 test sets and are ranked in the first place in these two translation tasks.

German↔French
For German↔French translation, we follow a similar process as the one used to English↔German tasks introduced in Section 3.1.We merge the "commoncrawl", "europarl-v7" and part of "de-fr.bicleaner07"selected by filter interactive.pyas the bilingual data.We collect 20M monolingual sentences for French and 20M for German from newscrawl.The data pre-processing rule and training procedure are the same as that used in Section 3.1.We split 9k sentences from the "dev08 14" as the validation set and use the remaining ones as the test set.
The results of German↔French translation on the test set are summarized in Table 2. Again, our method achieves significant improvement over the baselines.
Specifically, MADL boosts the baseline of German→French and French→German by 2 and 1.5 points respectively.
Our submitted German→French is a single system trained by MADL, achieving 37.3 BLEU on WMT19.The French→German is an ensemble of three independently trained models, achieving 35.0 BLEU score.Our systems are ranked in the first place for both German→French and French→German in the leaderboard.

Chinese→English
Dataset For Chinese→English translation, we use all the bilingual and monolingual data provided by the WMT official website, and also extra bilingual and monolingual data crawled from the web.
We filter the total 24M bilingual pairs from WMT using the script filter interactive.pyas described in Section 3.1 and get 18M sentence pairs.We use the Chinese monolingual data from XMU monolingual corpus4 and English monolingual data from News Crawl as well as the English sentences from all English-XX language pairs in WMT.We use 100M additional parallel sentences drawn from UN data, Open Subtitles and Web crawled data, which is filtered using the same filter rule described above, as well as fast align and in/outdomain filter.Finally we get 38M bilingual pairs.We also crawled 80M additional Chinese monolingual sentences from Sougou, China News, Xinhua News, Sina News, Ifeng News, and 2M English monolingual sentences from China News and Reuters.We use newstest2017 and newstest2018 on Chinese-English as development datasets.
We normalize the Chinese sentence from SBC case to DBC case, remove non-printable characters and tokenize with both Jieba 5 and PKUSeg 6 to increase diversity.For English sentences, we remove non-printable characters and tokenize with Moses tokenizer 7 .We follow previous practice (Hassan et al., 2018) and apply Byte-Pair Encoding (BPE) (Sennrich et al., 2016b) separately for Chinese and English, each with 40K vocabulary.

MASS Pre-training
We pre-train MASS (Transfomer big) with both monolingual and bilingual data.We use 100M Chinese and 300M English monolingual sentences for the unsupervised setting (Equation 3), and with a total of 18M and 56M bilingual sentence pairs for the supervised settings (Equation 4).We share the encoder and decoder for all the losses in Equation 3and 4. We then fine-tune the MASS pre-trained model on both 18M and 56M bilingual sentence pairs to get the baseline translation model for both Chinese→English and English→Chinese.

Back Translation and Knowledge Distillation
We randomly choose 40M monolingual sentences for Chinese and English respectively for back translation (Sennrich et al., 2016a;He et al., 2016) and knowledge distillation (Kim and Rush, 2016;Tan et al., 2019).We iterate back translation and knowledge distillation multiple times, to gradually boost the performance of the model.

Results
The results on newstest2017 and new-stest2018 are shown in Table 3.We list two baseline Transformer big systems which use 18M bilingual data (constraint) and 56M bilingual data (unconstraint) respectively.The pre-trained model achieves about 1 BLEU point improvement after fine-tuning on both 18M and 56M bilingual data.After iterative back translation (BT) and knowledge distillation (KD), as well as re-ranking, our system achieves 30.8 and 30.9 BLEU points on newstest2017 and newstest2018 respectively.

WMT19 Submission
For the WMT19 submission, we conduct fine-tuning and speculation to further boost the accuracy by using the source sentences in the WMT19 test set.We first filter the bilingual as well as pseudo-generated data according to the relevance to the source sentences.We use the filter method in Deng et al. (2018) and continue to train the model on the filtered data.Second, we conduct speculation on the test source sentences following the practice in Deng et al. (2018).The final BLEU score of our submission is 39.3, ranked in the first place in the leaderboard.

English↔Lithuanian
For English↔Lithuanian translation, we follow the similar process as that for Chinese→English task introduced in Section 3.3.We use all the WMT bilingual data, which is 2.24M after filtration.We use the same English monolingual data as used in Chinese-English.We select 100M Lithuanian monolingual data from official commoncrawl and use all the wiki and news Lithuanian monolingual data provided by WMT.In addition, we crawl 5M Lithuanian news data from LRT website8 .We share the BPE vocabulary between English and Lithuanian, and the vocabulary size is 65K.
All the bilingual and monolingual data are used for MASS pre-training, and all the bilingual data are used for fine-tuning.For iterative back translation and knowledge distillation, we split 24M English monolingual data as well as 12M Lithuanian monolingual data into 5 parts through sampling with replacement, to get different models independently so as to increase diversity in reranking/ensemble.Each model uses 8M English monolingual data and 6M Lithuanian monolingual data.For our WMT19 submission, different from zh-en, speculation technology is not used.
The BLEU scores on newsdev19 are shown in Table 4.
Our final submissions for WMT19 achieves 20.1 BLEU points for English→Lithuanian translation (ranked in the first place) and 35.6 for Lithuanian→English translation (ranked in the second place).

English↔Finnish
Preprocess We use the official English-Finnish data from WMT19, including both bilingual data and monolingual data.bilingual data contains 8.8M aligned sentence pairs.We share the vocabulary for English and Finnish with 46k BPE units.We use the WMT17 and WMT18 English-Finnish test sets as two development datasets, and tune hyper-parameters based on the concatenation of them.
Architecture search We use NAO to search sequence-to-sequence architectures for English-Finnish translation tasks, as introduced in subsection 2.3.We use PyTorch for our implementations.Due to time limitations, we are not targeting at finding better neural architectures than Transformer; instead we target at models with comparable performance to Transformer, while providing diversity in the reranking process.The whole search process takes 2.5 days on 16 P40 GPU cards and the discovered neural architecture, named as NAONet, is visualized in the Appendix.
Train single models The final system for English-Finnish is obtained through reranking of three strong model checkpoints, respectively from the Transformer model decoding from left to right (L2R Transformer), the Transformer model decoding from right to left (R2L Transformer) and NAONet decoding from left to right.All the models have 6-6 layers in encoder/decoder, and are obtained using the same process which is detailed as below.
Step 1: Base models.Train two models P 1 (x|y) and P 1 (y|x) based on all the bilingual dataset (8.8M), respectively for English→Finnish and Finnish→English translations.
Step 2: Back translation.Do the normal back translation (Sennrich et al., 2016a;He et al., 2016) using P 1 and P 2 .Specifically we choose 10M monolingual English corpus, use P 1 (y|x) to generate the 10M pseudo bitext with beam search (beam size is set to 5), and mix it with the bilingual data to continue the training of P 1 (x|y).The ratio of mixing is set as 1 : 1 through up-sampling.The model obtained through such a process is de-noted as P 2 (x|y).The same process is applied to the opposite direction and the new model P 2 (y|x) is attained.
Step 3: Back translation + knowledge distillation.In this step we generate more pseudo bitext by sequence level knowledge distillation (Kim and Rush, 2016) apart from using back translation.To be more concrete, as the first step, similar to Step 2, we choose 15M monolingual English and Finnish corpus, and generate the translations using P 2 (y|x) and P 2 (x|y), respectively.The resulting pseudo bitext is respectively denoted as D x→y and D y→x .Then we concatenate all the bilingual data, D x→y and D y→x , and use the whole corpus to train a new English-Finnish model from scratch.The attained model is denoted as P 3 (y|x).
Step 4: Finetune.In this step we try a very simple data selection method to handle the domain mismatch problem in WMT.We remove all the bilingual corpus from Paracrawl which is generally assumed to be quite noisy (Junczys-Dowmunt, 2018)   To investigate the effects of the four steps, we record the resulting BLEU scores on WMT17 and WMT18 test sets in Table 5, taking the L2R Transformer model as an example.Furthermore, we report the final BLEU scores of the three models after the four steps in Table 6.All the results are obtained via beam size 5 and length penalty 1.0.
The similar results for Finnish-English translation are shown in Table 7 Re-ranking We use n-best re-ranking to deliver the final translation results using the three model checkpoints introduced in the last subsection.The beam size is set as 12.The weights of the three models, as well as the length penalty in generation, are tuned on the WMT-18 test sets.The results are shown in the second row of Table 8.We would also like to investigate what is the influence of the NAONet to the re-ranking results.To achieve that, in re-ranking we replace NAONet with another model from L2R Transformer, trained with the same process in subsection 3.5 with the difference only in random seeds, while maintain the other two models unchanged.The results are illustrated in the last row of Table 8.From the comparison of the two rows in Table 8, we can see the new architecture NAONet discovered via NAO brings more diversity in the ranking, thus leading to better results.We also report the similar results for Finnish-English tasks in Table 9.
Our systems achieve 27.4 for and 31.9 for English→Finnish and Finnish→English, ranked in the first place and second place (by teams), respectively.Our system Our final system for Russian→English translation is a combination of Transformer network (Vaswani et al., 2017), back translation (Sennrich et al., 2016a), knowledge distillation (Kim and Rush, 2016), soft contextual data augmentation (Zhu et al., 2019), and model ensemble.We use Transformer big as network architecture.We first train two models, English→Russian and Russian→English respectively, on bilingual pairs as baseline model.Based on these two models, we perform back translation and knowledge distillation on monolingual data, generating 40M synthetic data.Combining both bilingual and synthetic data, we get a large train corpus with 56M pairs in total.We upsample the bilingual pairs and shuffle the combined corpus to ensure the balance between bilingual and synthetic data.Finally, we train the Russian→English model from scratch.During the training, we also use soft contextual data augmentation to further enhance training.Following the above procedures, 5 different models are trained and ensembled for final submission.

Russian→English
Results Our final submission achieves 40.1 BLEU score, ranked first in the leaderboard.Table 10 reports the results of our system on the development set.

English→Kazakh
Dataset We notice that most of the parallel data are out of domain.Therefore, we crawl some external data: (1) We crawl all news articles from inform.kz, a Kazakh-English news website.Then we match an English new article to a Kazakh one by matching their images with image hashing.In this way, we find 10K pairs of bilingual news articles.We use their title as additional parallel data.These data are in-domain and useful in training.
(2) We crawl 140K parallel sentence pairs from glosbe.com.Although most of these sentences are out-of-domain, they significantly extended the size of our parallel dataset and lead to better results.
Because most of our parallel training data are noisy, we filter these data with some rules: (1) For the KazakhTV dataset, we remove any sentence pair with an alignment score less than 0.05.(2) For the Wiki Titles dataset, we remove any sentence pair that starts with User or NGC.(3) For all datasets, we remove any sentence pair in which the English sentence contains no lowercase alphabets.(4) For all datasets, we remove any sentence pair where the length ratio is greater than 2.5:1.
We tokenize all our data using the Moses Decoder.We learn a shared BPE (Sennrich et al., 2016b) from all our data (including all WMT19 parallel data, WMT19 monolingual data9 , glosbe, inform.kznews titles, and inform.kznews contents) and get a shared vocabulary of 49,152 tokens.Finally, our dataset consists of 300K bilingual sentence pairs, 700K Kazakh monolingual sentences, and many English monolingual sentences.
Our system Our model is based on the Transformer (Vaswani et al., 2017).We vary the hyperparameters to increase the diversity of our model.Our models usually have 6 encoder layers, 6/7 decoder layers, ReLU/GELU (Hendrycks and Gimpel, 2016) activation function, and an embedding dimension of 640.
We train 4 English-Kazakh models and 4 Kazakh-English models with different random seeds and hyper-parameters.Then we apply backtranslation (Edunov et al., 2018) and knowledge distillation (Kim and Rush, 2016) for 6 rounds.In each round, we 1.Sample 4M sentences from English monolingual data and back-translate them to Kazakh with the best EN-KK model (on the dev set) in the previous round.
2. Back-translate all Kazakh monolingual data to English with the best KK-EN model in the previous round.
3. Sample 200K sentences from English monolingual data and translate them to Kazakh using the ensemble of all EN-KK models in the previous round.
4. Train 4 English-Kazakh models with BT data from step 2 and KD data from step 3. We upsample bilingual sentence pairs by 2x.
5. Train 4 Kazakh-English models with BT data from step 1.We up-sample bilingual sentence pairs by 3x.
Result Our final submission achieves 10.6 BLEU score, ranked second by teams in the leaderboard.

Conclusions
This paper describes Microsoft Research Asia's neural machine translation systems for the WMT19 shared news translation tasks.Our systems are built on Transformer, back translation and knowledge distillation, enhanced with our recently proposed techniques: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).Due to time and GPU limitations, we only apply each technique to a subset of translation tasks.We believe combining them together will further improve the translation accuracy and will conduct experiments in the future.Furthermore, some other techniques such as deliberation learning (Xia et al., 2017b), adversarial learning (Wu et al., 2018b), and reinforcement learning (He et al., 2017;Wu et al., 2018a) could also hep and are worthy of exploration.

Figure 1 :
Figure 1: Visualization of different levels of the search space, from the network, to the layer, to the node.For each of the different layers, we search its unique layer space.The lines in the middle part denote all possible connections between the three nodes (constituting the layer space) as specified via each architecture, while among them the deep black lines indicate the particular connection in Transformer.The right part similarly contains the two branches used in Node2 of Transformer.

Table 1 :
Results of English↔German by sacreBLEU.ResultsThe results are summarized in Table1, which are evaluated by sacreBLEU 3 .The baseline is the average accuracy of models using only bitext, i.e., f1 and f2 for English→German translation and ḡ1 and ḡ2 for German→English, and BT is the accuracy of the model after back-translation training.As can be seen, back translation improves accuracy.For example, back-translation boosts the BLEU score from 45.6 to 47.4 on news18 English→German translation, which is 1.8 point improvement.MADL further boosts BLEU to 50.4,obtaining another 3-point improvement, demonstrating the effectiveness of our method.

Table 2 :
Results of German↔French by sacreBLEU.

Table 3 :
BLEU scores on Chinese→English test sets.
7 https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl After de-duplicating, the and use the remaining bilingual corpus (4.5M ) to finetune P 3 (y|x) for one epoch.The resulting model is denoted as P 4 (y|x) which is set as the final model checkpoint.

Table 6 :
The final BLEU scores on English→Finnish test sets, for the three models: L2R Transformer, R2L Transformer and NAONet, after the four steps of training. .

Table 8 :
English→Finnish BLEU scores of re-ranking using the three models."news" is short for "newstest".

Table 9 :
Finnish→English BLEU scores of re-ranking using the three models.