The NiuTrans Machine Translation Systems for WMT19

This paper described NiuTrans neural machine translation systems for the WMT 2019 news translation tasks. We participated in 13 translation directions, including 11 supervised tasks, namely EN↔{ZH, DE, RU, KK, LT}, GU→EN and the unsupervised DE↔CS sub-track. Our systems were built on Deep Transformer and several back-translation methods. Iterative knowledge distillation and ensemble+reranking were also employed to obtain stronger models. Our unsupervised submissions were based on NMT enhanced by SMT. As a result, we achieved the highest BLEU scores in {KK↔EN, GU→EN} directions, ranking 2nd in {RU→EN, DE↔CS} and 3rd in {ZH→EN, LT→EN, EN→RU, EN↔DE} among all constrained submissions.


Introduction
Our NiuTrans team participated in 13 WMT19 shared news translation tasks, including 11 supervised and 2 unsupervised sub-tracks. We reused some effective approaches of our WMT18 submissions , including backtranslation by beam search (Sennrich et al., 2016b), BPE (Sennrich et al., 2016c) and further strengthened our systems by exploiting some new techniques this year.
For our supervised task submissions, all the language pairs shared similar model architectures and training flow. We proposed four novel Deep-Transformer architectures based on (Wang et al., 2019) as our baseline, which outperformed the standard Transformer-Big significantly in terms of both translation quality and convergence speed.
As for the data augmentation aspect, we experimented several back-translation methods (Sennrich et al., 2016b), including beam search, unrestricted sampling and sampling-topK proposed by Edunov et al. (2018), to leverage the targetside monolingual data. We also applied iterative knowledge distillation (Freitag et al., 2017) to leverage the source-side monolingual data.
Our system also employed the conventional combination methods including ensemble and feature-based re-ranking to further improve the translation quality. We proposed a simple greedy search algorithm to find the best ensemble combination effectively and efficiently. Hypothesis combination (Hassan et al., 2018) was also adopted to generate more diverse hypotheses for better reranking.
For unsupervised tasks, we mainly investigated the methodology of unsupervised SMT (Artetxe et al., 2019) and NMT (Lample and Conneau, 2019) to build our baselines, then presented a joint training strategy on top of these baselines to boost their performances.
This paper was structured as follows: we described the details of our novel Deep-Transformer in Section 2, then in Section 3 we presented an overview of our universal training flow for all supervised language pairs and the unsupervised methods. The experiment settings and main results were shown in Section 4.

Deep Transformer
Neural machine translation models based on multi-layer self-attention (Vaswani et al., 2017) has shown strong results on several large-scale tasks. Enlarging the model capacity is an effective way to obtain stronger networks, including widening the hidden representation or deepening the model layers. Bapna et al. (2018) has shown that learning deeper networks is not easy for vanilla Transformer due to the gradient vanishing/exploding problem. Figure 1: Examples of pre-norm residual unit and postnorm residual unit. F = sub-layer, and LN = layer normalization. Wang et al. (2019) emphasized that the location of layer normalization played a vital role when training deep Transformer. In early versions of Transformer (Vaswani et al., 2017), layer normalization was placed after the element-wise residual addition (see Figure 1(a)). While in recent implementations , layer normalization was applied to the input of every sublayer(see Figure 1(b)), which can provide a direct way to pass error gradient from top to bottom. In this way pre-norm Transformer is more efficient for training than post-norm (vanilla Transformer) when the model goes deeper. Remarkably, a dynamic linear combination of previous layers method 1 can further improve the translation quality. Note that we built our deep self-attentional counterparts in pre-norm way as default. In this section we described the details about our deep architectures as below: Pre-Norm Transformer: In recent Ten-sor2Tensor implementations 2 , layer normalization (Lei Ba et al., 2016) was applied to the input of every sub-layer which the computation sequence could be expressed as: normal-ize→Transform→dropout→residual-add. In this way we could successfully train a deeper pre-norm Transformer within comparable performance with Transformer-Big or even better, only one fourth training cost.
Pre-Norm Transformer-RPR: We found Transformer-RPR (Shaw et al., 2018) which simultaneously incorporating relative position information with sinusoidal position encodings for sequences in pre-norm style could outperform the pre-norm Transformer with the same encoder depth. We used clipping distance k = 20 with the unique edge representations per layer and head.
Pre-Norm Transformer-DLCL: The Transformer-DLCL employed direct links with all the previous layers and offered efficient access to lower-level representations in a deep stack. An additional weight matrix W l+1 ∈ R L×L was used to weigh each incoming layer in a linear manner. This method can be formulated as: Eq.1 provided a way to learn preference of layers in different levels of the stack, Ψ(y 0 , y 1 ...y l ) was the combination of previous layer representation. Furthermore, this method is model architecture free which we can integrate with either prenorm Transformer or pre-norm Transformer-RPR for further enhancement. The details can be seen in Wang et al. (2019).

Data Filter
Previous work (Junczys-Dowmunt, 2018;Stahlberg et al., 2018) indicated that rigorous data filtering scheme is crucial, or it will lead to catastrophic loss in quality, especially in EN↔DE and EN↔RU. For most language pairs, we filter the training bilingual corpus with the following rules: • Normalize punctuation with Moses scripts except the ZH ↔ EN language pair.
• Filter out the sentences longer than 100 words, or exceed 40 characters in a single word.
• Filter out the sentences which contain HTML tags or duplicated translations.
• Filter out the sentences which both the source and the target side are identical language.
• Filter out the sentences whose alignment scores obtained by fast-align 3 are lower than -6.
• The word ratio between the source and the target must not exceed 1:3 or 3:1.
After several data augmentation methods to leverage monolingual data in order to further boost translation quality, the same data filter strategy was employed.

Back Translation
Back-translation (Sennrich et al., 2016b) is an essential method to integrate the target side monolingual synthetic knowledge when building a stateof-the-art NMT system. Especially for lowresource language tasks, it's indispensable to augment the training data by mixing the pseudo corpus with the parallel part, in that the target side lexicon coverage is insufficient, such as EN ↔ {KK, GU} only consist of 0.11M and 0.5M bilingual data, respectively.
How to select the appropriate sentences from the abundant monolingual data is a crucial issue due to the limitation of equipment and huge overhead time. We trained a 5-gram language model based on the mixture of development set and bilingual-target side data to score the monolingual sentences. In addition, considering the impact of sequence length, we set a threshold range from 10 to 50.
Recent work (Edunov et al., 2018) has shown that different methods of generating pseudo corpus made discrepant influence on translation performance. Edunov et al. (2018) indicated that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search. This year we attempted several data augmentation methods as follows: • Beam search: Generated target translation by beam search with beam 4.
• Sampling: Selected a word randomly from the whole distribution each step which increases the diversity of pseudo corpus compared with beam search, but low precision.
• Sampling Top-K: Selected a word in a restricted way that only top-K (we set K as 10) words can be chosen.
It's worthy noting that experimental results on different language pairs behaved inconsistent: sampling is more helpful when it comes to low-resource problem like Kazakh, Gujarati and Lithuanian. Oppositely, we observed that language pairs with abundant parallel corpus like ZH↔EN are insensitive to sampling method, and slight improvement by restricted sampling which selected from top-10 candidates. We used different strategies to leverage monolingual resource for specific task which we will show detail description in Section4.

Greedy Based Ensemble
Ensemble decoding is an effective system combination method to boost machine translation quality via integrating the predictions of several single models at each decode step. It has been proved effective in the past few years' WMT tasks Deng et al., 2018;Junczys-Dowmunt, 2018;Sennrich et al., 2016a). We enhanced the single model by employing deep self-attentional models. Note that the improvement is poor if the single models performed strong enough and no significant benefits from increasing the participant quantity. So it's necessary to utilize the models sufficiently to search for a better combination on the development set. We adopted an easily operable greedy-base strategy as the following: Algorithm Ensemble decoding to get the score 3: end for 4: Choose the best 4model combination as the initial Φ f inal . 5: repeat 6: Shift the single model from the rest of Ω cand to the Φ f inal which performs better when combined with Φ f inal . 7: until there is tiny improvement as the model number increases To ensure the diversity among the candidate models, we constructed a single model from several perspectives, such as different initialization seed, training epochs, model sizes and network architectures described in Section 2. On the development set, this algorithm can consistently improve nearly 1-1.5 BLEU scores over the best single model across all the tasks in which we have participated.

Iterative Knowledge Distillation
A natural idea to further boost the performance of the ensemble model obtained in Section 3.3 is to alternate knowledge distillation (Hinton et al., Figure 2: A simple example of Iterative Knowledge Distillation with 5 students, 2 teachers and 2 iterations 2015; Freitag et al., 2017) and ensemble iteratively. The naive approach started with a list of single model candidates as the students and the best 4 models combination retrieved from Algorithm 1 as the teacher. Sequence-level knowledge distillation (Kim and Rush, 2016) was then applied to finetune each student model with additional source data. With these enhanced student models, a stronger 4 models combination can be produced through Algorithm 1. We iterated this process until less than 0.1 BLEU improvement on the validation set.
However, in the preliminary experiments we found that such iteration didn't yield good results as we expected. We attributed this phenomenon to the deficiency of model diversity, due to the fact that all students were collapsed to a similar optimum induced by the same teacher they learnt from, which limited the potential gain from iteration. To avoid this, in each step of the iteration, we split the candidates into 4 subsets randomly and assign each subset a distinct teacher model sampled from the top-4 models combinations, then fine-tuned each model within the same subset with its corresponding teacher model. Moreover, we added additional 2M source-side monolingual data in each step to better preserve the model diversity. Figure 2 shows an example.

Feature Reranking
This year we adopted an hypothesis combination strategy to pick up a potentially better translation from the N-best consisting of several different ensemble outputs. For example we generated 96 hypothesises by 8 different ensemble systems, and set the beam size as 12 during the decoding procedure instead of obtaining all 96 outputs from a sin-gle but best ensemble model. The oracle computed by sentence-level BLEU script on development set indicated that hypothesis combination achieved 5 BLEU scores higher compared with the single ensemble output. Our reranking features would be described on five aspects as follows: Right-to-Left Models: NMT models generate the target translations in a left to right fashion, so it's obvious that incorporating models which generate the target sentences in reverse order can be complementary (Stahlberg et al., 2018). We trained four deep Transformer-DLCL models with different hyper-parameter settings by reversing the target side sentence, followed by ensemble knowledge distillation method to enhance the single model performance. Experiment results showed that the accuracy of the reverse model was extremely necessary, or you may even get worse results.
Target-to-Source Models: Re-scoring between the hypothesis and the source input by targetto-source systems. In addition Target-to-Source-Right-to-Left models were needed.
Language Model: We both used a 5-gram language model and a deep self-attention language model trained on target monolingual data.
Cross-lingual Sentence Similarity: We mixed the source-to-target and target-to-source training data about 1:1 to train a cross-lingual translation model, in order to compute the cosine similarity between the n-best hypothesis and the source sentence-level vectors (Hassan et al., 2018) .
Sentence-Align Score: We used fast-align tool to evaluate the alignment probability between the source and the target.
Translation Coverage: A SMT phrase-table to obtain the top-50 translation for each source-totarget word pair. In this way, the translation coverage score can be easily gained with respect to the dual direction hits in the dictionary with length normalization.
We rescored 96-best outputs generated by several ensemble systems using a rescoring model consisting of features above by K-batched MI-RA (Cherry and Foster, 2012) algorithm which is widely used in Moses 4 .

Unsupervised NMT
We also participated in the unsupervised translation tasks with only the monolingual data provided by WMT organizer. We both attempted the unsupervised SMT and NMT, then combined them for better results. To train SMT models, the unsupervised tuning (Artetxe et al., 2019) was applied to further enhance the unsupervised SMT system, which employed a small pseudo generated by the target-to-source system to adjust weights of the source-to-target system. We followed Artetxe et al. (2019) to exploit subword information into unsupervised SMT system, which two additional weights were added to the initial phrase-table. The new features employed a character-level similarity function instead of word translation probabilities, which are analogous to the lexical weightings.
For unsupervised NMT, the techniques we used were based on the recently proposed method for unsupervised machine translation (Lample and Conneau, 2019), including proper initialization, leveraging a strong language model and iterative back-translation (Lample et al., 2018). Our systems were initiated by cross-lingual masked language model, which brought significant improvement than cross-lingual embedding method. After that, the standard NMT architecture can be trained by only leveraging monolingual data using combining denoising auto-encoding and iterative back-translation. We adopted two training strategies combining both NMT and SMT models to further enhance our unsupervised system: • Generate the pseudo corpus by SMT and warmup the NMT models restricted in first 1000 training steps, then we used the pseudo corpus generated by NMT systems for the remained training.
• We mixed the pseudo corpus consisting of N-MT and SMT outputs in 1:1 at the beginning, and we increased the ratio of NMT pseudo corpus iteratively until there was no significantly improvement on validation set.

Experiments and Results
For all supervised tasks, we used deep selfattentional models as our baseline, and we also experimented the shallow and wide counterparts to verify its effectiveness with the same training corpus. Preliminary experiments indicated that our deep models can even outperform the standard Transformer-Big by 0.7-1.3 BLEU scores on different language pairs. All of our experiments employed 25/30 encoder layers and 6 decoder layers, both embedding and hidden size have a dimension of 512, 8 heads for the self-attention and encoder-decoder attention mechanisms. We shared the target-side embedding and softmax matrix. All BLEU scores were reported with mteval-v13a.pl 5 . Next, we will show details for different language pairs in the following subsections.

Experiment setting
We implemented deep fashion models based on Tensor2Tensor, all models were trained on eight 1080Ti GPUs. We used the Adam optimizer with β 1 = 0.97, β 2 = 0.997 and = 10 −6 as well as gradient accumulation due to the high GPU memory consumption. The training data was reshuffled after finishing each training epoch, and we batched sentence pairs by target-side sentences lengths, with 8192 tokens per GPU. Large learning rate and warmup-steps were chosen for faster convergence. We set max learning rate as 0.002 and warmup-steps as 8000 for most language pairs including EN↔{ZH, RU, KK, LT}. Specifically in EN↔DE task, 16000 warmup-steps achieved better results. During training, we also employed label smoothing with a confidence score 0.9 and all the dropout probabilities were set to 0.1. Furthermore, we averaged the last 15 checkpoints of a single training process for all language pairs. The models were saved and validated every 20 minutes.

English ↔ Chinese
For ZH ↔ EN system, our parallel corpus included CWMT, wikititles-v1, NewsCommentary-v14, and 30% randomly sampled data from UN corpus. All parallel data were segmented by NiuTrans (Xiao et al., 2012) word segmentation toolkit. After the preprocessing, we trained BPE (Sennrich et al., 2016c) models with 32, 000 merge operations for both sides respectively. For back-translation, we trained 25-layers transformer models using WMT18  training data for both directions. We selected 10M NewsCrawl2018 monolingual data for ZH→EN and the combination of XinHua and XMU data for EN→ZH. Experimental results from table 1 showed that generating the pseudo corpus by beam search brought significant improvement on new-stest2018 for ZH↔EN. Meanwhile, for EN→ZH system, additional pseudo corpus 6 by sampling-top10 could obtain +0.7 BLEU scores on new-stest2018, but exhibited negative impact on new-stest2019.
The best performance on our development set new-stest2018 gained +1.6 BLEU improvement over Transformer-Base, even +0.7 BLEU higher than that of Transformer-Big. Iterative Knowledge Distillation with 4 teachers, 3 iterations and 1 epoch per iteration gave +1.6 BLEU improvement over the best single model. To this end, almost +4 BLEU improvement was observed on newstest2019. Through greedy based ensemble algorithm, we selected the best 8-model combination on newstest2018 and boosted our system performance by +0.8 BLEU. Our reranking model contained 27 features, including 4 L2R-Ensemble, 4 R2L-Ensemble, 4 T2S-Ensemble, 4 T2S-R2L-Ensemble and other features mentioned in Section 3.5.
For EN→ZH, we used the same training settings to obtain our best system. The results after applying each component are reported in Table 1. Surprisingly, adding pseudo corpus hindered our system improvement on newstest2019, yet gained +3.7 BLEU improvement on newstest2018. One possible explanation is that the construction of test set in this year is different from those in previous years.  6 We mixed the sampling-topk corpus with the parallel one to fine-tune each single model 4.3 English ↔ German Table 2 presents the BLEU scores on new-stest2018 and newstest2019 for EN↔DE tasks. All parallel training data released were used and we adopted the dual conditional cross-entropy method (Junczys-Dowmunt, 2018) to filter out the noise data in ParaCrawl corpus, resulting in 10M bilingual sentence pairs. A joint BPE model was applied in both directions with 32, 000 merge operations. Moreover, we selected shared vocabulary for both language pairs. The target-side monolingual data played an important role in the success of this language pairs. We back-translated 10M monolingual in-domain data from the collection of NewsCrawl2016-2018 filtered by XenC (Rousseau, 2013). We observed that generating pseudo corpus via random sampling is much more effective than beam search with the same volume of monolingual sentences, resulting in 2.5/3.7 BLEU improvement on new-stest2018 for EN→DE and DE→EN respectively. Transformer-DLCL with 25 encoder layers and 4096 filters obtained +2.5/1.7 BLEU improvement. Iterative Knowledge Distillation and 8 models combination yielded another +0.8/1.4 BLEU scores. Unfortunately, we failed to identify any significant improvement from reranking in terms of validation BLEU scores. Perhaps the features we used were not strong enough to score the n-best properly. It's worth noting that we re-normalized the quotes in German for the additional 1.8 BLEU improvement on EN→DE.

English ↔ Russian
For EN↔RU, we used the following resource provided by WMT, including News Commentary-v14, ParaCrawl-v3, CommonCrawl and Yandex Corpus. The parallel corpus we used was comprised of 7.66M sentences after removing the bad case mentioned in Section 3.1. We experimented different BPE code size, ranging from 30, 000 to 80, 000, inspired by the morphology richness of Russian. Considering the efficiency and performance, we finally chose 50, 000 for both directions. We used the same data selection strategy as in EN↔DE and retained only 16M monolingual data from NewsCrawl2015-2018 7 . The selected sentences were then divided into two equal parts. We generated the pseudo corpus from the first part with beam search sized 4 and trained our NMT models with this corpus together with the parallel ones. The other 8M data were back-translated by random sampling and used to fine-tune each model.
Our final submissions consisted of four Deep Transformer models strengthened by Knowledge Distillation, including DLCL25, DLCL30, DLCL25RPR and DLCL30RPR for EN→RU. The reverse direction contained DLCL25, DL-CL25RPR with 4096 filters, DLCL30RPR and DLCL30Filter with 4096 filters. The overall results of our system were reported in Table  3. We observed the same phenomenon as in EN→ZH, where back-translation could yield better results on newstest2018 but inferior ones on newstest2019.   Table 4. We used Russian as the pivotal language to construct the additional EN↔KK bilingual corpus from the crawled RU↔KK corpus as well as the RU↔EN one provided by WMT organizers, resulting in 3.78M high-quality bilingual data 8 . For back-translation, we generated the pseudo corpus via random sampling from 2M monolingual data selected by Xenc in the collection of Common Crawl, News Commentary, News crawl and Wiki dumps. This pseudo corpus was extremely effective for our system.
For KK→EN system, we adopted the same training procedure, except that we chose 4M English monolingual sentences from News crawl 2015-2018 instead, which consisted of 2M indomain sentences selected by Xenc and 2M randomly sampled. The detailed experiment results could be seen in Table 4.  Big  We used the same approach to select pseudo corpus with KK→EN task, while different generation approach were applied. Our pseudo corpus consisted of two parts: 2M pseudo data by beam search within (1.2, 10) for alpha and beam size respectively and another 1M through randomly sampling. From Table 6 we found that the data quantity was the key factor to enhance the translation quality in this task, and deep DLCL25RPR took full advantage of deep encoder layers to extract more expressive representations.

German ↔ Czech
This section demonstrated our unsupervised result on DE↔CS, Table 7 presents the BLEU scores on newstest2013 and newstest2019. We removed the duplicated sentences and the sentences with exceptional length ratio. As a result, we used 24.38M Czech monolingual data and 24.36M German monolingual data for each direction respectively from News Crawl2007-2018. All texts were segmented with scripts provides by Moses, and 60, 000 BPE merge operations were applied to un- We used the Transformer architecture as described in Lample and Conneau (2019) that we revised the Transformer-Big with 8 attention heads, learned positional embedding and GELU activation functions. From Table 7 we observed that through several techniques, the unsupervised SMT has gained significantly improvement on newstest2013 and newstest2019. Moreover, leveraging the pseudo corpus generated by unsupervised-SMT system can bring furthermore enhancement though the unsupervised SMT was inferior to NMT system. We both experimented the training strategies mentioned in Section 3.6, and the iterative training method was more efficient. We only fused two single models in decoding procedure and there is no significant improvement on both valid and test sets. Note that we fixed the quotes in both directions.

Conclusion
This paper described all 13 tasks of NiuTrans systems in WMT19 news shared translation tasks including both supervised and unsupervised sub tracks, showing that we could adopt an universal training strategies to gain promising achievement. We built our final submissions considering two mainstreams: • Neural architecture improvement by employing several deep self-attentional based models.
• Taking full advantage of both additional source and target monolingual data by knowledge distillation and back-translation, respectively.
In addition, a greed-based ensemble algorithm was helpful to search for a robust combination of models, and we adopted hypothesis combination strategy for more diverse re-ranking. Our systems performed strongly among all the constrained submissions: we ranked 1st in EN→KK, KK→EN and GU→EN respectively, and stayed Top-3 for the remained language pairs.