NICT’s Neural and Statistical Machine Translation Systems for the WMT18 News Translation Task

This paper presents the NICT’s participation to the WMT18 shared news translation task. We participated in the eight translation directions of four language pairs: Estonian-English, Finnish-English, Turkish-English and Chinese-English. For each translation direction, we prepared state-of-the-art statistical (SMT) and neural (NMT) machine translation systems. Our NMT systems were trained with the transformer architecture using the provided parallel data enlarged with a large quantity of back-translated monolingual data that we generated with a new incremental training framework. Our primary submissions to the task are the result of a simple combination of our SMT and NMT systems. Our systems are ranked first for the Estonian-English and Finnish-English language pairs (constraint) according to BLEU-cased.


Introduction
This paper describes the neural (NMT) and statistical machine translation systems (SMT) built for the participation of the National Institute of Information and Communications Technology (NICT) to the WMT18 shared News Translation Task (Bojar et al., 2018). We participated in four language pairs (eight translation directions): Estonian-English (Et-En), Finnish-English (Fi-En), Turkish-English (Tr-En), and Chinese-English (Zh-En). We chose these language pairs since they appear to be among the most challenging: involving distant languages and with less training data, for Finnish, Estonian, and Turkish, provided by the organizers than for Russian, German, and Czech. All our systems are constrained, i.e., we used only the parallel and monolingual data provided by the organizers to train and tune them. For all the translation directions, we trained NMT and SMT systems, and combined them through n-best list reranking using different informative features as proposed by Marie and Fujita (2018). This simple combination method, associated to the exploitation of large back-translated monolingual data, performed among the best MT systems at WMT18. Especially for the competitive Et-En and Fi-En translation tasks, for which our submissions are ranked first according to the BLEU-cased metric (henceforth BLEU). Our systems for Et-En, Fi-En, and Tr-En were trained using the exactly same procedures, without any specific linguistic treatments. On the other hand, for Zh-En, we used a specific tokenizer and used slightly different training parameters due to the much larger quantity of training data.
The remainder of this paper is organized as follows. In Section 2, we introduce the data preprocessing. In Section 3, we describe the details of our NMT and SMT systems. The back-translation of monolingual data using our new incremental training framework for NMT is described in Section 4. Then, the combination of NMT and SMT is described in Section 5. Empirical results produced with our systems are showed and analyzed in Section 6, and Section 7 concludes this paper.

Data
As parallel data to train our systems, we used all the available data for all our targeted translation directions, except the "Wiki Headlines" 1 corpus for Fi-En. As English monolingual data, we used all the available data except the "Common Crawl" and "News Discussions" corpora. 2 For all other languages, we used all the available monolingual corpora, except for Turkish for which we   used only 100 millions sentence pairs randomly extracted from "Common Crawl." To tune/validate and evaluate our systems, we used Newstest2016 and Newstest2017 for Fi-En and Tr-En, Newsdev2017 and Newstest2017 for Zh-En, and Newsdev2018 for Et-En.

Tokenization, Truecasing and Cleaning
We used Moses tokenizer (Koehn et al., 2007) and truecaser for English, Estonian, Finnish, and Turkish. The truecaser was trained on one million tokenized lines extracted randomly from the monolingual data. Truecasing was then performed on all the tokenized data. For Chinese, we used Jieba 3 for tokenization but did not perform truecasing. For cleaning, we only applied the Moses script clean-n-corpus.perl to remove lines in the parallel data containing more than 80 tokens and replaced characters forbidden by Moses. Note that we did not perform any punctuation normalization. Tables 1 and 2 present the statistics of the parallel and monolingual data, respectively, after preprocessing.
For Zh-En, we did not use --dropout-src 0.1 --dropout-trg 0.1 since the training data is much larger. We performed NMT decoding with an ensemble of a total of six models according to the best BLEU (Papineni et al., 2002) and the best perplexity scores, 7 produced by three independent training runs.

SMT
We also trained SMT systems using Moses. Word alignments and phrase tables were trained on the tokenized parallel data using mgiza. Source-to-target and target-to-source word alignments were symmetrized with the grow-diag-final-and heuristic. We trained hierarchical SMT models for Et-En and Fi-En since they provided better results than regular phrase-based models on our development data for these language pairs. 8 We also expected a similar observation for Tr-En and Zh-En. However, we were unable to exploit hierarchical models for the language pair Tr-En 9 while hierarchical models for the language pairs Zh-En were extremely large due to the size of our training data. Consequently, for Tr-En and Zh-En we simply trained regular phrase-based models using MSLR (monotone, swap, discontinuous-left, discontinuous-right) lexicalized reordering models and used the default distortion limit of 6. We trained two 4-gram language models: one on the entire monolingual data concatenated to the target side of the parallel data, and another one on the in-domain "News Crawl" corpora only, using LMPLZ (Heafield et al., 2013). For English, all singletons were pruned due to the large size of the monolingual data. To tune the SMT model weights, we used KB-MIRA (Cherry and Foster, 2012) and selected the weights giving the best BLEU score on the development data after 15 decoding runs.

Incremental Back-Translation with
Et-En, Fi-En, and Tr-En We introduced an incremental training framework for NMT aiming to iteratively increase the quality and quantity of the synthetic parallel data used for training. In this framework, we first simultaneously but independently train a sourceto-target and a target-to-source NMT systems using the same original parallel data. Then, we back-translate source and target monolingual data respectively using the source-to-target and the target-to-source NMT systems, and obtain two sets of synthetic parallel data. And then, a new source-to-target and a new target-to-source NMT 8 Between 0.5 and 1 BLEU points of improvement. 9 Moses consistently crashed (segmentation fault) during the decoding of the development data. systems are trained, from scratch, on their respective new training data comprising the mixture of the original parallel data and the synthetic parallel data whose source side is back-translated from the target side. At this stage, we just do what is usually done by previous work (Sennrich et al., 2016a).
As illustrated in Figure 1, we continue this procedure iteratively. Using source-to-target and target-to-source NMT systems trained on the mixture of the synthetic and original parallel data, we back-translate a larger number of monolingual sentences, including the same sentences backtranslated at the first iteration. Since we have better NMT systems than those at the first iteration, we can expect the back-translation to be of a better quality. We mix this new synthetic parallel data to the original one and train again from scratch a source-to-target and a target-to-source NMT systems to obtain further improved translation models. Note that this procedure is partially similar to the work proposed by Zhang et al. (2018) and, but differs in the sense that we increase incrementally our back-translated data.
Given the number of sentences used in the first iteration, k 1 , and an expansion factor, r, we determine k i , the number of monolingual sentences back-translated at iteration i, as follows: The parameters used for the given language pairs are listed in Table 3. The monolingual sentences to be back-translated were randomly extracted from the NewsCrawl corpora. For Et-En and Fi-En, we stopped the incremental training after 2 iterations, back-translating up to 2M sentences. For Tr-En, we observed improvements for  For each language pair, the same parameters were used for both translation directions. In our preliminary experiments, we found that setting r = 2 and k 1 very close to, or smaller than, the size of the original parallel data consistently gives good results across language pairs. Fine-tuning r and k 1 would result in a better translation quality but at a greater cost.
both translation directions until the fourth iteration that back-translated 1.6M sentences (approximately 8 times the size of the original parallel data). In our preliminary experiments, we found that incremental training significantly improves the translation quality over an NMT system that was trained directly, on the same amount of backtranslated sentences. For instance, we observed a 0.6 BLEU points improvements for Tr→En over a system trained on 1.6M sentences back-translated by a system trained on the original parallel data (as in (Sennrich et al., 2016a)).

Setting for Zh-En
For the Zh-En language pair, since much larger parallel data were provided to train the system, we did not perform the incremental back-translation described in Section 4.1. For En→Zh, we backtranslated the entire XMU Chinese monolingual corpus containing 5.4M sentences as the source to produce synthetic English data. For Zh→En, we empirically compared the impact of backtranslating different sizes of English monolingual data, using the first 10M, 20M, and 40M lines of the concatenation of News Crawl-2016 and News Crawl-2017 English corpora to produce synthetic Chinese data. As shown in Table 4, there is not a significant difference in exploiting back-translated data as large as 40M lines compared to only 10M lines. Therefore, we selected the first 10M lines of the News Crawl-2016 English corpus to produce synthetic Chinese data.

Combination of NMT and SMT
Although we can expect SMT to perform very poorly for all the language pairs we considered, 10 10 Especially due to the rich morphology of the languages involved and the long distance reorderings to perform in order  our primary submissions for WMT18 are the results of a simple combination of NMT and SMT. Indeed, as demonstrated by Marie and Fujita (2018), and despite the simplicity of the method used, combining NMT and SMT makes MT more robust and can significantly improve translation quality, even when SMT greatly underperforms NMT. Following Marie and Fujita (2018), our combination of NMT and SMT works as follows.

Generation of n-best Lists
We first produced the 100-best translation hypotheses with our NMT and SMT systems, independently. 11 Unlike Moses, Marian must use a beam of size k to produce a k-best list during decoding. However, using a larger beam size during decoding for NMT may worsen translation quality (Koehn and Knowles, 2017). 12 Consequently, we also produced with Marian the 10-best lists, for Zh-En, and 12-best lists for the other language pairs, and merged them with Marian's 100-best lists to obtain lists containing up to 110 or 112 hypotheses. 13 In this way, we make sure that we still have hypotheses of good quality in the lists despite using a larger beam size. 14 Then, we merged the lists produced by Marian and Moses. We rescored all the hypotheses in the resulting lists with a reranking framework using features to better model the fluency and the adequacy of each hyto produce a translation of good quality. 11 We used the option distinct in Moses to avoid duplicated hypotheses, i.e., with the same content but obtained from different word alignments, and consequently to increase diversity in the generated n-best lists. 12 For Zh-En, the decoding of the test data with k=100 resulted in a drop of 0.4 BLEU points compared to a decoding with k=10. However, for the other language pairs we did not observe such a quality drop but instead a consistent and slight improvement of BLEU scores. 13 Note that we did not remove duplicated hypotheses that may appear, for instance, in both 10-best and 100-best lists.
14 Note that we could have also generated many individual smaller n-best lists, for instance using all our NMT models independently, and merge them to increase the diversity of the hypotheses list to rerank and therefore obtained better results. However, we decided to leave the exploration of this possibility for feature work. Feature Description L2R (6) Scores given by each of the 6 left-to-right Marian models R2L (2) Scores given by each of the 2 (or 4 for Tr-En) right-to-left Marian models LEX (4) Sentence-level translation probabilities, for both translation directions LM (2) Scores given by the two language models used by the Moses baseline systems WPP (2) Averaged word posterior probability LEN (2) Difference between the length of the source sentence and the length of the translation hypothesis, and its absolute value SYS (1) System flag, 1 if the hypothesis comes from Moses n-best list or 0 otherwise MBR (2) For Tr-En only: MBR decoding using sBLEU and chrF++ PBFD (1) For Tr-En only: The phrase-based forced decoding score L2R-bwd (6) Scores given by each of the 6 left-to-right Marian models for the backward translation direction R2L-bwd (2) Scores given by each of the 2 (or 4 for Tr-En) right-to-left Marian models for the backward translation direction Table 5: Set of features used by our reranking systems. The column "Feature" refers to the same feature name used in Marie and Fujita (2018). Note that the two last feature sets, "L2R-bwd" and "R2L-bwd," were not experimented in Marie and Fujita (2018). The numbers between parentheses indicate the number of scores in each feature set.  Table 6: Detokenized BLEU-cased scores for our MT systems on the Newstest2018 test set. "NMT-reranked" denotes the reranking of the Moses's 100-best hypotheses using all our NMT models (left-to-right and right-toleft, for both translation directions, trained with back-translated data) as features. "backtr" denotes the use or not of back-translated monolingual data. "Moses + Marian" denotes our combination of best NMT (#5) and SMT (#1) systems described in Section 5.
pothesis. This method can find a better hypothesis in these merged n-best lists than the one-best hypothesis originated by either Moses or Marian.

Reranking Framework and Features
We chose KB-MIRA as a rescoring framework and used a subset of the features proposed in Marie and Fujita (2018). As listed in Table 5, it includes the scores given by the 6 left-to-right NMT models used to perform ensemble decoding (see Section 3.1). We also used as features the scores given by right-to-left NMT models that we trained for each translation direction with the same parameters as left-to-right NMT models. The two right-to-left NMT models, each achieving the best BLEU and the best perplexity scores on the development data, were selected, giving us two other features for each translation direction. Since the Tr-En training parallel data are much smaller, we were able to perform one more right-to-left train-ing run for Tr→En and En→Tr. 15 We also experimented with the use of the scores computed from the NMT models trained for the backward translation direction. In total, we have then 16 features, or 20 for Tr-En, computed from NMT models. All the following features we used are described in details by Marie and Fujita (2018). We computed sentence-level translation probabilities using the lexical translation probabilities learned by mgiza during the training of our SMT systems. The two language models trained for SMT for each translation direction were also used to score the n-best translation hypotheses. To account for hypotheses length, we added the difference, and its absolute value, between the number of tokens in the translation hypothesis and the source sentence. As a consensus-based feature, we used the word posterior probabilities. For only the Tr-En language pair, we were also able to compute a phrase-based forced decoding score (Zhang et al., 2017) thanks to the small size of the phrase table learned for this language pair. Also only for this language pair, we computed the scores for each hypothesis given by the so-called minimum Bayes risk (MBR) decoding for n-best list using two metrics: sBLEU and chrF++ (Popović, 2017).
The reranking framework was trained on n-best lists produced by the decoding of the same development data that we used to validate NMT system's training and to tune SMT's model weights.

Results
The results of our systems computed for the New-stest2018 test set are presented by Table 6.
As expected, SMT systems greatly underperformed our best NMT systems with differences in BLEU points ranging from 6.6 (En→Fi) to 13.7 (Tr→En). Reranking Moses 100-best hypotheses using NMT models (NMT-reranked) significantly improved the translation quality for all the translation directions. For Fi→En, Moses NMTreranked performed only 0.1 BLEU points worse than Marian single (w/o backtr). This result demonstrates the ability of SMT in producing better translation hypotheses than its one-best hypothesis. Indeed, a better translation can be easily retrieved with the help of NMT models within the 100-best lists. Using back-translated data during training was very effective for Et-En, Fi-En, and Tr-En, with improvements ranging from 3.6 to 5.8 BLEU points. Improvements were less significant for Zh-En, especially for Zh→En with only 1.0 BLEU points of improvements. This may be explained by the much larger parallel data already used to train systems for Zh-En. Another interesting finding is the relative inefficiency of using an ensemble of 3 models for NMT decoding with the transformer architecture over using a single model, as opposed to what was reported by most participants at WMT17 (Bojar et al., 2017) using RNN. For instance, for En→Et and En→Tr ensemble decoding improved the translation quality by only 0.3 BLEU points.
Our combination of SMT and NMT significantly outperformed all our NMT systems for all translation directions. For instance, this combination brought 1.6 and 1.8 BLEU points of improvements for Et→En and En→Zh, respectively, over our best NMT systems.

Conclusion
We participated in eight translation directions and for all of them we did experiments to compare SMT and NMT performances. While SMT significantly underperforms NMT, we showed that a simple combination of both approaches delivers the best results.