The University of Cambridge’s Machine Translation Systems for WMT18

The University of Cambridge submission to the WMT18 news translation task focuses on the combination of diverse models of translation. We compare recurrent, convolutional, and self-attention-based neural models on German-English, English-German, and Chinese-English. Our final system combines all neural models together with a phrase-based SMT system in an MBR-based scheme. We report small but consistent gains on top of strong Transformer ensembles.


Introduction
Encoder-decoder networks (Pollack, 1990;Chrisman, 1991;Forcada andÑeco, 1997;Kalchbrenner and Blunsom, 2013) are the current prevailing architecture for neural machine translation (NMT). Various architectures have been used in the general framework of encoder and decoder networks such as recursive auto-encoders (Pollack, 1990;Socher et al., 2011;Li et al., 2013), (attentional) recurrent models Bahdanau et al., 2015;Luong et al., 2015;Chen et al., 2018), convolutional models (Kalchbrenner and Blunsom, 2013;Kaiser et al., 2017;Gehring et al., 2017), and, most recently, purely (self-)attention-based models (Vaswani et al., 2017;Ahmed et al., 2017;Shaw et al., 2018). In the spirit of Chen et al. (2018) we devoted our WMT18 submission to exploring the three most commonly used architectures: recurrent, convolutional, and self-attentionbased models like the Transformer (Vaswani et al., 2017). Our experiments suggest that self-attention is the superior architecture on the tested language pairs, but it can still benefit from model combination with the other two. We show that using large batch sizes is crucial to Transformer training, and that the delayed SGD updates technique  is useful to increase the batch size on limited GPU hardware. Furthermore, we also report gains from MBR-based combination with a phrase-based SMT system. We found this particularly striking as the SMT baselines are often more than 10 BLEU points below our strongest neural models. Our final submission ranks second in terms of BLEU score in the WMT18 evaluation campaign on English-German and German-English, and outperforms all other systems on a variety of linguistic phenomena on German-English (Avramidis et al., 2018). Stahlberg et al. (2017a) combined SMT and NMT in a hybrid system with a minimum Bayes-risk (MBR) formulation which has been proven useful even for practical industry-level MT . Our system combination scheme is a generalization of this approach to more than two systems. Suppose we want to combine q models M 1 , . . . , M q . We first divide the models into two groups by selecting a p with 1 ≤ p ≤ q. We refer to scores from the first group M 1 , . . . , M p as full posterior scores and from the second group M p+1 , . . . , M q as MBR-based scores. Full posterior models contribute to the combined score with their complete posterior of the full translation. In contrast, models in the second group only provide the evidence space for estimating the probability of n-grams occurring in the translation. Full-posterior models need to assign scores via the standard left-to-right factorization of neural sequence models:

System Combination
for a target sentence y = y T 1 of length T given a source sentence x for all i ≤ p. For exam-ple, all left-to-right neural models in this work can be used as full posterior models, but the right-toleft models (Sec. 3) and SMT cannot. We combine full-posterior scores log-linearly, and bias the combined score S(y|x) towards low-risk hypotheses with respect to the MBR-based group as suggested by Stahlberg et al. (2017a, Eq. 4 MBR-based n-gram scores (2) where λ 1 , . . . , λ q are interpolation weights. Eq. 2 also describes how to use beam search in this framework as hypotheses can be built up from left to right due to the outer sum over time steps. The MBR-based models contribute via the probability P (y t t−n |x, M j ) of an n-gram y t t−n given the source sentence x. Posteriors in this form are commonly used for MBR decoding in SMT (Kumar and Byrne, 2004;Tromble et al., 2008), and can be extracted efficiently from translation lattices using counting transducers (Blackwood et al., 2010). For our neural models we run beam search with beam size 15 and compute posteriors over the 15best list. We smooth all n-gram posteriors as suggested by Stahlberg et al. (2017a).
Note that our generalization to more than two systems can still be seen as instance of the original scheme from Stahlberg et al. (2017a) by viewing the first group M 1 , . . . , M p as ensemble and the evidence space from the second group M p+1 , . . . , M q as mixture model.
The performance of our system combinations depends on the correct calibration of the interpolation weights λ 1 , . . . , λ q . We first tried to use nbest or lattice MERT (Och, 2003; to find interpolation weights, but these techniques were not effective in our setting, possibly due to the lack of diversity and depth in nbest lists from standard beam search. Therefore, we tune on the first best translation using Powell's method (Powell, 1964) with a line search al-gorithm similar to golden-section search (Kiefer, 1953).

Right-to-left Translation Models
Standard NMT models generate the translation from left to right on the target side. Recent work has shown that incorporating models which generate the target sentence in reverse order (i.e. from right to left) can improve translation quality (Liu et al., 2016;Li et al., 2017;Sennrich et al., 2017;Hassan et al., 2018). Right-to-left models are often used to rescore n-best lists from left-to-right models. However, we could not find improvements from rescoring in our setting. Instead, we extract n-gram posteriors from the R2L model, reverse them, and use them for system combination as described in Sec. 2.

Data Selection
We ran language detection (Nakatani, 2010) and gentle length filtering based on the number of characters and words in a sentence on all available monolingual and parallel data in English, German, and Chinese. Due to the high level of noise in the ParaCrawl corpus and its large size compared to the rest of the English-German data we additionally filtered ParaCrawl more aggressively with the following rules: • No words contain more than 40 characters.
• Sentences must not contain HTML tags.
• The minimum sentence length is 4 words.
• The character ratio between source and target must not exceed 1:3 or 3:1.
• Source and target sentences must be equal after stripping out non-numerical characters.
• Sentences must end with punctuation marks.
This additional filtering reduced the size of ParaCrawl from originally 36M sentences to 19M sentences after language detection, and to 11M sentences after applying the more aggressive rules.
For backtranslation (Sennrich et al., 2016a) we selected 20M sentences from News Crawl 2017. We used a single Transformer (Vaswani et al., 2017)   for generating the synthetic source sentences. We over-sampled (Sennrich et al., 2017) WMT data by factor 2 except the ParaCrawl data and the UN data on Chinese-English to roughly match the size of the synthetic data. Tabs. 1 and 2 summarize the sizes of our final training corpora.

Preprocessing
We preprocess our English and German data with Moses tokenization, punctuation normalization, and truecasing. On Chinese we first used the WMT tokenizeChinese.py 2 script and separated segments of Chinese and Latin text from each other. Then, we removed whitespace between Chinese characters and tokenized Chinese segments with Jieba 3 and the rest with mteval-v13a.pl. For our neural models we apply byte-pair encoding (Sennrich et al., 2016b, BPE) with 32K merge operations. We use joint BPE vocabularies on English-German and German-English and separate source/target encodings on Chinese-English.

Model Hyper-Parameters
We use 1024-dimensional embedding and output projection layers in all architectures. The embeddings are shared between encoder and decoder on   English-German and German-English, but not on Chinese-English.
LSTM For our recurrent models we adapted the TensorFlow seq2seq tutorial code base (Luong et al., 2017) for use inside the Tensor2Tensor library . 4 We roughly followed the UEdin WMT17 submission (Sennrich et al., 2017)

Training
We train vanilla phrase-based SMT systems 5 and extract 1000-best lists of unique translations candidates, from which n-gram posteriors are calculated.
We make extensive use of the delayed SGD updates technique we already applied successfully to syntax-based NMT . Delaying SGD updates allows to arbitrarily choose the effective batch size even on limited GPU hardware. Large batch training has received some attention in recent research (Smith et al., 2017;Neishi et al., 2017) and has been shown particularly useful for training the Transformer architecture with the Tensor2Tensor framework (Popel and Bojar, 2018). We support these findings in Tab. 4. 6 Our technical infrastructure 7 allows us to train on four P100 GPUs simultaneously, which limits the number of physical GPUs to g = 4 and the batch size 8 to b = 2048 due to the GPU memory. Thus, the maximum possible effective batch size without delaying SGD updates is b = 8192. Training with delay factor d accumulates gradients over d batches and applies the optimizer update rule on the accumulated gradients. This allows us to scale up the effective number of GPUs to 16 and improve the BLEU score significantly (29.5 vs. 30.3). Note that training regimens are equivalent if their effective batch size is the same, ie. training on 4 physical GPUs with d = 4 is mathe-matically equivalent to training on 16 GPUs without delaying SGD updates. Tab. 5 lists our training setups for the neural architectures used in this work. These training hyper-parameters were chosen empirically. Particularly, we did not find improvements by increasing the number of effective GPUs for SliceNet or longer LSTM training.
We use news-test2017 as development set on all language pairs to tune the model interpolation weights λ (Eq. 2) and the scaling factor for length normalization.

Decoding
We use the beam search strategy with beam size 8 of the SGNMT decoder (Stahlberg et al., 2017b in all our experiments. We apply length normalization (Bahdanau et al., 2015) on German-English and Chinese-English but not on English-German. As outlined in Sec. 2 we either use full posteriors or MBR-style n-gram posteriors from our individual models. SMT n-gram scores are extracted as described by Blackwood et al. (2010) using HiFST's lmbr tool. We use SGNMT's ngram output format to extract n-gram scores from our neural models.

Results
On English-German and German-English news-test2014 we compute cased BLEU scores with Moses' multi-bleu.pl script on tokenized output to be comparable with prior work Kaiser et al., 2017;Gehring et al., 2017;Vaswani et al., 2017;Chen et al., 2018). On all other test sets we use mteval-v13a.pl to be comparable to the official cased WMT scores. 9 First, we will discuss our experiments with a single architecture, i.e. single systems and ensembles of two systems with the same architecture. Tab. 6 compares the architectures on all test sets. PBMT as a single system is clearly inferior to all neural systems. Ensembling neural systems helps for all architectures across the board. LSTM   is usually slightly better than the convolutional SliceNet, but is much slower to train and decode (cf. Tab. 3). Note that our LSTM 2-ensemble is on par with the best BLEU score in WMT17 (Sennrich et al., 2017), which was also based on recurrent models. Transformer architectures outperform LSTMs and SliceNets on all test sets. The right-to-left Transformer is usually slightly worse, the Transformer with relative positioning slightly better than the standard Transformer setup.
Tab. 7 summarizes our system combination results with multiple architectures. Adding LSTM and SliceNet as full-posterior models to an ensemble of a Transformer and a Relative Transformer does not improve the BLEU score (rows 6 vs. 7). We see very slight improvements when we use these models to extract n-gram scores instead (rows 6 vs. 8). We report further gains by using MBR-based n-gram scores from the right-toleft Transformer and the PBMT system. The improvements from adding PBMT are rather small, but we still found them surprising given that the PBMT baseline is usally more than 10 BLEU points worse than our best single neural model. We list the performance of our submitted systems on all test sets in Tab. 8.

Direction
Test set BLEU

Related Work
There is a large body of research comparing NMT and SMT (Schnober et al., 2016;Toral and Sánchez-Cartagena, 2017;Koehn and Knowles, 2017;Menacer et al., 2017;Dowling et al., 2018;Bentivogli et al., 2016Bentivogli et al., , 2018. Most studies have found superior overall translation quality of NMT models in most settings, but complementary strengths of both paradigms. Therefore, the literature about hybrid NMT-SMT sys-tems is also vast, ranging from rescoring and reranking methods (Neubig et al., 2015;Stahlberg et al., 2016;Khayrallah et al., 2017;Grundkiewicz and Junczys-Dowmunt, 2018;Avramidis et al., 2016;Marie and Fujita, 2018), MBR-based formalisms (Stahlberg et al., 2017a, NMT assisting SMT (Junczys-Dowmunt et al., 2016b;Du and Way, 2017), and SMT assisting NMT (Niehues et al., 2016;He et al., 2016;Long et al., 2016;Wang et al., 2017;Dahlmann et al., 2017;Zhou et al., 2017). We confirm the potential of hybrid systems by reporting gains on top of very strong neural ensembles. Ensembling is a well-known technique in NMT to improve system performance. However, ensembles usually consist of multiple models of the same architecture. In this paper, we compare and combine three very different architectures (recurrent, convolutional, and self-attention based) in two different ways (full posterior and MBR-based), and find that combination with MBR-based n-gram scores is superior.

Conclusion
We have described our WMT18 submission, which achieves very competitive BLEU scores on all three language pairs (English-German, German-English, and Chinese-English) and significantly higher accuracies in a variety of linguistic phenomena compared to other submissions (Avramidis et al., 2018). Our system combines three different neural architecture with a traditional PBMT system. We showed that our MBRbased scheme is effective to combine these diverse models of translation, and that adding the PBMT system to the mix of neural models still yields gains although it is much worse as stand-alone system.