The University of Edinburgh’s Submissions to the WMT18 News Translation Task

The University of Edinburgh made submissions to all 14 language pairs in the news translation task, with strong performances in most pairs. We introduce new RNN-variant, mixed RNN/Transformer ensembles, data selection and weighting, and extensions to back-translation.


Introduction
For the WMT18 news translation task, we were the only team to make submissions to all 14 language pairs. Our submissions built on our strong results of the WMT16 and WMT17 tasks (Sennrich et al., 2016a, in that we used neural machine translation (NMT) with byte-pair encoding (BPE) (Sennrich et al., 2016c), back-translation (Sennrich et al., 2016b) and deep RNNs . For this year's submissions we experimented with new architectures, and new ways of data handling. In brief, the innovations that we introduced this year are: Architecture This year we experimented with the Transformer architecture (Vaswani et al., 2017), as implemented by Marian , as well as introducing a new variant on the deep RNN architectire (Section 2.3).
Data selection and weighting For some language pairs, we experimented with different data selection schemes, motivated by the introduction of the noisy ParaCrawl corpora to the task (Section 2.1). We also applied weighting of different corpora to most language pairs, particularly DE↔EN (Section 3.5).
Extensions to Back-translation For TR↔EN (Section 3.7) we used copied monolingual data (Currey et al., 2017a) and iterative back-translation.
In-domain Fine-tuning For RU↔EN (Section 3.6) we fine-tuned using a specially constructed "in-domain" data set.

System Details
In this section we describe the general properties of our systems, as well as some novel approaches that we tried this year such as data selection and a variant on the GRU-based RNN architecture. The specifics of our submissions for each language pair are described in Section 3.

Data and Selection
All our systems were constrained in the sense that they only used the supplied parallel data (including ParaCrawl) for training the systems. We also used the monolingual news crawls to create extra synthetic parallel data by back-translation, for all language pairs, and by copying monolingual data for TR↔EN. During training we generally used news-dev2016 or newstest2016 for validation, and newst-est2017 for development testing (i.e. model selection), except for ZH↔EN, and ET↔EN, where we used the recent newsdev sets instead.
All parallel data contains a certain amount of noise, and the problem was exacerbated this year since the organisers provided a ParaCrawl corpus 1 for most language pairs 2 as additional training data. On inspection, we could see that these crawled corpora were quite noisy, including mis-aligned sentence pairs, incorrect language, and garbled encodings. In early experiments, we showed increases in BLEU from including ParaCrawl in the training data, for ET→EN and FI→EN, but we decided to see if we could improve performance further by applying data filtering. We experimented with different filtering methods, described below. 1 https://paracrawl.eu 2 ParaCrawl corpora was not available for EN↔TR and EN↔ZH.
Language Identifier Filtering This was applied to the CS↔EN and DE↔EN corpora, based on observations that CzEng, and ParaCrawl both contain sentence pairs in the "wrong" language. For CS↔EN we applied langid (Lui and Baldwin, 2012) to both sids of the data, removing any sentences whose English side is not labelled as English, or whose Czech is not labelled as Czech, Slovak or Slovenian 3 . For DE↔EN, we just applied langid to ParaCrawl and retained only those pairs where each side was identified as the 'correct' language by langid. This reduced the size of the ParaCrawl corpus from about 36 million sentence pairs to ca. 18 million sentence pairs.

Data Selection with Translation Perplexity
We applied this to ET↔EN and FI↔EN. To perform the filtering, we first trained shallow RNN models in both directions, using all the permitted parallel data except ParaCrawl. We then used these models to score the ParaCrawl sentence pairs, normalising by target sentence length, and adding the scores for forward and reverse models. We then ranked sentence pairs in ParaCrawl using this score, and performed a grid search across different thresholds (from 0 -100% in 10 point intervals) of the ParaCrawl data, in addition to the other parallel data. We trained a shallow RNN system using the data selected across each of these thresholds, and tested it on newstest2017 (for FI→EN), or half of newsdev2018 (for ET→EN).
The results of the filtering are shown in Figure  1. Based on these results, we chose a threshold of 0.3 for ET↔EN (which gives us +0.8 BLEU), but used the whole of ParaCrawl for FI→EN.

Alignment-based Filtering
We applied this to the DE→EN parallel data, after langid filtering. We word-aligned all pre-cleaned parallel data with fastalign (Dyer et al., 2013) and computed the geometric mean of forward and backward alignment probabilities as a coarse estimate of how good a translation pair the respective sentence pair is.
All parallel data was sorted in descending order of this "plausible translation" score, and a neural system was trained on this data, in this order. In order to determine a threshold for data filtering, we monitored the performance on a validation set (newstest2016) and observed the point where translation quality started to deteriorate. We used the translation plausibility score at this point as the threshold for selecting data for training the final systems.

Preprocessing
For most language pairs, our preprocessing setup consisted of the Moses pipeline (Koehn et al., 2007) of normalisation, tokenisation and truecasing, followed by byte-pair encoding (BPE) (Sennrich et al., 2016c). We generally applied joint BPE, with the number of merge operations set on a per-pair basis, detailed in Section 3. Different pipelines were used for processing the two languages written in non-Latin scripts (i.e. Chinese and Russian), also explained in Section 3. For some language pairs (those including Czech, Estonian, Finnish and German) we used the preprocessed data provided by the organisers (which is preprocessed up to truecasing), whilst for the others we started with the raw data.

Model Architecture
For this submission we considered two types of sequence-to-sequence architectures: a transformer (Vaswani et al., 2017) and a deep RNN, specifically the BiDeep GRU encoder-decoder (Miceli Bar-one et al., 2017). Both architectures 4 are implemented in the Marian open source neural machine translation framework . For the transformer architecture we used the "wmt2017-transformer" setup from the Marian example collection 5 .
We extended the RNN with with multi-head and multi-hop attention. Multi-head attention is similar to , with an MLP attention mechanism using a single tanh hidden layer followed by one soft-max layer for each attention heads. We further include an optional projection layer on the attended context with layer normalisation in order to avoid increasing the total size of the attended context.
Let C ∈ R Ns×de be the input sentence representation produced by the encoder, where N s is the source sentence length and d e is the top-level bidirectional encoder state dimension. Let s ∈ R d d be an internal decoder state at some step. Then for source sentence position i we compute a vector of M attention weights, where M is the number of attention heads: where we assume that exponentiation is applied element-wise. Then we compute the attended context vector as: where CAT M r=1 is vector concatenation over the attention heads and each P ROJ r is either the identity function or a trainable linear layer followed by layer normalization.
Multi-hop attention is similar to Gehring et al. (2017), except that we do not use convolutional layers, but instead we introduce additional attention hops between the layers of the deep transition GRU in the decoder. In our implementation multi-head and multi-hop attention can be combined, in which case each attention hop is a separate multi-head attention mechanism.
Let L t ≥ 2 be the decoder base recurrence depth and H < L t be the number of attention hops. Then the base level of the decoder is defined as: 3 Submitted Systems

Chinese ↔ English
For ZH↔EN we preprocessed the parallel data, which consists of NewsCommentary v13, UN data and CWMT, as follows. We first desegmented all the Chinese data and resegmented it using Jieba 8 . We then removed any sentences that did not contain Chinese characters on the Chinese side, or contained only Chinese characters on the English side. We also cleaned up all sentences containing links, sentences longer than 50 words, as well as sentences where the amount of tokens on either side was > 1.3 times the tokens on the other side, following Hassan et al. (2018). After preprocessing the corpus size was 23.6M sentences. We then applied BPE using 18,000 merge operations and we used the top 18,000 BPE segments as vocabulary. We augmented our data with backtranslated ZH↔EN from , which consists of 8.6M sentences for EN→ZH and 19.7M for ZH→EN. We trained using the BiDeep architecture with multi-head attention with 1 hop and 3 heads. We decoded using an ensemble of 5 L2R systems and a beam of 12 for EN→ZH and 6 L2R systems and a beam of 12 for ZH→EN. Due to time constraints, we were not able to train any of the systems to convergence.

Czech ↔ English
After preprocessing, language filtering (see Sections 2.1 and 2.2), and removing any parallel sentences where neither side contains an ASCII letter, we were left with around 50M sentence pairs. We then learned a joint BPE model over the source and target corpora, with 89,500 merge operations, and applied it using a vocabulary threshold of 50.
For back-translation, we trained shallow RNN models in both directions without ParaCrawl or the langid-based corpus cleaning, and used to decode with a beam size of 5. We back-translated the English 2017 news-crawl, and the Czech news-crawls from 2016 and 2017, removing lines with more than 50 tokens, to create additional corpora of approximately 26.5M sentence for CS→EN and 13M for EN→CS. Initially we tried simply concatenating each of these corpora with the natural parallel data, but this gave poor results for CS→EN, so we over-sampled the synthetic data 2 times for that pair to give approximately equal amounts for synthetic and natural data. For EN→CS, we did not see any benefit from equalising the synthetic/natural ratio, so we stuck to using simple concatenation.
For the submitted systems, we trained the BiDeep RNN models using Marian. In addition to the default Marian settings, we used layer normalisation, tied embeddings, label smoothing (0.1), exponential smoothing, no dropout, but we used multiheaded/multihop attention with 2 heads and 3 hops. We trained on 4 GPUs with a working memory of 4000MB on each, validating every 2,500 updates. We used exponential smoothing and took the final smoothed model. We trained 4 left-right (L2R) and 4 right-left (R2L) models for each language pair, and due to time constraints we did not train to convergence, stopping each run after about 250k-350k updates. We decoded using an ensemble of the 4 L2R systems and a beam of 50, then reranked with the 4 R2L systems. For both language pairs we normalised probabilities by target length, raising it to a power of 0.8 for CS→EN.

Estonian ↔ English
As explained in Section 2.1, we used a filtered ParaCrawl for this pair, and in common with CS↔EN we removed any sentence pairs where either side contained no ascii letter. We trained and applied a BPE model with 89,500 merge operations and a vocabulary threshold of 50. We split news-dev2018 randomly and used one half for validation and another half for development testing.
The models used for back-translation were shallow RNNs trained on the parallel data without ParaCrawl. We translated the 2017 English newscrawl to Estonian, and translated all the Estonian news-crawls to English. We also experimented with the BigEst Estonian corpus, but did not see any improvement when using it to produce synthetic data, nor when we selected 50% of it using Moore-Lewis selection (Moore and Lewis, 2010) with the news-crawl data as in-domain. Our final natural parallel corpus contains approximately 1.2M sentences, and the synthetic corpora are about 2.9M for EN→ET and 26.5M for ET→EN. To create the final corpora for training, we combined natural and synthetic, over-sampling the natural 3-times for EN→ET and 23-times for the ET→EN. Again we apply BPE, trained on the Europarl, Rapid and selected Paracrawl corpora, with the same parameters as before.
Our submitted system was an ensemble of 4 left-right systems, reranked with 4 right-left systems, with each ensemble consisting of 2 deep BiDeep RNNs and 2 Transformers. The RNN had a BiDeep architecture, with layer normalisation, tied embeddings, label smoothing (0.1), exponential smoothing, RNN dropout (0.2), source and target word dropout (0.1) and multihead/multihop attention with 2 heads and 3 hops. We trained on 4 GPUs with a working memory of 4000MB on each, validating every 2,500 updates. The RNNs were not trained to convergence (due to time constraints) but stopped after between 300k and 500k steps. The transformer models used the settings from Marian examples. without layer normalisation, with a working memory of 9500MB (on each of 4 GPUs), validating every 2500 updates, and detecting convergence with a patience of 10. We also applied source and target word dropout to the trans-former models. They generally converged in under 200k updates. As for CS↔EN we used exponentially smoothed models. Decoding is the same as for CS↔EN, with normalisation by target length.

Finnish ↔ English
For FI↔EN, after pre-processing we removed sentence pairs where either side contains no ascii characters, then trained and applied a BPE model with 89,500 merge operations and a vocabulary threshold of 50. As reported in Section 2.1, we used the whole of ParaCrawl in our system.
For back-translation, we trained shallow RNN models in each direction, without ParaCrawl. We back-translated with a beam size of 5, translating the English 2017 news-crawl to Finnish, and the Finnish 2014-2017 news-crawls to English. Before back-translation, we removed any sentences with length greater than 50 tokens. For EN→FI, we combined 3.2M naturally parallel sentence pairs, over-sampling 5-times, with 14.6M sentences of synthetic data. For FI→EN, we combined the same natural corpus (over-sampled 8-times) with a 26.5M corpus of synthetic parallel data.
We created the submitted systems in the same way as the ET↔EN systems (Section 3.3), and again we were not able to train the deep RNNs to convergence. The only difference is that for EN→FI, we normalise by the target length raised to a power of 0.5, after running a grid search over different normalisations on the development set.

German ↔ English
Our efforts focussed on extracting the most useful data from ParaCrawl. After preprocessing and selection (see Section 2.1, we trained and applied joint BPE models with 35,000 merge operations, and a threshold of 50. To balance the data, we blended the data in a mix as shown in Table 1, by randomly sampling from each corpus (without raplacement), resetting (i.e., replacing all items at once) each corpus when it became exhausted, for a total of 40 million sentence pairs.
Our system was based on the transformer in Marian examples, and initially we trained several left-right and right-left systems with tied target embeddings (but separate source embeddings). We used these systems to create ensembles.
For the translation direction EN→DE, we also trained a single model with a set-up more closely reflecting the setup described in the wmt2017transformer Marian example set-up. For this single decoder, we tied all embeddings and pooled the top-ranked 7.5 million sentence pairs from paracrawl (according to the translation plausibility score) with the other training data. Below, this system is referred to as single transformer.
For the single transformer we used a mix of approximately 4.6 million parallel sentence pairs from latest versions of Europarl, CommonCrawl and News-commentary, oversampled twice, the 7.5 million parallel sentence pairs from ParaCrawl, filtered as described above, and 10 million backtranslated sentences from NewsCrawl 2016. We trained a Marian transformer model with standard settings.
We also ran preliminary experiments with multihead and multi-hop GRU architectures on the same training data except ParaCrawl but we found that these models tended to underperform the transformer by 0.6 − 1.0 BLEU points, therefore we did not use them for our submission.
As the results in Table 2 show, the single transformer produces better results than our ensembles. Even re-ranking of the single transformer output deteriorates the results, which we attribute to lower quality of the models used for ensembling and reranking. At this point we do not know whether the differences in model quality are due to differences in the tying of parameters, different choices of other hyperparameters, differences in the training data used, or a combination of any of these potential causes.

Russian ↔ English
After preprocessing, we trained a joint BPE model with 90,000 merge operations, using the same Latin-Cyrillic transliteration trick as in Sennrich et al. (2016c)  In order to maximize the performance of our submission systems, we created a pseudo "in-domain" fine-tuning corpus designed to be representative of the targeted news domain to a greater extent than the full parallel corpus. For that purpose, we concatenated pre-processed sentence pairs from NewsCommentary v13, CommonCrawl, and Yandex Corpus, excluding the noisy ParaCrawl data as well as data from the UN Parallel Corpus V1.0 which has little overlap with our target domain. To ensure that the so assembled corpus is as free of noise as possible, we furthermore filter out sentence pairs in which the Russian side is not predominantly composed of Cyrillic characters or the English side is dominated by non-Latin characters. Lastly, we combined the so obtained "in-domain" corpus with an equal amount of back-translated news data, resulting in two datasets of 2.1M sentence pairs each.
Our final submission included both deep RNN models (using multi-head and multi-hop attention with 3 heads and 2 hops) and Transformer models similar to the Transformer-Base of Vaswani et al. (2017). For the RNNs, we applied layer normalisation, label smoothing (0.1), dropout between recurrent layers (0.1), exponential smoothing and tieall embeddings. We applied similar options to our transformer models.
We trained our models in two stages: 1) Training on the full parallel corpus and 2) Fine-tuning on the "in-domain" corpus with a reduced learning rate. Each of the submitted models was optimized using the Adam algorithm, with β 1 set to 0.9, and β 2 set to 0.98. Learning rate was set to 0.0003 during the training stage and lowered to 0.00003 during the fine-tuning stage. Throughout the training, the learning rate was linearly increased over the initial 16,000 update steps up to the specified value and gradually degraded thereafter.
Model validation was performed every 5,000 steps, and we terminated training if no BLEU improvements are observed after five consecutive validations. For fine-tuning, we initialized our models with parameters corresponding to the highest validation-BLEU on the full corpus and train until convergence, as indicated by early stopping, on the "fine-tuning" training set. Due to time-constaints, convergence could not be reached for several of the ensembled models. Our final submissions consisted of an ensemble of 4 deep RNNs for EN→RU and a mixed ensemble of 2 RNNs and 2 transformers for RU→EN. All these models were trained independently and fine-tuned on the "in-domain" set. Improvements obtained following the fine-tuning step are detailed in Table 3. While our original intention was to use mixed ensembles for both directions, our transformer models under-performed on the EN→RU translation task, which we assume is due to our hyper-parameter choices. We re-ranked the translations obtained by our left-right ensemble with a right-left ensemble of identical design. It should be noted, however, that we were unable to identify any significant improvements in terms of validation-BLEU as a result of the re-ranking. We also fine-tuned the beam-size and length penalty hyper-parameters of our ensemble systems on the corresponding validation sets for which we observe a small increase in validation-BLEU. Accordingly, we set the beam size to 20 and length normalisation parameter to 0.4 for our EN→RU ensemble and to 28 and 1.2 respectively for RU→EN.

Turkish ↔ English
After preprocessing we trained and applied a joint BPE model with 36,000 merge operations, discarding any sentences longer than 120 tokens. To produce back-translations we built systems in two steps: first we trained back-translation systems in both directions using the parallel data only, and then we re-trained them on data sets containing additional 800K back-translated sentences. Back-   translation systems are trained as deep RNN models described below. The final training sets consist of 2.5M of synthetic parallel sentence pairs created from English or Turkish NewsCrawl data sets and the SETIMES2 data oversampled 5 times (Table 4). We also experimented with copying monolingual data (Currey et al., 2017b) by adding additional 1M examples with source sentences identical to target sentences randomly selected from the monolingual data.
Our RNN models used the BiDeep architecture, and we augmented the models with layer normalisation, skip connections, and parameter tying between all embeddings and output layer. The RNN hidden state size was set to 1024, embeddings size to 512.
The architecture of transformer models was close to the Transformer-Base proposed by Vaswani et al. (2017): encoder and decoder were composed of 6 layers, and employed 8-head selfattention. We used dropout between transformer layers (0.2) as well as in attention (0.05) and feedforward layers (0.05). The rest of parameters remained the same as in the RNN models.
Optimization used 4 GPUs with synchronous training and mini-batch size fitted into 9.5GB of GPU memory. The learning rate was linearly increased to 0.0004 reaching this value after first 18,000 updates, and then decreased by a square of the passed updates starting at 24,000 update. As a stopping criterium we used early stopping with a patience of 10 based on the word-level cross-entropy on the newsdev2016 data sets, which served as a development set. The model was validated every 5,000 updates, and we kept best models according to the cross-entropy and BLEU score.
We evaluated systems using models with the highest BLEU score on the development set. Decoding was performed by beam search with a beam size of 12 with length normalisation with value 0.2 for EN→TR and 1.2 for TR→EN based on the greed search on the development set. Additionally, as the Turkish language is not supported by the Moses tokenizer falling back to general English tokenization rules resulting in suboptimal detokenization, we postprocessed translated Turkish texts by merging words that contains an apostrophe.
We report results on the newstest2017 and news-test2018 in Table 5 9 . Our first submitted TR↔EN systems were ensembles of 6 independently trained models, reranked with 3 right-left systems (Ensemble ×6 +Rerank R2L ×3). Ensembles consist of four models trained on corpus B and one model trained on corpora A and C, while each right-left model is trained on different corpora A-C. Our final systems extended the previous ensemble by 6 additional models from the same training runs that achieve best cross-entropy (instead of best BLEU) on the development set 10 , utilizing 12 left-right models in total (Ensemble ×6×2). For comparison, we report the results for single systems trained on different corpora, and there is no significant performance difference among them.

Overall Performance of Submissions
In Table 6 we show the BLEU scores of our systems as compared to the top-scoring constrained systems, giving the BLEU scores from the matrix 11 and the EN-TR  human evaluation from the findings paper (Bojar et al., 2018).
In terms of the clustering provided by the organisers, we were in the top constrained cluster (i.e. no significant difference was observed between ours and the best constrained system) for EN→CS, DE→EN, ET→EN, FI→EN, TR→EN and EN→TR, i.e. 6/14 language pairs. Nevertheless, Table 6 still shows that our systems generally lag behind the best submitted systems. This is contrast to the 2017 shared task, where we achieved the highest scores in most of the language pairs where we submitted systems. We hypothesise that other groups have taken fuller advantage of the transformer architecture, and also of data weighting and selection. We also suggest that covering all 14 language pairs meant that we had insufficient time for experimentation on some pairs, and in fact we were not able to train all models to convergence.

Post-Submission Experiments
In this section we present results of some postsubmission experiments, which attempted to provide more insight into the contribution of different features of our system. We were especially interested in understanding why our systems tended to lag behind the performance of the best systems (in BLEU, at least). Mostly the experiments were conducted on EN↔{CS,ET,FI}.
The results are given on newstest2017 (devtest) and newstest2018 (test), except for ET↔EN, where devtest is half of newsdev2018.

Effect of Multihead/Multihop Attention
In the deep RNN models in our submissions, we used the BiDeep architecture, with multihead/multihop attention, setting the number of hops to 3 and heads to 2. In Table 7, we show the effect of this on 3 different language pairs (both directions). For these experiments, we use the same training sets and data preparation as in our system submissions, but train the deep RNNs with a working memory of 10GB, validating every 1,000 steps, and testing for convergence with a patience of 10. We use exponential smoothing and show the results on a single smoothed model.
From the results in Table 7 we see that the multihead/hop extension has a small positive effect on BLEU in most language pairs.

Effect of Vocabulary Size
After looking at the submission results, we questioned whether smaller vocabularies would have given better results, especially for transformer models. Having smaller vocabularies means that the models have few parameters, and also allow more words to be fitted into each training mini-batch.
To create a model with a smaller vocabulary, we follow the preparation steps used for our submissions (in EN↔{CS,ET,FI}), but use 30,000 BPE merges instead of 89,500. We show the effect both on the deep RNN model and on the Transformer model, and additionally we show the effect of tying all embeddings (i.e. source, target input and target output) on the Transformer model. The submitted models for these language pairs only have the target input and output embeddings tied. As in Section 4.1 we set the working memory for the deep RNN to 10GB, and we set the working memory for transformer training to 9.5GB. We used layer normalisation for the transformer models (although this appeared to make little if any difference to the results). In Table 8 we show the comparison for RNNs, and in Table 9 we show the same comparison for Transformer models.
Examining the results in Tables 8 and 9, we can see that the effect of vocabulary size reduction on RNN models is mixed, whereas the transformer models have a preference (in BLEU, at least) for smaller vocabularies. Tying all embeddings does not seem to help. Further investigation is needed on the vocabulary size question though, as the relationship between BPE hyper-parameters and BLEU is unclear. We note that changes in the vocabulary    size could have a disproportionate effect on the translation of rare words (including proper nouns) which would not necessarily be detected by BLEU.

Mixed Ensembles
For our submitted systems for FI↔EN and ET↔EN we used mixed ensembles consisting of two deep RNNs and two Transformer models. In this section we examine whether the mix of archi-tectures in the ensemble is beneficial. We compare this mixed ensemble with an ensemble of four deep RNNs.
In Table 10, we show the results. We show the mean BLEU score of the models in the ensemble, together with the overal ensemble score. For clarity, we just show scores on our test set (newstest2018). The gain in BLEU from ensembling (over the mean BLEU) is slightly higher in all cases than the corresponding gain for the uniform ensemble.

Conclusions
We have described Edinburgh's systems for all 14 language pairs, showing that we can gain improvements by augmenting a GRU-based RNN with multi-head and multi-hop attention, using mixed ensembles of deep RNNs and transformers, and selecting data from the noisy ParaCrawl corpora. Our systems perform strongly in most language pairs, except for when we did not manage to train to convergence.