FBK’s Participation to the English-to-German News Translation Task of WMT 2017

In this paper we report on FBK’s participation to the English-to-German news translation task of the Second Conference on Machine Translation (WMT’17). The submitted system is based on Neural Machine Translation using byte-pair encoding segmentation on both source and target languages for open-vocabulary translations. Back-translations of news mono-lingual data are used for improving the translations ﬂuency on the in-domain data. With respect to last year’s evaluation, our baseline outperforms the 2016 best system’s baseline on the test sets 2015 and 2016. However, in our set-up back-translations produced a smaller improvement than expected. The ﬁnal submission is given by the combination of 7 systems, including a system trained only on true parallel data and two right-to-left systems, which improves over our single best sys-tem by 1.5 BLEU points.


Introduction
FBK's participation to the news translations shared task in WMT 17 focused this year on the English-German language direction. Our purpose was to explore the state of the art and build a competitive neural machine translation [3] system in order to gain a practical knowledge of the available tools. With respect to our participation in the IWSLT 2016 evaluation campaign, we switched from the Nematus-Theano framework to the OpenNMT-Torch framework [16]. The reasons were twofold: higher baseline performance and significantly faster training. In our primary submission we used backtranslations [22], BPE-encoding [23] and sys-tem combination [11]. In this paper, we report about the tools used for the submitted system and the choices we have taken in terms of hyperparameters and used data. The presentation is structured as follows: in Section 2 we briefly introduce the theoretical background for NMT. In Section 3 we describe our baseline system. In Sections 4 and 5 we describe the details of the back-translations and system combination, which have been used for our final submission. Evaluation results are discussed in Section 6, while Section 7 is devoted to discussion and conclusions.

Neural Machine Translation
Neural machine translation [25] represents the state of the art for machine translation since the outstanding results obtained on IWSLT2015 [17] IWSLT2016 [1,7] and WMT16 [24,5] where the neural models greatly outperformed phrase-based systems. NMT is based on the encoder-decoderattention architecture [3] which jointly learns the translation and the alignment model with a sequence-to-sequence learning model. Given a sequence of words f 1 , f 2 , . . . , f m in the source language, they are used to index an embedding lookup table and retrieve the vectors x 1 , x 2 , . . . , x m representing the words. The embeddings are processed by a bi-directional RNN where merge is a function for merging the output of the RNNs, like the vector concatenation or the point-wise sum, and g is the LSTM [13] or the GRU [8] function. The sequence of vectors produced by the bidirectional RNN is the encoded representation of the source sentence. The decoder takes as input the encoder outputs (or states) and produces a sequence of target words e 1 , e 2 , . . . , e l . The decoder works by progressively predicting the probability of the next target word e i given the previously generated target words and the source context vector c i . At each step, the decoder computes a word embeddings y i−1 of the previous target word, applies one or more recurrent layers, an attention model function and a softmax layer. The recurrent layers produce an hidden state s i where, g can be computed with one or more LSTM or GRU layers. The output of the RNN is then used by the attention model to weight the source vectors according to their similarity with it.
The weights are used to compute a weighted average of the encoder outputs, which represents the source context The source context vector is then combined with the output of the last RNN layer in a new vector z i that is passed as input to the softmax layer to compute the probability for each word in the vocabulary to be the next word, such that: where e represents the transpose of the one-hot vector representation of word e. Let Θ be the set of all the network parameters, then the objective of the training is to find parameter values maximizing the likelihood of the training set S, i.e.:

Baseline
Our baseline is a neural machine translation system trained on the four parallel corpora released for the task. Our preprocessing pipeline involved normalizing the punctuation, de-escaping the special characters, tokenization and truecasing for both English and German. We also filtered out sentence pairs with source or target length greater than 50 or length ratio in one direction more than 1:9. In Table 1 we report the number of sentences before and after the cleaning step. The last step of the preprocessing is the BPE segmentation [23]. We trained 45, 000 BPE merge rules over the joint parallel data, which resulted in a vocabulary sizes of 43, 853 words for English and 47, 465 for German.
The NMT architecture consists of 2 LSTM layers both in the encoder and in the decoder. We used LSTM RNNs instead of the GRU RNNs, as they performed better in our preliminary experiments. Our result is hence coherent with what reported in [6]. The word embeddings size and the number of hidden units for each LSTM layer are fixed to 500. The encoder is a bidirectional LSTM [21] with 500 hidden units equally divided among the two directions. The optimizer of choice is SGD [20] with exponential decay. In preliminary experiments, using different and smaller datasets, this optimizer outperformed Adagrad [10] and Adam [15]. Figure 1 shows the validation scores after each epoch on the validation sets with the different optimizers. In [7] Adagrad led to better results on the IWSLT En-Fr validation set, thus we argue that the choice of the optimizer depends on the dataset and the NMT implementation. The latter is not considered in studies comparing different optimizers [2] We set the initial learning rate to 1.0 and the exponential decay to 0.9. The decay starts from epoch 9. The results of the baseline are reported in the first row of

Monolingual Data
In order to leverage monolingual data we followed the state-of-the-art practice of using back- translated data. A German-to-English MT system was used to translate the news monolingual sentences. As we did not plan to participate in the opposite direction, we decided to use a phrase-based MT to performing back-translations. The system of choice was MMT [4], an opensource PBMT system for industrial use, which has been trained using all available parallel data. The language model was trained on sentences randomly sampled from the English monolingual newscrawl data for a total of 1B words. The log-linear model weights were tuned on 1000 sentences sampled from newstest2013 and new-stest2014. After tuning, the system obtained a BLEU score of 25.04 on newstest2015 deen. With ModernMT we were able to translate 250, 000 sentences per day on a single CPU.
We translated in total about 30M newscrawl sentences from 2013 to 2016. In a first experiment we trained a model until convergence on this huge synthetic parallel data and then fine-tuned on the true parallel data. In this setting, the system trained on the synthetic data converged before finishing the first epoch, and the following fine-tuning reached only 23 Bleu scores on new-stest2015, thus we decided not to use this data for the final submission. Our best single system continued the training of the baseline on a new dataset consisting of both the parallel sentences and 5M back-translated parallel sentences randomly sampled from the 30M set. As we describe in the following section, we used monolingual data also for the system combination.

System Combination
Our primary submission has been produced by merging the outputs of different systems with Jane's system combination tool [11].
For a system combination of m systems we build m confusion networks that are then merged to form a single confusion network. For each of the small networks, only one of the systems is chosen as the primary system, which is the system that decides the word order. The sentences from every secondary systems are then aligned to the primary. We perform word alignment using METEOR [9], a tool that uses four criteria for aligning words: 1) exact match; 2) stem, which matches two words if their stems computed with the Snowball Stemmer [19] are the same; 3) synonym, which uses the WordNet [18] synsets database; 4) paraphrase, which matches phrases if they are in an internal paraphrase table. When no criterion is matched, there is a match with the empty string. The confusion networks are initialized with the primary system sentences, then the words from the secondary hypothesis are added to the network according to the alignment. The final confusion network is obtained by the union of the m networks. The output sentence is produced from the confusion network by majority voting. Each hypothesis receives a system weight, and the weights are optimized using a development set. In our case the development set is newstest2015 and the validation set is newstest2016 The systems involved in the combination are from  3. The tuning of the baseline for 7 epochs more on parallel + synthetic data 4. The baseline system For each system, with the exception of the baseline, we used the weights of last two epochs. This gave us an improvement on the validation set of 0.5 Bleu points. We improved the system combination by adding a 5-grams language model with modified Kneser-Ney smoothing [14] without pruning, trained on ∼ 500M tokens with KenLM [12]. This improved the result by another +0.6 BLEU on the validation. In Table 2 we present the results of the single systems on newstest 2015 and 16. As expected, the systems are quite different also in terms of performance, especially for newstest2016, thus we expected significant improvements. Surprisingly, we found that our system trained from scratch on back-translated data performed worse than the baseline, while the right-to-left system trained on the same data is slightly better on newstest2015 and 1 Bleu point better on new-stest2016. The best system is the one that was trained in two phases, during the first phase only on true parallel data, and continued after 21 epochs on true plus synthetic parallel sentences.

Results
In Table 3 we report the results in terms of Bleu scores, for the test sets from 2015 to 2017. On newstest2015 the baseline was already in par with last year's best single system [24], and the improvement obtained by back-translations is only of +0.4 Bleu scores. The improvement given by back-translations is more significant on new-stest2016, for which our system was quite weak if compared with last year's best single system, and it improved by +1.6 Bleu. The improvement is small also for newstest2017, where it amounts to +0.6. In the last row of the table the results of the system combination are reported. For newstest2015 we get an improvement of +2.4, but the weights are optimized according to this dataset. A similar improvement is obtained on newstest2016, where we gain +2.6 Bleu scores. The improvement is considerable but the best single system does not have state-of-the-art results on this dataset. On newstest2017 the improvement over our best single system is of +1.5 Bleu scores, thus it produced a final score of 26.30 for which it has been ranked 8th out of 21 systems. From Tables 2 and 3 we can see that the backtranslations gained a small improvement to our systems, specially when there has not been a previous training over only true parallel data (sys1 in Table 2). This is surely related to the number of back-translated sentences, which was maybe too high with respect to the number of parallel sentences. Another issue can be due to the quality of the back-translations that were done with a PBMT system, hence underperforming with respect to a state-of-the-art NMT system.

Conclusions
In this paper we have reported on our submission to the English-German news translation task of WMT17. We developed several NMT systems with the OpenNMT open-source tool that were trained over real and synthetic parallel data. We used BPE segmentation for open-vocabulary translation and back-translations to create additional synthetic translations. The best single system, trained on true parallel data and afterwards on true and synthetic parallel sentence pairs, obtained state-of-the-art results on newstest2015 but not on newstest2016 and newstest2017. Additional data created via back-translations did not pay off as hoped. The outputs of 4 different systems, including a right-to-left system, were combined using system combination, producing an improvement 274 of +1.5 BLEU on this year's test set.