The Karlsruhe Institute of Technology Systems for the News Translation Task in WMT 2017

We present our experiments in the scope of the news translation task in WMT 2017, in three directions: German → English, English → German and English → Latvian. The core of our systems is the encoder-decoder based neural machine translation models , enhanced with various modeling features, additional source side augmentation and output rescoring. We also ex-periment various methods in data selection and adaptation.


Introduction
We participate in the WMT 17 shared task on news translation with three directions: English-German, German-English and English-Latvian.The core of our submissions is the neural attentional encoderdecoder model, which we enhanced with different features such as context gates for more efficient attention and the coverage vector for maintaining attentional information during translation.Several techniques to integrated additional information into the source text have be investigated: Pretranslation with statistical systems, mono-lingual data and phrase-table entries.Finally, we combined different models using n-best lists reranking.

Data
This section describes the preprocessing steps for the parallel and monolingual corpora for the language pairs involved in the systems as well as the data selection methods investigated.

German↔English
As parallel data for our German↔English systems, we used Europarl v7 (EPPS), News Commentary v12 (NC), Rapid corpus of EU press releases, Common Crawl corpus, and simulated data.Except for the common crawl corpus, no special preprocessing was applied, but only tokenization and true-casing.For the common crawl corpus, we applied noise filtering using SVM as shown in Mediani et al. (2011).Around 900K sentence pairs are filtered out using this technique.
Synthetic data is motivated by Sennrich et al. (2015a).In order to exploit the monolingual data, we used the back-translation technique.We randomly select sentences from the data as much as our parallel data, and translate them with an inverse NMT system from the target to the source language.We use this synthetic data as an additional parallel training data.Summing all corpora, the preprocessed and noise-filtered parallel data reaches 8.3M sentences for each language.
For German monolingual data, we use News Crawl data.For English, we use News Crawl and News Discussions corpus.Same as for parallel data, only tokenization and true-casing are applied.
Once the data is preprocessed, we applied bytepair encoding (BPE) (Sennrich et al., 2015b) on the corpus.In this work, we deploy two different operation sizes, 40K and 80K.

Monolingual data selection
We experimented with using domain adaptation techniques to select monolingual data for backtranslation.In particular, we concatenated all news-test data sets up until 2013 to form our indomain corpus, and used news-shuffle as background data.We used the method by Axelrod et al. (2015), a class-based extension of the widely used cross-entropy difference based data selection method by Moore and Lewis (2010).For word clustering, we used Clustercat (Dehdari et al., 2016) with 20 classes.We selected an amount of data equal to the available bilingual training data.Backtranslation was done as in (Sennrich et al., 2015a).We attempted this approach for both systems with English and German as target language.However, we did not observe any improvements over selecting monolingual data at random, and did not employ this method for our final system.

Parallel data selection
From previous MT evaluation campaigns (Cho et al., 2016), we notice that NMT systems work well when we do fine tuning on in-domain data after training our models on out-of-domain data.Since a clear in-domain corpus is not available in this task, we conducted parallel data selection experiments to build an in-domain corpus.
We followed the approach described in (Peris et al., 2016) to extract an in-domain data set from News Commentary corpus.More specifically, an LSTM-based neural network was utilized to classify every sentence in the general corpus whether we should include it into the in-domain corpus or not.The network is trained using a "golden" corpus as the in-domain one.We took the WMT development sets from 2008 to 2013, c.a. 16K sentence pairs, to be the golden corpus for this training.The outcome is the merge of the development sets and the selected sentences from News Commentary, resulting in c.a. 100K sentence-pair indomain corpus.

English→Latvian
The parallel corpus English-Latvian contains 2.9 million sentences which are proprocessed by TILDE 1 with language specific tokenizers.The Latvian text is only true-cased on the first letter of the sentence.We also further clean the data by using the language detection library Shuyo (2010) and remove the lines that the target sentences cannot be recognized as Latvian by the tool, resulting in about 25K sentences removed.Aside from the main data provided by the organizer, we exploit the synthetically translated monolingual data (only the news2016 part), which is provided by University of Edinburgh with a Moses phrase-based system.The training data used for the final system consists of 5 million sentences in total.For validation, we use the the first 2, 000 sentences of the Leta corpus (the rest included in the training data) and use the newsdev2017 set (2, 003 sentences) for testing.We train a BPE (Sennrich et al., 2015b) model on the training data (including the back-1 www.tilde.comtranslated part) with 40K operations, which is potentially helpful for a morphologically rich target language.

NMT Frameworks
Our systems consist of multiple neural encoderdecoder models trained using two different toolkits.

Nematus
We initially used the nematus2 toolkit, in which we used the hyperparameters following previous works (Sennrich et al., 2017): minibatch size of 80, maximum sentence length of 50, word embedding size of 650, a one layer GRU with size 1,024 in the encoder and a conditional GRU decoder with hidden layer size 1,024.The gradients are scaled with norm of 1.0 and the gradient update method being used is Adam (Kingma and Ba, 2014) with learning rate 0.0001.Models are trained until the BLEU score on the validation set stops increasing.Checkpoints are saved every 20K iterations.

OpenNMT
We also employed the Torch-based (Collobert et al., 2011) toolkit OpenNMT (Klein et al., 2017) 3 .All models trained with this toolkit have two LSTM layers of 1,024 units each, and we also use the input-feeding method as described in (Luong et al., 2015).For optimization, the gradients are scaled at 5, and we experimentally use Adam with a high learning rate of 0.001 and then reduce it to 0.0005 when the perplexity of the model does not decrease anymore.Checkpoints are saved every epoch (all of the sentences are seen).We also enhanced the toolkits with different features, namely the Context Gate for attentional model (Tu et al., 2016a) and using coverage information during learning to translate (Tu et al., 2016b;Sankaran et al., 2016).

Context gates for machine translation
In conditional language models such as neural machine translation, the decoder makes prediction based on two sources of input: the decoder input at the current time step and the context vector queried by the attentional model.As analysed by (Tu et al., 2016a), it could be beneficial for the translation model to be able to control the influence of each prediction source.Concretely, inadequate translation can happen due to the bias over the current decoder input.We followed the authors to integrate a soft gating mechanism to alleviate this problem.Specifically, in our neural translation model, given the target hidden state h t and the source context vector c t , an attentional hidden state is formed by concatenation (Luong et al., 2015).
Alternatively, we use h t and c t to learn a soft context mask that prevents the activation of both states.The mentioned states are then masked with learned gates, and concatenated before being fed into the final linear regression layer.
Note that the authors (Tu et al., 2016a) built their model on top of the conditional GRU based network from Bahdanau et al. (2014), while ours are essentially an multi-layer LSTM decoder with an additional attention layer.Such difference leads to the minor change in terms of implementation, which may not replicate the same improvement as the original work.

Coverage mechanism for attention model
Various works have pointed out that the attention neural machine translation model can be benefit by constraining the attentional process to adequately cover the source words (Sankaran et al., 2016;Tu et al., 2016b;Mi et al., 2016;Luong et al., 2015).Different proposals share similar ideas which is to incorporate alignment information from the previous time steps into the attentional neural network.Our experiment inherits the neural fertility model from (Tu et al., 2016b) which uses an explicit vector to keep track of the alignment information.At every time step, the network makes an attentional decision with the help of the coverage vector, which is in turn updated using the alignment vector and the source context with a simple Gated Recurrent Unit (GRU).

Integration of Additional Resources
In this section, we show several techniques we applied in order to integrate additional resources into the translation.First, we integrate monolingual information using a multi-lingual NMT approach.
In addition, we extracted information from PBMT systems.

Monolingual Data
When the encoder of an NMT system of a wellchosen architecture considers words across different languages, the model is expected to learn a good representation of the source words in a joint embedding space, in which words carrying similar meaning would have a close distance from each other.In turn, the shared information across source languages could help improve the choice of words in the target side.For example, the word Flussufer in German and the word bank in English should be projected in the joint embedding space in close proximity.This information might help to choose the French word rive over banque.
To make an attention NMT for single language pair translation to support a multilingual NMT that shared the common semantic space, (Ha et al., 2016b) suggested language-specific coding.Basically, language codes are appended to every word in source and target sentences and indicate the original language of the word.This information will be then passed to the training process of the NMT system.For example, an English-German sentence pair excuse me and entschuldigen Sie become en excuse en me and de entschuldigen de Sie.By doing so, they can train a single multilingual system that translates from several source languages to one or several target languages.When we have n English-German sentence pairs and m French-German sentence pairs, for example, we can train a single NMT system with a parallel corpus of n + m sentence pairs.Then we can use the trained model to either translate from English or from French to German.
The aforementioned multilingual NMT can be used wisely as a novel way to utilize the monolingual data, which is not a trivial task in NMT systems.Particularly, if we want to translate from English to German, we can use a corpus in German as an additional German-German data similar to the way we utilize the French-German parallel corpus.Thus, the encoder is shared between the source and the target languages (English and German), and the attention is also shared across languages to help the decoder selects better German words in the target side.The system implemented this idea is referred as a mix-source system.
For this evaluation, we apply the idea of that multilingual NMT approach in the English-German direction in order to make use of the German monolingual corpus and gain additional im-provements.

Pre-translation
One of the main problems of current NMT system is its limited vocabulary (Luong et al., 2014), generating difficulties when translating rare words.While the overall performance of NMT is significantly better on many tasks compared to SMT (Bojar et al., 2016), the translation of words seen only a few times is often not correct.In contrast, PBMT is able to memorize a translation it has observed only once in the training data.Therefore, we tried to combine the advantages of NMT and PBMT using pre-translation as described in (Niehues et al., 2016).
In the first step, we translate the source sentence f using the PBMT system generating a translation e SM T .Then we use the NMT system to find the most probable translation e * given the source sentence f and the PBMT translation e SM T .Thus, we create a mixed input for the NMT system consisting of both sentences by concatenating them.This scheme, however, may lead to errors when the source and target languages have a same word in surface, but with different meanings, e.g.die in English is a verb, while it is an article in German.In order to prevent such errors, we use a separate vocabulary for each language.Using the BPE of the input (Sennrich et al., 2015b), we are able to encode any input words as well as any translation of the PBMT system.Thereby, the NMT is able to learn to copy translations of the PBMT system to the target side.The pre-translation method is applied on the German → English direction.

Integration of Selected Phrase Pairs
One main drawback of the aforementioned approach is that all training data as well as the test data has to be translated using a phrase-based MT system.Therefore, this is a time-consuming approach.
In a second approach to integrate information for rare words from the phrase-based MT system, we relied only on the phrase table.Using this technique, we annotate rare words with their possible translation according to the phrase table.In the first step, we need to identify the words for which we want to provide a possible translation.Then we need to select a translation from the phrase table and, finally, we need a method to provide the translation of the word optional to the NMT system.
In our approach, we consider all words that were split into several words by the byte pair encoding as rare words.For these words, we search their possible translations in the phrase table.We took the phrase pair with the longest source phrase that covers the word.If there are several translation options for this source phrase, we select the one where the log-sum of all fours probabilities in the phrase table is the highest.
We integrate this information into the source sentence, by appending the source phrase and the translation from the phrase table.We also annotate the beginning and end of the phrase with a special character.When we have the source sentence Obama empfän@@ gt Netanyahu and a phrase pair empfän@@ gt receives in the phrase table, we will generate the following input for the NMT system: Obama # empfän@@ gt ## receives # Netanyahu

System Combination
Combination of different neural networks often leads to better performance, as shown in various applications of neural networks and previous NMT submissions in evaluation campaigns (Bojar et al., 2016).A successful technique is to ensemble different checkpoints of a model or models with different random initialization.While this is a very helpful technique, it has a potential drawback that it can only be performed easily for models using the same input and output representations.
In order to further extend the variety of models, we combine the output of several ensemble models by an n-best list combination.A first approach is to generate an n-best list from all or several of the models.Afterwards, we combine the n-best lists into a single one by creating the union of the n-best lists.Since every model only generated a subset of the joint list, we rescored the joint list by each model.Finally, we used a combination of all the scores to select the best entry for every source sentence.In previous work (Cho et al., 2017), it was shown that it is often sufficient to use the nbest list of the best model and rescore this n-best list with the different models.In our experiments, we used n = 50 for the n-best list size.
For systems to be combined, we use the NMT system generated by different frameworks (described in Section 3), as well as the pre-translation and multi-lingual systems (described in Section 4).We also combine systems using different BPE sizes.In addition, we use a system that generates the target sentence in the reversed order (Sennrich et al., 2015a;Liu et al., 2016;Huck et al., 2016).
After joining the n-best lists and rescoring it using the different systems, we have k scores for every entry in the n-best lists.In our experiments, we use two different techniques to combine the scores.The first method is to use the sum of all scores.Especially, if the performance of the different models is similar, we do not need to weigh the different models.Similar to the ensemble methods we can reach a good performance by using equal weights.In a second approach, we use the List-Net algorithms (Cao et al., 2007;Niehues et al., 2015) to find the optimial weights for the individual models.

ListNet-based Rescoring
In order to find the optimal weights for the different models, we use the ListNet algorithm (Cao et al., 2007;Niehues et al., 2015).This technique defines a probability distribution on the permutations of the list based on the scores of the individual models and another one based on a reference metric.In this set of experiments, we use the BLEU+1 score introduced by (Liang et al., 2006).Then we measure the cross entropy between both distributions as the loss function for our training.We trained the weights for the different models on the validation set also used during training the NMT systems.
Using this loss function, we can compute the gradient and use stochastic gradient descent.We use batch updates with ten samples and tune the learning rate on the development data.
The range of the scores of the different toolkits may greatly differ.Therefore, we rescaled all scores observed on the development data to the range of [−1, 1] prior to rescoring.

Results
In this section, we describe the systems used to generate the final hypothesis for official test set.We participated in German→English, English →German, and English→Latvian translation tasks.

German→English
All German to English translation system are trained on the parallel data as well as backtranslated data (Sennrich et al., 2015a) randomly selected from the monolingual news data.We use newstest2013 as validation data.Using this data, we train our initial system with the Nematus toolkit and a byte pair encoding size of 40K operations (Nematus 40K).The translation for all Nematus based systems are generated with ensembled system of different checkpoints.Although we also attempted to select the data for backtranslation as described in Section 2.1, initial experiments did not show improvements on the translation quality.Therefore, we use the randomly selected data for the remaining experiments.
In addition, we build a system with a reverse target order (R2L) (Liu et al., 2016) and the pretranslation.The pre-translation was generated by the PBMT system used in WMT 2016 (Ha et al., 2016a).Both performed slightly better than the baseline system.
When increasing the size of BPE operation to 80K, we observe the improvements on the translation quality, by 1.4 BLEU points.
In addition to Nematus, we also used the Open-NMT framework to build a network.For this language pair, we used the context gate, but not the coverage model.In contrast to the Nematus based systems, we did not ensemble different checkpoints.When using OpenNMT this technique did not yield an improvement in translation performance.When OpenNMT is trained using 40K BPE units (single system), we reach a BLEU score of 38.39.The default architecture of Open-NMT -utilizing two hidden layers -is deemed to be one reason for its outstanding performance.
In addition, we build a system using rare words annotated with their translations.In contrast to the baseline OpenNMT system, this configuration utilizes only half the hidden size.For comparison, a baseline system using this hidden size achieved a BLEU score of 36.91 on newstest2016.Although we did not improve the performance over the baseline, it was beneficial to use the system in the combination.
Finally, we generated an n-best list using the best performing system OpenNMT 40K.Then we used all the other models to rescore this n-best lists.The scores are combined linearly.The weights were optimized using the ListNet algorithm on newtest 2015.This resulted to the best performance of 39.10.The combination of all models improve the translation performance by another 0.7 BLEU points.

English→German
Table 2 shows the results of the English→German translation task.The scores are reported in BLEU scores and evaluated on test2016.We used Open-NMT framework on the preprocessed data (parallel, sampled, back-translated as in Section 2.1).
For all experiments, we used BPE operation at 40K.The systems differ in the training method and the architectures.In the first series of experiments Forward, training sentences are seen in their natural direction (left-to-right in this case).For this type of experiments, we trained with two architectures: normal and with context gates.The Context Gate system got a improvement over the normal one.The two architectures share the same vocabularies and ensembling them helped us to get more improvements.In the second series of experiments R2L the target sentences were reversed in order (right-to-left).And the third type is the mix-source systems described in Section 4.1 and in (Ha et al., 2016b).In addition, we also used a pre-translation system.The systems have different vocabularies and they were eventually combined using our ListNet-based rescoring (Section 5.1).
For each type of experiments, we conducted fine tuning on the small in-domain corpus mentioned in Section 2.1.2,and the best adapted model based on its BLEU score on test2015 was picked for the ensembling andor rescoring.In all systems except for pre-translation, we observed considerable improvements, around 1 BLEU point, when applying fine tuning (c.f.Adapted column).
Finally, we rescored and combined four adapted systems (Forward Ensembled, R2L, Mix-source and Pre-translation) to get our submission system to the campaign.It achieved 33.17 BLEU points on test2016, 0.9 BLEU points better than the Forward Ensembled system and 1.6 BLEU points better than our best single system (R2L Regarding the two enhancement features mentioned above, the simple Context Gate improved the scores by 0.2 and 0.6 on the two sets respectively, while integrating the coverage mechanism in the attention model yields a further 1.1 and 0.5 BLEU scores.The decoder recurrent network has always received previous context information through input-feeding.Surprisingly, the coverage vector still manages to improve the model performance.We assume that the gain comes from a stronger attention network, which has more parameters than the cosine similarity between the hidden state and the context, and the fact that the coverage vector can maintain a longer past attentional information compared to input-feeding. It is notable that even though the improvement has been observed, it is not consistent throughout the sets.One possible explanation is the difference between the development (from Leta) and the test set (from news) in terms of domain and difficulty.
Regarding the consistency between BLEU score and perplexity, the model with higher BLEU score does not necessarily have lower perplexity (across different settings, for example baseline vs. coverage) even though we choose the model with the best perplexity for reporting BLEU scores.This is the case even when these models share the same vocabulary.We can see that perplexity is a good measure to choose models within a single run, even though it is not informative to compare models with different network topologies.
By ensembling the three models, we managed improving the translation performance by 1.3 BLEU points.Our final submission is done by using another model trained with reversed target sentences to rescore the n-best list (n = 20) generated by the ensembled system, which improves about 0.4 BLEU.

Conclusion
In conclusion, we described our experiments in the news translation task in WMT 2016, in which we attempted to try out several techniques across different language pairs.The model-wise modifications such as context gate and coverage provided slight improvement, while we find out that NMT models can benefit greatly from adaptation and pre-translation.As observed in previous works, the most consistent gain mostly comes from system ensembling/combination and reranking.

Table 3 :
Experiments for English→Latvian