The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

This paper describes the statistical machine translation system developed at RWTH Aachen University for the English → German and German → English translation tasks of the EMNLP 2017 Second Conference on Machine Translation (WMT 2017). We use ensembles of attention-based neural machine translation system for both directions. We use the provided parallel and synthetic data to train the models. In addition, we also create a phrasal system using joint translation and reordering models in decoding and neural models in rescoring.


Introduction
We describe the statistical machine translation (SMT) systems developed by RWTH Aachen University for the German→English and English→German language pairs of the WMT 2017 evaluation campaign.After testing multiple systems and system combinations we submitted an ensemble of multiple NMT networks since it outperformed every tested system combination.
This paper is organized as follows.In Section 2 we describe our data preprocessing.Section 3 depicts the generation of synthetic data.Our translation software and baseline setups are explained in Section 4, including the attention-based recurrent neural network ensemble in Subsection 4.1 and phrasal joint translation and reordering (JTR) system in Subsection 4.2.Our experiments for each track are summarized in Section 5.

Preprocessing
We compared two different preprocessings for German→English for the attention-based recurrent neural network (NMT) system.The first pre-processing is similar to the preprocessing used in our WMT 2015 submission (Peter et al., 2015), which was optimized for phrase-based translation (PBT).
Secondly, we utilize a simplified version which uses tokenization, frequent casing, and simple categories only.Note, that the changes in preprocessing have a huge negative impact on the PBT system, while slightly improving the NMT system (Table 1).We therefore use the simplified version for all pure NMT experiments and use the old preprocessing for all other systems.The phrasal JTR system uses the preprocessing technique that is optimized for PBT, as it relies on phrases as translation candidates.The preprocessing is similar to the one used in the WMT 2015 submission, but without any pre-ordering of source words.The English→German NMT system utilizes only the simplified preprocessing.

Synthetic Source Sentences
To increase the amount of usable parallel training data for the phrase-based and the neural machine translation systems, we translate a subset of the monolingual training data back to English in a similar way as described by (Bertoldi and Federico, 2009) and (Sennrich et al., 2016b).
We create a baseline German→English NMT system as described in 4.1 which is trained with all parallel data to translate 6.9M English sentences into German.For the other direction we use this newly created synthetic data and the parallel corpus to train a baseline English→German system, which in turn is used to translate additional 4.4M sentences from English to German.
Further, we append the synthetic data created by (Sennrich et al., 2016a).This results in additional 4.2M sentences for the German→English system and 3.6M for the opposite direction.

SMT Systems
For the WMT 2017 evaluation campaign, we have employed two different translation system architectures for the German→English direction: • phrasal joint translation and reordering • attention-based neural network ensemble The word alignments required by some models are obtained with GIZA++ (Och and Ney, 2003).We use mteval from the Moses toolkit (Koehn et al., 2007) and TERCom to evaluate our systems on the BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) measures.Additional we use BEER (Stanojević and Sima'an, 2014) and CTER (Wang et al., 2016).All reported scores are case-sensitive and normalized.

Attention-Based Recurrent Neural Network
The best performing system provided by the RWTH is an attention-based recurrent neural network (NMT) similar to (Bahdanau et al., 2015).
The encoder and decoder word embeddings are of size 620.The encoder consists of a bidirectional layer with 1000 LSTMs with peephole connections (Hochreiter and Schmidhuber, 1997a) to encode the source side.Additionally we ran experiments with two layers using 1000 LSTM nodes each where we optionally connect all internal states of the first LSTM layer to the second.The data is converted into subword units using byte pair encoding with 20000 operations (Sennrich et al., 2016c).
During training a batch size of 50 is used.The applied gradient algorithm is Adam (Kingma and Ba, 2014) with a learning rate of 0.001 and the four best models are averaged as described in the beginning of (Junczys-Dowmunt et al., 2016).Later experiments are done using Adam followed by an annealing scheme for learning rate reduction for SGD, as described in (Bahar et al., 2017).
The network is trained with 30% dropout for up to 500K iterations and evaluated every 10000 iterations on newstest2015.Decoding is done using a beam search with a beam size of 12.
If the neural network creates a special number token, the corresponding source number with the highest attention weight is copied to the target side.The synthetic training data is created and used as described in Section 3.
In addition, we tested methods to provide the alignment computation with supplementary information comparable with (Tu et al., 2016;Cohn et al., 2016).We model the word fertility and feedback the information of the last alignment points using a conventional layer with a window size of 5.
The final system was an ensemble of multiple systems each trained with slightly different settings as shown in Table 2 and 4.

Phrasal Joint Translation and Reordering System
The phrasal Joint Translation and Reordering (JTR) decoder is based on the implementation of the source cardinality synchronous search (SCSS) procedure described in (Zens and Ney, 2008).
The system combines the flexibility of word-level models with the search accuracy of phrase candidates.It incorporates the JTR model (Guta et al., 2015), a language model (LM), a word class language model (wcLM) (Wuebker et al., 2013), phrasal translation probabilities, conditional JTR probabilities on phrase level and additional lexical models for smoothing purposes.The phrases are annotated with word alignments to allow for the application of word-level models.
A more detailed description of the translation candidate generation and the search procedure is given in (Peter et al., 2016).The phrase extraction and the estimation of the translation models are performed on all bilingual data excluding the rapid2016 corpus, the newstest2008-2013 and newssyscom2009 corpora and the first part of the synthetic data (Section 3).The non-synthetic data was filtered to contain only sentences with 4 unaligned words at most.In total, this results in 3.57M parallel and 6.94M synthetic sentences.

JTR Model
A JTR sequence ( f , ẽ) Ĩ 1 is an interpretation of a bilingual sentence pair (f J 1 , e I 1 ) and its word alignment b I 1 .The joint probability p(f J 1 , e I 1 , b I 1 ) can be modeled as: The Viterbi alignments for both translation directions are obtained using GIZA++ (Och and Ney, 2003), merged and then used to convert the bilingual sentence pairs into JTR sequences.A 7gram JTR joint model (Guta et al., 2015), which is responsible for estimating the translation and reordering probabilities, is trained on those.It is estimated with interpolated modified Kneser-Ney smoothing (Chen and Goodman, 1998) using the KenLM toolkit (Heafield et al., 2013).

Language Models
The phrase-based translation system uses two language models (LM) that are estimated with the KenLM toolkit (Heafield et al., 2013) (Wuebker et al., 2013) to estimate the wcLM on the same data as the conventional LM.
Both models are trained on all monolingual corpora, except the commoncrawl corpus, and the target side of the bilingual data (Section 4.2), which sums up to 365.44M sentences and 7230.15Mrunning words, respectively.

Log-Linear Features in Decoding
In addition to the JTR model and the language models, JTR conditional models for both directions (Peter et al., 2016) are included into the loglinear framework.They are computed offline on the phrase level.Moreover, the system incorporates phrase translation models estimated as relative frequencies for both directions.
Because the JTR models are trained on Viterbi aligned word-pairs, they are limited to the context provided by the aligned word pairs and sensitive to the quality of the word alignments.To overcome this issue, we incorporate IBM 1 lexical models for both directions.The models are trained on all available bilingual data and the synthetic data, see Section 3.
The heuristic features used by the decoder are an enhanced low frequency penalty (Chen et al., 2011), a penalty for unaligned source words and a symmetric word-level distortion penalty.Thus, different phrasal segmentations have the same reordering costs if they are equal in their word alignments.An additional word bonus helps to control the length of the hypothesized translation by counteracting the language model, which prefers translations to be rather short.
The decoder also incorporates a gap distance penalty (Durrani et al., 2011).All parameter weights are optimized using MERT (Och, 2003) towards the BLEU metric.
An attention-based recurrent neural model is applied as an additional feature in rescoring 1000best lists, see Section 4.2.4.

Attention-based Recurrent Neural
Network in Re-Ranking An attention-based recurrent neural network similar to those in Subsection 4.1 is used within the log-linear framework for rescoring 1000-best lists generated by the phrasal JTR decoder.The model is trained on 6.96M sentences of the synthetic data.
The network uses the 30K most frequent words as source and target vocabulary, respectively.The decoder and encoder word embeddings are of size 500, the encoder uses a bidirectional LSTM layer with 1K units (Hochreiter and Schmidhuber, 1997b) to encode the source side.An LSTM layer with 1K units is used by the decoder.
Training is performed for up to 300K iterations with a batch size of 50 and Adam (Kingma and Ba, 2014) is used as the optimization algorithm.The parameters of the best four networks on news-test2015 with regards to BLEU score are averaged to produce the final model used in reranking.

Alignment-based Recurrent Neural
Network in Re-Ranking Besides the attention-based model, we apply recurrent alignment-based neural networks in 1000best rescoring.These networks are similar to the ones used in rescoring in (Alkhouli et al., 2016).
We use a bidirectional alignment model that has a bidirectional encoder (2 LSTM layers), a unidirectional target encoder (1 LSTM layer), and an additional decoder LSTM layer.The model pairs each target state computed at target position i − 1 with its aligned bidirectional source state.The alignment information is obtain using GIZA++ in training, and from the 1000-best lists during rescoring.The paired states are fed into the decoder layer.The model predicts the discrete jump from the previous to the current source position.The model is described in (Alkhouli and Ney, 2017).
We also use a bidirectional lexical model to score word translation.It uses an architecture similar to that of the alignment model, with the exception that pairing is done using the source states aligned to the target position i instead of i − 1.We also add weighted residual connections connecting the target states and the decoder states in the lexical model.We train two variants of this model, one including the target state, and one dropping it completely.
All models use four 200-node LSTM layers with the exception of the lexical model that includes the target state, which uses 350 nodes per layer.We use a class-factored output layer of 2000 classes, where 1000 classes are dedicated to the most frequent words, while the remaining 1000 classes are shared.This enables handling large vocabularies.The target vocabulary is reduced to 269K words, while the source vocabulary is reduced to 317K words

System Combination
System combination is applied to produce consensus translations from multiple hypotheses obtained from different translation approaches.The consensus translations typically outperform the individual hypotheses in terms of translation quality.A system combination implementation developed at RWTH Aachen University (Freitag et al., 2014) is used to combine the outputs of the different engines.
The first step in system combination is the generation of confusion networks (CN) from I input translation hypotheses.We need pairwise alignments between the input hypotheses.The alignments are obtained by METEOR (Banerjee and Lavie, 2005).The hypotheses are then reordered to match a selected skeleton hypothesis regarding the order of words.We generate I different CNs, each having one of the input systems as the skeleton hypothesis.The final lattice is the union of all I-many generated CNs.
The decoding of a confusion network consists of finding the shortest path in the network.Each arc is assigned a score of a linear model combination of M different models, which includes a Table 3: Results of the individual systems for the German→English task.The system combination contains the system in line 3, 6, and 7.
word penalty, a 3-gram LM trained on the input hypotheses, a binary primary system feature that marks the primary hypothesis and a binary voting feature for each system.The binary voting feature for the system outputs 1 if the decoded word origins from that system and 0 otherwise.The model weights for the system combination are trained with MERT.

Experimental Evaluation
We have mainly focused on building a strong German→English system and run most experiments on this task.We used newstest2015 as the development set.
After switching the preprocessing as described in Section 2, we have added the word fertility, which improves the baseline system by about 0.8 BLEU on newstest2016 as shown in Table 2. Adding the synthetic data as described in Section 3 gives a gain of 3.8 BLEU on newstest2016.Changing the number of layers in the decoder from one to two improves the performance by additional 0.8 BLEU.Filtering the rapid data corpus by scoring all bilingual sentences with an NMT system trained on all parallel data and removing the sentences with the worst scores improves the system on newstest2016 by 0.4 BLEU, but yield only in a small improvement on newstest2015.Surprisingly, it even decreases the performance on newstest2017, as observed at a later point in time.Switching from merging the 4 best networks in a training run to continuing the training with an annealing scheme for learning rate reduction for SGD, as described in (Bahar et al., 2017), has barely changed the performance on newstest2016.Nevertheless, we have decided to keep on using it, since it slightly helped on newstest2015.
We have used this, without the word fertility, as a base setup to train multiple systems with slightly different settings for an ensemble.In the first setting we use all LSTM states of the first decoder layer as input for the second decoder layer.This actually hurts the performance.Adding the word fertility or the alignment feedback as additional information does not have a large impact.Note, that the word fertility helpes when it is added to the baseline system -we are not sure why the effect disappears.Combining systems in one ensemble improves the system again by 1.1 BLEU on news-test2016.
We also combined the NMT system with the strongest phrasal JTR system and a few other combinations as well, but none of them has been able to improve over the NMT ensemble (Table 3).We therefore used the NMT system as our final submission.In the table, we can see that using three alignment-based models is comparable to using a single attention-based model.Note, however, that these models have relatively small LSTM layers of 200 and 350 nodes per layer.Meanwhile, the attention model uses 1000-node LSTM layers.When added on top of the alignment-based mix, the attention model only improves the mix slightly.
For the English→German system we have simply used the three best working NMT systems from the German→English setup and combined them in an ensemble.The word fertility and alignment feedback extensions also did not improve the performance, but the ensemble increased the overall performance by 1 BLEU on newstest2016.Due to computation time limitations, we did not succeed in building a phrasal JTR system on time.

Conclusion
The RWTH Aachen University has participated with a neural machine translation ensemble for the German→English and English→German WMT 2017 evaluation campaign.All networks are trained using all given parallel data, backtranslated synthetic data, two LSTM layers in the decoder.The rapid corpus has been filtered to remove the most unlikely sentences.Adam followed by annealing scheme of learning rate reduction is used for optimization.Four networks are combined for the German→English ensemble and three for the English→German ensemble.In addition, we have submitted a phrasal JTR system, which has come close to the performance of a single neural machine translation network for news-test2017.Using system combination has not improved the performance of the best neural ensemble.

Table 2 :
Results of the individual systems for the German→English task.The base system contains synthetic data, 2-decoder layers, filtered rapid data, and was trained with annealing learning rate instead of merging.Details are explained in Section 4.1.
(Och, 2000)ted into the decoder as separate models in the log-linear combination: A 5-gram LM and a 7gram word-class language model (wcLM).Both use interpolated modified Kneser-Ney smoothing.For the word-class LM, we train 200 word classes on the target side of the bilingual training data using an in-house tool(Botros et al., 2015)similar to mkcls(Och, 2000).We have not tuned the number of word classes, but simply used 200, as it has proved to work well in previous systems.With these class definitions, we apply the technique described in

Table 4 :
Results of the individual systems for the English→German task.