The QT21/HimL Combined Machine Translation System

This paper describes the joint submission of


Introduction
Quality Translation 21 (QT21) is a European machine translation research project with the aim of substantially improving statistical and machine learning based translation models for challenging languages and low-resource scenarios.
Health in my Language (HimL) aims to make public health information available in a wider variety of languages, using fully automatic machine translation that combines the statistical paradigm with deep linguistic techniques.
In order to achieve high-quality machine translation from English into Romanian, members of the QT21 and HimL projects have jointly built a combined statistical machine translation system. We participated with the QT21/HimL combined machine translation system in the WMT 2016 shared task for machine translation of news. 1 Core components of the QT21/HimL combined system are twelve individual English→Romanian translation engines which have been set up by different QT21 or HimL project partners. The outputs of all these individual engines are combined using the system combination approach as imple-mented in Jane, RWTH's open source statistical machine translation toolkit (Freitag et al., 2014a). The Jane system combination is a mature implementation which previously has been successfully employed in other collaborative projects and for different language pairs (Freitag et al., 2013;Freitag et al., 2014b;Freitag et al., 2014c).
In the remainder of the paper, we present the technical details of the QT21/HimL combined machine translation system and the experimental results obtained with it. The paper is structured as follows: We describe the common preprocessing used for most of the individual engines in Section 2. Section 3 covers the characteristics of the different individual engines, followed by a brief overview of our system combination approach (Section 4). We then summarize our empirical results in Section 5, showing that we achieve better translation quality than with any individual engine. Finally, in Section 6, we provide a statistical analysis of certain linguistic phenomena, specifically the prediction precision on morphological attributes. We conclude the paper with Section 7.

Preprocessing
The data provided for the task was preprocessed once, by LIMSI, and shared with all the participants, in order to ensure consistency between systems. On the English side, preprocessing consists of tokenizing and truecasing using the Moses toolkit (Koehn et al., 2007).
On the Romanian side, the data is tokenized using LIMSI's tokro (Allauzen et al., 2016), a rulebased tokenizer that mainly normalizes diacritics and splits punctuation and clitics. This data is truecased in the same way as the English side. In addition, the Romanian sentences are also tagged, lemmatized, and chunked using the TTL tagger (Tufiş et al., 2008).

Translation Systems
Each group contributed one or more systems. In this section the systems are presented in alphabetic order.

KIT
The KIT system consists of a phrase-based machine translation system using additional models in rescoring. The phrase-based system is trained on all available parallel training data. The phrase table is adapted to the SETimes2 corpus (Niehues and Waibel, 2012). The system uses a prereordering technique (Rottmann and Vogel, 2007) in combination with lexical reordering. It uses two word-based n-gram language models and three additional non-word language models. Two of them are automatic word class-based (Och, 1999) language models, using 100 and 1,000 word classes. In addition, we use a POS-based language model. During decoding, we use a discriminative word lexicon  as well.
We rescore the system output using a 300-best list. The weights are optimized on the concatenation of the development data and the SETimes2 dev set using the ListNet algorithm . In rescoring, we add the source discriminative word lexica (Herrmann et al., 2015) as well as neural network language and translation models. These models use a factored word representation of the source and the target. On the source side we use the word surface form and two automatic word classes using 100 and 1,000 classes. On the Romanian side, we add the POS information as an additional word factor.

LIMSI
The LIMSI system uses NCODE (Crego et al., 2011), which implements the bilingual n-gram approach to SMT (Casacuberta and Vidal, 2004; that is closely related to the standard phrase-based approach (Zens et al., 2002). In this framework, translation is divided into two steps. To translate a source sentence into a target sentence, the source sentence is first reordered according to a set of rewriting rules so as to reproduce the target word order. This generates a word lattice containing the most promising source permutations, which is then translated. Since the translation step is monotonic, this approach is able to rely on the n-gram assumption to decompose the joint probability of a sentence pair into a sequence of bilingual units called tuples.
We train three Romanian 4-gram language models, pruning all singletons with KenLM (Heafield, 2011). We use the in-domain monolingual corpus, the Romanian side of the parallel corpora and a subset of the (out-of-domain) Common Crawl corpus as training data. We select indomain sentences from the latter using the Moore-Lewis (Moore and Lewis, 2010) filtering method, more specifically its implementation in XenC (Rousseau, 2013). As a result, one third of the initial corpus is removed. Finally, we make a linear interpolation of these models, using the SRILM toolkit (Stolcke, 2002).

LMU-CUNI
The LMU-CUNI contribution is a constrained Moses phrase-based system. It uses a simple factored setting: our phrase table produces not only the target surface form but also its lemma and morphological tag. On the input, we include lemmas, POS tags and information from dependency parses (lemma of the parent node and syntactic relation), all encoded as additional factors.
The main difference from a standard phrasebased setup is the addition of a feature-rich discriminative translation model which is conditioned on both source-and target-side context (Tamchyna et al., 2016). The motivation for using this model is to better condition lexical choices by using the source context and to improve morphological and topical coherence by modeling the (limited left-hand side) target context.
We also take advantage of the target factors by using a 7-gram language model trained on sequences of Romanian morphological tags. Finally, our system also uses a standard lexicalized reordering model.

LMU
The LMU system integrates a discriminative rule selection model into a hierarchical SMT system, as described in (Tamchyna et al., 2014). The rule selection model is implemented using the highspeed classifier Vowpal Wabbit 2 which is fully integrated in Moses' hierarchical decoder. During decoding, the rule selection model is called at each rule application with syntactic context information as feature templates. The features are the same as used by Braune et al. (2015) in their string-to-tree system, including both lexical and soft source syntax features. The translation model features comprise the standard hierarchical features (Chiang, 2005) with an additional feature for the rule selection model (Braune et al., 2016).
Before training, we reduce the number of translation rules using significance testing (Johnson et al., 2007). To extract the features of the rule selection model, we parse the English part of our training data using the Berkeley parser (Petrov et al., 2006). For model prediction during tuning and decoding, we use parsed versions of the development and test sets. We train the rule selection model using VW and tune the weights of the translation model using batch MIRA (Cherry and Foster, 2012). The 5-gram language model is trained using KenLM (Heafield et al., 2013) on the Romanian part of the Common Crawl corpus concatenated with the Romanian part of the training data.

RWTH Aachen University: Hierarchical
Phrase-based System The RWTH hierarchical setup uses the open source translation toolkit Jane 2.3 (Vilar et al., 2010). Hierarchical phrase-based translation (HPBT) (Chiang, 2007) induces a weighted synchronous context-free grammar from parallel text. In addition to the contiguous lexical phrases, as used in phrase-based translation (PBT), hierarchical phrases with up to two gaps are also extracted. Our baseline model contains models with phrase translation probabilities and lexical smoothing probabilities in both translation directions, word and phrase penalty, and enhanced low frequency features (Chen et al., 2011). It also contains binary features to distinguish between hierarchical and non-hierarchical phrases, the glue rule, and rules with non-terminals at the boundaries. We use the cube pruning algorithm (Huang and Chiang, 2007) for decoding.
The system uses three backoff language models (LM) that are estimated with the KenLM toolkit (Heafield et al., 2013) and are integrated into the decoder as separate models in the log-linear combination: a full 4-gram LM (trained on all data), a limited 5-gram LM (trained only on in-domain data), and a 7-gram word class language model (wcLM)  trained on all data and with a output vocabulary of 143K words.
The system produces 1000-best lists which are reranked using a LSTM-based (Hochreiter and Schmidhuber, 1997;Gers et al., 2000;Gers et al., 2003) language model (Sundermeyer et al., 2012) and a LSTM-based bidirectional joined model (BJM) (Sundermeyer et al., 2014a). The models have a class-factored output layer (Goodman, 2001;Morin and Bengio, 2005) to speed up training and evaluation. The language model uses 3 stacked LSTM layers, with 350 nodes each. The BJM has a projection layer, and computes a for-ward recurrent state encoding the source and target history, a backward recurrent state encoding the source future, and a third LSTM layer to combine them. All layers have 350 nodes. The neural networks are implemented using an extension of the RWTHLM toolkit (Sundermeyer et al., 2014b). The parameter weights are optimized with MERT (Och, 2003) towards the BLEU metric.

RWTH Neural System
The second system provided by the RWTH is an attention-based recurrent neural network similar to . The implementation is based on Blocks (van Merriënboer et al., 2015) and Theano (Bergstra et al., 2010;Bastien et al., 2012).
The network uses the 30K most frequent words on the source and target side as input vocabulary. The decoder and encoder word embeddings are of size 620. The encoder uses a bidirectional layer with 1024 GRUs (Cho et al., 2014) to encode the source side, while the decoder uses 1024 GRU layer.
The network is trained for up to 300K updates with a minibatch size of 80 using Adadelta (Zeiler, 2012). The network is evaluated every 10000 updates on BLEU and the best network on the news-dev2016/1 dev set is selected as the final network.
The monolingual News Crawl 2015 corpus is translated into English with a simple phrase-based translation system to create additional parallel training data. The new data is weighted by using the News Crawl 2015 corpus (2.3M sentences) once, the Europarl corpus (0.4M sentences) twice and the SETimes2 corpus (0.2M sentences) three times. The final system is an ensemble of 4 networks, all with the same configuration and training settings.

Tilde
The Tilde system is a phrase-based machine translation system built on LetsMT infrastructure (Vasijevs et al., 2012) that features language-specific data filtering and cleaning modules. Tilde's system was trained on all available parallel data. Two language models are trained using KenLM (Heafield, 2011): 1) a 5-gram model using the Europarl and SETimes2 corpora, and 2) a 3-gram model using the Common Crawl corpus. We also apply a custom tokenization tool that takes into account specifics of the Romanian language and handles non-translatable entities (e.g., file paths, URLs, e-mail addresses, etc.). During translation a rule-based localisation feature is applied.

Edinburgh/LMU Hierarchical System
The UEDIN-LMU HPBT system is a hierarchical phrase-based machine translation system (Chiang, 2005) built jointly by the University of Edinburgh and LMU Munich. The system is based on the open source Moses implementation of the hierarchical phrase-based paradigm (Hoang et al., 2009). In addition to a set of standard features in a log-linear combination, a number of non-standard enhancements are employed to achieve improved translation quality.
Specifically, we integrate individual language models trained over the separate corpora (News Crawl 2015, Europarl, SETimes2) directly into the log-linear combination of the system and let MIRA (Cherry and Foster, 2012) optimize their weights along with all other features in tuning, rather than relying on a single linearly interpolated language model. We add another background language model estimated over a concatenation of all Romanian corpora including Common Crawl. All language models are unpruned.
For hierarchical rule extraction, we impose less strict extraction constraints than the Moses defaults. We extract more hierarchical rules by allowing for a maximum of ten symbols on the source side, a maximum span of twenty words, and no lower limit to the amount of words covered by right-hand side non-terminals at extraction time. We discard rules with non-terminals on their right-hand side if they are singletons in the training data.
In order to promote better reordering decisions, we implemented a feature in Moses that resembles the phrase orientation model for hierarchical machine translation as described by  and extend our system with it. The model scores orientation classes (monotone, swap, discontinuous) for each rule application in decoding.
We finally follow the approach outlined by Huck et al. (2011) for lightly-supervised training of hierarchical systems. We automatically translate parts (1.2M sentences) of the monolingual Romanian News Crawl 2015 corpus to English with a Romanian→English phrase-based statistical machine translation system (Williams et al., 2016). The foreground phrase table extracted from the human-generated parallel data is filled up with entries from a background phrase table extracted from the automatically produced News Crawl 2015 parallel data.  give a more in-depth description of the Edinburgh/LMU hierarchical machine translation system, along with detailed experimental results.

Edinburgh Neural System
Edinburgh's neural machine translation system is an attentional encoder-decoder , which we train with nematus. 3 We use byte-pair-encoding (BPE) to achieve openvocabulary translation with a fixed vocabulary of subword symbols (Sennrich et al., 2016c). We produce additional parallel training data by automatically translating the monolingual Romanian News Crawl 2015 corpus into English (Sennrich et al., 2016b), which we combine with the original parallel data in a 1-to-1 ratio. We use minibatches of size 80, a maximum sentence length of 50, word embeddings of size 500, and hidden layers of size 1024. We apply dropout to all layers (Gal, 2015), with dropout probability 0.2, and also drop out full words with probability 0.1. We clip the gradient norm to 1.0 (Pascanu et al., 2013). We train the models with Adadelta (Zeiler, 2012), reshuffling the training corpus between epochs. We validate the model every 10 000 minibatches via BLEU on a validation set, and perform early stopping on BLEU. Decoding is performed with beam search with a beam size of 12.
A more detailed description of the system, and more experimental results, can be found in (Sennrich et al., 2016a).

Edinburgh Phrase-based System
Edinburgh's phrase-based system is built using the Moses toolkit, with fast align (Dyer et al., 2013) for word alignment, and KenLM (Heafield et al., 2013) for language model training. In our Moses setup, we use hierarchical lexicalized reordering (Galley and Manning, 2008), operation sequence model , domain indicator features, and binned phrase count features. We use all available parallel data for the translation model, and all available Romanian text for the language model. We use two different 5-gram language models; one built from all the monolingual target text concatenated, without pruning, and one built from only News Crawl 2015, with singleton 3-grams and above pruned out. The weights of all these features and models are tuned with k-best MIRA (Cherry and Foster, 2012) on first the half of newsdev2016. In decoding, we use MBR (Kumar and Byrne, 2004), cube-pruning (Huang and Chiang, 2007) with a pop-limit of 5000, and the Moses "monotone at punctuation" switch (to prevent reordering across punctuation) (Koehn and Haddow, 2009).

USFD Phrase-based System
USFD's phrase-based system is built using the Moses toolkit, with MGIZA (Gao and Vogel, 2008) for word alignment and KenLM (Heafield et al., 2013) for language model training. We use all available parallel data for the translation model. A single 5-gram language model is built using all the target side of the parallel data and a subpart of the monolingual Romanian corpora selected with Xenc-v2 (Rousseau, 2013). For the latter we use all the parallel data as in-domain data and the first half of newsdev2016 as development set. The feature weights are tuned with MERT (Och, 2003) on the first half of newsdev2016.
The system produces distinct 1000-best lists, for which we extend the feature set with the 17 baseline black-box features from sentencelevel Quality Estimation (QE) produced with Quest++ 4 (Specia et al., 2015). The 1000-best lists are then reranked and the top-best hypothesis extracted using the nbest rescorer available within the Moses toolkit.

UvA
We use a phrase-based machine translation system (Moses) with a distortion limit of 6 and lexicalized reordering. Before translation, the English source side is preordered using the neural preordering model of (de Gispert et al., 2015). The preordering model is trained for 30 iterations on the full MGIZA-aligned training data. We use two language models, built using KenLM. The first is a 5-gram language model trained on all available data. Words in the Common Crawl dataset that appear fewer than 500 times were replaced by UNK, and all singleton ngrams of order 3 or higher were pruned. We also use a 7-gram class-based language model, trained on the same data. 512 word classes were generated using the method of Green et al. (2014).

System Combination
System combination produces consensus translations from multiple hypotheses which are obtained from different translation approaches, i.e., the systems described in the previous section. A system combination implementation developed at RWTH Aachen University (Freitag et al., 2014a) is used to combine the outputs of the different engines. The consensus translations outperform the individual hypotheses in terms of translation quality.
The first step in system combination is the generation of confusion networks (CN) from I input translation hypotheses. We need pairwise alignments between the input hypotheses, which are obtained from METEOR (Banerjee and Lavie, 2005). The hypotheses are then reordered to match a selected skeleton hypothesis in terms of word ordering. We generate I different CNs, each having one of the input systems as the skeleton hypothesis, and the final lattice is the union of all I generated CNs. In Figure 1 an example of a confusion network with I = 4 input translations is depicted. Decoding of a confusion network finds the best path in the network. Each arc is assigned a score of a linear model combination of M different models, which includes word penalty, 3-gram language model trained on the input hypotheses, a binary primary system feature that marks the primary hypothesis, and a binary voting feature for each system. The binary voting feature for a system is 1 if and only if the decoded word is from that system, and 0 otherwise. The different model weights for system combination are trained with MERT (Och, 2003).

Experimental Evaluation
Since only one development set was provided we split the given development set into two parts: newsdev2016/1 and newsdev2016/2. The first part was used as development set while the second part was our internal test set. Additionally we extracted 2000 sentences from the Europarl and SETimes2 data to create two additional development and test sets. Most single systems are optimized for newsdev2016/1 and/or the SETimes2 test set. The system combination was optimized on the newsdev2016/1 set.
The single system scores in Table 1 show clearly that the UEDIN NMT system is the strongest single system by a large margin. The other standalone attention-based neural network contribution, RWTH NMT, follows, with only a small margin before the phrase-based contributions. The combination of all systems improved the strongest system by another 1.9 BLEU points on our internal test set, newsdev2016/2, and by 1 BLEU point on the official test set, newstest2016.
Removing the strongest system from our system combination shows a large degradation of the results. The combination is still slightly stronger then the UEDIN NMT system on newsdev2016/2, but lags behind on newstest2016. Removing the by itself weakest system shows a slight degradation on newsdev2016/2 and newstest2016, hinting that it still provides valuable information. Table 2 shows a comparison between all systems by scoring the translation output against each other in TER and BLEU. We see that the neural networks outputs differ the most from all the other systems.

Morphology Prediction Precision
In order to assess how well the different system outputs predict the right morphology, we compute a precision rate for each Romanian morphological attribute that occurs with nouns, pronouns, adjectives, determiners, and verbs (Table 3). For this purpose, we use the METEOR toolkit (Banerjee and Lavie, 2005) to obtain word alignments between each system translation and the reference translation for newstest2016. The reference and hypotheses are tagged with TTL (Tufiş et al., 2008). 5 Each word in the reference that is assigned a POS tag of interest (noun, pronoun, adjective, determiner, or verb) is then compared to the word it is aligned to in the system output. When, for   Table 3: Precision of each system on morphological attribute prediction computed over the reference translation using METEOR alignments. The last row shows the ratio of reference words for which METEOR managed to find an alignment in the hypothesis. a given morphological attribute, the output and the reference have the same value (e.g. Num-ber=Singular), we consider the prediction correct. The prediction is considered wrong in every other case.
The last row in Table 3 shows the ratio of reference words for which METEOR found an alignment in the hypothesis. We observe a high correlation between this ratio and the quality of the morphological predictions, showing that the accuracy is highly dependent on the alignments. We nevertheless observe that the predictions made by UEDIN NMT are strictly all better than UEDIN PBT, although the latter has slightly more alignments to the reference. The system combination makes the most accurate predictions for almost every attribute. The difference in precision with the best single system (UEDIN NMT) can be significant (2.3% for definite and 1.4% for tense) showing that the combination managed to effectively identify the strong points of each translation system.

Conclusion
Our combined effort shows that even with an extremely strong single best system, we still manage to improve the final result by one BLEU point by combining it with the other systems of all participating research groups.
The joint submission for English→Romanian is the best submission measured in terms of BLEU, as presented on the WMT submission page. 6