DFKI’s experimental hybrid MT system for WMT 2015

DFKI participated in the shared translation task of WMT 2015 with the German-English language pair in each translation direction. The submissions were generated using an experimental hybrid system based on three systems: a statistical Moses system, a commercial rule-based system, and a serial coupling of the two where the output of the rule-based system is further translated by Moses trained on parallel text consisting of the rule-based output and the original target language. The outputs of three systems are combined using two methods: (a) an empirical selection mechanism based on grammatical features (primary submission) and (b) IBM 1 models based on POS 4-grams (contrastive sub-mission).


Introduction
The system architecture we will describe has been developed within the QTLEAP project. 1 The goal of the project is to explore different combinations of shallow and deep processing for improving MT quality. The system presented in this paper is the first of a series of MT system prototypes developed in the project. Figure 1 shows the overall architecture that includes: • A statistical Moses system, • the commercial transfer-based system Lucy, • their serial combination ("LucyMoses"), and • an informed selection mechanism ("ranker").
The components of this hybrid system will be detailed in the sections below.

Translation systems Moses
Our statistical machine translation system was based on a vanilla phrase-based system built with Moses (Koehn et al., 2007) trained on the corpora Europarl ver. 7, News Commentary ver. 9 (Bojar et al., 2014), Commoncrawl (Smith et al., 2013) and MultiUN . Language models of order 5 have been built and interpolated with SRILM (Stolcke, 2002) and KenLM (Heafield, 2011). For German to English, we also experimented with the method of pre-ordering the source side based on the target-side grammar (Popović and Ney, 2006). As a tuning set we used the news-test 2013.

Lucy
The transfer-based Lucy system (Alonso and Thurmair, 2003) includes the results of long linguistic efforts over the last decades and that has been used in previous projects including EURO-MATRIX, EUROMATRIX+ and QTLAUNCHPAD, while relevant hybrid systems have been submitted to WMT (Chen et al., 2007;Federmann et al., 2010;Hunsicker et al., 2012). The transferbased approach has shown good results that compete with pure statistical systems, whereas it focuses on translating according to linguistic struc-tures. Its functionality is based on hand-written linguistic rules and there are no major empirical components. Translations are processed on three phases: • the analysis phase, where the sourcelanguage text is parsed and a tree of the source language is constructed • the transfer phase, where the analysis tree is used for the transfer phase, where canonical forms and categories of the source are transferred into similar representations of the target language • the generation phase, where the target sentence is formed out of the transfered representations by employing inflection and agreement rules.

LucyMoses
As an alternative way of automatic post-editing of the transfer-based system, a serial trans-fer+SMT system combination is used, as described in (Simard et al., 2007). For building it, the first stage is translation of the source language part of the training corpus by the transfer-based system. In the second stage, an SMT system is trained using the transfer-based translation output as a source language and the target language part as a target language. Later, the test set is first translated by the transfer-based system, and the obtained translation is translated by the SMT system. In previous experiments, however, the method on its own could not outperform Moses trained on a large parallel corpus. The example in Figure 1 (taken from the QTLEAP corpus used in the project) nicely illustrates how the serial coupling operates. While the SMT output used the right terminology ("Menü Einfügen" -"insert menu"), the instruction is not formulated in a very polite manner. In contrast, the output of the transfer-based system is formulated politely, yet mistranslating the menu type.
The serial system combination produces a perfect translation. In this particular case, the machine translation is even better than the human reference ("Wählen Sie im Einfügen Menü die Tabelle aus.") as the latter is introducing a determiner for "table", which is not justified by the source.

Sentence level selection
We present two methods for performing sentence level selection, one with pairwise classifier and one based on POS 4-gram IBM1 models.
2.1.1 Empirical machine learning classifier (primary submission) The machine learning (ML) selection mechanism is based on encouraging results of previous projects including EUROMATRIX+ (Federmann and Hunsicker, 2011), META-NET (Federmann, 2012), QTLAUNCHPAD (Avramidis, 2013;. It has been extended to include several features that can only be generated on a sentence level and would otherwise blatantly increase the complexity of the transfer or decoding algorithm. In the architecture at hand, automatic syntactic and dependency analysis is employed on a sentence level, in order to choose the sentence that fulfills the basic quality aspects of the translation: (a) assert the fluency of the generated sentence, by analyzing the quality of its syntax (b) ensure its adequacy, by comparing the structures of the source with the structures of the generated sentence.
All produced features are used to build a machine-learned ranking mechanism (ranker) against training preference labels. Preference labels are part of the training data and rank different system outputs for a given source sentence based on the translation quality. Preference labels are generated either by automatic reference-based metrics, or derived from human preferences. The ranker was a result of experimenting with various combinations of feature sets and machine learning algorithms and choosing the one that performs best on the development corpus.
The implementation of the selection mechanism is based on the "Qualitative" toolkit that was presented at the MT Marathon, as an open-source contribution by QTLEAP (Avramidis et al., 2014).
Feature sets We experimented with feature sets that performed well in previous experiments. In particular: • Basic syntax-based feature set: unknown words, count of tokens, count of alternative parse trees, count of verb phrases, PCFG parse log likelihood. The parsing was performed with the Berkeley Parser (Petrov and Klein, 2007) and features were extracted from both source and target. This feature set has performed well as a metric in WMT-11 metrics task .
• Basic feature set + 17 QuEst baseline features: this feature set combines the basic syntax-based feature set described above with the baseline feature set of the QuEst toolkit  as per WMT-13 (Bojar et al., 2013). This feature set combination got the best result in WMT-13 quality estimation task (Avramidis and Popović, 2013). The 17 features set includes shallow features such as the number of tokens, LM probabilities, number of occurences of the target work within the target probability, average numbers of translations per source word in the sentence, percentages of unigrams, bigrams and trigrams in quartiles 1 and 4 of frequency of source words in a source language corpus and the count of punctuation marks.
Machine Learning As explained above, the core of the selection mechanism is a ranker which reproduces ranking by aggregating pairwise decisions by a binary classifier (Avramidis, 2013). Such a classifier is trained on binary comparisons in order to select the best out of two different MT outputs given one source sentence at a time. As a training material, we used the evaluation dataset of the WMT shared tasks (years 2008-2014), where each source sentence was translated by many systems and their outputs were consequently ranked by human annotators. These preference labels provided the binary pairwise comparisons for training the classifiers. Additionally to the human labels, we also experimented on training the classifiers against automatically generated preference labels, after ranking the outputs with METEOR (Banerjee and Lavie, 2005). In each translation direction, we chose the label type (human vs. METEOR) which maximizes if possible all automatic scores on our development set, including document-level BLEU. We exhaustively tested all suggested feature sets with many machine learning methods, including Support Vector Machines (with both RBF and linear kernel), Logistic Regression, Extra/Decision Trees, k-neighbors, Gaussian Naive Bayes, Linear and Quadratic Discriminant Analysis, Random Forest and Adaboost ensemble over Decision Trees. The binary classifiers were wrapped into rankers using the soft pairwise recomposition (Avramidis, 2013) to avoid ties between the systems. When ties occurred, the system selected based on a predefined system priority (Lucy, Moses, LucyMoses). The predefined priority was defined manually based on preliminary observations in order to prioritize the transfer-based system, due to its tension to achieve better grammat-icality. Further analysis on this aspect may be required.
Best combination The optimal systems are using: 1. the Basic feature set + 17 QuEst baseline features for GermanrightarrowEnglish, trained with Suppor Vector Machines (Basak et al., 2007) against human ranking labels.
2. the basic syntax-based feature set for English→German, trained with Support Vector Machines against METEOR scores. ME-TEOR was chosen since for this language pair, the empirical mechanism trained on human judgments had very low performance in term of correlation with humans.
2.1.2 POS 4-gram IBM1 models (contrastive submission) Using the IBM1 scores (Brown et al., 1993) for automatic evaluation of MT outputs without reference translations has been proposed in , and the best variant in terms of correlation with human ranking was the target-fromsource direction based on POS 4-grams. Therefore, we investigated this variant for our sentence selection, and we submitted the obtained translation outputs as contrastive.
The IBM1 scores are defined in the following way: where s j are the POS 4-grams of the source language sentence, S is the POS 4-gram length of this sentence, h i are the POS 4-grams of the target language translation output (hypothesis), and H is the POS 4-gram length of this hypothesis.
A parallel bilingual corpus for the desired language pair and a tool for training the IBM1 model are required in order to obtain IBM1 probabilities p(h i |s j ). For the POS n-gram scores, appropriate POS taggers for each of the languages are necessary. The POS tags cannot be only basic but must have all details (e.g. verb tenses, cases, number, gender, etc.).
The bilingual IBM1 probabilities used in our experiments are learnt from the German-English part of the WMT 2010 News Commentary bilingual corpora. Both German and English POS tags were produced using TreeTagger (Schmid, 1994).  (Papineni et al., 2002), word F-scores and POS F-scores (Popović, 2011) for all individual systems and system combinations for both translation directions. The following interesting tendencies can be observed:

Experimental results
• German→English: -Moses and LucyMoses are comparable on the word level (BLEU and WORDF) -LucyMoses is best on the syntactic (POS) level -LucyMoses achieves better scores than both its components -using all three systems with a selection mechanism is the best option • English→German: -Lucy is comparable with Moses on the word level and better on syntactic level -LucyMoses improves all scores -LucyMoses+Moses (LM+M) is the best combination for word level scores -Lucy+LucyMoses (L+LM) is comparable with the combination of all three systems (L+LM+M) for the syntactic oriented POSF score We submitted the combination of all three systems for both selection mechanisms and for both translation directions. It should be noted that the ML classifier is used for the project's first official prototype, whereas the IBM1 classifier has been investigated only recently in the framework of the project -therefore the primary submission for the shared task is the ML classifier although it yielded lower automatic scores than the IBM1 classifier.
In order to estimate the limits of the classifiers for the given three MT systems, upper bound scores are presented in the last two rows, when selecting criteria were the WORDF and POSF scores themselves. It can be seen that there is a room for improvement for both selection methods. Further investigation, tuning and extension of the selection mechanisms will provide more insights and has potential for future improvements of the selection itself as well as of the MT systems.
Preliminary results concerning analysis of differences between the systems and behaviour of classifiers are shown in the following section.

Analysis of the results
First step towards better understanding of the selection mechanisms is to investigate the contribution of each of the individual systems in the final translation output. The results are presented in Table 2 in the form of percentage of sentences selected from each system. It is notable that: • the ML classifier mostly favors the transferbased output; • for the English→German translation, the same holds for the IBM1 classifier; for the other translation direction, Lucy is selected very rarely -for less than 2% sentences; • upper bound selection yields a more or less uniform distribution, however WORDF is clearly biased towards LucyMoses and POSF towards Lucy.
First indication is that the deep features of the ML classifier are active and therefore this classifier has a bias towards the transfer-based output. Furthermore, system contributions of upper bound selection methods indicate that the transfer-based outputs are more grammatical and thus favored by the syntax-oriented POSF score, whereas the LucyMoses system, which can be seen as a lexical reparation of a grammatical output, is favored by the lexical WORDF score. Nevertheless, these first hypotheses need to be confirmed by further studies that are planned. Table 3 shows examples of differences between the selection methods as well as between the three individual MT systems. The sentences are taken from the WMT-15 test set. First column denotes the selection method which choose the particular translation output. Sentence 1 illustrates the differences between two classifiers as well as between two F-scores; POSF score and ML classifier opt for the transfer-based translation, whereas IBM1 choses Moses and WORDF score prefers Lucy-Moses. Sentences 2-4 show the discrepance between the ML classifier and the automatic scores; the IBM1 score selection differs from the upper bound selections only for the sentence 4. Such sentences are the most probable reason for lower overall MLC performance in terms of automatic scores. The last sentence shows an example where both classifiers agree, but they disagree with both F-scores.   Table 2: Percentage of selected sentences from each individual system.
The table also illustrates advantages of the serial LucyMoses system -this system produces the best translation output for all presented sentences except for sentence 3.

Summary and outlook
We described a hybrid MT system based on three different individual systems where the final translation output is produced by a sentence level selection mechanism, with the possibility to include deep linguistic and grammatical features. Preliminary analysis suggests that various improvements are possible, starting from improvements on the transfer-based system (handling of lexical items such as terminology, MWEs, OOVs and robustness of parsing), the serial combination (e.g., improved disambiguation), up to more detailed analysis and testing and improvement of the selection mechanism (e.g., integrating more "deep" information from external parsing).