DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors

We are presenting a hybrid MT approach in the WMT2016 Shared Translation Task for the IT-Domain. Our work consists of several translation components based on rule-based and statistical approaches that feed into an informed selection mechanism. Additions to last year’s submission include a WSD component, a syntactically-enhanced component and several improvements to the rule-based component, relevant to the particular domain. We also present detailed human evaluation on the output of all translation components, focusing on particular systematic errors.


Introduction
We are presenting extensions on our hybrid MT approach from the WMT 2015 translation task in the generic-domain (Avramidis et al., 2015). The system combines several SMT and RBMT components that feed into an informed selection mechanism. For WMT 2016, several new system components have been submitted to the IT-task that are described in more detail in this paper.
In our work, detailed evaluation of translation quality using a wide variety of methods from automatic scores to human error annotation is an active part of the MT development process. Already in previous work (Popović et al., 2014), we have argued for an approach to MT research and development (R&D) that makes a more direct use of the knowledge and expertise of language professionals.
One of the reasons is that it is difficult to build hybrid architectures (that take advantage of the fact that different engines make different errors) solely based on the rough feedback provided by automatic scores. As scores like BLEU (Pap-ineni et al., 2002) are not suitable for comparison across different types of engines like Statistical Machine Translation (SMT) and Rule-based Machine Translation (RBMT), we have included human feedback by a language professional in the development of the components reported in this paper.
To this end, we complement our system development with specific manual analysis. We have identified and manually inspected phenomena in the given domain that frequently lead to errors in our engines.
We are using the insights gained from this detailed analysis to guide further improvements of our engines and selection mechanism, some of which are detailed below. Therefore, the components developed follow the direction of addressing some of the most observed systematic issues. Nevertheless, the systems submitted to this task are only a stage in the continuous development effort.
The short paper is structured as follows: Section 2 includes a description of the individual components and the hybridization mechanism, section 3 presents a detailed manual evaluation focusing on systematic errors, whereas conclusions and ideas for further work are given in section 4.

System components
We hereby present the systems that appear in our submissions and our hybrid system:

Phrase-based SMT baseline
The baseline system consists of a basic phrasebased SMT model, trained with the state-of-theart settings on both the generic and technical data. The translation table was trained on a concatenation of generic and technical data, filtering out the sentences longer than 80 words. Batch 1 was used as a tuning set for MERT (Och, 2003).
One language model (monolingual) of order 5 was trained on the target side from both the  -Burch et al., 2007). All language models were interpolated on the tuning set (Schwenk and Koehn, 2008). The size of the training data is shown in Table 1. The text has been tokenized and truecased  prior to the training and the decoding, and de-tokenized and de-truecased afterwards. A few regular expressions were added to the tokenizer, so that URLs are not tokenized before being translated. Normalization of punctuation was also included, mainly in order to fix several issues with variable typography on quotes.
All statistical systems presented below are extensions of this system, also based on the same data and settings, unless stated otherwise.

SMT with Word Sense Disambiguation
The word-sense-disambiguated SMT system is a factored phrase-based statistical system with two decoding paths, one basic and one alternative. In the basic path, all nouns of the source language (English) have been annotated with a WSD system (Weissenborn et al., 2015) that assigns BabelNet senses to nouns and has recently shown improvements over state-of-the-art results on several corpora. The sense labels are estimated based on the disambiguation analysis on the sentence level by choosing the best ranked sense out of the ones provided by the WSD system. Each produced WSD label replaces the respective base word form of the noun. In the alternative path, non-annotated input is used. The alternative path allows for decoding phrases when there are no WSD labels or the decoder cannot form a translation with a good probability.
Due to the high computational demands of the WSD annotation, this model was trained on less data than the respective phrase-based models, using the first 1.1M sentences of Europarl and ommiting the entire Commoncrawl. We experimented with four different settings concerning the translation path. These settings with the corresponding automatic scores are depicted in Table 2, which includes the results on the development set 2. On this set, WSD does not show a positive effect over the baseline in terms of automatic scores.

Syntax-enhanced SMT
Motivated by the importance of grammar in the translation between English and German, we developed a syntax-enhanced SMT system. The process is similar to that of our baseline, but this version includes syntax-aware phrase extraction. Phrase pairs in the baseline SMT system were augmented with linguistically-motivated phrase pairs. These phrases were extracted by generating constituency and dependency parse trees for both the source and target languages, followed by nodealigning the parallel parse trees using a statistical tree aligner (Zhechev, 2009). The syntax-aware phrase extraction algorithm obtains surface-level chunks (syntax-aware) from the aligned subtrees (Srivastava and Way, 2009).
Intermediate experiments were conducted by using either constituency parsing or dependency parsing and it was discovered that despite containing phrase pairs unique to each parsing model (around 28%), no statistically significant difference was observed in the MT system performance. We therefore present the version that uses both of them by concatenating all phrase pairs in one table in an attempt to beenfit from multiple knowledge sources . Aditionally informed by the manual inspection in Section 3, we performed a pseudo-Named Entity Recognition (words and phrases tagged as nouns) in order to identify in-domain terminology and translate them separately in a post-decoding automatic post-editing framework.
For the constituency and dependency parsing we employed the Berkeley Parser (Petrov and Klein, 2007) and the Stanford Dependency Parser (Klein and Manning, 2003) respectively.

Rule-based component
The rule-based system Lucy (Alonso and Thurmair, 2003) is also part of our experiment, due to its state-of-the-art performance in the previous years. Additionally, manual inspection on the development set has shown that it provides better handling of complex grammatical phenomena particularly when translating into German, due to the fact that it operates based on transfer rules from the source to the target syntax tree.
This year's work on RBMT focuses on issues revealed through manual inspection of its performance on the development set: • Separate menu items: The rule-based system was observed to be incapable of handling menu items properly, mostly when they were separated by the ">" symbol, as they often ended up as compounds. We identified the menu items by searching for consequent title-cased chunks before and after each separator. These items were translated separately from the rest of the sentence, to avoid them being bundled as compounds. The rulebased system was then forced to treat the pretranslated menu items as chunks that should not be translated.
• Menu items by SMT: Additionally, we used the method above to check whether menu items could be translated with the baseline SMT system instead of Lucy.
• Unknown words by SMT: Since Lucy is flagging unknown words, we translated these individually with the baseline SMT system.
Finally, we experimented with normalization of the punctuation (which was previously included in the pre-processing steps of SMT but not in RBMT), addition of quotes on the menu items and some additional automatic source pre-processing in order to remove redundant phrases such as "where it says".
We ran exhaustive search with all possible combinations of the modification above and the most indicative automatic scores are shown in table 3. Although automatic scores have in the past shown low performance when evaluating RBMT systems, our proposed modifications have a lexical impact that can be adequately measured with ngram based metrics. Our investigation and discussion is performed on Batch 2. The best combination of the suggested modifications achieves an overall improvement of 0.51 points BLEU and 0.68 points METEOR over the baseline. In particular: • Adding quotes around menu items resulted in a significant drop of the automatic scores, so it was not used; this needs to be further evaluated, as references do not use quotes for menu items either. Nevertheless, quotes were not always useful due to an occasional erroneous identification of menu item boundaries.
• Separate translation of the menu items (sep-Menus) gives a positive result of about 0.46 BLEU and 0.63 METEOR.
• Normalizing punctuation (normPunct) has a slightly positive effect when the menu items are translated separately by Lucy.
• Passing only RBMT's unknown words (unk) to SMT results in a loss of 0.4 BLEU.
• Translating the RBMT's menus with SMT (SMTmenus) also deteriorates the scores and • translating both menu items and unknown words with SMT (unk+SMTmenus) has a positive effect against the baseline and it seems to be comparable with the best system without SMT (sepMenus+normPunct).
The phrase "where it says" appears in 7% of the sentences in Batch 2 and 2% of the sentences in Batch 1. Although the removal of "where it says" on the source sentence seems to slightly lower the   automatic scores, the difference does not seem significant, and manual inspection raised the concern that this may be because of the way this phrase has been translated in the references. We therefore conducted manual sentence selection on 38 (out of the 69) sentences where this phrase appeared and in 84.2% of the cases its removal made the translation preferable. We therefore concluded in selecting this variation, despite the slightly lower scores.

Serial RBMT post-editing with SMT
As an alternative to automatic post-editing of the RBMT system, a serial RBMT+SMT system combination is used, as described in (Simard et al., 2007). For building it, the first stage is translation of the source language part of the training corpus by the RBMT system. In the second stage, a SMT system is trained using the RBMT translation output as a source language and the target language part as a target language. Later, the test set is first translated by the RBMT system, and the obtained translation is translated by the SMT system.

Selection mechanism
The selection mechanism aims to combine various systems, by selecting the best MT output for every sentence. The architecture of the system is illustrated in figure 1. The core of the selection mechanism is a ranker which reproduces ranking by aggregating pairwise decisions by a binary classifier (Avramidis, 2013). Such a classifier is trained on binary comparisons in order to select the best one out of two different MT outputs given one source sentence at a time. As training material, we used the test-sets of WMT evaluation task (2008)(2009)(2010)(2011)(2012)(2013)(2014). The rank labels for the training are automatically generated, after ordering the given MT outputs based on their sentence-level METEOR (Lavie and Agarwal, 2007) against the references. We have previously experimented with training on ranking provided by users, but experiments showed that for this task, ranks made out of sentence-level METEOR maximize all automatic scores on our development set, including other document-level ones, such as BLEU.
We exhaustively tested the available feature sets with many machine learning methods and Support Vector Machines seemed to give the best performance. The binary classifiers were wrapped into rankers using the soft pairwise recomposition (Avramidis, 2013) to reduce ties between the systems. Due to technical reasons, the version of the selection mechanism that is submitted to this task is only a pilot version that includes WSD-SMT (section 2.2), baseline RBMT (section 2.4) and RBMT→SMT (section 2.5). When ties occurred, despite the soft recomposition, the system was selected based on a predefined system priority (WSD-SMT, RBMT, RBMT→SMT). The pre-defined order of the systems needs to be further confirmed as part of the future work.

Manual evaluation
Apart from the automatic evaluation scores, we include manual evaluation performed by a professional German linguist.

Manual evaluation methodology
The manual evaluation was performed in four phases: • The annotator reads through the development set translated by all systems and identifies the phenomena where often errors occur.
• For each one of the prominent linguistic phenomena, the annotator selects 100 source segments including the respective phenomenon that is prone to MT errors.
• The total occurrences of each phenomenon in all source segments are counted (each phenomenon may occur more than once in a segment, and each segment may contain more than one sentences).
• Consequently, the annotator counts the times each phenomenon has been translated correctly. For a translation to be correct it does not have to be identical with the reference translation. This is repeated for the output of every MT system. The accuracy is calculated as the ratio of the correct translations of the phenomenon divided by the occurrences of the phenomenon in the source.

Manual evaluation results
The most prominent error categories were found to be imperatives, compounds, quotation marks, menu item sequences (separated by ">"), missing verbs, phrasal verbs and terminology. In these 7 categories, 657 source segments were chosen from development set Batch 2 to demonstrate the phenomena bound to the frequent errors 1 . Many segments contained multiple instances of the respective phenomena, resulting in 2104 instances of phenomena in overall. The results appear in table 4. The two baseline systems SMT and RBMT seem to have complementary behavior regarding the investigated phenomena. SMT performs well on terminology, menu items and quotation marks, but seems to suffer on imperatives, missing verbs, phrasal verbs and generation of compounds. On the contrary, RBMT does relatively well with imperatives, compounds, verbs and phrasal verbs, whereas it has issues with menu items and is relatively worse with terminology. The linear combination system RBMT→SMT manages to successfully combine the performance of the two systems regarding imperatives and maintains almost the same performance on verbs and terminology, whereas all other phenomena deteriorate, despite achieving higher automatic scores in overall.
The SMT-syntax and the SMT-WSD systems seem to have relatively lower performance in all categories. 2 Since the performance of the WSD analyzer has already been confirmed, the failure of the SMT-WSD system to achieve a good performance on terminology and high n-gram-based automatic scores may be an indication that the current data setting does not face ambiguity issues and the senses probably only add additional complexity.
The selection mechanism (which in its current version only included SMT-WSD, RBMT and RBMT→SMT) performs better with the terminology and the quotation marks, whereas it maintains the good performance of its components on verbs and menu items. Performance on phrasal verbs nevertheless suffers. Additionally it achieves the highest accuracy on the selected phenomena, with 2% less errors than its best component, the baseline RBMT system.
The two improved versions of the RBMT system appear to have solved the problems they were developed for, namely the compounded menu items and one of them also does better with the quotation marks. The performance on imperatives, verbs and terminology remains the same, but the deterioration on phrasal verbs is obvious. A postmortem analysis attributes this loss to a logical bug in the menu items detection, which often erroneously included title-cased verbs in the beginning of the sentence, preventing them from being translated as an active part of the sentence.

Discussion and further work
In our shared task submission we included: (i) the SMT and RBMT baseline systems, (ii) the syntax-enhanced system (DFKI-syntax), (iii) the RBMT system with separate menu items, normalization of punctuation and removal of "where it says" (previously appearing as sepMenus+normPunct-WhereItSays, submitted as qtl-RBMT-menus), (iv) the RBMT system with removal of "where it says", passing menu items and unknown words to SMT (previously appearing as unk+SMTmenus-WhereItSays, submitted as qtl-RBMT-SMTmenus) and (v) the selection mechanism which includes the systems SMT-WSD, RBMT and RBMT→SMT.
The results of the official evaluation campaign for our systems appear in the table 5. RBMTmenus appears to be slightly better than all the other systems we developed, but the difference with the other RBMT systems is not statistically significant. Nevertheless, it is our only system that competes with another competitor system for the 2nd position. Additionally, it is worth noting the failure of BLEU to correlate with the human preferences, mainly for the systems that relate to RBMT, inline with past observations (Callison-Burch et al., 2006).
In future work, we intend to continue this line of development by including all the individual components in the selection mechanism. Additionally,  we would focus on solving issues on the particular phenomena, by employing specialized methods. Finally, we should perform a more in-depth evaluation of the selection mechanism and study how the insights gained from the manual inspection of errors can be translated into features that improve the selection.