What a Transfer-Based System Brings to the Combination with PBMT

We present a thorough analysis of a combination of a statistical and a transfer-based system for English → Czech translation, Moses and TectoMT. We describe several techniques for inspecting such a system combination which are based both on automatic and manual evaluation. While TectoMT often produces bad translations, Moses is still able to select the good parts of them. In many cases, Tec-toMT provides useful novel translations which are otherwise simply unavailable to the statistical component, despite the very large training data. Our analyses con-ﬁrm the expected behaviour that TectoMT helps with preserving grammatical agreements and valency requirements, but that it also improves a very diverse set of other phenomena. Interestingly, including the outputs of the transfer-based system in the phrase-based search seems to have a positive effect on the search space. Overall, we ﬁnd that the components of this combination are complementary and the ﬁnal system produces signiﬁcantly better translations than either component by itself.


Introduction
Chimera (Bojar et al., 2013b;Tamchyna et al., 2014) is a hybrid English-to-Czech MT system which has repeatedly won in the WMT shared translation task (Bojar et al., 2013a;. It combines a statistical phrase-based system (Moses, in a factored setting), a deep-transfer hybrid system TectoMT (Popel andŽabokrtský, 2010) and a rule-based post-editing tool Depfix (Rosa et al., 2012).
Empirical results show that each of the components contributes significantly to the translation quality, together setting the state of the art for English→Czech translation. While the effects of Depfix have been thoroughly analyzed in Bojar et al. (2013b), the interplay between the two translation systems (Moses and TectoMT) has not been examined so far.
In this paper, we show how exactly a deep transfer-based system helps in statistical MT. We believe that our findings are not limited to our exact setting but rather provide a general picture that applies also to other hybrid MT systems and other translation pairs with rich target-side morphology.
The paper is organized as follows: Section 2 briefly describes the architecture of Chimera and summarizes its results in the WMT shared tasks.
In Section 3, we analyze what the individual components of Chimera contribute to translation quality. Section 4 describes how the components complement each other Section 5 outlines some of the problems still present in Chimera and Section 6 concludes the paper.

Chimera Overview
Chimera is a system combination of a phrasebased Moses system (Koehn et al., 2007) with Tec-toMT (Popel andŽabokrtský, 2010), finally processed with Depfix (Rosa et al., 2012), an automatic correction of morphological and some semantic errors (reversed negation). Chimera thus does not quite fit in the classification of hybrid MT systems suggested by Costa-jussà and Fonollosa (2015). Figure 1 provides a graphical summary of the simple system combination technique dubbed "poor man's", as introduced by Bojar et al. (2013b). The system combination does not need any dedicated tool, e.g. those by Matusov et al. (2008), Barrault (2010), or Heafield and Lavie (2010). Instead, it directly includes the output of the transfer-based system into the main phrasebased search.  Figure 1: "Poor man's system combination".
At its core, Chimera is a (factored) Moses system with two phrase tables. The first is a standard phrase table extracted from English-Czech parallel data. The second phrase table is tailored to the input data and comes from a synthetic parallel corpus provided by TectoMT: the source sides of the dev and test sets are first translated with CU-TECTOMT. Following the standardard word alignment on the source side and the translation, phrases are extracted from this synthetic corpus and added as a separate phrase table to the combined system (CH). The relative importance of this phrase table is estimated in standard MERT (Och, 2003).
The final translation of the test set is produced by Moses (enriched with this additional phrase table) and additionally post-processed by Depfix.
Note that all components of this combination have direct access to the source side which prevents the cumulation of errors.
For brevity, we will use the following names: CH to denote the plain Moses, CH to denote the Moses combining the two phrase tables (one from CH and one from CU-TECTOMT), and CH to denote the final CHIMERA.
In this paper, we focus on the first two components, leaving CH aside. The rest of this section summarizes Chimera's results in the last three years of WMT translation task and adds two technical details: language models used in 2015 and the effects of the default low phrase table limit.  tary value of each component in Chimera.

Chimera and its Components in WMT
TectoMT by itself does not perform well compared to other systems in the task, it consistently achieves low BLEU scores and manual ranking. Moses by itself (CH) achieves quite a high BLEU score but still significantly lower than CH (combination of the "poor" TectoMT and plain Moses). Depfix seems to make almost no difference in the automatic scores (once it even slightly worsened the BLEU score) but CH has been consistently significantly better in manual evaluation. In 2014, Chimera would have lost to Edinburgh's submission if it were not for Depfix.
An illustration of the complementary utility is given in Table 3. Both CH and CU-TECTOMT produce translations with major errors. CH is able to pick the best of both and produce a grammatical and adequate output, very similar to the reference translation. CH can also produce words which were not present in either output.

Language Models
In 2015, CHIMERA in all its stages used four language models (LMs), as summarized in Table 2.
Two of the language models ("big" and "long") are trained on surface forms ("stc" refers to su-source the living zone with the dining room and kitchen section in the household of the young couple .
reference obývací zóna s jídelní a kuchyňskoučástí v domácnosti mladého páru . living zone with dining and kitchen section in household younggen couplegen .

CU-TECTOMTž
ivá zóna pokoje s jídelnou a s kuchyňským oddílem v domácnosti mladého páru . alive zone roomgen with dining room and with kitchen section in household younggen couplegen . CH obývací prostor s jídelnou a kuchyní v domácnosti mladého páru . living space with dining room and kitchen in household younggen couplegen .  pervised truecasing, where the casing is determined by the lemmatizer) and two on morphological tags. Since tags are much less sparse than word forms, we can use a higher LM order. The new "long morphological", dubbed "longm", was aimed at capturing common sentential morphosyntactic patterns.

Phrase Table Limit
Until recently we did not pay much attention to the maximum number of different translation options considered per source phrase (the parameter table-limit), assuming that the good phrase pairs are scored high and will be present in the list. This year, we set table-limit to 100 instead of the default 20 and found that while it indeed made little or no difference in CH, it affected the system combination in CH. It is known that multiple phrase tables clutter the search space with different derivations of the same output (Bojar and Tamchyna, 2011), demanding a relaxation of pruning during the search (e.g. stack-limit or the various limits of cube pruning). From this point of view, increasing the table-limit actually makes the situation worse by bringing in more options. We leave the search pruning limits at their default values, increase only the table-limit, and yet observe a gain. Table 4 shows the average testset BLEU score (incl. the standard deviation) obtained in three independent runs of MERT when setting the table-limit to 20 or 100 for one or both 60584 56298 57284 54536 51567  Table 5 breaks n-grams from the reference of WMT14 test set into classes depending on by which Chimera components they were produced. The first column considers unigram tokens, the subsequent columns report n-gram types. We see that 44.7 % of unigram tokens needed by the reference were available in all (DDD) components, i.e. CU-TECTOMT, CH, and surviving in the combination CH. On the other hand 32.9 %  tokens were not available in any of these singlebest outputs. For Czech as a morphologically rich target language, it is a common fact that a large portion of the output is not confirmed by the reference (and vice versa) despite not containing any errors (Bojar et al., 2010). The poor man's system combination method is essentially phrase-based, so it is not surprising that there are about twice as many unigrams that come from CH than from CU-TECTOMT, see 8.6 vs 4.5 %. This bias towards PBMT gets more pronounced with longer n-grams (5.1 vs 1.5 % for 4grams). The number of n-grams needed by the reference and coming from either of the individual systems but not appearing in the combination (-

Contribution of Individual Components
It is good news that we gain ∼1.5 % of n-grams as a side-effect: neither of the systems suggested them on its own but they appeared in the combination (--D). Note that we see this positive effect also for unigrams, suggesting that our "poor man's" system combination could in principle outperform more advanced techniques. The output of the secondary system(s) can help the main search to come up with better translation options.
In the following, we refine the analysis of contributions of the individual components by finding where they apply and what they improve.

Sources of Used Phrase Pairs
In a separate analysis, we look at the translation of the WMT13 test set and the phrases used to produce it. Table 6 shows both phrase counts and average (source) phrase lengths (in words) broken down according to the phrase source. The test set was translated using 31961 phrases in total ("phrase tokens"), 21106 unique phrase pairs were used ("phrase types"). Many phrase pairs were available in both phrase tables.
The TectoMT phrase table provided 11706 phrase types in total, 3503 of these were unique, i.e. not present in the phrase table extracted from the parallel data. (See Section 4.1 below for the reachability of such phrases on the WMT14 test set.) Given the total number of phrase types, this is a small minority (roughly 17 %), however these phrases correspond directly to our test set and the benefit is visible right away: the average phrase length of these unique phrases is much higher (3.73) which allows the decoder to cover longer parts of the input by a single phrase. We believe that such phrases help preserve local (morphological) agreement and overall consistency of the translation. 1 As expected, the average length of the shared phrase pairs (present in both phrase tables) is short and this is even more prominent when we look at tokens (phrase occurrences) where the average length is only 1.56. Again, phrase tokens provided by TectoMT are significantly longer, 3.68 words on average.

CU-TECTOMT
Phrase-based MT relies on phrase pairs automatically extracted from parallel data. This process uses imperfect word alignment and several heuristics and therefore, phrase tables often contain spurious translation pairs. Moreover, phrases extracted from synthetic data (where the target side was produced automatically) can contain errors made by the translation system.
In this analysis, our basic aim was to compare the quality of phrases extracted from parallel data and phrases provided by TectoMT. This analysis was done manually on data samples by two independent annotators. We looked at the percentage of such bad phrase pairs in two settings: • phrase pairs contained in the phrase table • phrase pairs used in the 1-best translation We can assume that most of the noisy phrase pairs in the phrase tables are never used in practice (they are improbable according to the data or they apply to some very uncommon source phrase). That is why we also looked at phrase pairs actually used in producing the 1-best translation of the WMT 13 test set.
For each of the two settings, we took a random sample of 100 phrase pairs from each source of data and had two annotators evaluate them. The basic annotation instruction was: "Mark a phrase pair as correct if you can imagine at least some context where it could provide a valid translation." In other words, we are checking if a phrase pair introduces an error already on its own.   Table 7 shows the results of the annotation. As expected, the percentage of inadmissible phrase pairs is much higher in the first setting (random samples from phrase tables), 17.5-26.3 % compared to 7.5-9.0 %. Most phrase pairs which contributed to the final translations were valid translations (87.5-89.0 %).
The phrase table extracted from TectoMT translations was worse in both settings. However, while only 66 % of its phrase pairs were considered correct in the random selection, it was about 87 % of phrases actually used. This shows that the final decoder is able to pick the correct suggestions quite successfully.
Interestingly, despite the rather vague task description, inter-annotator agreement was quite high: 80.5 % on average in the first setting and 90.5 % in the second one.

Automatic Analysis of Errors in Morphology
We were interested to see whether we can find a pattern in the types of morphological errors fixed by adding the TectoMT phrase table. We translated the WMT14 test set using CH, CH and CH. We aligned each translation to the reference using HMM monolingual aligner (Zeman et al., 2011) on lemmas. We focused on cases where both the translation and the reference contain the same (aligned) lemma but the surface forms differ. 2 Table 8 shows summary statistics along with the distribution of errors among Czech parts of speech. We omitted prepositions, adverbs, conjunctions and punctuation from the table -these POSes do not really inflect in Czech. The number of successfully matched lemmas (in the HMM alignment phase) is lowest for CH -this is expected as this system also got a lower BLEU score. Both other systems matched roughly 400 more lemmas within the test set (this also means 400 more opportunities for making morphological errors, i.e. CH and CH have a more difficult position than CH in this evaluation). The good news is that CH and CH show a significantly lower number of errors in morphology -the total number of errors was reduced by almost 500 from the 6065 made by CH. Overall, the number of errors per part of speech (POS) is naturally affected by the frequency of the individual POS in Czech text. We see that CH (and CH) reduce the number of errors across all POSes. However, the most prominent improvement can be observed with nouns (N) and adjectives (A). We can roughly say that they account for 407 errors out of the 491 fixed by CH. When we look at the morphological tags for each of the 407 errors, we find that the vast majority (393 errors) only differ in morphological case. TectoMT therefore seems to improve target-side morphological coherence and in particular valency and noun-adjective agreement. This is further supported by the manual analysis in Section 3.4. This analysis does not provide a good picture of the effect of adding Depfix. The difference in error numbers is negligible and inconsistent across POSes (adjectives seemingly got mildly worse while nouns were somewhat improved). Depfix rules generally prefer precision over recall, so they do not change the output considerably. Moreover, valid corrections may not be confirmed by the single reference that we have available. The accuracy of the individual Depfix rules was already evaluated by Bojar et al. (2013b). Depfix significantly improves translation quality according to human evaluation, as evidenced by Table 1.

Manual Analysis of TectoMT n-Grams
In order to check what phenomena are improved by TectoMT, we manually analyzed a small sample of n-grams needed by the reference and provided specifically by TectoMT, i.e. n-grams produced CU-TECTOMT but not CH and surviving to the final CH output. These come from the 1.5 % D-D 4-grams from Table 5.
The results are presented in Table 9. For each of the examined 4-grams, the annotator started by checking the corresponding part of CH output. In   Table 9: Small manual analysis of 4-grams confirmed by the reference and coming from CU-TECTOMT (not produced by CH, only by CH).
31.1 % of cases, the CH output was an equally acceptable translation. (Other parts of the sentence were not considered.) The false positive 4-grams are fortunately rather rare: 3 % of these 4-grams by CH and confirmed by the reference are actually worse than the proposal by CH ("Worsened") and 1.5 % other cases are bad in both CH and CH output ("Bad Anyway").
Overall, the most frequent improvements thanks to CU-TECTOMT are related to Czech morphology, be it better choice of preposition and/or case for noun phrases dependent on verbs or other nouns ("Valency"), better preservation of case, number and/or gender within NPs or between the subject and the verb ("Agreements"), or morphological properties of verbs ("Properties of Verbs"). Another prominent class of tackled errors is related to syntax of complex noun phrases which often surface as garbled word order ("Word Order, esp. Syntax of Complex NPs"). CU-TECTOMT also helps with translating clause structure (incl. avoiding the comma used in English after topicalized elements, "Avoided Superfluous Comma"), with lexical choice, possessive constructions or the reflexive particle.
Overall, the range of improvements is rather broad, with each type receiving only a small share. The row "Other" includes diverse phenomena like better Noun-Verb-Adj disambigua-tion, morphological properties of nouns coming from the source, phrasal verbs, translation of numerical expressions incl. units, negation, pro-drop, or translation of named entities.

Complementary Utility
This section contains some observations on how the individual components of Chimera complement each other and to what extent one can substitute another. Unlike the previous section, we are not interested in why the components help but instead in what happens when they are not available.

Reachability of TectoMT Outputs for Plain Moses
In order to determine whether Moses itself could have produced the translations acquired by combining it with TectoMT, we ran a forced (constrained) decoding experiment (with table limit set to 100) -we ran CH on the WMT14 test set and targeted the translations produced by CH. We first put aside the 338 sentences where the outputs of both systems are identical.  Out of the 2665 remaining sentences, Moses was able to produce 1741 sentences (i.e., roughly two thirds). This shows that TectoMT indeed provides many novel translations. This fact is particularly interesting when we consider the amount of data available to Moses -this year, its translation model was trained using over 52 million parallel sentences. Still, many necessary word forms are apparently missing in the phrase table (when limited to 100 options per source span).
For the reacheable sentences, we compared their model scores according to CH. On average, the score of the CH original translation was slightly higher (by 1.11) than the score of the forced translation -in 1601 cases, Moses produced a better-scoring translation. We can attribute this difference to modelling errors: when we compare BLEU scores of CH and CH on these 1601 sentences, CH obtains a significantly better result, 24.78 vs. 23.03 (even though the model score according to CH is lower).
In 140 sentences, the model score of the forced translation was higher than the score of the translation actually produced. Apparently, the quality of CH's output was harmed also by search errors. 3 For completeness, we ran another variant of the forced decoding setting. We collected all phrases that were provided by the TectoMT phrase table and used by CH when translating the test set. We then ran constrained decoding for CH with these phrases as input sentences. Our question was how many of TectoMT's phrases can CH in principle create by itself. Out of the 15607 TectoMT's phrases used for translating the test set, CH was able to create 14057 of them. We looked at the roughly 10 % of phrases which were unreachable and found that some of them contained named entities or unusual formulations (not necessarily correct), however most were valid translations. Note that even if 90 % of the phrases are reachable, they can still be overly costly (esp. when built from multiple pieces) so Moses might prefer a segmentation with fewer phrases, although they match together less well.   Table 11 illustrates the impact of phrase table limit on the reachability of phrases in this setting. The difference in coverage is significant between the limits 20 (the default value for Moses) and 100, which confirms our observations in Section 2.3. It is somewhat surprising that even between the 100th and 1000th best phrase translation, there are still phrases that can improve the coverage.

Long or Morphological LMs vs. TectoMT
In order to learn more about the interplay between the TectoMT phrase table and our language mod-els (LMs), we carried out an experiment where we evaluated all (sensible) subsets of the LMs. For each subset, we reran tuning (MERT) and evaluated the system using BLEU.
As shown above, a significant part of the contribution of TectoMT lies in improving morphological coherence. Since the strong LMs (especially the ones trained on morphological tags) should have a similar effect, we were interested to see whether they complement each other or whether they are mutually replaceable.
In Table 12, we provide results obtained on the WMT14 test set, sorted in ascending order by the BLEU score with TectoMT included. It is immediately apparent that LMs cannot replace the contribution of TectoMT -the best result in the first column (22.69) is noticeably worse than the weakest result obtained with TectoMT included (22.93).  Concerning the usefulness of LMs, it seems that their effects are also complementary -we get the best results by using all of them. It seems that "big" and "long" capture different aspects of the language -"big" provides very reliable statistics on short n-grams while "long" models common long sequences (patterns). The morphological LMs do seem correlated though. When adding "longm", our aim was to also capture long common patterns in sentential structure. However, it seems that the n-gram order 10 already serves this purpose quite well and extending the range provides only modest improvement.

Outstanding Issues
The current combination is quite complex and as such, it results in non-trivial interactions between the components which are hard to identify and describe. We would like to simplify the architecture somehow, striving for a clean, principled design. However, as we have shown, we cannot simply remove any of the components without a significant loss of translation quality, so this remains an open question for further research.

Weaknesses of CH
On many occassions, we were surprised by the low quality of CH's translations. We considered this system a rather strong baseline, given the LMs trained on billions of tokens and the factored scheme, which specifically targets morphological coherence. Yet we observed many obvious errors both in lexical choice and morphological agreement, which were well within the scope of the phrase length limit and n-gram order. We believe that more sophisticated statistical models, such as discriminative classifiers which take source context into account (Carpuat and Wu, 2007) or operation sequence models (Durrani et al., 2011), could be applied to further improve CH.

Practical Considerations
As he have shown, our approach to system combination has some unique properties and can certainly be an interesting alternative. Yet it can be viewed as impractical -the models (the TectoMT phrase table, specifically) actually require the input to be known in advance. In this section, we outline a possible solution which would allow for using the system in an on-line setting.
The synthetic parallel data consist of the dev set and test set. Our development data can be fixed in advance so re-tuning the system parameters is not required for new inputs.
The only remaining issue is ensuring that the second phrase table contains the TectoMT translation of the input. We propose to first translate the input sentence using TectoMT. Then for word alignment, we can either use the alignment information directly from TectoMT or apply a pretrained word-alignment model, provided e.g. by MGiza (Gao and Vogel, 2008). Phrase extraction and scoring can be done quickly on the fly.
Phrase scores should ideally be combined with the dev-set part of the phrase table. Moses has support for dynamic updating of its phrase tables (Bertoldi, 2014), so changing the scores or adding new phrase pairs is possible at very little cost.
With pre-trained word alignment and dynamic updating of the phrase table, we believe that our approach could be readily deployed in practice.

Conclusion
We have carefully analyzed the system combination Chimera which consists of a statistical system Moses (CH), a deep-syntactic transfer-based system TectoMT and a rule-based post-processing tool Depfix. We focused on the interaction between CH and CU-TECTOMT. We described several techniques for inspecting this combination, based on both automatic and manual evaluation.
We have found that the transfer-based component provides a mix of useful, correct translations and noise. Many of its translations are unavailable to the statistical component, so its generalization power is in fact essential. Moses is able to select the useful translations quite successfully thanks to strong language models, which are trained both on surface forms and morphological tags.
Our experiment with forced decoding further showed that translations which are reachable for Moses are often not chosen due to modelling errors. It is the extra prominence these translations get thanks to CU-TECTOMT that helps to overcome these errors.
We show that our approach to system combination (using translations from the transfer-based system as additional training data) has several advantageous properties and that it might be an interesting alternative to standard techniques. We outline a solution to the issue of the practical applicability of our method.
Overall, we find that by adding the transferbased system, we obtain novel translations and improved morphological coherence. The final translation quality is improved significantly over both CH and CU-TECTOMT alone, setting the state of the art for English→Czech translation for several years in a row.