CUNI in WMT15: Chimera Strikes Again

This paper describes our WMT15 system submission for the translation task, a hybrid system for English-to-Czech translation. We repeat the successful setup from the previous two years.


Introduction
CHIMERA (Bojar et al., 2013;Tamchyna et al., 2014) is our English-to-Czech MT system designed as a combination of three very different components: • TectoMT (Popel andŽabokrtský, 2010), a deep-syntactic transfer-based system, • Moses (Koehn et al., 2007), where we use a factored phrase-based setup with large language models, • Depfix (Rosa et al., 2012), an automatic postediting system, aimed at correcting mainly errors in morphological agreement but successful also in semantic corrections, esp. recovery of lost negation.
The overall setup as well as the details on each of the components have been described in the past. We nevertheless briefly review it here, to make the paper self-contained.
This year, our submission mainly differed in the additional data we were able to collect. We thus evaluate how much do the additional data help in contrast with an identical setup using WMT15 training data only. 1 For the manual evaluation in WMT15, we submitted the non-constrained system, and even the "constrained" setup might not qualify as such, since it is a system combination and both TectoMT and Depfix rely on handcrafted rules to some extent.
In the following, we provide various details of the setup. We leave Depfix aside, since we simply applied it as a post-processing step and the relevant analysis of its rules was published previously (Bojar et al., 2013).

Factored Setup
We use our established setup, translating from English word form in one translation step to the Czech word form and morphological tag. This allows us to use language models over morphological tags, see §2.5 below.
Our word forms are in truecase, i.e. the words at sentence beginnings are lowercased, unless they are names. We rely on Czech and English lemmatizers 2 to select the true case.
Otherwise, our setup is fairly standard. We do not use any models of reordering, relying on basic distortion penalty.

Our System Combination
The first two components of CHIMERA, Tec-toMT (which appears in WMT evaluations as CU-TECTOMT) and Moses are independent MT systems on their own. CHIMERA combines them in a way remotely similar to standard system combination techniques (Matusov et al., 2008) and adds the third component, Depfix, for automatic correction of some grammar and semantic errors. For clarity, we will use the abbreviation CH to refer to the basic Moses setup without CU-TECTOMT. CH refers to the first stage, where CU-TECTOMT has been added, and CH is the complete combination.
To obtain the output of CH from CH and CU-TECTOMT, we could have used some of the standard system combination tools, e.g. Barrault (2010) or Heafield and Lavie (2010). Instead, we simply use Moses to do the job. Figure 1 provides a graphical summary of the technique. To obtain the combined system CH, we add one additional phrase table to the primary phrase-based system CH. This new phrase table is "synthetic", its source side comes from the input text and the target side comes from the output of CU-TECTOMT. The process to construct this phrase table is straightforward: we translate the source side of the development sets and the test set with CU-TECTOMT and treat it as a standard parallel corpus. We align it with GIZA++, using lemmas instead of word forms, but aligning only this relatively small corpus, not the main parallel training data. After symmetrization (grow-diagfinal-and), we extract phrases without any smoothing. Moses is set up to use simultaneously the two phrase tables, the CH one and the new from CU-TECTOMT, in two alternative decoding paths.
The main and only trick is to include the development set(s) and the test set in this phrase table.
Covering the development set ensures that MERT will correctly assess the relative importance of the two tables. And covering the test set is essential in the main run.
We dub the approach "poor man's" system combination, but we have recently found that this approach has surprising benefits over the standard approaches. It allows the combined system CH to react to (usually longer) phrases coming from CU-TECTOMT and use words and phrases from the standard CH phrase table that were not previously selected to CH single-best output but make the sentence overall more fluent. See Tamchyna and Bojar (2015) for a detailed analysis.
This year, we translated the source side of all WMT news test sets from the years 2007 till 2015 with CU-TECTOMT, contributing to the phrase table. The MERT is tuned only on WMT newstest 2013. We used newstest2014 to decide which exact configuration to submit and the final results of WMT are obviously based on newstest2015. Tables   Table 1 summarizes the parallel data used in our experiments. We use the CzEng 1.0 corpus and Europarl in both the constrained and unconstrained setting.

Parallel Data and Phrase
Our full system additionally uses OpenSubtitles datasets from OPUS. 3 We downloaded all three corpora (2011,2012,2013) and ran context-aware de-duplication on the whole dataset. (A sentence is removed only if it was already seen in the context of one preceding and one following sentence. The same sentence can thus appear in the corpus many times, if its context was different.) For DGT Acquis, we do not rely on OPUS. Instead, we downloaded the corpus from the official website, aligned the sentences using HunAlign (Varga et al., 2005) and de-duplicated them.
We also use the small translation memories from ECDC 4 and EAC. 5 Source # sents # en tokens # cs tokens

Monolingual Data
Table 2 summarizes the monolingual data that we use in the full and in the constrained setup. Czech Press is a very large collection of news texts acquired in 2012. From CzEng 1.0, we use only the news section. CWC stands for Czech Web Corpus collected at our department from various web sites; here, we restrict it to articles (as opposed to discussion fora). RSS are our own collected news from six Czech web news sites and WMT are the standard monolingual data collected by WMT organizers in the years 2007-2014. Only CzEng and WMT data are allowed in the constrained runs. Note that several of the resources are likely to overlap, e.g. our RSS collection probably follows the same sources as WMT data and Czech Web Corpus is also likely to be gathered from similar websites.
Except CWC, all the LM texts are strictly from the news domain. In other words, while we use as much and as diverse parallel texts as possible, we keep our LM in domain. We believe that at our current order of data size, preserving the domain is more important than using more monolingual data.

Language Models
As detailed in Table 2, we build several separate language models from the data. The constrained setup uses three LMs and the full setup uses four: Long is a 7-gram model based on our truecased word forms. While the remaining LMs are trained directly with KenLM (Heafield, 2011), this 7-gram LMs is interpolated with SRILM from separate (KenLM) ARPA files estimated from each of the years separately. The lambdas for the interpolation are set to optimize the perplexity on WMT new-stest2012. This approach allows us to use the relatively high order of the model and probably serves also as a kind of smoothing, distributing more probability mass to n-grams that are important across several years.
Big is a 4-gram LM based on our truecased word forms. It uses all our data, and as such, it cannot be included in the constrained setup. The motivation for using both "big" and "long" models is to cover long sequences as well as to have as precise statistics for shorter sequences as possible. We would not be able to train a 7-gram model using all our data.
Morph is a 10-gram LM based on Czech morphological tags. There are around 4000 distinct morphological tags, so we can afford training such a high order of the LM.
LongMorph is a 15-gram variation of "morph". We were hoping that given again some more training data this year, the morphological tags would be dense enough to capture sentence patterns within 15-grams. As it turns out, standard n-gram modelling techniques were not able to reach this goal. Table 3 lists the BLEU scores (newstest2014) for all sensible (non-constrained) combinations of the LMs in CH. We see that the LMs indeed have some complementary effect. The absolute differences in BLEU scores are rather small (and most of them are probably not statistically significant), but arguably using "big", "long" and one of the morphological LMs is the most beneficial setup.   Table 4 shows (tokenized) BLEU scores on the WMT14 test set, comparing CH (i.e. plain factored phrase-based Moses setup) and CH (i.e. the combination with CU-TECTOMT), in the constrained and full-data runs. The BLEU scores are case-sensitive. The scores indicate that adding CU-TECTOMT is more important than the additional training data. With more data, the benefit of CU-TECTOMT slightly decreases, but still remains rather high, 1.65 BLEU points absolute.

Results
In Table 5, we list scores of different variants of CHIMERA and competing MT systems for WMT15. Our system ranked first according to both automatic and manual evaluation. Some of the gains are due to large training data (other academic submissions were constrained systems). On the other hand, we also outperform Google Translate which likely uses all data available.

Conclusion
We briefly described our submission to the WMT15 translation shared task. Our setup is fairly standard with the exception of our language model suite and the system combination with a transfer-based system. We showed that we benefit both from the large training data and from the system combination. Our submission ranked first according to both automatic and manual evaluation.