Results of the WMT15 Tuning Shared Task

This paper presents the results of the WMT15 Tuning Shared Task. We provided the participants of this task with a complete machine translation system and asked them to tune its internal parameters (feature weights). The tuned systems were used to translate the test set and the outputs were manually ranked for translation quality. We received 4 submissions in the English-Czech and 6 in the Czech-English translation direction. In addition, we ran 3 baseline setups, tuning the parameters with standard optimizers for BLEU score.


Introduction
Almost all modern statistical machine translation (SMT) systems internally consider translation candidates from several aspects. Some of these aspects can be very simple and one parameter is sufficient to capture them, such as the word penalty incurred for every word produced or the phrase penalty controlling whether the sentence should be translated in fewer or more independent phrases, leading to more or less word-for-word translation. Other aspects try to assess e.g. the fidelity of the translation, the fluency of the output or the amount of reordering. These are far more complex and formally captured in a model such as the translation model or language model. Both the simple penalties as well as the scores from the more complex models are called features and need to be combined to a single score to allow for ranking of translation candidates. This is usually done using a linear combination of the scores: where e and f are the candidate translation and the source, respectively, and h m (·, ·) is one of the M penalties or models. The tuned parameters are λ m ∈ R, called feature weights.
Feature weights have a tremendous effect on the final translation quality. For instance the system can produce extremely long outputs, fabulating words just in order to satisfy a negativelyweighted word penalty, i.e. a bonus for each word produced. An inherent part of the preparation of MT systems is thus some optimization of the weight settings.
If we had to set the weights manually, we would have to try a few configurations and pick one that leads to reasonable outputs. The common practice is to use an optimization algorithm that examines many settings, evaluating the produced translations automatically against reference translations using some evaluation measure (traditionally called "metric" in the MT field). In short, the optimizer tunes model weights so that the final combined model score correlates with the metric score.
The metric score, in turn, is designed to correlate well with human judgements of translation quality, see  and the previous papers summarizing WMT metrics tasks. However, a metric that correlates well with humans on final output quality may not be usable in weight optimization for various technical reasons. BLEU (Papineni et al., 2002) was shown to be very hard to surpass (Cer et al., 2010) and this is also confirmed by the results of the invitation-only WMT11 Tunable Metrics Task (Callison-Burch et al., 2010) 1 . Note however, that some metrics have been successfully used for system tuning (Liu et al., 2011;Beloucif et al., 2014).
The aim of the WMT15 Tuning Task 2 is to attract attention to the exploration of all the three    (1), we provide a fixed set of "dense" features and also allow participants to add additional "sparse" features. For (2), the optimization algorithm, task participants are free to use one of the available algorithms for direct loss optimization (Och, 2003;Zhao and Chen, 2009), which are usually capable of optimizing only a dozen of features, or one of the optimizers handling also very large sets of features (Cherry and Foster, 2012;Hopkins and May, 2011), or a custom algorithm. And finally for (3), participants can use any established evaluation metric or a custom one.

Tuning Task Assignment
Tuning task participants were given a complete model for the hierarchical variant of the machine translation system Moses (Hoang et al., 2009) and the development set (newstest2014), i.e. the source and reference translations. No "dev test" set was provided, since we expected that participants will internally evaluate various variants of their method by manually judging MT outputs. In fact, we offered to evaluate a certain number of translations into Czech for free to ease the participation for teams without any access to speakers of Czech; only one team used this service once.
A complete model consists of a rule table extracted from the parallel corpus, the default glue grammar and the language model extracted from the monolingual data. As such, this defines a fixed set of dense features. The participants were allowed to add any sparse features implemented in Moses Release 3.0 (corresponds to Github commit 5244a7b607) and/or to use any optimization algorithm and evaluation metric. Fully manual optimization was also not excluded but nobody seemed to take this approach.
Each submission in the tuning task consisted of the configuration of the MT system, i.e. the additional sparse features (if any) and the values of all the feature weights, λ m .

Details of Systems Tuned
The systems that were distributed for tuning are based on Moses (Hoang et al., 2009) implementation of hierarchical phrase-based model (Chiang, 2005). The language models were 5-gram models with Kneser-Ney smoothing (Kneser and Ney, 1995) built using KenLM (Heafield et al., 2013). For word alignments, we used Mgiza++ (Gao and Vogel, 2008).
The parallel data used for training translation models consisted of the Europarl v7, News Commentary data (parallel-nc-v9) and Com-monCrawl, as released for WMT14. 3 We excluded CzEng because we wanted to keep the task small and accessible to more groups.
Since the test set (newstest2015) and the development set (newstest2014) are in the news domain, we opted to exclude Europarl from the language model data. We did not add any monolingual news on top of News Commentary, which are quite close to the news domain. In retrospect, we should have added also some of the monolingual news data as released by WMT, esp. since we used a 5-gram LM.
Before any further processing, the data was tokenized (using Moses tokenizer) and lowercased. We also removed sentences longer than 60 words or shorter than 4 words. Table 1 summarizes the  final dataset sizes and Table 2 provides details on out-of-vocabulary items.
Aside from the dev set provided, the participants were free to use any other data for tuning (making their submission "unconstrained"), but no participant decided to do that. All tuning task submissions are therefore also constraint in terms of   . We leave all decoder settings (n-best list size, pruning limits etc.) at their default values. While the participants may have used different limits during tuning, the final test run was performed at our site with the default values. It is indeed only the feature weights that differ.

Tuning Task Participants
The list of participants and the names of the submitted systems are shown in Table 3, along with references to the details of each method.
USAAR-TUNA by Liling Tan and Mihaela Vela has no accompanying paper, so we sketch it here. The method sets each weight as the harmonic mean ( 2xy x+y ) of the weight proposed by batch MIRA and MERT. Batch MIRA and MERT are run side by side and the harmonic mean is taken and used in moses.ini at every iteration. The optimization stops when the averaged weights change only very little, which happened around iteration 17 or 18 in this case (Liling Tan, pc).
ILLC-UVA (Stanojević and Sima'an, 2015) was tuned using KBMIRA with modified version of BEER evaluation metric. The authors claim that standard trained evaluation metrics learn to give too much importance to recall and thus lead to overly long translations in tuning. For that reason they modify the training of BEER to value recall and precision equally. This modified version of BEER is used to train the MT system. DCU (Li et al., 2015) is tuned with RED, an evaluation metric based on maching of dependency n-grams. Authors have tried tuning with both MERT and KBMIRA and found that KBMIRA gives better results so the submitted system uses KBMIRA.
HKUST (Lo et al., 2015) is with an improved version of MEANT. MEANT is an evaluation metric that pays more attention to semantic aspect of translation. Better correlation on the sentence level was achieved by integrating distributional se-mantics into MEANT and handling failures of the underlying semantic parser. The submission of HKUST contained a bug that was discovered after human evaluation period so the corrected submission HKUST-LATE is evaluated only with BLEU.
METEOR-CMU (Denkowski and Lavie, 2011) is a system tuned for an adapted version of Meteor. Meteor's parameters are set to give an equal importance to precision and recall.
AFRL (Erdmann and Gwinnup, 2015) is the only submission trained with a new tuning algorithm "Drem" instead of the standard MERT or KBMIRA. Drem uses scaled derivative-free trust-region optimization instead of line search or (sub)gradient approximations. For weight settings that were not tested in the decoder yet, it interpolates the decoder output using the information of which settings produced which translations. The optimized metric is a weighted combination of NIST, Meteor and Kendall's τ .
In addition to the systems submitted, we provided three baselines: Since all the submissions including the baselines were subject to manual evaluation, we did not run the MERT or MIRA optimizations more than once (as is the common practice for estimating variance due to optimizer instability). We simply used the default settings and stopping criteria and picked the weights that performed best on the dev set according to BLEU.
Of all the submissions, only the submission METEOR-CMU used sparse features. For a more interesting comparison, we set our baseline (BLEU-MIRA-SPARSE) to use the very same set of sparse features. These features are automatically constructed using Moses' feature templates named PhraseLengthFeature0, SourceWordDele-tionFeature0, TargetWordInsertionFeature0 and WordTranslationFeature0. They were made for the 50 most frequent words in the training data. For both language pairs these feature templates produce around 1000 features.

Results
We used the submitted moses.ini and (optionally) sparse weights files to translate the test set. The test set was not available to the participants at the time of their submission (not even the source side). We used the Moses recaser trained on the target side of the parallel corpus to recase the outputs of all the models.
Finally, the recased outputs were manually evaluated, jointly with regular translation task submissions of WMT . This was not enough to reliably separate tuning systems in the Czech-to-English direction, so we asked task participants to provide some further rankings.
The resulting human rankings were used to compute the overall manual score using the TrueSkill method, same as for the main translation task . We report two variants of the score: one is based on manual judgements related to tuning systems only and one is based on all judgements. Note that the actual ranking tasks shown to the annotators were identical, mixing tuning systems with regular submissions.
Tables 4 and 5 contain the results of the submitted systems sorted by their manual scores.
The horizonal lines represent separation between clusters of systems that perform similarly. Cluster boundaries are established by the same method as for the main translation task. Interestingly, cluster boundaries for Czech-to-English vary as we change the set of judgements.
Some systems do not have the TrueSkill score because they were either submitted after the deadline (HKUST-LATE) or served as additional baselines and performed similarly to our baselines (USAAR-BASELINE-MIRA and USAAR-BASELINE-MERT).

Discussion
There are a few interesting observations that can be made about the baseline results. Various details    Table 6.

Dense vs. Sparse Features
It is surprising how well the baseline based on KB-MIRA and BLEU tuning (BLEU-MIRA-DENSE) performs on both language pairs. On Czech-English, it is better than all the other submitted systems while on English-Czech, only one system outperforms it (staying in the same performance cluster anyway). Using BLEU-MIRA-DENSE for tuning dense features is becoming more common in the MT community, compared to the previous standard of using MERT. Our results confirm this practice. Preferring KBMIRA to MERT is often motivated by possibility to include sparse features, but we see that even for dense features only KBMIRA is better than MERT. The sparse models, BLEU-MIRA-SPARSE and METEOR-CMU, however, perform rather poorly even though they were trained with KBMIRA. Both of the sparse submissions use the same set of features and the same tuning algorithm, although the optimization was run at different sites. The only difference is the metric they optimize. Tuning for Meteor (Denkowski and Lavie, 2011) gives better results than tuning for BLEU (Papineni et al., 2002). Unfortunately, we had no system with It is not clear why the sparse methods perform badly. One explanation could be the relatively small development set or some pruning settings. In any case, we find it unfortunate that sparse features in the hierarchical model harm performance in the default configuration 4 .

Some Observations on Weight Settings
We tried to find some patterns in the weight settings and the performance of the system, but admittedly, it is difficult to make much sense of the few points in the 8-dimensional space.
For English-to-Czech, we can see a gist of a bell-like shape when normalizing the weights with L2 norm and plotting the word penalty and the 4 MERT and two MIRA runs reached BLEU of not more than +0.02 points higher when the size of n-best list was increased from 100 to 200. So n-best list size does not seem to be the problem.

Czech-to-English
Type Manual Score Test BLEU Dev BLEU  Table 6: Detailed scores and weights of Czechto-English (left) and English-to-Czech (right) systems. manual score, see Figure 1. The middle values seemed to be a good setting. For the other translation direction or other weights, no such clear relation is apparent. We tried to interpret the weight settings also using principal component analysis (PCA), despite the low number of observations. (Ideally, we would like to have at least 40-80 systems, we have 7 or 9). Before running PCA, we normalized the weights with L2 norm. After running Cattell Scree test, the results showed that two components would be appropriate to summarize the dataset. To make components more interpretable, we applied varimax rotation. Figure 2 plots the two principal components of the set of systems for English-to-Czech. We see that the first component (PC1) explains the performance almost completely with middle values being the best. Looking at loadings (correlations of components with the original feature function dimensions) in Table 7, we learn, that PC1 primarily accounts for the first two weights of translation model (TM 0 and TM 1, which correspond to phrase and lexically-weighted inverse probabilities, resp.) and the word penalty (WrdPen) and language model weight (LM0). Knowing that in almost all systems the weight of word penalty is several times bigger than weights of TM 0, TM 1, and LM0, we conclude that tuning of word penalty (in balance with LM weight) was the most apparent decisive factor of English-Czech tuning task. The second component (PC2) primarily covers the weights of the remaining features, that is the direct translation probabilites and phrase penalty. Unfortunately, PC2 is not very informative about the final quality of the translation.
The Czech-to-English results in Figure 3 do not  Table 7: Loadings (correlations) of each component with each feature function for English-Czech seem to lend themselves to any simple conclusion. Based on closeness of systems in the PCA plots, we can say that for English-Czech, two out of three best systems (BLEU-MIRA-DENSE and DCU) found similar settings while AFRL stands out. Czech-English results show that systems of very similar weight settings give translations of very different quality. Again, AFRL stands out while leading to very good outputs.

Conclusion
This paper presented the WMT shared task in optimizing parameters of a given hierarchical phrasebased system (WMT Tuning Task) when translating from English to Czech and vice versa. The underlying system was intentionally restricted to small data setting and somewhat unusually, the data for the language model were smaller than for the translation model.
Overall, six teams took part in one or both directions, sticking to the constrained setting, with only METEOR-CMU and our baseline BLEU-MIRA-SPARSE using sparse features.
The submitted configurations were manually evaluated jointly with the systems of the main WMT translation task. Given the small data setting, we did not expect the tuning task systems to perform competitively to other submissions in the WMT translation task.
The results confirm that KBMIRA with the standard (dense) features optimized towards BLEU should be preferred over MERT. Two other systems (DCU and AFRL) performed equally well in English-to-Czech translation. The two systems using sparse features (METEOR-CMU and BLEU-MIRA-SPARSE) performed poorly, but the sample is too small to draw any conclusions from this. Overall, the variance in translation quality obtained using various weight settings is apparent and justifies the efforts put into optimization tech-niques.
Since the task attracted a good number of submissions and was generally considered interesting and useful by our colleagues, we plan to run the task again for WMT in 2016. The next year's underlying systems will use all data available in the WMT constraint setting, to test the tuning methods in the range where state-of-the-art systems operate.