Results of the WMT13 Metrics Shared Task

This paper presents the results of the WMT13 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in WMT13 Shared Translation Task. We collected scores of 16 metrics from 8 research groups. In addition to that we computed scores of 5 standard metrics such as BLEU, WER, PER as baselines. Collected scores were evaluated in terms of system level correlation (how well each metric's scores correlate with WMT13 official human scores) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence).


Introduction
Automatic machine translation metrics play a very important role in the development of MT systems and their evaluation. There are many different metrics of diverse nature and one would like to assess their quality. For this reason, the Metrics Shared Task is held annually at the Workshop of Statistical Machine Translation (Callison-Burch et al., 2012). This year, the Metrics Task was run by different organizers but the only visible change is hopefully that the results of the task are presented in a separate paper instead of the main WMT overview paper.
In this task, we asked metrics developers to score the outputs of WMT13 Shared Translation Task (Bojar et al., 2013). We have collected the computed metrics' scores and use them to evaluate quality of the metrics.
The systems' outputs, human judgements and evaluated metrics are described in Section 2. The quality of the metrics in terms of system level correlation is reported in Section 3. Segment level correlation is reported in Section 4.

Data
We used the translations of MT systems involved in WMT13 Shared Translation Task together with reference translations as the test set for the Metrics Task. This dataset consists of 135 systems' outputs and 6 reference translations in 10 translation directions (5 into English and 5 out of English). Each system's output and the reference translation contain 3000 sentences. For more details please see the WMT13 main overview paper (Bojar et al., 2013).

Manual MT Quality Judgements
During the WMT13 Translation Task a large scale manual annotation was conducted to compare the systems. We used these collected human judgements for evaluating the automatic metrics.
The participants in the manual annotation were asked to evaluate system outputs by ranking translated sentences relative to each other. For each source segment that was included in the procedure, the annotator was shown the outputs of five systems to which he or she was supposed to assign ranks. Ties were allowed. Only sentences with 30 or less words were ranked by humans.
These collected rank labels were then used to assign each system a score that reflects how high that system was usually ranked by the annotators. Please see the WMT13 main overview paper for details on how this score is computed. You can also find inter-and intra-annotator agreement estimates there. Task   Table 1 lists the participants of WMT13 Shared Metrics Task, along with their metrics. We have collected 16 metrics from a total of 8 research groups.

Participants of the Shared
In addition to that we have computed the following two groups of standard metrics as baselines: Metrics Participant METEOR Carnegie Mellon University (Denkowski and Lavie, 2011) LEPOR, NLEPOR University of Macau (Han et al., 2013) ACTA, ACTA5+6 Idiap Research Institute (Hajlaoui, 2013) (Hajlaoui and Popescu-Belis, 2013) DEPREF-{ALIGN,EXACT} Dublin City University  SIMPBLEU-{RECALL,PREC} University of Shefield (Song et al., 2013) MEANT, UMEANT Hong Kong University of Science and Technology (Lo and Wu, 2013) TERRORCAT German Research Center for Artificial Intelligence (Fishel, 2013) LOGREGFSS, LOGREGNORM DFKI (Avramidis and Popović, 2013) (Papineni et al., 2002), TER (Snover et al., 2006), WER, PER and CDER (Leusch et al., 2006) were computed using the Moses scorer which is used in Moses model optimization. To tokenize the sentences we used the standard tokenizer script as available in Moses Toolkit. In this paper we use the suffix *-MOSES to label these metrics.
Metrics BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) were computed using the script mteval-v13a.pl 1 which is used in OpenMT Evaluation Campaign and includes its own tokenization. We use *-MTEVAL suffix to label these metrics. By default, mteval assumes the text is in ASCII, causing poor tokenization around curly quotes.
We run mteval in both the default setting as well as with the flag --international-tokenization (marked *-INTL).
We have normalized all metrics' scores such that better translations get higher scores.

System-Level Metric Analysis
We measured the quality of system-level metrics' scores using the Spearman's rank correlation coefficient ρ. For each direction of translation we converted the official human scores into ranks. For each metric, we converted the metric's scores of systems in a given direction into ranks. Since there were no ties in the rankings, we used the simplified formula to compute the Spearman's ρ: where d i is the difference between the human rank and metric's rank for system i and n is number of systems. The possible values of ρ range between 1 (where all systems are ranked in the same order) and -1 (where the systems are ranked in the reverse order). A good metric produces rankings of systems similar to human rankings. Since we have normalized all metrics such that better translations get higher score we consider metrics with values of Spearman's ρ closer to 1 as better.
We also computed empirical confidences of Spearman's ρ using bootstrap resampling. Since we did not have direct access to participants' metrics (we received only metrics' scores for the complete test sets without the ability to run them on new sampled test sets), we varied the "golden truth" by sampling from human judgments. We have bootstrapped 1000 new sets and used 95 % confidence level to compute confidence intervals.
The Spearman's ρ correlation coefficient is sometimes too harsh: If a metric disagrees with humans in ranking two systems of a very similar quality, the ρ coefficient penalizes this equally as if the systems were very distant in their quality. Aware of how uncertain the golden ranks are in general, we do not find the method very fair. We thus also computed three following correlation coefficients besides the Spearman's ρ: • Pearson's correlation coefficient. This coefficient measures the strength of the linear relationship between metric's scores and human scores. In fact, Spearman's ρ is Pearson's correlation coefficient applied to ranks.
• Correlation with systems' clusters. In the Translation Task (Bojar et al., 2013), the manual scores are also presented as clusters of systems that can no longer be significantly distinguished from one another given the available judgements. (Please see the WMT13 Overview paper for more details).
We take this cluster information as a "rank with ties" for each system and calculate its Pearson's correlation coefficient with each metric's scores.
• Correlation with systems' fuzzy ranks. For a given system the fuzzy rank is computed as an average of ranks of all systems which are not significantly better or worse than the given system. The Pearson's correlation coefficient of a metric's scores and systems' fuzzy ranks is then computed.
You can find the system-level correlations for translations into English in Table 2 and for translations out of English in Table 3. Each row in the tables contains correlations of a metric in each of the examined translation directions. The metrics are sorted by average Spearman's ρ correlation across translation directions. The best results in each direction are in bold.
As in previous years, a lot of metrics outperformed BLEU in system level correlation. The metric which has on average the strongest correlation in directions into English is METEOR. For the out of English direction, SIMPBLEU-RECALL has the highest system-level correlation. TER-RORCAT achieved even a higher average correlation but it did not participate in all language pairs. The implementation of BLEU in mteval is slightly better than the one in Moses scorer (BLEU-MOSES). This confirms the known truth that tokenization and other minor implementation details can considerably influence a metric performance.

Segment-Level Metric Analysis
We measured the quality of metrics' segmentlevel scores using Kendall's τ rank correlation coefficient. For this we did not use the official WMT13 human scores but we worked with raw human judgements: For each translation direction we extracted all pairwise comparisons where one system's translation of a particular segment was judged to be (strictly) better than the other system's translation. Formally, this is a list of pairs (a, b) where a segment translation a was ranked better than translation b: where r(·) is human rank. For a given metric m(·), we then counted all concordant pairwise compar-isons and all discordant pairwise comparisons. A concordant pair is a pair of two translations of the same segment in which the comparison of human ranks agree with the comparison of the metric's scores. A discordant pair is a pair in which the comparison of human ranks disagrees with the metric's comparison. Note that we totally ignore pairs where human ranks or metric's scores are tied. Formally: Finally the Kendall's τ is computed using the following formula: The possible values of τ range between -1 (a metric always predicted a different order than humans did) and 1 (a metric always predicted the same order as humans). Metrics with higher τ are better. The final Kendall's τ s are shown in Table 4 for directions into English and in Table 5 for directions out of English. Each row in the tables contains correlations of a metric in given directions. The metrics are sorted by average correlation across the translation directions. Metrics which did not compute scores for systems in all directions are at the bottom of the tables.
You can see that in both categories, into and out of English, the strongest correlated segment-level metric is SIMPBLEU-RECALL.

Details on Kendall's τ
The computation of Kendall's τ has slightly changed this year. In WMT12 Metrics Task (Callison-Burch et al., 2012), the concordant pairs were defined exactly as we do (Equation 3) but the discordant pairs were defined differently: pairs in which one system was ranked better by the human annotator but in which the metric predicted a tie were considered also as discordant: We feel that for two translations a and b of a segment, where a is ranked better by humans, a metric which produces equal scores for both translations should not be penalized as much as a metric which n/a .272 n/a n/a n/a .272  n/a .270 n/a n/a n/a .270 TERRORCAT . 161 .298 .230 n/a n/a .230 n/a n/a .136 n/a n/a .136 TERRORCAT .116 .074 .186 n/a n/a .125  n/a n/a .033 n/a n/a .033 Table 5: Segment-level Kendall's τ correlations of automatic evaluation metrics and the official WMT human judgements when translating out of English.
strongly disagrees with humans. The method we used this year does not harm metrics which often estimate two segments as equally good.

Conclusion
We carried out WMT13 Metrics Shared Task in which we assessed the quality of various automatic machine translation metrics. We used the human judgements as collected for WMT13 Translation Task to compute system-level and segment-level correlations with human scores. While most of the metrics correlate very well on the system-level, the segment-level correlations are still rather poor. It was shown again this year that a lot of metrics outperform BLEU, hopefully one of them will attract a wider use at last.