Results of the WMT15 Metrics Shared Task

This paper presents the results of the WMT15 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system level correlation (how well each metric’s scores correlate with WMT15 of-ﬁcial manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence).


Introduction
Automatic machine translation metrics play a very important role in the development of MT systems and their evaluation. There are many different metrics of diverse nature and one would like to assess their quality. For this reason, the Metrics Shared Task is held annually at the Workshop of Statistical Machine Translation 1 , starting with Koehn and Monz (2006) and following up to Macháček and Bojar (2014).
The systems' outputs, human judgements and evaluated metrics are described in Section 2. The quality of the metrics in terms of system level correlation is reported in Section 3. Section 4 is devoted to segment level correlation.

Data
We used the translations of MT systems involved in WMT15 Shared Translation Task (Bojar et al.,1 http://www.statmt.org/wmt15 2015) together with reference translations as the test set for the Metrics Task. This dataset consists of 87 systems' outputs and 10 reference translations in 10 translation directions (English from and into Czech, Finnish, French, German and Russian). The number of sentences in system and reference translations varies among language pairs ranging from 1370 for Finnish-English to 2818 for Russian-English. For more details, please see the WMT15 overview paper .

Manual MT Quality Judgements
During the WMT15 Translation Task, a large scale manual annotation was conducted to compare the translation quality of participating systems. We used these collected human judgements for the evaluation of the automatic metrics.
The participants in the manual annotation were asked to evaluate system outputs by ranking translated sentences relative to each other. For each source segment that was included in the procedure, the annotator was shown five different outputs to which he or she was supposed to assign ranks. Ties were allowed.
These collected rank labels for each five-tuple of outputs were then interpreted as pairwise comparisons of systems and used to assign each system a score that reflects how high that system was usually ranked by the annotators. Several methods have been tested in the past for the exact score calculation and WMT15 has adopted TrueSkill as the official one. Please see the WMT15 overview paper for details on how this score is computed.
For the metrics task in 2014, we were still using the "Pre-TrueSkill" method called "> Others", see Bojar et al. (2011). Since we are now moving to the golden truth calculated by TrueSkill, we report also the average "Pre-TrueSkill" score in the relevant tables for comparison.  Task   Table 1 lists the participants of the WMT15 Shared Metrics Task, along with their metrics. We have collected 46 metrics from a total of 11 research groups.

Participants of the Metrics Shared
Here we give a short description of each metric that performed the best on at least one language pair.

BEER and BEER TREEPEL
BEER is a trained metric, a linear model that combines features capturing character n-grams and permutation trees. BEER has participated last year in sentence-level evalution. The main additions this year are corpus-level aggregation of sentence-level scores and a syntactic version called BEER TREEPEL. BEER TREEPEL includes features checking the match of each type of arc in the dependency trees of the hypothesis and the reference.
BEER was the best for en-de and en-ru at the system level and en-fi and en-ru at the sentence level. BEER TREEPEL was the best for systemlevel evaluation of ru-en.

BS
The metric BS has no corresponding paper, so we include a summary by Mark Fishel here: The BS metric was an attempt of moving in a different direction than most state-of-the-art metrics and reduce complexity and language resource dependence to the minimum. The score is obtained from the number and lengths of "bad segments": continuous subsequences of words that are present only in the hypothesis or the reference, but not both. To account for morphologically complex languages and smooth the score for sparse word forms poor man's lemmatization is added: the floor of one third of each word's characters are re-moved from the word's end. The final score is either the log-sum of the bad segment lengths (BS) or a simple sum (TOTAL-BS).
BS and DPMF were the best for system-level English-French evaluation.

CHRF3
CHRF3 calculates a simple F-score combination of the precision and recall of character n-grams of length 6. The F-score is calculated with β = 3, giving triple the weight to recall. CHRF3 was the best for en-fi and en-cs at the system level and en-cs at the sentence level.

DPMF and DPMFCOMB
DPMF is a syntax-based metric but unlike many syntax-based metrics, it does not compute score on substructures of the tree returned by a syntactic parser. Instead, DPMF parses the reference translation with a standard parser and trains a new parser on the tree of the reference translation. This new parser is then used for scoring the hypothesis. Additionally, DPMF uses F-score of unigrams in combination with the syntactic score.
DPMFCOMB is a combination of DPMF with several other metrics available in the evaluation tool Asiya 2 .
DPMF and BS were the best for system-level evaluation of English-French. DPMF also tied for the best place with UOW-LSTM for French-English. DPMFCOMB was the best for fi-en, deen and cs-en at the sentence level.

DREEM
DREEM uses distributed word and sentence representations of three different kinds: one-hot representation, a distributed representation learned with a neural network and a distributed sentence representation learned with a recursive autoencoder. The final score is the cosine similarity of the representation of the hypothesis and the reference, multiplied with a length penalty.
DREEM was the best for fi-en system-level evaluation.
2.2.6 LEBLEU-OPTIMIZED LEBLEU is a relaxation of the strict word n-gram matching that is used in standard BLEU. Unlike other similar relaxations, LEBLEU uses fuzzy matching of longer chunks of text that allows, for example, to match two independent words with a compound. LEBLEU-OPTIMIZED applies fuzzy match threshold and n-gram length optimized for each language pair. LEBLEU-OPTIMIZED was the best for en-de at the sentence level.

RATATOUILLE
RATATOUILLE is a metric combination of BLEU, BEER, Meteor and few more metrics out of which METEOR-WSD is a novel contribution. METEOR-WSD is an extension of Meteor that includes synonym mappings to languages other than English based on alignments and rewards semantically adequate translations in context. RATATOUILLE was the best for sentencelevel French-English evaluation in both directions.

UOW-LSTM
UOW-LSTM uses dependency-tree recursive neural network to represent both the hypothesis and the reference with a dense vector. The final score is obtained from a neural network trained on judgements from previous years converted to similarity scores, taking into account both the distance and angle of the two representations.
UOW-LSTM tied for the best place in fr-en system-level evaluation with DPMF.

UPF-COBALT
UPF-COBALT pays an increased attention to syntactic context (for example arguments, complements, modifiers etc.) both in aligning the words of the hypothesis and reference as well as in scoring of the matched words. It relies on additional resources including stemmers, WordNet synsets, paraphrase databases and distributed word representations. UPF-COBALT system-level score was calculated by taking the ratio of sentences in which each system from a set of competitors was assigned the highest sentence-level score. UPF-COBALT was the best on system-level evaluation for de-en and, together with VERTA-70ADEQ30FLU, for cs-en.

VERTA-70ADEQ30FLU
VERTA-70ADEQ30FLU aims at the combination of adequacy and fluency features that use many sources of different linguistic information: synonyms, lemmas, PoS tags, dependency parses and language models. On previous works VERTA's linguistic features combination were set depending on whether adequacy or fluency was evaluated. VERTA-70ADEQ30FLU is a weighted combination of VERTA setups for adequacy (0.70) and fluency (0.30).
VERTA-70ADEQ30FLU was, together with UPF-COBALT, the best on cs-en on system level.

Baseline Metrics
In addition to the submitted metrics, we have computed the following two groups of standard metrics as baselines for the system level: • Mteval.
The metrics BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) were computed using the script mteval-v13a.pl 3 which is used in the OpenMT Evaluation Campaign and includes its own tokenization.
We run mteval with the flag --international-tokenization since it performs slightly better .
• Moses Scorer. The metrics TER (Snover et al., 2006), WER, PER and CDER (Leusch et al., 2006) were computed using the Moses scorer which is used in Moses model optimization. To tokenize the sentences, we used the standard tokenizer script as available in Moses toolkit.
For segment level baseline, we have used the following modified version of BLEU: • SentBLEU. The metric SentBLEU is computed using the script sentence-bleu, part of the Moses toolkit. It is a smoothed version of BLEU that correlates better with human judgements for segment level.
We have normalized all metrics' scores such that better translations get higher scores.
For computing the scores we used the same script from the last year metric task.

System-Level Results
Same as last year, we used Pearson correlation coefficient as the main measure for system level metrics correlation. We use the following formula to compute the Pearson's r for each metric and translation direction: (1) where H is the vector of human scores of all systems translating in the given direction, M is the vector of the corresponding scores as predicted by the given metric.H andM are their means respectively.
Since we have normalized all metrics such that better translations get higher score, we consider metrics with values of Pearson's r closer to 1 as better.
You can find the system-level correlations for translations into English in Table 2 and for translations out of English in Table 3. Each row in the tables contains correlations of a metric in each of the examined translation directions. The upper part of each table lists metrics that participated in all language pairs and it is sorted by average Pearson correlation coefficient across translation directions. The lower part contains metrics limited to a subset of the language pairs, so the average correlation cannot be directly compared with other metrics any more. The best results in each direction are in bold. The reported empirical confidence intervals of system level correlations were obtained through bootstrap resampling of 1000 samples (confidence level of 95%).
The move to TrueSkill golden truth slightly increased the correlations and changed the ranking of the metrics a little, but the general patterns hold. (The correlation between "Average" and "Pre-TrueSkill Average" is .999 for both directions.) Both tables also include the average Spearman's rank correlation, which used to be the evaluation measure in the past. Spearman's rank correlation considers only the ranking of the systems and not the distances between them. It is thus more susceptible to instability if several systems have similar scores.

System-Level Discussion
As in the previous years, many metrics outperform BLEU both into as well as out of English. Note that the original BLEU was designed to work with 4 references and WMT provides just one; see  for details on BLEU correlation with varying number of references, up to several thousands. This year, BLEU with one reference reaches the average correlation of .92 into English or .78 out of English. The best performing metrics get up to .98 into English and .92 out of English. CDER is the best of the baselines, reaching .94 into English and .81 out of English.
The winning metric for each language pair is different, with interesting outliers: DREEM performed best when evaluating English translations from Finnish but on average, 12 other metrics into English performed better and DREEM appears to be among the worst metrics out of English. RATATOUILLE is fifth to tenth when evaluated by average Pearson but wins in both directions in average Spearman's rank correlation.
Two metrics confirm the effectiveness of character-level measures, esp. the winners for out of English evaluation: CHRF3 and BEER. The metric CHRF3 is particularly interesting because it does not require any resources whatsoever. It is defined as a simple F-measure of character-level 6grams (spaces are ignored), with recall weighted 3 times more than precision. The balance between the precision and recall seems important depending on morphological richness of the target language: for evaluations into English, CHRF (equal weights) performs better than CHRF3.
As we already observed in the past, the winning metrics are trained on previous years of WMT. This holds for DPMFCOMB, UOW-LSTM and BEER including BEER TREEPEL. DPMF and UPF-COBALT are not combination or trained metrics of any kind, DPMF is based on dependency analysis of the candidate and reference sentences and UPF-COBALT uses contextual information of compared words in the candidate and the reference.
We see an interesting difference in the performance of UOW-LSTM. It is the second metric in system-level correlation but falls among the worst ones in segment-level correlations, see Table 4 below. Gupta et al. (2015b) suggest that the discrepancy in performance could be based by low interannotator agreement and Kendall's τ not reflecting the distances in translation quality between candidates, an issue similar to what we see with Pearson vs. Spearman's rank correlations.
Another dense-representation metric, DREEM, seems to suffer a similar discrepancy when evaluating into English. Out of English, DREEM did not perform very well.
An untested speculation is that the dense sentence-level representation present in some form in both UOW-LSTM as well as in DREEM confuses the metrics in their judgements of individual sentences.

Comparison with BLEU
In Appendix A, we provide two correlation plots for each language pair. The first plot visualizes the correlation of BLEU and manual judgements, the second plot shows the correlation for the best performing metric for that pair.
The BLEU plots include grey ellipses to indicate the confidence intervals of both BLEU as well as manual judgements. The ellipses are tilted only to indicate that BLEU and the manual score are dependent variables. Only the width and height of each ellipse represent a value, that is the confidence interval in each direction. The same vertical confidence intervals hold for plots in the righthand column, but since we don't have any confidence estimates for the individual metrics, we omit them.
Czech-English plots indicate that UPF-COBALT was able to account for the very different behaviour of the transfer-based deep-syntactic system CU-TECTO. It was also able to appreciate the higher translation quality of montreal, UEDIN-* and online-b. The big cluster of systems labelled TT-* are submissions to the WMT15 Tuning Task .
For English-Czech, we see that UEDIN-JHU and MONTREAL are overfit for BLEU. In terms of BLEU, they are very close to the winning system CU-CHIMERA (a combination of CU-TECTO and phrase-based Moses, followed by automatic postediting). CHRF3 is able to recognize the overfitting for MONTREAL, a neural-network based system, but not for UEDIN-JHU. CHRF3 also better recognizes the distance in quality between larger sys-tems (from COMMERCIAL1 above) and the smalldata tuning task systems.
For German-English, we see the same overfit of UEDIN-JHU towards BLEU. While neither UPF-COBALT nor CHRF3 could recognize this for translations involving Czech, the issue is spotted by UPF-COBALT for systems involving German. Syntax-based systems like UEDIN-SYNTAX for English-German and (presumably) ONLINE-B for German-English are among those where the correlation got most improved over BLEU.
The French dataset was in a different domain, which may explain why the best performing metric DPMF does actually not improve much above BLEU. DPMF uses a syntactic parser on the reference, and the performance of parsers on discussions is likely to be lower than the generally used news domain.
In Finnish results, we see again UEDIN-JHU and ABUMATRAN (Rubino et al., 2015) overvalued by BLEU. DREEM based on distributed representation of words and sentences is able to recognize this for translation into English but it falls among the worst metrics in the other direction. For translation into Finnish, character-based n-grams of CHRF3 are much more reliable. Variants of ABU-MATRAN were again those most overvalued by BLEU. ABUMATRAN uses several types of morphological segmentation and reconstructs Finnish words from the segments by concatenation. ABU-MATRAN is loaded with many other features, like web-crawled data and domain handling, and system combination of several approaches. The optimization towards BLEU (unreliable for Finnish, as we have learned in this task), could be among the main reasons behind the comparably lower manual scores.
For Russian, BEER is the best metric, in its syntax-aware variant BEER TREEPEL for evaluating English. Compared to BLEU, the improvement in correlation is not that striking for Russian-English. (It would be interesting to know whether ONLINE-G is better than ONLINE-B because of English syntax or addressing source-side morphology better. BEER TREEPEL captures both aspects.) In the other direction, targetting Russian, BLEU was effectively unable to rank the systems at all. It is probably the character-level features in BEER that allow it to reach a very good correlation, .97.   Table 3: System-level correlations of automatic evaluation metrics and the official WMT human scores when translating out of English. The symbol "≀" indicates where the average is out of sequence compared to the main Pearson average.

Segment-Level Results
We measure the quality of metrics' segment-level scores using Kendall's τ rank correlation coefficient. In this type of evaluation, a metric is expected to predict the result of the manual pairwise comparison of two systems. Note that the golden truth is obtained from a compact annotation of five systems at once, while an experiment with text-tospeech evaluation techniques by Vazquez-Alvarez and Huckvale (2002) suggest that a genuine pairwise comparison is likely to lead to more stable results. The basic formula for Kendall's τ is: where Concordant is the set of all human comparisons for which a given metric suggests the same order and Discordant is the set of all human comparisons for which a given metric disagrees. The formula is not specific with respect to ties, i.e. cases where the annotation says that the two outputs are equally good.
The way in which ties (both in human and metric judgment) were incorporated in computing Kendall τ changed each year of WMT metric tasks. Here we adopt the version from WMT14. For a detailed discussion on other options, see Macháček and Bojar (2014).
The method is formally described using the following matrix: Given such a matrix C h,m where h, m ∈ {<, = , >} 4 and a metric, we compute the Kendall's τ for the metric the following way: We insert each extracted human pairwise comparison into exactly one of the nine sets S h,m according to human and metric ranks. For example the set S <,> contains all comparisons where the left-hand system was ranked better than right-hand system by humans and it was ranked the other way round by the metric in question.
To compute the numerator of Kendall's τ , we take the coefficients from the matrix C h,m , use them to multiply the sizes of the corresponding sets S h,m and then sum them up. We do not include sets for which the value of C h,m is X. To compute the denominator of Kendall's τ , we simply sum the sizes of all the sets S h,m except those where C h,m = X. To define it formally: To summarize, the WMT14 matrix specifies to: • exclude all human ties, • count metric's ties only for the denominator of Kendall τ (thus giving no credit for giving a tie), • all cases of disagreement between human and metric judgements are counted as Discordant, • all cases of agreement between human and metric judgements are counted as Concordant.
You can find the system-level correlations for translations into English in Table 4 and for translations out of English in Table 5. Again, the upper part of each table contains metrics participating in all language pairs and it is sorted by average τ across translation directions. The lower part contains metrics limited to a subset of the language pairs, so the average cannot be directly compared with other metrics any more.

Segment-Level Discussion
As usual, segment-level correlations are significantly lower than system-level ones. The highest correlation is reached by DPMFCOMB on Czechto-English: .495 of Kendall's τ . The correlations reach on average .447 into English and .400 out of English.
DPMFCOMB is the clear winner into English, followed by BEER TREEPEL, both of which consider syntactic structure of the sentence, combined with several other independent features or metrics.
RATATOUILLE, also a combined metric, is the best option for evaluation to and from French.
Metrics considering character-level n-grams (BEER and CHRF3) are particularly good for    Only two segment-level metrics took part in 2014 and 2015, BEER in a slightly improved implementation (with some small effect on the scores) and SENTBLEU in exactly the same implementation. Table 6 documents that this year, the scores are on average slightly higher. The main reason lies probably in the test set, which may be somewhat easier this year. French is different, the correlations decreased somewhat this year, which can be easily explained by the domain change: news in 2014 and discussions in 2015. The increase should not be caused by the redundancy cleanup of WMT manual rankings, see , since the collapsed systems get a tie after expanding and our implementation ignores all tied manual comparisons.

Conclusion
In this paper, we summarized the results of the WMT15 Metrics Shared Task, which assesses the quality of various automatic machine translation metrics. As in previous years, human judgements collected in WMT15 serve as the golden truth and we check how well the metrics predict the judgements at the level of individual sentences as well as at the level of the whole test set (system-level).
Across the two types of evaluation and the 10 language pairs, we saw great performance of trained and combined metrics (DPMFCOMB, BEER, RATATOUILLE and others). Neural networks for continuous word and sentence representations have also shown their generalization power, with an interesting discrepancy in systemvs. segment-level performance of UOW-LSTM and to a smaller degree of DREEM.
We value high the metric CHRF or CHRF3 for its extreme simplicity and very good performance at both system and segment level and especially out of English. We are curious to see if CHRF3 has the potential of becoming "the BLEU for the next five years". It would be very interesting to test its usability in system tuning. It is known that in tuning, metrics putting too much attention to recall can be easily tricked, but perhaps a careful setting of CHRF's β will be sufficient.
The WMT Metrics Task again attracted a good number of participants and the majority of submitted metrics are actually new ones. This is good news, indicating that MT metrics are an active field of research. Most, if not all metrics come with the source code, so it should be relatively easy to use them in own experiments. Still, we would expect much wider adoption of the metrics, if they made it for example to the standard Moses scorer or at least to the Asyia toolkit.

A System-Level Correlation Plots
The following figures plot the system-level results of BLEU (left-hand plots) and the best performing metric for the given language pair (right-hand plots) against manual score. See the discussion in Section 3.2.