Results of the WMT16 Metrics Shared Task

This paper presents the results of the WMT16 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT16 Shared Translation Task. We collected scores of 16 metrics from 9 research groups. In addition to that, we computed scores of 9 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT16 ofﬁcial manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence). This year there are several additions to the setup: large number of language pairs (18 in total), datasets from different domains (news, IT and medical), and different kinds of judgments: relative ranking (RR), direct assessment (DA) and HUME manual semantic judgments. Finally, generation of large number of hybrid systems was trialed for provision of more conclusive system-level metric rankings.


Introduction
Automatic evaluation of machine translation quality is essential in the development and selection of machine translation systems. Many different automatic MT quality metrics are available and the Metrics Shared Task 1 is held annually at WMT to assess their quality, starting with Koehn and Monz (2006) and following up to . 1 http://www.statmt.org/wmt16/ metrics-task/ Metrics participating in the metrics task rely on the existence of reference translations with which MT outputs are compared, and the metrics task itself then needs manual judgments of translation quality in order to check the extent to which the automatic metrics can approximate the judgment. A related WMT task on quality estimation assesses the performance of methods where no reference translations are needed, requiring only the manual quality judgments (Bojar et al., 2016b).
This year, we keep the two main types of metric evaluation: system-level, where a metric is expected to provide a quality score for the whole translated document, and segment-level, where the score is needed for every individual sentence.
We experiment with several novelties. Specifically, test sets this year come from three domains: news, IT and medical/health-related texts.
The added domains bring in an extended set of languages. In sum, the metrics task this year includes 18 language pairs, English paired with Basque, Bulgarian, Czech, Dutch, Finnish, German, Polish, Portuguese, Romanian, Russian, Spanish, and Turkish, in one or both directions.
On the evaluation side, we rely on three golden truths of manual judgment: • Relative Ranking (RR) of up to 5 different translation candidates at a time, as collected in WMT in the past, • Direct Assessment (DA) evaluating the adequacy of a translation candidate on an absolute scale in isolation from other translations, • HUME, a composite segment-level score aggregated over manual judgments of translation quality of semantic units of the source sentence.
Additional changes to the task evaluation include a change in the way we compute confidence N e w s T a s k T u n i n g T a s k I T T a s k H i m L Y e a r 1 H y b r i d cs de ro fi ru tr  Table 1: Overview of "tracks" of the WMT16 metrics task. "•" indicates language pairs covered in the evaluation, "·" are language pairs planned but abandoned due to difficulties in obtaining human judgments.
intervals for metric correlations with human assessment, resulting in more reliable conclusions as to which metrics outperform others. The official method of evaluation remains unchanged, relying on RR in both the system-level (TrueSkill) and segment-level (Kendall's τ ) metrics, see below for details and references.
Our datasets are described in Section 2. This includes the test sets, system outputs, human judgments of translation quality as well as participating metrics across the tasks. Results of system-level metric evaluation are provided in Section 3.1 and Section 3.2, the results of the segment-level evaluation are provided in Section 3.3. Table 1 provides the complete picture of the golden truths, test sets, translation systems and language pairs involved in the metrics task this year. For simplicity, we called each of these setups a "track", indicating the underlying type of golden truth (RR/DA/HUME), system-or segment-level evaluation (sys/seg) and the particular test set.

Data
While the set of setups is much larger this year, the participants of the task were affected rather minimally. Participants were only required to run metrics on the additional test sets and with an additional large set of hybrid systems in the systemlevel evaluation. As in the previous years, participants were allowed take part in any subset of language pairs and setups.

Test Sets
We use the following test sets: newstest2016 is the main test set. It is the test set used in WMT16 News Translation Task (Bojar et al., 2016b), with approximately 3,000 sentences for each translation direction (with the exception of Romanian which only has 1,999 sentences). The set includes a single reference translation for each direction, except English→Finnish with two reference translations.
it-test2016 is the set of 1,000 sentences translated from English into seven other European languages. The IT test sentences typically contain instructions for operating commonly used software like web browsers, mail clients or image editors, e.g.: "In message box click on More > Archived." himl2015 is part of the official test set created by the EU project HimL. 2 These are healthrelated texts from Cochrane summaries and NHS 24 online content. The texts originated in English and the target languages consist of Czech, German, Polish and Romanian versions created by post-edition of phrase-based MT output. From the full set of about 3,000 sentences, 800 were given as input to the participants of the metrics task and in the end about 340 sentences per language pair were used for evaluation, as those sentences have manual score suitable to employ as the golden truth for metric evaluation.
The sentences of NHS 24 tend to be shorter and simpler translations, e.g. "Choose lower fat options such as semi-skimmed milk and low fat yogurt.", while Cochrane summaries are longer and often contain specific terminology, e.g. "The purpose of this research was to determine how good the TEG and ROTEM assessments are at diagnosing TIC in adult trauma patients who are bleeding."

Translation Systems
Characteristics of the particular underlying translation task MT systems is likely an important fac-2 http://www.himl.eu/test-sets 200 tor affecting the difficulty of the metrics task. For instance, if all of the systems perform similarly, it will be more difficult, even for the humans, to distinguish between the quality of translations. If the task includes a wide range of systems of varying quality, however, or systems quite different in nature, this could in some way could make the task easier for metrics, with metrics that are more sensitive to certain aspects of MT output performing better.
The MT systems included in evaluation of metrics are as follows: News Task Systems are all MT systems participating in the WMT16 News Translation Task (Bojar et al., 2016b). These systems differ widely in nature (standard phrase-based, syntax-based, transfer-based or even rulebased systems, also with a large number of neural MT systems), with the precise set of systems and system types also depending on specific language pair.
Tuning Task Systems are all Moses phrasebased systems run by the organizers of the WMT16 Tuning Task (Jawaid et al., 2016). All of these systems share the same phrase tables and language models, they are trained on relatively large volumes of data, and differ only in the model weights as provided by the participants of the tuning task. Tuning task was limited to Czech↔English language pairs. IT Task Systems are participants of the WMT16 IT-domain Translation Task (Bojar et al., 2016b), translating only from English to seven other European languages. This is generally a smaller set of systems and the number of covered system architectures here is also smaller. As far as we know, no neural system was involved in the task.
HimL Year 1 Systems are MT systems released in the first year of the EU project HimL 3 . They are all Moses-based and trained on available data in the medical or health-related domain.
Hybrid Systems were created by combining the output of two newstest2016 translation task systems, with the aim of providing a larger Excluding the hybrid systems, we ended up with 171 system outputs across 18 language pairs and 3 test sets.

Manual MT Quality Judgments
There are three distinct "golden truths" employed to evaluate metrics this year: Relative Ranking (RR, as in previous year), Direct Assessment (DA) and HUME, a semantic-based manual metric.
The details of the methods are provided in this section, separately for system-level evaluation (Section 2.3.1, using RR and DA) and segmentlevel evaluation (Section 2.3.2, using RR, DA and HUME).
The RR manual judgments were provided by MT researchers taking part in WMT tasks, as in recent years of the campaign, after it was empirically established that judgments of RR collected through crowd-sourcing platforms were not reliable . DA judgments are more robust in this respect and while the original plan was to collect DA from both researchers and crowd-sourced non-experts, only the latter ultimately took place due to time constraints.

System-level Manual Quality Judgments
In system-level evaluation, the goal is to assess the quality of translation of an MT system for the whole document. Both our manual scoring methods RR and DA nevertheless proceed sentence by sentence, aggregating the final score in some way.
Relative Ranking (RR) As in previous WMT shared tasks, human assessors of MT output (only researchers this year) were presented with the source language input, target language reference translation and the output of five distinct MT output translations. Human assessors were required to rank the five translations from best to worse, with ties allowed. As introduced in WMT15, identical translations from distinct systems were collapsed into a single translation before running the human evaluation to increase the overall efficiency of RR human assessment. Each five-tuple relative ranking was employed to produce 10 pairwise assessments, later combined into a score for each MT system that re-flects the frequency by which the output of that system was preferred to the output of other systems. Several methods have been tested in the past for the exact score calculation and WMT16 has again adopted TrueSkill as the official ranking approach. Please see the WMT16 overview paper for details on how this score is computed.
To increase annotator efficiency, a maximum sentence length of 30 words was applied to RR human assessment.
Direct Assessment (DA) In addition to the standard relative ranking (RR) manual evaluation employed to yield official system rankings in WMT16 translation task, this year the translation task also trialed a new method of human evaluation, monolingual direct assessment (DA) of translation fluency (Graham et al., 2013) and adequacy . For investigatory purposes, therefore, we also include evaluation of metrics with reference to the newly trialed human assessment method.
Since sufficient levels of agreement in human assessment of translation quality are difficult to achieve, the DA setup simplifies the task of translation assessment (conventionally a bilingual task) into a simpler monolingual assessment for both fluency and adequacy. Furthermore, DA avoids bias that has been problematic in previous evaluations introduced by simultaneous assessment of several alternate translations of a given single source language input, where scores of systems for which translations were often compared to high or low quality translations resulted in an unfair advantage or disadvantage (Bojar et al., 2011). DA achieves this by assessment of individual translations in isolation from other outputs of the same source input.
Translation adequacy is structured as a monolingual assessment of similarity of meaning where the target language reference translation and the MT output are displayed to the human assessor. Human assessors rate a given translation by how adequately it expresses the meaning of the reference translation on an analogue scale corresponding to an underlying 0-100 rating scale. 4 Fluency assessment is similar to adequacy except that no reference is displayed and assessors are asked to rate how much they agree that a given translation is fluent target language text.
Large numbers of DA human assessments of translations for seven language pairs (targeting English and Russian) were collected on Amazon's Mechanical Turk, 5 via sets of 100-translation hits to ensure sufficient repeat items per worker, before application of strict quality control measures to filter out assessments from poorly performing workers.
In order to iron out differences in scoring strategies attributed to distinct workers, human assessment scores for translations were standardized according to an individual worker's overall mean and standard deviation score. Mean standardized scores for translation task participating systems were computed by firstly taking the average of scores for individual translations in the test set (since some were assessed more than once), before combining all scores for translations attributed to a given MT system into its overall adequacy or fluency score.
Although the WMT16 Translation Task included both fluency and adequacy DA human assessment, the metrics task this year employed only DA adequacy scores. We hope to incorporate DA fluency into future metric evaluations, however.
Finally, although it is common to apply a sentence length restriction in WMT human evaluation, the simplified DA setup does not require restriction of the evaluation in this respect and no sentence length restriction was applied in DA WMT16.

Segment-level Manual Quality Judgments
Segment-level metrics have been evaluated against the pairwise judgments implied by the 5-way relative ranking annotation. This year, we add two new variants of human assessment: segment-level DA and HUME.
Segment-level DA Adequacy assessments were collected for translations sampled from the output of systems participating in WMT16 translation task for seven language pairs (Graham et al., 2015). 6 Since the actual MT system is not important for segment-level assessment, we sampled 500 translations per language pair at random.

Metric Participant
BEER ILLC -University of Amsterdam (Stanojević and Sima'an, 2015) CHARACTER RWTH Aachen University  CHRF1,2,3, WORDF1,2,3 Humboldt University of Berlin (Popović, 2016) DEPCHECK Charles University, no corresponding paper DPMFCOMB-WITHOUT-RED Chinese Academy of Sciences and Dublin City University (Yu et al., 2015) MPEDA Jiangxi Normal University (Zhang et al., 2016) UOW.REVAL University of Wolverhampton (Gupta et al., 2015b) UPF-COBALT, COBALTF, METRICSF Universitat Pompeu Fabra (Fomicheva et al., 2016) DTED University of St Andrews, (McCaffery and Nederhof, 2016)  Segment-level DA adequacy scores were collected as in system-level DA, described in Section 2.3.1, again with strict quality control and score standardization applied. To achieve accurate segment-level scores for translations, a human assessment of each translation was collected from 15 distinct human assessors before combination into a mean adequacy score for each individual translation. Although in general agreement in human assessment of MT has been difficult to achieve, segment-level DA scores employing a minimum of 15 repeat assessments have been shown to be almost perfectly replicable. In repeat experiments, for all tested language pairs, a correlation of above 0.9 between (a) segment-level DA scores for translations collected in an initial experiment run and (b) the same collected in a repeat evaluation of the same translations, by combining assessments of a minimum of 15 human assessors (Graham et al., 2015).
A distinction between DA and RR is that while RR works off a single set of human assessments for evaluation of both system-level and segmentlevel metrics, DA additionally includes a variant of its methodology designed specifically for evaluation of segment-level metrics. HUME The HUME metric (Birch et al., 2016) is a novel human evaluation measure that decomposes over the UCCA semantic units. UCCA (Abend and Rappoport, 2013) is an appealing candidate for semantic analysis, due to its crosslinguistic applicability, support for rapid annotation, and coverage of many fundamental semantic phenomena, such as verbal, nominal and adjectival argument structures and their interrelations. HUME operates by aggregating human assessments of the translation quality of individual semantic units in the source sentence. We thus avoid the semantic annotation of machinegenerated text, which is often garbled or seman-tically unclear. This also allows the re-use of the source semantic annotation for measuring the quality of different translations of the same source sentence, and avoids reliance on possibly suboptimal reference translations. HUME shows good inter-annotator agreement, and reasonable correlation with Direct Assessment (Graham et al., 2015). Table 2 lists the participants of the WMT16 Shared Metrics Task, along with their metrics. We have collected 16 metrics from a total of 9 research groups.

Participants of the Metrics Shared Task
The following subsections provide a brief summary of all the metrics that participated. The list is concluded by our baseline metrics in Section 2.4.10.

BEER
BEER (Stanojević and Sima'an, 2015) is a trained evaluation metric with a linear model that combines features capturing character n-grams and permutation trees. BEER has participated in previous years of the evaluation task. This year the learning algorithm is improved (linear SVM instead of logistic regression) and some features that are relatively slow to compute are removed (paraphrasing, syntax and permutation trees) which resulted in a very large speed-up. BEER is usually trained for ranking but in this case there was a compromise: the initial model is trained for ranking (RR) with ranking SVM and then the output from SVM is scaled using trained regression model to approximate absolute judgment (DA).

CHARACTER
CHARACTER ) is a novel character-level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence. CHARACTER calculates the character-level edit distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered to match a reference word and could be shifted, if the edit distance between them is below a threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER.

CHRF and WORDF
WORDF1,2,3 (Popović, 2016) calculate a simple F-score combination of the precision and recall of word n-grams of maximal length 4 with different setting for the β parameter (β = 1, 2, or 3). Precision and recall that are used in computation of the F-score are arithmetic averages of precisions and recalls, respectively, for the different n-gram orders. CHRF1,2,3 calculate the F-score of character n-grams of maximal length 6. β parameter gives β times weight to recall: β = 1 implies equal weights for precision and recall.

DEPCHECK
DEPCHECK is based on the automatic post-editing tool Depfix (Rosa, 2014). For each sentence, DE-PCHECK computes the percentage of nodes postedited by Depfix, obtaining a "relative depcheck error rate" (RDER). The value of the DEPCHECK metric is then defined as 1 − RDER. DEPCHECK does not distinguish the error types or whether there was more than one Depfix rule applied to a node. It is suggested for a future version of DE-PCHECK to assign a weight (either by hand, or training from some golden data) to each rule that was applied to the MT output.

DPMFCOMB-WITHOUT-RED
The authors of DPMFCOMB-WITHOUT-RED follow the work on last year's metric DPMFCOMB (Yu et al., 2015), but modify it with two main differences. Firstly, they use the 'case insensitive' instead of 'case sensitive' option when using Asiya. Secondly, REDP are not used. Thus, DPMFCOMB-WITHOUT-RED is a combined metric including 57 single metrics. Weights of the individual metrics are trained with SVM-rank, using training data from the English-targeted language pairs from WMT12 to WMT14. In the results DPMFCOMB-WITHOUT-RED is represented as DPMFCOMB for brevity.

DTED
DTED (McCaffery and Nederhof, 2016) is based on Tree Edit Distance. The scoring is done over the dependency parse tree of the output where the number of edit operations (insert, delete or substitute) needed to convert it to the correct (reference) dependency tree is used as an indicator of the translation quality. Unlike the majority of metrics which evaluate many aspects of translation, DTED evaluates only the word order.

MPEDA
MPEDA (Zhang et al., 2016) is developed on the basis of the METEOR metric. In order to accurately match words or phrases with the same or similar meaning, it extracts a domain-specific paraphrase table from the monolingual corpus and applies that paraphrase table to the METEOR metric to replace the general one. Unlike traditional paraphrase extraction approaches, it first filters out a domain-specific sub-corpus from a large general monolingual corpus and then extracts domain-specific paraphrase table from the sub-corpus by Markov Network model. Since the proposed paraphrase extraction approach can be used in all languages, MPEDA is languageindependent.

UOW.REVAL
UOW.REVAL (Gupta et al., 2015b) uses dependency-tree Long Short Term Memory (LSTM) network to represent both the hypothesis and the reference with a dense vector. Training is performed using the judgements from WMT13  converted to similarity scores. The final score at the system level is obtained by averaging the segment level scores obtained from a neural network which takes into account both distance and Hadamard product of the two representations.

UPF-COBALT, COBALTF and METRICSF
UPF-COBALT (Fomicheva et al., 2016) is an alignment-based metric that examines the syntactic contexts of lexically similar candidate and reference words in order to distinguish meaningpreserving variations from the differences indicative of MT errors. This year the metric was improved by explicitly addressing MT fluency. The new version of the metric, COBALTF, combines various components of UPF-COBALT with a number of fine-grained features intended to capture the number and scale of disfluent fragments contained in MT sentences. METRICSF is a combination of three evaluation systems, BLEU, METEOR and UPF-COBALT, with the fluency-oriented features.

Baseline Metrics
As mentioned by Bojar et al. (2016a), metrics task occasionally suffers from "loss of knowledge" when successful metrics participate only in one year. We attempt to avoid this by regularly evaluating also a range of "baseline metrics": The metrics MTEVALBLEU (Papineni et al., 2002) and MTEVAL-NIST (Doddington, 2002) were computed using the script mteval-v13a.pl 7 which is used in the OpenMT Evaluation Campaign and includes its own tokenization.
We run mteval with the flag --international-tokenization since it performs slightly better (Macháček and Bojar, 2013).
The metrics MOSES-BLEU, MOSESTER (Snover et al., 2006), MOSESWER, MOSESPER and MOSECDER (Leusch et al., 2006) were produced by the Moses scorer which is used in Moses model 7 http://www.itl.nist.gov/iad/mig/ tools/ optimization. To tokenize the sentences, we used the standard tokenizer script as available in Moses toolkit. Since Moses scorer is versioned on Github, we strongly encourage authors of high-performing metrics to add them to Moses scorer, as this will ensure that their metric can be included in future tasks.
As for segment-level baselines, we employ the following modified version of BLEU: • SentBLEU. The metric SENTBLEU is computed using the script sentence-bleu, part of the Moses toolkit. It is a smoothed version of BLEU that correlates better with human judgments for segment-level.
For computing system-level scores, the same script was employed as in last year's metric task. New scripts have been added for system-level hybrids and segment-level evaluation. Table 3 provides an overview of all the tables and figures in the rest of the paper. We discuss systemlevel results for news task systems (including tuning task systems) in Section 3.1. The system-level results for the IT domain are discussed in Section 3.2. The segment-level results are in Section 3.3. We end with discussion in Section 3.4.

System-Level Results for News Task
As in previous years, we employ the Pearson correlation (r) as the main evaluation measure for system-level metrics, as follows: where H are human assessment scores of all systems in a given translation direction, M are corresponding scores as predicted by a given metric. H and M are their means respectively.
Since some metrics, such as BLEU, for example, aim to achieve a strong positive correlation with human assessment, while error metrics, such as TER aim for a strong negative correlation, after computation of r for metrics, we compare metrics via the absolute value of a given metric's correlation with human assessment.
Table 4 includes results for system-level into-English metrics for evaluation of systems participating in the main translation task (newstest2016), evaluated against RR and DA human assessment variants, while Table 5 includes the same for the newstest2016 out-of-English language pairs (only Russian has the DA judgments). Tuning systems were excluded from Tables 4 and 5 and they are covered by Table 6 that shows correlations achieved by metrics with RR when the set of systems additionally includes tuning task systems.
In previous years, we reported empirical confidence intervals of system-level correlations obtained by bootstrap resampling human assessments data and computing confidence intervals for individual correlations with human assessment. Such confidence intervals reflect the variance due to particular sentences and assessors involved in the evaluation but lead to over-estimation of significant differences if employed to conclude which metrics outperform others. This year, as recommended by , instead we employ Williams significance test (Williams, 1959). Williams test is a test of significance of a difference in dependent correlations and therefore suitable for evaluation of metrics. Correlations not significantly outperformed by any other are highlighted in bold in Tables 4 and 5. Since RR is the official method of evaluation for this year's metrics task, bolded correlations under RR comprise official winners of the news domain portion of the system-level metrics task. DA results are included for comparison and are investigatory only.
With regard to which individual metric may or may not outperform other metrics, such as the important comparison as to which metrics significantly outperform the most widely employed metric BLEU (in its mteval or Moses scorer implementation), Figures 1, 2, 3, 4, 5, and 6 include significance test results for every competing pair of metrics including our baseline metrics. In heatmaps in Figures 1, 2, 3, 4, 5, and 6, the column labelled "MTEVALBLEU" or "MOSESBLEU" can be used to quickly observe which metrics achieve  Table 6: Absolute Pearson correlation of cs-en and en-cs system-level metric scores with human assessment variant RR + TT, i.e. standard WMT relative ranking including tuning task systems. a significant increase in correlation with human assessment over that of BLEU, where a green cell in the column denotes outperformance of BLEU by the metric in that row.
For investigatory purposes only, we also include hybrid-supersample (Graham and Liu, 2016) results for system-level metrics. 10K hybrid systems were created per language pair, with corresponding DA human assessment scores, by sampling pairs of systems from WMT16 translation task and creating a hybrid system by combining translations from each system to create new hybrid output test set documents, each with a corresponding DA human assessment score. Not all metrics participating in the system-level metrics shared task submitted metric scores for the large set of hybrid systems, possibly due to the increased time required to run metrics on the large set of 10K systems. In this respect, DA hybrid may provide some indication of which metrics are likely to be more feasible to employ for tuning purposes in MT systems out-of-the-box. Due to time constraints, this year it was only possible to include hybridsupersampling results for language pairs evaluated by the DA human assessment variant.
Correlations of metric scores with human assessment of the large set of hybrid systems are    Figure 1: German-to-English (de-en), Finnish-to-English (fi-en) and Romanian-to-English (ro-en) system-level metric significance test results for human assessment variants; green cells denote a significant increase in correlation with human assessment for the metric in a given row over the metric in a given column according to Williams test; RR = standard WMT relative ranking for translation task systems only; DA = direct assessment of translation adequacy; DA Hybrids = direct assessment with hybrid super-sampling. 208  Figure 2: Russian-to-English (ru-en), Turkish-to-English (tr-en) and English-to-Russian (en-ru) systemlevel metric significance test results for human assessment variants; green cells denote a significant increase in correlation for the metric in a given row over the metric in a given column according to Williams test; RR = standard WMT relative ranking for translation task systems only; DA = direct assessment of translation adequacy; DA Hybrids = direct assessment with hybrid super-sampling.  Figure 3: Czech-to-English (cs-en) system-level metric significance test results for human assessment variants; a green cell corresponds to a significant increase in correlation for the metric in a given row over the metric in a given column according to Williams test; RR = standard WMT relative ranking for translation task systems only; RR + TT = standard WMT relative ranking for all cs-en newstest2016 systems; DA = direct assessment of translation adequacy; DA Hybrids = direct assessment with hybrid super-sampling.  Figure 4: English-to-Czech (en-cs) system-level metric significance test results; a green cell corresponds to a significant increase in correlation for the metric in a given row over the metric in a given column according to Williams test; RR = standard WMT relative ranking; RR + TT = standard WMT relative ranking for translation and tuning task systems. Figure 5: English-to-German (en-de) system-level metric significance test results; a green cell corresponds to a significant increase in correlation for the metric in a given row over the metric in a given column according to Williams test; RR = standard WMT relative ranking.  Figure 6: English-to-Finnish (en-fi), English-to-Romanian (en-ro) and English-to-Turkish (en-tr) systemlevel metric significance test results; a green cell corresponds to a significant increase in correlation for the metric in a given row over the metric in a given column according to Williams test; RR = standard WMT relative ranking.
shown in Table 7, where again metrics not significantly outperformed by any other are highlighted in bold. Results are for investigatory purposes only and do not indicate official winners, however. Figures 1, 2 and 3 also include significance test results for hybrid super-sampled correlations for all pairs of competing metrics for a given language pair.
In Appendix A, correlation plots for each language pair are also provided. The left-hand plot visualizes the correlation of MTEVALBLEU and manual judgements, while the right-hand plot shows the correlation for the best performing metrics for that pair according to both standard RR and DA, as per Tables 4, 5 and 7.

System-Level Results for IT Task
Since systems participating in the IT domain translation task were manually evaluated with RR, we include evaluation of metrics for translation of this specific domain. Results of all metrics evaluated on the IT domain MT systems are shown in Table 8, where official winning metrics for this domain are identified as those not significantly outperformed by any other metric according to Williams test, correlations for which are high-lighted in bold. 8 Full pairwise significance test results for every pair of competing metrics evaluated on IT domain systems for Spanish, Dutch and Portuguese are shown in Figure 7, German in Figure 5 and Czech in Figure 4. No significance tests are provided for IT domain Bulgarian and Basque, as all metrics achieved equal correlations.
We see from Table 8 and also Figure 7 that MOSESBLEU does not belong to the winners for several target languages (Czech, German, Dutch), but across the board, metrics are hard to distinguish on this specific test set.

Segment-Level Results
In WMT16, the official method for segment-level metric evaluation remains unchanged: a Kendall's Tau-like formulation of a given metric's agreement with pairwise human assessment of translations, collected through 5-way relative ranking (RR). However, we also trial evaluation of segmentlevel metrics with reference to segment-level DA human assessment (for the main translation task data set) and a semantic-based manual judgments HUME (for himl2015 data set).    Figure 7: System-level metric ittest2016 significance test results for differences in metric correlation with human assessment for remaining out-of-English language pairs evaluated with relative ranking (RR) human assessment.
Segment-level DA Evaluation Segment-level DA adequacy scores, as described in Section 2.3.2, are employed as gold standard human scores for translations. Since DA segment-level scores are absolute judgments, in their raw (nonstandardized) form corresponding simply to a percentage of the absolute adequacy of a given translation, evaluation of metrics simply takes the form of the computation of a Pearson correlation coefficient between metric and DA scores for translations. Significance of differences in metric performance, as in system-level DA metric evaluation, takes the form of Williams test for the significance of a difference in dependent correlations (Williams, 1959;Graham et al., 2015).
Segment-level HUME evaluation The evaluation of segment-level metrics with reference to HUME scores operates in a similar way to DA, by computing the Pearson correlation of HUME evaluation scores for individual translations with metric scores. Williams test is also applied to test for significant differences in metric performance.

Kendall's Tau-like Formulation
We measure the quality of metrics' segment-level scores using a Kendall's Tau-like formulation, which is an adaptation of the conventional Kendall's Tau coefficient. Since we do not have a total order ranking of all translations we use to evaluate metrics, it is not possible to apply conventional Kendall's Tau given the current RR human evaluation setup (Graham et al., 2015). Vazquez-Alvarez and Huckvale (2002) also note that a genuine pairwise comparison is likely to lead to more stable results for segment-level metric evaluation. Our Kendall's Tau-like formulation, τ , for segment-level evaluation is as follows: where Concordant is the set of all human comparisons for which a given metric suggests the same order and Discordant is the set of all human comparisons for which a given metric disagrees. The formula is not specific with respect to ties, i.e. cases where the annotation says that the two outputs are equally good.
The way in which ties (both in human and metric judgment) were incorporated in computing Kendall τ has changed across the years of WMT metrics tasks. Here we adopt the version from WMT14 and WMT15. For a detailed discussion on other options, see Macháček and Bojar (2014).
The method is formally described using the following matrix: Given such a matrix C h,m where h, m ∈ {<, = , >} 9 and a metric, we compute the Kendall's τ for the metric the following way: We insert each extracted human pairwise comparison into exactly one of the nine sets S h,m according to human and metric ranks. For example the set S <,> contains all comparisons where the left-hand system was ranked better than right-hand system by humans and it was ranked the other way round by the metric in question.
To compute the numerator of our Kendall's τ formulation, we take the coefficients from the matrix C h,m , use them to multiply the sizes of the corresponding sets S h,m and then sum them up. We do not include sets for which the value of C h,m is X. To compute the denominator, we simply sum the sizes of all the sets S h,m except those where C h,m = X.
To summarize, the WMT16 matrix specifies to: • exclude all human ties, • count metric's ties only for the denominator (thus giving no credit for giving a tie), • all cases of disagreement between human and metric judgments are counted as Discordant, • all cases of agreement between human and metric judgments are counted as Concordant.
In previous years, we reported confidence intervals for the Kendall's Tau formulation, see  for details. However, since the formulation of Kendall's Tau is not computed in the standard way (we do not have a single overall ranking of translations, but rather rankings of sets of 5 translations), the accuracy of confidence intervals computed in this way is difficult to verify. To avoid the risk of drawing incorrect conclusions of significant differences in metric performance, we  Figure 9: Direct Assessment (DA) segment-level metric significance test results for English to Russian (newstest2016): Green cells denote a significant win for the metric in a given row over the metric in a given column according to Williams test for difference in dependent correlation.
do not include confidence intervals with this year's Kendall's Tau formulation results.
Results of the segment-level human evaluation for translations sampled from the main translation task are shown in Tables 9 and 10, where metric correlations (for DA human assessment variant only) not significantly outperformed by any other metric are highlighted in bold. Since Kendall's Tau are traditionally employed to conclude task winners, while at the same time we currently lack a known reliable method of identifying significant differences between metrics, we postpone announcement of official winning segment-level metrics until further research has been carried out to establish a reliable method in this respect.
DA human assessment pairwise significance test results for differences in metric performance are included for investigatory purposes only in Figures 8 and 9. Results of segment-level metrics task evaluated with HUME on the himl2015 data set are shown in Table 11, where metrics not significantly outperformed by any other in a given language pair are highlighted in bold, and these metrics are official winners of the himl2015 segment-level metric evaluation. Full pairwise significance test results for all metrics are shown in Figure 10.    Table 11: Pearson correlation of segment-level metric scores with HUME human assessment variant.

Discussion
During the task, the DA evaluation, other than being more principled and discerning, has proved more reliable for crowd-sourcing human evaluation of MT.
It should be noted that DA requires distinct DA human evaluation variants for system and segment-level evaluation, but we may not see this as a negative but rather that DA provides a new method of human evaluation devised specifically for accurate evaluation of segment-level metrics.
Although this year DA was carried out through crowd-sourcing, while RR was completed by researchers, DA is not restricted to crowd-sourcing and could be carried out as-is by researchers or by slight modification by removal of the overhead of translation assessments included in DA for quality control. With any method of human evaluation, if we aim at crowd-sourcing, we must keep in mind that some languages are difficult to obtain workers for, observed in the fact that this year's WMT only collected crowd-sourced assessment for English and Russian as a target language. Although we employed a minimum of 15 human assessors for segment-level evaluation of metrics per segment, it might be worth noting that preliminary empirical evaluation has shown that the 15 human assessments we acquire do not need to be from distinct workers and when repeat assessments are allowed from the same worker, this also yields a correlation of above 0.9 with assessments of translations collected from strictly distinct workers. In other words, DA should be technically viable for all language pairs, if we employ researchers as opposed to crowd-sourced assessors (who may not be available for the language) and if we allow repeated assessments of the same segment by the same person.
Hybrid supersampling is a novel way of doing meta-evaluation of metric performance and it provided more conclusive results. Although we carried out hybrid supersampling for DA human evaluation only, the method is not DA specific, and it would be interesting to trial it with RR the future.
Character-level metrics again gave very good results on both system and segment level. The trend that started on WMT14 with BEER, then continued on WMT15 with BEER and CHRF, now happens with BEER, CHRF and CHARAC-TER. This growing number of character-level metrics suggests that community (at least the one that develops metrics) had started to adopt characterlevel matching as an important component of evaluation.
Just like in previous years, metrics that train their parameters get very high correlation with human judgment as exemplified with BEER and UOW.REVAL. This year's edition of the metrics task introduced different types of golden truths that opens the question towards which golden truth should metrics be trained. Should it be for RR by using some learning-to-rank algorithms, or for DA by using regression algorithms or some combination of the two.
The results this year again include surprises. For instance, evaluation of English-to-Czech this year suggests that WORDF, BLEU and NIST outperform CHRF under evaluation against RR both with and without tuning systems (Figure 4) on the news domain, whereas we have seen the exact op-posite last year. The IT domain for English-to-Czech stays in line with last year's observations. BLEU (and especially its Moses implementation) has been clearly outperformed by many metrics. That again highlights the question in MT as to why almost all systems remain to be optimized for BLEU. Optimization towards BLEU has driven system development and certainly achieved results in the past, but the relatively low correlation with human judgment is a sign that some alternative metrics should be considered. For this reason, we encourage metrics developers to add their metric to Moses scorer so that the MT community can more easily experiment with employing them as optimization objective functions. An additional motivation should also be so that valuable development work on metrics is not lost in the future. If added to Moses scorer, future metrics tasks could run easily these metrics as baselines, even if their authors are not participating in the task that year. That way, good performing metrics will live on and the results of the metrics task will be more comparable across years.

Conclusion
In this paper, we summarized the results of the WMT16 Metrics Shared Task, which assesses the quality of various automatic machine translation metrics. As in previous years, human judgments collected in WMT16 serve as the golden truth and we check how well the metrics predict the judgments at the level of individual sentences as well as at the level of the whole test set (system-level).
The more extensive meta-evaluation in this years task that involved large number of language pairs, different types of judgments and better measurements of the significance would hopefully shed some more light on the qualities of different metrics.
The patterns that can be observed in the results are that character-level metrics perform really well and that the number of them is growing over the years. Also, the trained metrics on average are performing better than non-trained metrics, especially for into-English language pairs.

A System-Level Correlation Plots
The following figures plot the system-level results of MTEVALBLEU (left-hand plots) and the best performing (according to RR and DA, see Tables 4, 5 and 7) metrics for the given language pair (right-hand plots) against manual score.