Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges

This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked to score the outputs of the translations systems competing in the WMT19 News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 of which are reference-less “metrics” and constitute submissions to the joint task with WMT19 Quality Estimation Task, “QE as a Metric”. In addition, we computed 11 baseline metrics, with 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, and chrF) and 3 reimplementations (chrF+, sacreBLEU-BLEU, and sacreBLEU-chrF). Metrics were evaluated on the system level, how well a given metric correlates with the WMT19 official manual ranking, and segment level, how well the metric correlates with human judgements of segment quality. This year, we use direct assessment (DA) as our only form of manual evaluation.


Introduction
To determine system performance in machine translation (MT), it is often more practical to use an automatic evaluation, rather than a manual one. Manual/human evaluation can be costly and time consuming, and so an automatic evaluation metric, given that it sufficiently correlates with manual evaluation, can be useful in developmental cycles. In studies involving hyperparameter tuning or architecture search, automatic metrics are necessary as the amount of human effort implicated in manual evaluation is generally prohibitively large. As objective, reproducible quantities, metrics can also facilitate cross-paper compar-isons. The WMT Metrics Shared Task 1 annually serves as a venue to validate the use of existing metrics (including baselines such as BLEU), and to develop new ones; see Koehn and Monz (2006) through Ma et al. (2018).
In the setup of our Metrics Shared Task, an automatic metric compares an MT system's output translations with manual reference translations to produce: either (a) system-level score, i.e. a single overall score for the given MT system, or (b) segment-level scores for each of the output translations, or both.
This year we teamed up with the organizers of the QE Task and hosted "QE as a Metric" as a joint task. In the setup of the Quality Estimation Task (Fonseca et al., 2019), no humanproduced translations are provided to estimate the quality of output translations. Quality estimation (QE) methods are built to assess MT output based on the source or based on the translation itself. In this task, QE developers were invited to perform the same scoring as standard metrics participants, with the exception that they refrain from using a reference translation in production of their scores. We then evaluate the QE submissions in exactly the same way as regular metrics are evaluated, see below. From the point of view of correlation with manual judgements, there is no difference in metrics using or not using references.
The source, reference texts, and MT system outputs for the Metrics task come from the News Translation Task (Barrault et al., 2019, which we denote as Findings 2019). The texts were drawn from the news domain and involve translations of English (en) to/from Czech (cs), German (de), Finnish (fi), Gujarati (gu), Kazakh (kk), Lithuanian (lt), Russian (ru), and Chinese (zh), but excluding csen (15 language pairs). Three other language pairs not including English were also manually evaluated as part of the News Translation Task: German→Czech and German↔French. In total, metrics could participate in 18 language pairs, with 10 target languages.
In the following, we first give an overview of the task (Section 2) and summarize the baseline (Section 3) and submitted (Section 4) metrics. The results for system-and segment-level evaluation are provided in Sections 5.1 and 5.2, respectively, followed by a joint discussion Section 6.

Task Setup
This year, we provided task participants with one test set for each examined language pair, i.e. a set of source texts (which are commonly ignored by MT metrics), corresponding MT outputs (these are the key inputs to be scored) and a reference translation (held out for the participants of "QE as a Metric" track).
In the system-level, metrics aim to correlate with a system's score which is an average over many human judgments of segment translation quality produced by the given system. In the segment-level, metrics aim to produce scores that correlate best with a human ranking judgment of two output translations for a given source segment (more on the manual quality assessment in Section 2.3). Participants were free to choose which language pairs and tracks (system/segment and reference-based/reference-free) they wanted to take part in.

Source and Reference Texts
The source and reference texts we use are newstest2019 from this year's WMT News Translation Task (see Findings 2019). This set contains approximately 2,000 sentences for each translation direction (except Gujarati, Kazakh and Lithuanian which have approximately 1,000 sentences each, and German to/from French which has 1701 sentences).
The reference translations provided in new-stest2019 were created in the same direction as the MT systems were translating.
The exceptions are German→Czech where both sides are translations from English and German↔French which followed last years' practice. Last year and the years before, the dataset consisted of two halves, one originating in the source language and one in the target language. This however lead to adverse artifacts in MT evaluation.

System Outputs
The results of the Metrics Task are affected by the actual set of MT systems participating in a given translation direction. On one hand, if all systems are very close in their translation quality, then even humans will struggle to rank them. This in turn will make the task for MT metrics very hard. On the other hand, if the task includes a wide range of systems of varying quality, correlating with humans should be generally easier, see Section 6.1 for a discussion on this. One can also expect that if the evaluated systems are of different types, they will exhibit different error patterns and various MT metrics can be differently sensitive to these patterns.
This year, all MT systems included in the Metrics Task come from the News Translation Task (see Findings 2019). There are however still noticeable differences among the various language pairs.
The German→Czech research systems were trained in an unsupervised fashion, i.e. without the access to parallel Czech-German texts (except for a couple of thousand sentences used primarily for validation). We thus expect the research German-Czech systems to be "more creative" and depart further away from the references. The online systems in this language directions are however standard MT systems so the German-Czech evaluation could be to some extent bimodal.
• EU Election. The French↔German translation was focused on a sub-domain of news, namely texts related EU Election. Various MT system developers may have invested more or less time to the domain adaptation.
• Regular News Tasks Systems. These are all the other MT systems in the evaluation; differing in whether they are trained only on WMT provided data ("Constrained", or "Unconstrained") as in the previous years. All the freely available web services (online MT systems) are deemed unconstrained.
Overall, the results are based on 233 systems across 18 language pairs. 2

Manual Quality Assessment
Direct Assessment (DA, Graham et al., 2013Graham et al., , 2014a was employed as the source of the "golden truth" to evaluate metrics again this year. The details of this method of human evaluation are provided in Findings 2019. The basis of DA is to collect a large number of quality assessments (a number on a scale of 1-100, i.e. effectively a continuous scale) for the outputs of all MT systems. These scores are then standardized per annotator.
In the past years, the underlying manual scores were reference-based (human judges had access to the same reference translation as the MT quality metric). This year, the official WMT19 scores are reference-based (or "monolingual") for some language pairs and reference-free (or "bilingual") for others. 3 Due to these different types of golden truth collection, reference-based language pairs are in a closer match with the standard referencebased metrics, while the reference-free language pairs are better fit for the "QE as a metric" subtask.
Note that system-level manual scores are different than those of the segment-level. Since for segment-level evaluation, collecting enough DA judgements for each segment is infeasible, so we resort to converting DA judgements to 2 This year, we do not use the artificially constructed "hybrid systems" (Graham and Liu, 2016) because the confidence on the ranking of system-level metrics is sufficient even without hybrids.
3 Specifically, the reference-based language pairs were those where the anticipated translation quality was lower or where the manual judgements were obtained with the help of anonymous crowdsourcing. Most of these cases were translations into English (fien, gu-en, kk-en, lt-en, ru-en and zh-en) and then the language pairs not involving English (de-cs, de-fr and fr-de). The reference-less (bilingual) evaluations were those where mainly MT researchers themselves were involved in the annotations: en-cs, en-de, en-fi, en-gu, en-kk, en-lt, en-ru, en-zh. golden truth expressed as relative rankings, see Section 2.3.2.
The exact methods used to calculate correlations of participating metrics with the golden truth are described below, in the two sections for system-level evaluation (Section 5.1) and segment-level evaluation (Section 5.2).

System-level Golden Truth: DA
For the system-level evaluation, the collected continuous DA scores, standardized for each annotator, are averaged across all assessed segments for each MT system to produce a scalar rating for the system's performance.
The underlying set of assessed segments is different for each system. Thanks to the fact that the system-level DA score is an average over many judgments, mean scores are consistent and have been found to be reproducible (Graham et al., 2013). For more details see Findings 2019.

Segment-level Golden Truth: daRR
Starting from Bojar et al. (2017), when WMT fully switched to DA, we had to come up with a solid golden standard for segment-level judgements. Standard DA scores are reliable only when averaged over sufficient number of judgments. 4 Fortunately, when we have at least two DA scores for translations of the same source input, it is possible to convert those DA scores into a relative ranking judgement, if the difference in DA scores allows conclusion that one translation is better than the other. In the following, we denote these re-interpreted DA judgements as "daRR", to distinguish it clearly from the relative ranking ("RR") golden truth used in the past years. 5  Table 1: Number of judgements for DA converted to daRR data; "DA>1" is the number of source input sentences in the manual evaluation where at least two translations of that same source input segment received a DA judgement; "Ave" is the average number of translations with at least one DA judgement available for the same source input sentence; "DA pairs" is the number of all possible pairs of translations of the same source input resulting from "DA>1"; and "daRR" is the number of DA pairs with an absolute difference in DA scores greater than the 25 percentage point margin.
From the complete set of human assessments collected for the News Translation Task, all possible pairs of DA judgements attributed to distinct translations of the same source were converted into daRR better/worse judgements. Distinct translations of the same source input whose DA scores fell within 25 percentage points (which could have been deemed equal quality) were omitted from the evaluation of segment-level metrics. Conversion of scores in this way produced a large set of daRR judgements for all language pairs, rely on judgements collected from known-reliable volunteers and crowd-sourced workers who passed DA's quality control mechanism. Any inconsistency that could arise from reliance on DA judgements collected from low quality crowd-sourcing is thus prevented.
shown in Table 1 due to combinatorial advantage of extracting daRR judgements from all possible pairs of translations of the same source input. We see that only German-French and esp. French-German can suffer from insufficient number of these simulated pairwise comparisons.
The daRR judgements serve as the golden standard for segment-level evaluation in WMT19.

Baseline Metrics
In addition to validating popular metrics, including baselines metrics serves as comparison and prevents "loss of knowledge" as mentioned by Bojar et al. (2016).
Moses scorer 6 is one of the MT evaluation tools that aggregated several useful metrics over the time. Since Macháček and Bojar (2013), we have been using Moses scorer to provide most of the baseline metrics and kept encouraging authors of well-performing MT metrics to include them in Moses scorer. 7 The baselines we report are: BLEU and NIST The metrics BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) were computed using mteval-v13a.pl 8 from the OpenMT Evaluation Campaign. The tool includes its own tokenization.
We run mteval with the flag --international-tokenization. 9 TER, WER, PER and CDER. The metrics TER (Snover et al., 2006), WER, PER and CDER (Leusch et al., 2006) were produced by the Moses scorer, which is used in Moses model optimization. We used the standard tokenizer script as available in Moses toolkit for tokenization.
sentBLEU. The metric sentBLEU is computed using the script sentence-bleu, a part of the Moses toolkit. It is a   Table 2: Participants of WMT19 Metrics Shared Task. "•" denotes that the metric took part in (some of the language pairs) of the segment-and/or system-level evaluation. "⊘" indicates that the system-level scores are implied, simply taking arithmetic (macro-)average of segment-level scores. "−" indicates that the metric didn't participate the track (Seg/Sys-level). A metric is learned if it is trained on a QE or metric evaluation dataset (i.e. pretraining or parsers don't count, but training on WMT 2017 metrics task data does). For the baseline metrics available in the Moses toolkit, paths are relative to http://github.com/moses-smt/ mosesdecoder/.
smoothed version of BLEU for scoring at the segment-level. We used the standard tokenizer script as available in Moses toolkit for tokenization.
The baselines serve in system and segmentlevel evaluations as customary: BLEU, TER, WER, PER, CDER, sacreBLEU-BLEU and sacreBLEU-chrF for system-level only; sentBLEU for segment-level only and chrF for both.
Chinese word segmentation is unfortunately not supported by the tokenization scripts mentioned above. For scoring Chinese with baseline metrics, we thus pre-processed MT outputs and reference translations with the script tokenizeChinese.py 11 by Shujian Huang, which separates Chinese characters from each other and also from non-Chinese parts. Table 2 lists the participants of the WMT19 Shared Metrics Task, along with their metrics and links to the source code where available. We have collected 24 metrics from a total of 13 research groups, with 10 reference-less "metrics" submitted to the joint task "QE as a Metrich" with WMT19 Quality Estimation Task.

Submitted Metrics
The rest of this section provides a brief summary of all the metrics that participated.

BEER
BEER (Stanojević and Sima'an, 2015) is a trained evaluation metric with a linear model that combines sub-word feature indicators (character n-grams) and global word order features (skip bigrams) to achieve a language agnostic and fast to compute evaluation metric. BEER has participated in previous years of the evaluation task.

BERTr
BERTr (Mathur et al., 2019) uses contextual word embeddings to compare the MT output with the reference translation.
The BERTr score of a translation is the average recall score over all tokens, using a relaxed version of token matching based on BERT embeddings: namely, computing the maximum cosine similarity between the embedding of a reference token against any token in the MT output. BERTr uses bert_base_uncased embeddings for the to-English language pairs, and bert_base_multilingual_cased embeddings for all other language pairs.

CharacTER
CharacTER (Wang et al., 2016b,a), identical to the 2016 setup, is a character-level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence. CharacTER calculates the character-level edit distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered to match a reference word and could be shifted, if the edit dis-tance between them is below a threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER.
Similarly to other character-level metrics, CharacTER is generally applied to nontokenized outputs and references, which also holds for this year's submission with one exception. This year tokenization was carried out for en-ru hypotheses and references before calculating the scores, since this results in large improvements in terms of correlations. For other language pairs, no tokenizer was used for pre-processing.

EED
EED (Stanchev et al., 2019) is a characterbased metric, which builds upon CDER. It is defined as the minimum number of operations of an extension to the conventional edit distance containing a "jump" operation. The edit distance operations (insertions, deletions and substitutions) are performed at the character level and jumps are performed when a blank space is reached. Furthermore, the coverage of multiple characters in the hypothesis is penalised by the introduction of a coverage penalty. The sum of the length of the reference and the coverage penalty is used as the normalisation term.

ESIM
Enhanced Sequential Inference Model (ESIM; Chen et al., 2017;Mathur et al., 2019) is a neural model proposed for Natural Language Inference that has been adapted for MT evaluation. It uses cross-sentence attention and sentence matching heuristics to generate a representation of the translation and the reference, which is fed to a feedforward regressor. The metric is trained on singly-annotated Direct Assessment data that has been collected for evaluating WMT systems: all WMT 2018 to-English data for the to-English language pairs, and all WMT 2018 data for all other language pairs.

hLEPORb_baseline, hLEPORa_baseline
The submitted metric hLEPOR_baseline is a metric based on the factor combination of length penalty, precision, recall, and position difference penalty. The weighted harmonic mean is applied to group the factors together with tunable weight parameters. The systemlevel score is calculated with the same formula but with each factor weighted using weight estimated at system-level and not at segmentlevel.
In this submitted baseline version, hLE-POR_baseline was not tuned for each language pair separately but the default weights were applied across all submitted language pairs. Further improvements can be achieved by tuning the weights according to the development data, adding morphological information and applying n-gram factor scores into it (e.g. part-of-speech, n-gram precision and n-gram recall that were added into LEPOR in WMT13.). The basic model factors and further development with parameters setting were described in the paper (Han et al., 2012) and (Han et al., 2013).
For sentence-level score, only hLE-PORa_baseline was submitted with scores calculated as the weighted harmonic mean of all the designed factors using default parameters.
For system-level score, both hLEPORa_baseline and hLE-PORb_baseline were submitted, where hLEPORa_baseline is the the average score of all sentence-level scores, and hLE-PORb_baseline is calculated via the same sentence-level hLEPOR equation but replacing each factor value with its system-level counterpart.  Bannard and Callison-Burch, 2005) and integrates it into Meteor-based metrics.

PReP
PReP (Yoshimura et al., 2019) is a method for filtering pseudo-references to achieve a good match with a gold reference. At the beginning, the source sentence is translated with some off-the-shelf MT systems to create a set of pseudo-references. (Here the MT systems were Google Translate and Microsoft Bing Translator.) The pseudoreferences are then filtered using BERT (Devlin et al., 2019) fine-tuned on the MPRC corpus (Dolan and Brockett, 2005), estimating the probability of the paraphrase between gold reference and pseudo-references. Thanks to the high quality of the underlying MT systems, a large portion of their outputs is indeed considered as a valid paraphrase.
The final metric score is calculated simply with SentBLEU with these multiple references.

WMDO
WMDO (Chow et al., 2019b) is a metric based on distance between distributions in the semantic vector space. Matching in the semantic space has been investigated for translation evaluation, but the constraints of a translation's word order have not been fully explored. Building on the Word Mover's Distance metric and various word embeddings, WMDO introduces a fragmentation penalty to account for fluency of a translation. This word order extension is shown to perform better than standard WMD, with promising results against other types of metrics.
YiSi-1 is a MT evaluation metric that measures the semantic similarity between a machine translation and human references by aggregating the idf-weighted lexical semantic similarities based on the contextual embeddings extracted from BERT and optionally incorporating shallow semantic structures (denoted as YiSi-1_srl).
YiSi-0 is the degenerate version of YiSi-1 that is ready-to-deploy to any language. It uses longest common character substring to measure the lexical similarity.
YiSi-2 is the bilingual, reference-less version for MT quality estimation, which uses the contextual embeddings extracted from BERT to evaluate the crosslingual lexical semantic similarity between the input and MT output. Like YiSi-1, YiSi-2 can exploit shallow semantic structures as well (denoted as YiSi-2_srl).

QE Systems
In addition to the submitted standard metrics, 10 quality estimation systems were submitted to the "QE as a Metric" track. The submitted QE systems are evaluated in the same settings as metrics to facilitate comparison. Their descriptions can be found in the Findings of the WMT 2019 Shared Task on Quality Estimation (Fonseca et al., 2019).

Results
We discuss system-level results for news task systems in Section 5.1. The segment-level results are in Section 5.2.

System-Level Evaluation
As in previous years, we employ the Pearson correlation (r) as the main evaluation measure for system-level metrics. The Pearson correlation is as follows: where H i are human assessment scores of all systems in a given translation direction, M i are the corresponding scores as predicted by a given metric. H and M are their means, respectively.
Since some metrics, such as BLEU, aim to achieve a strong positive correlation with human assessment, while error metrics, such as TER, aim for a strong negative correlation we compare metrics via the absolute value |r| of a      Figure 1: System-level metric significance test results for DA human assessment for into English and out-of English language pairs (newstest2019): Green cells denote a statistically significant increase in correlation with human assessment for the metric in a given row over the metric in a given column according to Williams test.
given metric's correlation with human assessment.

System-Level Results
Tables 3, 4 and 5 provide the system-level correlations of metrics evaluating translation of newstest2019. The underlying texts are part of the WMT19 News Translation test set (new-stest2019) and the underlying MT systems are all MT systems participating in the WMT19 News Translation Task. As recommended by Graham and Baldwin (2014), we employ Williams significance test (Williams, 1959) to identify differences in correlation that are statistically significant. Williams test is a test of significance of a difference in dependent correlations and therefore suitable for evaluation of metrics. Correlations not significantly outperformed by any other metric for the given language pair are highlighted in bold in Tables 3, 4 and 5.
Since pairwise comparisons of metrics may be also of interest, e.g. to learn which metrics significantly outperform the most widely employed metric BLEU, we include significance test results for every competing pair of metrics including our baseline metrics in Figure 1 and Figure 2.
This year, the increased number of systems participating in the news tasks has provided a larger sample of system scores for testing metrics. Since we already have sufficiently conclusive results on genuine MT systems, we do not need to generate hybrid system results as in Graham and Liu (2016) and past metrics tasks.

Segment-Level Evaluation
Segment-level evaluation relies on the manual judgements collected in the News Translation Task evaluation. This year, again we were unable to follow the methodology outlined in Graham et al. (2015) for evaluation of segment-level metrics because the sampling of sentences did not provide sufficient number of assessments of the same segment. We therefore convert pairs of DA scores for competing translations to daRR better/worse preferences as described in Section 2.3.2.
We measure the quality of metrics' segmentlevel scores against the daRR golden truth using a Kendall's Tau-like formulation, which is an adaptation of the conventional Kendall's Tau coefficient. Since we do not have a total order ranking of all translations, it is not possible to apply conventional Kendall's Tau (Graham et al., 2015).
Our Kendall's Tau-like formulation, τ , is as follows: where Concordant is the set of all human comparisons for which a given metric suggests the same order and Discordant is the set of all human comparisons for which a given metric disagrees. The formula is not specific with respect to ties, i.e. cases where the annotation says that the two outputs are equally good. The way in which ties (both in human and metric judgement) were incorporated in computing Kendall τ has changed across the years of WMT Metrics Tasks. Here we adopt the version used in WMT17 daRR evaluation. For a detailed discussion on other options, see also Macháček and Bojar (2014).
Whether or not a given comparison of a pair of distinct translations of the same source input, s 1 and s 2 , is counted as a concordant (Conc) or disconcordant (Disc) pair is defined by the following matrix: Metric s 1 < s 2 s 1 = s 2 s 1 > s 2 Human In the notation of Macháček and Bojar (2014), this corresponds to the setup used in WMT12 (with a different underlying method of manual judgements, RR): The key differences between the evaluation used in WMT14-WMT16 and evaluation used in WMT17-WMT19 were (1) the move from RR to daRR and (2) the treatment of ties. In the years 2014-2016, ties in metrics scores were not penalized. With the move to daRR, where the quality of the two candidate translations   Figure 2: System-level metric significance test results for DA human assessment in newstest2019 for German to Czech, German to French and French to German; green cells denote a statistically significant increase in correlation with human assessment for the metric in a given row over the metric in a given column according to Williams test.   Table 6: Segment-level metric results for to-English language pairs in newstest2019: absolute Kendall's Tau formulation of segment-level metric scores with DA scores; correlations of metrics not significantly outperformed by any other for that language pair are highlighted in bold. en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zh  Table 7: Segment-level metric results for out-of-English language pairs in newstest2019: absolute Kendall's Tau formulation of segment-level metric scores with DA scores; correlations of metrics not significantly outperformed by any other for that language pair are highlighted in bold.  Table 8: Segment-level metric results for language pairs not involving English in newstest2019: absolute Kendall's Tau formulation of segment-level metric scores with DA scores; correlations of metrics not significantly outperformed by any other for that language pair are highlighted in bold. is deemed substantially different and no ties in human judgements arise, it makes sense to penalize ties in metrics' predictions in order to promote discerning metrics.

de-en fi-en gu-en kk-en lt-en ru-en zh-en
Note that the penalization of ties makes our evaluation asymmetric, dependent on whether the metric predicted the tie for a pair where humans predicted <, or >. It is now important to interpret the meaning of the comparison identically for humans and metrics. For error metrics, we thus reverse the sign of the metric score prior to the comparison with human scores: higher scores have to indicate better translation quality. In WMT19, the original authors did this for CharacTER.
To summarize, the WMT19 Metrics Task for segment-level evaluation: • ensures that error metrics are first converted to the same orientation as the human judgements, i.e. higher score indicating higher translation quality, • excludes all human ties (this is already implied by the construction of daRR from DA judgements),  Figure 3: daRR segment-level metric significance test results for into English and out-of English language pairs (newstest2019): Green cells denote a significant win for the metric in a given row over the metric in a given column according bootstrap resampling.  Figure 4: daRR segment-level metric significance test results for German to Czech, German to French and French to German (newstest2019): Green cells denote a significant win for the metric in a given row over the metric in a given column according bootstrap resampling.
• counts metric's ties as a Discordant pairs.
We employ bootstrap resampling (Koehn, 2004;Graham et al., 2014b) to estimate confidence intervals for our Kendall's Tau formulation, and metrics with non-overlapping 95% confidence intervals are identified as having statistically significant difference in performance.

Segment-Level Results
Results of the segment-level human evaluation for translations sampled from the News Translation Task are shown in Tables 6, 7

Discussion
This year, human data was collected from reference-based evaluations (or "monolingual") and reference-free evaluations (or "bilingual").
The reference-based (monolingual) evaluations were obtained with the help of anonymous crowdsourcing, while the reference-less (bilingual) evaluations were mainly from MT researchers who committed their time contribution to the manual evaluation for each submitted system.

Stability across MT Systems
The observed performance of metrics depends on the underlying texts and systems that participate in the News Translation Task (see Section 2). For the strongest MT systems, distinguishing which system outputs are better is hard, even for human assessors. On the other hand, if the systems are spread across a wide performance range, it will be easier for metrics to correlate with human judgements.
To provide a more reliable view, we created plots of Pearson correlation when the underlying set of MT systems is reduced to top n ones. One sample such plot is in Figure 5, all language pairs and most of the metrics are in Appendix A.
As the plot documents, the official correlations reported in Tables 3 to 5 can lead to wrong conclusions. sacreBLEU-BLEU correlates at .969 when all systems are considered, but as we start considering only the top n systems, the correlation falls relatively quickly. With 10 systems, we are below .5 and when only the top 6 or 4 systems are considered, the correlation falls even to the negave values. Note that correlations point estimates (the value in the y-axis) become noiser with the decreasing number of the underlying MT systems. Figure 6 explains the situation and illus- Top 8 Top 10 Top 12 Top 15 All systems Figure 6 trates the sensitivity of the observed correlations to the exact set of systems. On the full set of systems, the single outlier (the worstperforming system called en_de_task) helps to achieve a great positive correlation. The majority of MT systems however form a cloud with Pearson correlation around .5 and the top 4 systems actually exhibit a negative correlation of the human score and sacreBLEU-BLEU.
In Appendix A, baseline metrics are plotted in grey in all the plots, so that their trends can be observed jointly. In general, most baselines have similar correlations, as most baselines use similar features (n-gram or word-level features, with the exception of chrF). In a number of language pairs (de-en, de-fr, en-de, en-kk, lten, ru-en, zh-en), baseline correlations tend towards 0 (no correlation) or even negative Pearson correlation. For a widely applied metric such as sacreBLEU-BLEU, our analysis reveals weak correlation in comparing top stateof-the-art systems in these language pairs, especially in en-de, de-en, ru-en, and zh-en.
We will restrict our analysis to those language pairs where the baseline metrics have an obvious downward trend (de-en, de-fr, en-de, en-kk, lt-en, ru-en, zh-en). Examining the topn correlation in the submitted metrics (not including QE systems), most metrics show the same degredation in correlation as the baselines. We note BERTr as the one exception consistently degrading less and retaining positive correlation compared to other submitted metrics and baselines, in the language pairs where it participated.
For QE systems, we noticed that in some instances, QE systems have upward correlation trends when other metrics and baselines have downward trends. For instance, LP, UNI, and UNI+ in the de-en language pair, YiSi-2 in en-kk, and UNI and UNI+ in ru-en. These results suggest that QE systems such as UNI and UNI+ perform worse on judging systems of wide ranging quality, but better for top performing systems, or perhaps for systems closer in quality.
If our method of human assessment is sound, we should believe that BLEU, a widely applied metric, is no longer a reliable metric for judging our best systems. Future investigations are needed to understand when BLEU applies well, and why BLEU is not effective for output from our state of the art models.
Metrics and QE systems such as BERTr, ESIM, YiSi that perform well at judging our best systems often use more semantic features compared to our n-gram/char-gram based baselines. Future metrics may want to explore a) whether semantic features such as contextual word embeddings are achieving semantic understanding and b) whether semantic understanding is the true source of a metric's performance gains.
It should be noted that some language pairs do not show the strong degrading pattern with top-n systems this year, for instance en-cs, engu, en-ru, or kk-en. English-Chinese is particularly interesting because we see a clear trend towards better correlations as we reduce the set of underlying systems to the top scoring ones.

System-Level Evaluation
In system-level evaluation, the series of YiSi metrics achieve the highest correlations in several language pairs and it is not significantly outperformed by any other metrics (denoted as a "win" in the following) for almost all language pairs.
The new metric ESIM performs best on 5 language languages (18 language pairs) and obtains 11 "wins" out of 16 language pairs in which ESIM participated.
The metric EED performs better for language pairs out-of English and excluding En-glish compared to into-English language pairs, achieving 7 out of 11 "wins" there.

Segment-Level Evaluation
For segment-level evaluation, most language pairs are quite discerning, with only one or two metrics taking the "winner" position (of not being significantly surpassed by others). Only French-German differs, with all metrics performing similarly except the significantly worse sentBLEU.
YiSi-1_srl stands out as the "winner" for all language pairs in which it participated. The excluded language pairs were probably due to the lack of semantic information required by YiSi-1_srl. YiSi-1 participated all language pairs and its correlations are comparable with those of YiSi-1_srl. ESIM obtain 6 "winners" out of all 18 languages pairs.
Both YiSi and ESIM are based on neural networks (YiSi via word and phrase embeddings, as well as other types of available resources, ESIM via sentence embeddings). This is a confirmation of a trend observed last year.

QE Systems as Metrics
Generally, correlations for the standard reference-based metrics are obviously better than those in "QE as a Metric" track, both when using monolingual and bilingual golden truth.
In system-level evaluation, correlations for "QE as a Metric" range from 0.028 to 0.947 across all language pairs and all metrics but they are very unstable. Even for a single metric, take UNI for example, the correlations range from 0.028 to 0.930 across language pairs.
In segment-level evaluation, correlations for QE metrics range from -0.153 to 0.351 across all language pairs and show the same instability across language pairs for a given metric.
In either case, we do not see any pattern that could explain the behaviour, e.g. whether the manual evaluation was monolingual or bilingual, or the characteristics of the given language pair.

Dependence on Implementation
As it already happened in the past, we had multiple implementations for some metrics, BLEU and chrF in particular.
The detailed configuration of BLEU and sacreBLEU-BLEU differ and hence their scores and correlation results are different.
chrF and sacreBLEU-chrF use the same parameters and should thus deliver the same scores but we still observe some differences, leading to different correlations. For instance for German-French Pearson correlation, chrF obtains 0.931 (no win) but sacreBLEU-chrF reaches 0.952, tying for a win with other metrics.
We thus fully support the call for clarity by Post (2018b) and invite authors of metrics to include their implementations either in Moses scorer or sacreBLEU to achieve a long-term assessment of their metric.

Conclusion
This paper summarizes the results of WMT19 shared task in machine translation evaluation, the Metrics Shared Task. Participating metrics were evaluated in terms of their correlation with human judgement at the level of the whole test set (system-level evaluation), as well as at the level of individual sentences (segment-level evaluation).
We reported scores for standard metrics requiring the reference as well as quality estimation systems which took part in the track "QE as a metric", joint with the Quality Estimation task.
For system-level, best metrics reach over 0.95 Pearson correlation or better across several language pairs. As expected, QE systems are visibly in all language pairs but they can also reach high system-level correlations, up to .947 (Chinese-English) or .936 (English-German) by YiSi-1_srl or over .9 for multiple language pairs by UNI.
An important caveat is that the correlations are heavily affected by the underlying set of MT systems. We explored this by reducing the set of systems to top-n ones for various ns and found out that for many language pairs, system-level correlations are much worse when based on only the better performing systems. With both good and bad MT systems partic-ipating in the news task, the metrics results can be overly optimistic compared to what we get when evaluating state-of-the-art systems.
In terms of segment-level Kendall's τ results, the standard metrics correlations varied between 0.03 and 0.59, and QE systems obtained even negative correlations.
The results confirm the observation from the last year, namely metrics based on word or sentence-level embeddings (YiSi and ESIM), achieve the highest performance.