Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic metrics. We collected scores of 10 metrics and 8 research groups. In addition to that, we computed scores of 8 standard metrics (BLEU, SentBLEU, chrF, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT18 official manual ranking of systems) and in terms of segment-level correlation (how often a metric agrees with humans in judging the quality of a particular sentence relative to alternate outputs). This year, we employ a single kind of manual evaluation: direct assessment (DA).


Introduction
Accurate machine translation (MT) evaluation is important for measuring improvements in system performance. Human evaluation can be costly and time consuming, and it is not always available for the language pair of interest. Automatic metrics can be employed as a substitute for human evaluation in such cases, metrics that aim to measure improvements to systems quickly and at no cost to developers. In the usual set-up, an automatic metric carries out a comparison of MT system output translations and human-produced reference translations to produce a single overall score for the system. 1 Since there exists a large number of possible approaches to producing quality scores for translations, it is sensible to carry out a meta-evaluation of metrics with the aim to estimate their accuracy as a substitute for human assessment of translation quality. The Metrics Shared Task 2 of WMT annually evaluates the performance of automatic machine translation metrics in their ability to provide a substitute for human assessment of translation quality.
Again, we keep the two main types of metric evaluation unchanged from the previous years.
In system-level evaluation, each metric provides a quality score for the whole translated test set (usually a set of documents, in fact). In segment-level evaluation, a score is assigned by a given metric to every individual sentence.
The underlying texts and MT systems come from the News Translation Task (Bojar et al., 2018, denoted as Findings 2018 in the following). The texts were drawn from the news domain and involve translations to/from Chinese (zh), Czech (cs), German (de), Estonian (et), Finnish (fi), Russian (ru), and Turkish (tr), each paired with English, making a total of 14 language pairs. A single form of golden truth of translation quality judgement is used this year: • In Direct Assessment (DA) , humans assess the quality of a given MT output translation by comparison with a reference translation (as opposed to the source and reference). DA is the new standard used in WMT News Translation Task evaluation, requiring only monolingual evaluators.
As in last year's evaluation, the official method of manual evaluation of MT outputs is no longer "relative ranking" (RR, evaluating up to five system outputs on an annotation screen relative to each other) as this was changed in 2017 to DA. For system-level evaluation, we thus use the Pearson correlation r of automatic metrics with DA scores. For segment-level evaluation, we re-interpret DA judgements as relative comparisons and use Kendall's τ as a substitute, see below for details and references.
Section 2 describes our datasets, i.e. the sets of underlying sentences, system outputs, human judgements of translation quality and also participating metrics. Sections 3.1 and 3.2 then provide the results of system and segment-level metric evaluation, respectively. We discuss the results in Section 4.

Data
This year, we provided the task participants with one test set along with reference translations and outputs of MT systems. Participants were free to choose which language pairs they wanted to participate and whether they reported system-level, segment-level scores or both.

Test Sets
We use the following test set, i.e. a set of source sentences and reference translations: newstest2018 is the test set used in WMT18 News Translation Task (see Findings 2018), with approximately 3,000 sentences for each translation direction (except Chinese and Estonian which have 3,981 and 2,000 sentences, resp.). new-stest2018 includes a single reference translation for each direction.

Translation Systems
The results of the Metrics Task are likely affected by the actual set of MT systems participating in a given translation direction. For instance, if all of the systems perform similarly, it will be more difficult, even for humans, to distinguish between the quality of translations. If the task includes a wide range of systems of varying quality, however, or systems are quite different in nature, this could in some way make the task easier for metrics, with metrics that are more sensitive to certain aspects of MT output performing better. This year, the MT systems included in the Metrics Task were: News Task Systems are machine translation systems participating in the WMT18 News Translation Task (see Findings 2018). 3 Hybrid Systems are created automatically with the aim of providing a larger set of systems against which to evaluate metrics, as in Graham and Liu (2016). Hybrid systems were created for new-stest2018 by randomly selecting a pair of MT systems from all systems taking part in that language pair and producing a single output document by randomly selecting sentences from either of the two systems. In short, we create 10K hybrid MT systems for each language pair.
Excluding the hybrid systems, we ended up with 149 systems across 14 language pairs.

Manual MT Quality Judgments
Direct Assessment (DA) was employed as the "golden truth" to evaluate metrics again this year. The details of this method of human evaluation is provided in two sections for system-level evaluation (Section 2.3.1) and segment-level evaluation (Section 2.3.2).
The DA manual judgements were provided by MT researchers taking part in WMT tasks, a number of in-house human evaluators at Amazon and crowd-sourced workers on Amazon Mechanical Turk. 4 Only judgements from workers who passed DA's quality control mechanism were included in the final datasets used to compute system and segment-level scores employed as a gold standard in the Metrics Task.

System-level Manual Quality Judgments
In the system-level evaluation, the goal is to assess the quality of translation of an MT system for the whole test set. Our manual scoring method, DA, nevertheless proceeds sentence by sentence, aggregating the final score as described below.
Direct Assessment (DA) This year the translation task employed monolingual direct assessment (DA) of translation adequacy (Graham et al., 2013;Graham et al., 2014a;. Since sufficient levels of agreement in human assessment of translation quality are difficult to achieve, the DA setup simplifies the task of translation assessment (conventionally a bilingual task) into a simpler monolingual assessment. In addition, DA avoids bias that has been problematic in previous evaluations introduced by assessment of several alternate translations on a single screen, where scores for translations had been unfairly penalized if often compared to high quality translations (Bojar et al., 2011). DA therefore employs assessment of individual translations in isolation from other outputs. Translation adequacy is structured as a monolingual assessment of similarity of meaning where the target language reference translation and the MT output are displayed to the human assessor. Assessors rate a given translation by how adequately it expresses the meaning of the reference translation on an analogue scale corresponding to an underlying 0-100 rating scale. 5 Large numbers of DA human assessments of translations for all 14 language pairs included in the News Translation Task were collected from researchers and from workers on Amazon's Mechanical Turk, via sets of 100translation hits to ensure sufficient repeat assessments per worker, before application of strict quality control measures to filter out assessments from poor performers.
In order to iron out differences in scoring strategies attributed to distinct human assessors, human assessment scores for translations were standardized according to an indi-vidual judge's overall mean and standard deviation score. Final scores for MT systems were computed by firstly taking the average of scores for individual translations in the test set (since some were assessed more than once), before combining all scores for translations attributed to a given MT system into its overall adequacy score. The gold standard for systemlevel DA evaluation is thus what is denoted "Ave z" in Findings 2018 (Bojar et al., 2018).
Finally, although it was necessary to apply a sentence length restriction in WMT human evaluation prior to the introduction of DA, the simplified DA setup does not require restriction of the evaluation in this respect and no sentence length restriction was applied in DA WMT18.

Segment-level Manual Quality Judgments
Segment-level metrics have been evaluated against DA annotations for the newstest2018 test set. This year, a standard segment-level DA evaluation of metrics, where each translation is assessed a minimum of 15 times, was unfortunately not possible due to insufficient number of judgements collected. DA judgements were therefore converted to relative ranking judgements (daRR) to produce results. This is the same strategy as carried out for some out-of-English language pairs in last year's evaluation.
daRR When we have at least two DA scores for translations of the same source input, it is possible to convert those DA scores into a relative ranking judgement, if the difference in DA scores allows conclusion that one translation is better than the other. In the following, we will denote these re-interpreted DA judgements as "daRR", to distinguish it clearly from the "RR" golden truth used in the past years.
Since the analogue rating scale employed by DA is marked at the 0-25-50-75-100 points, the difference in DA scores we employ to distinguish translations that are better/worse than one another is 25 points. Note that we rely on judgements collected from knownreliable volunteers and crowd-sourced workers who passed DA's quality control mechanism. Any inconsistency that could arise from re-  Table 1: Number of judgements for DA converted to daRR data; "DA>1" is the number of source input sentences in the manual evaluation where at least two translations of that same source input segment received a DA judgement; "Ave" is the average number of translations with at least one DA judgement available for the same source input sentence; "DA pairs" is the number of all possible pairs of translations of the same source input resulting from "DA>1"; and "daRR" is the number of DA pairs with an absolute difference in DA scores greater than the 25 percentage point margin.
liance on DA judgements collected from low quality crowd-sourcing, for example, is thus prevented.
From the complete set of human assessments collected for the News Translation Task, all possible pairs of DA judgements attributed to distinct translations of the same source were converted into daRR better/worse judgements. Distinct translations of the same source input whose DA scores fell within 25 percentage points (which could have been deemed equal quality) were omitted from the evaluation of segment-level metrics. Conversion of scores in this way produced a large set of daRR judgements for all language pairs, shown in Table 1 due to combinatorial advantage of extracting daRR judgements from all possible pairs of translations of the same source input.

Kendall's Tau-like Formulation for daRR
We measure the quality of metrics' segment-level scores against the daRR golden truth using a Kendall's Tau-like formulation, which is an adaptation of the conventional Kendall's Tau coefficient. Since we do not have a total order ranking of all translations we use to evaluate metrics, it is not possible to apply conventional Kendall's Tau given the current daRR human evaluation setup (Graham et al., 2015). Vazquez-Alvarez and Huckvale (2002) also note that a genuine pairwise comparison is likely to lead to more stable results for segment-level metric evaluation.
Our Kendall's Tau-like formulation, τ , is as follows: where Concordant is the set of all human comparisons for which a given metric suggests the same order and Discordant is the set of all human comparisons for which a given metric disagrees. The formula is not specific with respect to ties, i.e. cases where the annotation says that the two outputs are equally good. The way in which ties (both in human and metric judgement) were incorporated in computing Kendall τ has changed across the years of WMT Metrics Tasks. Here we adopt the version used in the last years' WMT17 daRR evaluation (but not earlier). For a detailed discussion on other options, see also Macháček and Bojar (2014).
Whether or not a given comparison of a pair of distinct translations of the same source input, s 1 and s 2 , is counted as a concordant (Conc) or disconcordant (Disc) pair is defined by the following matrix: In the notation of Macháček and Bojar (2014), this corresponds to the setup used in WMT12 (with a different underlying method of manual judgements, RR): The key differences between the evaluation used in WMT14-WMT16 and evaluation used in WMT17 and WMT18 are (1) the move from RR to daRR and (2) the treatment of ties. 6 In the years 2014-2016, ties in metrics scores were not penalized. With the move to daRR, where the quality of the two candidate translations is deemed substantially different and no ties in human judgements arise, it makes sense to penalize ties in metrics' predictions in order to promote discerning metrics.
Note that the penalization of ties makes our evaluation asymmetric, dependent on whether the metric predicted the tie for a pair where humans predicted < or >. It is now important to interpret the meaning of the comparison identically for humans and metrics. For error metrics, we thus reverse the sign of the metric score prior to the comparison with human scores: higher scores have to indicate better translation quality. In WMT18, we did this for ITER and the original authors did this for CharacTER.
To summarize, the WMT18 Metrics Task for segment-level evaluation: • excludes all human ties (this is already implied by the construction of daRR from DA judgements), • counts metric's ties as a Discordant pairs, • ensures that error metrics are first converted to the same orientation as the human judgements, i.e. higher score indicating higher translation quality.
We employ bootstrap resampling (Koehn, 2004;Graham et al., 2014b) to estimate confidence intervals for our Kendall's Tau formulation, and metrics with non-overlapping 95% confidence intervals are identified as having statistically significant difference in performance. Table 2 lists the participants of the WMT18 Shared Metrics Task, along with their metrics. We have collected 10 metrics from a total of 8 research groups.

Participants of the Metrics Shared Task
The following subsections provide a brief summary of all the metrics that participated. The list is concluded by our baseline metrics in Section 2.4.9.
As in last year's task, we asked participants whose metrics are publicly available to provide links to where the code can be accessed. Table 3 provides links for metrics that participated in WMT18 that are publicly available for download.

BEER
BEER (Stanojević and Sima'an, 2015) is a trained evaluation metric with a linear model that combines features sub-word feature indicators (character n-grams) and global word order features (skip bigrams) to get language agnostic and fast to compute evaluation metric. BEER has participated in previous years of the evaluation task.

Blend
Blend incorporates existing metrics to form an effective combined metric, employing SVM regression for training and DA scores as the gold standard. For to-English language pairs, incorporated metrics include 25 lexical based metrics and 4 other metrics. Since some lexical based metrics are simply different variants of the same metric, there are only 9 kinds of lexical based metrics, namely BLEU, NIST, GTM, METEOR, ROUGE, Ol, WER, TER and PER. 4 other metrics are CharacTER, BEER, DPMF and ENTF.
Blend has participated in the Metrics Task in WMT17. This year, Blend follows its setup in WMT17, but enlarges the training data since there are some data available in WMT17. For to-English language pairs, there are 9280 sentences as training data, while1620 sentences for English-Russian (en-ru). Experiments show the performance of Blend can be improved if the training data increases.
Blend is flexible to be applied to any language pairs if incorporated metrics support the  Table 2: Participants of WMT18 Metrics Shared Task. "•" denotes that the metric took part in (some of the language pairs) of the segment-and/or system-level evaluation and whether hybrid systems were also scored. "⊘" indicates that the system-level and hybrids are implied, simply taking arithmetic average of segment-level scores. "⋆" indicates that the original ITER system-level scores should be calculated as the micro-average of segment-level scores but we calculate them as simple macro-averaged for the hybrid systems. See the ITER paper for more details.  specific language pair and DA scores are available.

CharacTer
CharacTer (Wang et al., 2016b;Wang et al., 2016a), identical to the 2016 setup, is a character-level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence. CharacTer calculates the characterlevel edit distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered to match a reference word and could be shifted, if the edit distance between them is below a threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER. Similarly to other character-level metrics, CharacTer is applied to non-tokenized outputs and references, which also holds for this year's submission.
This year tokenization was carried out for en-ru hypotheses and reference before calculating the scores, since this results in large improvements in terms of correlations. For other language pairs a tokenizer was not used for pre-processing. A python library was used for calculating the Levenshtein distance, so that the metric is now about 7 times faster than before.

ITER
ITER (Panja and Naskar, 2018) is an improved Translation Edit/Error Rate (TER) metric. In addition to the basic edit operations in TER (insertion, deletion, substitution and shift), ITER also allows stem matching and uses optimizable edit costs and better normalization.
Note that for segment-level evaluation, we reverse the sign of the score, so that better translations get higher scores. For systemlevel confidence, we calculate the system-level scores for hybrids systems slightly differently than the original ITER definition would require. We use the unweighted arithmetic average of segment-level scores (macro-average) whereas ITER would use the micro-average.

meteor++
meteor++ (Guo et al., 2018) is metric based on Meteor (Denkowski and Lavie, 2014), adding explicing treatment of "copy-words", i.e. words that are likely to be preserved across all paraphrases of a sentence in a given language.

RUSE
RUSE (Shimanaka et al., 2018) is a perceptron regressor based on three types of sentence embeddings: Infersent, Quick-Thought and Universal Sentence Encoder, designed with the aim to utilize global sentence information that cannot be captured by local features based on character or word n-grams. The sentence embeddings come from pre-trained models and the regression itself is trained on past manual judgements in WMT shared tasks.

UHH_TSKM
UHH_TSKM (Duma and Menzel, 2017) is a non-trained metric utilizing kernel functions, i.e. methods for efficient calculation of overlap of substructures between the candidate and the reference translations. The metric uses both sequence kernels, applied on the tokenized input data, together with tree kernels, that exploit the syntactic structure of the sentences. Optionally, the match can also be performed for the candidate and a pseudoreference (i.e. a translation by another MT system) or for the source sentence and the candidate back-translated into the source language.
YiSi-1 also successfully served in the parallel corpus filtering task. Some details are provided in the system description paper (?).
YiSi-1 measures the relative lexical semantic similarity (weighted word embeddings cosine similarity aggregated into n-grams similarity) of the candidate and reference translations, optionally taking the shallow semantic structure ("srl") into account. YiSi-0 is a degenerate resource-free version using the longest common character substring, instead of word embeddings cosine similarity, to measure the word similarity of the candidate and reference translations.

Baseline Metrics
As mentioned by Bojar et al. (2016), Metrics Task occasionally suffers from "loss of knowledge" when successful metrics participate only in one year.
We attempt to avoid this by regularly evaluating also a range of "baseline metrics": • Mteval.
The metrics BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) were computed using the script mteval-v13a.pl 7 that is used in the OpenMT Evaluation Campaign and includes its own tokenization.
We run mteval with the flag --international-tokenization since it performs slightly better (Macháček and Bojar, 2013).
The metrics TER (Snover et al., 2006), WER, PER and CDER (Leusch et al., 2006) were produced by the Moses scorer, which is used in Moses model optimization. To tokenize the sentences, we used the standard tokenizer script as available in Moses toolkit. When tokenizing, we also convert all outputs to lowercase.
Since Moses scorer is versioned on Github, we strongly encourage authors of highperforming metrics to add them to Moses scorer, as this will ensure that their metric can be easily included in future tasks.
• SentBLEU. The metric sentBLEU is computed using the script sentence-bleu, a part of the Moses toolkit. It is a smoothed version of BLEU that correlates better with human judgements for segment-level. Standard Moses tokenizer is used for tokenization.
We run chrF++.py with the parameters -nw 0 -b 3 to obtain the chrF score and with -nw 0 -b 1 to obtain the chrF+ score. Note that chrF intentionally removes all spaces before matching the ngrams, detokenizing the segments but also concatenating words.
We originally planned to use the chrF implementation which was recently made available in Moses Scorer but it mishandles Unicode characters for now.
The baselines serve in system and segmentlevel evaluations as customary: BLEU, TER, WER, PER and CDER for system-level only; sentBLEU for segment-level only and chrF for both.
Chinese word segmentation is unfortunately not supported by the tokenization scripts mentioned above. For scoring Chinese with baseline metrics, we thus preprocessed MT outputs and reference translations with the script tokenizeChinese.py 8 by Shujian Huang, which separates Chinese characters from each other and also from non-Chinese parts.
For computing system-level and segmentlevel scores, the same scripts were employed as in last year's Metrics Task as well as for generation of hybrid systems from the given hybrid descriptions.

Results
We discuss system-level results for news task systems in Section 3.1. The segment-level results are in Section 3.2.

System-Level Results
As in previous years, we employ the Pearson correlation (r) as the main evaluation measure for system-level metrics. The Pearson correlation is as follows: where H i are human assessment scores of all systems in a given translation direction, M i are the corresponding scores as predicted by a given metric. H and M are their means respectively.
Since some metrics, such as BLEU, for example, aim to achieve a strong positive correlation with human assessment, while error metrics, such as TER aim for a strong negative correlation, after computation of r for metrics, we compare metrics via the absolute value of a given metric's correlation with human assessment. Table 4 provides the system-level correlations of metrics evaluating translation of newstest2018 into English while Table 5 provides the same for out-of-English language pairs. The underlying texts are part of the WMT18 News Translation test set (new-stest2018) and the underlying MT systems are all MT systems participating in the WMT18 News Translation Task with the exception of a single tr-en system not included in the initial human evaluation run.
As recommended by Graham and Baldwin (2014), we employ Williams significance test (Williams, 1959) to identify differences in correlation that are statistically significant. Williams test is a test of significance of a difference in dependent correlations and therefore suitable for evaluation of metrics. Correlations not significantly outperformed by any other metric for the given language pair are highlighted in bold in Tables 4 and 5.
Since pairwise comparisons of metrics may be also of interest, e.g. to learn which metrics    Figure 1: System-level metric significance test results for DA human assessment in newstest2018; green cells denote a statistically significant increase in correlation with human assessment for the metric in a given row over the metric in a given column according to Williams test.    Figure 2: System-level metric significance test results for 10K hybrid systems (DA human evaluation) from newstest2018; green cells denote a statistically significant increase in correlation with human assessment for the metric in a given row over the metric in a given column according to Williams test. significantly outperform the most widely employed metric BLEU, we include significance test results for every competing pair of metrics including our baseline metrics in Figure 1. The sample of systems we employ to evaluate metrics is often small, as few as five MT systems for cs-en, for example. This can lead to inconclusive results, as identification of significant differences in correlations of metrics is unlikely at such a small sample size. Furthermore, Williams test takes into account the correlation between each pair of metrics, in addition to the correlation between the metric scores themselves, and this latter correlation increases the likelihood of a significant difference being identified.

cs-en de-en et-en fi-en ru-en tr-en zh-en
To strenghten the conclusions of our evaluation, we include significance test results for large hybrid-super-samples of systems (Graham and Liu, 2016). 10K hybrid systems were created per language pair, with corresponding DA human assessment scores by sampling pairs of systems from WMT18 News Translation Task, creating hybrid systems by randomly selecting each candidate translation from one of the two selected systems. Similar to last year, not all metrics participating in the system-level evaluation submitted metric scores for the large set of hybrid systems. Fortunately, taking a simple average of segment-level scores is the proper aggregation method for almost all metrics this year, so where needed, we provided scores for hybrids ourselves, see Table 2.
Correlations of metric scores with human assessment of the large set of hybrid systems are shown in Tables 6 and 7, where again metrics not significantly outperformed by any other are highlighted in bold. Figure 2 then provides significance test results for hybrid supersampled correlations for all pairs of competing metrics for a given language pair.

Segment-Level Results
Segment-level evaluation relies on the manual judgements collected in the News Translation Task evaluation. This year, we were unable to follow the methodology outlined in Graham et al. (2015) for evaluation of segment-level metrics because the sampling of sentences did not provide sufficient number of assessments of the same segment. We therefore convert pairs of DA scores for competing translations to daRR better/worse preferences and employ a Kendall's Tau formulation as described in Section 2.3.2. Results of the segment-level human evaluation for translations sampled from the News Translation Task are shown in Tables 8 and  9, where metric correlations not significantly outperformed by any other metric are highlighted in bold. Head-to-head significance test results for differences in metric performance are included in Figure 3.

Obtaining Human Judgements
Human data was collected in the usual way, a portion via crowd-sourcing and the remaining from researchers who mainly committed their time contribution to the manual evaluation as they had submitted a system in that language pair. Evaluation of translations employed the DA set-up and it again successfully acquired sufficient judgments to evaluate systems. As in the previous years, hybrid supersampling proved very effective and allowed to obtain conclusive results of system-level evaluation even for language pairs where as few as 5 MT systems participated. We should however note that hybrid systems are constructed by randomly mixing sentences coming from different MT systems. As soon as documentlevel evaluation becomes relevant (which we anticipate in the next evaluation campaign already), this style of hybridization is susceptible to breaking cross-sentence references in MT outputs and may no longer be applicable.
In the case of segment-level evaluation, the optimal human evaluation data was unfortunately not available due to resource constraints. Conversion of document-level data held as a substitute for segment-level DA scores. These scores are however not optimal for evaluation of segment-level metrics and we would like to return to DA's standard segment-level evaluation in future, where a minimum of 15 human judgments of translation quality are collected per translation and combined to get highly accurate scores for translations.    Figure 3: daRR segment-level metric significance test results for all language pairs (new-stest2018): Green cells denote a significant win for the metric in a given row over the metric in a given column according bootstrap resampling.

Overall Metric Performance
As always, the observed performance of metrics depends on the underlying texts and systems that participate in the News Translation Task. Two new metrics, RUSE and YiSi stand out as metrics that achieve highest correlation in the system level evaluation in more than one language pair according to the hybrid evaluation, and perform great across all their language pairs on average. ITER also performs very well in en-et, en-fi, zh-en and several other languages but fails for en-ru and encs, which drags its overall performance down. Both YiSi and RUSE are based on neural networks (YiSi via word and phrase embeddings, RUSE via sentence embeddings). This is a new trend compared to the last year evaluation where the best performance was reached by character-level (not deep) metrics BEER, chrF (and its variants) and CharacTer.
It is however important to note that the results of performance agreggated over language pairs are not particularly stable across years. In the last year's evaluation, NIST seemed worse than TER. The overall results is the opposite this year and NIST even ranks slightly better than RUSE in terms of average systemlevel correlation across languages.
Overall, the reported figures confirm the observation from the past years that systemlevel metrics can achieve correlations above 0.9 but even the best ones can fall to 0.7 or 0.8 for some language pairs. Kendall's Tau achieved by segment-level metrics are generally lower, in the range of 0.25-0.4. The best metrics in their best language pairs can reach up to 0.69 of segment-level correlations with humans. This capping could be possibly in part attributed to the sub-optimal human evaluation data, DA judgements converted to relative ranking.
Two metrics that stand out as performing consistently well are RUSE for evaluation of into-English translation and YiSi-1* for outof-English. Overall, YiSi*, BEER, Char-acTER, RUSE, and BLEND (in this order) outperform sentBLEU.
All of the "winners" in this years campaign are publicly available, which is very good for their prospective wider adoption. If participants could put the additional effort of adding their code to Moses scorer, this would guarantee their long-term inclusion in the Metrics Task.

Conclusion
This paper summarizes the results of WMT18 shared task in machine translation evaluation, the Metrics Shared Task. Participating metrics were evaluated in terms of their correlation with human judgment at the level of the whole test set (system-level evaluation), as well as at the level of individual sentences (segmentlevel evaluation). For the former, best metrics reach over 0.95 Pearson correlation or better across several language pairs. Correlations varied more than usual between 0.2 and 0.7 in terms of segment-level metrics Kendall's τ results.