Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.


Introduction
Automatic metrics are an indispensable part of machine translation (MT) evaluation, serving as a proxy to human evaluation which is considerably more expensive and time-consuming. They provide immediate feedback during MT system development and serve as the primary metric to report the quality of MT systems. Accordingly, the reliability of metrics is critical to progress in MT research.
A particularly worrying finding was made in the most recent Conference on Machine Translation (WMT), as part of their annual competition findings to benchmark progress in translation and translation evaluation. WMT has established a method based on Pearson's correlation coefficient for measuring how well automatic metrics match with human judgements of translation quality, which is used to rank metrics and to justify their widespread use in lieu of human evaluation. Their findings (Ma et al., 2019) showed that if the correlation is computed for metrics using a large cohort of translation systems, typically very high correlations were found between leading metrics and humans (as high as r = 0.9). However, if considering only the few best systems, the correlation reduced markedly. This is in contrast to findings at sentence-level evaluation, where metrics are better at distinguishing between high-quality translations compared to lowquality translations (Fomicheva and Specia, 2019).
When considering only the four best systems, the automatic metrics were shown to exhibit negative correlations in some instances. It would appear that metrics can only be relied upon for making coarse distinctions between poor and good translation outputs, but not for assessing similar quality outputs, i.e., the most common application faced when assessing incremental empirical improvements.
Overall these findings raise important questions as to the reliability of the accepted best-practises for ranking metrics, and more fundamentally, cast doubt over these metrics' utility for tuning highquality systems, and making architecture choices or publication decisions for empirical research.
In this paper, we take a closer look into this problem, using the metrics data from recent years of WMT to answer the following questions: 1. Are the above problems identified with Pearson's correlation evident in other settings besides small collections of strong MT systems? To test this we consider a range of system quality levels, including random samples of systems, and show that the problem is widely apparent.
2. What is the effect of outlier systems in the reported correlations? Systems that are considerably worse than all others can have a dispro-portionate effect on the computed correlation, despite offering very little insight into the evaluation problem. We identify a robust method for identifying outliers, and demonstrate their effect on correlation, which for some metrics can result in radically different conclusions about their utility.
3. Given these questions about metrics' utility, can they be relied upon for comparing two systems? More concretely, we seek to quantify the extent of improvement required under an automatic metric such that the ranking reliably reflects human assessment. In doing so, we consider both type I and II errors, which correspond to accepting negative or insignificant differences as judged by humans, versus rejecting human significant differences; both types of errors have the potential to stunt progress in the field.
Overall we find that current metric evaluation methodology can lend false confidence to the utility of a metric, and that leading metrics require either untenably large improvements to serve a gatekeeping role, or overly permissive usage to ensure good ideas are not rejected out of hand. Perhaps unsurprisingly, we conclude that metrics are inadequate as a substitute for human evaluations in MT research. 1

Related work
Since 2007, the Conference on Machine Translation (WMT) has organized an annual shared task on automatic metrics, where metrics are evaluated based on correlation with human judgements over a range of MT systems that were submitted to the translation task. Methods for both human evaluation and meta evaluation of metrics have evolved over the years.
In early iterations, the official evaluation measure was the Spearman's rank correlation of metric scores with human scores (Callison-Burch and Osborne, 2006). However, many MT system pairs have very small score differences, and evaluating with Spearman's correlation harshly penalises metrics that have a different ordering for these systems. This was replaced by the Pearson correlation in 2014 (Bojar et al., 2014). To test whether the difference in the performance of two metrics is statis-tically significant, the William's test for dependent correlations is used (Graham and Baldwin, 2014), which takes into account the correlation between the two metrics. Metrics that are not outperformed by any other metric are declared as the winners for that language pair.
Pearson's r is highly sensitive to outliers (Osborne and Overbay, 2004): even a single outlier can have a drastic impact on the value of the correlation coefficient; and in the extreme case, outliers can give the illusion of a strong correlation when there is none, or mask the presence of a true relationship. More generally, very different underlying relationships between the two variables can have the same value of the correlation coefficient (Anscombe, 1973). 2 The correlation of metrics with human scores is highly dependent on the underlying systems used. BLEU (Papineni et al., 2002a) has remained mostly unchanged since it was proposed in 2002, but its correlation with human scores has changed each year over ten years of evaluation (2006 to 2016) on the English-German and German-English language pairs at WMT (Reiter, 2018). The low correlation for most of 2006-2012 is possibly due to the presence of strong rule-based systems that tend to receive low BLEU scores (Callison-Burch and Osborne, 2006). By 2016, however, there were only a few submissions of rule-based systems, and these were mostly outperformed by statistical systems according to human judgements (Bojar et al., 2016). The majority of the systems in the last three years have been neural models, for which most metrics have a high correlation with human judgements. BLEU has been surpassed by various other metrics at every iteration of the WMT metrics shared task. Despite this, and extensive analytical evidence of the limitations of BLEU in particular and automatic metrics in general (Stent et al., 2005;Callison-Burch and Osborne, 2006;Smith et al., 2016), the metric remains the de facto standard of evaluating research hypotheses. Annotators are asked to rate the adequacy of a set of translations compared to the corresponding source/reference sentence on a slider which maps to a continuous scale between 0 and 100. Bad quality annotations are filtered out based on quality control items included in the annotation task. Each annotator's scores are standardised to account for different scales. The score of an MT system is computed as the mean of the standardised score of all its translations. In WMT 19, typically around 1500-2500 annotations were collected per system for language pairs where annotator availability was not a problem. To assess whether the difference in scores between two systems is not just chance, the Wilcoxon rank-sum test is used to test for statistical significance.

Metrics
Automatic metrics compute the quality of an MT output (or set of translations) by comparing it with a reference translation by a human translator. For the WMT 19 metrics task, participants were also invited to submit metrics that rely on the source instead of the reference (QE . In this paper, we focus on the following metrics that were included in evaluation at the metrics task at WMT 2019:

Baseline metrics
• BLEU (Papineni et al., 2002b) is the precision of n-grams of the MT output compared to the reference, weighted by a brevity penalty to punish overly short translations. BLEU has high variance across different hyper-parameters and pre-processing strategies, in response to which sacreBLEU (Post, 2018) was introduced to create a standard implementation for all researchers to use; we use this version in our analysis. • TER (Snover et al., 2006) measures the number of edits (insertions, deletions, shifts and substitutions) required to transform the MT output to the reference. • CHRF (Popović, 2015) uses character n-grams instead of word n-grams to compare the MT output with the reference. This helps with matching morphological variants of words.
Best metrics across language pairs • YISI-1 (Lo, 2019) computes the semantic similarity of phrases in the MT output with the reference, using contextual word embeddings (BERT: Devlin et al. (2019)). • ESIM (Chen et al., 2017;Mathur et al., 2019) is a trained neural model that first computes sentence representations from BERT embeddings, then computes the similarity between the two strings. 3 Source-based metric • YISI-2 (Lo, 2019) is the same as YISI-1, except that it uses cross-lingual embeddings to compute the similarity of the MT output with the source. The baseline metrics, particularly BLEU, were designed to use multiple references. However, in practice, they have only have been used with a single reference in recent years.

Re-examining conclusions of Metrics
Task 2019 4.1 Are metrics unreliable when evaluating high-quality MT systems?
In general, the correlation of reference-based metrics with human scores is greater than r = 0.8 for all language pairs. However, the correlation is dependent on the systems that are being evaluated, and as the quality of MT increases, we want to be sure that the metrics evaluating these systems stay reliable.
To estimate the validity of the metrics for highquality MT systems, Ma et al. (2019) sorted the systems based on their Direct Assessment scores, and plotted the correlation of the top N systems, with N ranging from all systems to the best four systems. They found that for seven out of 18 language pairs, the correlation between metric and human scores decreases as we decrease N , and tends towards zero or even negative when N = 4.
There are four language pairs (German-English, English-German, English-Russian, and English-Chinese) where the quality of the best MT systems is close to human performance (Barrault et al., 2019). If metrics are unreliable for strong MT systems, we would expect to see a sharp degradation in correlation for these language pairs. But as we look at the top N systems, the correlation decreases for German-English and English-German, stays the same for English-Russian, and actually increases for English-Chinese. On the other hand, we observe this phenomenon with English-Kazakh, where the top systems are far from the quality of human translation.
Is there another explanation for these results? Pearson's r between metrics and DA scores is unstable for small samples, particularly when the systems are very close in terms of quality. The low correlation over top-N systems (when N is small) could be an artefact of this instability. To understand this effect, we instead visualise the correlation of a rolling window of systems, starting with the worst N systems, and moving forward by one system until we reach the top N systems. The number of systems stays constant for all points in these graphs, which makes for a more valid comparison than the original setting where the sample size varies. If the metrics are indeed less reliable for strong systems, we should see the same pattern as with the top N systems.
For the German-English language pair (Figure 1  b), the correlation of most metrics is very unstable when N = 4. Both BLEU and CHRF perfectly correlate with human scores for systems ranked 2-5, which then drops to −1 for the top 4 systems. On the other hand, ESIM exhibits the opposite behaviour, even though it shows an upward trend when looking at the top-N systems.
Even worse, for English-German, YISI-2 obtains a perfect correlation at some values of N , when in fact its correlation with human scores is negligible once outliers are removed (Section 4.2).
We observe similar behaviour across all lan-guage pairs: the correlation is more stable as N increases, but there is no consistent trend in the correlation that depends on the quality of the systems in the sample. If we are to trust Pearson's r at small sample sizes, then the reliability of metrics doesn't really depend on the quality of the MT systems. Given that the sample size is small to begin with (typically 10-15 MT systems per language pair), we believe that we do not have enough data to use this method to assess whether metric reliability decreases with the quality of MT systems.
A possible explanation for the low correlation of subsets of MT systems is that it depends on how close these systems are in terms of quality. In the extreme case, the difference between the DA scores of all the systems in the subset can be statistically insignificant, so metric correlation over these systems can be attributed to chance.

How do outliers affect the correlation of MT evaluation metrics?
An outlier is defined as "an observation (or subset of observations) which appears to be inconsistent with the remainder of the dataset" (Barnett and Lewis, 1974). Pearson's r is particularly sensitive to outliers in the observations. When there are systems that are generally much worse (or much better) than the rest of the systems, metrics are usually able to correctly assign low (or high) scores to these systems. In this case, the Pearson correlation can over-estimate metric reliability, irrespective of the relationship between human and metric scores of other systems. Based on a visual inspection, we can see there are two outlier systems in the English-German language pair. To illustrate the influence of these systems on Pearson's r, we repeatedly subsample ten systems from the 22 system submissions (see Figure 2). When the most extreme outlier (en-de-task) is present in the sample, the correlation of all metrics is greater than 0.97. The selection of systems has a higher influence on the correlation when neither outlier is present, and we can see that YISI-1 and ESIM usually correlate much higher than BLEU.
One method of dealing with outliers is to calculate the correlation of the rest of the points (called the skipped correlation: Wilcox (2004)). Most of these apply methods to detect multivariate outliers in the joint distribution of the two variables: the English-German en-de-task online-X neither Figure 2: Pearson's r for metrics, when subsampling systems from the English-German language pair. We group the samples in the presence of the two outliers ("en-de-task" and "Online-X"), and when neither is present.
metric and human scores in our case. However, multivariate outliers could be system pairs that indicate metric errors, and should not be removed because they provide important data about the metric. Thus, we only look towards detecting univariate outliers based on human ratings. One common method is to simply standardise the scores, and remove systems with scores that are too high or too low. However, standardising depends on the mean and standard deviation, which are themselves affected by outliers. Instead, we use the median and the Median Absolute Deviation (MAD) which are more robust (Iglewicz and Hoaglin, 1993;Rousseeuw and Hubert, 2011;Leys et al., 2013).
For MT systems with human scores s, we use the following steps to detect outlier systems: 1. Compute MAD, which is the median of all absolute deviations from the median 2. compute robust scores: 3. discard systems where the magnitude of z exceeds a cutoff (we use 2.5) Tables 1 and 2 show Pearson's r with and without outliers for the language pairs that contain outliers. Some interesting observations, are as follows:  Table 1: Correlation of metrics with and without outliers ("All" and "−out", resp.) for the to-English language pairs that contain outlier systems de-cs en-de en-fi en-kk en-ru fr-de  Table 2: Correlation of metrics with and without outliers ("All" and "−out", resp.) for the language pairs into languages other than English that contain outlier systems.
• for language pairs like Lithuanian-English and English-Finnish, the correlation between the reference based metrics and DA is high irrespective of the presence of the outlier; • the correlation of BLEU with DA drops sharply from 0.85 to 0.58 for English-Kazakh when outliers are removed; • for English-German, the correlation of BLEU and TER appears to be almost as high as that of YISI-1 and ESIM. However, when we remove the two outliers, there is a much wider gap between the metrics. • if metrics wrongly assign a higher score to an outlier (e.g. most metrics in Gujarat-English), removing these systems increases correlation, and reporting only the skipped correlation is not ideal. To illustrate the severity of the problem, we show examples from the metrics task data where outliers present the illusion of high correlation when the metric scores are actually independent of the human scores without the outlier. For English-German, the source-based metric YISI-2 correctly assigns a low score to the outlier en-de-task. When this system is removed, the correlation is near zero. At the other extreme, YISI-2 incorrectly assigns a very high score to a low-quality outlier in the English-Russian language pair, resulting in a strongly negative correlation. When we remove this system, we find there is no association between metric and human scores.
The results for all metrics that participated in the WMT 19 metrics task are presented in Tables 3, 4 and 5 in the appendix.

Beyond correlation: metric decisions for system pairs
In practice, researchers use metric scores to compare pairs of MT systems, for instance when claiming a new state of the art, evaluating different model architectures, or even in deciding whether to publish. Basing these judgements on metric score alone runs the risk of making wrong decisions with respect to the true gold standard of human judgements. That is, while a change may result in a significant improvement in BLEU, this may not be judged to be an improvement by human assessors. Thus, we examine whether metrics agree with DA on all the MT systems pairs across all languages used in WMT 19.
Following  The colours indicate pairs judged by humans to be insignificantly different (cyan/light gray), significantly worse (red/dark gray on the left) and significantly better (green/dark gray on the right).
cal significance tests to detect if the difference in scores (human or metric) between two systems (S1 and S2) can just be attributed to chance.
For human scores, we apply the Wilcoxon ranksum test which is used by WMT when ranking systems. We use the bootstrap method (Koehn, 2004) to test for statistical significance of the difference in BLEU between two systems. YISI-1 and ESIM compute the system score as the average of sentence scores, so we use the paired t-test to compute significance. Although CHRF is technically the macro-average of n-gram statistics over the entire test set, we treat this as a micro-average when computing significance such that we can use the more powerful paired t-test over sentence scores. Figure 4 visualises the agreement between metric score differences and differences in human DA scores. Ideally, only differences judged as truly significant would give rise to significant and large magnitude differences under the metrics; and when metrics judge differences to be insignificant, ideally very few instances would be truly significant. However, this is not the case: there are substantial numbers of insignificant differences even for very high metric differences (cyan, for higher range bins); moreover, the "NS" category -denoting an insignificant difference in metric score -includes many human significant pairs (red and green, top bin).
Considering BLEU (top plot in Figure 2), for insignificant BLEU differences, humans judge one system to be better than the other for half of these system pairs. This corresponds to a Type I error. It is of concern that BLEU cannot detect these differences. Worse, the difference in human scores has a very wide range. Conversely, when the BLEU score is significant but in the range 0-3, more than half of these systems are judged to be insignificantly different in quality (corresponding to a Type II error). For higher BLEU deltas, these errors diminish, however, even for a BLEU difference between 3 and 5 points, about a quarter of these system pairs are of similar quality. This paints a dour picture for the utility of BLEU as a tool for gatekeeping (i.e., to define a 'minimum publishable unit' in deciding paper acceptance on empirical grounds, through bounding the risk of Type II errors), as the unit would need to be whoppingly large to ensure only meaningful improvements are accepted. Were we seek to minimise Type I errors in the interests of nurturing good ideas, the thresh- old would need to be so low as to be meaningless, effectively below the level required for acceptance of the bootstrap significance test. The systems evaluated consist of a mix of systems submitted by researchers (mostly neural models) and anonymous online systems (where the MT system type is unknown). Even when we restrict the set of systems to only neural models submitted by researchers, the patterns of Type 1 and Type 2 errors remain the same (figure omitted for space reasons).
TER makes similar errors: TER scores can wrongly show that a system is much better than another when humans have judged them similar, or even worse, drawn the opposite conclusion.
CHRF, YISI-1 and ESIM have fewer errors compared to BLEU and TER. When these metrics mistakenly fail to detect a difference between systems, the human score difference is considerably lower than for BLEU. Accordingly, they should be used in place of BLEU. However the above argument is likely to still hold true as to their utility for gatekeeping or nurturing progress, in that the thresholds would still be particularly punitive or permissive, for the two roles, respectively.
Finally, Figure 5 looks at agreement between metric decisions when comparing MT systems. As expected, when BLEU or TER disagree with CHRF, ESIM, or YISI-1, the former are more likely to be wrong. BLEU and TER have an 80% overlap in errors. The decisions of ESIM, a trained neural model, diverge a little more from the other metrics. Overall, despite the variety of approaches towards the task, all five metrics have common biases: over half of all erroneous decisions made by a particular metric are made in common with all other metrics.

Conclusion
In this paper, we revisited the findings of the metrics task at WMT 2019, which flagged potential problems in the current best practises for assessment of evaluation metrics.
Pearson's correlation coefficient is known to be unstable for small sample sizes, particularly when the systems in consideration are very close in quality. This goes some way to explaining the findings whereby strong correlations between metric scores and human judgements evaporate when considering small numbers of strong systems. We show that the same can be true for any small set of similar quality systems, not just the top systems. This effect can partly be attributed to noise due to the small sample size, rather than true shortcomings in the metrics themselves. We need better methods to empirically test whether our metrics are less reliable when evaluating high quality MT systems.
A more serious problem, however, is outlier systems, i.e. those systems whose quality is much higher or lower than the rest of the systems. We found that such systems can have a disproportionate effect on the computed correlation of metrics. The resulting high values of correlation can then lead to to false confidence in the reliability of metrics. Once the outliers are removed, the gap between correlation of BLEU and other metrics (e.g. CHRF, YISI-1 and ESIM) becomes wider. In the worst case scenario, outliers introduce a high correlation when there is no association between metric and human scores for the rest of the systems. Thus, future evaluations should also measure correlations after removing outlier systems.
Finally, the same value of correlation coefficient can describe different patterns of errors. Any single number is not adequate to describe the data, and visualising metric scores against human scores is the best way to gain insights into metric reliability. This could be done with scatter plots (e.g. Figure 3a) for each language pair, or Figure 5, which compresses this information into one graph.
Metrics are commonly used to compare two systems, and accordingly we have also investigated the real meaning encoded by a difference in metric score, in terms of what this indicates about human judgements of the two systems. Most published work report BLEU differences of 1-2 points, however at this level we show this magnitude of difference only corresponds to true improvements in quality as judged by humans about half the time. Although our analysis assumes the Direct Assessment human evaluation method to be a gold standard despite its shortcomings, our analysis does suggest that the current rule of thumb for publishing empirical improvements based on small BLEU differences has little meaning.
Overall, this paper adds to the case for retiring BLEU as the de facto standard metric, and instead using other metrics such as CHRF, YISI-1, or ESIM in its place. They are more powerful in assessing empirical improvements. However, human evaluation must always be the gold standard, and for continuing improvement in translation, to establish significant improvements over prior work, all automatic metrics make for inadequate substitutes.
To summarise, our key recommendations are: • When evaluating metrics, use the technique outlined in Section 4.2 to remove outliers before computing Pearson's r. • When evaluating MT systems, stop using BLEU or TER for evaluation of MT, and instead use CHRF, YISI-1, or ESIM; • Stop using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and make sure these are supported by manual evaluation.
Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1-61, Florence, Italy. Association for Computational Linguistics.    Table 4: Pearson correlation of metrics for the to-English language pairs. For language pairs that contain outlier systems, we also show correlation after removing outlier systems.
Correlations of metrics not significantly outperformed by any other for that language pair are highlighted in bold.  Table 5: Correlation of metrics for the from-English language pairs. For language pairs that contain outlier systems, we also show correlation after removing outlier systems. Values in bold indicate that the metric is not significantly outperformed by any other metric under the Williams Test.