A Call for Clarity in Reporting BLEU Scores

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to “the” BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.


Introduction
Science is the process of formulating hypotheses, making predictions, and measuring their outcomes. In machine translation research, the predictions are made by models whose development is the focus of the research, and the measurement, more often than not, is done via BLEU (Papineni et al., 2002). BLEU's relative language independence, its ease of computation, and its reasonable correlation with human judgments have led to its adoption as the dominant metric for machine translation research. On the whole, it has been a boon to the community, providing a fast and cheap way for researchers to gauge the performance of their models. Together with larger-scale controlled manual evaluations, BLEU has shep-herded the field through a decade and a half of quality improvements (Graham et al., 2014). This is of course not to claim there are no problems with BLEU. Its weaknesses abound, and much has been written about them (cf. Callison-Burch et al. (2006); Reiter (2018)). This paper is not, however, concerned with the shortcomings of BLEU as a proxy for human evaluation of quality; instead, our goal is to bring attention to the relatively narrower problem of the reporting of BLEU scores. This problem can be summarized as follows: • BLEU is not a single metric, but requires a number of parameters ( §2.1).
• Preprocessing schemes have a large effect on scores ( §2.2). Importantly, BLEU scores computed against differently-processed references are not comparable.
• Papers vary in the hidden parameters and schemes they use, yet often do not report them ( §2.3). Even when they do, it can be hard to discover the details.
Together, these issues make it difficult to evaluate and compare BLEU scores across papers, which impedes comparison and replication. I quantify these issues and show that they are serious, with variances bigger than many reported gains. After introducing the notion of userversus metricsupplied tokenization, I identify user-supplied reference tokenization as the main cause of this incompatibility. In response, I suggest the community use only metric-supplied reference tokenization when sharing scores, 2 following the annual Conference on Machine Translation (Bojar et al., 2017, WMT). In support of this, I release a Python package, SACREBLEU, 3 which automatically downloads and stores references for common test sets, thus introducing a "protective layer" between them and the user. It also provides a number of other features, such as reporting a version string which records the parameters used and which can be included in published papers.
2 Problem Description 2.1 Problem: BLEU is underspecified "BLEU" does not signify a single concrete method, but a constellation of parameterized methods. Among these parameters are: • The number of references used; • for multi-reference settings, the computation of the length penalty; • the maximum n-gram length; and • smoothing applied to 0-count n-grams.
Many of these are not common problems in practice. Most often, there is only one reference, and the length penalty calculation is therefore moot. The maximum n-gram length is virtually always set to four, and since BLEU is corpus level, it is rare that there are any zero counts. But it is also true that people use BLEU scores as very rough guides to MT performance across test sets and languages (comparing, for example, translation performance into English from German and Chinese). Apart from the wide intralanguage scores between test sets, the number of references included with a test set has a large effect that is often not given enough attention. For example, WMT 2017 includes two references for English-Finnish. Scoring the online-B system with one reference produces a BLEU score of 22.04, and with two, 25.25. As another example, the NIST OpenMT Arabic-English and Chinese-English test sets 4 provided four references and consequently yielded BLEU scores in the high 40s (and now, low 50s). Since these numbers are all gathered together under the label "BLEU", over time, they leave an impression in people's minds of very high BLEU scores for some language pairs or test sets relative to others, but without this critical distinguishing detail. 3 pip3 install sacrebleu 4 https://catalog.ldc.upenn.edu/ LDC2010T21 2.2 Problem: Different reference preprocessings cannot be compared The first problem dealt with parameters used in BLEU scores, and was more theoretical. A second problem, that of preprocessing, exists in practice. Preprocessing includes input text modifications such as normalization (e.g., collapsing punctuation, removing special characters), tokenization (e.g., splitting off punctuation), compoundsplitting, the removal of case, and so on. Its general goal is to deliver meaningful white-space delimited tokens to the MT system. Of these, tokenization is one of the most important and central. This is because BLEU is a precision metric, and changing the reference processing changes the set of n-grams against which system n-gram precision is computed. Rehbein and Genabith (2007) showed that the analogous use in the parsing community of F 1 scores as rough estimates of crosslingual parsing difficulty were unreliable, for this exact reason. BLEU scores are often reported as being tokenized or detokenized. But for computing BLEU, both the system output and reference are always tokenized; what this distinction refers to is whether the reference preprocessing is usersupplied or metric-internal (i.e., handled by the code implementing the metric), respectively. And since BLEU scores can only be compared when the reference processing is the same, user-supplied preprocessing is error-prone and inadequate for comparing across papers. Table 1 demonstrates the effect of computing BLEU scores with different reference tokenizations. This table presents BLEU scores where a single WMT 2017 system (online-B) and the reference translation were both processed in the following ways: • basic. User-supplied preprocessing with the MOSES tokenizer (Koehn et al., 2007). 5 • split. Splitting compounds, as in Luong et al. (2015a): 6 e.g., rich-text → rich -text.
• unk. All word types not appearing at least twice in the target side of the WMT training data (with "basic" tokenization) are mapped to UNK. This hypothetical scenario could 5 Arguments -q -no-escape -protected basic-protected-patterns -l LANG.
6 Their use of compound splitting is not mentioned in the paper, but only here: http://nlp.stanford.edu/ projects/nmt. Each column varies the processing of the "online-B" system output and its references. basic denotes basic user-supplied tokenization, split adds compound splitting, unk replaces words not appearing at least twice in the training data with UNK, and metric denotes the metric-supplied tokenization used by WMT. The range row lists the difference between the smallest and largest scores, excluding unk.
easily happen if this common user-supplied preprocessing were inadvertently applied to the reference.
• metric. Only the metric-internal tokenization of the official WMT scoring script, mteval-v13a.pl, is applied. 7 The changes in each column show the effect these different schemes have, as high as 1.8 for one arc, and averaging around 1.0. The biggest is the treatment of case, which is well known, yet many papers are not clear about whether they report cased or case-insensitive BLEU.
Allowing the user to handle pre-processing of the reference has other traps. For example, many systems (particularly before sub-word splitting (Sennrich et al., 2016) was proposed) limited the vocabulary in their attempt to deal with unknown words. It's possible that these papers applied this same unknown-word masking to the references, too, which would artificially inflate BLEU scores. Such mistakes are easy to introduce in researcher pipelines. 8

Problem: Details are hard to come by
User-supplied reference processing precludes direct comparison of published numbers, but if enough detail is specified in the paper, it is at 7 https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/ generic/mteval-v13a.pl 8 This paper's observations stem in part from an early version of a research workflow I was using, which applied preprocessing to the reference, affecting scores by half a point.  least possible to reconstruct comparable numbers. Unfortunately, this is not the trend, and even for meticulous researchers, it is often unwieldy to include this level of technical detail. In any case, it creates uncertainty and work for the reader. One has to read the experiments section, scour the footnotes, and look for other clues which are sometimes scattered throughout the paper. Figuring out what another team did is not easy. The variations in Table 1 are only some of the possible configurations, since there is no limit to the preprocessing that a group could apply. But assuming these represent common, concrete configurations, one might wonder how easy it is to determine which of them was used by a particular paper. Table 2 presents an attempt to recover this information from a handful of influential papers in the literature. Not only are systems not comparable due to different schemes, in many cases, no easy determination can be made. should not touch the reference, while the metric applies its own processing to the system output and reference.

Problem: Dataset specification
Other tricky details exist in the management of datasets. It has been common over the past few years to report results on the English→German arc of the WMT'14 dataset. It is unfortunate, therefore, that for this track (and this track alone), there are actually two such datasets. One of them, released for the evaluation, has only 2,737 sentences, having removed about 10% of the original data after problems were discovered during the evaluation. The second, released after the evaluation, restores this missing data (after correcting the problem) and has 3,004 sentences. Many researchers are unaware of this fact, and do not specify which version they use when reporting, which itself contributes to variance. Figure 1 depicts the ideal process for computing sharable scores. Reference tokenization must identical in order for scores to be comparable. The widespread use of user-supplied reference preprocessing prevents this, needlessly complicating comparisons. The lack of details about preprocessing pipelines exacerbates this problem. This situation should be fixed.

The example of PARSEVAL
An instructive comparison comes from the evaluation of English parsing scores, where numbers have been safely compared across papers for decades using the PARSEVAL metric (Black et al., 1991). PARSEVAL works by taking labeled spans of the form (N, i, j) representing a nonterminal N spanning a constituent from word i to word j. These are extracted from the parser output and used to compute precision and recall against the gold-standard set taken from the correct parse tree. Precision and recall are then combined to compute the F 1 metric that is commonly reported and compared across parsing papers. Computing parser F 1 comes with its own set of hidden parameters and edge cases. Should one count the TOP (ROOT) node? What about -NONE-nodes? Punctuation? Should any labels be considered equivalent? These boundary cases are resolved by that community's adoption of a standard codebase, evalb, 9 which included a parameters file that answers each of these questions. 10 This has facilitated almost thirty years of comparisons on treebanks such as the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993).

Existing scripts
MOSES 11 has a number of scoring scripts. Unfortunately, each of them has problems. Moses' multi-bleu.perl cannot be used because it requires user-supplied preprocessing. The same is true of another evaluation framework, MultEval (Clark et al., 2011), which explicitly advocates for user-supplied tokenization. 12 A good candidate is Moses' mteval-v13a.pl, which makes use of metric-internal preprocessing and is used in the annual WMT evaluations. However, this script inconveniently requires the data to be wrapped into XML. Nematus (Sennrich et al., 2017) contains a version (multi-bleu-detok.perl) that removes the XML requirement. This is a good idea, but it still requires the user to manually handle the reference translations. A better approach is to keep the reference away from the user entirely.

SACREBLEU
SACREBLEU is a Python script that aims to treat BLEU with a bit more reverence: • It expects detokenized outputs, applying its own metric-internal preprocessing, and produces the same values as WMT; • it automatically downloads and stores WMT (2008WMT ( -2018 and IWSLT 2017 (Cettolo et al., 2017) test sets, obviating the need for the user to handle the references at all; and • it produces a short version string that documents the settings used.
SACREBLEU can be installed via the Python package management system: SACREBLEU is open source software released under the Apache 2.0 license. 13 The CHRF metric is also available via the -m flag.

Summary
Research in machine translation benefits from the regular introduction of test sets for many different language arcs, from academic, government, and industry sources. It is a shame, therefore, that we are in a situation where it is difficult to directly compare scores across these test sets. One might be tempted to shrug this off as an unimportant detail, but as was shown here, these differences are in fact quite important, resulting in large variances in the score that are often much higher than the gains reported by a new method.
Fixing the problem is relatively simple. Research groups should only report BLEU computed using a metric-internal tokenization and preprocessing scheme for the reference, and they should be explicit about the BLEU parameterization they use. With this, scores can be directly compared. For backwards compatibility with WMT results, I recommend the processing scheme used by WMT, and provide a new tool that makes it easy to do so.