Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models by demonstrating important failure cases on multiple datasets, language pairs and tasks. Our experiments show that metrics (i) usually prefer system outputs to human-authored texts, (ii) can be insensitive to correct translations of rare words, (iii) can yield surprisingly high scores when given a single sentence as system output for the entire test set.


Introduction
Human assessment is the best practice at hand for evaluating language generation tasks such as machine translation (MT), dialogue systems, visual captioning and abstractive summarisation. In practice, however, we rely on automatic metrics which compare system outputs to human-authored references. Initially proposed for MT evaluation, metrics such as BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014) are increasingly used for other tasks, along with more task-oriented metrics such as ROUGE (Lin, 2004) for summarisation and CIDER  for visual captioning. Reiter and Belz (2009) remark that it is not sufficient to conclude on the usefulness of a natural language generation system's output by solely relying on metrics that quantify the similarity of the output to human-authored texts. Previous criticisms concerning automatic metrics corroborate this perspective to some degree. To cite a few, in the context of MT, Callison-Burch et al. (2006) state that human judgments may not correlate with BLEU and an increase in BLEU does not always indicate an improvement in quality. Further, Mathur et al. (2020) challenge the stability of the common practices measuring correlation between metrics and human judgments, the standard approach in the MT community, and show that they may be severely impacted by outlier systems and the sample size. In the context of image captioning, Wang and Chan (2019) claim that the consensus-based evaluation protocol of CIDER actually penalises output diversity. Similar problems have also been discussed in the area of automatic summarisation with respect to ROUGE (Schluter, 2017). Nevertheless, automatic metrics like these are a necessity and remain popular, especially given the increasing number of open evaluation challenges.
In this paper, we further probe BLEU, METEOR, CIDER-D and ROUGE L metrics ( § 2) that are commonly used to quantify progress in language generation tasks. We also probe a recently proposed contextualised embeddings-based metric called BERTSCORE (Zhang et al., 2020). We first conduct leave-oneout average scoring with multiple references and show that, counter-intuitively, metrics tend to reward system outputs more than human-authored references ( § 3.1). A system output perturbation experiment further highlights how metrics penalise errors in extremely frequent n-grams ( § 3.2.1) while they are quite insensitive to errors in rare words ( § 3.2.2). The latter makes it hard to understand whether a model is better than another in its ability to handle rare linguistic phenomena correctly. Finally, we design ( § 3.3) an adversary that seeks to demonstrate metrics' preference for frequent n-grams by using a single training set sentence as system output for the entire test set. We observe strikingly high scores, such as the sentence "A man is playing" obtaining a BLEU score of 30.6, compared to 47.9 of a strong model, on a captioning task. We hope that our observations in this paper will lead the community towards formulation of better metrics and evaluation protocols in the future ( § 4).

Automatic Metrics
In this section, we briefly describe the metrics we use in our experiments. To compute BLEU, ME-TEOR, CIDER and ROUGE-L, we provide pre-tokenised hypotheses and references to coco-caption utility 1 . For BERTSCORE, we use its official release 2 .
• Initially proposed for MT evaluation, BLEU is a prevalent metric based on n-gram matches between the candidate and reference sentences. The final score is the geometric mean of n-gram precisions with a brevity penalty to penalise outputs shorter than references.
• In contrast, METEOR is a recall-based metric which uses explicit word alignments between candidate and references, allowing for exact matching, synonymy, and stemming. A fragmentation penalty rewards longer and fewer chunks of contiguously aligned tokens. The final score is computed between the best scoring reference and the candidate sentence.
• Another recall-based metric is ROUGE-L which measures longest common sub-sequences between the candidate and references, i.e. a set of shared words with similar order even if not contiguous.
• CIDER is a consensus-based captioning metric which uses term frequency inverse document frequency (TF-IDF) for n-gram weighting over all references. The final score is the average cosine similarity between the candidate sentence and references. We use the popular variant CIDER-D, which integrates a length-based Gaussian penalty and clipping.
• Finally, BERTSCORE computes a token level similarity for each candidate token against each token in the reference sentence, using contextual BERT embeddings (Zhang et al., 2020). We report the Fscore, i.e. the harmonic mean of the precision and the recall of maximal per-token cosine similarities between the candidate and the reference(s).

Experiments
This section describes our probing experiments and results 3 for different tasks and language pairs.

Machine vs. human-authored texts
We first explore the extent to which metrics reward human-authored texts. We hypothesise that metrics should prefer human-authored texts to machine-produced texts, as the former is considered to be the 'ground truth' reference. For a given multi-reference metric M, REF-VS-REF leave-one-out 4 average L M is computed across C human-authored reference corpora {R 1 , . . . , R C }. At each iteration i, the reference R i is taken as the candidate and is evaluated against the held-out C−1 references: Same principle holds when evaluating a trained system, except that SYS would now represent the system outputs. The metric computations inside the sum would still be using C−1 references at each iteration.  Tasks. We explore two visual captioning tasks, namely, COCO English image captioning (Chen et al., 2015) and the bilingual VATEX video captioning dataset . For the former, we use a state-of-the-art captioning model (Lu et al., 2018), and evaluate the 5k sentences in the test split (Karpathy and Fei-Fei, 2015). For English and Chinese VATEX, we train neural baselines and evaluate on the dev set which contains 3k videos. Finally, we also experiment with the Chinese→English NIST 2008 (MT08) test set which contains 1357 sentences, and translate it using Google Translate.
Results. Table 1 shows that human-authored references obtain relatively worse scores than systems in general. BLEU exhibits the biggest differences, with almost 11 points for COCO and 3.5 points for the NIST MT experiment. The results are consistent across different datasets and languages except for English VATEX and NIST-2008, which score slightly better than system outputs in METEOR and CIDER. Overall, the results suggest that the metrics do not necessarily reflect the quality of the generated captions or translations, as scores for human-authored texts are far from the upper bounds of the metrics (e.g. 100 for BLEU). Therefore, we suggest that this type of evaluation should not be used to draw conclusions on 'human parity', as previously done in some studies (Xu et al., 2015;Vinyals et al., 2017).

N -gram perturbations
In this section, we propose two perturbation experiments to investigate the sensitivity of metrics to frequent and infrequent phenomena. In both experiments, the first reference corpus R 1 of COCO test set is considered to be the output of a hypothetical system from which perturbed versions are created. Each version is then evaluated against the four remaining references {R 2 , . . . , R 5 } of the COCO test set. Although we target unigrams for perturbation, they implicitly effect higher order n-grams as well.

Frequent unigram perturbation
We create five independent variants of R 1 by substituting the most common words 'people', 'standing', 'sitting', 'man' and 'a' with the unknown token UNK. Table 2 shows that metrics react quite conservatively to this attack, until very high substitution percentages are reached. For example, substituting 'people' yields a drop of less than 0.4 points for BLEU, METEOR and ROUGE, a potentially uninteresting drop which can easily be overlooked during model development.
Although repeatedly missing indefinite articles 'a' is semantically much less critical than missing 'people', all metrics aggressively penalise the former, simply because the function word 'a' is extremely frequent: with 53.8% relative drop, BLEU is the most affected metric, whereas METEOR and BERTSCORE are more robust with relative drops of 17.9% and 12.9%, respectively. To find out more, we randomly substitute content words from R 1 (RANDOM in   .7) show substantial increases when compared to dropping 'a', supporting our semantic conjecture above. However, the decreases in METEOR (20.03→19.58) and BERTSCORE (0.480→0.413) highlight the sensitivity of these metrics to semantics rather than pure frequencies. The continuous and contextual nature of BERTSCORE seems to be an advantage here.

Insensitivity to infrequent constructions
We now approach to the problem from the other end to explore how sensitive metrics are to a set of rarely occurring words. We conjecture that this will provide insight into how much the metrics reward a hypothetical model that systematically translates a set of rare words correctly. Specifically, we create four variants of R 1 , with each one substituting a particular set of words by UNK, based on training set frequencies. 5 Table 3 shows that even the most aggressively short-listed hypothetical model (T = 30) obtains marginally worse scores than the full vocabulary model (T = 1). As a concrete example, the last row shows the non-substantial impact of systematically replacing each occurrence of 'woman' (0.54% of all unigrams) with 'man' (1.21% of all unigrams).
Discussion. Although it is hard to interpret the magnitude of differences observed in both experiments, our empirical findings highlight interesting cases. Overall, the conclusions of our perturbation experiments are in line with the concurrent work of Mathur et al. (2020) which suggests that important conclusions, such as comparative judgments about systems, should not be drawn based only on small changes in automatic metrics.

Single representative sentence
Following the observations from the previous experiments, we search over the training set for a single representative sentence (SS) which maximises test set BLEU 6 when used as a system output for every test set instance. 7 We explore tasks and datasets which include the ones previously introduced ( § 3.1). For visual captioning, we add MSVD (Chen and Dolan, 2011), a widely known dataset of 1,970 videos, with up to 41 English captions per video. For image captioning, we use English Flickr30k (Young et al., 2014), and the STAIR (Yoshikawa et al., 2017) dataset which provides Japanese captions for COCO images. We also explore the multi-turn dialogue dataset DailyDialog (Li et al., 2017) which contains conversations that cover 10 different daily life topics, and its multi-reference test set (Gupta et al., 2019). Table 4 summarises the statistics about the datasets explored for this experiment. Table 5 draws a comparison between the scores obtained for the retrieved single sentences, baselines and state-of-the-art systems when available. We observe that: (i) MSVD exhibits the highest scores with 30.6 BLEU and 23.4 METEOR, the latter being very close to a strong captioning baseline (Venugopalan et al., 2015), (ii) the SS scores for the DAILYDIALOG are surprisingly on par with a recent baseline and a state-of-the-art system, and (iii) CIDER is more robust against the single sentence adversary as more references seem to affect its internal consensus-based scoring. Although BERTSCORE does not explicitly  (Lu et al., 2018) 75.2 34.4 26.5 1.06 55.5 0.628 FLICKR-EN "A man in a blue shirt standing in front of a building" 46.2 12.2 11.6 0.07 32.7 -Hard Attention (Xu et al., 2015) 66.9 19.9 18.5 ---Neural Baby Talk (Lu et al., 2018) 69.  Table 5: Comparison of single sentence scores to several baselines and state-of-the-art: "Single representative sentences are in italics." English translations are provided for non-English tasks.
rely on n-gram statistics, it also exhibits surprisingly high scores for the SS setting. This contradicts the salient claims regarding the utility of the metric for model selection (Zhang et al., 2020).

Discussion
In this work, we explore cases where commonly used language generation evaluation metrics exhibit counter-intuitive behaviour. Although the main goal in language generation tasks is to generate 'humanquality' texts, our analysis in § 3.1 shows that metrics have a preference towards machine-generated texts rather than human references. Our perturbation experiments in § 3.2.2 highlight potential insensitivity of metrics to lexical changes in infrequent n-grams. This is a major concern for tasks such as multimodal machine translation (Specia et al., 2016) or pronoun resolution in MT (Guillou et al., 2016), where the metrics are expected to capture lexical changes which are due to rarely occurring linguistic ambiguities. We believe that targeted probes (Isabelle et al., 2017) are much more reliable than sentence or corpus level metrics for such tasks. Finally, we reveal that metrics tend to produce unexpectedly high scores when each test set hypothesis is set to a particular training set sentence, which can be thought of as finding the sweet spot of the corpus ( § 3.3). Therefore, we note that a high correlation between metrics and human judgments is not sufficient to characterise the reliability of a metric. The latter probably requires a thorough exploration and mitigation of adversarial cases such as the proposed single sentence baseline.