The price of debiasing automatic metrics in natural language evaluation

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.


Introduction
In recent years, there has been an increasing interest in tasks that require generating natural language, including abstractive summarization (Nallapati et al., 2016), open-response question answering (Nguyen et al., 2016;Kočisky et al., 2017), image captioning (Lin et al., 2014), and open-domain dialogue (Lowe et al., 2017b). Unfortunately, the evaluation of these systems remains a thorny issue because of the diversity of possible correct responses. As the gold standard of performing human evaluation is often too expensive, there has been a large effort develop- * Authors contributed equally.
ing automatic metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin and Rey, 2004), ME-TEOR (Lavie and Denkowski, 2009;Denkowski and Lavie, 2014) and CiDER (Vedantam et al., 2015). However, these have shown to be biased, correlating poorly with human metrics across different datasets and systems (Liu et al., 2016b;. Can we combine automatic metrics and human evaluation to obtain an unbiased estimate at lower cost than human evaluation alone? In this paper, we propose a simple estimator based on control variates (Ripley, 2009), where we average differences between human judgments and automatic metrics rather than averaging the human judgments alone. Provided the two are correlated, our estimator will have lower variance and thus reduce cost.
We prove that our estimator is optimal in the sense that no unbiased estimator using the same automatic metric can have lower variance. We also analyze its data efficiency (equivalently, cost savings)-the factor reduction in number of human judgments needed to obtain the same accuracy versus naive human evaluation-and show that it depends solely on two factors: (a) the annotator variance (which is a function of the human evaluation prompt) and (b) the correlation between human judgments and the automatic metric. This factorization allows us to calculate typical and best-case data efficiencies and accordingly refine the evaluation prompt or automatic metric.
Finally, we evaluate our estimator on stateof-the-art systems from two tasks, summarization on the CNN/Daily Mail dataset (Hermann et al., 2015;Nallapati et al., 2016) and openresponse question answering on the MS MAR-COv1.0 dataset (Nguyen et al., 2016). To study our estimators offline, we preemptively collected 10,000 human judgments which cover several  Figure 1: (a) At a system-level, automatic metrics (ROUGE-L) and human judgment correlate well, but (b) the instance-level correlation plot (where each point is a system prediction) shows that the instancelevel correlation is quite low (ρ = 0.31). As a consequence, if we try to locally improve systems to produce better answers ( in (a)), they do not significantly improve ROUGE scores and vice versa ( ). tasks and systems. 1 As predicted by the theory, we find that the data efficiency depends not only on the correlation between the human and automatic metrics, but also on the evaluation prompt. If the automatic metric had perfect correlation, our data efficiency would be around 3, while if we had noiseless human judgments, our data efficiency would be about 1.5. In reality, the reduction in cost we obtained was only about 10%, suggesting that improvements in both automatic metric and evaluation prompt are needed. As one case study in improving the latter, we show that, when compared to a Likert survey, measuring the amount of post-editing needed to fix a generated sentence reduced the annotator variance by three-fold.

Bias in automatic evaluation
It is well understood that current automatic metrics tend to correlate poorly with human judgment at the instance-level. For example,  report correlations less than 0.3 for a large suite of word-based and grammar-based evaluation methods on a generation task. Similarly, Liu et al. (2016b) find correlations less than 0.35 for automatic metrics on a dialog generation task in one domain, but find correlations with the same metric dropped significantly to less than 0.16 when used in another domain. Still, somewhat surprisingly, several automatic metrics 1 An anonymized version of this data and the annotation interfaces used can be found at https://bit.ly/ price-of-debiasing.
have been found to have high system-level correlations . What, then, are the implications of having a low instance-level correlation?
As a case study, consider the task of openresponse question answering: here, a system receives a human-generated question and must generate an answer from some given context, e.g. a document or several webpages. We collected the responses of several systems on the MS MAR-COv1 dataset (Nguyen et al., 2016) and crowdsourced human evaluations of the system output (see Section 4 for details).
The instance-level correlation ( Figure 1b) is only ρ = 0.31. A closer look at the instance-level correlation reveals that while ROUGE is able to correctly assign low scores to bad examples (lower left), it is bad at judging good examples and often assigns them low ROUGE scores (lower right)see Table 1 for examples. This observation agrees with a finding reported in  that automatic metrics correlate better with human judgments on bad examples than average or good examples.
Thus, as Figure 1(a) shows, we can improve low-scoring ROUGE examples without improving their human judgment ( ) and vice versa ( ). Indeed, Conroy and Dang (2008) report that summarization systems were optimized for ROUGE during the DUC challenge (Dang, 2006) until they were indistinguishable from the ROUGE scores of human-generated summaries, but the systems Bhullar is set to sign a -day contract with the Kings.
The -year-old will become the NBA's first player of Indian descent. Bhullar will be on the roster when the Kings host New Orleans Pelicans.
Bhullar andThe Kings are signing Bhullar to a -day contract.
The -year-old will be on the roster on friday when David Wear's -season contract expires thursday. Bhullar is set to become the NBA's first player of Indian descent. . The show has never had high ratings but is considered one of the great TV series. It's unknown what will happen to characters, but we can always guess.
'This's "Mad Men" is the end of a series of an era', This he says. Stores have created fashion lines inspired by the show."The Sopranos". The in the Kent State shootings in may or Richard Nixonś re-election.. (ml+rl; 0.95 / 0.24) (b) CNN/Daily Mail. Human judgment scores used are post-edit distance (Edit) (lower is better) and the automatic metric used is sentence vector similarity with the reference (higher is better). Table 1: Examples highlighting the different modes in which the automatic metric and human judgments may agree or disagree. On the MS MARCO task, a majority of responses from systems were actually correct but poorly scored according to ROUGE-L. On the CNN/Daily Mail task, a significant number of examples which are scored highly by VecSim are poorly rated by humans, and likewise many examples scored poorly by VecSim are highly rated by humans. had hardly improved on human evaluation. Hillclimbing on ROUGE can also lead to a system that does worse on human scores, e.g. in machine translation (Wu et al., 2016). Conversely, genuine quality improvements might not be reflected in improvements in ROUGE. This bias also appears in pool-based evaluation for knowledge base population (Chaganty et al., 2017). Thus the problems with automatic metrics clearly motivate the need for human evaluation, but can we still use the automatic metrics somehow to save costs?

Statistical estimation for unbiased evaluation
We will now formalize the problem of combining human evaluation with an automatic metric. Let X be a set of inputs (e.g., articles), and let S be the system (e.g. for summarization), which takes x ∈ X and returns output S(x) (e.g. a summary).
x ∈ X } be the set of system predictions. Let Y (z) be the random variable representing the human judgment according to some evaluation prompt (e.g. grammaticality or correctness), and define f (z) = E[Y (z)] to be the (unknown) human metric corresponding to averaging over an infinite number of human judgments. Our goal is to estimate the average across all examples: with as few queries to Y as possible.
Let g be an automatic metric (e.g. ROUGE), which maps z to a real number. We assume evaluating g(z) is free. The central question is how to use g in conjunction with calls to Y to produce an unbiased estimateμ (that is, E[μ] = µ). In this section, we will construct a simple estimator based on control variates (Ripley, 2009), and prove that it is minimax optimal.

Sample mean
We warm up with the most basic unbiased estimate, the sample mean. We sample z (1) , . . . , z (n) independently with replacement from Z. Then, we sample each human judgment y (i) = Y (z (i) ) independently. 2 Define the estimator to bê µ mean = 1 n n i=1 y (i) . Note thatμ mean is unbiased (E[μ mean ] = µ).
2 Note that this independence assumption isn't quite true in practice since we do not control who annotates our data.
We can define σ 2 f def = Var(f (z)) as the variance of the human metric and σ 2 a def = E z [Var(Y (z))] as the variance of human judgment averaged over Z. By the law of total variance, the variance of our estimator is

Control variates estimator
Now let us see how an automatic metric g can reduce variance. If there is no annotator variance (σ 2 a = 0) so that Y (z) = f (z), we should expect the variance of f (z) − g(z) to be lower than the variance of f (z), assuming g is correlated with f -see Figure 2 for an illustration.
The actual control variates estimator needs to handle noisy Y (z) (i.e. σ 2 a > 0) and guard against a g(z) with low correlation. Let us standardize g to have zero mean and unit variance, because we have assumed it is free to evaluate. As before, let z (1) , . . . , z (n) be independent samples from Z and draw y (i) = Y (z (i) ) independently as well. We define the control variates estimator aŝ where Intuitively, we have averaged over y (i) to handle the noise introduced by Y (z), and scaled g(z) to prevent an uncorrelated automatic metric from introducing too much noise. An important quantity governing the quality of an automatic metric g is the correlation between f (z) and g(z) (recall that g has unit variance): We can show that among all distributions with fixed σ 2 f , σ 2 a , and α (equivalently ρ), this estimator is minimax optimal, i.e. it has the least variance among all unbiased estimators: Theorem 3.1. Among all unbiased estimators that are functions of y (i) and g(z (i) ), and for all distributions with a given σ 2 f , σ 2 a , and α, and no other estimator has a lower worst-case variance.
Samples of ( ) Samples of − ( ) ( ) ( ) Figure 2: The samples from f (z) have a higher variance than the samples from f (z) − g(z) but the same mean. This is the key idea behind using control variates to reduce variance.  Comparing the variances of the two estimators ( (2) and (6)), we define the data efficiency as the ratio of the variances: where γ def = σ 2 a /σ 2 f is the normalized annotator variance. Data efficiency is the key quantity in this paper: it is the multiplicative reduction in the number of samples required when using the control variates estimatorμ cv versus the sample mean µ mean . Figure 3 shows the inverse data efficiency contours as a function of the correlation ρ and γ.
When there is no correlation between human and automatic metrics (ρ = 0), the data efficiency is naturally 1 (no gain). In order to achieve a data efficiency of 2 (half the labeling cost), we need |ρ| ≥ √ 2/2 ≈ 0.707. Interestingly, even for an automatic metric with perfect correlation (ρ = 1), the data efficiency is still capped by 1+γ γ : unless γ → 0 the data efficiency cannot increase unboundedly. Intuitively, even if we knew that ρ = 1, f (z) would be undetermined up to a constant additive shift and just estimating the shift would incur a variance of 1 n σ 2 a .

Using the control variates estimator
The control variates estimator can be easily integrated into an existing evaluation: we run human evaluation on a random sample of system outputs, automatic evaluation on all the system outputs, and plug in these results into Algorithm 1.
It is vital that we are able to evaluate the automatic metric on a significantly larger set of examples than those with human evaluations to reliably normalize g(z): without these additional examples, it be can shown that the optimal minimax estimator for µ is simply the naive estimateμ mean . Intuitively, this is because estimating the mean of g(z) incurs an equally large variance as estimating µ. In other words, g(z) is only useful if we have additional information about g beyond the samples {z (i) }.
Algorithm 1 shows the estimator. In practice, we do not know α = Cov(f (z), g(z)), so we use a plug-in estimateα in line 3 to compute the estimate µ in line 4. We note that estimating α from data does introduce a O(1/n) bias, but when compared to the standard deviation which decays as Θ(1/ √ n), this bias quickly goes to 0.
Algorithm 1 Control variates estimator 1: Input: n human evaluations y (i) on system outputs z (i) , normalized automatic metric g 2: y = 1 n i y (i) 3:α = 1 n i (y (i) − y)g(z (i) ) 4: µ = 1 n i y (i) −αg(z (i) ) 5: return µ An additional question that arises when applying Algorithm 1 is figuring out how many samples n to use. Given a target variance, the number of samples can be estimated using (6) with conservative estimates of σ 2 f , σ 2 a and ρ. Alternatively, our estimator can be combined with a dynamic stopping rule (Mnih et al., 2008) to stop data collection once we reach a target confidence interval.  Table 2: A summary of the key statistics, human metric variance (σ 2 f ) and annotator variance (σ 2 a ) for different datasets, CNN/Daily Mail (CDM) and MS MARCO in our evaluation benchmark. We observe that the relative variance (γ) is fairly high for most evaluation prompts, upper bounding the data efficiency on these tasks. A notable exception is the Edit prompt wherein systems are compared on the number of post-edits required to improve their quality.

Discussion of assumptions
We will soon see that empirical instantiations of γ and ρ lead to rather underwhelming data efficiencies in practice. In light of our optimality result, does this mean there is no hope for gains? Let us probe our assumptions. We assumed that the human judgments are uncorrelated across different system outputs; it is possible that a more accurate model of human annotators (e.g. Passonneau and Carpenter (2014)) could offer improvements. Perhaps with additional information about g(z) such as calibrated confidence estimates, we would be able to sample more adaptively. Of course the most direct routes to improvement involve increasing the correlation of g with human judgments and reducing annotator variance, which we will discuss more later.

Tasks and datasets
In order to compare different approaches to evaluating systems, we first collected human judgments for the output of several automatic summarization and open-response question answering systems using Amazon Mechanical Turk. Details of instructions provided and quality assurance steps taken are provided in Appendix A of the supplementary material. In this section, we'll briefly describe how we collected this data.
Evaluating language quality in automatic summarization. In automatic summarization, systems must generate a short (on average two or three sentence) summary of an article: for our study, we chose articles from the CNN/Daily Mail (CDM) dataset (Hermann et al., 2015;Nallapati et al., 2016) which come paired with reference summaries in the form of story highlights. We focus on the language quality of summaries and leave evaluating content selection to future work.
For each summary, we collected human judgments on a scale from 1-3 (Figure 4a) for fluency, (lack of) redundancy, and overall quality of the summary using guidelines from the DUC summarization challenge (Dang, 2006). As an alternate human metric, we also asked workers to postedit the system's summary to improve its quality, similar to the post-editing step in MT evaluations (Snover et al., 2006). Obtaining judgments costs about $0.15 per summary and this cost rises to about $0.40 per summary for post-editing.
We collected judgments on the summaries generated by the seq2seq and pointer models of See et al. (2017), the ml and ml+rl models of Paulus et al. (2018), and the reference summaries. 3 Before presenting the summaries to human annotators, we performed some minimal post-processing: we true-cased and de-tokenized the output of seq2seq and pointer using Stanford CoreNLP (Manning et al., 2014) and replaced "unknown" tokens in each system with a special symbol ( ).
Evaluating answer correctness. Next, we look at evaluating the correctness of system outputs in question answering using the MS MARCO question answering dataset (Nguyen et al., 2016). Here, each system is provided with a question and up to 10 paragraphs of context. The system generates open-response answers that do not need to be tied to a span in any paragraph.
We first ask annotators to judge if the output is even plausible for the question, and if yes, ask them identify if it is correct according to each context paragraph. We found that requiring annotators to highlight regions in the text that support their decision substantially improved the quality of the output without increasing costs. Annotations cost $0. While our goal is to evaluate the correctness of the provided answer, we found that there are often answers which may be correct or incorrect depending on the context. For example, the question "what is a pothole" is typically understood to refer to a hole in a roadway, but also refers to a geological feature (Figure 4b). This is reflected when annotators mark one context paragraph to support the given answer but mark another to contradict it. We evaluated systems based on both the average correctness (AvgCorrect) of their answers across all paragraphs as well as whether their answer is correct according to any paragraph (AnyCorrect).
We collected annotations on the systems generated by the fastqa and fastqa ext from Weissenborn et al. (2017) and the snet and snet.ens(emble) models from Tan et al. (2018), along with reference answers. The answers generated by the systems were used without any postprocessing. Surprisingly, we found that the correctness of the reference answers (according to the AnyCorrect metric) was only 73.5%, only 2% above that of the leading system (snet.ens). We manually inspected 30 reference answers which were annotated incorrectly and found that of those, about 95% were indeed incorrect. However, 62% are actually answerable from some paragraph, indicating that the real ceiling performance on this dataset is around 90% and that there is still room for improvement on this task.

Experimental results
We are now ready to evaluate the performance of our control variates estimator proposed in Section 3 using the datasets presented in Section 4. specify which passage they used to generate the answer.
Recall that our primary quantity of interest is data efficiency, the ratio of the number of human judgments required to estimate the overall human evaluation score for the control variates estimator versus the sample mean. We'll briefly review the automatic metrics used in our evaluation before analyzing the results.
Automatic metrics. We consider the following frequently used automatic word-overlap based metrics in our work: BLEU (Papineni et al., 2002), ROUGE (Lin and Rey, 2004) and ME-TEOR (Lavie and Denkowski, 2009). Following  and Liu et al. (2016b), we also compared a vector-based sentence-similarity using sent2vec (Pagliardini et al., 2017) to compare sentences (VecSim). Figure 5 shows how each of these metrics is correlated with human judgment for the systems being evaluated. Unsurprisingly, the correlation varies considerably across systems, with token-based metrics correlating more strongly for systems that are more extractive in nature (fastqa and fastqa ext).
Results. 5 In Section 3 we proved that the control variates estimator is not only unbiased but also has the least variance among other unbiased estimators. Figure 6 plots the width of the 80% confidence interval, estimated using bootstrap, measured as a function of the number of samples collected for different tasks and prompts. As expected, the control variates estimator reduces the width of the confidence interval. We measure data efficiency by the averaging of the ratio of squared confidence intervals between the human baseline Certain systems are more correlated with certain automatic metrics than others, but overall the correlation is low to moderate for most systems and metrics. and control variates estimates. We observe that the data efficiency depends on the task, prompt and system, ranging from about 1.08 (a 7% cost reduction) to 1.15 (a 13% cost reduction) using current automatic metrics. As we showed in Section 3, further gains are fundamentally limited by the quality of the evaluation prompts and automatic metrics. Figures 6a  and 6b show how improving the quality of the evaluation prompt from a Likert-scale prompt for quality (Overall) to using post-editing (Edit) noticeably decreases variance and hence allows better automatic metrics to increase data efficiency. Likewise, Figure 6c shows how using a better automatic metric (ROUGE-L instead of VecSim) also reduces variance. Figure 6 also shows the conjectured confidence intervals if we were able to eliminate noise in human judgments (noiseless humans) or have a automatic metric that correlated perfectly with average human judgment (perfect metric). In particular, we use the mean of all (2-3) humans on each z for the perfect g(z) and use the mean of all humans on each z for the "noiseless" Y (z).
In both cases, we are able to significantly increase data efficiency (i.e. decrease estimator variance). With zero annotator variance and using existing automatic metrics, the data efficiency ranges from 1.42 to 1.69. With automatic metrics with perfect correlation and current variance of human judgments, it ranges from 2.38 to 7.25. Thus, we conclude that it is important not only to improve our automatic metrics but also the evaluation prompts we use during human evaluation.

Related work
In this work, we focus on using existing automatic metrics to decrease the cost of human evaluations. There has been much work on improving the quality of automatic metrics. In particular, there is interest in learning models (Lowe et al., 2017a;Dusek et al., 2017) that are able to optimize for improved correlations with human judgment. However, in our experience, we have found that these learned automatic metrics have trouble generalizing to different systems. The framework we provide allows us to safely incorporate such models into evaluation, exploiting them when their correlation is high but also not introducing bias when it is low.
Our key technical tool is control variates, a standard statistical technique used to reduce the variance of Monte Carlo estimates (Ripley, 2009). The technique has also been used in machine learning and reinforcement learning to lower variance estimates of gradients (Greensmith et al., 2004;Paisley et al., 2012;Ranganath et al., 2014). To the best of our knowledge, we are the first to apply this technique in the context of language evaluation.
Our work also highlights the importance of human evaluation. Chaganty et al. (2017) identified a similar problem of systematic bias in evaluation metrics in the setting of knowledge base population and also propose statistical estimators that relies on human evaluation to correct bias. Unfortunately, their technique relies on having a structured output (relation triples) that are shared between  Figure 6: 80% bootstrap confidence interval length as a function of the number of human judgments used when evaluating the indicated systems on their respective datasets and prompts. (a) We see a modest reduction in variance (and hence cost) relative to human evaluation by using the VecSim automatic metric with the proposed control variates estimator to estimate Overall scores on the CNN/Daily Mail task; the data efficiency (DE) is 1.06. (b) By improving the evaluation prompt to use Edits instead, it is possible to further reduce variance relative to humans (DE is 1.15). (c) Another way to reduce variance relative to humans is to improve the automatic metric evaluation; here using ROUGE-1 instead of VecSim improves the DE from 1.03 to 1.16.
systems and does not apply to evaluating natural language generation. In a similar vein, Chang et al. (2017) dynamically collect human feedback to learn better dialog policies.

Discussion
Prior work has shown that existing automatic metrics have poor instance-level correlation with mean human judgment and that they score many good quality responses poorly. As a result, the evaluation is systematically biased against genuine system improvements that would lead to higher human evaluation scores but not improve automatic metrics. In this paper, we have explored using an automatic metric to decrease the cost of human evaluation without introducing bias. In practice, we find that with current automatic metrics and evaluation prompts data efficiencies are only 1.08-1.15 (7-13% cost reduction). Our theory shows that further improvements are only possible by improving the correlation of the automatic metric and reducing the annotator variance of the evaluation prompt. As an example of how evaluation prompts could be improved, we found that using post-edits of summarizes decreased normalized annotator variance by a factor of three relative to using a Likert scale survey. It should be noted that changing the evaluation prompt also changes the underlying ground truth f (z): it is up to us to find a prompt that still captures the essence of what we want to measure.
Without making stronger assumptions, the control variates estimator we proposed outlines the limitations of unbiased estimation. Where do we go from here? Certainly, we can try to improve the automatic metric (which is potentially as difficult as solving the task) and brainstorming alternative ways of soliciting evaluation (which has been less explored). Alternatively, we could give up on measuring absolute scores, and seek instead to find techniques stably rank methods and thus improve them. As the NLP community tackles increasingly difficult tasks, human evaluation will only become more important. We hope our work provides some clarity on to how to make it more cost effective.

Reproducibility
All code, data, and experiments for this paper are available on the CodaLab platform at https:// bit.ly/price-of-debiasing.