Re-evaluating Evaluation in Text Summarization

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.


Introduction
In text summarization, manual evaluation, as exemplified by the Pyramid method (Nenkova and Passonneau, 2004), is the gold-standard in evaluation. However, due to time required and relatively high cost of annotation, the great majority of research papers on summarization use exclusively automatic evaluation metrics, such as ROUGE (Lin, 2004) , JS-2 (Louis and Nenkova, 2013), S3 (Peyrard et al., 2017), BERTScore (Zhang et al., 2020), Mover-Score (Zhao et al., 2019) etc. Among these metrics, ROUGE is by far the most popular, and there is relatively little discussion of how ROUGE may deviate from human judgment and the potential for this deviation to change conclusions drawn regarding relative merit of baseline and proposed methods. To characterize the relative goodness of evaluation metrics, it is necessary to perform metaevaluation (Graham, 2015;Lin and Och, 2004), where a dataset annotated with human judgments (e.g. TAC 1 2008 (Dang and Owczarzak, 2008)) is used to test the degree to which automatic metrics correlate therewith.
However, the classic TAC meta-evaluation datasets are now 6-12 years old 2 and it is not clear whether conclusions found there will hold with modern systems and summarization tasks. Two earlier works exemplify this disconnect: (1) Peyrard (2019) observed that the human-annotated summaries in the TAC dataset are mostly of lower quality than those produced by modern systems and that various automated evaluation metrics strongly disagree in the higher-scoring range in which current systems now operate. (2) Rankel et al. (2013) observed that the correlation between ROUGE and human judgments in the TAC dataset decreases when looking at the best systems only, even for systems from eight years ago, which are far from today's state-of-the-art.
Constrained by few existing human judgment datasets, it remains unknown how existing metrics behave on current top-scoring summarization systems. In this paper, we ask the question: does the rapid progress of model development in summarization models require us to re-evaluate the evaluation process used for text summarization? To this end, we create and release a large benchmark for meta-evaluating summarization metrics including: • Outputs from 25 top-scoring extractive and abstractive summarization systems on the CNN/DailyMail dataset. • Automatic evaluations from several evaluation metrics including traditional metrics (e.g. ROUGE) and modern semantic matching metrics (e.g. BERTScore, MoverScore). (1) ROUGE metrics outperform all other metrics.
(2) For extractive summaries, most metrics are better at evaluating summaries than systems. For abstractive summaries, some metrics are better at summary level, others are better at system level. • Manual evaluations using the lightweight pyramids method (Shapira et al., 2019), which we use as a gold-standard to evaluate summarization systems as well as automated metrics.
Using this benchmark, we perform an extensive analysis, which indicates the need to re-examine our assumptions about the evaluation of automatic summarization systems. Specifically, we conduct four experiments analyzing the correspondence between various metrics and human evaluation. Somewhat surprisingly, we find that many of the previously attested properties of metrics found on the TAC dataset demonstrate different trends on our newly collected CNNDM dataset, as shown in Tab. 1. For example, MoverScore is the best performing metric for evaluating summaries on dataset TAC, but it is significantly worse than ROUGE-2 on our collected CNNDM set. Additionally, many previous works (Novikova et al., 2017;Peyrard et al., 2017;Chaganty et al., 2018) show that metrics have much lower correlations at comparing summaries than systems. For extractive summaries on CNNDM, however, most metrics are better at comparing summaries than systems.
Calls for Future Research These observations demonstrate the limitations of our current bestperforming metrics, highlighting (1) the need for future meta-evaluation to (i) be across multiple datasets and (ii) evaluate metrics on different application scenarios, e.g. summary level vs. system level (2) the need for more systematic metaevaluation of summarization metrics that updates with our ever-evolving systems and datasets, and (3) the potential benefit to the summarization community of a shared task similar to the WMT 3 Metrics Task in Machine Translation, where systems and metrics co-evolve. 3 http://www.statmt.org/wmt20/

Preliminaries
In this section we describe the datasets, systems, metrics, and meta evaluation methods used below. -2008, 2009(Dang and Owczarzak, 2008, 2009) are multi-document, multi-reference summarization datasets. Human judgments are available on for the system summaries submitted during the TAC-2008, TAC-2009 shared tasks. CNN/DailyMail (CNNDM) (Hermann et al., 2015;Nallapati et al., 2016) is a commonly used summarization dataset that contains news articles and associated highlights as summaries. We use the version without entities anonymized.

Evaluation Metrics
We examine eight metrics that measure the agreement between two texts, in our case, between the system summary and reference summary. BERTScore (BScore) measures soft overlap between contextual BERT embeddings of tokens between the two texts 4 (Zhang et al., 2020). MoverScore (MScore) applies a distance measure to contextualized BERT and ELMo word embeddings 5 (Zhao et al., 2019). Sentence Mover Similarity (SMS) applies minimum distance matching between text based on sentence embeddings (Clark et al., 2019). Word Mover Similarity (WMS) measures similarity using minimum distance matching between texts which are represented as a bag of word embeddings 6 (Kusner et al., 2015). JS divergence (JS-2) measures Jensen-Shannon divergence between the two text's bigram distributions 7 (Lin et al., 2006). ROUGE-1 and ROUGE-2 measure overlap of unigrams and bigrams respectively 8 (Lin, 2004). ROUGE-L measures overlap of the longest common subsequence between two texts (Lin, 2004). We use the recall variant of all metrics (since the Pyramid method of human evaluations is inherently recall based) except MScore which has no specific recall variant.

Correlation Measures
Pearson Correlation is a measure of linear correlation between two variables and is popular in metaevaluating metrics at the system level (Lee Rodgers, 1988). We use the implementation given by Virtanen et al. (2020). William's Significance Test is a means of calculating the statistical significance of differences in correlations for dependent variables (Williams, 1959;Graham and Baldwin, 2014). This is useful for us since metrics evaluated on the same dataset are not independent of each other.

Meta Evaluation Strategies
There are two broad meta-evaluation strategies: summary-level and system-level. Setup: For each document d i , i ∈ {1 . . . n} in a dataset D, we have J system outputs, where the outputs can come from (1) extractive systems (Ext), (2) abstractive systems (Abs) or (3) a union of both (Mix). Let s ij , j ∈ {1 . . . J} be the j th summary of the i th document, m i be a specific metric and K be a correlation measure.

Summary Level
Summary-level correlation is calculated as follows: (1) Here, correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported.

System Level
System-level correlation is calculated as follows: (2) Additionally, the "quality" of a system sys j is defined as the mean human score received by it i.e.

Collection of Human Judgments
We follow a 3-step process to collect human judgments: (1) we collect system-generated summaries on the most-commonly used summarization dataset, CNNDM; (2) we select representative test samples from CNNDM and (3) we manually evaluate system-generated summaries of the aboveselected test samples.

System-Generated Summary Collection
We collect the system-generated summaries from 25 top-scoring systems, 9 covering 11 extractive and 14 abstractive systems (Sec. 2.2) on the CNNDM dataset. We organize our collected generated summaries into three groups based on system type: • CNNDM Abs denotes collected output summaries from abstractive systems. • CNNDM Ext denotes collected output summaries from extractive systems. • CNNDM Mix is the union of the two.

Representative Sample Selection
Since collecting human annotations is costly, we sample 100 documents from CNNDM test set (11,490 samples) and evaluate system generated summaries of these 100 documents. We aim to include documents of varying difficulties in the representative sample. As a proxy to the difficulty of summarizing a document, we use the mean score received by the system generated summaries for the document. Based on this, we partition the CNNDM test set into 5 equal sized bins and sample 4 documents from each bin. We repeat this process for 5 metrics (BERTScore, MoverScore, R-1, R-2, R-L) obtaining a sample of 100 documents. This methodology is detailed in Alg. 1 in Sec. A.1.

Human Evaluation
In text summarization, a "good" summary should represent as much relevant content from the input document as possible, within the acceptable length limits. Many human evaluation methods have been proposed to capture this desideratum (Nenkova and Passonneau, 2004;Chaganty et al., 2018;Fan et al., 2018;Shapira et al., 2019). Among these, Pyramid (Nenkova and Passonneau, 2004) is a reliable and widely used method, that evaluates content selection by (1) exhaustively obtaining Semantic Content Units (SCUs) from reference summaries, (2) weighting them based on the number of times they are mentioned and (3) scoring a system summary based on which SCUs can be inferred.
Recently, Shapira et al. (2019) extended Pyramid to a lightweight, crowdsourcable method -LitePyramids, which uses Amazon Mechanical Turk 10 (AMT) for gathering human annotations. LitePyramids simplifies Pyramid by (1) allowing crowd workers to extract a subset of all possible SCUs and (2) eliminating the difficult task of merging duplicate SCUs from different reference summaries, instead using SCU sampling to simulate frequency-based weighting.
Both Pyramid and LitePyramid rely on the presence of multiple references per document to assign importance weights to SCUs. However in the CNNDM dataset there is only one reference summary per document. We therefore adapt the LitePyramid method for the single-reference setting as follows. SCU Extraction The LitePyramids annotation instructions define a Semantic Content Unit (SCU) as a sentence containing a single fact written as briefly and clearly as possible. Instead, we focus on shorter, more fine-grained SCUs that contain at most 2-3 entities. This allows for partial content overlap between a generated and reference summary, and also makes the task easy for workers. Tab. 2 gives an example. We exhaustively extract (up to 16) SCUs 11 from each reference summary. Requiring the set of SCUs to be exhaustive increases the complexity of the SCU generation task, and hence instead of relying on crowd-workers, we create SCUs from reference summaries ourselves. In the end, we obtained nearly 10.5 SCUs on average from each reference summary. System Evaluation During system evaluation the full set of SCUs is presented to crowd workers. Workers are paid similar to Shapira et al. (2019), scaling the rates for fewer SCUs and shorter summary texts. For abstractive systems, we pay $0.20 per summary and for extractive systems, we pay $0.15 per summary since extractive summaries are more readable and might precisely overlap with SCUs. We post-process system output summaries before presenting them to annotators by true-casing the text using Stanford CoreNLP (Manning et al., 2014) and replacing "unknown" tokens with a special symbol "2" (Chaganty et al., 2018).
Tab. 2 depicts an example reference summary, system summary, SCUs extracted from the reference summary, and annotations obtained in evaluating the system summary. Annotation Scoring For robustness (Shapira et al., 2019), each system summary is evaluated by 4 crowd workers. Each worker annotates up to 16 SCUs by marking an SCU "present" if it can be (a) Reference Summary: Bayern Munich beat Porto 6 -1 in the Champions League on Tuesday. Pep Guardiola's side progressed 7 -4 on aggregate to reach semi-finals. Thomas Muller scored 27th Champions League goal to pass Mario Gomez. Muller is now the leading German scorer in the competition. After game Muller led the celebrations with supporters using a megaphone.
(b) System Summary (BART, Lewis et al. (2019)): Bayern Munich beat Porto 6 -1 at the Allianz Arena on Tuesday night. Thomas Muller scored his 27th Champions League goal. The 25 -year -old became the highest -scoring German since the tournament took its current shape in 1992. Bayern players remained on the pitch for some time as they celebrated with supporters. inferred from the system summary or "not present" otherwise. We obtain a total of 10,000 human annotations (100 documents × 25 systems × 4 workers).
For each document, we identify a "noisy" worker as one who disagrees with the majority (i.e. marks an SCU as "present" when majority thinks "not present" or vice-versa), on the largest number of SCUs. We remove the annotations of noisy workers and retain 7,742 annotations of the 10,000. After this filtering, we obtain an average inter-annotator agreement (Krippendorff's alpha (Krippendorff, 2011)) of 0.66. 12 Finally, we use the majority vote to mark the presence of an SCU in a system summary, breaking ties by the class, "not present".

Experiments
Motivated by the central research question: "does the rapid progress of model development in summarization models require us to re-evaluate the evaluation process used for text summarization?" We use the collected human judgments to meta-evaluate current metrics from four diverse viewpoints, measuring the ability of metrics to: (1) evaluate all systems; (2) evaluate top-k strongest systems; (3) compare two systems; (4) evaluate individual summaries. We find that many previously attested properties of metrics observed on TAC exhibit different trends on the new CNNDM dataset.

Exp-I: Evaluating All Systems
Automatic metrics are widely used to determine where a new system may rank against existing state-of-the-art systems. Thus, in meta-evaluation studies, calculating correlation of automatic metrics with human judgments at the system level is a commonly-used setting (Novikova et al., 2017;Bojar et al., 2016;Graham, 2015). We follow this setting and specifically, ask two questions: Can metrics reliably compare different systems? To answer this we observe the Pearson correlation between different metrics and human judgments in Fig. 2, finding that: (1) MoverScore and JS-2, which were the best performing metrics on TAC, have poor correlations with humans in comparing CNNDM Ext systems.
(2) Most metrics have high correlations on the TAC-2008 dataset but many suffer on TAC-2009, especially ROUGE based metrics. However, ROUGE metrics consistently perform well on the collected CNNDM datasets. Are some metrics significantly better than others in comparing systems? Since automated metrics calculated on the same data are not independent, we must perform the William's test (Williams, 1959) to establish if the difference in correlations between metrics is statistically significant (Graham and Baldwin, 2014). In Fig. 1 we report the pvalues of William's test. We find that (1) MoverScore and JS-2 are significantly better than other metrics in correlating with human judgments on the TAC datasets.
(2) However, on CNNDM Abs and CNNDM Mix, R-2 significantly outperforms all others whereas on CNNDM Ext none of the metrics show significant improvements over others. Takeaway: These results suggest that metrics run the risk of overfitting to some datasets, highlighting the need to meta-evaluate metrics for modern datasets and systems. Additionally, there is no one-size-fits-all metric that can outperform others on all datasets. This suggests the utility of using different metrics for different datasets to evaluate systems e.g. MoverScore on TAC-2008, JS-2 on TAC-2009 and R-2 on CNNDM datasets.

Exp-II: Evaluating Top-k Systems
Most papers that propose a new state-of-the-art system often use automatic metrics as a proxy to human judgments to compare their proposed method against other top scoring systems. However, can metrics reliably quantify the improvements that one high quality system makes over other competitive systems? To answer this, instead of focusing on all of the collected systems, we evaluate the correlation between automatic metrics and human judg-ments in comparing the top-k systems, where top-k are chosen based on a system's mean human score (Eqn. 3). 14 Our observations are presented in Fig. 3. We find that: (1) As k becomes smaller, metrics de-correlate with humans on the TAC-2008 and CNNDM Mix datasets, even getting negative correlations for small values of k (Fig. 3a, 3c). Interestingly, SMS, R-1, R-2 and R-L improve in performance as k becomes smaller on CNNDM Ext.
(2) R-2 had negative correlations with human judgments on TAC-2009 for k < 50, however it remains highly correlated with human judgments on CNNDM Abs for all values of k. Takeaway: Metrics cannot reliably quantify the improvements made by one system over others, especially for the top few systems across all datasets. Some metrics, however, are well suited for specific datasets, e.g. JS-2 and R-2 are reliable indicators of improvements on TAC-2009 and CNNDM Abs respectively.

Exp-III: Comparing T wo-Systems
Instead of comparing many systems (Sec. 4.1, 4.2) ranking two systems aims to test the discriminative power of a metric, i.e., the degree to which the metric can capture statistically significant differences between two summarization systems. We analyze the reliability of metrics along a useful dimension: can metrics reliably say if one system is significantly better than another? Since we only have 100 annotated summaries to compare any two systems, sys 1 and sys 2 , we use paired bootstrap resampling, to test with statistical sig- nificance if sys 1 is better than sys 2 according to metric m (Koehn, 2004;Dror et al., 2018). We take all J 2 pairs of systems and compare their mean human score (Eqn. 3) using paired bootstrap resampling. We assign a label y true = 1 if sys 1 is better than sys 2 with 95% confidence, y true = 2 for viceversa and y true = 0 if the confidence is below 95%. We treat this as the ground truth label of the pair (sys 1 , sys 2 ). This process is then repeated for all metrics, to get a "prediction", y m pred from each metric m for the same J 2 pairs. If m is a good proxy for human judgments, the F1 score (Goutte and Gaussier, 2005) between y m pred and y true should be high. We calculate the weighted macro F1 score for all metrics and view them in Fig. 4.
We find that ROUGE based metrics perform moderately well in this task. R-2 performs the best on CNNDM datasets. While on the TAC 2009 dataset, JS-2 achieves the highest F1 score, its performance is low on CNNDM Ext. Takeaway: Different metrics are better suited for different datasets. For example, on the CNNDM datasets, we recommend using R-2 while, on the TAC datasets, we recommend using JS-2.

Exp-IV: Evaluating Summaries
In addition to comparing systems, real-world application scenarios also require metrics to reliably compare multiple summaries of a document. For example, top-scoring reinforcement learning based summarization systems (Böhm et al., 2019) and the current state-of-the-art extractive system (Zhong et al., 2020)   scores to guide the optimization process.
In this experiment, we ask the question: how well do different metrics perform at the summary level, i.e. in comparing system summaries generated from the same document? We use Eq. 1 to calculate Pearson correlation between different metrics and human judgments for different datasets and collected system outputs. Our observations are summarized in Fig. 5. We find that: (1) As compared to semantic matching metrics, R-1, R-2 and R-L have lower correlations on the TAC datasets but are strong indicators of good summaries especially for extractive summaries on the CNNDM dataset.
(2) Notably, BERTScore, WMS, R-1 and R-L have negative correlations on TAC-2009 but perform moderately well on other datasets including CNNDM.
(3) Previous meta-evaluation studies (Novikova et al., 2017;Peyrard et al., 2017;Chaganty et al., 2018) conclude that automatic metrics tend to correlate well with humans at the system level but have poor correlations at the instance (here summary) level. We find this observation only holds on TAC-2008. Some metrics' summary-level correlations can outperform system-level on the CNNDM dataset as shown in Fig. 5b (bins below y = 0). Notably, MoverScore has a correlation of only 0.05 on CNNDM Ext at the system level but 0.74 at the summary level. Takeaway: Meta-evaluations of metrics on the old TAC datasets show significantly different trends than meta-evaluation on modern systems and datasets. Even though some metrics might be good at comparing summaries, they may point in the wrong direction when comparing systems. Moreover, some metrics show poor generalization ability to different datasets (e.g. BERTScore on TAC-2009 vs other datasets). This highlights the need for empirically testing the efficacy of different automatic metrics in evaluating summaries on multiple datasets.

Related Work
This work is connected to the following threads of topics in text summarization. Human Judgment Collection Despite many approaches to the acquisition of human judgment (Chaganty et al., 2018;Nenkova and Passonneau, 2004;Shapira et al., 2019;Fan et al., 2018), Pyramid (Nenkova and Passonneau, 2004) has been a mainstream method to meta-evaluate various automatic metrics. Specifically, Pyramid provides a robust technique for evaluating content selection by exhaustively obtaining a set of Semantic Content Units (SCUs) from a set of references, and then scoring system summaries on how many SCUs can be inferred from them. Recently, Shapira et al. (2019) proposed a lightweight and crowdsourceable version of the original Pyramid, and demonstrated it on the DUC 2005 (Dang, 2005) and 2006 (Dang, 2006) multi-document summarization datasets. In this paper, our human evaluation methodology is based on the Pyramid (Nenkova and Passonneau, 2004) and LitePyramids (Shapira et al., 2019) techniques. Chaganty et al. (2018) also obtain human evaluations on system summaries on the CNNDM dataset, but with a focus on language quality of summaries. In comparison, our work is focused on evaluating content selection. Our work also covers more systems than their study (11 extractive + 14 abstractive vs. 4 abstractive).

Meta-evaluation with Human Judgment
The effectiveness of different automatic metrics -ROUGE-2 (Lin, 2004), ROUGE-L (Lin, 2004), ROUGE-WE (Ng and Abrecht, 2015), JS-2 (Louis and Nenkova, 2013) and S3 (Peyrard et al., 2017) is commonly evaluated based on their correlation with human judgments (e.g., on the TAC-2008 (Dang and Owczarzak, 2008) and TAC-2009 (Dang andOwczarzak, 2009) datasets). As an important supplementary technique to metaevaluation, Graham (2015) advocate for the use of a significance test, William's test (Williams, 1959), to measure the improved correlations of a metric with human scores and show that the popular variant of ROUGE (mean ROUGE-2 score) is sub-optimal. Unlike these works, instead of proposing a new metric, in this paper, we upgrade the meta-evaluation environment by introducing a sizeable human judgment dataset evaluating current top-scoring systems and mainstream datasets. And then, we re-evaluate diverse metrics at both systemlevel and summary-level settings. (Novikova et al., 2017) also analyzes existing metrics, but they only focus on dialog generation.

Implications and Future Directions
Our work not only diagnoses the limitations of current metrics but also highlights the importance of upgrading the existing meta-evaluation testbed, keeping it up-to-date with the rapid development of systems and datasets. In closing, we highlight some potential future directions: (1) The choice of metrics depends not only on different tasks (e.g, summarization, translation) but also on different datasets (e.g., TAC, CNNDM) and application scenarios (e.g, system-level, summary-level). Future works on meta-evaluation should investigate the effect of these settings on the performance of metrics.
(2) Metrics easily overfit on limited datasets. Multidataset meta-evaluation can help us better understand each metric's peculiarity, therefore achieving a better choice of metrics under diverse scenarios. (3) Our collected human judgments can be used as supervision to instantiate the most recentlyproposed pretrain-then-finetune framework (originally for machine translation) (Sellam et al., 2020), learning a robust metric for text summarization.

A.3 Exp-II using Kendall's tau correlation
Please see Figure 8 for the system level Kendall's tau correlation on top-k systems, between different metrics and human judgements.

A.4 Exp IV using Kendall's tau correlation
Please see Figure 7 for the summary level Kendall's tau correlation between different metrics and human judgements.