Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

In text summarization, evaluating the efficacy of automatic metrics without human judgments has become recently popular. One exemplar work (Peyrard, 2019) concludes that automatic metrics strongly disagree when ranking high-scoring summaries. In this paper, we revisit their experiments and find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. We hypothesize that this may be because summaries are similar to each other in a narrow scoring range and are thus, difficult to rank. Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement - Ease of Summarization, Abstractiveness, and Coverage.


Introduction
Automatic metrics play a significant role in summarization evaluation, profoundly affecting the direction of system optimization. Due to its importance, evaluating the quality of evaluation metrics, also known as meta-evaluation has been a crucial step. Generally, there are two meta-evaluation strategies: (i) assessing how well each metric correlates with human judgments (Lin, 2004;Ng and Abrecht, 2015;Louis and Nenkova, 2013;Peyrard et al., 2017;Bhandari et al., 2020), which requires procuring manual annotations that are expensive and time-consuming, and (ii) measuring the correlation between different metrics (Peyrard, 2019), which is a human judgment-free method. In this work, we focus on the latter and ask two research questions: RQ1: How do automated metrics correlate when ranking summaries in different scoring ranges (low, average, and high)? We revisit the experiments of Peyrard (2019) which concludes that automated metrics strongly disagree for ranking high-scoring summaries. 2 We find that the scoring range has little effect on the correlation of metrics. It is rather the width of the scoring range which affects inter-metric correlation. Specifically, we observe that metrics agree in ranking summaries from the full scoring range but disagree in ranking summaries from low, average, and high scoring ranges when taken separately.
RQ2: Which other factors affect the correlations of metrics? In addition to the width of the scoring range, we analyze three properties of a reference summary on inter-metric correlation -Ease of Summarization, Abstractiveness and Coverage. Overall we find that for highly extractive document-reference summary pairs, inter-metric correlation is high whereas metrics disagree when ranking summaries of abstractive document-reference summary pairs. We summarize our contributions as follows: (1) We extend the analysis of Peyrard (2019) and find that not only do metrics disagree in the high scoring range, they also disagree in the low and medium scoring range.
(2) We perform our analysis on the popular CNN/Dailymail dataset using traditional lexical matching metrics like ROUGE as well as recently popular semantic matching metrics like BERTScore and MoverScore. (3) Apart from the width of the scoring range, we analyze three linguistic properties of reference summaries which affect inter-metric correlations.

Preliminaries
2.1 Datasets TAC-2008, 2009(Dang and Owczarzak, 2008Dang and Owczarzak, 2009) are multi-document, multi-reference summarization datasets used during the TAC-2008, TAC-2009 shared tasks. Following (Peyrard, 2019) we combine the two and refer to the joined dataset as TAC. CNN/DailyMail (CNNDM) (Hermann et al., 2015) is a commonly used summarization dataset modified by Nallapati et al. (2016), which contains news articles and associated highlights as summaries. We use the non-anonymized version.

Evaluation Metrics
We examine six metrics that measure the semantic equivalence between two texts, in our case, between the system-generated summary and the reference summary. BERTScore (BScore) measures soft overlap between contextual BERT embeddings of tokens between the two texts 3 (Zhang et al., 2020). Mover-Score (MS) applies a distance measure to contextualized BERT and ELMo word embeddings 4 (Zhao et al., 2019). JS divergence (JS-2) measures Jensen-Shannon divergence between the two text's bigram distributions 5 (Lin et al., 2006). ROUGE-1 (R1) and ROUGE-2 (R2) measure the overlap of unigrams and bigrams respectively 6 (Lin, 2004). ROUGE-L measures the overlap of the longest common subsequence between two texts (Lin, 2004). We use the recall variant of all metrics except MoverScore which has no specific recall variant.

Correlation Measure
Kendall's τ is a measure of the rank correlation between any two measured quantities (in our case scores given by evaluation metrics) and is popular in meta-evaluating metrics at the summary level (Peyrard, 2019). We use the implementation given by Virtanen et al. (2020).

Summary Generation
To simulate the full scoring range of summaries that are possible for a document, we follow Peyrard (2019) and use a genetic algorithm (Peyrard and Eckle-Kohler, 2016) to generate extractive summaries. We optimize for 5 metrics -ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and MoverScore, generating 100 summaries per metric for each of the nearly 11K documents in the CNNDM test set resulting in 500 summaries per document. After de-duplication, we are left with nearly 419 summaries per document on average. For the TAC dataset, we randomly sample 500 summaries for each document from the nearly 2000 output summaries provided by Peyrard (2019).

Width of Scoring Range
In this experiment, we aim to re-examine the results in Peyrard (2019) and answer our first research question Q1: how do different automated metrics correlate in ranking summaries in different scoring ranges? We approach this as follows: for each summary s ij of document d i , we first calculate its mean score across all metrics after normalizing the metrics to be between 0 and 1. We use this to partition the scoring range of each document into three parts: low scoring (L), medium scoring (M), and top scoring (T), which are the bottom third, the middle third and the top third of the scoring range respectively. We then analyze the summaries falling into these bins in two different ways: (1-2). Note that here the width of the scoring range is different for each row.
2. Non-cumulative: In this setting, we analyze the average inter-metric correlation on summaries belonging to each scoring bin separately as shown in the right side of Tab.
(1-2). We advocate for the use of this setting as (1) it controls for the width of the scoring range and (2) it allows for a more fine-grained analysis of the scoring range. Note that, for each bin, the correlation is calculated for summaries generated for each document and then averaged over all documents. We only consider statistically significant (p < 0.05) kendall's τ values.
Observations & Discussion Our observations on the TAC and CNNDM datasets are shown in Tab. 1 and 2 respectively. In the cumulative setting, we observe the same trend reported by Peyrard (2019): intermetric agreement decreases when the average score increases and is the lowest in the top scoring range (T). However, in the non-cumulative setting, where metrics rank summaries from a narrow scoring range, we observe that (i) metrics have low correlations in all three scoring ranges (low, medium, and top) and (ii) there is no clear trend in correlations across the bins. Comparing the cumulative and non-cumulative settings, one can see that decreasing the width of the scoring range reduces the inter-metric correlations. This suggests that rather than the scoring range, the width of the scoring range has a strong impact on the correlation between metrics. This may be because summaries from a narrow scoring range are similar to each other, and thus, difficult for different metrics to rank consistently.

Factors affecting Inter-metric Correlation
In this experiment, we aim to answer the second research question Q2: Apart from the width of the scoring range, which factors affect inter-metric correlations? Specifically, we identify three factors which affect the correlation of metrics -(1) Ease of Summarization, (2) Abstractiveness, and (3) Here, m k is a metric function normalized to be between 0 and 1. Thus, EoS is the average over all metrics of the maximum score that any summary received. A higher EoS score for a document implies that for that document, we can generate higher scoring extractive summaries according to many metrics. 2. Abstractiveness: We define abstracriveness of a document d i with reference r i as 1−|Voc(d i )∩Voc(r i )|

|Voc(r i )|
where Voc(x) is the set of unique tokens of any text x. Abstractiveness measures the overlap in vocabularies of the document and its reference summary.

Coverage:
We use the definition of Coverage as provided by Grusky et al. (2018) i.e. "the percentage   of words in the summary that are part of an extractive fragment with the article". We refer the reader to Grusky et al. (2018) for a detailed description of Coverage.
Observations: Our observations are summarized in Fig. 1. Each point in the graph represents a document-reference summary pair with its corresponding property on the x-axis and inter-metric correlation of its summaries on the y-axis. We find that (1) metrics agree with each other as documents become easier to summarize (2) as documents become more abstractive, the correlation between metrics decreases (3) as the coverage of documents increases, the correlation between metrics increases. These observations suggest that automatic evaluation metrics have higher correlations for easier to summarize, and more extractive (lower abstractiveness, higher coverage) document-reference summary pairs.

Implications and Future Directions
In this work, we revisit the conclusion of Peyrard (2019)'s work and show that instead of solely disagreeing in high-scoring range, metrics disagree when ranking summaries from all three scoring ranges -low, medium and top. This highlights the need to collect human judgments to identify trustworthy metrics. Moreover, future meta-evaluations should use uniform-width bins when comparing correlations to ensure a more robust analysis. Additionally, we analyze three linguistic properties of reference summaries and their effect on inter-metric correlations. Our observation that metrics de-correlate as references become more abstractive suggests that we need to exercise caution when using automatic metrics to compare summarization systems on abstractive datasets like XSUM (Narayan et al., 2018). Moreover, future work proposing new evaluation metrics can analyze them using these properties to get more insights about their behavior.

Acknowledgments
We would like to thank Maxime Peyrard for sharing the code and data used in Peyrard (2019) and for his useful feedback about our experiments. We would also like to thank Graham Neubig for his feedback and for providing the computational resources needed for this work.

A Disagreement
In addition to Kendall's τ between metrics, Peyrard (2019) analyzes the disagreement between metrics and shows higher inter-metric disagreement in the higher scoring range. To analyze disagreement, they randomly sample 100,000 pairs of summaries (say s a and s b with corresponding references r a , r b ) for each pair of metrics (say m 1 and m 2 ) and bin them into 15 cumulative bins according to the average score for any one metric i.e. according to 1 2 (m 1 (s a , r a ) + m 1 (s b , r b )). The disagreement for each bin is then calculated as the percentage of summary pairs for which m 1 (s a , r a ) > m 1 (s b , r b ) but m 2 (s a , r a ) < m 2 (s b , r b ) or vice-versa i.e. m 1 (s a , r a ) < m 1 (s b , r b ) but m 2 (s a , r a ) > m 2 (s b , r b ).
The use of cumulative bins suffers from the same phenomena as described in section 4.1 i.e the width of the bin may play a role in the agreement of metrics. In Fig. 2 we replicate the cumulative disagreement plot for the TAC and CNNDM datasets and show the corresponding non-cumulative versions. We observe that when we control for the width of scoring range, inter-metric disagreement is higher even in the low scoring range.

B F/N Ratio
i.e. out of all the summaries ranked better than a summary by one metric, how many are ranked better  by all the metrics. As shown in Fig. 3a on the TAC dataset, as the average score of s (averaged across all metrics) increases, F/N decreases. This may suggest that as summary quality improves, different metrics do not agree on which summaries are of better quality. However, this quantity is misleading. As the average score of s increases, the numerator F will naturally decrease (because for a higher scoring s, the number of summaries that are better than s are fewer) while the denominator N may remain large even if one metric is misaligned with others. To prove this hypothesis, we first replicate the measure for TAC and CNNDM datasets in Fig. 3a, 3b. Next, instead of the real metric scores, we assign each summary a random number sampled from Uniform(0, 1). In Fig. 3c we see the same trend for random scores as for real metric scores. This shows that this decreasing trend is indeed a property of the ratio F/N rather than being a property specific to real evaluation metrics.
Moreover, one can come up with a modified ratio F /N as follows which measures "out of all the summaries that are ranked worse than a summary by one metric, how many are ranked worse by all metrics. If metrics truly de-correlated in only the higher scoring range, one would expect the same decreasing trend for F /N . However, as is clear from Fig. 4 the trend is reversed for real as well as random metric scores. F /N increases as average score of s increases. This is because, similar to F/N , this measure is also misleading and sensitive to the numerator F which always increases as average score of s increases.
C Factors affecting inter-metric correlation C.1 Ease of summarization Please see Fig. 5, 6 for Ease of Summarization vs Kendall's τ for all metric pairs.

C.3 Coverage
Please see Fig. 9, 10 for Coverage vs Kendall's τ for all metric pairs.