Intrinsic Evaluation of Summarization Datasets

High quality data forms the bedrock for building meaningful statistical models in NLP. Consequently, data quality must be evaluated either during dataset construction or post hoc . Almost all popular summarization datasets are drawn from natural sources and do not come with inherent quality assurance guarantees. In spite of this, data quality has gone largely unquestioned for many recent summarization datasets. We perform the ﬁrst large-scale evaluation of summarization datasets by introducing 5 intrinsic metrics and applying them to 10 popular datasets. We ﬁnd that data usage in recent summarization research is sometimes inconsistent with the underlying properties of the datasets employed. Further, we discover that our metrics can serve the additional purpose of being inexpensive heuristics for detecting generically low quality examples.


Introduction
Data understanding is fundamentally important in natural language processing (NLP); for data-driven learning-based methods (e.g. neural networks), the quality of the training data bounds the quality of models learned using it. Therefore, understanding this data is necessary in order to ensure that models learn to perform a given task correctly.
Understanding data is a multidimensional problem. One line of inquiry has demonstrated why prominent datasets are insufficiently challenging: many data examples can be solved by alternative heuristics that do not encode an approach that is faithful to the task (McCoy et al., 2019). From the perspective of datasets, several works have shown that standard datasets in areas such as visual question answering (Zhang et al., 2016;Kafle and Kanan, 2017), natural language inference (Gururangan et al., 2018;Poliak et al., 2018), and reading comprehension (Kaushik and Lipton, 2018) contain annotation artifacts that often give rise to these spurious correlations or reasoning shortcuts. Data understanding can also inform scientific and ethical decision-making (Bender and Friedman, 2018;Gebru et al., 2018;Mitchell et al., 2019) with recent work studying how social biases encoded in training data propagate to learned models (Zhao et al., 2019;Tan and Celis, 2019).
In this work, we extend these efforts towards the setting of summarization. We find this to be particularly timely since several summarization datasets have been released in recent years with little discussion of data quality. While prior work on evaluating NLP datasets has focused on their difficulty, transparency, or bias, we consider broadly the overall quality of the dataset -in our case, for the task of summarization. 1 Our central insight is that desirable properties of a summary can be readily estimated by adapting and applying existing NLP methods. With this in mind, we present a multiaspect large-scale study of summarization datasets that dissects summarization into 5 properties that are evaluated across 10 datasets spanning multiple summarization domains. Our analysis reveals that our metrics can serve as lightweight detectors of generically low quality examples. Most strikingly, we show that quantifiable aspects of summarization datasets are inconsistent with their use by the NLP community in several instances.

Motivation
Quality assurance for data. Nuanced understanding of data is requisite for drawing sound scientific conclusions. In particular, without evaluating for the quality and accuracy of data used to test models, it is impossible to be certain that progress is being made and that successive iterations of models truly make progress on the underlying task or linguistic phenomena of interest.
Within NLP, iconic datasets such as the Penn Treebank (Marcus et al., 1993) have sustained subareas such as language modelling, part-of-speech tagging, and syntactic parsing for years due to the painstaking annotation efforts put into making these high-fidelity resources. And in the context of summarization, initial datasets, such as those produced during the Document Understanding Conference (DUC) and Text Analysis Conference (TAC) evaluations, implemented fine-grained verification of data quality. 2 In part due to the emergence of data-hungry modelling techniques, the demands for larger datasets often render quality assurance procedures of this standard to be impractical and infeasible. Nonetheless, several recent natural language understanding datasets (Bowman et al., 2015;Rajpurkar et al., 2016;Suhr et al., 2017) institute explicit qualitycontrol procedures in crowd-sourcing dataset construction (Zaidan and Callison-Burch, 2011;Yan et al., 2014;Callison-Burch et al., 2015), such as using additional annotators to validate annotations (c.f. Geva et al., 2019). In the sibling subfield of machine translation, which often shares similar modelling challenges and evaluation regimes as summarization due to the shared nature of being sequence-to-sequence natural language generation tasks, the annual WMT conference 3 consistently furnishes high quality data. In summary, ensuring data quality is both crucial and challenging. And in comparison with other subareas of NLP, we argue that summarization has lagged behind in rigorously ensuring the quality of widely-used datasets.
Relating data quality and model quality. The correctness and quality of data inherently bounds what can be learned from the data about the task of interest. From an information-theoretic perspective, this can be made fully formal as follows: 4 I(S; M ) Here, I denotes the mutual information, S denotes understanding of the underlying summarization task and M denotes a model learned using summarization training data T , additional pre-2 DUC 2003 annotation guidelines: https://duc. nist.gov/duc2003/tasks.html and DUC 2002 quality assessment questions: https://duc.nist.gov/ duc2003/quality.html 3 http://www.statmt.org/wmt20/ 4 Proof deferred to Appendix D.
training data P , and the model's architecture A.
For fully learning-based methods, especially those with weak/minimal inductive biases such as neural networks, I(S; A) is approximately zero. While I(S; P ) may be greater than zero (e.g. language modelling pretraining provides statistical information that may facilitate a model to avoid a priori unlikely summaries), standard pretraining regimes such as large-scale language modelling over generic text corpora (Devlin et al., 2019;Raffel et al., 2019) are likely insufficient to meaningfully learn to summarize. Under these assumptions, the mutual information between S and M is critically upper-bounded in terms of I (S; T ). We hypothesize that the quality of the training dataset T is highly correlated with its mutual information with respect to the summarization task S, I(S; T ). One size does not fit all. Spärck Jones (1999) famously argued that summarization systems should be understood conditional on the context in which they will be used. In recent years, the field has significantly departed from this perspective and primarily studied "general-purpose summarization" (Kryscinski et al., 2019), which she denounced as ignis fatuus. With our work, we adopt the perspective that for all datasets it is strictly preferable to have all properties quantified; it is the responsibility of practitioners building summarization systems to accurately weight different metrics based on their ultimate goals and use cases. As such, we refrain from providing prescriptive domain-agnostic or context-agnostic notions of summarization.

Metrics
In this work, we evaluate the quality of a dataset by aggregating scores for each example in the dataset. We conjecture that for many NLP tasks, estimating the quality of a particular data example is of similar complexity as correctly performing the task on the example. 5 Nevertheless, for summarization, our insight is that various aspects of a summarization example (a document-summary pair) can be reliably estimated by re-purposing existing NLP methods. We are guided by pioneering work (Luhn, 1958;Edmundson, 1969;Mani, 1999) that defined core properties of summarization systems and influential sub-sequent work (Radev et al., 2002;Nenkova, 2006;Nenkova and McKeown, 2012;Peyrard, 2019a) that refined and extended these properties. From this literature, we specifically study compression, topic similarity, abstractivity, redundancy, and semantic coherence as these properties are of recurring and sustained interest. 6 For each abstract property, numerous concrete methods can be proposed to quantify it. In Appendix A, we describe alternatives we considered and detail how we decided which methods performed best. We restrict discussion to the bestperforming approaches in the main paper. Notation. Our metrics will assume indexed sets D, S such that summary S i ∈ S summarizes document D i ∈ D. The length in words of a sequence s is |s| and the length in sentences is s . Each metric assigns a value x ∈ [0, 1] to every (D i , S i ) where 1 is the maximal score and example-level scores are averaged to yield a dataset-level score. Compression. We quantify compression at the word (w) and sentence (s) levels: Topic Similarity. We learn a topic model M on training corpus T with k topics using LDA (Blei et al., 2003) and quantify topic similarity by comparing the inferred topic distributions θ D i |M , θ S i |M for a given summary and document: where JS is the Jensen-Shannon distance. We set k = 20 and T = D. Abstractivity. Grusky et al. (2018) introduced fragments F(D i , S i ), which are greedily-matched spans shared between D i and S i . We quantify abstractivity as a normalized function of the aggregate fragment length; our definition generalizes the definition of Grusky et al. (2018). We set p = 1.
Redundancy. ROUGE (Lin, 2004) implicitly penalizes redundancy but underestimates its detrimental impacts (Chaganty et al., 2018). However, we find that ROUGE is effective for identifying redundancy given the definitional focus on overlapping spans. We quantify redundancy as the average ROUGE-L 6 Different names and interpretations have been given for these properties in the literature. We revisit this in Appendix A in discussing alternate metrics.
F -score for all pairs of distinct sentences in the summary.
Semantic Coherence. We evaluate the semantic coherence of multi-sentence summaries by predicting the probability of each successive sentence conditioned on the previous one using a powerful language model, BERT (Devlin et al., 2019), pretrained with precisely this objective.

Data
We study the following 10 summarization datasets that have been frequently used in recent years. 7

Results and Analysis
Compression scores quantitatively disambiguate summarization tasks. Concretely, we observe GW has the lowest compression scores and while GW is sometimes described as a summarization dataset (Rush et al., 2015;Chopra et al., 2016), it is better seen as a headline generation dataset that is more in the style of sentence compression (as is suggested by S i = D i = 1). Conversely, AMI and Movi-eScript achieve the highest scores by a substantial margin and are long-document summarization datasets. Classifying new summarization datasets accurately may prove useful given that successful methods from one domain often do not extend to another and this shortcoming in generalization can be attributed to the differences in compression requirements (Cohan et al., 2018). Given the goals stated in the XSum dataset paper, TL;DR may be a better choice than XSum. In particular, Narayan et al. (2018) introduce XSum as a large dataset that legitimately requires abstraction. While XSum is more abstractive than other News datasets (barring GW) and is relatively large, TL;DR displays greater abstractivity, similar length summaries, and is 15 times larger. That said, Narayan et al. (2018) explore topic-oriented strategies in their work and such methods may be better suited to XSum given the TS scores. CNN-DM and NYT are suboptimal for studying abstractive/extractive systems respectively. Several recent works (See et al., 2017;Paulus et al., 2018;Li et al., 2018) have used CNN-DM to build and evaluate abstractive systems. Conversely, NYT has been used to build extractive systems (Hong and Nenkova, 2014;Li et al., 2016). Given our findings, we find both of these trends to be inconsistent with dataset properties and suboptimal given other preferable datasets for these purposes: CNN-DM is one of the least abstractive datasets and there are larger and more extractive alternatives to NYT such as NWS. Especially in the case of CNN-DM, we note that training learning-based systems (e.g. neural methods) using data with limited abstractivity implies the resulting summarizers will be limited in their ability to generate genuinely abstractive text. This is validated by empirical findings as both See et al. (2017) and Zhang et al. (2018) observe limited abstractivity in abstractive systems trained on CNN-DM. In light of this, we argue systems should be characterized as abstractive or not based on their empirical behavior rather than their theoretical capability. 9 CNN-DM is not a representative benchmark for summarization as a whole. Recent work (Kryscinski et al., 2019;Raffel et al., 2019) has explicitly portrayed CNN-DM as the benchmark dataset for summarization; the field has implicitly done this for several years (Kryscinski et al., 2019). While there is clear value in evaluating pretrained representations on summarization datasets, we caution against using CNN-DM as a stand-in for the entire summarization subfield. Instead, we suggest using a diverse group of datasets and not reducing a highly heterogeneous subfield to a single dataset. While this adds additional overhead, this cost is necessary to draw meaningful conclusions about the impact of advances on summarization broadly given the pronounced diversity in summarization datasets (Table 1). Post-processing methods for mitigating redundancy may be needed for practical systems. While evaluation on standard datasets using ROUGE may not penalize for this, redundancy is clearly undesirable (Carbonell and Goldstein, 1998;Peyrard, 2019a) and existing datasets (and thereby systems learned using that data) display significant amounts of redundancy in their gold-standard summaries (exceptions are datasets with short summaries where cross-sentence redundancy is constrained to be low). Specifically, Nenkova (2006) argues that redundancy is a clear inhibitor for practical application of summarization systems. Consequently, post hoc methods that reduce redundancy after initial evaluation may be useful in generating summaries that are suitable for human users. Semantic coherence captures observable variation in summary coherence. We observe that the Scientific summaries (which are abstracts of published papers) are clearly more coherent than the author-generated summaries in TL;DR, the fragmented summaries in AMI, and the concatenated bullet-point summaries in CNN-DM. We find that this distinction is captured by the SC measure using BERT. Quantifying semantic coherence is especially important given that the coherence of reference summaries will inform the coherence of system summaries, especially for learning-based approaches. Akin to what we discuss for abstractivity, See et al. (2017) and Paulus et al. (2018) both demonstrate that neural summarizers generate incoherent summaries despite achieving high ROUGE scores.

Pairwise Correlations
While the properties we evaluate for do not exhaust all aspects of summarization that may be of interest, it is unclear to what extent different measures overlap in judgments. To quantify this, in Table 2 CMPw CMPs TS ABS1 RED SC  we report pairwise correlations for every pair of metrics. In each case, the value reported is the Spearman rank correlation coefficient ρ computed between the length 10 vectors containing the scores for each dataset. 10 ρ = 1 indicates perfect positive correlation (which is why we see this for all diagonal entries) and ρ < 0 indicates the metrics are anti-correlated. Unsurprisingly, the compression metrics are strongly correlated with each other. We further observe that redundancy and topic similarity are correlated whereas abstractivity is anti-correlated with both. In particular, when summaries are considerably redundant, we qualitatively observe that the repeated content in the summary was both important and repeated in the context of the reference document. As a result, this may explain why redundancy and abstractivity are anti-correlated as this would suggest that highly redundant summaries are highly extractive. Additionally, since we measure topic similarity using LDA and unigram count statistics, it is not surprising that extractions may correlate with high topic similarity. In part, this may suggest a deficiency of our measure of topic similarity to accurately consider references to the same topic using substantially different words.
We also observe that semantic coherence patterns similarly to redundancy. In particular, while we find the semantic coherence scores are appropriate for most examples we manually inspected, this suggests that BERT relies upon word-level overlaps in making next-sentence judgments (similar to behaviors seen in other sentence-pair tasks such as natural language inference, c.f Gururangan et al., 2018) 6 Detecting Low Quality Examples Since manually considering all of the 1080 examples was not feasible, we began by examining the sampled examples for topic similarity, redundancy, and semantic coherence. Our hypothesis was that example quality would positively correlate with coherence and topic similarity and negatively correlate with redundancy. We found this hypothesis to be validated by our observations as we found that examples with low coherence, low topic similarity, or high redundancy scores were generally low quality examples. Every example which we judged to be low quality demonstrated at least one of the following defects: • The summary contains critical disfluencies that severely hinder accurate processing. 11 • The summary excludes unambiguously critical information from the reference document.
• Crucial information in the summary does not appear in the reference document and is not general knowledge.
• Substantial fractions of the summary involve entities, relations, or events that are ambiguous and that we could not resolve from the 11 We invoked this condition fairly judiciously as we observed that the domain of summaries also could influence the fluency of summaries in terms of grammaticality. In particular, we unsurprisingly found that academic papers in the Science domain generally have highly grammatical summaries whereas the bullet-point summaries in CNN-DM and the author-written summaries in TL;DR often were ungrammatical but still sufficiently clear to be interpreted correctly. summary alone. In particular, accurate interpretation of the summary would require also reading the reference document to resolve various coreferring expressions; the summary is not self-contained. 12 • The summary is entirely inappropriate as a summary of the reference document. For example, the summary only discusses an event with no obvious relationship to the contents of the reference document.
• The summary includes an entire sentence or long phrase describing something that appears in the main document but that is clearly an auxiliary detail. We flagged examples as low quality due to this condition quite conservatively, only using it when we could come to no basis for why the sentence/phrase should appear in the summary.
On the other hand, we did not find any systematic defects in examples with high coherence, high topic similarity, or low redundancy scores. Instead, almost all of these examples were satisfactory. For the remaining two properties (compression measured by CMP w , abstractivity measured by ABS 1 ), we analyzed all of the associated 400 examples. What we observed is that many of these examples tended to be generically low quality and we quantify this in Table 3. Since this analysis may be difficult to replicate and involves subjective decisions about example quality, we comprehensively enumerate all example IDs we use in Table 8. Table 4 shows a representative subset of the low quality examples we found in our analysis. We provide further examples in Appendix C and Figures 1-9. Compression. Minimally compressed summaries in NYT, NWS, TL;DR, and PubMed often are supplementary information to the document rather than a summary of it; in some cases, we believe this is due to errors in alignment in dataset construction/release. On the other hand, heavily compressed summaries in NWS and XSum often are just category labels (e.g. Sports), in TL;DR are 12 Many summaries drawn from the News domain have references that could be resolved by world knowledge or that could be reasonably understood using common sense knowledge. In these cases, while the summary is not fully self-contained, we did not judge them to be low quality. However, we expect that systems trained using these datasets would require knowledge beyond what is afforded by the reference document to accurately generate summaries of this type.   usually attention-grabbers, and in NYT are nearexact duplicates of reference documents, which themselves are letters to the editor.
Abstractivity. Manual inspection reveals highly abstractive summaries in NYT and NWS generally are exceedingly vague or are entirely unrelated to the original document. Highly abstractive summaries in PeerRead are often translated to English from the reference document's language and discuss results that do not appear in the introduction but likely appear later in the paper. Conversely, extremely extractive summaries in NWS and NYT often are just the lede and cannot be understood without the reference document. However, in most other instances, the lede is an effective summary for examples drawn from the News domain.
Within the context of our sample of examples, we find that eight of the ten summarization datasets (all but AMI, MovieScript) contain at least 8% low quality examples, the majority contain at least 14% low quality examples, and that these low quality examples can be detected using our compression and abstractivity metrics. For the worst-offending TL;DR dataset, we conservatively estimate at least 20% of examples are of substantially subpar quality. In general, we find that the low quality TL;DR "summaries" we detect often serve a different rhetorical purpose than summarization (e.g. attention grabbing, responding to a previous post that is not available in the dataset, sarcasm/humor).

Related Work
Dataset Analysis. As an alternative to automated evaluation, Chen et al. (2016) and Yatskar (2019) conduct human evaluations of standard datasets in reading comprehension and question answering. In some cases, dataset creators perform manual analyses of the data they introduce (e.g. Sandhaus (2008) and Grusky et al. (2018) for the NYT and Newsroom corpora, respectively). Automated and human evaluation provide complementary benefits with respect to their scalability and reliability. Even in the context of human evaluations, we advocate that automatic metrics can be useful in guiding the exploration of data and informing subsampling procedures that provide fine-grained insights. Quality Estimation. Our work bears resemblance both in name and structure to work on quality estimation. Quality estimation, often centered on natural language generation, is the task of measuring system-generated output quality (Paetzold and Specia, 2016; Yuan and Sharoff, 2020). It is closely related to work on unsupervised or reference-free evaluation (Napoles et al., 2016;Ethayarajh and Sadigh, 2020). Within the context of summarization, the special case of quality estimation regarding factual consistency/faithfulness has been of recent interest (Wang et al., 2020;Maynez et al., 2020;Durmus et al., 2020) since neural abstractive summarizers have been shown to hallucinate/misrepresent facts (See et al., 2017). In comparison to these settings, our metrics make no use of labelled data (even in training) and are entirely intrinsic/unsupervised. Two recent works are highly related to our own. Kryscinski et al. (2019) provide a critical reevaluation of summarization research. Most relevant to our work, they show that web-scraped datasets, specifically CNN-DM and NWS, contain a nontrivial fraction of examples (approx. 3.5%) with HTML artifacts (which can be easily detected/removed). Jung et al. (2019) provide an aspect-level evaluation of both summarization datasets and systems. In their work, the dataset analyses center on biases in the data (e.g. positional biases, which are often seen in news summarization), which is reminiscent of the annotation artifacts seen in other NLP tasks (Gururangan et al., 2018; Niven and Kao, 2019).

Discussion
Open Problems and Future Directions. Our results demonstrate that a sizeable fraction of examples in most summarization datasets are low quality. However, it remains open whether modellers should simply prune these examples, manually/automatically attempt to correct them, or model them without change. We do note that research in the machine learning and learning theory communities shows that models both theoretically and empirically do substantially worse when trained using low quality examples, even when the examples are not strictly adversarially chosen (Klivans et al., 2009;Biggio et al., 2012;Koh et al., 2018). These concerns are further compounded by the evidence of Belinkov and Bisk (2018) that neural models for natural language generation are not robust to naturally noisy data.
Our metrics may be repurposed to rank examples in designing curricula for curriculum learning ap-proaches (Bengio et al., 2009). Alternatively, they can serve as additional metrics for the (possibly unsupervised) evaluation of summarization systems, potentially mitigating deficiencies in standard metrics, such as ROUGE, by directly penalizing redundancy and semantic incoherence.
Limitations. In this work, we restrict ourselves to single-document single-reference English language summarization datasets. While the datasets we study constitute a considerable fraction of dataset usage in the summarization community, several multi-document summarization datasets have been introduced (e.g. Fabbri et al., 2019; Antognini and Faltings, 2020) and multi-reference summarization datasets have often been argued to be desirable due to under-constrained nature of the summarization task (Kryscinski et al., 2019) and the ideal evaluation paradigm for ROUGE (Lin, 2004). Beyond English, both large summarization datasets (Nguyen and Daumé III, 2019; Varab and Schluter, 2020) and more general language resources/technologies (Joshi et al., 2020) are less available, which may heighten the need for data quality assurance.
More broadly, the measures that we introduce are automated, and therefore non-human, judgments of the quality of summarization data. Therefore, we only envision these measures to be useful as inexpensive first-order approximations of aspectlevel summary quality rather than bona fide replacements for human evaluation. Additionally, since we principally envision applying these metrics to datasets, we make no efforts to make these metrics robust to adversarially-crafted data and they are likely quite susceptible to adversarial attack.

Conclusion
In this work, we demonstrate that various aspects of summarization datasets can be intrinsically evaluated for. We specifically show this for 5 properties across 10 popular datasets, uncovering that dataset use is sometimes incongruous with the attributes of the underlying data. We also find that some aspectlevel estimators may be surprisingly effective at detecting low quality dataset examples. Our findings suggest that more intentional and deliberate decisions should be made in selecting summarization datasets for downstream modelling research and that further scrutiny should be placed upon summarization datasets released in the future.

Reproducibility
All code is made publicly available. 13 Exhaustive reproducibility details, including how to access all datasets, are provided in Appendix B. We fully adhere to the EMNLP 2020 Reproducibility guidelines, addressing all relevant checklist items.

A.1 Compression
For compression, we found sentence-level compression to be a naturally motivated metric given that many extractive systems are constrained to extract sentence-length sequence. We also considered byte-level compression as an alternative to word-level compression (as computational length constraints have sometimes been used in evaluation instead of word length constraints). We found the results to be highly correlated with word-level compression and to not be further revealing (and bytes may be inherently less interpretable for NLP when compared with words). We also considered only considering content words, motivated by literature in topic modelling (Schofield et al., 2017) that has considered removing stopwords and other such lexical categories. These results were also highly correlated with the original word-level compression results and we did not find any discerning trends in looking at individual examples.

A.2 Topic Similarity
In the main paper, we compute topic similarity using the Jensen-Shannon distance. We initially considered the Kullback-Leibler (KL) divergence. While the JS distance and/or divergence has been more frequently used in the context of similarity in topic modelling, the KL divergence is also frequently considered. Intuitively and under some interpretations, the asymmetry of the KL divergence may be desirable as the extent to which a summary is topically similar to a document may not be the same as the extent to which a document is topically similar to a summary. In spite of this, in viewing the results using KL, we found that the measure lacked discriminative power in disambiguating examples we believed were more topically similar than others. We qualitatively found the judgments via the JS distance to be accurate. That said, the judgments between the measures tended to be highly correlated as the Spearman rank correlation coefficient was ρ ≥ 0.7 for all topic modelling settings and in most cases exceeded 0.8. We also considered a topic model learned using both the documents and summaries D ∪ S and just the documents D. Both are natural choices, with using the documents being more general in some sense as the topic similarity of a summary should be able to be assigned without requiring the summary collection. We further considered several choices for the number of topics as well. In Table 5, we report the full results for all pairs of (training corpus T , # topics k) for all (T , k) ∈ {D ∪ S, D} × {10, 20, 50, 100}. In all cases, the number of training examples is truncated to 20000 (hence 10000 summaries and 10000 documents when using the training corpus of D ∪ S). We fix the number of training documents across datasets to attempt to control for the confound of larger datasets inducing higher quality topic models. We did not observe significant changes in the result by relaxing this (i.e. using the full datasets instead of just 20000 examples).
We find that there is significant variation in crossdataset rankings with respect to these two parameters. We chose to report the results corresponding to k = 20, T = D. We chose the value for k based on qualitative judgments about topic quality for CNN-DM, PeerRead, and AMI, as we considered these to be a diverse subset of all 10 datasets. The topics we observed were highly disjoint and reasonably aligned with our intuitions about what sensible topics should be. We chose the value for T based on the generality referenced previously. While the results are substantially different for D versus D ∪ S, we did not find any consistent and interpretable discriminative properties between the two.

A.3 Abstractivity
Our general framework for quantifying abstractivity is derived from Grusky et al. (2018). We considered p ∈ {1, 2, 3, 4} initially and found p = 1 to be the most informative regarding abstractivity. In particular, we find that for increasing p, useful conclusions about abstractivity are inherently masked by the dominance of the |S i | p denominator in the definition. We report the scores for ABS 2 in Table 6.
We also considered the natural extensions to ABS 3 and ABS 4 but we found that the normalization dominates any deviation in the scores and all datasets essentially receive a score of 1. We also considered other forms of normalization (i.e. normalizing ABS 2 in the style of the L 2 norm/the style of generalized p-norms) in initial experiments but found no substantial differences.

A.4 Redundancy
In the main paper, we compute redundancy scores for each distinct sentence pair using ROUGE-L Fmeasure and then average these individual values to get a score for the entire summary. Alternatively, we considered other ROUGE scores (specifically ROUGE-1 and ROUGE-2) as well as max pooling the sentence pair scores. We report these results below in Table 7. We do not observe significant changes with the specific ROUGE metric considered (i.e. a Spearman ρ of 1.0 which indicates a perfect correlation in the case of max pooling across the ROUGE variants). We do see substantial differences between averaging and max pooling; we find that max pooling turns out to precisely correlate (ρ = 1.0) with the average summary length. This is somewhat expected, given that the max-pooled redundancy estimates doesn't inherently control for summary length. We therefore chose to report redundancy scores using averaging as we also qualitatively found them to be more useful and characteristic, especially for datasets such as AMI and the Scientific datasets as max pooling was overly aggressive. While the nuances of the specific ROUGE variant did not significantly impact trends in redundancy scores, we chose to report the ROUGE-L scores in the main paper as we (highly subjectively) found the values to be most interpretable/consistent with values we would have assigned.

A.5 Semantic Coherence
We evaluate for semantic coherence between successive pairs of sentences, exploiting the auxiliary training objective of BERT beyond its masked language modeling objective. In particular, we were especially interested in this given that many systems are designed with explicit handling of sentence boundaries (e.g. more extractive systems first rank extractive sentences and then order a thresholded subset) and datasets such as CNN-DM, which are artificially concatenated, may not be inherently coherent across sentence-boundaries.   Our observations regarding the measure of coherence provided by BERT's next-sentence predictions seem to contradict existing findings. In particular, Liu et al. (2019) introduce RoBERTa as a direct followup study to BERT and find that the next-sentence prediction objective is not an effective pretraining objective for improving representations for natural language understanding; Yang et al. (2019) also provide similar evidence. However, our findings do not contest these conclusions but instead suggest that, nonetheless, BERT is a strong next-sentence predictor and that these predictions are still useful for measuring coherence across sentences. While we considered word or subword measures of coherence, we did not consider alternative pretrained models that are pretrained on other objectives related to inter-sentence coherence such as ALBERT (Lan et al., 2020). Given the findings of Lan et al. (2020, §4.6), it seems likely that the sentence order prediction task they use may be more effective for measuring semantic coherence. Concurrent work by Prabhumoye et al. (2020) also substantiates the usefulness of BERT-based nextsentence prediction for measuring coherence and ranking sentences orders.
That said, semantic coherence could also be evaluated using (neural) language models, especially in light of results suggest they may be consistent with human judgments regarding grammaticality and acceptability (Chowdhury and Zamparelli, 2018;Warstadt et al., 2019). We did consider this and found language modeling scores (e.g. surprisal) assigned via a pretrained high-quality causal lan-guage model (GPT-2) to be inconsistent with our human judgments. We believe language modeling scores in this sense are likely highly sensitive to the domain (and even within-domain effects, e.g. lexical variation for XSum which is fairly limited given all articles are sourced from the BBC whereas for Newsroom the variation is greater given the heterogeneous group of publishers with more diversified writing styles).

B Reproducibility Details
We provide precise and comprehensive details discussing all data, preprocessing and modelling decisions. All code will be made publicly available as noted in the main paper.

B.1 Dataset Sources
We use the versions of GW and CNN-DM dataset released by Gehrmann et al. (2018). 14 Sentence boundary tokens inserted by Gehrmann et al. (2018) to improve summarization quality were removed to ensure fair comparison in our work. An important distinction in the use of the CNN-DM dataset for modeling is whether the entity-anonymized or non-anonymized version was used. This copy is non-anonymized and it is important to consider the stability of our metrics under this anonymization. We used the released version of the NYT dataset directly as it was released via LDC. 15  Table 7: Alternative methods for estimating redundancy. Results in main paper are equivalent to those in the row corresponding to mean and ROUGE-L.
We use the released version of the TL;DR dataset provided by the authors of Völske et al. (2017). 16 We use a version of the NWS dataset that was released via private communication with the authors of Grusky et al. (2018). We have verified with the authors that the data can be requested with the platform they released in their original work. 17 For all remaining datasets, we use the version released by Jung et al. (2019). 18 All of our conventions in using these five datasets follow their work.

B.2 Data Preprocessing
All datasets were first filtered to remove examples where either the document or summary was empty. We found only examples in CNN-DM failed this criterion and this constituted less than 0.1% 114 287227 of the dataset. All results were reported then on the standard training set if we were aware of a standard split used consistently in the summarization system literature. Splits in the case of datasets sourced from the work of Jung et al. (2019) followed their work. In all cases, the training set was at least 80% of the full data collection, so we expect results to generalize to the portions of the collection that were not considered assuming splits were constructed by sampling uniformly at random (we did not verify this).

B.3 Topic Similarity
We lowercase all terms, remove stopwords using the list specified in NLTK (Loper and Bird, 2002), and lemmatize using SpaCy (Honnibal and Montani, 2017). We only retain words tagged with a POS category in {NOUN, ADJ, VERB, ADV} 16 https://tldr.webis.de/ 17 https://summari.es/ 18 http://biassum.com/ by the SpaCy POS tagger. We use LDA (Blei et al., 2003) to learn all topic models and rely on the implementation in Gensim (Řehůřek and Sojka, 2010) based on specification of Hoffman et al. (2010). All hyperparameters are set as default and we discussed the number of topics k and training corpus T in §A.2 with the results in the main paper using k = 20 and T = D where T is truncated to be at most 20000 documents. We compute the Jensen-Shannon distance using SciPy (Virtanen et al., 2020).

B.4 Abstractivity
Fragments (Grusky et al., 2018) were computed using the scripts released in that work for the purposes of estimating abstractivity. In the case of the NWS dataset, the authors already provide fragment-related scores which we use without recomputing these values.

B.5 Redundancy
We make use of the native Python reimplementation of ROUGE (Lin, 2004), easy-rouge. 19 All scores reported in the main paper use ROUGE-L and use the computed F -measure score.

B.6 Semantic Coherence
We compute semantic coherence by predicting the probability of a sentence conditional on the preceding sentence using BERT. BERT was pretrained with exactly this objective (beyond its masked language modeling objective) and we use the released model as-is with no further fine-tuning. We use the bert-base-uncased model along with the associated tokenizer that was implemented in PyTorch (Paszke et al., 2017) by HuggingFace in the transformers repository. 20

B.7 Efficiency
All metrics reported in the main paper can be computed over all datasets in less than 10 ten hours on a single CPU. The only model with a nontrivial number of parameters used in this work is the bert-base-uncased models we use in measuring semantic coherence. We refer readers to Devlin et al. (2019) for more details and to the HuggingFace implementation we reference previously.

C Detecting Low-Quality Examples
In the main paper, we briefly discuss how we discovered that several of our metrics can serve the dual purpose of detecting generally low quality examples for example that achieve extreme scores. Figures 1 through 9 are several examples we found to be representative of the general structure of low quality examples for a given metric. In some cases, the trends are highly dataset-specific whereas in others they are more general. To facilitate reproducibility efforts, we provide all examples IDs we studied for each (dataset, metric) in Table 8.
Original Text (truncated): Let us, in the beginning, give a word of cordial praise to the American publishers of these splendid volumes. The undertaking, in the first place, was an intellectual compliment to the country. It was based on the faith that there is in this country enough of philosophy and scholarship to justify a new and complete edition of . . . Summary: Let us, in the beginning, give a word of cordial praise to the American publishers of these splendid volumes. The undertaking, in the first place, was an intellectual compliment to the country. Figure 1: Dataset: NWS. This summary simply is the lede and we do not find it to be a useful summary for readers not familiar with the full context of the article. We hypothesize that such a summary may have been useful for members of a newsroom communicating information about the article to the other (given their intimate familiarity with the article) but this likely is inappropriate as a summary in most settings.

D Mutual Information Bounds
The entropy of a random variable X is defined as: Original Text (truncated): A FULL-SERVICE hotel and conference center is to go up in the Lafayette Yard area of Trenton, giving the city a hotel for the first time since the 1980's and bringing to an end its unenviable distinction as the only state capital without lodging for visitors . . .

Summary: Acquest
Detector: Extremely Low Abstraction Figure 2: Dataset: NYT. This summary simply conveys no useful information to someone who has not also read the reference document and simply is a word copied from the source document. It appears to be a label rather than a summary.

Summary (truncated)
: logic is the science of correct inferences and a logical system is a tool to prove assertions in a certain logic in a correct way . . . BOSTON RED SOX -Named Dale Sveum third base coach.

Summary: Sports transactions
Detector: Extremely High Abstraction Figure 5: Dataset: NYT. This summary is unlikely to be informative to someone who has not read the reference document and is more of a categorization/label than a summary. This is similar to the previous NYT example given.
The conditional entropy of X given Y is defined as: H(X | Y ) y p(y) − x p(x | y) log 2 p(x | y)    Figure 7: Dataset: NYT. Similar to the previous example, this summary has a negative compression score and, in this case, this seems to indicate the summaries and documents were created/aligned incorrectly in Sandhaus (2008).
Original Text (truncated): Brodie (the dog) was neglected, and ended up with serious anger and health issues concerning his skin and allergies. My boyfriend adopted him . . .

Summary: Onions.
Detector: Extremely High Compression Figure 8: Dataset: TL;DR. We observe this trend quite frequently in TL;DR. Specifically, since authors on the social discussion platform Reddit choose to provide these summaries at their discretion, we often find the "summaries" are attention-grabbing and serve a starkly different rhetorical purpose from how summaries are generally conceived.
The mutual information between random variables X and Y is defined as: I(X; Y ) H(X) − H(X | Y ) The entropy measures the uncertainty in the probability mass/density function of a random variable. As such, the mutual information measures how Original Text (truncated): these are external links and will open in a new window1908 -king carlos and eldest son assassinated in lisbon. second son manuel becomes king. 1910 -king manuel ii abdicates amid revolution . . .

Summary: a chronology of key events :
Detector: Extremely High Compression Figure 9: Dataset: XSum. We observe this trend quite frequently in XSum. For articles that are essentially timelines or other types of chronologies discussing historic events diachronically (which forms a small but distinctive section of the writing style of BBC from our analysis), the summary extracted to accompany it is generally this string or a slightly altered version. We argue this summary is fairly unhelpful (and is likely fairly uninteresting to test models on; simple rule-based filtering made be preferable to avoid overestimating performance on this dataset because of these examples). much the entropy of X is reduced by (on average) due to the observation of Y .
In the main paper, we state the following inequality: I(S; M ) where I denotes the mutual information, S denotes understanding of the underlying summarization task and M denotes a model learned using summarization training data T , additional pretraining data P , and the model's architecture A.
Intuitively, the claim is that the uncertainty about the summarization task that is reduced by the model (which is uniquely determined by its training data, pretraining data, and architecture) is at most what can be cumulatively reduced by the training data, pretraining data, and inductive biases encoded in the model's architecture.
Our hypothesis is that I(S; A) is small for learning-based models with minimal inductive biases, such as neural networks. Further, we hypothesize that while I(S; P ) is likely nontrivial for popular pretraining regimes, the dominant term on the right-hand side is likely I(S; T ). We do note that this second hypothesis may be false given the partial evidence of GPT-3 (Brown et al., 2020) and the successes it enjoys in few-shot learning due to pretraining at unprecedented scale. However, no evaluation is conducted on summarization data in that work.

E Additional Statistics
In the main paper, we report the average score for each metric on each dataset. To complement reporting the mean, we report the standard deviation for each metric on each dataset in Table 9.  Table 9: Aspect-level standard deviations for each dataset. Redundancy and semantic coherence are not reported for datasets with > 95% single-sentence summaries.