Neural Text Summarization: A Critical Evaluation

Text summarization aims at compressing long documents into a shorter form that conveys the most important parts of the original document. Despite increased interest in the community and notable research effort, progress on benchmark datasets has stagnated. We critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlight three primary shortcomings: 1) automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation, 2) current evaluation protocol is weakly correlated with human judgment and does not account for important characteristics such as factual correctness, 3) models overfit to layout biases of current datasets and offer limited diversity in their outputs.


Introduction
Text summarization aims at compressing long textual documents into a short, human readable form that contains the most important information from the source. Two strategies of generating summaries are extractive (Dorr et al., 2003;Nallapati et al., 2017), where salient fragments of the source document are identified and directly copied into the summary, and abstractive (Rush et al., 2015;See et al., 2017), where the salient parts are detected and paraphrased to form the final output.
The number of summarization models introduced every year has been increasing rapidly. Advancements in neural network architectures (Sutskever et al., 2014;Bahdanau et al., 2015;Vinyals et al., 2015;Vaswani et al., 2017) and the availability of large scale data (Sandhaus, 2008;Nallapati et al., 2016a;Grusky et al., 2018) enabled the transition from systems based on expert knowledge and heuristics to data-driven approaches powered by end-to-end deep neural models. Current approaches to text summarization utilize advanced attention and copying mechanisms (See et al., 2017;Tan et al., 2017;Cohan et al., 2018), multi-task and multi-reward training techniques (Guo et al., 2018;Kryściński et al., 2018), reinforcement learning strategies (Paulus et al., 2017;Narayan et al., 2018b;Dong et al., 2018;Wu and Hu, 2018), and hybrid extractive-abstractive models Hsu et al., 2018;Gehrmann et al., 2018;Chen and Bansal, 2018). Many of the introduced models are trained on the CNN/DailyMail (Nallapati et al., 2016a) news corpus, a popular benchmark for the field, and are evaluated based on n-gram overlap between the generated and target summaries with the ROUGE package (Lin, 2004).
Despite substantial research effort, the progress on these benchmarks has stagnated. State-of-theart models only slightly outperform the Lead-3 baseline, which generates summaries by extracting the first three sentences of the source document. We argue that this stagnation can be partially attributed to the current research setup, which involves uncurated, automatically collected datasets and non-informative evaluations protocols. We critically evaluate our hypothesis, and support our claims by analyzing three key components of the experimental setting: datasets, evaluation metrics, and model outputs. Our motivation is to shift the focus of the research community into developing a more robust research setup for text summarization.
2 Related Work

Datasets
To accommodate the requirements of modern data-driven approaches, several large-scale datasets have been proposed. The majority of available corpora come from the news domain. Gigaword (Graff and Cieri, 2003) is a set of articles and corresponding titles that was originally used for headline generation (Takase et al., 2016), but it has also been adapted to single-sentence summarization (Rush et al., 2015;Chopra et al., 2016). NYT (Sandhaus, 2008) is a collection of articles from the New York Times magazine with abstracts written by library scientists. It has been primarily used for extractive summarization (Hong and Nenkova, 2014; and phraseimportance prediction (Yang and Nenkova, 2014;Nye and Nenkova, 2015). The CNN/DailyMail (Nallapati et al., 2016a) dataset consists of articles with summaries composed of highlights from the article written by the authors themselves. It is commonly used for both abstractive (See et al., 2017;Paulus et al., 2017;Kryściński et al., 2018) and extractive (Dong et al., 2018;Wu and Hu, 2018;Zhou et al., 2018) neural summarization. The collection was originally introduced as a Cloze-style QA dataset by Hermann et al. (2015). XSum (Narayan et al., 2018a) is a collection of articles associated with one, singlesentence summary targeted at abstractive models. Newsroom (Grusky et al., 2018) is a diverse collection of articles sourced from 38 major online news outlets. This dataset was released together with a leaderboard and held-out testing split.
Outside of the news domain, several datasets were collected from open discussion boards and other portals offering structure information. Reddit TIFU  is a collection of posts scraped from Reddit where users post their daily stories and each post is required to contain a Too Long; Didn't Read (TL;DR) summary. Wiki-How (Koupaee and Wang, 2018) is a collection of articles from the WikiHow knowledge base, where each article contains instructions for performing procedural, multi-step tasks covering various areas, including: arts, finance, travel, and health.

Evaluation Metrics
Manual and semi-automatic (Nenkova and Passonneau, 2004;Passonneau et al., 2013) evaluation of large-scale summarization models is costly and cumbersome. Much effort has been made to develop automatic metrics that would allow for fast and cheap evaluation of models.
The ROUGE package (Lin, 2004) offers a set of automatic metrics based on the lexical over-lap between candidate and reference summaries. Overlap can be computed between consecutive (n-grams) and non-consecutive (skip-grams) subsequences of tokens. ROUGE scores are based on exact token matches, meaning that computing overlap between synonymous phrases is not supported.
Many approaches have extended ROUGE with support for synonyms and paraphrasing. ParaEval (Zhou et al., 2006) uses a three-step comparison strategy, where the first two steps perform optimal and greedy paraphrase matching based on paraphrase tables before reverting to exact token overlap. ROUGE-WE (Ng and Abrecht, 2015) replaces exact lexical matches with a soft semantic similarity measure approximated with the cosine distances between distributed representations of tokens. ROUGE 2.0 (Ganesan, 2018) leverages synonym dictionaries, such as WordNet, and considers all synonyms of matched words when computing token overlap. ROUGE-G (ShafieiBavani et al., 2018) combines lexical and semantic matching by applying graph analysis algorithms to the WordNet semantic network. Despite being a step in the direction of a more comprehensive evaluation protocol, none of these metrics gained sufficient traction in the research community, leaving ROUGE as the default automatic evaluation toolkit for text summarization.

Models
Existing summarization models fall into three categories: abstractive, extractive, and hybrid.
Extractive models select spans of text from the input and copy them directly into the summary. Non-neural approaches (Neto et al., 2002;Dorr et al., 2003;Filippova and Altun, 2013;Colmenares et al., 2015) utilized domain expertise to develop heuristics for summary content selection, whereas more recent, neural techniques allow for end-to-end training. In the most common case, models are trained as word-or sentencelevel classifiers that predict whether a fragment should be included in the summary (Nallapati et al., 2016b(Nallapati et al., , 2017Narayan et al., 2017;Liu et al., 2019;Xu and Durrett, 2019). Other approaches apply reinforcement learning training strategies to directly optimize the model on task-specific, nondifferentiable reward functions (Narayan et al., 2018b;Dong et al., 2018;Wu and Hu, 2018) .
Abstractive models paraphrase the source doc-uments and create summaries with novel phrases not present in the source document. A common approach in abstractive summarization is to use attention and copying mechanisms (See et al., 2017;Tan et al., 2017;Cohan et al., 2018). Other approaches include using multi-task and multireward training (Paulus et al., 2017;Jiang and Bansal, 2018;Guo et al., 2018;Kryściński et al., 2018), and unsupervised training strategies (Chu and Liu, 2018;Schumann, 2018).
Hybrid models (Hsu et al., 2018;Gehrmann et al., 2018;Chen and Bansal, 2018) include both extractive and abstractive modules and allow to separate the summarization process into two phases -content selection and paraphrasing.
For the sake of brevity we do not describe details of different models, we refer interested readers to the original papers.

Analysis and Critique
Most summarization research revolves around new architectures and training strategies that improve the state of the art on benchmark problems. However, it is also important to analyze and question the current methods and research settings. Zhang et al. (2018) conducted a quantitative study of the level of abstraction in abstractive summarization models and showed that wordlevel, copy-only extractive models achieve comparable results to fully abstractive models in the measured dimension. Kedzie et al. (2018) offered a thorough analysis of how neural models perform content selection across different data domains, and exposed data biases that dominate the learning signal in the news domain and architectural limitations of current approaches in learning robust sentence-level representations. Liu and Liu (2010) examine the correlation between ROUGE scores and human judgments when evaluating meeting summarization data and show that the correlation strength is low, but can be improved by leveraging unique meeting characteristics, such as available speaker information. Owczarzak et al. (2012) inspect how inconsistencies in human annotator judgments affect the ranking of summaries and correlations with automatic evaluation metrics. The results showed that systemlevel rankings, considering all summaries, were stable despite inconsistencies in judgments, how-ever, summary-level rankings and automatic metric correlations benefit from improving annotator consistency. Graham (2015) compare the fitness of the BLEU metric (Papineni et al., 2002) and a number of different ROUGE variants for evaluating summarization outputs. The study reveals superior variants of ROUGE that are different from the commonly used recommendations and shows that the BLEU metric achieves strong correlations with human assessments of generated summaries. Schulman et al. (2015) study the problems related to using ROUGE as an evaluation metric with respect to finding optimal solutions and provide proof of NP-hardness of global optimization with respect to ROUGE.

Underconstrained task
The task of summarization is to compress long documents by identifying and extracting the most important information from the source documents. However, assessing the importance of information is a difficult task in itself, that highly depends on the expectations and prior knowledge of the target reader.
We show that the current setting in which models are simply given a document with one associated reference summary and no additional information, leaves the task of summarization underconstrained and thus too ambiguous to be solved by end-to-end models.
To quantify this effect, we conducted a human study which measured the agreement between different annotators in selecting important sentences

Article
The glowing blue letters that once lit the Bronx from above Yankee stadium failed to find a buyer at an auction at Sotheby's on Wednesday. While the 13 letters were expected to bring in anywhere from $300,000 to $600,000, the only person who raised a paddle -for $260,000 -was a Sotheby's employee trying to jump start the bidding. The current owner of the signage is Yankee hall-of-famer Reggie Jackson, who purchased the 10-feet-tall letters for an undisclosed amount after the stadium saw its final game in 2008. No love: 13 letters that hung over Yankee stadium were estimated to bring in anywhere from $300,000 to $600,000, but received no bids at a Sotheby's auction Wednesday. The 68-year-old Yankee said he wanted 'a new generation to own and enjoy this icon of the Yankees and of New York City.', The letters had beamed from atop Yankee stadium near grand concourse in the Bronx since 1976, the year before Jackson joined the team. (...)

Summary Questions
When was the auction at Sotheby's? Who is the owner of the signage? When had the letters been installed on the stadium?

Constrained Summary A Unconstrained Summary A
Glowing letters that had been hanging above the Yankee stadium from 1976 to 2008 were placed for auction at Sotheby's on Wednesday, but were not sold, The current owner of the sign is Reggie Jackson, a Yankee hall-of-famer.
There was not a single buyer at the auction at Sotheby's on Wednesday for the glowing blue letters that once lit the Bronx's Yankee Stadium. Not a single non-employee raised their paddle to bid. Jackson, the owner of the letters, was surprised by the lack of results. The venue is also auctioning off other items like Mets memorabilia.

Constrained Summary B Unconstrained Summary B
An auction for the lights from Yankee Stadium failed to produce any bids on Wednesday at Sotheby's. The lights, currently owned by former Yankees player Reggie Jackson, lit the stadium from 1976 until 2008.
The once iconic and attractive pack of 13 letters that was placed at the Yankee stadium in 1976 and later removed in 2008 was unexpectedly not favorably considered at the Sotheby's auction when the 68 year old owner of the letters attempted to transfer its ownership to a member the younger populace. Thus, when the minimum estimate of $300,000 was not met, a further attempt was made by a former player of the Yankees to personally visit the new owner as an Table 1: Example summaries collected from human annotators in the constrained (left) and unconstrained (right) task. In the unconstrained setting, annotators were given a news article and asked to write a summary covering the parts they considered most important. In the constrained setting, annotators were given a news article with three associated questions and asked to write a summary that contained the answers to the given questions.
from a fragment of text. We asked workers to write summaries of news articles and highlight sentences from the source documents that they based their summaries on. The experiment was conducted in two settings: unconstrained, where the annotators were instructed to summarize the content that they considered most important, and constrained, where annotators were instructed to write summaries that would contain answers to three questions associated with each article. This is similar to the construction of the TAC 2008 Opinion Summarization Task 1 . The questions associated with each article where collected from human workers through a separate assignment. Experiments were conducted on 100 randomly sampled articles, further details of the human study can be found in Appendix A.1. Table 2 shows the average number of sentences, per-article, that annotators agreed were important. The rows show how the average changes with the human vote threshold needed to reach consensus about the importance of any sentence. For example, if we require that three or more human votes are necessary to consider a sentence important, annotators agreed on average on the importance of 0.627 and 1.392 sentences per article in the unconstrained and constrained settings respectively. The average length (in sentences) of sampled articles was 16.59, with a standard deviation of 5.39. The study demonstrates the difficulty and ambiguity of content selection in text summarization.
We also conducted a qualitative study of summaries written by annotators. Examples comparing summaries written in the constrained and unconstrained setting are shown in Table 1. We noticed that in both cases the annotators correctly identified the main topic and important fragments of the source article. However, constrained summaries were more succinct and targeted, without sacrificing the natural flow of sentences. Un-  constrained writers tended to write more verbose summaries that did not add information. The study also highlights the abstractive nature of human written summaries in that similar content can be described in unique ways. News articles adhere to a writing structure known in journalism as the "Inverted Pyramid" (PurdueOWL, 2019). In this form, initial paragraphs contain the most newsworthy information, which is followed by details and background information.

Layout bias in news data
To quantify how strongly articles in the CNN/DM corpus follow this pattern we conducted a human study that measured the importance of different sections of the article. Annotators read news articles and selected sentences they found most important. Experiments were conducted on 100 randomly sampled articles, further details of the human study are described in Appendix A.3. Figure 1 presents how annotator selections were distributed over the length of the article. The distribution is skewed towards the first quarter of the length of articles. The cumulative plot shows that nearly 60% of the important information was present in the first third of the article, and approximately 25% and 15% of selections pointing to the second and last third, respectively.
It has become standard practice to exploit such biases during training to increase performance of models (See et al., 2017;Paulus et al., 2017;Kryściński et al., 2018;Gehrmann et al., 2018;Jiang and Bansal, 2018;, but the importance of these heuristics has been accepted without being quantified. These same heuristics would not apply to books or legal documents, which lack the Inverted Pyramid layout so common in the news domain, so it is important that these heuristics be part of ablation studies rather than accepted as default pre-processing step.

Noise in scraped datasets
Given the data requirements of deep neural networks and the vast amounts of diverse resources available online, automatically scraping web content is a convenient way of collecting data for new corpora. However, adapting scraped content to the needs of end-to-end models is problematic. Given that manual inspection of data is infeasible and human annotators are expensive, data curation is usually limited to removing any markup structure and applying simple heuristics to discard obviously flawed examples. This, in turn, makes the quality of the datasets heavily dependent on how well the scraped content adheres to the assumptions made by the authors about its underlying structure.
This issue suggests that available summarization datasets would be filled with noisy examples. Manual inspection of the data, particularly the reference summaries, revealed easily detectable, consistent patterns of flawed examples Many such examples can be isolated using simple regular expressions and heuristics, which allows approximation of how widespread these flaws are in the dataset.
We investigated this issue in two large summarization corpora scraped from the internet:

CNN/DM -Links to other articles
Michael Carrick has helped Manchester United win their last six games. Carrick should be selected alongside Gary Cahill for England. Carrick has been overlooked too many times by his country. READ : Carrick and Man United team-mates enjoy second Christmas party.

Newsroom -Links to news sources
Get Washington DC, Virginia, Maryland and national news. Get the latest/breaking news, featuring national security, science and courts.
Read news headlines from the nation and from The Washington Post. Visit www.washingtonpost.com/nation today.

Summary -Factually incorrect
Brady Olson, a Washington High School teacher at North Thurston High, opened fire on Monday morning. No one was injured after the boy shot twice toward the ceiling in the school commons before classes began at North Thurston High School in Lacey (...) Table 4: Example of a factually incorrect summary generated by an abstractive model. Top: ground-truth article. Bottom: summary generated by model. CNN/DM (Nallapati et al., 2016a) and the Newsroom (Grusky et al., 2018). The problem of noisy data affects 0.47%, 5.92%, and 4.19% of the training, validation, and test split of the CNN/DM dataset, and 3.21%, 3.22%, and 3.17% of the respective splits of the Newsroom dataset. Examples of noisy summaries are shown in Table 3. Flawed examples contained links to other articles and news sources, placeholder texts, unparsed HTML code, and non-informative passages in the reference summaries.

Weak correlation with human judgment
The effectiveness of ROUGE was previously evaluated (Lin, 2004;Graham, 2015) through statistical correlations with human judgment on the DUC datasets (Over and Yen, 2001. However, their setting was substantially different from the current environment in which summarization models are developed and evaluated. To investigate the robustness of ROUGE in the setting in which it is currently used, we evaluate how its scores correlate with the judgment of an average English-speaker using examples from the CNN/DM dataset. Following the human evaluation protocol from Gehrmann et al. (2018), we asked annotators to rate summaries across four dimensions: relevance (selection of important content from the source), consistency (factual alignment between the summary and the source), fluency (quality of individual sentences), and coherence (collective quality of all sentences). Each summary was rated by 5 distinct judges with the final score obtained by averaging the individual scores. Experiments were conducted on 100 randomly sampled articles with the outputs of 13 summarization systems provided by the original authors. Correlations were computed between all pairs of Human-, ROUGE-scores, for all systems. Additional summaries were collected from annotators to inspect the effect of using multiple ground-truth labels on the correlation with automatic metrics. Further details of the human study can be found in Appendix A.2.
Results are shown in Table 5. The left section of the table presents Pearson's correlation coefficients and the right section presents Kendall rank correlation coefficients. In terms of Pearsons's coefficients, the study showed minimal correlation with any of the annotated dimensions for both abstractive and extractive models together and for abstractive models individually. Weak correlation was discovered for extractive models primarily with the fluency and coherence dimensions.
We hypothesized that the noise contained in the fine-grained scores generated by both human annotators and ROUGE might have affected the correlation scores. We evaluated the relation on a higher level of granularity by means of correlation between rankings of models that were obtained from the fine-grained scores. The study showed weak correlation with all measured dimensions, when evaluated for both abstractive and extractive models together and for abstractive models individually. Moderate correlation was found for extractive models across all dimensions. A surprising result was that correlations grew weaker with the increase of ground truth references.
Our results align with the observations from Liu and Liu (2010) who also evaluated ROUGE outside of its original setting. The study highlights the limited utility in measuring progress of the field Pearson correlation Kendall rank correlation 1 Reference 5 References 10 References 1 Reference 5 References 10 References  solely by means of ROUGE scores.

Insufficient evaluation protocol
The goal of text summarization is to automatically generate succinct, fluent, relevant, and factually consistent summaries. The current evaluation protocol depends primarily on the exact lexical overlap between reference and candidate summaries measured by ROUGE. In certain cases, ROUGE scores are complemented with human studies where annotators rate the relevance and fluency of generated summaries. Neither of the methods explicitly examines the factual consistency of summaries, leaving this important dimension unchecked.
To evaluate the factual consistency of existing models, we manually inspected randomly sampled articles with summaries coming from randomly chosen, abstractive models. We focused exclusively on factual incorrectness and ignored any other issues, such as low fluency. Out of 200 article-summary pairs that were reviewed manually, we found that 60 (30%) contained consistency issues. Table 4 shows examples of discovered inconsistencies. Some of the discovered inconsistencies, despite being factually incorrect, could be rationalized by humans. However, in many cases, the errors were substantial and could have severe repercussions if presented as-is to target readers.

Layout bias in news data
We revisit the problem of layout bias in news data from the perspective of models. Kedzie et al. (2018) showed that in the case of news articles, the layout bias dominates the learning signal for neural models. In this section, we approximate the degree with which generated summaries rely on the leading sentences of news articles.
We computed ROUGE scores for collected models in two settings: first using the CNN/DM reference summaries as the ground-truth, and second where the leading three sentences of the source article were used as the ground-truth, i.e. the Lead-3 baseline. We present the results in Table 6.
For all examined models we noticed a substantial increase of overlap across all ROUGE variants. Results suggest that performance of current models is strongly affected by the layout bias of news corpora. Lead-3 is a strong baseline that exploits the described layout bias. However, there is still a large gap between its performance and an upper bound for extractive models (extractive oracle).

Diversity of model outputs
Models analyzed in this paper are considerably different from each other in terms of architectures, training strategies, and underlying approaches. We inspected how the diversity in approaches translates into the diversity of model outputs.
We computed ROUGE-1 and ROUGE-4 scores between pairs of model outputs to compare them by means of token and phrase overlap. Results are visualized in Figure 2, where the values above and below the diagonal are ROUGE-1 and -4 scores accordingly, and model names (M-) follow the order from Table 6.  We notice that the ROUGE-1 scores vary considerably less than ROUGE-4 scores. This suggests that the models share a large part of the vocabulary on the token level, but differ on how they organize the tokens into longer phrases.
Comparing results with the n-gram overlap between models and reference summaries (Table 6) shows a substantially higher overlap between any model pair than between the models and reference summaries. This might imply that the training data contains easy to pick up patterns that all models overfit to, or that the information in the training signal is too weak to connect the content of the source articles with the reference summaries.

Conclusions
This critique has highlighted the weak points of the current research setup in text summarization. We showed that text summarization datasets require additional constraints to have well-formed summaries, current state-of-the-art methods learn to rely too heavily on layout bias associated with the particular domain of the text being summarized, and the current evaluation protocol reflects human judgments only weakly while also failing to evaluate critical features (e.g. factual correctness) of text summarization.
We hope that this critique provides the summarization community with practical insights for future research directions that include the construction of datasets, models less fit to a particular do-  Table 6. main bias, and evaluation that goes beyond current metrics to capture the most important features of summarization.