Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization

Despite the recent developments on neural summarization systems, the underlying logic behind the improvements from the systems and its corpus-dependency remains largely unexplored. Position of sentences in the original text, for example, is a well known bias for news summarization. Following in the spirit of the claim that summarization is a combination of sub-functions, we define three sub-aspects of summarization: position, importance, and diversity and conduct an extensive analysis of the biases of each sub-aspect with respect to the domain of nine different summarization corpora (e.g., news, academic papers, meeting minutes, movie script, books, posts). We find that while position exhibits substantial bias in news articles, this is not the case, for example, with academic papers and meeting minutes. Furthermore, our empirical study shows that different types of summarization systems (e.g., neural-based) are composed of different degrees of the sub-aspects. Our study provides useful lessons regarding consideration of underlying sub-aspects when collecting a new summarization dataset or developing a new system.


Introduction
Despite numerous recent developments in neural summarization systems (Narayan et al., 2018b;Nallapati et al., 2016;See et al., 2017;Kedzie et al., 2018;Gehrmann et al., 2018;Paulus et al., 2017) the underlying rationales behind the improvements and their dependence on the training corpus remain largely unexplored. Edmundson (1969) put forth the position hypothesis: important sentences appear in preferred positions in the document. Lin and Hovy (1997) provide a method to empirically identify such positions. Later, Hong and Nenkova (2014) showed an intentional lead * Equal contribution, name order decided by coin flip. bias in news writing, suggesting that sentences appearing early in news articles are more important for summarization tasks. More generally, it is well known that recent state-of-the-art models (Nallapati et al., 2016;See et al., 2017) are often marginally better than the first-k baseline on single-document news summarization.
In order to address the position bias of news articles, Narayan et al. (2018a) collected a new dataset called XSum to create single sentence summaries that include material from multiple positions in the source document. Kedzie et al. (2018) showed that the position bias in news articles is not the same across other domains such as meeting minutes (Carletta et al., 2005).
In addition to position, Lin and Bilmes (2012) defined other sub-aspect functions of summarization including coverage, diversity, and information. Lin and Bilmes (2011) claim that many existing summarization systems are instances of mixtures of such sub-aspect functions; for example, maximum marginal relevance (MMR) (Carbonell and Goldstein, 1998) can be seen as an combination of diversity and importance functions.
Following the sub-aspect theory, we explore three important aspects of summarization ( §3): position for choosing sentences by their position, importance for choosing relevant contents, and diversity for ensuring minimal redundancy between summary sentences.
We then conduct an in-depth analysis of these aspects over nine different domains of summarization corpora ( §5) including news articles, meeting minutes, books, movie scripts, academic papers, and personal posts. For each corpus, we investigate which aspects are most important and develop a notion of corpus bias ( §6). We provide an empirical result showing how current summarization systems are compounded of which sub-aspect  showing what portion of aspect is used for each corpus and each system. The portion is measured by calculating ROUGE score between (a) summaries obtained from each aspect and target summaries or (b) summaries obtained from each aspect and each system.
factors called system bias ( §7). At last, we summarize our actionable messages for future summarization researches ( §8). We summarize some notable findings as follows: • Summarization of personal post and news articles except for XSum (Narayan et al., 2018a) are biased to the position aspect, while academic papers are well balanced among the three aspects (see Figure 1 (a)). Summarizing long documents (e.g. books and movie scripts) and conversations (e.g. meeting minutes) are extremely difficult tasks that require multiples aspects together. • Biases do exist in current summarization systems (Figure 1 (b)). Simple ensembling of multiple aspects of systems show comparable per-formance with simple single-aspect systems. • Reference summaries in current corpora include less than 15% of new words that do not appear in the source document, except for abstract text of academic papers. • Semantic volume (Yogatama et al., 2015) overlap between the reference and model summaries is not correlated with the hard evaluation metrics such as ROUGE (Lin, 2004).

Related Work
We provide here a brief review of prior work on summarization biases. Lin and Hovy (1997) studied the position hypothesis, especially in the news article writing (Hong and Nenkova, 2014;Narayan et al., 2018a) but not in other domains such as conversations (Kedzie et al., 2018). Narayan et al. (2018a) collected a new corpus to address the bias by compressing multiple contents of source document in the single target summary. In the bias analysis of systems, Lin andBilmes (2012, 2011) studied the sub-aspect hypothesis of summarization systems. Our study extends the hypothesis to various corpora as well as systems. With a specific focus on importance aspect, a recent work (Peyrard, 2019a) divided it into three subcategories; redundancy, relevance, and informativeness, and provided quantities of each to measure. Compared to this, ours provide broader scale of sub-aspect analysis across various corpora and systems. We analyze the sub-aspects on different domains of summarization corpora: news articles (Nallapati et al., 2016;Grusky et al., 2018;Narayan et al., 2018a), academic papers or journals (Kang et al., 2018;Kedzie et al., 2018), movie scripts (Gorinski and Lapata, 2015), books (Mihalcea and Ceylan, 2007), personal posts (Ouyang et al., 2017), and meeting minutes (Carletta et al., 2005) as described further in §5.
With a large scale of corpora for training, neural network based systems have recently been developed. In abstractive systems, Rush et al. (2015) proposed a local attention-based sequenceto-sequence model. On top of the seq2seq framework, many other variants have been studied using convolutional networks (Cheng and Lapata, 2016;Allamanis et al., 2016), pointer networks (See et al., 2017), scheduled sampling (Bengio et al., 2015), and reinforcement learning (Paulus et al., 2017). In extractive systems, different types of encoders (Cheng and Lapata, 2016;Nallapati et al., 2017;Kedzie et al., 2018) and optimization techniques (Narayan et al., 2018b) have been developed. Our goal is to explore which types of systems learns which sub-aspect of summarization.

Sub-aspects of Summarization
We focus on three crucial aspects : Position, Diversity, and Importance. For each aspect, we use different extractive algorithms to capture how much of the aspect is used in the oracle extractive summaries 1 . For each algorithm, the goal is to select k extractive summary sentences (equal to the number of sentences in the target summaries for each sample) out of N sentences appearing in the original source. The chosen sentences or their indices will be used to calculate the various evaluation metrics described in §4 For some algorithms below, we use vector representation of sentences. We parse a document x into a sequence of sentences x = x 1 ..x N where each sentence consists of a sequence of words x i = w 1 ..w s . Each sentence is then encoded: where BERT (Devlin et al., 2018) is a pre-trained bidirectional encoder from transformers (Vaswani 1 See §4 for our oracle set construction. et al., 2017) 2 . We use the last layer from BERT as a representation of each token, and then average them to get final representation of a sentence. All tokens are lower cased.

Position
Position of sentences in the source has been suggested as a good indicator for choosing summary sentences, especially in news articles (Lin and Hovy, 1997;Hong and Nenkova, 2014;See et al., 2017). We compare three position-based algorithms: First, Last, and Middle, by simply choosing k number of sentences in the source document from these positions.

Diversity
Yogatama et al. (2015) assume that extractive summary sentences which maximize the semantic volume in a distributed semantic space are the most diverse but least redundant sentences. Motivated by this notion, our goal is to find a set of k sentences that maximizes the volume size of them in a continuous embedding space like the BERT representations in Eq 1. Our objective is to find the optimal search function S that maximizes the volume size V of searched sentences: arg max 1..k V (S 1..c (E(x 1 ), . . . , E(x N ))). If k=N, we use every sentence from the source document. (Figure 2 (a)). However, its volume space does not guarantee to maximize the volume size because of the non-convex polygonality. In order to find a convex maximum volume, we consider two different algorithms described below.
Heuristic. Yogatama et al. (2015) heuristically choose a set of summary sentences using a greedy algorithm: It first chooses a sentence which has the farthest vector representation from the centroid of whole source sentences, and then repeatedly finds sentences whose representation is farthest from the centroid of vector representations of the chosen sentences. Unlike the original algorithm in (Yogatama et al., 2015) restricting the number of words, we constrain the total number of selected sentences to k. This heuristic algorithm can fail to find the maximum volume depending on its starting point and/or the farther distance between two points detected (Figure 2 (b)).
ConvexFall. Here we first find the convexhull 3 using Quickhull (Barber et al., 1996), implemented by Qhull library 4 . It guarantees the maximum volume size of selected points with minimum number of points (Figure 2 (c)). However, it does not reduce a redundancy between the points over the convex-hull, and usually choose larger number of sentences than k. Marcu (1999) shows an interesting study regarding an importance of sentences: given a document, if one deletes the least central sentence from the source text, then at some point the similarity with the reference text rapidly drops at sudden called the waterfall phenomena. Motivated by his study, we similarly prune redundant sentences from the set chosen by convex-hull search. For each turn, the sentence with the lowest volume reduction ratio is pruned until the number of remaining sentences is equivalent to k.

Importance
We assume that contents that repeatedly occur in one document contain important information. We find sentences that are nearest to the neighbour sentences using two distance measures: N-Nearest calculates an averaged Pearson correlation between one and the rest for all source sentence vector representations. k sentences having the highest averaged correlation are selected as final extractive summaries. On the other hand, K-Nearest chooses the K nearest sentences per each sentence, and then averages distances between each nearest sentence and the selected one. The one has the lowest averaged distance is chosen. This calculation is repeated k times and the selected sentences are removed from the remaining pool.

Metrics
In order to determine the aspects most crucial to the summarization task, we use three evaluation metrics: ROUGE is Recall-Oriented Understudy for Gisting Evaluation (Lin and Hovy, 2000) for evaluating summarization systems. We use ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) Fmeasure scores which corresponds to uni-gram, bigrams and longest common subsequences, respectively, and their averaged score (R).
Volume Overlap (VO) ratio. Hard metrics like ROUGE often ignore semantic similarities between sentences. Based on the volume assumption in (Yogatama et al., 2015), we measure overlap ratio of two semantic volumes calculated by the model and target summaries. We obtain a set of vector representations of the reference summary sentencesŶ and the model summary sentences Y predicted by any algorithm algo in §3 for the i-th document: Each volume V is then calculated using the convexhull algorithm and their overlap ( ) is calculated using a shapely package 56 . The final VO is then: where N is the total number of input documents, E is the BERT sentence encoder in Eq 1, and E(Ŷ i ) and E(Y algo i ) are a set of vector representations of the reference and model summary sentences, respectively. The volume overlap indicates how two summaries are semantically overlapped in a continuous embedding space.
Sentence Overlap (SO) ratio. Even though ROUGE provides a recall-oriented lexical overlap, we don't know the upper-bound on performance (called oracle) of the extractive summarization. We extract the oracle extractive sentences (i.e. a set of input sentences) which maximizes ROUGE-L F-measure score with the reference summary. We then measure sentence overlap (SO) which determines how many extractive sentences from our algorithms are in the oracle summary. The SO is: where C is a function for counting the number of elements in a set. The sentence overlap indicates how well the algorithm finds the oracle summaries for extractive summarization.

Summarization Corpora
We use various domains of summarization datasets to conduct the bias analysis across corpora and systems. Each dataset has source documents and corresponding abstractive target summaries. We provide a list of datasets used along with a brief description and our pre-processing scheme: • CNNDM (Nallapati et al., 2016): contains 300K number of online news articles. It has multiple sentences (4.0 on average) as a summary. • Newsroom (Grusky et al., 2018): contains 1.3M news articles and written summaries by authors and editors from 1998 to 2017. It has both extractive and abstractive summaries. • XSum (Narayan et al., 2018a): has news articles and their single but abstractive sentence summaries mostly written by the original author. • PeerRead (Kang et al., 2018): consists of scientific paper drafts in top-tier computer science venues as well as arxiv.org. We use full text of introduction section as source document and of abstract section as target summaries. • PubMed (Kedzie et al., 2018): is 25,000 medical journal papers from the PubMed Open Access Subset. 7 Unlike PeerRead, full paper except for abstract is used as source documents. • MScript (Gorinski and Lapata, 2015): is a collection of movie scripts from ScriptBase corpus and their corresponding user summaries of the movies. • BookSum (Mihalcea and Ceylan, 2007): is a dataset of classic books paired to summaries from Grade Saver 8 and Cliffs Notes 9 . Due to a large number of sentences, we only choose the first 1K sentences for source document and the first 50 sentences for target summaries. • Reddit (Ouyang et al., 2017): is a collection of personal posts from reddit.com. We use a single abstractive summary per post. The same data split from Kedzie et al. (2018) is used. • AMI (Carletta et al., 2005): is documented meeting minutes from a hundred hours of recordings and their abstractive summaries. Table 1 summarizes the characteristics of each dataset. We note that the Gigaword (Graff et al., 2003), New York Times 10 , and Document Understanding Conference (DUC) 11 are also popular datasets commonly used in summarization analyses, though here we exclude them as they represent only additional collections of news articles, showing similar tendencies to the other news datasets such as CNNDM.

Analysis on Corpus Bias
We conduct different analyses of how each corpus is biased with respect to the sub-aspects. We highlight some key findings for each sub-section. Table 2 shows a comparison of the three aspects for each corpus where we include random selection and the oracle set. For each dataset metrics are calculated on a test set except for BookSum and AMI where we use train+test due to the smaller sample size.

Multi-aspect analysis
Earlier isn't always better. Sentences selected early in the source show high ROUGE and SO on CNNDM, Newsroom, Reddit, and BookSum, but not in other domains such as medial journals and meeting minutes, and the condensed news summaries (XSum). For summarization of movie scripts in particular, the last sentences seem to provide more important summaries.
XSum requires much importance than other corpora. Interestingly, the most powerful algorithm for XSum is N-Nearest. This shows that summaries in XSum are indeed collected by abstracting multiple important contents into single sentence, avoiding the position bias.
First, ConvexFall, and N-Nearest tend to work better than the other algorithms for each aspect. First is better than Last or Middle in new articles except for XSum and personal posts, while not in academic papers (i.e., PeerRead, PubMed) and meeting minutes. ConvexFall finds the set of sentences that maximize the semantic volume overlap with the target sentences better than the heuristic one.
ROUGE and SO show similar behavior, while VO does not. In most evaluations, ROUGE scores are linear to SO ratios as expected. However, VO has high variance across algorithms and aspects.    This is mainly because the semantic volume assumption maximizes the semantic diversity, but sacrifices other aspects like importance by choosing the outlier sentences over the convex hull.

CNNDM
Social posts and news articles are biased to the position aspect while the other two aspects appear less relevant. (Figure 1 (a)) However, XSum requires all aspects equally but with relatively less relevant to any of aspects than the other news corpora.
Paper summarization is a well-balanced task. The variance of SO across the three aspects in PeerRead and PubMed is relatively smaller than other corpora. This indicates that abstract summary of the input paper requires the three aspects at the same time. PeerRead has relatively higher SO then PubMed because it only summarize text in Introduction section, while PubMed summarize whole paper text, which is much difficult (almost random performance).
Conversation, movie script and book summarization are very challenging. Conversation of spoken meeting minutes includes a lot of witty replies repeatedly (e.g., 'okay.' , 'mm -hmm.' , 'yeah.'), causing importance and diversity mea-sures to suffer. MScript and BookSum which include very long input document seem to be extremely difficult task, showing almost random performance.

Intersection between the sub-aspects
Averaged ratios across the sub-aspects do not capture how the actual summaries overlap with each other. Figure 3 shows Venn diagrams of how sets of summary sentences chosen by different subaspects are overlapped each other on average.
XSum, BookSum, and AMI have high Oracle Recall. If we develop a mixture model of the three aspects, the Oracle Recall means its upper bound, meaning that another sub-aspect should be considered regardless of the mixture model. This indicates that existing procedures are not enough to cover the Oracle sentences. For example, AMI and BookSum have a lot of repeated noisy sentences, some of which could likely be removed without a significant loss of pertinent information.
Importance and Diversity are less overlapped with each other. This means that important sentences are not always diverse sentences, indicating that they should be considered together.  Figure 4 shows two dimensional PCA projections of a document in CNNDM on the embedding space. Source sentences are clustered on the convexhull border, not in the middle. We conjecture that sentences are not uniformly distributed in the embedding space but their positions gradually move over the convexhull. Target summaries reflect different sub-aspects according to the sample and corpora. For example, many target sentences in CNNDM are near by First-k sentences.

Single-aspect analysis
We calculate the frequency of source sentences overlapped with the oracle summary where the source sentences are ranked differently according to the algorithm of each aspect (See Figure 5). Heavily skewed histograms indicate that oracle sentences are positively (right-skewed) or negatively (left-skewed) related to the sub-aspect.
In most cases, some oracle sentences are overlapped to the first part of the source sentences. Even though their degrees are different, oracle summaries from many corpora (i.e, CNNDM, NewsRoom, PeerRead, BookSum, MScript) are highly related to the position. Compared to the other corpora, PubMed and AMI contain more topranked important sentences in their oracle summaries. News articles and papers tend to find oracle sentences without diversity (i.e., rightskewed), meaning that non-diverse sentences are frequently selected as part of the oracle.
We also measure how many new words occur in abstractive target summaries, by comparing overlap between oracle summaries and document sentences (Table 3). One thing to note is that XSum and AMI have less new words in their target summaries. On the other hand, paper datasets (i.e., PeerRead and PubMed) include a lot, indicating that abstract text in academic paper is indeed "abstract".

Analysis on System Bias
We study how current summarization systems are biased with respect to three sub-aspects. In addition, we show that a simple ensemble of sys-    O∩T shows N-gram overlap between oracle and target summaries. The higher the more overlapped words in between. T\S is a proportion of N-grams in target summaries not occurred in source document. The lower the more abstractive (i.e., new words) target summaries.
tems shows comparable performance to the singleaspect systems.
Proposed ensemble systems. Motivated by the sub-aspect theory (Lin andBilmes, 2012, 2011), we combine different types of systems together from two different pools of extractive systems: asp from the three best algorithm from each aspect and ext from all extractive systems. For each combination, we choose the sumary sentences randomly among the union set of the predicted sentences (rand) or the most frequent unique sentences (topk).
Results. Table 4 shows a comparison of existing and proposed summarization systems on the set of corpora in §5 except for Newsroom 12 . Neural extractive systems such as CL, SumRun and S2SExt outperform the others in general. LexRank is highly biased toward the position aspect. On the other hand, MMR is extremely biased to the importance aspect on XSum and Reddit. Interestingly, neural extractive systems are somewhat balanced compared to the others. Ensemble systems seem to have the three sub-aspects in balance, compared to the neural extractive systems. They also outperform the others (either ROUGE or SO) on five out of eight datasets.  x 8/7/8 9.9 x 9/9/9 10.2 x 10/10/10 11.9 x 11/7/8 20.3 x 9/12/1 --x -/-/-14.0 x 6/8/8 +Pointer 23.9 x 20/13/14 15.6 x 12/11/12 13.6 x 13/13/13 11.2 x 11/12/11 14.3 x 14/10/12 23.0 x 11/13/1 --x -/-/-10.0 x 6/7/7 +Teacher 29.7 x 33/21/22 17.0 x 12/10/12 8.7  Table 4: Comparison of different systems using the averaged ROUGE scores (1/2/L) with target summaries (R) and averaged oracle overlap ratios (SO, only for extractive systems). We calculate R between systems and selected summary sentences from each sub-aspect (R(P/D/I)) where each aspect uses the best algorithm: First, ConvexFall and NNearest. R(P/D/I) is rounded by the decimal point. -indicates the system has too few samples to train the neural systems. x indicates SO is not applicable because abstractive systems have no sentence indices. The best score for each corpora is shown in bold with different colors.

Conclusion and Future Directions
We define three sub-aspects of text summarization: position, diversity, and importance. We analyze how different domains of summarization dataset are biased to these aspects. We observe that news articles strongly reflect the position aspect, while the others do not. In addition, we investigate how current summarization systems reflect these three sub-aspects in balance. Each type of approach has its own bias, while neural systems rarely do. Simple ensembling of the systems shows more balanced and comparable performance than single ones. We summarize actionable messages for future summarization research: • Different domains of datasets except for news articles pose new challenges to the appropriate design of summarization systems. For example, summarization of conversations (e.g., AMI) or dialogues (MSCript) need to filter out repeated, rhetorical utterances. Book summarization (e.g., BookSum) is very challenging due to its extremely large document size. Here current neural encoders suffer from computation limits. • Summarization systems to be developed should clearly state their computational limits as well as effectiveness in each aspect and in each corpus domain. A good summarization system should reflect different kinds of the sub-aspects harmoniously, regardless of corpus bias. Developing such bias-free or robust models can be very important for future directions. • Nobody has clearly defined the deeper nature of meaning abstraction yet. A more theoretical study of summarization, and the various aspects, is required. A recent notable example is Peyrard (2019a)'s attempt to theoretically define different quantities of importance aspect, and demonstrate the potential of the framework on an existing summarization system. Similar studies can be applied to other aspects and their combinations in various systems and different domains of corpora. • One can repeat our bias study on evaluation metrics. Peyrard (2019b) showed that widely used evaluation metrics (e.g., ROUGE, Jensen-Shannon divergence) are strongly mismatched in scoring summary results. One can compare different measures (e.g., n-gram recall, sentence overlaps, embedding similarities, word connectedness, centrality, importance reflected by discourse structures), and study bias of each with respect to systems and corpora.