Scoring Sentence Singletons and Pairs for Abstractive Summarization

When writing a summary, humans tend to choose content from one or two sentences and merge them into a single summary sentence. However, the mechanisms behind the selection of one or multiple source sentences remain poorly understood. Sentence fusion assumes multi-sentence input; yet sentence selection methods only work with single sentences and not combinations of them. There is thus a crucial gap between sentence selection and fusion to support summarizing by both compressing single sentences and fusing pairs. This paper attempts to bridge the gap by ranking sentence singletons and pairs together in a unified space. Our proposed framework attempts to model human methodology by selecting either a single sentence or a pair of sentences, then compressing or fusing the sentence(s) to produce a summary sentence. We conduct extensive experiments on both single- and multi-document summarization datasets and report findings on sentence selection and abstraction.


Introduction
Abstractive summarization aims at presenting the main points of an article in a succinct and coherent manner. To achieve this goal, a proficient editor can rewrite a source sentence into a more succinct form by dropping inessential sentence elements such as prepositional phrases and adjectives. She can also choose to fuse multiple source sentences into one by reorganizing the points in a coherent manner. In fact, it appears to be common practice to summarize by either compressing single sentences or fusing multiple sentences. We investigate this hypothesis by analyzing human-written abstracts contained in three large datasets: DUC-04 (Over and Yen, 2004), CNN/Daily Mail (Hermann et al., 2015), and XSum (Narayan et al., 2018). For every summary sentence, we find its ground-truth set containing one or more source  Figure 1: Portions of summary sentences generated by compression (content is drawn from one source sentence) and fusion (content is drawn from two or more source sentences). Humans often grab content from 1 or 2 document sentences when writing a summary sentence.
sentences that exhibit a high degree of similarity with the summary sentence (details in §4). As shown in Figure 1, across the three datasets, 60-85% of summary sentences are generated by fusing one or two source sentences. Selecting summary-worthy sentences has been studied in the literature, but there lacks a mechanism to weigh sentence singletons and pairs in a unified space. Extractive methods focus on selecting sentence singletons using greedy (Carbonell and Goldstein, 1998), optimization-based (Gillick and Favre, 2009;Kulesza and Taskar, 2011;Cho et al., 2019), and (non-)autoregressive methods (Cheng and Lapata, 2016;Kedzie et al., 2018). In contrast, existing sentence fusion studies tend to assume ground sets of source sentences are already provided, and the system fuses each set of sentences into a single one (Daumé III and Marcu, 2004;Filippova, 2010;Thadani and McKeown, 2013). There is thus a crucial gap between sentence selection and fusion to support summarizing by both compressing single sentences and fusing pairs. This paper attempts to bridge the gap by ranking singletons and pairs together by their likelihoods of producing summary sentences.
The selection of sentence singletons and pairs can bring benefit to neural abstractive summarization, as a number of studies seek to separate content selection from summary generation (Chen and Bansal, 2018;Hsu et al., 2018;Gehrmann et al., 2018;. Content selection draws on domain knowledge to identify relevant content, while summary generation weaves together selected source and vocabulary words to form a coherent summary. Despite having local coherence, system summaries can sometimes contain erroneous details (See et al., 2017) and forged content (Cao et al., 2018b;. Separating the two tasks of content selection and summary generation allows us to closely examine the compressing and fusing mechanisms of an abstractive summarizer. In this paper we propose a method to learn to select sentence singletons and pairs, which then serve as the basis for an abstractive summarizer to compose a summary sentence-by-sentence, where singletons are shortened (i.e., compressed) and pairs are merged (i.e., fused). We exploit stateof-the-art neural representations and traditional vector space models to characterize singletons and pairs; we then provide suggestions on the types of representations useful for summarization. Experiments are performed on both single-and multi-document summarization datasets, where we demonstrate the efficacy of selecting sentence singletons and pairs as well as its utility to abstractive summarization. Our research contributions can be summarized as follows: • the present study fills an important gap by selecting sentence singletons and pairs jointly, assuming a summary sentence can be created by either shortening a singleton or merging a pair. Compared to abstractive summarizers that perform content selection implicitly, our method is flexible and can be extended to multi-document summarization where training data is limited; • we investigate the factors involved in representing sentence singletons and pairs. We perform extensive experiments and report findings on sentence selection and abstraction. 1

Related Work
Content selection is integral to any summarization system. Neural approaches to abstractive summarization often perform content selection jointly with surface realization using an encoder-decoder architecture (Rush et al., 2015;Nallapati et al., 2016;Chen et al., 2016b;Tan et al., 2017;See 1 We make our code and models publicly available at https: com/ucfnlp/summarization-sing-pair-mix et al., 2017;Paulus et al., 2017;Celikyilmaz et al., 2018;Narayan et al., 2018). Training these models end-to-end means learning to perform both tasks simultaneously and can require a massive amount of data that is unavailable and unaffordable for many summarization tasks.
Recent approaches emphasize the importance of separating content selection from summary generation for abstractive summarization. Studies exploit extractive methods to identify content words and sentences that should be part of the summary and use them to guide the generation of abstracts (Chen and Bansal, 2018;Gehrmann et al., 2018;. On the other hand, surface lexical features have been shown to be effective in identifying pertinent content (Carenini et al., 2006;Wong et al., 2008;Galanis et al., 2012). Examples include sentence length, position, centrality, word frequency, whether a sentence contains topic words, and others. The surface cues can also be customized for new domains relatively easily. This paper represents a step forward in this direction, where we focus on developing lightweight models to select summary-worthy sentence singletons and pairs and use them as the basis for summary generation.
A succinct sentence can be generated by shortening or rewriting a lengthy source text. Recent studies have leveraged neural encoder-decoder models to rewrite the first sentence of an article to a title-like summary (Nallapati et al., 2016;Zhou et al., 2017;Li et al., 2017;Guo et al., 2018;Cao et al., 2018a). Compressive summaries can be generated in a similar vein by selecting important source sentences and then dropping inessential sentence elements such as prepositional phrases. Before the era of deep neural networks it has been an active area of research, where sentence selection and compression can be accomplished using a pipeline or a joint model (Daumé III and Marcu, 2002;Zajic et al., 2007;Gillick and Favre, 2009;Wang et al., 2013;Li et al., 2013Li et al., , 2014Filippova et al., 2015). A majority of these studies focus on selecting and compressing sentence singletons only.
A sentence can also be generated through fusing multiple source sentences. However, many aspects of this approach are largely underinvestigated, such as determining the set of source sentences to be fused, handling its large cardinality, and identifying the sentence relationships for per- Pakistan denies its spy agency helped plan bombing that (B) Wajid Shamsul Hasan, Pakistan's high commissioner to Britain, and Hamid Gul, killed 58. former head of the ISI, firmly denied the agency's involvement in the attack.

Sentence Singleton:
Compressed Sentence: (A) Pakistani Maj. Gen. Athar Abbas said the report "unfounded and malicious" and Maj. Gen. Athar Abbas said the report was an "effort to an "effort to malign the ISI," -Pakistan's directorate of inter-services intelligence. malign the ISI." forming fusion. Previous studies assume a set of similar source sentences can be gathered by clustering sentences or by comparing to a reference summary sentence (Barzilay and McKeown, 2005;Filippova, 2010;Shen and Li, 2010;Chenal and Cheung, 2016;Liao et al., 2018); but these methods can be suboptimal. Joint models for sentence selection and fusion implicitly perform content planning (Martins and Smith, 2009;Berg-Kirkpatrick et al., 2011;Bing et al., 2015;Durrett et al., 2016) and there is limited control over which sentences are merged and how. In contrast, this work attempts to teach the system to determine if a sentence singleton or a pair should be selected to produce a summary sentence. A sentence pair (A, B) is preferred over its consisting sentences if they carry complementary content. Table 1 shows an example. Sentence B contains a reference ("the attack") and A contains a more complete description for it ("bombing that killed 58"). Sentences A and B each contain certain valuable information, and an appropriate way to merge them exists. As a result, a sentence pair can be scored higher than a singleton given the content it carries and compatibility of its consisting sentences. In the following we describe methods to represent singletons and pairs in a unified framework and scoring them for summarization.

Our Model
We present the first attempt to transform sentence singletons and pairs to real-valued vector representations capturing semantic salience so that they can be measured against each other ( §3.1). This is a nontrivial task, as it requires a direct comparison of texts of varying length-a pair of sentences is almost certainly longer than a single sentence. For sentence pairs, the representations are expected to further encode sentential semantic compatibility. In §3.2, we describe our method to utilize highest scoring singletons and pairs to a neural abstractive summarizer to generate summaries.

Scoring Sentence Singletons and Pairs
Given a document or set of documents, we create a set D of singletons and pairs by gathering all single sentences and arbitrary pairs of them. We refer to a singleton or pair in the set as an instance. The sentences in a pair are arranged in order of their appearance in the document or by date of documents. Let N be the number of single sentences in the input document(s), a complete set of singletons and pairs will contain |D|= N(N−1) 2 +N instances. Our goal is to score each instance based on the amount of summary-worthy content it conveys. Despite their length difference, a singleton can be scored higher than a pair if it contains a significant amount of salient content. Conversely, a pair can outweigh a singleton if its component sentences are salient and compatible with each other.
Building effective representations for singletons and pairs is therefore of utmost importance. We attempt to build a vector representation for each instance. The representation should be invariant to the instance type, i.e., a singleton or pair. In this paper we exploit the BERT architecture (Devlin et al., 2018) to learn instance representations. The representations are fine-tuned for a classification task predicting whether a given instance contains content used in human-written summary sentences (details for ground-truth creation in §4).
BERT BERT supports our goal of encoding singletons and pairs indiscriminately. It introduces two pretraining tasks to build deep contextual representations for words and sequences. A sequence can be a single sentence (A) or pair of sentences (A+B). 2 The first task predicts missing words in the input sequence. The second task predicts if B is the next sentence following A. It requires the vector representation for (A+B) to capture the coherence of two sentences. As coherent sentences can often be fused together, we conjecture that the second task is particularly suited for our goal.
Concretely, BERT constructs an input sequence by prepending a singleton or pair with a "[CLS]" symbol and delimiting the two sentences of a pair with " [SEP]." The representation learned for the [CLS] symbol is used as an aggregate sequence representation for the later classification task. We show an example input sequence in Eq. (1). In the case of a singleton, w B i are padding tokens.
In Eq. (2), each token w i is characterized by an input embedding e i , calculated as the elementwise sum of the following embeddings: • e w (w i ) is a token embedding; • e sgmt (w i ) is a segment embedding, signifying whether w i comes from sentence A or B.
• e wpos (w i ) is a word position embedding indicating the index of w i in the input sequence; • we introduce e spos (w i ) to be a sentence posi- Intuitively, these embeddings mean that, the extent to which a word contributes to the sequence (A+B) representation depends on these factors: (i) word salience, (ii) importance of sentences A and B, (iii) word position in the sequence, and, (iv) sentence position in the document. These factors coincide with heuristics used in summarization literature (Nenkova and McKeown, 2011), where leading sentences of a document and the first few words of a sentence are more likely to be included in the summary.
The input embeddings are then fed to a multilayer and multi-head attention architecture to build deep contextual representations for tokens. Each layer employs a Transformer block (Vaswani et al., 2017), which introduces a self-attention mechanism that allows each hidden state h l i to be compared with every other hidden state of the same layer [h l 1 , h l 2 , . . . , h l N ] using a parallelizable, multi-head attention mechanism (Eq. (3-4)).
The representation at final layer L for the [CLS] symbol is used as the sequence representation h L [CLS] . The representations can be fine-tuned with an additional output layer to generate state-ofthe-art results on a wide range of tasks including reading comprehension and natural language inference. We use the pretrained BERT base model and fine-tune it on our specific task of predicting if an instance (a singleton or pair) p inst = σ(w h L [CLS] ) is an appropriate one, i.e., belonging to the ground-truth set of summary instances for a given document. At test time, the architecture indiscriminately encodes a mixed collection of sentence singletons/pairs. We then obtain a likelihood score for each instance. This framework is thus a first effort to build semantic representations for singletons and pairs capturing informativeness and semantic compatibility of two sentences.
VSM We are interested in contrasting BERT with the traditional vector space model (Manning et al., 2008) for representing singletons and pairs. BERT learns instance representations by attending to important content words, where the importance is signaled by word and position embeddings as well as pairwise word relationships. Nonetheless, it remains an open question whether BERT can successfully weave the meaning of topically important words into representations. A word "border" is topically important if the input document discusses border security. A topic word is likely to be repeatedly mentioned in the input document but less frequently elsewhere. Because sentences containing topical words are often deemed summaryworthy (Hong and Nenkova, 2014), it is desirable to represent sentence singletons and pairs based on the amount of topical content they convey.
VSM represents each sentence as a sparse vector. Each dimension of the vector corresponds to an n-gram weighted by its TF-IDF score. A high TF-IDF score suggests the n-gram is important to the topic of discussion. We further strengthen the sentence vector with position and centrality information, i.e., the sentence position in the document and the cosine similarity between the sentence and document vector. We obtain a document vector by averaging over its sentence vectors, and we similarly obtain a vector for a pair of sentences. We use VSM representations as a baseline to contrast its performance with distributed representations from BERT. To score singletons and pairs, we use the LambdaMART model 3 which has demonstrated success on related NLP tasks (Chen et al., 2016a); it also fits our requirements of ranking singletons and pairs indiscriminately.

Generating Summaries
We proceed by performing a preliminary investigation of summary generation from singletons and pairs; they are collectively referred to as instances.
In the previous section, a set of summary instances is selected from a document. These instances are treated as "raw materials" for a summary; they are fed to a neural abstractive summarizer which processes them into summary sentences via fusion and compression. This strategy allows us to separately evaluate the contributions from instance selection and summary composition.
We employ the MMR principle (Carbonell and Goldstein, 1998) to select a set of highest scoring and non-redundant instances. The method adds an instanceP to the summary S iteratively per Eq. (5) until a length threshold has been reached. Each instance is weighted by a linear combination of its importance score I(P k ), obtained by BERT or VSM, and its redundancy score R(P k ), computed as the cosine similarity between the instance and partial summary. λ is a balancing factor between importance and redundancy. 4 Essentially, MMR prevents the system from selecting instances that are too similar to ones already selected.
Composing a summary from selected instances is a non-trivial task. As a preliminary investigation of summary composition, we make use of pointergenerator (PG) networks (See et al., 2017) to compress/fuse sentences into summary sentences. PG is a sequence-to-sequence model that has achieved state-of-the-art performance in abstractive summarization by having the ability to both copy tokens from the document or generate new tokens from the vocabulary. When trained on documentsummary pairs, the model has been shown to remove unnecessary content from sentences and can merge multiple sentences together.
In this work, rather than training on documentsummary pairs, we train PG exclusively on ground-truth instances. This removes most of the responsibility of content selection, and allows it to focus its efforts on merging the sentences. We use instances derived from human summaries ( §4) to 4 We use a coefficient λ of 0.6. pair is chosen (red) and then merged to generate the first summary sentence. Next, a sentence singleton is selected (blue) and compressed for the second summary sentence.
train the network, which includes a sentence singleton or pair along with the ground-truth compressed/merged sentence. At test time, the network receives an instance from BERT or VSM and outputs a summary sentence, then repeats this process to generate several sentences. In Figure 2 we present an illustration of the system architecture.

Data
Our method does not require a massive amount of annotated data. We thus report results on singleand multi-document summarization datasets. We experiment with (i) XSum (Narayan et al., 2018), a new dataset created for extreme, abstractive summarization. The task is to reduce a news article to a short, one-sentence summary. Both source articles and reference summaries are gathered from the BBC website. The training set contains about 204k article-summary pairs and the test contains 11k pairs. (ii) CNN/DM (Hermann et al., 2015), an abstractive summarization dataset frequently exploited by recent studies. The task is to reduce a news article to a multi-sentence summary (4 sentences on average). The training set contains about 287k article-summary pairs and the test set contains 11k pairs. We use the non-anonymzied version of the dataset. (iii) DUC-04 (Over and Yen, 2004), a benchmark multi-document summarization dataset. The task is to create an abstractive summary (5 sentences on average) from a set of 10 documents discussing a given topic. The dataset contains 50 sets of documents used for testing purpose only. Each document set is associated with four human reference summaries.
We build a training set for both tasks of content selection and summary generation. This is done by creating ground-truth sets of instances based on document-summary pairs. Each document and summary pair (D, S) is a collection of sentences D = {d 1 , d 2 , ..., d M } and S = {s 1 , s 2 , ..., s N }. We wish to associate each summary sentence s n with a subset of the document sentencesD ⊆ D, which are the sentences that are merged to form s n . Our method chooses multiple sentences that work together to capture the most overlap with summary sentence s n , in the following way.
We use averaged ROUGE-1, -2, -L scores (Lin, 2004) to represent sentence similarity. The source sentence most similar to s n is chosen, which we calld 1 . All shared words are then removed from s n to create s n , effectively removing all information already captured byd 1 . A second source sentenced 2 is selected that is most similar to the remaining summary sentence s n , and shared words are again removed from s n to create s n . This process of sentence selection and overlap removal is repeated until no remaining sentences have at least two overlapping content words (words that are non-stopwords or punctuation) with s n . The result is referred to as a ground-truth set (s n ,D) whereD = {d 1 ,d 2 , ...,d |D| }. To train the models,D is limited to one or two sentences because it captures the large majority of cases. All empty ground-truth sets are removed, and only the first two sentences are chosen for all ground-truth sets with more than two sentences. A small number of summary sentences have empty ground-truth sets, corresponding to 2.85%, 9.87%, 5.61% of summary sentences in CNN/DM, XSum, and DUC-04 datasets. A detailed plot of the ground-truth set size is illustrated in Figure 1, and samples of the ground-truth are found in the supplementary.
We use the standard train/validation/test splits for both CNN/Daily Mail and XSum. We train our models on ground-truth sets of instances created from the training sets and tune hyperparameters using instances from the validation sets. DUC-04 is a test-only dataset, so we use the models trained on CNN/Daily Mail to evaluate DUC-04. Because the input is in the form of multiple documents, we select the first 20 sentences from each document and concatenate them together into a single megadocument . For the sentence position feature, we keep the sentence positions from the original documents. This handling of sentence position, along with other features that are invariant to the input type, allows us to effectively train on single-document inputs and transfer to the multi-document setting.

Results
Evaluation Setup In this section we evaluate our proposed methods on identifying summaryworthy instances including singletons and pairs. We compare this scheme with traditional methods extracting only singletons, then introduce novel evaluation strategies to compare results. We exploit several strong extractive baselines: (i) Sum-Basic (Vanderwende et al., 2007) extracts sentences by assuming words occurring frequently in a document have higher chances of being included in the summary; (ii) KL-Sum (Haghighi and Vanderwende, 2009) greedily adds sentences to the summary to minimize KL divergence; (iii) LexRank (Erkan and Radev, 2004) estimates sentence importance based on eigenvector centrality in a document graph representation. Further, we include the LEAD method that selects the first N sentences from each document. We then require all systems to extract N instances, i.e., either singletons or pairs, from the input document(s). 5 We compare system-identified instances with ground-truth instances, and in particular, we compare against the primary, secondary, and full set of ground-truth sentences. A primary sentence is defined as a ground-truth singleton or a sentence in a ground-truth pair that has the highest similarity to the reference summary sentence; the other sentence in the pair is considered secondary, which provides complementary information to the primary sentence. E.g., let S * ={(1, 2), 5, (8, 4), 10} be a ground-truth set of instances, where numbers are sentence indices and the first sentence of each pair is primary. Our ground-truth primary set thus contains {1, 5, 8, 10}; secondary set contains {2, 4}; and the full set of ground-truth sentences contains {1, 2, 5, 8, 4, 10}. Assume S={(1, 2), 3, (4, 10), 15} are system-selected instances. We uncollapse all pairs to obtain a set of single sentences S={1, 2, 3, 4, 10, 15}, then compare them against the primary, secondary, and full set of ground-truth sentences to calculate precision, recall, and F1measure scores. This evaluation scheme allows a fair comparison of a variety of systems for instance selection, and assess their performance on identifying primary and secondary sentences respectively for summary generation.

Extraction Results
In Table 2 we present in- 5 We use N=4/1/5 respectively for the CNN/DM, XSum, and DUC-04 datasets. N is selected as the average number of sentences in reference summaries.  (Vanderwende et al., 2007) 15.  (Erkan and Radev, 2004) 22 XSum LEAD-Baseline 8.5 9.4 8.9 5.3 9.5 6.8 13.8 9.4 11.2 SumBasic (Vanderwende et al., 2007) 8.7 9.7 9.2 5.0 8.9 6.4 13.7 9.4 11.1 KL-Summ (Haghighi et al., 2009) 9.2 10.2 9.7 5.0 8.9 6.4 14.2 9.7 11.5 LexRank (Erkan and Radev, 2004) (Vanderwende et al., 2007) 4.2 3.2 3.6 3.0 3.8 3.3 7.2 3.4 4.6 KL-Summ (Haghighi et al., 2009) 5.6 4.5 5.0 2.8 3.8 3.2 8.0 4.2 5.5 LexRank (Erkan and Radev, 2004) 8.5 6.7 7.5 4.8 6.5 5.  stance selection results for the CNN/DM, XSum, and DUC-04 datasets. Our method builds representations for instances using either BERT or VSM ( §3.1). To ensure a thorough comparison, we experiment with selecting a mixed set of singletons and pairs ("SingPairMix") as well as selecting singletons only ("SingOnly"). On the CNN/DM and XSum datasets, we observe that selecting a mixed set of singletons and pairs based on BERT representations (BERT+SingPairMix) demonstrates the most competitive results. It outperforms a number of strong baselines when evaluated on a full set of ground-truth sentences. The method also performs superiorly on identifying secondary sentences. For example, it increases recall scores for identifying secondary sentences from 33.8% to 69.8% (CNN/DM) and from 16.7% to 65.3% (XSum). Our method is able to achieve strong performance on instance selection owing to BERT's capability of building effective representations for both singletons and pairs. It learns to identify salient source content based on token and position embeddings and it encodes sentential semantic compatibility using the pretraining task of predicting the next sentence; both are valuable additions to summary instance selection. Further, we observe that identifying summary-worthy singletons and pairs from multi-document inputs (DUC-04) appears to be more challenging than that of single-document inputs (XSum and CNN/DM). This distinction is not surprising given that for multi-document inputs, the system has a large and diverse search space where candidate singletons and pairs are gathered from a set of documents written by different authors. 6 We find that the BERT model performs consistently on identifying secondary sentences, and VSM yields considerable performance gain on selecting primary sentences. Both BERT and VSM models are trained on the CNN/DM dataset and applied to DUC-04 as the latter data are only used for testing. Our findings suggest that the TF-IDF features of the VSM model are effective for multi-document inputs, as important topic words are usually repeated across documents and TF-IDF scores can reflect topical importance of words. This analysis further reveals that extending BERT to incorporate topical salience of words can be a valuable line of research for future work.

Summarization Results
We present summarization results in Table 3, where we assess both extractive and abstractive summaries generated by BERT-SingPairMix. We omit VSM results as they are not as competitive as BERT on instance selection for the mixed set of singletons and pairs. The extractive summaries "BERT-Extr" are formed by concatenating selected singletons and pairs for each document, whereas "GT-SingPairMix" concatenates ground-truth singletons and pairs; it provides an upper bound for any system generating a set of singletons and pairs as the summary. To assure fair comparison, we limit all extractive summaries to contain up to 100 words (40 words for XSum) for ROUGE evaluation 7 , where R-1, R-2, R-L, and R-SU4 are variants used to measure the overlap of unigrams, bigrams, longest common subsequences, and skip bigrams (with a maximum distance of 4) between system and reference summaries (Lin, 2004). The abstractive summaries are generated from the same singletons and pairs used to form system extracts. "BERT-Abs-PG" generates an abstract by iteratively encoding singletons or pairs and decoding summary sentences using pointer-generator networks ( §3.2). 8 Our BERT summarization systems achieve results largely on par with those of prior work. It is interesting to observe that the extractive variant (BERT-Extr) can outperform its abstractive counterparts on DUC-04 and CNN/DM datasets, and vice versa on XSum. A close examination of the results reveals that whether abstractive summaries outperform appears to be related to the amount of sentence pairs selected by "BERT-SingPairMix." Selecting more pairs than singletons seems to hurt the abstractor. For example, BERT selects 100% and 76.90% sentence pairs for DUC-04 and CNN/DM respectively, and 28.02% for XSum. These results suggest that existing abstractors using encoder-decoder models may need to improve on sentence fusion. These models are trained to generate fluent sentences more than preserving salient source content, leading to important content words being skipped in generating summary sentences. Our work intends to separate the tasks of sentence selection and summary generation, thus holding promise for improving compression and merging in the future. We present example system summaries in the supplementary.
Further analysis In this section we perform a series of analyses to understand where summaryworthy content is located in a document and how humans order them into a summary. Figure 3 shows the position of ground-truth singletons and pairs in a document. We observe that singletons of CNN/DM and DUC-04 tend to occur at the beginning of a document, whereas singletons of XSum  Figure 4: A sentence's position in a human summary can affect whether or not it is created by compression or fusion.
can occur anywhere. We also find that the first and second sentence of a pair can appear far apart for XSum, but are closer for CNN/DM. These findings suggest that selecting singletons and pairs for XSum can be more challenging than others, as indicated by the name "extreme" summarization. Figure 4 illustrates how humans choose to organize content into a summary. Interestingly, we observe that a sentence's position in a human summary affects whether or not it is created by compression or fusion. The first sentence of a humanwritten summary is more likely than the following sentences to be a fusion of multiple source sentences. This is the case across all three datasets. We conjecture that the first sentence of a summary is expected to give an overview of the document and needs to consolidate information from different parts. Other sentences of a human summary can be generated by simply shortening singletons. Our statistics reveal that DUC-04 and XSum summaries involve more fusion operations, exhibiting a higher level of abstraction than CNN/DM.

Conclusion
We present an investigation into the feasibility of scoring singletons and pairs according to their likelihoods of producing summary sentences. Our framework is founded on the human process of selecting one or two sentences to merge together and it has the potential to bridge the gap between compression and fusion studies. Our method provides a promising avenue for domain-specific summarization where content selection and summary generation are only loosely connected to reduce the costs of obtaining massive annotated data.

A Ground-truth Sets of Instances
We performed a manual inspection over a subset of our ground-truth sets of singletons and pairs. Each sentence from a human-written summary is matched with one or two source sentences based on average ROUGE similarity (details in Section 4 of the paper). Tables 4, 5, and 6 present randomly selected examples from CNN/Daily Mail, XSum, and DUC-04, respectively. Colored text represents overlapping tokens between sentences. Darker colors represent content from primary sentences, while lighter colors represent content from secondary sentences. Best viewed in color. Table 7 presents example system summaries and human-written abstracts from CNN/Daily Mail. Each Human Abstract sentence is matched with a sentence singleton or pair from the source document; these singletons/pairs make up the GT-SingPairMix summary. Similarly, each sentence from BERT-Abs is created by compressing a singleton or merging a pair selected by BERT-Extr.

Selected Source Sentence(s)
Human Summary Sentence an inmate housed on the " forgotten floor , " where many mentally ill inmates are housed in miami before trial . mentally ill inmates in miami are housed on the " forgotten floor " most often , they face drug charges or charges of assaulting an officer -charges that judge steven leifman says are usually " avoidable felonies . " judge steven leifman says most are there as a result of " avoidable felonies " " i am the son of the president . miami , florida -lrb-cnn -rrb--the ninth floor of the miami-dade pretrial detention facility is dubbed the " forgotten floor . " while cnn tours facility , patient shouts : " i am the son of the president " it 's brutally unjust , in his mind , and he has become a strong advocate for changing things in miami . so , he says , the sheer volume is overwhelming the system , and the result is what we see on the ninth floor . leifman says the system is unjust and he 's fighting for change .

Selected Source Sentence(s)
Human Summary Sentence the average surface temperature has warmed one degree fahrenheit -lrb-0.6 degrees celsius -rrb-during the last century , according to the national research council .
earth has warmed one degree in past 100 years .
the reason most cited -by scientists and scientific organizations -for the current warming trend is an increase in the concentrations of greenhouse gases , which are in the atmosphere naturally and help keep the planet 's temperature at a comfortable level . in the worst-case scenario , experts say oceans could rise to overwhelming and catastrophic levels , flooding cities and altering seashores .
majority of scientists say greenhouse gases are causing temperatures to rise .
a change in the earth 's orbit or the intensity of the sun 's radiation could change , triggering warming or cooling . other scientists and observers , a minority compared to those who believe the warming trend is something ominous , say it is simply the latest shift in the cyclical patterns of a planet 's life .
some critics say planets often in periods of warming or cooling .

Selected Source Sentence(s) Human Summary Sentence
nev edwards scored an early try for sale , before castres ' florian vialelle went over , but julien dumora 's penalty put the hosts 10-7 ahead at the break .
a late penalty try gave sale victory over castres at stade pierre-antoine in their european challenge cup clash .

Selected Source Sentence(s) Human Summary Sentence
speaking in the dil , sinn fin leader gerry adams also called for a commission of investigation and said his party had " little confidence the government is protecting the public interest " . last year , nama sold its entire 850-property loan portfolio in northern ireland to the new york investment firm cerberus for more than # 1bn .
the irish government has rejected calls to set up a commission of investigation into the sale of nama 's portfolio of loans in northern ireland . cambodian elections , fraudulent according to opposition parties , gave the cpp of hun sen a scant majority but not enough to form its own government . opposition leaders prince norodom ranariddh and sam rainsy , citing hun sen 's threats to arrest opposition figures after two alleged attempts on his life , said they could not negotiate freely in cambodia and called for talks at sihanouk 's residence in beijing . cambodian leader hun sen has guaranteed the safety and political freedom of all politicians , trying to ease the fears of his rivals that they will be arrested or killed if they return to the country . opposition leaders fearing arrest , or worse , fled and asked for talks outside the country .
the cambodian people 's party criticized a non-binding resolution passed earlier this month by the u.s. house of representatives calling for an investigation into violations of international humanitarian law allegedly committed by hun sen .
the un found evidence of rights violations by hun sen prompting the us house to call for an investigation . cambodian politicians expressed hope monday that a new partnership between the parties of strongman hun sen and his rival , prince norodom ranariddh , in a coalition government would not end in more violence .
the three-month governmental deadlock ended with han sen and his chief rival , prince norodom ranariddh sharing power . citing hun sen 's threats to arrest opposition politicians following two alleged attempts on his life , ranariddh and sam rainsy have said they do not feel safe negotiating inside the country and asked the king to chair the summit at gis residence in beijing . after a meeting between hun sen and the new french ambassador to cambodia , hun sen aide prak sokhonn said the cambodian leader had repeated calls for the opposition to return , but expressed concern that the international community may be asked for security guarantees .
han sen guaranteed safe return to cambodia for all opponents but his strongest critic , sam rainsy , remained wary . diplomatic efforts to revive the stalled talks appeared to bear fruit monday as japanese foreign affairs secretary of state nobutaka machimura said king norodom sihanouk has called on ranariddh and sam rainsy to return to cambodia . king norodom sihanouk on tuesday praised agreements by cambodia 's top two political parties -previously bitter rivals -to form a coalition government led by strongman hun sen . chief of state king norodom sihanouk praised the agreement . Table 6: Sample of our ground-truth labels for singleton/pair instances from DUC-04, a multi-document dataset. Ground-truth sentences are widely dispersed among all ten documents.

Extractive Upper Bound
• She's a high school freshman with Down syndrome. • Trey -a star on Eastern High School's basketball team in Louisville, Kentucky, who's headed to play college ball next year at Ball State -was originally going to take his girlfriend to Eastern's prom.
• Trina Helson, a teacher at Eastern, alerted the school's newspaper staff to the prom-posal and posted photos of Trey and Ellie on Twitter that have gone viral.

BERT-Extractive
• But all that changed Thursday when Trey asked Ellie to be his prom date. • Trey -a star on Eastern High School's basketball team in Louisville, Kentucky, who's headed to play college ball next year at Ball State -was originally going to take his girlfriend to Eastern's prom.
• Trina Helson, a teacher at Eastern, alerted the school's newspaper staff to the prom-posal and posted photos of Trey and Ellie on Twitter that have gone viral.
• (CNN) He's a blue chip college basketball recruit. • She's a high school freshman with Down syndrome.

Human Abstract
• College-bound basketball star asks girl with Down syndrome to high school prom.
• Pictures of the two during the "prom-posal" have gone viral.

BERT-Abstractive
• Trey asked Ellie to be his prom date.
• Trina Helson, a teacher at Eastern, alerted the school's newspaper staff.
• He's a high school student with Down syndrome.

Extractive Upper Bound
• Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." • Reichelt told "Erin Burnett: outfront" that he had watched the video and stood by the report, saying Bild and Paris Match are "very confident" that the clip is real.
• Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of severe depression," the airline said Tuesday.

BERT-Extractive
• Marseille, France (CNN) -the French prosecutor leading an investigation into the crash of Germanwings flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. • Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." • Robin's comments follow claims by two magazines, German Daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings flight 9525 as it crashed into the French Alps. • The two publications described the supposed video, but did not post it on their websites.

Human Abstract
• Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports.
• Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says.
• Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says.

BERT-Abstractive
• New : French prosecutor says he was not aware of video footage from on board the plane.
• Two magazines, including German Daily Bild, have been described as the video. Table 7: Example system summaries and human-written abstracts. Each Human Abstract sentence is lined up horizontally with its corresponding ground-truth instance, which is found in Extractive Upper Bound summary. Similarly, each sentence from BERT-Abstractive is lined up horizontally with its corresponding instance selected by BERT-Extractive. The sentences are manually de-tokenized for readability.