Extractive Summarization as Text Matching

This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. We believe the power of this matching-based summarization framework has not been fully exploited. To encourage more instantiations in the future, we have released our codes, processed dataset, as well as generated summaries in https://github.com/maszhongming/MatchSum.


Introduction
The task of automatic text summarization aims to compress a textual document to a shorter highlight while keeping salient information on the original text. In this paper, we focus on extractive summarization since it usually generates semantically and grammatically correct sentences (Dong et al., 2018;Nallapati et al., 2017) and computes faster.
Currently, most of the neural extractive summarization systems score and extract sentences (or smaller semantic unit ) one by one from the original text, model the relationship between the sentences, and then select several sentences to form a summary. Cheng and Lapata (2016); Nallapati et al. (2017) formulate the extractive summarization task as a sequence labeling problem and solve it with an encoder-decoder framework. These models make independent binary decisions for each sentence, resulting in high redundancy. A natural way to address the above problem is to introduce an auto-regressive decoder (Chen and Bansal, 2018;Jadhav and Rajan, 2018;Zhou et al., 2018), allowing the scoring operations of different sentences to influence on each other. Trigram Blocking (Paulus et al., 2017;Liu and Lapata, 2019), as a more popular method recently, has the same motivation. At the stage of selecting sentences to form a summary, it will skip the sentence that has trigram overlapping with the previously selected sentences. Surprisingly, this simple method of removing duplication brings a remarkable performance improvement on CNN/DailyMail.
The above systems of modeling the relationship between sentences are essentially sentence-level extractors, rather than considering the semantics of the entire summary. This makes them more inclined to select highly generalized sentences while ignoring the coupling of multiple sentences. Narayan et al. (2018b); Bae et al. (2019) utilize reinforcement learning (RL) to achieve summarylevel scoring, but still limited to the architecture of sentence-level summarizers.
To better understand the advantages and limitations of sentence-level and summary-level approaches, we conduct an analysis on six benchmark datasets (in Section 3) to explore the characteristics of these two methods. We find that there is indeed an inherent gap between the two approaches across these datasets, which motivates us to propose the following summary-level method.
In this paper, we propose a novel summary-level framework (MATCHSUM, Figure 1) and conceptualize extractive summarization as a semantic text matching problem. The principle idea is that a good summary should be more semantically similar as a whole to the source document than the unqualified summaries. Semantic text matching is an important research problem to estimate semantic similarity between a source and a target text fragment, which has been applied in many fields, such as information retrieval (Mitra et al., 2017), question answering (Yih et al., 2013;Severyn and Moschitti, 2015), natural language inference (Wang and Jiang, 2016;Wang et al., 2017) and so on. One of the most conventional approaches to semantic text matching is to learn a vector representation for each text fragment, and then apply typical similarity metrics to compute the matching scores.
Specific to extractive summarization, we propose a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary. Siamese BERT leverages the pre-trained BERT (Devlin et al., 2019) in a Siamese network structure (Bromley et al., 1994;Hoffer and Ailon, 2015;Reimers and Gurevych, 2019) to derive semantically meaningful text embeddings that can be compared using cosine-similarity. A good summary has the highest similarity among a set of candidate summaries.
We evaluate the proposed matching framework and perform significance testing on a range of benchmark datasets. Our model outperforms strong baselines significantly in all cases and improve the state-of-the-art extractive result on CNN/DailyMail. Besides, we design experiments to observe the gains brought by our framework.
We summarize our contributions as follows: 1) Instead of scoring and extracting sentences one by one to form a summary, we formulate extractive summarization as a semantic text matching problem and propose a novel summary-level framework. Our approach bypasses the difficulty of summary-level optimization by contrastive learning, that is, a good summary should be more semantically similar to the source document than the unqualified summaries.
2) We conduct an analysis to investigate whether extractive models must do summary-level extraction based on the property of dataset, and attempt to quantify the inherent gap between sentence-level and summary-level methods.
3) Our proposed framework has achieved superior performance compared with strong baselines on six benchmark datasets. Notably, we obtain a state-of-the-art extractive result on CNN/DailyMail (44.41 in ROUGE-1) by only using the base version of BERT. Moreover, we seek to observe where the performance gain of our model comes from.

Extractive Summarization
Recent research work on extractive summarization spans a large range of approaches. These work usually instantiate their encoder-decoder framework by choosing RNN (Zhou et al., 2018), Transformer (Zhong et al., 2019b;Wang et al., 2019) or GNN (Wang et al., 2020) as encoder, non-auto-regressive (Narayan et al., 2018b;Arumae and Liu, 2018) or auto-regressive decoders (Jadhav and Rajan, 2018;Liu and Lapata, 2019). Despite the effectiveness, these models are essentially sentence-level extractors with individual scoring process favor the highest scoring sentence, which probably is not the optimal one to form summary 1 .
The application of RL provides a means of summary-level scoring and brings improvement (Narayan et al., 2018b;Bae et al., 2019). However, these efforts are still limited to auto-regressive or non-auto-regressive architectures. Besides, in the non-neural approaches, the Integer Linear Programming (ILP) method can also be used for summarylevel scoring (Wan et al., 2015).
In addition, there is some work to solve extractive summarization from a semantic perspective before this paper, such as concept coverage (Gillick and Favre, 2009), reconstruction (Miao and Blunsom, 2016) and maximize semantic volume (Yogatama et al., 2015).

Two-stage Summarization
Recent studies (Alyguliyev, 2009;Galanis and Androutsopoulos, 2010;Zhang et al., 2019a) have attempted to build two-stage document summarization systems. Specific to extractive summarization, the first stage is usually to extract some fragments of the original text, and the second stage is to select or modify on the basis of these fragments. Chen and Bansal (2018) and Bae et al. (2019) follow a hybrid extract-then-rewrite architecture, with policy-based RL to bridge the two networks together. Lebanoff et al. (2019); Xu and Durrett (2019); Mendes et al. (2019) focus on the extractthen-compress learning paradigm, which will first train an extractor for content selection. Our model can be viewed as an extract-then-match framework, which also employs a sentence extractor to prune unnecessary information.

Sentence-Level or Summary-Level? A Dataset-dependent Analysis
Although previous work has pointed out the weakness of sentence-level extractors, there is no systematic analysis towards the following questions: 1) For extractive summarization, is the summarylevel extractor better than the sentence-level extractor? 2) Given a dataset, which extractor should we choose based on the characteristics of the data, and what is the inherent gap between these two extractors?
In this section, we investigate the gap between sentence-level and summary-level methods on six benchmark datasets, which can instruct us to search for an effective learning framework. It is worth noting that the sentence-level extractor we use here doesn't include a redundancy removal process so that we can estimate the effect of the summarylevel extractor on redundancy elimination. Notably, the analysis method to estimate the theoretical effectiveness presented in this section is generalized and can be applicable to any summary-level approach.

Definition
We refer to D = {s 1 , · · · , s n } as a single document consisting of n sentences, and C = {s 1 , · · · , s k , |s i ∈ D} as a candidate summary in-cluding k (k ≤ n) sentences extracted from a document. Given a document D with its gold summary C * , we measure a candidate summary C by calculating the ROUGE (Lin and Hovy, 2003) value between C and C * in two levels: 1) Sentence-Level Score: where s is the sentence in C and |C| represents the number of sentences. R(·) denotes the average ROUGE score 2 . Thus, g sen (C) indicates the average overlaps between each sentence in C and the gold summary C * .
2) Summary-Level Score: where g sum (C) considers sentences in C as a whole and then calculates the ROUGE score with the gold summary C * .

Pearl-Summary
We define the pearl-summary to be the summary that has a lower sentence-level score but a higher summary-level score.
Definition 1 A candidate summary C is defined as a pearl-summary if there exists another candidate summary C that satisfies the inequality: g sen (C ) > g sen (C) while g sum (C ) < g sum (C).
Clearly, if a candidate summary is a pearl-summary, it is challenging for sentence-level summarizers to extract it.
Best-Summary The best-summary refers to a summary has highest summary-level score among all the candidate summaries.
Definition 2 A summaryĈ is defined as the bestsummary when it satisfies: where C denotes all the candidate summaries of the document.

Ranking of Best-Summary
For each document, we sort all candidate summaries 3 in descending order based on the sentencelevel score, and then define z as the rank index of the best-summaryĈ.  Intuitively, 1) if z = 1 (Ĉ comes first), it means that the best-summary is composed of sentences with the highest score; 2) If z > 1, then the bestsummary is a pearl-summary. And as z increases (Ĉ gets lower rankings), we could find more candidate summaries whose sentence-level score is higher than best-summary, which leads to the learning difficulty for sentence-level extractors.
Since the appearance of the pearl-summary will bring challenges to sentence-level extractors, we attempt to investigate the proportion of pearlsummary in different datasets on six benchmark datasets. A detailed description of these datasets is displayed in Table 1.
As demonstrated in Figure 2, we can observe that for all datasets, most of the best-summaries are not made up of the highest-scoring sentences. Specifically, for CNN/DM, only 18.9% of best-summaries are not pearl-summary, indicating sentence-level extractors will easily fall into a local optimization, missing better candidate summaries.
Different from CNN/DM, PubMed is most suitable for sentence-level summarizers, because most of best-summary sets are not pearl-summary. Additionally, it is challenging to achieve good performance on WikiHow and Multi-News without a summary-level learning process, as these two datasets are most evenly distributed, that is, the appearance of pearl-summary makes the selection of the best-summary more complicated.
In conclusion, the proportion of the pearlsummaries in all the best-summaries is a property to characterize a dataset, which will affect our choices of summarization extractors.

Inherent Gap between Sentence-Level and Summary-Level Extractors
Above analysis has explicated that the summarylevel method is better than the sentence-level method because it can pick out pearl-summaries, but how much improvement can it bring given a specific dataset? Based on the definition of Eq. (1) and (2), we can characterize the upper bound of the sentencelevel and summary-level summarization systems for a document D as: α sen (D) = max where C D is the set of candidate summaries extracted from D. Then, we quantify the potential gain for a document D by calculating the difference between α sen (D) and α sum (D): Finally, a dataset-level potential gain can be obtained as: where D represents a specific dataset and |D| is the number of documents in this dataset. We can see from Figure 3, the performance gain of the summary-level method varies with the dataset and has an improvement at a maximum 4.7 on CNN/DM. From Figure 3 and Table 1, we can find the performance gain is related to the length of reference summary for different datasets. In the case of short summaries (Reddit and XSum), the perfect identification of pearl-summaries does not lead to much improvement. Similarly, multiple sentences in a long summary (PubMed and Multi-News) already have a large degree of semantic overlap, making the improvement of the summary-level method relatively small. But for a medium-length summary (CNN/DM and WikiHow, about 60 words), the summary-level learning process is rewarding. We will discuss this performance gain with specific models in Section 5.4.

Summarization as Matching
The above quantitative analysis suggests that for most of the datasets, sentence-level extractors are inherently unaware of pearl-summary, so obtaining the best-summary is difficult. To better utilize the above characteristics of the data, we propose a summary-level framework which could score and extract a summary directly.
Specifically, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. The following section will detail how we instantiate our proposed matching summarization framework by using a simple siamese-based architecture.

Siamese-BERT
Inspired by siamese network structure (Bromley et al., 1994), we construct a Siamese-BERT architecture to match the document D and the candidate summary C. Our Siamese-BERT consists of two BERTs with tied-weights and a cosine-similarity layer during the inference phase.
Unlike the modified BERT used in (Liu, 2019;Bae et al., 2019), we directly use the original BERT to derive the semantically meaningful embeddings from document D and candidate summary C since we need not obtain the sentence-level representation. Thus, we use the vector of the '[CLS]' token from the top BERT layer as the representation of a document or summary. Let r D and r C denote the embeddings of the document D and candidate summary C. Their similarity score is measured by f (D, C) = cosine(r D , r C ).
In order to fine-tune Siamese-BERT, we use a margin-based triplet loss to update the weights. Intuitively, the gold summary C * should be semantically closest to the source document, which is the first principle our loss should follow: where C is the candidate summary in D and γ 1 is a margin value. Besides, we also design a pairwise margin loss for all the candidate summaries. We sort all candidate summaries in descending order of ROUGE scores with the gold summary. Naturally, the candidate pair with a larger ranking gap should have a larger margin, which is the second principle to design our loss function: where C i represents the candidate summary ranked i and γ 2 is a hyperparameter used to distinguish between good and bad candidate summaries. Finally, our margin-based triplet loss can be written as: The basic idea is to let the gold summary have the highest matching score, and at the same time, a better candidate summary should obtain a higher score compared with the unqualified candidate summary. Figure 1 illustrate this idea.
In the inference phase, we formulate extractive summarization as a task to search for the best summary among all the candidates C extracted from the document D.

Candidates Pruning
Curse of Combination The matching idea is more intuitive while it suffers from combinatorial explosion problems. For example, how could we determine the size of the candidate summary set or should we score all possible candidates? To alleviate these difficulties, we propose a simple candidate pruning strategy. Concretely, we introduce a content selection module to pre-select salient sentences. The module learns to assign each sentence a salience score and prunes sentences irrelevant with the current document, resulting in a pruned document D = {s 1 , · · · , s ext |s i ∈ D}.
Similar to much previous work on two-stage summarization, our content selection module is a parameterized neural network. In this paper, we use BERTSUM (Liu and Lapata, 2019) without trigram blocking (we call it BERTEXT) to score each sentence. Then, we use a simple rule to obtain the candidates: generating all combinations of sel sentences subject to the pruned document, and reorganize the order of sentences according to the original position in the document to form candidate summaries. Therefore, we have a total of ext sel candidate sets.

Datasets
In order to verify the effectiveness of our framework and obtain more convicing explanations, we perform experiments on six divergent mainstream datasets as follows.   (Hermann et al., 2015) is a commonly used news summarization dataset modified by Nallapati et al. (2016). PubMed (Cohan et al., 2018) is collected from scientific papers. We modify this dataset by using the introduction section as the document and the abstract section as the corresponding summary. WikiHow (Koupaee and Wang, 2018) is a diverse dataset extracted from an online knowledge base. XSum (Narayan et al., 2018a) is a one-sentence summary dataset to answer the question "What is the article about?". Multi-News (Fabbri et al., 2019) is a multi-document news summarization dataset, we concatenate the source documents as a single input. Reddit ) is a highly abstractive dataset collected from social media platform. We use the TIFU-long version of Reddit.

Implementation Details
We use the base version of BERT to implement our models in all experiments. Adam optimizer (Kingma and Ba, 2014) with warming-up is used and our learning rate schedule follows Vaswani et al. (2017) as: lr = 2e −3 · min(step −0.5 , step · wm −1.5 ), (11) where each step is a batch size of 32 and wm denotes warmup steps of 10,000. We choose γ 1 = 0 and γ 2 = 0.01. When γ 1 <0.05 and 0.005<γ 2 <0.05 they have little effect on performance, otherwise they will cause performance degradation. We use the validation set to save three best checkpoints during training, and record the performance of the best checkpoints on the test set. Importantly, all the experimental results listed in this paper are the average of three runs. To obtain a Siamese-BERT model on CNN/DM, we use 8 Tesla-V100-16G GPUs for about 30 hours of training.
For datasets, we remove samples with empty document or summary and truncate the document  (Zhou et al., 2018) 41.59 19.01 37.98 JECS (Xu and Durrett, 2019) 41.70 18.50 37.90 HIBERT (Zhang et al., 2019b) 42.37 19.95 38.83 PNBERT (Zhong et al., 2019a) 42  to 512 tokens, therefore ORACLE in this paper is calculated on the truncated datasets. Details of candidate summary for the different datasets can be found in Table 2.

Experimental Results
Results on CNN/DM As shown in Table 3, we list strong baselines with different learning approaches. The first section contains LEAD, OR-ACLE and MATCH-ORACLE 4 . Because we prune documents before matching, MATCH-ORACLE is relatively low. We can see from the second section, although RL can score the entire summary, it does not lead to much performance improvement. This is probably because it still relies on the sentence-level summarizers such as Pointer network or sequence labeling models, which select sentences one by one, rather than distinguishing the semantics of different summaries as a whole. Trigram Blocking is a simple yet effective heuristic on CNN/DM, even better than all redundancy removal methods based on neural models.  Compared with these models, our proposed MATCHSUM has outperformed all competitors by a large margin. For example, it beats BERTEXT by 1.51 ROUGE-1 score when using BERT-base as the encoder. Additionally, even compared with the baseline with BERT-large pre-trained encoder, our model MATCHSUM (BERT-base) still perform better. Furthermore, when we change the encoder to RoBERTa-base , the performance can be further improved. We think the improvement here is because RoBERTa introduced 63 million English news articles during pretraining. The superior performance on this dataset demonstrates the effectiveness of our proposed matching framework.

Results on Datasets with Short Summaries
Reddit and XSum have been heavily evaluated by abstractive summarizer due to their short summaries. Here, we evaluate our model on these two datasets to investigate whether MATCHSUM could achieve improvement when dealing with summaries containing fewer sentences compared with other typical extractive models.
When taking just one sentence to match the original document, MATCHSUM degenerates into a re-ranking of sentences. Table 4 illustrates that this degradation can still bring a small improvement (compared to BERTEXT (Num = 1), 0.88 ∆R-1 on Reddit, 0.82 ∆R-1 on XSum). However, when the number of sentences increases to two and summary-level semantics need to be taken into account, MATCHSUM can obtain a more re-  In addition, our model maps candidate summary as a whole into semantic space, so it can flexibly choose any number of sentences, while most other methods can only extract a fixed number of sentences. From Table 4, we can see this advantage leads to further performance improvement.

Results on Datasets with Long Summaries
When the summary is relatively long, summarylevel matching becomes more complicated and is harder to learn. We aim to compare the difference between Trigram Blocking and our model when dealing with long summaries. Table 5 presents that although Trigram Blocking works well on CNN/DM, it does not always maintain a stable improvement. Ngram Blocking has little effect on WikiHow and Multi-News, and it causes a large performance drop on PubMed. We think the reason is that Ngram Blocking cannot really understand the semantics of sentences or summaries, just restricts the presence of entities with many words to only once, which is obviously not suitable for the scientific domain where entities may often appear multiple times.
On the contrary, our proposed method does not have strong constraints but aligns the document with the summary from semantic space. Experiment results display that our model is robust on all domains, especially on WikiHow, MATCHSUM beats the state-of-the-art model by 1.54 R-1 score.

Analysis
Our analysis here is driven by two questions: 1) Whether the benefits of MATCHSUM are consistent with the property of the dataset analyzed in Section 3?
2) Why have our model achieved different performance gains on diverse datasets?
Dataset Splitting Testing Typically, we choose three datasets (XSum, CNN/DM and WikiHow) with the largest performance gain for this experiment. We split each test set into roughly equal numbers of five parts according to z described in Section 3.2, and then experiment with each subset. Figure 4 shows that the performance gap between MATCHSUM and BERTEXT is always the smallest when the best-summary is not a pearlsummary (z = 1). The phenomenon is in line with our understanding, in these samples, the ability of the summary-level extractor to discover pearlsummaries does not bring advantages.
As z increases, the performance gap generally tends to increase. Specifically, the benefit of MATCHSUM on CNN/DM is highly consistent with the appearance of pearl-summary. It can only bring an improvement of 0.49 in the subset with the smallest z, but it rises sharply to 1.57 when z reaches its maximum value. WikiHow is similar to CNN/DM, when best-summary consists entirely of highest-scoring sentences, the performance gap is obviously smaller than in other samples. XSum is slightly different, although the trend remains the same, our model does not perform well in the samples with the largest z, which needs further improvement and exploration.
From the above comparison, we can see that the performance improvement of MATCHSUM is concentrated in the samples with more pearlsummaries, which illustrates our semantic-based summary-level model can capture sentences that are not particularly good when viewed individually, thereby forming a better summary.
Comparison Across Datasets Intuitively, improvements brought by MATCHSUM framework  Figure 4: Datasets splitting experiment. We split test sets into five parts according to z described in Section 3.2. The X-axis from left to right indicates the subsets of the test set with the value of z from small to large, and the Y-axis represents the ROUGE improvement of MATCHSUM over BERTEXT on this subset.
where C M S and C BE represent the candidate summary selected by MATCHSUM and BERTEXT in the document D, respectively. Therefore, ∆(D) * can indicate the improvement by MATCHSUM over BERTEXT on dataset D. Moreover, compared with the inherent gap between sentence-level and summary-level extractors, we define the ratio that MATCHSUM can learn on dataset D as: where ∆(D) is the inherent gap between sentencelevel and summary-level extractos. It is clear from Figure 5, the value of ψ(D) depends on z (see Figure 2) and the length of the gold summary (see Table 1). As the gold summaries get longer, the upper bound of summary-level approaches becomes more difficult for our model to reach. MATCHSUM can achieve 0.64 ψ(D) on XSum (23.3 words summary), however, ψ(D) is less than 0.2 in PubMed and Multi-News whose summary length exceeds 200. From another perspective, when the summary length are similar, our model performs better on datasets with more pearlsummaries. For instance, z is evenly distributed in Multi-News (see Figure 2), so higher ψ(D) (0.18) can be obtained than PubMed (0.09), which has the least pearl-summaries.
A better understanding of the dataset allows us to get a clear awareness of the strengths and limitations of our framework, and we also hope that the above analysis could provide useful clues for future research on extractive summarization.

Conclusion
We formulate the extractive summarization task as a semantic text matching problem and propose a novel summary-level framework to match the source document and candidate summaries in the semantic space. We conduct an analysis to show how our model could better fit the characteristic of the data. Experimental results show MATCHSUM outperforms the current state-of-the-art extractive model on six benchmark datasets, which demonstrates the effectiveness of our method.