Sentence Centrality Revisited for Unsupervised Summarization

Single document summarization has enjoyed renewed interest in recent years thanks to the popularity of neural network models and the availability of large-scale datasets. In this paper we develop an unsupervised approach arguing that it is unrealistic to expect large-scale and high-quality training data to be available or created for different types of summaries, domains, or languages. We revisit a popular graph-based ranking algorithm and modify how node (aka sentence) centrality is computed in two ways: (a) we employ BERT, a state-of-the-art neural representation learning model to better capture sentential meaning and (b) we build graphs with directed edges arguing that the contribution of any two nodes to their respective centrality is influenced by their relative position in a document. Experimental results on three news summarization datasets representative of different languages and writing styles show that our approach outperforms strong baselines by a wide margin.


Introduction
Single-document summarization is the task of generating a shorter version of a document while retaining its most important content (Nenkova et al., 2011). Modern neural network-based approaches (Nallapati et al., 2016;Paulus et al., 2018;Nallapati et al., 2017;Cheng and Lapata, 2016;See et al., 2017;Narayan et al., 2018b;Gehrmann et al., 2018) have achieved promising results thanks to the availability of largescale datasets containing hundreds of thousands of document-summary pairs (Sandhaus, 2008;Hermann et al., 2015b;Grusky et al., 2018). Nevertheless, it is unrealistic to expect that large-scale and high-quality training data will be available or cre-ated for different summarization styles (e.g., highlights vs. single-sentence summaries), domains (e.g., user-vs. professionally-written articles), and languages.
It therefore comes as no surprise that unsupervised approaches have been the subject of much previous research (Marcu, 1997;Radev et al., 2000;Lin and Hovy, 2002;Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Wan, 2008;Wan and Yang, 2008;Hirao et al., 2013;Parveen et al., 2015;Yin and Pei, 2015;Li et al., 2017). A very popular algorithm for extractive single-document summarization is TextRank (Mihalcea and Tarau, 2004); it represents document sentences as nodes in a graph with undirected edges whose weights are computed based on sentence similarity. In order to decide which sentence to include in the summary, a node's centrality is often measured using graph-based ranking algorithms such as PageRank (Brin and Page, 1998).
In this paper, we argue that the centrality measure can be improved in two important respects. Firstly, to better capture sentential meaning and compute sentence similarity, we employ BERT (Devlin et al., 2018), a neural representation learning model which has obtained state-of-the-art results on various natural language processing tasks including textual inference, question answering, and sentiment analysis. Secondly, we advocate that edges should be directed, since the contribution induced by two nodes' connection to their respective centrality can be in many cases unequal. For example, the two sentences below are semantically related: (1) Half of hospitals are letting patients jump NHS queues for cataract surgery if they pay for it themselves, an investigation has revealed. stitute of blind people, said: "It's shameful that people are being asked to consider funding their own treatment when they are entitled to it for free, and in a timely manner on the NHS." Sentence (1) describes a news event while sentence (2) comments on it. Sentence (2) would not make much sense on its own, without the support of the preceding sentence, whose content is more central. Similarity as an undirected measure, cannot distinguish this fundamental intuition which is also grounded in theories of discourse structure (Mann and Thompson, 1988) postulating that discourse units are characterized in terms of their text importance: nuclei denote central segments, whereas satellites denote peripheral ones. We propose a simple, yet effective approach for measuring directed centrality for single-document summarization, based on the assumption that the contribution of any two nodes' connection to their respective centrality is influenced by their relative position. Position information has been frequently used in summarization, especially in the news domain, either as a baseline that creates a summary by selecting the first n sentences of the document (Nenkova, 2005) or as a feature in learning-based systems (Lin and Hovy, 1997;Schilder and Kondadadi, 2008;Ouyang et al., 2010). We transform undirected edges between sentences into directed ones by differentially weighting them according to their orientation. Given a pair of sentences in the same document, one is looking forward (to the sentences following it), and the other is looking backward (to the sentences preceding it). For some types of documents (e.g., news articles) one might further expect sentences occurring early on to be more central and therefore backward-looking edges to have larger weights.
We evaluate the proposed approach on three single-document news summarization datasets representative of different languages, writing conventions (e.g., important information is concentrated in the beginning of the document or distributed more evenly throughout) and summary styles (e.g., verbose or more telegraphic). We experimentally show that position-augmented centrality significantly outperforms strong baselines (including TextRank; Mihalcea and Tarau 2004) across the board. In addition, our best system achieves performance comparable to supervised systems trained on hundreds of thousands of ex-amples (Narayan et al., 2018b;See et al., 2017). We present an alternative to more data-hungry models, which we argue should be used as a standard comparison when assessing the merits of more sophisticated supervised approaches over and above the baseline of extracting the leading sentences (which our model outperforms).
Taken together, our results indicate that directed centrality improves the selection of salient content substantially. Interestingly, its significance for unsupervised summarization has gone largely unnoticed in the research community. For example, gensim (Barrios et al., 2016), a widely used open-source implementation of TextRank only supports building undirected graphs, even though follow-on work (Mihalcea, 2004) experiments with position-based directed graphs similar to ours. Moreover, our approach highlights the effectiveness of pretrained embeddings for the summarization task, and their promise for the development of unsupervised methods in the future. We are not aware of any previous neural-based approaches to unsupervised single-document summarization, although some effort has gone into developing unsupervised models for multi-document summarization using reconstruction objectives Ma et al., 2016;Chu and Liu, 2018).

Undirected Text Graph
A prominent class of approaches in unsupervised summarization uses graph-based ranking algorithms to determine a sentence's salience for inclusion in the summary (Mihalcea and Tarau, 2004;Erkan and Radev, 2004). A document (or a cluster of documents) is represented as a graph, in which nodes correspond to sentences and edges between sentences are weighted by their similarity. A node's centrality can be measured by simply computing its degree or running a ranking algorithm such as PageRank (Brin and Page, 1998).
For single-document summarization, let D denote a document consisting of a sequence of sentences {s 1 , s 2 , ..., s n }, and e ij the similarity score for each pair (s i , s j ). The degree centrality for sentence s i can be defined as: After obtaining the centrality score for each sentence, sentences are sorted in reverse order and the top ranked ones are included in the summary.
TextRank (Mihalcea and Tarau, 2004) adopts PageRank (Brin and Page, 1998) to compute node centrality recursively based on a Markov chain model. Whereas degree centrality only takes local connectivity into account, PageRank assigns relative scores to all nodes in the graph based on the recursive principle that connections to nodes having a high score contribute more to the score of the node in question. Compared to degree centrality, PageRank can in theory be better since the global graph structure is considered. However, we only observed marginal differences in our experiments (see Sections 4 and 5 for details).

Directed Text Graph
The idea that textual units vary in terms of their importance or salience, has found support in various theories of discourse structure including Rhetorical Structure Theory (RST; Mann and Thompson 1988). RST is a compositional model of discourse structure, in which elementary discourse units are combined into progressively larger discourse units, ultimately covering the entire document. Discourse units are linked to each other by rhetorical relations (e.g., Contrast, Elaboration) and are further characterized in terms of their text importance: nuclei denote central segments, whereas satellites denote peripheral ones. The notion of nuclearity has been leveraged extensively in document summarization (Marcu, 1997(Marcu, , 1998Hirao et al., 2013) and in our case provides motivation for taking directionality into account when measuring centrality.
We could determine nuclearity with the help of a discourse parser (Li et al. 2016;Feng and Hirst 2014;Joty et al. 2013; Liu and Lapata 2017, inter alia) but problematically such parsers rely on the availability of annotated corpora as well as a wider range of standard NLP tools which might not exist for different domains, languages, or text genres. We instead approximate nuclearity by relative position in the hope that sentences occurring earlier in a document should be more central. Given any two sentences s i , s j (i < j) taken from the same document D, we formalize this simple intuition by transforming the undirected edge weighted by the similarity score e ij between s i and s j into two directed ones differentially weighted by λ 1 e ij and λ 2 e ij . Then, we can refine the centrality score of s i based on the directed graph as follows: where λ 1 , λ 2 are different weights for forwardand backward-looking directed edges. Note that when λ 1 and λ 1 are equal to 1, Equation (2) becomes degree centrality. The weights can be tuned experimentally on a validation set consisting of a small number of documents and corresponding summaries, or set manually to reflect prior knowledge about how information flows in a document.
During tuning experiments, we set λ 1 + λ 2 = 1 to control the number of free hyper-parameters. Interestingly, we find that the optimal λ 1 tends to be negative, implying that similarity with previous content actually hurts centrality. This observation contrasts with existing graph-based summarization approaches (Mihalcea and Tarau, 2004;Mihalcea, 2004) where nodes typically have either no edge or edges with positive weights. Although it is possible to use some extensions of PageRank (Kerchove and Dooren, 2008) to take negative edges into account, we leave this to future work and only consider the definition of centrality from Equation (6) in this paper.

Sentence Similarity Computation
The key question now is how to compute the similarity between two sentences. There are many variations of the similarity function of TextRank (Barrios et al., 2016) based on symbolic sentence representations such as tf-idf. We instead employ a state-of-the-art neural representation learning model. We use BERT (Devlin et al., 2018) as our sentence encoder and fine-tune it based on a type of sentence-level distributional hypothesis (Harris, 1954;Polajnar et al., 2015) which we explain below. Fine-tuned BERT representations are subsequently used to compute the similarity between sentences in a document.

BERT as Sentence Encoder
We use BERT (Bidirectional Encoder Representations from Transformers; Devlin et al. 2018) to map sentences into deep continuous representations. BERT adopts a multi-layer bidirectional Transformer encoder (Vaswani et al., 2017) and uses two unsupervised prediction tasks, i.e., masked language modeling and next sentence prediction, to pre-train the encoder.
The language modeling task aims to predict masked tokens by jointly conditioning on both left and right context, which allows pre-trained representations to fuse both contexts in contrast to conventional uni-directional language models. Sentence prediction aims to model the relationship between two sentences. It is a binary classification task, essentially predicting whether the second sentence in a sentence pair is indeed the next sentence. Pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference. We use BERT to encode sentences for unsupervised summarization.

Sentence-level Distributional Hypothesis
To fine-tune the BERT encoder, we exploit a type of sentence-level distributional hypothesis (Harris, 1954;Polajnar et al., 2015) as a means to define a training objective. In contrast to skip-thought vectors (Kiros et al., 2015) which are learned by reconstructing the surrounding sentences of an encoded sentence, we borrow the idea of negative sampling from word representation learning (Mikolov et al., 2013). Specifically, for a sentence s i in document D, we take its previous sentence s i−1 and its following sentence s i+1 to be positive examples, and consider any other sentence in the corpus to be a negative example. The training objective for s i is defined as: where v s and v s are two different representations of sentence s via two differently parameterized BERT encoders; σ is the sigmoid function; and P (s) is a uniform distribution defined over the sentence space.
The objective in Equation (3) aims to distinguish context sentences from other sentences in the corpus, and the encoder is pushed to capture the meaning of the intended sentence in order to achieve that. We sample five negative samples for each positive example to approximate the expectation. Note, that this approach is much more computationally efficient, compared to reconstructing surrounding sentences (Kiros et al., 2015).

Similarity Matrix
Once we obtain representations {v 1 , v 2 , ..., v n } for sentences {s 1 , s 2 , . . . , s n } in document D, we employ pair-wise dot product to compute an unnormalized similarity matrixĒ: We could also use cosine similarity, but we empirically found that the dot product performs better. The final normalized similarity matrix E is defined based onĒ: Equation (5) aims to remove the effect of absolute values by emphasizing the relative contribution of different similarity scores. This is particularly important for the adopted sentence representations which in some cases might assign very high values to all possible sentence pairs. Hyper-parameter β (β ∈ [0, 1]) controls the threshold below which the similarity score is set to 0.

Experimental Setup
In this section we present our experimental setup for evaluating our unsupervised summarization approach which we call PACSUM as a shorthand for Position-Augmented Centrality based Summarization.

Datasets
We performed experiments on three recently released single-document summarization datasets representing different languages, document information distribution, and summary styles. Table 1 presents statistics on these datasets (test set); example summaries are shown in Table 5. The CNN/DailyMail dataset (Hermann et al., 2015a) contains news articles and associated highlights, i.e., a few bullet points giving a brief overview of the article. We followed the standard splits for training, validation, and testing used by supervised systems (90,266/1,220/1,093 CNN documents and 196,961/12,148/10,397 DailyMail documents). We did not anonymize entities.
The LEAD-3 baseline (selecting the first three sentences in each document as the summary) is extremely difficult to beat on CNN/DailyMail (Narayan et al., 2018b,a), which implies that salient information is mostly concentrated in the beginning of a document. NYT writers follow less prescriptive guidelines 2 , and as a result salient information is distributed more evenly in the course of an article (Durrett et al., 2016). We therefore view the NYT annotated corpus (Sandhaus, 2008) as complementary to CNN/DailyMail in terms of evaluating the model's ability of finding salient information. We adopted the training, validation and test splits (589,284/32,736/32,739) widely used for evaluating abstractive summarization systems. However, as noted in Durrett et al. (2016), some summaries are extremely short and formulaic (especially those for obituaries and editorials), and thus not suitable for evaluating extractive summarization systems. Following Durrett et al. (2016), we eliminate documents with summaries shorter than 50 words. As a result, the NYT test set contains longer and more elaborate summary sentences than CNN/Daily Mail (see Table 1).
Finally, to showcase the applicability of our approach across languages, we also evaluated our model on TTNews (Hua et al., 2017), a Chinese news summarization corpus, created for the shared summarization task at NLPCC 2017. The corpus contains a large set of news articles and corresponding human-written summaries which were displayed on the Toutiao app (a mobile news app). Because of the limited display space on the mobile phone screen, the summaries are very concise and typically contain just one sentence. There are 50,000 news articles with summaries and 50,000 news articles without summaries in the training set, and 2,000 news articles in test set.

Implementation Details
For each dataset, we used the documents in the training set to fine-tune the BERT model; hyperparameters (λ 1 , λ 2 , β) were tuned on a validation set consisting of 1,000 examples with gold sum-maries, and model performance was evaluated on the test set.
We used the publicly released BERT model 3 (Devlin et al., 2018) to initialize our sentence encoder. English and Chinese versions of BERT were respectively used for the English and Chinese corpora. As mentioned in Section 3.2, we finetune BERT using negative sampling; we randomly sample five negative examples for every positive one to create a training instance. Each mini-batch included 20 such instances, namely 120 examples. We used Adam (Kingma and Ba, 2014) as our optimizer with initial learning rate set to 4e-6.

Automatic Evaluation
We evaluated summarization quality automatically using ROUGE F1 (Lin and Hovy, 2003). We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) as a means of assessing informativeness and the longest common subsequence (ROUGE-L) as a means of assessing fluency. Table 2 summarizes our results on the NYT and CNN/Daily Mail corpora (examples of system output can be found in the Appendix). We forced all extractive approaches to select three summary sentences for fair comparison. The first block in the table includes two state-of-the-art supervised models. REFRESH (Narayan et al., 2018b) is an extractive summarization system trained by globally optimizing the ROUGE metric with reinforcement learning. POINTER-GENERATOR (See et al., 2017) is an abstractive summarization system which can copy words from the source text while retaining the ability to produce novel words. As an upper bound, we also present results with an extractive oracle system. We used a greedy algorithm similar to Nallapati et al. (2017) to generate an oracle summary for each document. The algorithm explores different combinations of sentences and generates an oracle consisting of multiple sentences which maximize the ROUGE score against the gold summary.

NYT and CNN/Daily Mail
The second block in Table 2 presents the results of the LEAD-3 baseline (which simply creates a summary by selecting the first three sentences in a document) as well as various instantiations of 3 https://github.com/google-research/ bert  TEXTRANK (Mihalcea and Tarau, 2004). Specifically, we experimented with three sentence representations to compute sentence similarity. The first one is based on tf-idf where the value of the corresponding dimension in the vector representation is the number of occurrences of the word in the sentence times the idf (inverse document frequency) of the word. Following gensim, We preprocessed sentences by removing function words and stemming words. The second one is based on the skip-thought model (Kiros et al., 2015) which exploits a type of sentence-level distributional hypothesis to train an encoder-decoder model trying to reconstruct the surrounding sentences of an encoded sentence. We used the publicly released skip-thought model 5 to obtain vector representations for our task. The third one is based on BERT (Devlin et al., 2018) fine-tuned with the method proposed in this paper. Finally, to determine whether the performance of PageRank and degree centrality varies in practice, we also include a graph-based summarizer with DEGREE centrality and tf-idf representations. The third block in Table 2 reports results with three variants of our model, PACSUM. These include sentence representations based on tf-idf, skip-thought vectors, and BERT. Recall that PAC-SUM uses directed degree centrality to decide which sentence to include in the summary. On both NYT and CNN/Daily Mail datasets, PAC-4 The ROUGE scores here on CNN/Daily Mail are higher than those reported in the original paper, because we extract 3 sentences in Daily Mail rather than 4. 5 https://github.com/ryankiros/ skip-thoughts SUM (with BERT representations) achieves the highest ROUGE F1 score, compared to other unsupervised approaches. This gain is more pronounced on NYT where the gap between our best system and LEAD-3 is approximately 6 absolute ROUGE-1 F1 points. Interestingly, despite limited access to only 1,000 examples for hyperparameter tuning, our best system is comparable to supervised systems trained on hundreds of thousands of examples (see rows REFRESH and POINTER-GENERATOR in the table).
As can be seen in Table 2, DEGREE (tf-idf) is very close to TEXTRANK (tf-idf). Due to space limitations, we only show comparisons between DEGREE and TEXTRANK with tf-idf, however, we observed similar trends across sentence representations. These results indicate that considering global structure does not make a difference when selecting salient sentences for NYT and CNN/Daily Mail, possibly due to the fact  Table 3: Results on Chinese TTNews corpus using ROUGE F1 (R-1 and R-2 are shorthands for unigram and bigram overlap, R-L is the longest common subsequence).
that news articles in these datasets are relatively short (see Table 1). The results in Table 2 further show that PACSUM substantially outperforms TEXTRANK across sentence representations, directly confirming our assumption that position information is beneficial for determining sentence centrality in news single-document summarization. In Figure 1 we further show how PACSUM's performance (ROUGE-1 F1) on the NYT validation set varies as λ 1 ranges from -2 to 1 (λ 2 = 1 and β = 0, 0.3, 0.6). The plot highlights that differentially weighting a connection's contribution (via relative position) has a huge impact on performance (ROUGE ranges from 0.30 to 0.40). In addition, the optimal λ 1 is negative, suggesting that similarity with the previous content actually hurts centrality in this case. We also observed that PACSUM improves further when equipped with the BERT encoder. This validates the superiority of BERT-based sentence representations (over tf-idf and skip-thought vectors) in capturing sentence similarity for unsupervised summarization. Interestingly, TEXTRANK performs worse with BERT. We believe that this is caused by the problematic centrality definition, which fails to fully exploit the potential of continuous representations. Overall, PACSUM obtains improvements over baselines on both datasets highlighting the effectiveness of our approach across writing styles (highlights vs. summaries) and narrative conventions. For instance, CNN/Daily Mail articles often follow the inverted pyramid format starting with the most important information while NYT articles are less prescriptive attempting to pull the reader in with an engaging introduction and develop from there to explain a topic. Table 3   evaluation metric. We report results with variants of TEXTRANK (tf-idf) and PACSUM (BERT) which performed best on NYT and CNN/Daily Mail. Since summaries in the TTNews corpus are typically one sentence long (see Table 1), we also limit our extractive systems to selecting a single sentence from the document. The LEAD baseline also extracts the first document sentence, while the ORACLE selects the sentence with maximum ROUGE score against the gold summary in each document. We use the popular POINTER-GENERATOR system of See et al. (2017) as a comparison against supervised methods. The results in Table 3 show that POINTER-GENERATOR is superior to unsupervised methods, and even comes close to the extractive oracle, which indicates that TTNews summaries are more abstractive compared to the English corpora. Nevertheless, even in this setting which disadvantages extractive methods, PACSUM outperforms LEAD and TEXTRANK showing that our approach is generally portable across different languages and summary styles. Finally, we show some examples of system output for the three datasets in Appendix.

Human Evaluation
In addition to automatic evaluation using ROUGE, we also evaluated system output by eliciting human judgments. Specifically, we assessed the degree to which our model retains key information from the document following a questionanswering (QA) paradigm which has been previously used to evaluate summary quality and document compression (Clarke and Lapata, 2010;Narayan et al., 2018b). We created a set of questions based on the gold summary under the assumption that it highlights the most important document content. We then examined whether partici-NYT Gold Summary: Marine Corps says that V-22 Osprey, hybrid aircraft with troubled past, will be sent to Iraq in September, where it will see combat for first time. The Pentagon has placed so many restrictions on how it can be used in combat that plane -which is able to drop troops into battle like helicopter and then speed away like airplane -could have difficulty fulfilling marines longstanding mission for it. limitations on v-22, which cost $80 million apiece, mean it can not evade enemy fire with same maneuvers and sharp turns used by helicopter pilots. Questions: • Which aircraft will be sent to Iraq? V-22 Osprey • What are the distinctive features of this type of aircraft? able to drop troops into battle like helicopter and then speed away like airplane • How much does each v-22 cost? $80 million apiece CNN+DM Gold Summary: "We're all equal, and we all deserve the same fair trial," says one juror. The months-long murder trial of Aaron Hernandez brought jurors together. Foreperson: "It's been an incredibly emotional toll on all of us." Questions: • Who was on trial? Aaron Hernandez • Who said: "It's been an incredibly emotional toll on all of us"? Foreperson TTNEWS Gold Summary : 皇马今夏清洗名单曝光，三小将租借外出，科恩特朗、伊利亚拉门迪将被永久送出伯纳乌球场.
(Real Madrid's cleaning list was exposed this summer, and the three players will be rented out. Coentrao and Illarramendi will permanently leave the Bernabeu Stadium.) Question: 皇马今夏清洗名单中几人将被外租？三 (How many people will be rented out by Real Madrid this summer? three) pants were able to answer these questions by reading system summaries alone without access to the article. The more questions a system can answer, the better it is at summarizing the document.
For CNN/Daily Mail, we worked on the same 20 documents and associated 71 questions used in Narayan et al. (2018b). For NYT, we randomly selected 18 documents from the test set and created 59 questions in total. For TTNews, we randomly selected 50 documents from the test set and created 50 questions in total. Example questions (and answers) are shown in Table 5.
We compared our best system PACSUM (BERT) against REFRESH, LEAD-3, and ORACLE on CNN/Daily Mail and NYT, and against LEAD-3 and ORACLE on TTNews. Note that we did not include TEXTRANK in this evaluation as it performed worse than LEAD-3 in previous experiments (see Tables 2 and 3). Five participants answered questions for each summary. We used the same scoring mechanism from Narayan et al. (2018b), i.e., a correct answer was marked with a score of one, partially correct answers with a score of 0.5, and zero otherwise. The final score for a system is the average of all its question scores. Answers for English examples were elicited using Amazon's Mechanical Turk crowdsourcing platform while answers for Chinese summaries were assessed by in-house native speakers of Chinese. We uploaded the data in batches (one system at a time) on AMT to ensure that the same participant does not evaluate summaries from different systems on the same set of questions.
The results of our QA evaluation are shown in Table 4. ORACLE's performance is below 100, indicating that extracting sentences by maximizing ROUGE fails in many cases to select salient content, capturing surface similarity instead. PAC-SUM significantly outperforms LEAD but is worse than ORACLE which suggests there is room for further improvement. Interestingly, PACSUM performs on par with REFRESH (the two systems are not significantly different).

Conclusions
In this paper, we developed an unsupervised summarization system which has very modest data requirements and is portable across different types of summaries, domains, or languages. We revisited a popular graph-based ranking algorithm and refined how node (aka sentence) centrality is computed. We employed BERT to better capture sentence similarity and built graphs with directed edges arguing that the contribution of any two nodes to their respective centrality is influenced by their relative position in a document. Experimental results on three news summarization datasets demonstrated the superiority of our approach against strong baselines. In the future, we would like to investigate whether some of the ideas introduced in this paper can improve the performance of supervised systems as well as sentence selection in multi-document summarization.