Unsupervised document summarization using pre-trained sentence embeddings and graph centrality

This paper describes our submission for the LongSumm task in SDP 2021. We propose a method for incorporating sentence embeddings produced by deep language models into extractive summarization techniques based on graph centrality in an unsupervised manner.The proposed method is simple, fast, can summarize any kind of document of any size and can satisfy any length constraints for the summaries produced. The method offers competitive performance to more sophisticated supervised methods and can serve as a proxy for abstractive summarization techniques


Introduction
Automatic text summarization is a very old and important task in Natural Language Processing (NLP) that has received continued attention since the creation of the field in the late 50's (Luhn, 1958), mainly because of the ever-increasing size of collections of text. The objective of the task is, given a document, to produce a shorter text with maximum information content, fluency and coherence. The summarization task can be classified into extractive and abstractive. Extractive summarization means that the summary is composed exclusively of passages present in the original document and abstractive summarization means that there can be words in the summary that did not appear in the original document.
Since the creation of the first neural language models (Bengio et al., 2003), vector representations of text that encode meaning (called embeddings) have played a significant role in NLP. They allow the application of statistical and geometrical methods to words, sentences and documents ((Pennington et al., 2014), (Mikolov et al., 2013), (Reimers and Gurevych, 2019)), leading to stateof-the-art performance on several NLP tasks like Information Retrieval, Question Answering or Paraphrase Identification. Among these neural language models, very deep pre-trained neural language models, like BERT (Devlin et al., 2018), T5 (Raffel et al., 2020), and GPT-3 (Brown et al., 2020) have shown impressive performance in tasks like language modelling and text generation or benchmarks like GLUE (Wang et al., 2018).
An important variation of extractive summarization that goes back as far as the late 90's (Salton et al., 1994(Salton et al., , 1997 utilizes graphs, where the nodes represent text units and the links represent some measure of semantic similarity. These early graphbased summarization techniques involved creating a graph where the nodes were the sentences or paragraphs of a document and two nodes were connected if the corresponding text units had a similar vocabulary. After creating the document graph, the system created a summary by starting at the first paragraph and following random walks defined by different algorithms that tried to cover as much of the graph as possible. A more evolved approach was the creation of lexical centrality (Erkan and Radev, 2004) (Mihalcea and Tarau, 2004) (Wolf and Gibson, 2004), which is a measure of the importance of a passage in a text where the sentences of the document are connected by the similarity of their vocabularies.
The current state of the art in automatic summarization with graphs is mainly based on algorithms like PageRank (Brin and Page, 1998) enhanced with statistical information of the terms in the document (like in (Ramesh et al., 2014)) or Graph Neural Networks (Kipf and Welling, 2016) on top of deep language models (like in (Xu et al., 2019)).
Only two systems from the previous Scholarly Document Processing workshop held in 2020 are based on graphs: CIST-BUPT and Monash-Summ.
In CIST-BUPT , they used Recurrent Neural Networks to create sentence embeddings that can be used to build a graph which is then fed into a Graph Convolutional Network (Kipf and Welling, 2016) and a Graph Attention Network (Veličković et al., 2018) to create extractive summaries. To generate abstractive summaries, they used the gap-sentence method of (Zhang et al., 2019) to fine-tune T5 (Raffel et al., 2020).
In Monash-Summ (Ju et al., 2020), they propose an unsupervised approach that leverages linguistic knowledge to construct a sentence graph like in SummPip (Zhao et al., 2020). The graph nodes, which represent sentences, are further clustered to control the summary length, while the final abstractive summary is created from the key phrases and discourse from each cluster.
This work focuses on extractive summarization using graphs leveraging sentence embeddings produced by pre-trained language models. The essential idea is that, while the sentence embeddings produced by SBERT (Reimers and Gurevych, 2019) are not well suited for clustering algorithms like Hierarchical Clustering or DBSCAN (Ester et al., 1996), they produce excellent results in Paraphrase Identification or Semantic Textual Similarity when compared with Cosine Similarity, which implies that they can be used along with graph centrality methods. The text summarization method proposed in this paper has the following contributions: • Is unsupervised and can be used as a proxy for more advanced summarization methods.
• Can easily scale to arbitrarily large amounts of text.
• Is fast and easy to implement.
• Can fit any length requirements for the production of summaries.

Methodology
In this section, we describe how the system works.
The system is composed of three main steps: first, we use SBERT to produce sentence embeddings for every sentence in the document to summarize; next, we form a graph by comparing all the pairs of sentence embeddings obtained and finally, we rank the sentences by their degree centrality in this graph. Fig. 1 gives an overview of the whole method.

Sentence tokenization
The first step of our pipeline is to split the input text into a list of sentences. This step is critical because Document Tokenization Sentence Embeddings Graph Generation Ranking Selection Summary Figure 1: The complete pipeline of the proposed method. In the first step, we split the input text into sentences by using a regular expression handcrafted specifically for scientific documents. In the second step, we compute the sentence embeddings of the parsed sentences using SBERT. In the third step, we create a graph by comparing all the pairs of sentence embeddings obtained using cosine similarity. In the fourth step, we rank the sentences by the degree centrality in the generated graph. In the fifth and final step, we only keep a certain number of sentences or words to adjust to the length requirements of the summary.
if the sentences are too long, the final summary will have a lot of meaningless content (therefore losing precision). However, if the sentences are too short, there is a risk of not having enough context to produce an accurate sentence embedding for them or extracting meaningless sequences, like data in tables or numbers that lie in the middle of the text. We found that the function sent_tokenize() from the NLTK package (Bird et al., 2009) often failed because of the numbers in the tables and the abbreviations, like "et al.", which are very common in scientific literature. Because of this, we used a regular expression handcrafted specifically to split the text found in scientific documents.

Computing sentence embeddings
After extracting the sentences, the next step is to produce the sentence embedding of each sentence using SBERT (Reimers and Gurevych, 2019), which is a Transformer-based (Vaswani et al., 2017) model built on top of BERT (Devlin et al., 2018) that takes as input sentences and produces sentence embeddings that can be compared with cosine similarity, which is given by the following formula: sim(x, y) = x · y |x||y| .
As shown in (Reimers and Gurevych, 2019), these sentence embeddings are superior in quality than taking the CLS token of BERT or averaging the sentence embeddings of the words in the sentence produced by BERT, GloVe (Pennington et al., 2014), or Word2Vec (Mikolov et al., 2013). SBERT, like BERT, was pre-trained on a general large text collection to learn good sentence embeddings, but it has to be fine-tuned on a more specific data set according to the task. Since we are working with scientific papers, we picked the "base" version of RoBERTa  that was fine-tuned in the MSMARCO data set (Bajaj et al., 2016) for the Information Retrieval task.

Generation of the sentence graph
After the sentence embeddings have been produced, the next step is to produce a weighted complete graph with a node for each sentence in the text. Its edges are weighted according to the cosine similarities of the corresponding sentence embeddings. An example graph is depicted in Fig. 2. To build this graph, the first step is to gather all the pairwise cosine similarities in a matrix. Let D = (s 1 , s 2 , ..., s n ) be a document. Using SBERT, we produce a sequence of vectors (e 1 , e 2 , ..., e n ), where e i is the sentence embedding of s i . Then, we can compute the matrix A, where A[i, j] = 1 − sim(e i , e j ).
We make the following observations: • The diagonal of A is composed exclusively of zeros, because A[i, i] = 1 − sim(e i , e i ) = 0.
• All the entries in A are non-negative, because −1 ≤ sim(e i , e j ) ≤ 1.
These observations imply that the matrix A can be interpreted as the adjacency matrix of a weighted complete graph G = (V, E) where V = {s 1 , s 2 , ..., s n }, E = {(s 1 , s 2 )|s 1 , s 2 ∈ V } and the edges are weighted by the following function: w(s 1 , s 2 ) = 1 − sim(e 1 , e 2 ).

Ranking by centrality
The forth step is to assign a score for each sentence that allows us to sort them by their importance in the document. As a consequence, we define the importance rank for each sentence as follows: where e i and e j are the corresponding SBERT sentence embedding for s i and s j .
To motivate this definition, we observe that adding the entries of the matrix A columnwise gives naturally a ranking of the nodes of G that is a natural generalization of the degree centrality. However, in our ranking, the most "central" sentences (sentences that are similar to many other sentences in the document) have lower scores than the ones that are less "central." To further support this definition, we observe that if G were an undirected, unweighted simple graph G = (V, E) (that is, the entries of A are either 0 or 1, A is symmetric and only has zeros in its diagonal), then we would have that (2) which is the definition of the degree of node v i and is clearly a (somewhat crude) measure of the importance of the node in the graph.
It is important to note that in scientific papers, which have around 300 sentences, the proposed method takes around 1 second for the whole process. This result implies that there is no obstacle for applying this method to longer documents since producing the sentence embeddings with the SBERT implementation is very efficient, and the only thing that we are doing is compare all the pairs of sentence embeddings, which can be done with highly efficient linear algebra libraries.

Summary selection
The final step in the method is to select the sentences that are going to form the summary. To do this, we can take only the bottom n-percentile in reverse (as opposed to the top n-percentile, since in our method, a lower rank means that the sentence is more important in the document) or concatenate the ranked sentences in reverse (so that the sentences with the lowest ranks -that is, the most important ones-come first) and take the first k words to satisfy a word-length constraint for the summaries.

Experimental setup 3.1 Data set
Since our method is for unsupervised extractive summarization, we only used the extractive summaries in the TalkSumm data set (Lev et al., 2019) to estimate the appropriate threshold value for the sentence selection phase. As suggested in the task, we used science-parse (AllenAI, 2019) to extract the text of the scientific articles and split it into sections. Given that the objective of the task is to produce long summaries for the documents, we discarded the title and abstract and then took as input for the algorithm the remaining text as a single block.

Evaluation
As is customary in summarization tasks, we used ROUGE (Lin, 2004) in its variations ROUGE-1, ROUGE-2 and ROUGE-L.

Percentile threshold in the selection phase
We tried with p = {1, 1.5, 2, 2.5, 5, 10, 15} as the value of the bottom percentage of sentences to keep for the final summary and truncated the output to satisfy the 600 word limit for the task when the summary was longer. It is important to note that the freedom of this parameter allows the system to produce summaries of arbitrary length, depending on the task at hand.

Results
Overall, we observed that the 600-word constraint of the task prevented our method from performing better, but we also observed that the best summaries produced by our method are too long (around 1,000 words or more). Table 1 displays the performance of the method variations that we submitted to the task.

Conclusion and Future Work
The method introduced in this work displays competitive performance with more sophisticated meth-Bottom % R-1 F R-1 R R-2 F R-2 R R-L F R-L R  Table 1: performance of the different variations of the proposed method submitted to the task. In this setting, the ranked sentences were sorted in reverse and concatenated to form a preliminary output, which was truncated at 600 words to comply with the task's requirements. The "Bottom %" column displays the percentile used in the sentence selection phase of the method. R-N F stands for the F-measure in ROUGE-N, while R-N R stands for the Recall in ROUGE-N.
ods and can be useful when there is not enough labelled data to train a deep neural summarization system while being fast, simple and efficient. Overall, we observed that the precision component of ROUGE for the proposed method has much room for improvement, as having sentences as the minimal text units prevents it from filtering out the less important phrases. Another important future direction is to reduce the redundancy of the summaries, as it is common to have several versions of the same important sentence scattered across the document, so all these versions of the sentence appear in the final summary.