CNLP-NITS @ LongSumm 2021: TextRank Variant for Generating Long Summaries

The huge influx of published papers in the field of machine learning makes the task of summarization of scholarly documents vital, not just to eliminate the redundancy but also to provide a complete and satisfying crux of the content. We participated in LongSumm 2021: The 2^{nd} Shared Task on Generating Long Summaries for scientific documents, where the task is to generate long summaries for scientific papers provided by the organizers. This paper discusses our extractive summarization approach to solve the task. We used TextRank algorithm with the BM25 score as a similarity function. Even after being a graph-based ranking algorithm that does not require any learning, TextRank produced pretty decent results with minimal compute power and time. We attained 3^{rd} rank according to ROUGE-1 scores (0.5131 for F-measure and 0.5271 for recall) and performed decently as shown by the ROUGE-2 scores.


Introduction
Text summarization or summarizing large pieces of texts into comparatively smaller number of words is a challenging machine learning (ML) task that has gained significant traction in recent years. The applications are immense and diverse, from condensing and comparing legal contractual documents to summarizing medical and clinical texts. Often the two approaches (Maybury, 1999) adopted for solving this task are: • Extractive summarization: Here those unmodified segments of the original text are extracted and concatenated which play the most significant role in expressing the salient sentiment of the entire text. This technique is mostly used for generating comparatively longer summaries.
• Abstractive summarization: Here an abstract semantic representation of the original content is formed by the model which helps generate novel words/phrases for the summary by text generation and paraphrasing methods. This technique is often useful for generating concise summaries.
Recently, the task of summarizing scholarly documents has grasped the attention of researchers due to the vast quantity of papers published everyday, especially in the field of machine learning. This makes it challenging for researchers and professionals to keep-up with the latest developments in the field. Thus, the task of summarizing scientific papers aims not just to avoid redundancy in text and generate shorter summaries but also to cover all the salient information present in the document which often demands longer summaries. This would aid researchers to grasp the contents of the paper beyond abstract-level information without reading the entire paper.
Prior work on summarization of scientific documents is mostly targeted towards generation of short summaries but as mentioned before, in order to encompass all the important ideas longer summaries are required. LongSumm 2021 1 shared task, on the other hand, aims to encourage the researchers to focus on generating longer-form summaries for scientific papers.
As mentioned before, extractive summarization methods are better accustomed for generating longer-form summaries than abstractive summarization methods, in this paper we try to summarize scientific documents using the extractive summarization technique of TextRank (Mihalcea and Tarau, 2004) algorithm. It is a graph-based ranking algorithm to rank the sentences in a document according to their importance in conveying the information of the document. Different 'similarity' functions can be used while creating the graph which leads to varied results (Barrios et al., 2016), therefore we chose BM25 as the similarity function.

Related Works
Upon scrutinizing various approaches of document summarization, we have found some of the concurrent works in the field. One of these is (Christensen et al., 2013). This work describes extractive summarization as a joint process of selection and ordering. It uses graph as its elemental part, which is used to approximate the discourse relativeness using co-reference, deverbal nouns, etc. Similar works are shown by (Li et al., 2011), (Goldstein et al., 2000) and (Barzilay et al., 1999). Other works use the Google TextRank algorithm (Mihalcea and Tarau, 2004) to bring out the order in the text extraction. One of the works (Mallick et al., 2019) uses the modified TextRank plus graph infrastructure to extract contextual information. It uses the sentence as nodes in the graph and inverse cosine similarity 2 to form the weights of the edges of the graph. This graph is passed as an input to the TextRank algorithm which generates the required summary. Similar approach is followed by (Ashari and Riasetiawan, 2017) which uses the power of TextRank and semantic networks to form extractive summaries which bear the semantic relations.
Some of the works like (Nallapati et al., 2017), (Al-Sabahi et al., 2018) use capabilities of neural networks to semantically extract the information from the description and present it in human readable form. One of the works (Nallapati et al., 2016) uses a joint framework of classification and selection on the textural data to form summaries. Classifier architecture makes a decision as to whether a particular sentence in sequence (as selected by selector) will be the part of the membership of the summary or not, whereas the selector framework randomly selects the sentences from the description and places it in the summary.
Apart from these, varied approaches were adopted by the participants of the previous edition of the shared task, LongSumm 2020, as mentioned in (Chandrasekaran et al., 2020). For instance, a divide and conquer approach, DANCER, was used in (Gidiotis et al., 2020) to summarize key sections of the paper separately and combine them through a PEGASUS based transformer to generate the final summary. Another team (Ghosh Roy et al., 2020) used a neural extractive summarizer to summarize each section separately. A different team utilized the BERT summarizer as shown in (Sotudeh Gharebagh et al., 2020). The main idea was based on multi-task learning heuristic in which two tasks are optimized, namely the binary classification task of sentence selection and the section prediction of input sentences. They also suggested an abstractive summarizer based on the BART transformer that runs after the extractive summarizer. Other methods were Convolutional Neural Network (CNN) in (Reddy et al., 2020), Graph Convolutional Network (GCN) and Graph Attention Network (GAN) in (Li et al., 2020), and unsupervised clustering in (Mishra et al., 2020) and (Ju et al., 2020).

Description
The LongSumm dataset is distinctive in the sense that it consists of scientific documents which have scientific jargon targeted for a niche audience, unlike other summarization corpuses like news articles for the general public. Due to the same reason, it is difficult to find domain-specific scientific documents with their longer-form summaries covering all their important details in a concise manner.
The organizers of LongSumm 2021 provided corpus for this task includes a training set that consists of 1705 extractive summaries, and 531 abstractive summaries of NLP and Machine Learning scientific papers. The extractive summaries are based on video talks (Chandrasekaran et al., 2020) from associated conferences while the abstractive summaries are blog posts created by NLP and ML researchers.
We used Textrank (Mihalcea and Tarau, 2004) which is a graph-based ranking model for ranking sentences in a document for extractive summarization. Therefore, only extractive summaries were used as validation data. The extractive summaries are based on the TalkSumm (Lev et al., 2019) dataset. The dataset contains 1,705 automatically generated noisy extractive summaries of scientific papers from the NLP and Machine Learning domain based on video talks from associated conferences (like ACL, NAACL, ICML). URL links to the papers and their summaries and could be found in the Github repository 3 devoted to this shared task. Each summary provides the top-30 sentences, which are on average around 990 words.
Another list of 22papers 4 was provided as test data (blind). The summaries generated for these papers were used for evaluation. ROUGE-1, ROUGE-2 and ROUGE-L scores were used to evaluate the performance of the system.

Preprocessing
After retrieving the text from the papers (links to which were provided by the organizers) the sections before 'Introduction' (like Authors, Abstract etc.) and after 'Conclusion/Results' (like References, Acknowledgements etc.) were removed as the text in these sections do not add much valuable sentiments to the summary as compared to the left over sections of the paper. Further citation indexing, hyperlinks, newline and redundant white-space characters were eliminated.

System Description
Our approach essentially was to use the TextRank algorithm (Mihalcea and Tarau, 2004) to rank the sentences corresponding to their relevance to the whole text and use the most significant (highest ranked) sentences as the summary.

TextRank
TextRank is a graph-based ranking algorithm which is proven to be quite impactful for keyword and sentence extraction from natural language texts.
According to (Mihalcea and Tarau, 2004) for sentence extraction, a graph is constructed for the given document in which each vertex represents an 3 https://github.com/guyfe/LongSumm/ tree/master/extractive_summaries 4 https://github.com/guyfe/ LongSummtest-data-blind entire sentence. Now the semantic links amongst the vertices are identified by the "similarity" between the sentences, where "similarity" is measured as a function of their content overlap. The formal expression for determining the similarity of two sentences, S a = w a 1 , w a 2 , ..., w a Na with N a words, and S b = w b 1 , w b 2 , ..., w b N b with N b words as defined in (Mihalcea and Tarau, 2004): The text in the document can thus be represented as a weighted on which the ranking algorithm is run to sort the vertices (each representing a sentence in the text) in reversed order of the obtained score, from which we include the 30 most significant sentences are selected and present them in the same order as they appear in the document.

Gensim TextRank Summarizer
Variants of the similarity function can be chosen to obtain improved results, an analysis of which is shown in (Barrios et al., 2016). The different similarity functions including LCS (Longest Common Substring), cosine similarity, BM25 (Robertson et al., 1994), BM25+ (Lv and Zhai, 2011) and original TextRank similarity function were evaluated using ROUGE-1, ROUGE-2 and ROUGE-SU4 as metrics in (Barrios et al., 2016) and the best results were obtained using BM25 and BM25+.
The Summarizer module of the Gensim project 5 uses BM25-TextRank algorithm for summarization, therefore we proceeded with this implementation of TextRank to prepare the summaries. BM25 is a variation of the TF-IDF model using a probabilistic model. Given two sentences R, S, BM25 is defined as: where k and b are parameters, and L avg is the average length of the sentences in the document. TF is the term-frequency and IDF is the correction formula given as: where takes a value between 0.5 and 0.3 and avgIDF is the average IDF for all terms.

Result
The participating systems were evaluated by ROUGE (Lin, 2004) scores, specifically using ROUGE-1, ROUGE-2 and ROUGE-L metrics. Our team's name was CNLP-NITS and the result of our system on blind test data of 22 papers using Tex-tRank with BM25 similarity is given in

Analysis
Individual ROUGE scores for each paper in the training set was calculated for finding the average scores. The predicted and reference summary for the paper 6 with one of the best ROUGE scores (as given in Table 3  Reference summary (best performance): The predicted and reference summary for the paper 11 with one of the worst ROUGE scores (as given in Table 4  Predicted summary (worst performance): Reference summary (worst performance): 9 Complete summary at https://bit.ly/3cWCmjo 10 Complete summary at https://bit.ly/3f2XO9i

Conclusion and Future Work
In this paper we targeted our efforts towards Tex-tRank algorithm in order to generate long extractive summaries of given scientific research papers. Our approach TextRank when used with BM25 similarity function, even after not being a learning algorithm, was able to achieve appreciable ROUGE-1 scores while remaining competitive in ROUGE-2 scores. As TextRank is a graph-based ranking algorithm that ranks the sentences independently for each document, it requires no training, thus being compute and time efficient.
Although we approached the task using an algorithm which does not require training and were still able to produce substantial results, there is definitely a scope for leveraging training data to gather a general semantic structure from a collection of documents as a whole instead of working on each document independently using neural network based learning algorithms. This will definitely be our prime focus for future work in extractive text summarization. Nonetheless, through our participation in LongSumm 2021 we tried to optimise TextRank algorithm and put it to test against other learning-based approaches of other teams and were able to pull off significant results with comparatively low machine and time requirements.