Extractive Research Slide Generation Using Windowed Labeling Ranking

Presentation slides generated from original research papers provide an efficient form to present research innovations. Manually generating presentation slides is labor-intensive. We propose a method to automatically generates slides for scientific articles based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures the importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.


Introduction
It has become common practice for researchers to use slides as a visual aid in presenting research findings and innovations. Such slides usually contain bullet points that the researchers believe to be important to show. These bullet points serve both as a reminder to the speaker (when he/she is presenting) and summaries for audiences to understand. Manually creating a set of high-quality slides from an academic paper is time-consuming. We propose a method that automatically selects salient sentences that could be included into the slides, with the purpose of reducing the time and effort for slide generation.
The main challenge for solving this problem is to accurately extract the main points from an academic paper. This is due to the limitations of existing methods to fully encode semantics of sentences and the implicit relations between sentences. Here, we propose an extractive summarizer that identifies the best sentence in a set of consecutive sentence windows. The selection process depends on importance and novelty of the sentence that is modeled by the neural networks. The selected sentences and their frequent noun phrases are then structured in a layered format to make the bullet points of the slides.
Presentation slides are usually created with multiple bullet points organized in a multi-level hierarchical structure, usually with phrases summarizing high level topics at the first level and bullets at the second and other levels for further clarification or details. Statistical analysis on our training data set shows that more than 92% of the bullets are in the first and second level and only 8% are in the third layer. Therefore, we built our presentations in two level bullet points only.
Our contribution is threefold.
• Propose a system that utilizes sentences with high rankings for generating presentation slides for research papers and is used as a starting point in the slide generation process.
• Create and provide PS5K, a corpus of 5000 paper-slide pairs in the field of computer and information science. To the best of our knowledge, this is the largest paper-slide dataset and can be used for training and evaluating slide generation models.
• Propose a novel method to rank sentences within a sentence window, which improved an existing state-of-the-art text-summarization method by a significant margin.

Related Work
Summarizing scholarly articles in presentation slides is different from standard text summarization (Xiao and Carenini, 2019), which focuses on generating a paragraph of free text summary out of a longer document. Automatic slide generation can be done by first extracting salient sentences in a hierarchical order and grouping them into slides that are sequentially aligned with the original paper. PPSGen (Hu and Wan, 2014) was a framework that automatically generated presentation slides from scientific papers. They applied Support Vector Regressor and Integer Linear Programming (ILP) to rank and select important sentences. Wang et al. (2017) generate slides by extracting phrases from papers and learning the hierarchical relationship between pairs of phrases to build the structure of bullet points. Their model is trained on a small set of 175 paper-slide pairs. The slideSeer (Kan, 2007) project crawled more than 10,000 paper-slide pairs using the Google APIs to search for the slide of papers using their title as a search query. The full set of data is not publicly available (only 20 pairs are available). Compared with previous works, our model is trained and tested on a relatively large set of 5000 paper-slide pairs and the dataset will be publicly available for future works. There had been some work on the alignment of presentations slides to the article sections (Hayama et al., 2005;Kan, 2007;Beamer and Girju, 2009).
SummaRuNNer (Nallapati et al., 2017) is a neural extractive summarizer that treats the summarization task as a sequence labeling problem. Sum-maRuNNer was evaluated on CNN/Daily Mail corpus, which contains news articles that are shorter than research papers. We improve upon the Sum-maRuNNer model for the summarization of scientific papers.

Data
Producing a large dataset for summarization of scientific documents is challenging and requires domain experts to make the summaries. The latest CL-Scisumm 2018 summarization task contains only 40 NLP papers with human-annotated reference summaries. Recently, ScisummNet (Yasunaga et al., 2019) expanded the CL-Scisumm to 1000 scientific articles. TalkSum (Lev et al., 2019) summarizes scientific articles based on the transcripts of the presentation talks at conferences.
Using presentation slides made by the authors is promising for the training of deep neural summarization models as more conferences are providing slides with papers.
We crawled more than 5,000 paper-slide pairs from a manually curated list of websites, e.g., usenix.org and aclweb.org. GROBID (Lopez, 2009) is used to get metadata and the body of the text from scientific papers in PDF format. Presentations are transformed form PDF or PPT format to XML by Apache Tika 1 . The Tika XML files are divided into pages and the text is extracted using Optical Character Recognition (OCR) tools. Most venues of papers in our dataset are in computational linguistics, system, and system security. In our dataset, there are on average 35 pages of slide per presentation and 8 lines of text per slide page. The majority (75%) of papers are published between 2013 and 2019. We used this dataset (called PS5K) to train summarization models to identify important parts of the input document at the sentence level.

Method
Generating slides requires identifying important sentences of the input scientific article and consists of three main steps. The first is to label salient sentences in the paper that are literally similar to corresponding slides. The second is to train the model to rank sentences and the final step selects salient sentences based on the predicted scores, size of the summary and the length of the sentences. Afterwards, frequent noun phrases are extracted from the selected sentences to shape the hierarchical structure of the bullet points. The architecture of our model is shown in Figure 1.

Sentence Labeling
The text in manually generated slides may not be directly extracted from the original paper. Instead, text can be truncated, summarized, or rephrased. Therefore, we need to generate extractive labels for sentences of the input document. The sentence labeling process attempts to identify salient sentences that are semantically similar to the corresponding slides. This generates an extractive summary, which will be used as the ground truth for training and evaluation.
The problem is formalized below: A research paper can be represented as a sequence of n sentences D = {s 1 , s 2 , ...s n }, each having a label y i ∈ {0, 1}, the system predicts p(y i = 1), probability of including sentence i to the summary. SummaRuNNer treats the summarization task as a sequence labeling problem, if adding the sentence to the summary improves the ROUGE score, the sentence is labeled with 1, otherwise it is labeled with 0. This method is suitable for news articles such as CNN/DailyMail (Nallapati et al., 2016) where the first couple of sentences in articles usually cover the main content. Scholarly papers usually contain a hierarchical structure of sections. Each section should have its own summary as a part of the summary of the entire paper. Therefore, the labeling process should be adapted to distribute positive labels across all sections of the paper. However, accurately parsing sections of open domain scholarly papers is non-trivial. Therefore, we propose a windowed labeling approach, in which ranking is performed only within a series of non-overlapping text windows, each of which contains w consecutive sentences. A sentence is labeled as 1 if adding the current sentence increases the ROUGE-1 index. The best window size is determined empirically by trying different widow sizes and calculating the ROUGE score between selected sentences and the presentation slides. Section 5 elaborates on the experiments performed to select the best window size.

Sentence and Document Embedding
The ranking of sentences depends on their salience, novelty, and content similarity to the ground truth. To quantify these characteristics, a document is represented into a vector. We explore two methods to build the embedding for the whole document.
Simple Document Embedding A simple document embedding can be obtained by calculating the average of sentence encodings generated by a Bidirectional Long Short-Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997). A sentence s i can be encoded as E s i = [ h i , h i ] in which E s i is a concatenation of forward ( h i ) and backward ( h i ) hidden states of the last token in sentence s i . The embedding for document D with n sentences is the average of all sentence embeddings: in which ReLU is the activation function, W and b are parameters to be learned.
Hierarchical Self Attention Document Embedding This model embeds a document by applying the attention mechanism at both word and sentence levels (Al-Sabahi et al., 2018;Yang et al., 2016).
Sentence embeddings are obtained by encoding word-level tokens of a sentence using BiLSTM and then aggregating hidden layers using an attention mechanism. Formally, considering a sentence s i with m words, the sentence encoding h s i is obtained as a concatenation of all m hidden states of word-level tokens (h s i = [h 1 , h 2 , ..., h m ]) where h s i ∈ R m×2d and d is the embedding dimension for each word. The attention weights are: where W attn ∈ R k×2d is the model matrix to be learned. Then a word ∈ R k×m and the embedding for sentence s i is: where E s i ∈ R 1×2d and k is the attention dimension which is set to 100 in our experiments. Document embeddings (E D ) are generated using sentence embeddings (E s i ) built in the previous step. A similar attention layer is applied on top of sentence embeddings to build the document embedding. The sentence level attention works as the weights to emphasize important sentences in document embedding.

Sentence Ranking
The rank of a sentence depends on its position in the paper, salience, and novelty with respect to the previously selected sentences, calculated below: where W pos ∈ R 2d×1 ,where W content ∈ R 2d×1 W salience ∈ R 2d×2d , and W novelty ∈ R 2d×2d are parameters to be learned. The position is the position of the sentence in the document specified by a Embedding lookup function, σ is the sigmoid activation function, and pos is its positional embedding. The salience estimates the importance of a sentence. The novelty represents the novelty of a sentence with respect to the current summery. The summary embedding is the weighted sum of the previous sentences added to summary until sentence i: The higher chance of adding the sentence to the summary gives it a bigger portion in the summary embedding. Figure 2 shows the architecture for predicting the score for the third sentence in a document. With windowed labeling, the positive labels are sparse. To deal with the imbalanced positive labels, the following weighted cross-entropy loss is adopted. The setting of w 1 = −85 and w 2 = −2 results in the highest ROUGE score. − n i=0 w 1 y i × log (p(y i = 1)) +w 2 (1 − y i ) × log (1 − p(y i = 1)) (6)

Sentence Selection
To select the sentences for the slide we tried 1) the greedy approach that sequentially adds sentences with highest scores until the maximum limit is hit and 2) the ILP method that selects the sentences by optimizing the following function using IBM CPLEX Optimizer 2 .
where p(y i = 1) is the score of the sentence predicted by the model, x i is a binary variable showing whether sentence i is selected for the summary or not, l i is the length of sentence i and penalizes short sentences, and maxLen is the maximum length of the summary.

Slide Generation
A typical presentation slide includes a limited number of bullet points as the first-level, which are usually phrases or shortened sentences. Some slides may contain second-level bullet points for further breakdowns. Table 2 shows that less than 8% of the content of the presentations in the ground truth corpus is covered in third-level bullets. We generate slides containing up to 2 bullet levels. Table 2 also shows that a slide title on average contains 4 words and either Level 1 or Level 2 bullets contains on average 8 words. Each slide consists of on average 36 words in 5 bullets and each level-1 bullet includes 2 second-level bullets.
Sentences selected are treated as the second-level bullets. The first-level bullets are the noun phrases extracted from the sentences. Noun phrases are removed if they contain more than 10 words or just 1 word. Noun phrases with a document frequency greater than 10 are excluded (e.g. "the model"). The section, which the first sentence of a slide is in, is found and its heading is used as the slide title.   The heading is truncated to the first 5 tokens. We limit a maximum of 4 sentences per slide. If a topic has more than 4 related sentences, the slide is split into two distinct ones.

Experiments and Results
We estimated the parameters of our model on PS5K. We split the dataset into training, validation, and testing set, each consisting of 4500, 250, and 250 pairs, respectively. We experimented with different window sizes and found that a window size of w = 10 gives the best ROUGE-1 recall (Table 3) and is adapted for our model. The Stanford CoreNLP ) is used to tokenize and lemmatize sentences to the constituent tokens and to extract noun phrases. GloVe (Pennington et al., 2014) 50-dimensional vectors are used to initialize the word embeddings. With the AdaDelta optimizer and a learning rate of 0.1, we trained for 50 epochs. The sentences are truncated or padded to have 50 tokens (only 8% sentences consist of more than 50 tokens). Similarly, we adopt a fixed document size of 500 sentences (only 3.5% of documents in our dataset have more than 500 sentences). We used the standard ROUGE score (Lin, 2004) to evaluate the summaries. The ROUGE scores for summaries are tabulated in Table 1. The summary size can not exceed 20% of the size of the input document in words. TextRank (Mihalcea and Tarau, 2004) is a graph based summarizer that applies the Google PageRank (Page et al., 1999) algorithm to rank the sentences. Sefid et al. (Sefid et al., 2019) rank the sentences by combining surface features, semantic and contextual embeddings. The windowed SummaRuNNer+ILP model outperforms the base SummaRuNNer by at least 3 points in ROUGE-1 recall. Adding attention layer to the model does not improve the ROUGE score while it increases the training time considerably as there are more parameters to be trained.

Conclusion
We create and make available PS5K, which is a large slide-paper dataset consisting of 5,000 scientific articles and corresponding manually made slides. This dataset can be used for scientific document summarization and slide generation. We used state of the art extractive summarization methods to summarize scientific articles. Our results show that distributing the positive labels across all sections of a scientific paper, in contrast to summarization methods for news articles, considerably improves performance.