Keyphrase Extraction from Scientific Articles via Extractive Summarization

Automatically extracting keyphrases from scholarly documents leads to a valuable concise representation that humans can understand and machines can process for tasks, such as information retrieval, article clustering and article classification. This paper is concerned with the parts of a scientific article that should be given as input to keyphrase extraction methods. Recent deep learning methods take titles and abstracts as input due to the increased computational complexity in processing long sequences, whereas traditional approaches can also work with full-texts. Titles and abstracts are dense in keyphrases, but often miss important aspects of the articles, while full-texts on the other hand are richer in keyphrases but much noisier. To address this trade-off, we propose the use of extractive summarization models on the full-texts of scholarly documents. Our empirical study on 3 article collections using 3 keyphrase extraction methods shows promising results.


Introduction
Automatic keyphrase extraction is the process of identifying representative phrases in a document that summarize its content. Keyphrases are important pieces of information for many applications, including information retrieval (Ji et al., 2019;Boudin et al., 2020), text classification (Meng et al., 2019), text summarization (Song et al., 2019), entity recognition (Du et al., 2018) and event detection (Hossny et al., 2020).
This work focuses on keyphrase extraction from scholarly documents. In particular, we consider an interesting issue in this domain, which concerns the part of a scientific article that should be given as input to keyphrase extraction methods. Table 1 shows representative supervised and unsupervised keyphrase extraction methods from the most popular categories of the task (deep learning, traditional supervised, graph-based, and statisticsbased), along with the parts of academic articles that they consider, among Title+Abstract (TA), Full-text (F) and other Specific Parts (S/P). Table 1: Types of textual content, i.e., Title+Abstract (TA), Full-text (F), and Specific Parts (S/P) of the document, used by supervised and unsupervised keyphrase extraction approaches in the training and evaluation process. Approaches with an asterisk (*) are evaluated on TAs and Fs in Hasan and Ng (2010).
We can see that recent deep learning keyphrase extraction and generation methods take titles and abstracts as input, due to the complexity in pro-cessing larger sequences. Traditional supervised learning methods, as well as unsupervised ones can handle full-texts, but this does not necessarily lead to better results compared to using just titles and abstracts. Papagiannopoulou and Tsoumakas (2018) show that graph-based methods achieve better accuracy when titles and abstracts are used, while the strong baseline TfIdf works best with full-text. Florescu and Caragea (2017) and Boudin (2018) show that keyphrases generally occur in positions very close to the beginning of a scholarly document. Nguyen and Luong (2010) show that title and abstracts have the highest density of keyphrases, followed by the conclusions, introduction and related work sections.
It appears that there is a trade-off between using titles and abstracts versus using full-texts of academic papers as input to keyphrase extraction methods. Full-texts provide richer information, including more keyphrases, but at the same time they are much more noisy compared to the titles and abstracts. Motivated from this observation, our scientific question is whether using automated summarization models on the full-text of a scientific article can lead to textual information that is richer than titles and abstracts, yet less noisy than fulltexts.
Towards answering this question, we present some first steps employing extractive summarization. Our main goals are to: a) investigate the dynamics of summarization in keyphrase extraction, paving the way for the research community to develop approaches combining techniques from both tasks (e.g., via multi-task learning) and b) provide some guidelines to practitioners of the field suggesting better utilization of the full-texts. Our empirical study provides strong evidence that the full-text extractive summaries manage to capture keyphrases, which in most cases improve the performance of state-of-the-art supervised and unsupervised keyphrase extraction methods (regarding the F 1 score) on three datasets compared to the conventional use of abstracts and full-texts.

Our Approach and Alternatives
We are interested in finding out whether we can improve the signal-to-noise ratio of the input given to keyphrase extraction approaches by applying automated summarization on the full-text of scientific articles. As a first step towards investigating this hypothesis, we focus on extractive summarization models.
We generate extractive summaries from the corresponding full-texts using the pre-trained distillated RoBERTa model distilroberta-base-ext-sum from the TransformerSum 1 library. Distillated RoBERTa is a version of RoBERTa (Liu et al., 2019), which is based on DistilBERT (Sanh et al., 2019). It is a lighter, faster and smaller variant of the original RoBERTa, that achieves a time speedup of 50%, while retaining 95% performance of the original model. Furthermore, we investigate the utility of alternative input types, such as the first three paragraphs of the document that include the title, the abstract and a part of the document's introduction. We experiment with two different paragraph lengths in words, i.e., 220 and 400.
Our investigation includes the standard input types, i.e., title+abstract and full-text, too. For deep learning methods, we split full-texts into sentences and paragraphs, as they cannot handle their whole length at once due to memory limitations.
Finally, we explore an ensemble approach to keyphrase extraction, which involves the late fusion of two input types: the standard title plus abstract and the title plus the extractive summary. We apply keyphrase extraction methods to these two input types independently and then consider the union of the extracted keyphrases. Table 2 presents all these approaches along with their abbreviations, which will be used in the rest of our work.

Experimental Setup
Our empirical study includes three keyphrase extraction methods: TfIdf, as a baseline method, Mul-tipartiteRank (MR) (Boudin, 2018), as a strong graph-based method, and Bi-LSTM-CRF (BLC) (Alzaidy et al., 2019), as a strong neural model. Due to the lack of publicly available code for a BLC model tailored to keyphrase extraction, we proceeded to our own implementation, which we make publicly available along with all experiments in this paper 2 .
BLC is trained using the train and validation sets from (Meng et al., 2017). Specifically, we trained both models described in (Alzaidy et al., 2019), i.e., the BLC T A on the documents' abstracts and the BLC ABSE on the abstracts' sentences (used only with test datasets that their text is split in sentences and only for the model comparison). Experiments were performed on a Ryzen 5 3600 CPU with 16GB RAM. Training the model on title and abstract takes approximately 24 hours for a total of 5 epochs, while training on title and abstract split in sentences takes about 5 hours to complete.
These keyphrase extraction methods are evaluated on three well-known datasets that contain fulltext articles from the computer science domain: Se-mEval (Kim et al., 2010), NUS (Nguyen and Kan, 2007), and ACM (Krapivin et al., 2008). These datasets contain 244, 211, and 2304 documents, respectively (we merged the train and test sets of the SemEval dataset).
We compute F 1 (F 1 @10 for unsupervised methods) according to both the exact (E) and partial (P) (Rousseau and Vazirgiannis, 2015) string match to determine the number of correctly matched phrases with the golden ones for a document. We also apply stemming to the methods' output and the article's golden phrases as a pre-processing step before the evaluation process. We employ the authors' and readers' (in case they are available) keyphrases as a gold evaluation standard for all dataset collections.
Finally, we use a two-sided Wilcoxon signedrank test to check the statistical significance of the results in terms of the most popular exact match evaluation between the proposed input types and the conventional ones, at a significance level of 0.05. We denote with a "*" the statistical significance with TA and with a " †" the statistical significance with ABSE or F (in cases there is an improvement). Table 3 gives the percentage and actual number (in parentheses) of keyphrases that appear inside each textual content type (F, 3P 400 , 3P 220 , TS, TA) for each of the 3 datasets (SemEval, NUS, ACM). We can see that full-texts contain the highest percentage of keyphrases, as expected. Note that this number is less than 1, as a small percentage of the keyphrases that authors or readers assign to papers do not appear inside the paper's full-text. The percentages of 3P 400 and 3P 220 are high too. Extractive summaries contain less keyphrases than the previous content types, but more than titles+abstracts. This is a positive sign, which combined with low amount of noise, could lead to improved keyphrase extraction results.  One disadvantage of extractive summaries, is that they require an additional pre-processing step compared to the rest pre-existing textual content types. The average time to generate the extractive summary per document in the machine used for the experiments is 2.21, 2.13, and 2.34 seconds for the SemEval, NUS, and ACM datasets, respectively. This is not high for offline applications, while for online ones, higher scale hardware and/or more efficient architectures could be employed. Table 4 shows the results of our implementation of the BLC model, along with the ones published in (Alzaidy et al., 2019) for the kp20k test set from (Meng et al., 2017). BLC solves a sequence classification task: for each word, it outputs a binary label indicating whether this word belongs to a keyphrase or not. The evaluation of BLC in (Alzaidy et al., 2019) was based on the F 1 -score of this binary sequence classification task that BLC solves, which we also compute for our implementation. We also show the results of our implementation in terms of the exact and partial evaluation approaches.  Table 4: F 1 based on sequence (S), exact (E) and partial (P) evaluation for the original BLC approach and our implementation.

Bi-LSTM-CRF
The results of the two BLC T A implementations are close to each other. The difference could be attributed to two things: a) the pre-processing of the data, which is not described in detail in (Alzaidy et al., 2019), and b) the fact that Alzaidy et al.
(2019) might have not included the title in their experiments, as this is not clear in the paper. For BLC ABSE , the difference is larger which might be a result of the above and the selected hyperparameters, which we fine-tuned on BLC T A . Table 5 shows the results of BLC with the standard and proposed input types. Results indicate no significant improvement using extractive summaries compared to titles and abstracts, even though TS includes more keyphrases across all datasets (see Table 3). However, this evaluation may be slightly unfair to TS as input to BLC, since the model used the original documents' abstracts for training. TAs and TSs may have substantial differences in their syntax, structure, etc. Nevertheless, AS performs better than TA, meaning that TS manages to introduce unseen keyphrases to TA, which seems promising for the potential of extractive summarization.  In addition, our findings show that we achieve higher F 1 -scores when we predict on the abstracts split into sentences rather than the entire abstract. This indicates the inability of the model to retain past information from longer text excerpts, which is a common problem for RNNs. Note that for all the results of the experiments in Table 5, we utilize only the BLC T A model, even on the text excerpts split in sentences as it showed superior performance than the BLC ABSE .
Moreover, FP and 3P 220 seem to be better alternatives to TA, as they constitute richer sources in keyphrases, and the trained BLC T A model can utilize them properly. Finally, the FS approach fails to detect the full-text's keyphrases due to the combination of noise and the disparity of important context, which is a result of the extreme fragmentation of long texts to sentences.

Unsupervised methods
Tables 6 and 7 show that the unsupervised methods TfIdf and MR certainly benefit from the extractive summaries (TS) as they outperform the conventional approaches (TA, F) (except for the MR method on NUS where the TS's F 1 -score is slightly lower than the F's one). 3P 200 and 3P 400 approaches, in most cases, do not improve the corresponding methods' accuracy. Although the introductory parts of a document contain many keyphrases, they are also quite noisy due to general descriptions related to the document's topics.  Table 6: F 1 @10 based on exact (E) and partial (P) evaluation approach for TfIdf on 3 different datasets (Se-mEval, NUS, ACM) using various textual content types as input, i.e., TA, F, TS, AS, 3P 220 , 3P 400 .

Conclusions and Future Work
Our work set out to investigate whether using automated summarization, as a pre-processing step, can lead to improved results in the task of keyphrase extraction from scholarly documents. Our empirical study shows that unsupervised approaches improve  Table 7: F 1 @10 based on exact (E) and partial (P) evaluation approach for MR on 3 different datasets (Se-mEval, NUS, ACM) using various input types, i.e., TA, F, TS, AS, 3P 220 , 3P 400 .
their accuracy using extractive summaries as input, highlighting the full-text's useful information for the task and showing a positive relationship between the tasks of extractive summarization and keyphrase extraction. It is worth noting that even though the gains on the exact match F 1 -scores seem to be moderate, this does not necessarily reflect the actual performance gain. Considering that exact match scores are generally low due to the strict nature of the method, a moderate increase in performance leads to considerable percentage gain over the initial performance.
As future work, an interesting direction would be to experiment with additional summarization methods, including abstractive ones as well as their combination with extractive ones. In addition, we could experiment with additional recent and stateof-the-art keyphrase extraction methods, including methods building on top of contextual embeddings (Sahrawat et al., 2020).