Word centrality constrained representation for keyphrase extraction

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. Unfortunately, this method fails for short documents where the context is unclear. Moreover, keyphrases, which are usually the gist of a document, need to be the central theme. We propose a new extraction model that introduces a centrality constraint to enrich the word representation of a Bidirectional long short-term memory. Performance evaluation on 2 publicly available datasets demonstrate our model outperforms existing state-of-the art approaches.


Introduction
Keyphrase extraction is an important information extraction task that identifies single or multi-word linguistic units that concisely represent a document. They can also serve to provide a brief summary of the document content. Keyphrases are widely used in variety of natural language processing tasks such as document summarization (Bharti and Babu, 2017;Sarkar, 2014), query formulation (Jones and Staveley, 1999), text classification (Coenen et al., 2007), clustering (Hammouda et al., 2005), and recommendation systems (Naw and Hlaing, 2013). Keyphrases have become increasingly important for biomedical documents as there has been an exponential growth with over 32 million articles indexed by PubMed (NLM). Existing keyphrase extraction methods mainly fall either under a supervised or unsupervised approach. Common unsupervised approaches use word co-occurrence statistics to build graph-based ranking algorithms. Each word is mapped to a node and edges connect words that co-occur within a specified window size. Even though unsupervised approaches are desirable for datasets which do not have manually-labeled ground truth values, most such methods perform worse compared to the supervised counterparts.
The supervised approaches use classification to label every token as being part of a keyphrase or not by using features such as part-of speech tags, termfrequency inverse document frequency (tf-idf), and the position of the token in the document. Recently, supervised methods based on deep learning have been employed for keyphrase extraction. In Thomaidou andVazirgiannis (2011) andGollapalli et al. (2017), the authors posed the problem as a sequence labeling task and applied a Long Short-Term Memory network (LSTM) and conditional random fields (CRF) to tag each token in document as positive (i.e., part of a keyphrase) or negative. While these approaches achieve much better performance, they still suffer from a major limitation when applied on biomedical literature. The task of labelling each token does not consider how central the token is to the document contents. For Figure 1, the main theme of the keyphrases are genes associated with breast cancer. Thus, the document theme can be used as additional information to improve the keyphrase extraction performance.
To this end, we propose to address the problem of keyphrase extraction as a sequence labelling task with an additional component to capture the centrality of each token. We design a centrality layer built on top of a bidirectional LSTM (BiLSTM) layer to constrain each token with regards to the central theme of the document. The output dependencies are then modeled using a CRF layer. The contributions of our work are: • Introducing a centrality constraint layer to better capture the main theme of the document and how strongly each token is related to the main theme.
• Thorough evaluation of the centrality layer using an ablation study on biomedical and general domain abstracts.
The next section presents a brief description of the related work. The proposed keyphrase extraction method is introduced in Section 3. Sections 4, and 5 present experimental results and conclusion respectively.

Related Work
Keyphrase extraction methods mainly take either supervised or unsupervised approach. Unsupervised approaches generate candidates and rank using features such as tf-idf and topic proportions (Barker and Cornacchia, 2000;Liu et al., 2009b), graph-based centrality measures (Grineva et al., 2009;Wan and Xiao, 2008), topic modeling (Liu et al., 2009a;Teneva and Cheng, 2017), and document's citation network (Gollapalli and Caragea, 2014). Unsupervised, graph-based methods build a graph from the input document where all the candidate keyphrases are nodes and the connection between each candidate is represented by edges. A graph-based ranking method then determines the weights for each node based on the relatedness between the candidates. Alternatively, topic-based approaches cluster candidate keyphrases into topics in the document so that all the topics in the input document are represented by the selected keyphrases.
Recently (Sun et al., 2020) proposed a sentence embedding model named SIFRank that uses autoregressive pre-trained language model to extract keyphrases from short documents. Yet unsupervised methods often fail to achieve state-of-the-art performance.
Under the supervised approach, the keyphrase extraction problem is treated as a binary classification task (Alzaidy et al., 2019;Turney, 2000Turney, , 2002, where learning algorithms such as support vector machines (Witten et al., 2005;Jiang et al., 2009) and maximum entropy (Kim and Kan, 2009;Yih et al., 2006) are used. Supervised keyphrase extraction can also be posed as a ranking problem between candidates (Witten et al., 2005). The candidates keys are extracted using statistical features (tf-idf, number of occurrences, first occurrence of the key) and structural features (part of speech tags).
Deep learning based models have also been used for keyphrase extraction. Word embeddings are used to measure the relatedness between words in graph-based models (Wang et al., 2014). Zhang et al. (2016) used a Recurrent Neural Network (RNN) based approach to identify keyphrases in Twitter data. The model addresses the problem as sequence labeling for very short text, where a joint-layer RNN is used to capture the semantic dependencies in the input sequence. Alzaidy et al. (2019) employed a LSTM-CRF architecture to model keyphrase extraction as a sequence labelling task to learn the labels of the entire input sequence. Santosh et al. (2020) extended the LSTM-CRF to utilize BiLSTM and incorporated an attention mechanism to retrieve additional information from other sentences within the same document. Sahrawat et al. (2020) evaluated the effect of various pre-trained word embeddings for the BiLSTM-CRF architecture in extracting keyphrases from benchmark datasets and found contextual embeddings offered better performance. While these models offer better performance, they fail to capture the centrality of the keyphrases which represent a salient feature of the document.

Methodology
The keyphrase extraction task is formulated as a sequence labelling task. Given a document X = w 1 , w 2 , · · · , w t where w i is the i th word and t is the number of words in the document, we predict the labels y = y 1 , y 2 , · · · , y t where each label y i is whether word w i is a keyphrase or not.

Word Embedding Layer
Each word in the document is represented by pre-trained low-dimensional vector representations. Any pre-trained vector representation can be used, and we experiment with various pre-trained embeddings such as GloVe (Pennington et al., 2014), BERT (Devlin et al., 2019) and BioBERT (Lee et al., 2020). The impact of each embedding type is discussed in the experiments section.

BiLSTM Layer
This layer is used to encode each document to obtain the local contextual representation. A forward and backward LSTMs are used to read the input sequence from left to right, respectively. The outputs from the two directions are concatenated and summed for the final hidden state representation of

Centrality Weighting Layer
Sequence labelling is commonly used for other token encoding tasks such as Named Entity Recognition (NER) where the task is to determine whether a token is a named entity or not. However, keyphrase extraction is different from other sequence labelling tasks (for example NER) in that the tokens should capture the main gist of the document. This is in contrast to NER where the importance of the token is irrelevant as long as it is a named entity. To incorporate the idea of centrality, we use the similarity between each token and the document embedding, H, to bias the model towards tokens which are central (i.e., similar) to the document.
For words {w 1 , w 2 , · · · , w t } in a document D, we compute the centrality weight for each word α 1 , α 2 , · · · , α t . Each α i is calculated as the cosine similarity between the document vector (H) and each word (w i ). This is then used to weight the document vector when concatenating with each word's representation from the BiLSTM.
The output representation, z i for each word is then the centrality weight, α i multiplied by the output of the biLSTM, A dense layer is then used to transform the output representation, k i = f (z i ).

Conditional Random Fields (CRF)
The obtained contextual representations of each word, k i are given as input sequence to a CRF layer. CRFs are widely used to model sequence labeling tasks (Lafferty et al., 2001). Given the input document as sequence of tokens, CRF produces a probability distribution over the output label sequence using the dependencies among the labels of the entire input sequence. This formulation considers the correlations between neighboring labels and allows joint decoding for the best sequence of labels for the input sequence, rather than decoding each label independently. Moreover, by utilizing two different labels for the keyphrase to denote the beginning (t B ) and intermediate part (t I ) of the keyphrase, the model can learn a multi-token keyphrase. As an example, given a sentence with five tokens (t 1 , t 2 , t 3 , t 4 t 5 ) of which two (t 2 , t 3 ) are part of a keyphrase, the label would be represented as (t O , t B , t I , t O , t O ). Figure 2 illustrates our model architecture with the various layers.

Experiments
Datasets. We ran our experiment on 2 publicly available keyphrase datasets: PubMed (Gero and Ho, 2019) and INSPEC (Hulth, 2003). PubMed consists of 2532 articles from PubMed Central Open Access Subset with at least 5 author-provided keyphrases while INSPEC contains 200 abstracts of scientific journal papers from Computer Science collected between the years 1998 and 2002. Each document in INSPEC has two sets of keywords assigned: the controlled keywords, which are manually controlled assigned keywords that appear in the Inspec thesaurus but may not appear in the document, and the uncontrolled keywords which are freely assigned by the editors. The union of both sets is considered as the ground-truth in this work. Summary statistics for the datasets are shown in Table 1.
Since we use a sequence labeling formulation of the keyphrase extraction problem, the abstract/keyphrases data pairs are prepared such that each document is a sequence of word tokens, each with positive labels if it occurs in a keyphrase (k B , k I ), or with a negative label (k O ).
Experiment Settings. As baseline models, we train BiLSTM and BiLSTM-CRF with 100dimension Glove pre-trained embedding vectors (Pennington et al., 2014). We also train BiLSTM-CRF with two 768-dimension contextual embeddings, BERT (Devlin et al., 2019) and BioBERT (Lee et al., 2020). DAKE (Santosh et al., 2020), a state-of-the art baseline, uses a sentence enrich- The results reported are from three runs using 80/20/20 split for train/val/test sets respectively. The BiLSTM, and BiLSTM-CRF are optimized during training using stochastic gradient descent with the learning rate 0.0001. Gradient clipping and drop-out are used to prevent overflow and overfitting. We select the model with the best F1 score on the validation set over three runs. The final test scores reported are the averages running the best model on the test sets.
The code was implemented in Tensorflow 2.4.1 and the code is available at https://github.com/ZHgero/ keyphrases_centrality.git.

Results
The performance comparisons between the baselines and our model are shown in Table 2. Our model performs significantly better on the PubMed dataset compared to the existing baselines. In particular, the results show the impact of the centrality layer as it provides a boost in AUC of 0.02 from BiLSTM-CRF (BioBERT) to our model. The improvement gained from our model is not as large on the INSPEC dataset. We hypothesize that for the centrality constraint to be effective, the input sequence should be relatively longer. The sentences in the INSPEC dataset are much shorter hence the difficulty in learning the central theme.
We also compared our models with several state-of-the-art unsupervised approaches including SingleRank (Litvak and Last, 2008), Position-Rank (Florescu and Caragea, 2017), TopicRank (Bougouin et al., 2013), and SIFRank (Sun et al., Table 3 presents the comparison on the PubMed dataset. Since the unsupervised methods are ranking-based methods, the performances are evaluated in terms of F1-measure when a fixed number of keyphrases are extracted. To convert our model into a ranking model, we compute the probability for the predicted keyphrases by using an independence assumption after calculating the marginal probabilities from the CRF layer. The results illustrate that our model outperforms previous unsupervised methods by a significant margin.
In Figure 3, we compare keyphrases tagged by the BioBERT model and our model on a sample abstract. The true positives are colored blue while false negatives are in red. We observe that the BioBERT model fails to identify 'chronic thromboembolic pulmonary hypertension' as an important keyphrase whereas our model correctly identifies it. This may be due to the single occurrence of 'pulmonary hypertension' in the input text. Meanwhile our model leverages the document embedding to 'understand' that pulmonary hypertension is semantically relevant in the context of the entire abstract. We also observe a similar pattern with the keyphrase 'duration of anticoagulation'. Even though both models fail to capture the entire phrase, our model identifies 'anticoagulation' as a strong candidate because of its semantic meaning in the context of the whole abstract.
The figure also illustrates the limitation of the models as both struggle with common words such as 'post' and 'high' that are attached as prefixes to important keywords. 'High risk', 'duration of' and 'post-' are considered unimportant by both models. This can be explained by the fact that such words usually occur outside a keyphrase boundary and get overlooked even when they appear with important words. False positives by both models are important terms as the phrases are very relevant in the context of abstract but were not selected by the authors.

Conclusion
In this paper, we proposed a keyphrase extraction method that focuses on identifying words which are central to the document semantics. The problem of keyphrase extraction is posed as a sequence labeling task where each token is tagged as either a keyphrase or not. In addition to our novel centrality constraint layer, we have used Bi-LSTM layers to capture the long term dependencies among the input sequences. Finally, we have a CRF layer which is well suited to capture the dependencies from the output labels. Empirical results on two datasets show that our method gains significant improvement in the PubMed dataset while performing slightly better on the INSPEC dataset.