Deep Keyphrase Generation

Keyphrase provides highly-summative information that can be effectively used for understanding, organizing and retrieving text content. Though previous studies have provided many workable solutions for automated keyphrase extraction, they commonly divided the to-be-summarized content into multiple text chunks, then ranked and selected the most meaningful ones. These approaches could neither identify keyphrases that do not appear in the text, nor capture the real semantic meaning behind the text. We propose a generative model for keyphrase prediction with an encoder-decoder framework, which can effectively overcome the above drawbacks. We name it as deep keyphrase generation since it attempts to capture the deep semantic meaning of the content with a deep learning method. Empirical analysis on six datasets demonstrates that our proposed model not only achieves a significant performance boost on extracting keyphrases that appear in the source text, but also can generate absent keyphrases based on the semantic meaning of the text. Code and dataset are available at https://github.com/memray/seq2seq-keyphrase.


Introduction
A keyphrase or keyword is a short piece of text that summarizes the main semantic meaning of a longer text.The typical use of a keyphrase or keyword is in scientific publications to provide the core information of a paper.We use the term "keyphrase" interchangeably with "keyword" in the rest of this paper, as both terms have an implication that they may contain multiple words.High-quality keyphrases can facilitate the understanding, organizing, and accessing of document content.As a result, many studies have focused on ways of automatically extracting keyphrases from textual content (Liu et al., 2009;Medelyan et al., 2009a;Witten et al., 1999).Due to public accessibility, many scientific publication datasets are often used as test beds for keyphrase extraction algorithms.Therefore, this study also focuses on extracting keyphrases from scientific publications.
Automatically extracting keyphrases from a document is called keypharase extraction, and it has been widely used in many applications, such as information retrieval (Jones and Staveley, 1999), text summarization (Zhang et al., 2004), text categorization (Hulth and Megyesi, 2006), and opinion mining (Berend, 2011).Most of the existing keyphrase extraction algorithms have addressed this problem through two steps (Liu et al., 2009;Tomokiyo and Hurst, 2003).The first step is to acquire a list of keyphrase candidates.Researchers have tried to use n-grams or noun phrases with certain part-of-speech patterns for identifying potential candidates (Hulth, 2003;Le et al., 2016;Liu et al., 2010;Wang et al., 2016).The second step is to rank candidates on their importance to the document, either through supervised or unsupervised machine learning methods with a set of manually-defined features (Frank et al., 1999;Liu et al., 2009Liu et al., , 2010;;Kelleher and Luz, 2005;Matsuo and Ishizuka, 2004;Mihalcea and Tarau, 2004;Song et al., 2003;Witten et al., 1999).
There are two major drawbacks in the above keyphrase extraction approaches.First, these methods can only extract the keyphrases that appear in the source text; they fail at predicting meaningful keyphrases with a slightly different se-arXiv:1704.06879v3[cs.CL] 31 May 2021 quential order or those that use synonyms.However, authors of scientific publications commonly assign keyphrases based on their semantic meaning, instead of following the written content in the publication.In this paper, we denote phrases that do not match any contiguous subsequence of source text as absent keyphrases, and the ones that fully match a part of the text as present keyphrases.Table 1 shows the proportion of present and absent keyphrases from the document abstract in four commonly-used datasets, from which we can observe large portions of absent keyphrases in all the datasets.The absent keyphrases cannot be extracted through previous approaches, which further prompts the development of a more powerful keyphrase prediction model.
Second, when ranking phrase candidates, previous approaches often adopted machine learning features such as TF-IDF and PageRank.However, these features only target to detect the importance of each word in the document based on the statistics of word occurrence and co-occurrence, and are unable to reveal the full semantics that underlie the document content.To overcome the limitations of previous studies, we re-examine the process of keyphrase prediction with a focus on how real human annotators would assign keyphrases.Given a document, human annotators will first read the text to get a basic understanding of the content, then they try to digest its essential content and summarize it into keyphrases.Their generation of keyphrases relies on an understanding of the content, which may not necessarily use the exact words that occur in the source text.For example, when human annotators see "Latent Dirichlet Allocation" in the text, they might write down "topic modeling" and/or "text mining" as possible keyphrases.In addition to the semantic understanding, human annotators might also go back and pick up the most important parts, based on syntactic features.For exam-ple, the phrases following "we propose/apply/use" could be important in the text.As a result, a better keyphrase prediction model should understand the semantic meaning of the content, as well as capture the contextual features.
To effectively capture both the semantic and syntactic features, we use recurrent neural networks (RNN) (Cho et al., 2014;Gers and Schmidhuber, 2001) to compress the semantic information in the given text into a dense vector (i.e., semantic understanding).Furthermore, we incorporate a copying mechanism (Gu et al., 2016) to allow our model to find important parts based on positional information.Thus, our model can generate keyphrases based on an understanding of the text, regardless of the presence or absence of keyphrases in the text; at the same time, it does not lose important in-text information.
The contribution of this paper is three-fold.First, we propose to apply an RNN-based generative model to keyphrase prediction, as well as incorporate a copying mechanism in RNN, which enables the model to successfully predict phrases that rarely occur.Second, this is the first work that concerns the problem of absent keyphrase prediction for scientific publications, and our model recalls up to 20% of absent keyphrases.Third, we conducted a comprehensive comparison against six important baselines on a broad range of datasets, and the results show that our proposed model significantly outperforms existing supervised and unsupervised extraction methods.
In the remainder of this paper, we first review the related work in Section 2.Then, we elaborate upon the proposed model in Section 3.After that, we present the experiment setting in Section 4 and results in Section 5, followed by our discussion in Section 6. Section 7 concludes the paper.

Automatic Keyphrase Extraction
A keyphrase provides a succinct and accurate way of describing a subject or a subtopic in a document.A number of extraction algorithms have been proposed, and the process of extracting keyphrases can typically be broken down into two steps.
The first step is to generate a list of phrase candidates with heuristic methods.As these candidates are prepared for further filtering, a consid-erable number of candidates are produced in this step to increase the possibility that most of the correct keyphrases are kept.The primary ways of extracting candidates include retaining word sequences that match certain part-of-speech tag patterns (e.g., nouns, adjectives) (Liu et al., 2011;Wang et al., 2016;Le et al., 2016), and extracting important n-grams or noun phrases (Hulth, 2003;Medelyan et al., 2008).
The second step is to score each candidate phrase for its likelihood of being a keyphrase in the given document.The top-ranked candidates are returned as keyphrases.Both supervised and unsupervised machine learning methods are widely employed here.For supervised methods, this task is solved as a binary classification problem, and various types of learning methods and features have been explored (Frank et al., 1999;Witten et al., 1999;Hulth, 2003;Medelyan et al., 2009b;Lopez and Romary, 2010;Gollapalli and Caragea, 2014).As for unsupervised approaches, primary ideas include finding the central nodes in text graph (Mihalcea and Tarau, 2004;Grineva et al., 2009), detecting representative phrases from topical clusters (Liu et al., 2009(Liu et al., , 2010)), and so on.
Aside from the commonly adopted two-step process, another two previous studies realized the keyphrase extraction in entirely different ways.Tomokiyo and Hurst (2003) applied two language models to measure the phraseness and informativeness of phrases.Liu et al. (2011) share the most similar ideas to our work.They used a word alignment model, which learns a translation from the documents to the keyphrases.This approach alleviates the problem of vocabulary gaps between source and target to a certain degree.However, this translation model is unable to handle semantic meaning.Additionally, this model was trained with the target of title/summary to enlarge the number of training samples, which may diverge from the real objective of generating keyphrases.Zhang et al. (2016) proposed a joint-layer recurrent neural network model to extract keyphrases from tweets, which is another application of deep neural networks in the context of keyphrase extraction.However, their work focused on sequence labeling, and is therefore not able to predict absent keyphrases.

Encoder-Decoder Model
The RNN Encoder-Decoder model (which is also referred as sequence-to-sequence Learning) is an end-to-end approach.It was first introduced by Cho et al. (2014) and Sutskever et al. (2014) to solve translation problems.As it provides a powerful tool for modeling variable-length sequences in an end-to-end fashion, it fits many natural language processing tasks and can rapidly achieve great successes (Rush et al., 2015;Vinyals et al., 2015;Serban et al., 2016).
Different strategies have been explored to improve the performance of the Encoder-Decoder model.The attention mechanism (Bahdanau et al., 2014) is a soft alignment approach that allows the model to automatically locate the relevant input components.In order to make use of the important information in the source text, some studies sought ways to copy certain parts of content from the source text and paste them into the target text (Allamanis et al., 2016;Gu et al., 2016;Zeng et al., 2016).A discrepancy exists between the optimizing objective during training and the metrics during evaluation.A few studies attempted to eliminate this discrepancy by incorporating new training algorithms (Marc 'Aurelio Ranzato et al., 2016) or by modifying the optimizing objectives (Shen et al., 2016).

Methodology
This section will introduce our proposed deep keyphrase generation method in detail.First, the task of keyphrase generation is defined, followed by an overview of how we apply the RNN Encoder-Decoder model.Details of the framework as well as the copying mechanism will be introduced in Sections 3.3 and 3.4.

Problem Definition
Given a keyphrase dataset that consists of N data samples, the i-th data sample (x (i) , p (i) ) contains one source text x (i) , and M i target keyphrases p (i) = (p (i,1) , p (i,2) , . . ., p (i,M i ) ).Both the source text x (i) and keyphrase p (i,j) are sequences of words: L x (i) and L p (i,j) denotes the length of word sequence of x (i) and p (i,j) respectively.
Each data sample contains one source text sequence and multiple target phrase sequences.To apply the RNN Encoder-Decoder model, the data need to be converted into text-keyphrase pairs that contain only one source sequence and one target sequence.We adopt a simple way, which splits the data sample (x (i) , p (i) ) into M i pairs: (x (i) , p (i,1) ), (x (i) , p (i,2) ), . . ., (x (i) , p (i,M i ) ).Then the Encoder-Decoder model is ready to be applied to learn the mapping from the source sequence to target sequence.For the purpose of simplicity, (x, y) is used to denote each data pair in the rest of this section, where x is the word sequence of a source text and y is the word sequence of its keyphrase.

Encoder-Decoder Model
The basic idea of our keyphrase generation model is to compress the content of source text into a hidden representation with an encoder and to generate corresponding keyphrases with the decoder, based on the representation .Both the encoder and decoder are implemented with recurrent neural networks (RNN).
The encoder RNN converts the variable-length input sequence x = (x 1 , x 2 , ..., x T ) into a set of hidden representation h = (h 1 , h 2 , . . ., h T ), by iterating the following equations along time t: where f is a non-linear function.We get the context vector c acting as the representation of the whole input x through a non-linear function q.
The decoder is another RNN; it decompresses the context vector and generates a variable-length sequence y = (y 1 , y 2 , ..., y T ) word by word, through a conditional language model: where s t is the hidden state of the decoder RNN at time t.The non-linear function g is a softmax classifier, which outputs the probabilities of all the words in the vocabulary.y t is the predicted word at time t, by taking the word with largest probability after g(•).
The encoder and decoder networks are trained jointly to maximize the conditional probability of the target sequence, given a source sequence.After training, we use the beam search to generate phrases and a max heap is maintained to get the predicted word sequences with the highest probabilities.

Details of the Encoder and Decoder
A bidirectional gated recurrent unit (GRU) is applied as our encoder to replace the simple recurrent neural network.Previous studies (Bahdanau et al., 2014;Cho et al., 2014) indicate that it can generally provide better performance of language modeling than a simple RNN and a simpler structure than other Long Short-Term Memory networks (Hochreiter and Schmidhuber, 1997).As a result, the above non-linear function f is replaced by the GRU function (see in (Cho et al., 2014)).
Another forward GRU is used as the decoder.In addition, an attention mechanism is adopted to improve performance.The attention mechanism was firstly introduced by Bahdanau et al. (2014) to make the model dynamically focus on the important parts in input.The context vector c is computed as a weighted sum of hidden representation h = (h 1 , . . ., h T ): where a(s i−1 , h j ) is a soft alignment function that measures the similarity between s i−1 and h j ; namely, to which degree the inputs around position j and the output at position i match.

Copying Mechanism
To ensure the quality of learned representation and reduce the size of the vocabulary, typically the RNN model considers a certain number of frequent words (e.g.30,000 words in (Cho et al., 2014)), but a large amount of long-tail words are simply ignored.Therefore, the RNN is not able to recall any keyphrase that contains out-ofvocabulary words.Actually, important phrases can also be identified by positional and syntactic information in their contexts, even though their exact meanings are not known.The copying mechanism (Gu et al., 2016) is one feasible solution that enables RNN to predict out-of-vocabulary words by selecting appropriate words from the source text.
By incorporating the copying mechanism, the probability of predicting each new word y t consists of two parts.The first term is the probability of generating the term (see Equation 3) and the second one is the probability of copying it from the source text: p(y t |y 1,...,t−1 , x) = p g (y t |y 1,...,t−1 , x) + p c (y t |y 1,...,t−1 , x) (5) Similar to attention mechanism, the copying mechanism weights the importance of each word in source text with a measure of positional attention.But unlike the generative RNN which predicts the next word from all the words in vocabulary, the copying part p c (y t |y 1,...,t−1 , x) only considers the words in source text.Consequently, on the one hand, the RNN with copying mechanism is able to predict the words that are out of vocabulary but in the source text; on the other hand, the model would potentially give preference to the appearing words, which caters to the fact that most keyphrases tend to appear in the source text.
) where χ is the set of all of the unique words in the source text x, σ is a non-linear function and W c ∈ R is a learned parameter matrix.Z is the sum of all the scores and is used for normalization.Please see (Gu et al., 2016) for more details.

Experiment Settings
This section begins by discussing how we designed our evaluation experiments, followed by the description of training and testing datasets.Then, we introduce our evaluation metrics and baselines.

Training Dataset
There are several publicly-available datasets for evaluating keyphrase generation.The largest one came from Krapivin et al. (2008), which contains 2,304 scientific publications.However, this amount of data is unable to train a robust recurrent neural network model.In fact, there are millions of scientific papers available online, each of which contains the keyphrases that were assigned by their authors.Therefore, we collected a large amount of high-quality scientific metadata in the computer science domain from various online digital libraries, including ACM Digital Library, Sci-enceDirect, Wiley, and Web of Science etc. (Han et al., 2013;Rui et al., 2016).In total, we obtained a dataset of 567,830 articles, after removing duplicates and overlaps with testing datasets, which is 200 times larger than the one of Krapivin et al. (2008).Note that our model is only trained on 527,830 articles, since 40,000 publications are randomly held out, among which 20,000 articles were used for building a new test dataset KP20k.Another 20,000 articles served as the validation dataset to check the convergence of our model, as well as the training dataset for supervised baselines.

Testing Datasets
For evaluating the proposed model more comprehensively, four widely-adopted scientific publication datasets were used.In addition, since these datasets only contain a few hundred or a few thousand publications, we contribute a new testing dataset KP20k with a much larger number of scientific articles.We take the title and abstract as the source text.Each dataset is described in detail below.
-Inspec (Hulth, 2003): This dataset provides 2,000 paper abstracts.We adopt the 500 testing papers and their corresponding uncontrolled keyphrases for evaluation, and the remaining 1,500 papers are used for training the supervised baseline models.
-Krapivin (Krapivin et al., 2008): This dataset provides 2,304 papers with full-text and author-assigned keyphrases.However, the author did not mention how to split testing data, so we selected the first 400 papers in alphabetical order as the testing data, and the remaining papers are used to train the supervised baselines.
-NUS (Nguyen and Kan, 2007): We use both author-assigned and reader-assigned keyphrases and treat all 211 papers as the testing data.Since the NUS dataset did not specifically mention the ways of splitting training and testing data, the results of the supervised baseline models are obtained through a five-fold cross-validation.
-SemEval-2010 (Kim et al., 2010): 288 articles were collected from the ACM Digital Library. 100 articles were used for testing and the rest were used for training supervised baselines.
-KP20k: We built a new testing dataset that contains the titles, abstracts, and keyphrases of 20,000 scientific articles in computer science.They were randomly selected from our obtained 567,830 articles.Due to the memory limits of implementation, we were not able to train the supervised baselines on the whole training set.Thus we take the 20,000 articles in the validation set to train the supervised baselines.It is worth noting that we also examined their performance by enlarging the training dataset to 50,000 articles, but no significant improvement was observed.

Implementation Details
In total, there are 2,780,316 text, keyphrase pairs for training, in which text refers to the concatenation of the title and abstract of a publication, and keyphrase indicates an author-assigned keyword.The text pre-processing steps including tokenization, lowercasing and replacing all digits with symbol digit are applied.Two encoderdecoder models are trained, one with only attention mechanism (RNN) and one with both attention and copying mechanism enabled (Copy-RNN).For both models, we choose the top 50,000 frequently-occurred words as our vocabulary, the dimension of embedding is set to 150, the dimension of hidden layers is set to 300, and the word embeddings are randomly initialized with uniform distribution in [-0.1,0.1].Models are optimized using Adam (Kingma and Ba, 2014) with initial learning rate = 10 −4 , gradient clipping = 0.1 and dropout rate = 0.5.The max depth of beam search is set to 6, and the beam size is set to 200.
The training is stopped once convergence is determined on the validation dataset (namely earlystopping, the cross-entropy loss stops dropping for several iterations).
In the generation of keyphrases, we find that the model tends to assign higher probabilities for shorter keyphrases, whereas most keyphrases contain more than two words.To resolve this problem, we apply a simple heuristic by preserving only the first single-word phrase (with the highest generating probability) and removing the rest.

Baseline Models
Four unsupervised algorithms (Tf-Idf, Tex-tRank (Mihalcea and Tarau, 2004), SingleRank (Wan and Xiao, 2008), and ExpandRank (Wan and Xiao, 2008)) and two supervised algorithms (KEA (Witten et al., 1999) and Maui (Medelyan et al., 2009a)) are adopted as baselines.We set up the four unsupervised methods following the optimal settings in (Hasan and Ng, 2010), and the two supervised methods following the default setting as specified in their papers.

Evaluation Metric
Three evaluation metrics, the macro-averaged precision, recall and F-measure (F 1 ) are employed for measuring the algorithm's performance.Following the standard definition, precision is defined as the number of correctly-predicted keyphrases over the number of all predicted keyphrases, and recall is computed by the number of correctlypredicted keyphrases over the total number of data records.Note that, when determining the match of two keyphrases, we use Porter Stemmer for preprocessing.

Results and Analysis
We conduct an empirical study on three different tasks to evaluate our model.

Predicting Present Keyphrases
This is the same as the keyphrase extraction task in prior studies, in which we analyze how well our proposed model performs on a commonly-defined task.To make a fair comparison, we only consider the present keyphrases for evaluation in this task.Table 2 provides the performances of the six baseline models, as well as our proposed models (i.e., RNN and CopyRNN).For each method, the table lists its F-measure at top 5 and top 10 predictions on the five datasets.The best scores are highlighted in bold and the underlines indicate the second best performances.
The results show that the four unsupervised models (Tf-idf, TextTank, SingleRank and Ex-pandRank) have a robust performance across different datasets.The ExpandRank fails to return any result on the KP20k dataset, due to its high time complexity.The measures on NUS and Se-mEval here are higher than the ones reported in (Hasan and Ng, 2010) and (Kim et al., 2010), probably because we utilized the paper abstract instead of the full text for training, which may filter out some noisy information.The performance of the two supervised models (i.e., Maui and KEA) were unstable on some datasets, but Maui achieved the best performances on three datasets among all the baseline models.
As for our proposed keyphrase prediction approaches, the RNN model with the attention mech-

Inspec
Krapivin NUS SemEval KP20k F 1 @5 F 1 @10 F 1 @5 F 1 @10 F 1 @5 F 1 @10 F 1 @5 F 1 @10 F 1 @5 F 1 @10  The example in Figure 1(a) shows the result of predicted present keyphrases by RNN and Copy-RNN for an article about video search.We see that both models can generate phrases that relate to the topic of information retrieval and video.However most of RNN predictions are high-level terminologies, which are too general to be selected as keyphrases.CopyRNN, on the other hand, predicts more detailed phrases like "video metadata" and "integrated ranking".An interesting bad case, "rich content" coordinates with a keyphrase "video metadata", and the CopyRNN mistakenly puts it into prediction.

Predicting Absent Keyphrases
As stated, one important motivation for this work is that we are interested in the proposed model's capability for predicting absent keyphrases based on the "understanding" of content.It is worth noting that such prediction is a very challenging task, and, to the best of our knowledge, no existing methods can handle this task.Therefore, we only provide the RNN and CopyRNN performances in the discussion of the results of this task.Here, we evaluate the performance within the recall of the top 10 and top 50 results, to see how many absent keyphrases can be correctly predicted.We use the absent keyphrases in the testing datasets for evaluation.

Dataset
RNN CopyRNN R@10 R@50 R@10 R@50 Inspec  This indicates that, to some extent, both models can capture the hidden semantics behind the textual content and make reasonable predictions.In addition, with the advantage of features from the source text, the CopyRNN model also outperforms the RNN model in this condition, though it does not show as much improvement as the present keyphrase extraction task.An example is shown in Figure 1(b), in which we see that two absent keyphrases, "video retrieval" and "video indexing", are correctly recalled by both models.Note that the term "indexing" does not appear in the text, but the models may detect the information "index videos" in the first sentence and paraphrase it to the target phrase.And the CopyRNN successfully predicts another two keyphrases by capturing the detailed information from the text (highlighted text segments).

Transferring the Model to the News Domain
RNN and CopyRNN are supervised models, and they are trained on data in a specific domain and writing style.However, with sufficient training on a large-scale dataset, we expect the models to be able to learn universal language features that are also effective in other corpora.Thus in this task, we will test our model on another type of text, to see whether the model would work when being transferred to a different environment.
We use the popular news article dataset DUC-2001 (Wan and Xiao, 2008) for analysis.The dataset consists of 308 news articles and 2,488 manually annotated keyphrases.The result of this analysis is shown in Table 4, from which we could see that the CopyRNN can extract a portion of correct keyphrases from a unfamiliar text.We also report the baseline performance included in (Hasan and Ng, 2010).The performance of CopyRNN is better than TextRank (Mihalcea and Tarau, 2004) and KeyCluster (Liu et al., 2009), but lags behind the other three baselines.It is worth noting that the hyperparameters of baseline models, such as number of recalled keyphrases for Tf-Idf and Sin-gleRank, are carefully tuned and may drastically affect the results.However for CopyRNN we simply report its F 1 score of top 10 predicted phrases (F 1 @10).
As it is transferred to a corpus in a completely different type and domain, the model encounters more unknown words and has to rely more on the positional and syntactic features within the text.In this experiment, the CopyRNN recalls 766 keyphrases.14.3% of them contain out-ofvocabulary words, and many names of persons and places are correctly predicted.

Discussion
Our experimental results demonstrate that the CopyRNN model not only performs well on predicting present keyphrases, but also has the ability to generate topically relevant keyphrases that are absent in the text.In a broader sense, this model attempts to map a long text (i.e., paper abstract) with representative short text chunks (i.e., keyphrases), which can potentially be applied to improve information retrieval performance by generating high-quality index terms, as well as assisting user browsing by summarizing long documents into short, readable phrases.
Thus far, we have tested our model with scientific publications and news articles, and have demonstrated that our model has the ability to capture universal language patterns and extract key information from unfamiliar texts.We believe that our model has a greater potential to be generalized to other domains and types, like books, online reviews, etc., if it is trained on a larger data corpus.Also, we directly applied our model, which was trained on a publication dataset, into generating keyphrases for news articles without any adaptive training.We believe that with proper training on news data, the model would make further improvement.
Additionally, this work mainly studies the problem of discovering core content from textual materials.Here, the encoder-decoder framework is applied to model language; however, such a framework can also be extended to locate the core information on other data resources, such as summarizing content from images and videos.

Conclusions and Future Work
In this paper, we proposed an RNN-based generative model for predicting keyphrases in scientific text.To the best of our knowledge, this is the first application of the encoder-decoder model to a keyphrase prediction task.Our model sum- -In this work, we only evaluated the performance of the proposed model by conducting off-line experiments.In the future, we are interested in comparing the model to human annotators and using human judges to evaluate the quality of predicted phrases.
-Our current model does not fully consider correlation among target keyphrases.It would also be interesting to explore the multiple-output optimization aspects of our model.

Figure 1 :
Figure 1: An example of predicted keyphrase by RNN and CopyRNN.Phrases shown in bold are correct predictions.

Table 1 :
Proportion of the present keyphrases and absent keyphrases in four public datasets

Table 2 :
The performance of predicting present keyphrases of various models on five benchmark datasets

Table 3 :
Absent keyphrases prediction performance of RNN and CopyRNN on five datasets

Table 4 :
Keyphrase prediction performance of CopyRNN on DUC-2001.The model is trained on scientific publication and evaluated on news.