Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards

Generating keyphrases that summarize the main points of a document is a fundamental task in natural language processing. Although existing generative models are capable of predicting multiple keyphrases for an input document as well as determining the number of keyphrases to generate, they still suffer from the problem of generating too few keyphrases. To address this problem, we propose a reinforcement learning (RL) approach for keyphrase generation, with an adaptive reward function that encourages a model to generate both sufficient and accurate keyphrases. Furthermore, we introduce a new evaluation method that incorporates name variations of the ground-truth keyphrases using the Wikipedia knowledge base. Thus, our evaluation method can more robustly evaluate the quality of predicted keyphrases. Extensive experiments on five real-world datasets of different scales demonstrate that our RL approach consistently and significantly improves the performance of the state-of-the-art generative models with both conventional and new evaluation methods.


Introduction
The task of keyphrase generation aims at predicting a set of keyphrases that convey the core ideas of a document. Figure 1 shows a sample document and its keyphrase labels. The keyphrases in red color are present keyphrases that appear in the document, whereas the blue ones are absent keyphrases that do not appear in the input. By distilling the key information of a document into a set of succinct keyphrases, keyphrase generation facilitates a wide variety of downstream applications, including document clustering (Hammouda et al., 2005;Hulth and Megyesi, 2006), opinion mining (Berend, 2011), and summarization (Zhang et al., 2004;Wang and Cardie, 2013). Figure 1: Sample document with keyphrase labels and predicted keyphrases. We use red (blue) color to highlight present (absent) keyphrases. The underlined phrases are name variations of a keyphrase label. "catSeqD" is a keyphrase generation model from Yuan et al. (2018). "catSeqD-2RF 1 " denotes the catSeqD model after being trained by our RL approach. The enriched keyphrase labels are based on our new evaluation method.
To produce both present and absent keyphrases, generative methods (Meng et al., 2017;Ye and Wang, 2018;Chen et al., 2018a,b) are designed to apply the attentional encoder-decoder model Luong et al., 2015) with copy mechanism (Gu et al., 2016;See et al., 2017) to approach the keyphrase generation task. However, none of the prior models can determine the appropriate number of keyphrases for a document. In reality, the optimal keyphrase count varies, and is dependent on a given document's content. To that end, Yuan et al. (2018) introduced a training setup in which a generative model can learn to decide the number of keyphrases to predict for a given document and proposed two models. Although they provided a more realistic setup, there still exist two drawbacks. First, models trained under this setup tend to generate fewer keyphrases than the groundtruth. Our experiments on the largest dataset show that their catSeqD model generates 4.3 keyphrases per document on average, while these documents have 5.3 keyphrase labels on average. Ideally, a model should generate both sufficient and accurate keyphrases. Second, existing evaluation methods rely only on the exact matching of word stems (Porter, 2006) to determine whether a predicted phrase matches a ground-truth phrase. For example, given the document in Figure 1, if a model generates "support vector machine", it will be treated as incorrect since it does not match the word "svm" given by the gold-standard labels. It is therefore desirable for an evaluation method to consider name variations of a groundtruth keyphrase.
To address the first limitation, we design an adaptive reward function, RF 1 , that encourages a model to generate both sufficient and accurate keyphrases. Concretely, if the number of generated keyphrases is less than that of the groundtruth, we use recall as the reward, which does not penalize the model for generating incorrect predictions. If the model generates sufficient keyphrases, we use F 1 score as the reward, to balance both recall and precision of the predictions. To optimize the model towards this nondifferentiable reward function, we formulate the task of keyphrase generation as a reinforcement learning (RL) problem and adopt the self-critical policy gradient method (Rennie et al., 2017) as the training procedure. Our RL approach is flexible and can be applied to any keyphrase generative model with an encoder-decoder structure. In Figure 1, we show a prediction result of the catSeqD model (Yuan et al., 2018) and another prediction result of the catSeqD model after being trained by our RL approach (catSeqD-2RF 1 ). This example illustrates that our RL approach encourages the model to generate more correct keyphrases. Perhaps more importantly, the number of generated keyphrases also increases to five, which is closer to the ground-truth number (5.3).
Furthermore, we propose a new evaluation method to tackle the second limitation. For each ground-truth keyphrase, we extract its name variations from various sources. If the word stems of a predicted keyphrase match the word stems of any name variation of a ground-truth keyphrase, it is treated as a correct prediction. For instance, in Figure 1, our evaluation method enhances the "svm" ground-truth keyphrase with its name variation, "support vector machine". Thus, the phrase "support vector machine" generated by catSeqD and catSeqD-2RF 1 will be considered correct, which demonstrates that our evaluation method is more robust than the existing one.
We conduct extensive experiments to evaluate the performance of our RL approach. Experiment results on five real-world datasets show that our RL approach consistently improves the performance of the state-of-the-art models in terms of F -measures. Moreover, we analyze the sufficiency of the keyphrases generated by different models. It is observed that models trained by our RL approach generate more absent keyphrases, which is closer to the number of absent groundtruth keyphrases. Finally, we deploy our new evaluation method on the largest keyphrase generation benchmark, and the new evaluation identifies at least one name variation for 14.1% of the groundtruth keyphrases.
We summarize our contributions as follows: (1) an RL approach with a novel adaptive reward function that explicitly encourages the model to generate both sufficient and accurate keyphrases; (2) a new evaluation method that considers name variations of the keyphrase labels; and (3) the new state-of-the-art performance on five real-world datasets in a setting where a model is able to determine the number of keyphrases to generate. This is the first work to study RL approach on the keyphrase generation problem.

Keyphrase Extraction and Generation
Traditional extractive methods select important phrases from the document as its keyphrase predictions. Most of them adopt a two-step approach. First, they identify keyphrase candidates from the document by heuristic rules . Afterwards, the candidates are either ranked by unsupervised methods (Mihalcea and Tarau, 2004;Wan and Xiao, 2008) or supervised learning algorithms (Medelyan et al., 2009;Witten et al., 1999;Nguyen and Kan, 2007a). Other extractive methods apply sequence tagging models (Luan et al., 2017;Gollapalli et al., 2017; to identify keyphrases. However, extractive methods cannot produce absent keyphrases. To predict both present and absent keyphrases for a document, Meng et al. (2017) proposed a generative model, CopyRNN, which is composed of an attentional encoder-decoder model  and a copy mechanism (Gu et al., 2016). Lately, multiple extensions to CopyRNN were also presented. CorrRNN (Chen et al., 2018a) incorporates the correlation among keyphrases. TG-Net (Chen et al., 2018b) exploits the title information to learn a better representation for an input document. Chen et al. (2019) leveraged keyphrase extraction models and external knowledge to improve the performance of keyphrase generation. Ye and Wang (2018) considered a setting where training data is limited, and proposed different semi-supervised methods to enhance the performance. All of the above generative models use beam search to over-generate a large number of keyphrases and select the topk predicted keyphrases as the final predictions, where k is a fixed number.
Recently, Yuan et al. (2018) introduced a setting where a model has to determine the appropriate number of keyphrases for an input document. They proposed a training setup that empowers a generative model to generate variable numbers of keyphrases for different documents. Two new models, catSeq and catSeqD, were described. Our work considers the same setting and proposes an RL approach, which is equipped with adaptive rewards to generate sufficient and accurate keyphrases. To our best knowledge, this is the first time RL is used for keyphrase generation. Besides, we propose a new evaluation method that considers name variations of the keyphrase labels, a novel contribution to the state-of-the-art.

Reinforcement Learning for Text Generation
Reinforcement learning has been applied to a wide array of text generation tasks, including machine translation (Wu et al., 2016;Ranzato et al., 2015), text summarization (Paulus et al., 2018;, and image/video captioning (Rennie et al., 2017;Pasunuru and Bansal, 2017). These RL approaches lean on the REINFORCE algorithm (Williams, 1992), or its variants, to train a generative model towards a non-differentiable reward by minimizing the policy gradient loss. Different from existing work, our RL approach uses a novel adaptive reward function, which combines the recall and F 1 score via a hard gate (if-else statement).

Problem Definition
We formally define the problem of keyphrase generation as follows. Given a document x, output a set of ground-truth keyphrases Y = {y 1 , y 2 , . . . , y |Y| }. The document x and each ground-truth keyphrase y i are sequences of words, i.e., x = (x 1 , . . . , x lx ), and y i = (y i 1 , . . . , y i l y i ), where l x and l y i denote the numbers of words in x and y i respectively. A keyphrase that matches any consecutive subsequence of the document is a present keyphrase, otherwise it is an absent keyphrase. We use Y p = {y p,1 , y p,2 , . . . , y p,|Y p | } and Y a = {y a,1 , y a,2 , . . . , y a,|Y a | } to denote the sets of present and absent ground-truth keyphrases, respectively. Thus, the ground-truth keyphrases set can be expressed as Y = Y p ∪ Y a .

Keyphrase Generation Model
In this section, we describe the attentional encoder-decoder model  with copy mechanism (See et al., 2017), which is the backbone of our implementations of the baseline generative models.

Our training setup.
For each documentkeyphrases pair (x, Y), we join all the keyphrases in Y into one output sequence, y = y p,1 y p,2 . . . y p,|Y p | y a,1 y a,2 . . . y a,|Y a | , where is a special token that indicates the end of present keyphrases, and is a delimiter between two consecutive present keyphrases or absent keyphrases. Using such (x, y) samples as training data, the encoder-decoder model can learn to generate all the keyphrases in one output sequence and determine the number keyphrases to generate. The only difference with the setup in Yuan et al. (2018) is that we use to mark the end of present keyphrases, instead of using .
Attentional encoder-decoder model. We use a bi-directional Gated-Recurrent Unit (GRU)  as the encoder. The encoder's i-th A single-layered GRU is adopted as the decoder. At decoding step t, the decoder hidden state is s t = GRU(e t−1 , s t−1 ) ∈ R ds , where e t−1 is the embedding of the (t − 1)-th predicted word. Then we apply the attention layer in  to compute an attention score a t,i for each of the word x i in the document. The attention scores are next used to compute a context vector h * t for the document. The probability of predicting a word y t from a predefined vocabulary V is de- ). In this paper, all the W terms represent trainable parameters and we omit the bias terms for brevity.
Pointer-generator network. To alleviate the out-of-vocabulary (OOV) problem, we adopt the copy mechanism from See et al. (2017). For each document x, we build a dynamic vocabulary V x by merging the predefined vocabulary V and all the words that appear in x. Then, the probability of predicting a word y t from the dy- is a soft gate to select between generating a word from the vocabulary V and copying a word from the document.
Maximum likelihood training. We use θ to denote all model parameters and y 1:t−1 to denote a sequence (y 1 , ..., y t−1 ). Previous work learns the parameters by maximizing the log-likelihood of generating the ground-truth output sequence y, defined as follows, log P Vx (y t |y 1:t−1 , x; θ). (1)

Reinforcement Learning Formulation
We formulate the task of keyphrase generation as a reinforcement learning problem, in which an agent interacts with an environment in discrete time steps. At each time step t = 1, . . . , T , the agent produces an action (word)ŷ t sampled from the policy π(ŷ t |ŷ 1:t−1 , x; θ), whereŷ 1:t−1 denotes the sequence generated by the agent from step 1 to t − 1. After that, the environment gives a reward r t (ŷ 1:t , Y) to the agent and transits to the next step t+1 with a new stateŝ t+1 = (ŷ 1:t , x, Y). The policy of the agent is a keyphrase generation model, i.e., π(.|ŷ 1:t−1 , x; θ) = P Vx (.|ŷ 1:t−1 , x; θ).
To improve the sufficiency and accuracy of both present keyphrases and absent keyphrases generated by the agent, we give separate reward signals to present keyphrase predictions and absent keyphrase predictions. Hence, we divide our RL problem into two different stages. In the first stage, we evaluate the agent's performance on extracting present keyphrases. Once the agent generates the ' ' token, we denote the current time step as T p , the environment computes a reward using our adaptive reward function RF 1 by comparing the generated keyphrases inŷ 1:T P with the ground-truth present keyphrases Y p , i.e., r T P (ŷ 1:T P , Y) = RF 1 (ŷ 1:T P , Y p ). Then we enter the second stage, where we evaluate the agent's performance on generating absent keyphrases. Upon generating the EOS token, the environment compares the generated keyphrases inŷ T P +1:T with the ground-truth absent keyphrases Y a and computes a reward After that, the whole process terminates. The reward to the agent is 0 for all other time steps, i.e., r t (ŷ 1: , whereŷ denotes the complete sequence generated by the agent, i.e.,ŷ =ŷ 1:T . We then simplify the expression of return into: The goal of the agent is to maximize the expected initial return Eŷ ∼π(.|x;θ)

Adaptive reward function.
To encourage the model to generate sufficient and accurate keyphrases, we define our adaptive reward function RF 1 as follows. First, let N be the number of predicted keyphrases, and G be the number of ground-truth keyphrases, then If the model generates insufficient number of keyphrases, the reward will be the recall of the predictions. Since generating incorrect keyphrases will not decrease the recall, the model is encouraged to produce more keyphrases to boost the reward. If the model generates a sufficient number of keyphrases, the model should be discouraged from over-generating incorrect keyphrases, thus the F 1 score is used as the reward, which incorporates the precision of the predicted keyphrases. REINFORCE. To maximize the expected initial return, we define the following loss function: According to the REINFORCE learning rule in Williams (1992), the expected gradient of the initial return can be expressed as ∇ θ L(θ) = −Eŷ ∼π(.|x;θ) [ T t=1 ∇ θ log π(ŷ t |ŷ 1:t−1 , x; θ)R t ]. In practice, we approximate the above expectation using a sampleŷ ∼ π(.|x; θ). Moreover, we subtract the return R t by a baseline B t , which is a standard technique in RL to reduce the variance of the gradient estimator (Sutton and Barto, 1998). In theory, the baseline can be any function that is independent of the current action y t . The gradient ∇ θ L is then estimated by: Intuitively, the above gradient estimator increases the generation probability of a wordŷ t if its return R t is higher than the baseline (R t − B t > 0). Self-critical sequence training. The main idea of self-critical sequence training (Rennie et al., 2017) is to produce another sequenceȳ from the current model using greedy search algorithm, then use the initial return obtained byȳ as the baseline. The interpretation is that the gradient estimator increases the probability of a word if it has an advantage over the greedily decoded sequence. We apply this idea to our RL problem, which has two different stages. When in the present (absent) keyphrase prediction stage, we want the baseline B t to be the initial return obtained by the greedy sequenceȳ in its present (absent) keyphrase prediction stage. Thus, we first letT P andT be the decoding steps where the greedy search algorithm generates the token and EOS token, respectively. We then define the baseline 1 as: With Eqs. (5) and (6), we can simply perform gradient descent to train a generative model. 1 The value of Bt only depends on whether ' ' exists in y1:t−1, hence it does not depend on the current actionŷt.

New Evaluation Method
Our new evaluation method maintains a set of name variationsỹ i for each ground-truth keyphrase y i of x. If a predicted keyphraseŷ i matches any name variation of a ground-truth keyphrase, thenŷ i is considered a correct prediction. A ground-truth keyphrase is also its own name variation. If there are multiple ground-truth keyphrases in x that have the same name variations set, we will only keep one of them. In our evaluation method, the name variation set of a ground-truth keyphrase may contain both present phrases and absent phrases. In such a case, a ground-truth keyphrase can be matched by a present predicted keyphrase or an absent predicted keyphrase. Thus, this ground-truth keyphrase should be treated as both a present ground-truth keyphrase and an absent ground-truth keyphrase, as shown in the following definition.
Definition 5.1. Present (Absent) ground-truth keyphrase. If a name variation setỹ i of a groundtruth keyphrase y i only consists of present (absent) keyphrases, then y i is a present (absent) ground-truth keyphrase. Otherwise, y i is both a present ground-truth keyphrase and an absent ground-truth keyphrase, i.e., y i ∈ Y p and y i ∈ Y a .

Name Variation Extraction
We extract name variations of a ground-truth keyphrase from the following sources: acronyms in the ground-truths, Wikipedia disambiguation pages, and Wikipedia entity titles. The later two sources have also been adopted by entity linking methods (Zhang et al., 2010(Zhang et al., , 2011 to find name variations. Some examples of extracted name variations are shown in Table 1. Acronyms in the ground-truths. We found that some of the ground-truth keyphrases have included an acronym at the end of the string, e.g.,"principal component analysis (pca)". Thus, we adopt the following simple rule to extract an acronym from a ground-truth keyphrase. If a ground-truth keyphrase ends with a pair of parentheses, we will extract the phrase inside the pair, e.g., "pca", as one of the name variations. Wikipedia entity titles. An entity page in Wikipedia provides the information of an entity, and the page title represents an unambiguous name variation of that entity. For example, a search for "solid state disk" on Wikipedia will be redirected to the entity page of "solid state drive". In such case, the title "solid state drive" is a name variation of "solid state disk". Wikipedia disambiguation pages. A disambiguation page helps users find the correct entity page when the input query refers to more than one entity in Wikipedia. It contains a list of entity pages that the query refers to. For example, a keyphrase of "ssd" may refer to the entity "solid state drive" or "sterol-sensing domain" in Wikipedia. To find the correct entity page for a keyphrase, we iterate through this list of possible entities. If an entity title is present in a document, we assume it is the entity that the keyphrase refers to. For example, if a document x contains "solid state drive", we will assume that the keyphrase "ssd" refers to this entity.

Experiments
We first report the performance of different models using the conventional evaluation method. Afterwards, we present the results based on our new evaluation method. All experiments are repeated for three times using different random seeds and the averaged results are reported. The source code and the enriched evaluation set are released to the public 2 . Sample output is shown in Figure 1.

Datasets
We conduct experiments on five scientific article datasets, including KP20k (Meng et al., 2017), Inspec (Hulth, 2003), Krapivin (Krapivin et al., 2009), NUS Kan, 2007b), andSe-mEval (Kim et al., 2010). Each sample from these datasets consists of the title, abstract, and keyphrases of a scientific article. We concatenate the title and abstract as an input document, and use the assigned keyphrases as keyphrase labels. Following the setup in (Meng et al., 2017;Yuan et al., 2018;Chen et al., 2018b), we use the training set of the largest dataset, KP20k, for model training and the testing sets of all five datasets to evaluate the performance of a generative model. From the training set of KP20k, we remove all articles that are duplicated in itself, either in the KP20k validation set, or in any of the five testing sets. After the cleanup, the KP20k dataset contains 509,818 training samples, 20,000 validation samples, and 20,000 testing samples.

Evaluation Metrics
The performance of a model is typically evaluated by comparing the top k predicted keyphrases with the ground-truth keyphrases. The evaluation cutoff k can be either a fixed number or a variable. Most previous work (Meng et al., 2017;Ye and Wang, 2018;Chen et al., 2018a,b) adopted evaluation metrics with fixed evaluation cutoffs, e.g., F 1 @5. Recently, Yuan et al. (2018) proposed a new evaluation metric, F 1 @M , which has a variable evaluation cutoff. F 1 @M compares all the keyphrases predicted by the model with the ground-truth to compute an F 1 score, i.e., k = number of predictions. It can also be interpreted as the original F 1 score with no evaluation cutoff.
We evaluate the performance of a model using a metric with a variable cutoff and a metric with a fixed cutoff, namely, F 1 @M and F 1 @5. Marco average is deployed to aggregate the evaluation scores for all testing samples. We apply Porter Stemmer before determining whether two phrases are matched. Our implementation of F 1 @5 is different from that of Yuan et al. (2018). Specifically, when computing F 1 @5, if a model generates less than five predictions, we append random wrong answers to the prediction until it reaches five predictions 3 . The rationale is to avoid producing similar F 1 @5 and F 1 @M , when a model (e.g., cat-Seq) generates less than five keyphrases, as shown in the Table 2 of Yuan et al. (2018).

Baseline and Deep Reinforced Models
We train four baseline generative models using maximum-likelihood loss.
into one output sequence. With this setup, all baselines can determine the number of keyphrases to generate. The catSeqCorr and catSeqTG models are the CorrRNN (Chen et al., 2018a) and TG-Net (Chen et al., 2018b) models trained under this setup, respectively. For the reinforced models, we follow the method in Section 3.2 to concatenate keyphrases. We first pre-train each baseline model using maximum-likelihood loss, and then apply our RL approach to train each of them. We use a suffix "-2RF 1 " to indicate that a generative model is finetuned by our RL algorithm, e.g., catSeq-2RF 1 .

Implementation Details
Following (Yuan et al., 2018), we use greedy search (beam search with beam width 1) as the decoding algorithm during testing. We do not apply the Porter Stemmer to the keyphrase labels in the SemEval testing dataset because they have already been stemmed. We remove all the duplicated keyphrases from the predictions before computing an evaluation score. The following steps are applied to preprocess all the datasets. We lowercase all characters, replace all the digits with a special token digit , and perform tokenization. Following (Yuan et al., 2018), for each document, we sort all the present keyphrase labels according to their order of the first occurrence in the document. The absent keyphrase labels are then appended at the end of present keyphrase labels. We do not rearrange the order among the absent keyphrases.
The vocabulary V is defined as the most frequent 50,002 words, i.e., |V | = 50002. We train all the word embeddings from scratch with a hidden size of 100. The hidden size of encoder d h and the hidden size of decoder d s are both set to 300. The followings are the dimensions of the W terms: W V ∈ R |V |×ds , W V ∈ R ds×(d h +ds) , W g ∈ R 1×(d h +ds+100) . The encoder bi-GRU has only one layer. The initial state of the decoder GRU is set to . For all other model parameters of the baseline models, we follow the dimensions specified by their corresponding papers (Yuan et al., 2018;Chen et al., 2018a,b). We initialize all the model parameters using a uniform distribution within the interval [−0.1, 0.1]. During training, we use a dropout rate of 0.1 and gradient clipping of 1.0.
For maximum-likelihood training (as well as pretraining), we use the Adam optimization algorithm (Kingma and Ba, 2014) with a batch size of 12 and an initial learning rate of 0.001. We evaluate the validation perplexity of a model for every 4000 iterations. We reduce the learning rate by half if the validation perplexity (ppl) stops dropping for one check-point and stop the training when the validation ppl stops dropping for three contiguous check-points. We also use teachingforcing during the training.
For RL training, we use the Adam optimization algorithm (Kingma and Ba, 2014) with a batch size of 32 and an initial learning rate of 0.00005. We evaluate the validation initial return of a model for every 4000 iterations. We stop the training when the validation initial return stops increasing for three contiguous check-points. If the model generates more than one ' ' segmenter, we will only keep the first one and remove the duplicates. If the model does not generate the ' ' segmenter, we will manually insert a ' ' segmenter to the first position of the generated sequence.

Main Results
In this section, we evaluate the performance of present keyphrase prediction and absent keyphrase prediction separately. The evaluation results of different models on predicting present keyphrases are shown in Table 2. We observe that our re-   inforcement learning algorithm consistently improves the keyphrase extraction ability of all baseline generative models by a large margin. On the largest dataset KP20k, all reinforced models obtain significantly higher F 1 @5 and F 1 @M (p < 0.02, t-test) than the baseline models. We then evaluate the performance of different models on predicting absent keyphrases. Table 3 suggests that our RL algorithm enhances the performance of all baseline generative models on most datasets, and maintains the performance of baseline methods on the SemEval dataset. Note that predicting absent keyphrases for a document is an extremely challenging task (Yuan et al., 2018), thus the significantly lower scores than those of present keyphrase prediction.

Number of Generated Keyphrases
We analyze the abilities of different models to predict the appropriate number of keyphrases. All duplicated keyphrases are removed during preprocessing. We first measure the mean absolute error (MAE) between the number of generated keyphrases and the number of groundtruth keyphrases for all documents in the KP20k dataset. We also report the average number of generated keyphrases per document, denoted as  Table 5: Ablation study on the KP20k dataset. Suffix "-2RF 1 " denotes our full RL approach. Suffix "-2F 1 " denotes that we replace our adaptive RF 1 reward function in the full approach by an F 1 reward function. Suffix "-RF 1 " denotes that we replace the two separate RF 1 reward signals in our full approach with only one RF 1 reward signal for all the generated keyphrases.
"Avg. #". The results are shown in Table 4, where oracle is a model that always generates the ground-truth keyphrases. The resultant MAEs demonstrate that our deep reinforced models notably outperform the baselines on predicting the number of absent keyphrases and slightly outperform the baselines on predicting the number of present keyphrases. Moreover, our deep reinforced models generate significantly more absent keyphrases than the baselines (p < 0.02, ttest). The main reason is that the baseline models can only generate very few absent keyphrases, whereas our RL approach uses recall as the reward and encourages the model to generate more absent keyphrases. Besides, the baseline models and our reinforced models generate similar numbers of present keyphrases, while our reinforced models achieve notably higher F -measures, implying that our methods generate present keyphrases more accurately than the baselines.

Ablation Study
We conduct an ablation study to further analyze our reinforcement learning algorithm. The results are reported in Table 5.  Seq model using another RL algorithm which only gives one reward for all generated keyphrases without distinguishing present keyphrases and absent keyphrases. We use "catSeq-RF 1 " to denote such a method. As seen in Table 5, although the performance of catSeq-RF 1 is competitive to catSeq-2RF 1 on predicting present keyphrases, it yields an extremely poor performance on absent keyphrase prediction. We analyze the cause as follows. During the training process of catSeq-RF 1 , generating a correct present keyphrase or a correct absent keyphrase leads to the same degree of improvement in the return at every time step.
Since producing a correct present keyphrase is an easier task, the model tends to generate present keyphrases only.
Alternative reward function. We implement a variant of our RL algorithm by replacing the adaptive RF 1 reward function with an F 1 score function (indicated with a suffix "-2F 1 " in the result table). By comparing the last two rows in Table 5, we observe that our RF 1 reward function slightly outperforms the F 1 reward function.

Analysis of New Evaluation Method
We extract name variations for all keyphrase labels in the testing set of KP20k dataset, following the methodology in Section 5. Our method extracts at least one additional name variation for 14.1% of the ground-truth keyphrases. For these enhanced keyphrases, the average number of name variations extracted is 1.01. Among all extracted name variations, 14.1% come from the acronym in the ground-truth, 28.2% from the Wikipedia disambiguation pages, and the remaining 61.6% from Wikipedia entity page titles. We use our new evaluation method to evaluate the performance of different keyphrase generation models, and compare with the existing evaluation method. Table 6 shows that for all generative mod-els, the evaluation scores computed by our method are higher than those computed by prior method. This demonstrates that our proposed evaluation successfully captures name variations of groundtruth keyphrases generated by different models, and can therefore evaluate the quality of generated keyphrases in a more robust manner.

Conclusion and Future Work
In this work, we propose the first RL approach to the task of keyphrase generation. In our RL approach, we introduce an adaptive reward function RF 1 , which encourages the model to generate both sufficient and accurate keyphrases. Empirical studies on real data demonstrate that our deep reinforced models consistently outperform the current state-of-the-art models. In addition, we propose a novel evaluation method which incorporates name variations of the ground-truth keyphrases. As a result, it can more robustly evaluate the quality of generated keyphrases. One potential future direction is to investigate the performance of other encoder-decoder architectures on keyphrase generation such as Transformer (Vaswani et al., 2017) with multi-head attention module . Another interesting direction is to apply our RL approach on the microblog hashtag annotation problem .