Diverse Keyphrase Generation with Neural Unlikelihood Training

In this paper, we study sequence-to-sequence (S2S) keyphrase generation models from the perspective of diversity. Recent advances in neural natural language generation have made possible remarkable progress on the task of keyphrase generation, demonstrated through improvements on quality metrics such as F1-score. However, the importance of diversity in keyphrase generation has been largely ignored. We first analyze the extent of information redundancy present in the outputs generated by a baseline model trained using maximum likelihood estimation (MLE). Our findings show that repetition of keyphrases is a major issue with MLE training. To alleviate this issue, we adopt neural unlikelihood (UL) objective for training the S2S model. Our version of UL training operates at (1) the target token level to discourage the generation of repeating tokens; (2) the copy token level to avoid copying repetitive tokens from the source text. Further, to encourage better model planning during the decoding process, we incorporate K-step ahead token prediction objective that computes both MLE and UL losses on future tokens as well. Through extensive experiments on datasets from three different domains we demonstrate that the proposed approach attains considerably large diversity gains, while maintaining competitive output quality.

: Comparison of sample outputs generated by our model (DivKGen) vs. an MLE baseline. The repeating keyphrases are shown in red.
In this paper, we take a principled direction towards addressing the information redundancy issue in keyphrase generation models. We propose to tackle this problem directly during the training stage, rather than applying adhoc post-processing at inference time. Specifically, we adopt the neural unlikelihood training (UL) objective (Welleck et al., 2020), whereby the decoder is penalized for generating undesirable tokens. Welleck et al. (2020) introduce unlikelihood training for a language model setting. Since we work with a S2S setup, our version of UL loss consists of two components: (1) a target token level UL loss based on the target vocabulary to penalize the model for generating repeating tokens; (2) a copy token level UL loss based on the dynamic vocabulary of source tokens required for copy mechanism (Gu et al., 2016;See et al., 2017), which penalizes the model for copying repetitive tokens.
S2S models trained with maximum likelihood estimation (MLE) are usually tasked with the next token prediction objective. However, this does not necessarily incentivize the model to plan for future token prediction ahead of time. We observe such lack of model planning capability in our initial experiments with MLE models and to overcome this issue we propose to use K-step ahead token prediction. This modified training objective encourages the model to learn to correctly predict not just the current token, but also tokens upto K-steps ahead in the future. We then naturally incorporate UL training on the K-step ahead token prediction task.
We summarize our contributions as follows: (1) To improve the diversity of generated keyphrases in a principled manner during training, we adopt the unlikelihood objective for the S2S setting and propose a novel copy token unlikelihood loss. (2) In order to incentivize model planning, we augment our training objective function to incorporate K-step ahead token prediction. Additionally, we also introduce the K-step ahead unlikelihood losses. (3) We propose new metrics for benchmarking keyphrase generation models on diversity criterion. We carry out experiments on datasets from three different domains (scientific articles, news and community QA) and validate the effectiveness of our approach. We observe substantial gains in diversity while maintaining competitive output quality.

Problem Definition
The task of keyphrase generation can be formulated in the following manner. Given a source document x, we are required to generate a set of keyphrases Y = {y 1 , y 2 , . . . , y |Y| } that best describe the input.
The source document is denoted as a sequence of S words: x = (x 1 , x 2 , . . . , x S ). Each target keyphrase y i = (y 1 , y 2 , . . . , y T i ) is also a word sequence of length T i .
We follow the modelling setup adopted in previous work on keyphrase generation (Ye and Wang, 2018;Chan et al., 2019). Given document-keyphrases pair (x, Y), we concatenate all the ground truth keyphrases into a single linearized output sequence y = y 1 ⋄ y 2 ⋄ . . . ⋄ y |Y| , where ⋄ denotes a special delimiter token that is inserted in between consecutive keyphrases. The training data now consists of (x, y) pairs and one can conveniently use a sequence-to-sequence (S2S) modelling architecture to learn the mapping from x to y.

Sequence Encoder-Decoder
A bi-directional LSTM encoder (Hochreiter and Schmidhuber, 1997) reads the variable length source sequence x = (x 1 , . . . , x i , . . . , x S ) and produces a sequence of hidden state representations h = For the decoder, we use a uni-directional LSTM which computes a hidden state s t ∈ R d s at each decoding time step based on a non-linear function defined as s t = f dec (e t−1 , s t−1 ). At training time, e t−1 is the embedding of the ground truth previous word and at inference time, it is the embedding of the word predicted at the previous time step.

Attention Guided Decoding
By incorporating global attention mechanism (Bahdanau et al., 2015) into the basic S2S architecture, it is possible to dynamically align source information with the target hidden states during the decoding process. This is achieved by computing an alignment score between the decoder hidden state s t and each of the encoder hidden representations {h i } S i=1 . At decoding time step t, this corresponds to where α ti is referred to as the attention probability score and W a is a learnable attention weight matrix. Next we compute the attention context vector as a weighted summation across source hidden states.
Finally, the probability distribution over a predefined vocabulary V Target of target tokens is obtained as where ⊕ refers to the concatenation operator. Note that W u and W v are trainable decoder parameters and y t ∈ V Target . For notational brevity, we omit the bias terms.

Copy Mechanism
We incorporate copy mechanism (Gu et al., 2016) to alleviate the out-of-vocabulary issue during generation, by allowing the decoder to selectively copy tokens from the source document. Specifically, we employ a learnable switching parameter p gen = sigmoid(W c [s t ; c t ; e t−1 ]) which refers to the probability of generating a token from the target vocabulary V Target . Thus, (1 − p gen ) corresponds to the probability of copying a token present on the source side whose dynamic vocabulary is denoted by V x . The generation probability and the copy probability at time step t are then combined to predict the next token as follows: where y t ∈ V Target ∪ V x and P copy (y t ) = ∑ i∶x i =y t α ti is the copy probability of token y t defined as a sum of its attention weights across all its occurrences in the source text.

Maximum Likelihood Training
Encoder-decoder models for sequence generation are typically trained using Maximum Likelihood Estimation (MLE). Concretely, for a given instance in the training data, MLE objective corresponds to learning the model parameters θ that minimizes the negative log-likelihood loss defined as follows: where y t is the t-th token in the ground truth output sequence y whose total length is L tokens. We begin with a setup where the S2S model for keyphrase generation is trained using MLE. We carry out preliminary experiments analyzing the diversity of the generation process and demonstrate the shortcomings of MLE-based training (Section 2.6) which paves way for the proposed approach (Section 3).

Lack of Diversity Issue
We conduct a pilot study using KP20k dataset (Meng et al., 2017), a corpus of scientific articles. Each article consists of a title, an abstract and a set of associated keyphrases. Table 1 shows one such example, along with outputs from two systems -a S2S model trained purely with MLE objective and our proposed model which is trained with a combination of unlikelihood training and future token prediction. It can be observed that with MLE objective alone, the S2S model tends to generate the same keyphrase over and over again. On the other hand, the output keyphrases from the proposed model summarizes the abstract of the scientific article, without any repetitions. Furthermore, in Table 2 we quantify this lack of diversity issue using two simple metrics -the percentage of duplicate keyphrases and the percentage of duplicate tokens. On average, for an MLE model, about 27% of the generated KPs and 36% of the generated tokens are duplicates. These values are much higher than the percentage of repetitions present in the ground truth data. This implies that a significant computational effort is spent in the generation of redundant information. Moreover, additional post-processing pipelines are required in order to get rid of these repetitions. From a user experience point of view, the developed system should generate high quality KPs that describe the main ideas in the source text, without any information redundancy. We design our system keeping this objective in mind.

Proposed Approach
Rather than addressing the information redundancy issue through post-processing, we take a principled approach in this direction during training itself. Firstly, we adopt neural unlikelihood training (Welleck et al., 2020) for sequence-to-sequence setting by directly penalizing the decoder for either generating or copying repeating tokens. Secondly, we improve the planning capability of the decoder by incorporating a K-step ahead token prediction loss. This is achieved by using the same decoder hidden state but different attention mechanisms to decide which source tokens to be attended to, for predicting the target at the current time step, 1-step ahead and so on. An illustration of our approach is presented in Figure 1.

Target Token Unlikelihood Loss
The goal of unlikelihood training is to suppress the model's tendency to assign high probabilities to unnecessary tokens. During decoding, say at time step t, we maintain a negative candidate list C t Target that consists of tokens that should ideally be assigned a low probability for the current time step prediction. Formally, given C t TargetUL = {c 1 , . . . , c m } where c j ∈ V Target , we define the unlikelihood loss based on the target vocabulary across all time steps as follows Intuitively, assigning a high probability to a negative candidate token leads to a larger loss. Following Welleck et al. (2020), our negative candidate list for L TargetUL consists of the ground truth context tokens from the previous time steps, i.e., C t Target = {y 1 , . . . , y t−1 } \ {y t }. In this manner, we effectively discourage the model from repeatedly generating tokens that are already present in the previous contexts.

Copy Token Unlikelihood Loss
In contrast to Welleck et al. (2020) who introduce UL training for language model setting, our application employs this method for a S2S task. As described in Section 2.4, our decoder utilizes a copy mechanism Figure 1: (a) Illustration of unlikelihood training. In the above example, at decoding time step t = 6, the previous tokens from the target context form the negative candidate list denoted by C t=6 Target . The Target UL loss is computed using the probabilities assigned to these tokens. Similarly, the Copy UL loss discourages the model from copying the words displayed in red from the source document at t = 6. Ideally, we would like the model to copy the word 'collection'. (b) Depiction of K−step ahead prediction with K = 2. Different attention matrices are used to compute the corresponding attention context vectors for k = 0, 1, 2. These are then individually fed to the final softmax layer along with the shared decoder hidden state to predict the token at the respective k. Copy mechanism is omitted from Figure 1(b) for simplicity. that dynamically creates an extended vocabulary during generation based on the source tokens (V x ). An undesirable side-effect of copying is that the model might repeatedly attend to (and copy) the same set of source tokens over multiple decoding time steps, leading to repetitions in the output. To circumvent this issue, we propose an approach that we refer to as copy token unlikelihood loss.
For penalizing unnecessary copying, our negative candidate list at each time step is composed of ground truth context tokens from previous time steps that also appear in the source text (and thus can be copied).
where C t Copy = {y i | y i ∈ {y 1 , . . . , y t−1 } \ {y t } and y i ∈ V x } and P copy (c|.) refers to the probability of copying a given token c determined by the attention mechanism over the source tokens (Section 2.4).

K-Step Ahead Token Prediction Loss
Keyphrases are made up of one or more tokens. The decoder in S2S models is usually tasked with simply predicting the next token given the context so far. This greedy approach does not incentivize the model to plan for the upcoming future tokens ahead of time. We mitigate this issue by directly incorporating the prediction of tokens K-steps ahead from the current time step into our training objective. To do so, we start with Equation 5, the MLE-based objective for next token prediction at time step t. This can be generalized for the prediction of upto K tokens ahead in time as follows: where γ k refers to the coefficient of the kth step ahead token prediction loss. Note that the next token prediction MLE objective in Equation 5 is a special case of Equation 8 where K = 0 and γ 0 = 1.0. One can think of the K-step ahead losses as a way to reward the model to plan the surface realization of the output sequence ahead of time. Intuitively, it makes sense to assign a high weightage to current token prediction (i.e., for k = 0) and relatively downweight the losses incurred from future token predictions. We accomplish this by decaying the coefficient γ k using the formula γ k = 1.0 k+1 . For K-Step ahead prediction, we consider two implementation choices: (1) For each k, learn a different transformation W k v (in Equation 3) from the hidden representation to the logits over the vocabulary V Target . However, this increases the number of model parameters by k × d s t × |V Target | where d s t is the decoder hidden size. (2) With the second option, for each k, a different attention weight matrix W k a is learnt, while having a shared output transformation layer based on W v . More specifically, Equations 1 and 2 can be re-written as The intuition behind such a formulation is that the different attention mechanisms (for different k's) learn different weighting schemes over the source tokens that enable the prediction of the future token at time step t + k. Moreover, this is much more parameter efficient because the number of extra parameters introduced into the model is only Hence, we adopt the second implementation choice in our experiments.

K-Step Ahead Unlikelihood Loss
In Section 3.3, we introduce an MLE-based loss for the task of K-step ahead token prediction. This idea can be naturally extended to the unlikelihood setting. Concretely, we impose the target and copy unlikelihood losses on the K-step ahead token prediction task as follows: where the negative candidate lists are C t+k Target = {y 1 , . . . , y t+k−1 } \ {y t+k } and C t+k Copy = {y i | y i ∈ {y 1 , . . . , y t+k−1 } \ {y t+k } and y i ∈ V x }. Penalizing the model for future repetitions through the K-step ahead unlikelihood losses should further enhance overall diversity of its outputs.

Overall Training Objective
To summarize, our S2S model is trained with a combination of likelihood and unlikelihood losses on the current (k = 0) and future (k = 1, . . . , K) token prediction tasks. The overall loss function is given by: where λ T and λ C are hyperparameters that control the weight of target and copy UL losses respectively.

Experiment Setup
Evaluation Metrics. To measure the quality of generated keyphrases, i.e., its relevance with respect to the source document, we compare the generated set of KPs to the KPs in the corresponding ground truth data. To this end, we report F 1 @M , where M refers to the number of model predicted keyphrases. We also include the corresponding precision and recall metrics. As justified in previous work (Chan et al., 2019;Yuan et al., 2020), F 1 @M captures the ability of abstractive S2S models in generating a variable number of KPs depending on the source document, in comparison to traditional extractive methods where one is required to specify a cutoff in order to output the top-k keyphrases. However, different from previous work, we report the overall F 1 @M score rather than separately computing this score for keyphrases present vs. absent in the source text. This is because our goal in this work is to overcome the lack of diversity issue in keyphrase generation models, and not necessarily to generate more absent keyphrases. In order to evaluate the model outputs on the criterion of diversity, we define the following metrics: % Duplicate KPs = 1 −

Number of Unique Keyphrases
Total Number of Generated Keyphrases * 100 % Duplicate Tokens = 1 −

Number of Unique Tokens
Total Number of Generated Tokens * 100 # KPs: We report the number of keyphrases generated. Ideally, the model should generate the same number of keyphrases as present in the ground truth target sequence.
The next three metrics measure the inter-keyphrase similarity among the generated set of keyphrases -a lower value indicates fewer repetitions and thus more diversity in the output.
Self-BLEU: We use Self-BLEU (Zhu et al., 2018) which computes pairwise BLEU score (Papineni et al., 2002) between generated KPs. This metric captures word level surface overlap. All reported metrics are computed for each test set output, followed by averaging across all records.
Datasets. We carry out experiments on datasets from three domains 3 : (1)

Results and Analysis
We report quality and diversity metrics on five baselines and three variants of the proposed approach (  Table 3: KP generation results on datasets from 3 domains, evaluated on both quality and diversity criteria. higher percentage of repetitions. This is also evident from the inter-keyphrase pairwise similarity metrics Self-BLEU, EditDist and EmbSim. Surprisingly, the previous best performing model catSeqTG-2RF1, which uses RL to improve F 1 score, does worse than all MLE baselines in terms of diversity. In contrast, DivKGen, our proposed approach achieves much better diversity than all baselines. The repetition percentages are lowered and are relatively closer to the ground truth. There is a large boost by simply adding token and copy UL losses to the baseline MLE model. For KP20K dataset, we obtain small diversity gains through the incorporation of K-Step ahead losses whereas for the other two datasets, it does not result an improvement. A possible explanation is that the base DivKGen (UL) variant itself steers the diversity statistics to be quite close to that of the ground truth of these datasets. As a result, it becomes increasingly difficult to achieve a further reduction in this gap through any additional model changes.
With regards to quality evaluation metrics, it can be observed that DivKGen models have slightly lower scores. This can be explained from a quality-diversity trade-off viewpoint. As the model attempts to explore the output space through the generation of more interesting KPs, it may output new KPs that are not present in the ground truth, thus resulting in lower precision. DivKGen generates shorter sequences (and hence may not be able to produce all the KPs as per the ground truth) than the baselines, which could explain the lower recall.
Quality-Diversity Trade-off. We train different versions of DivKGen (UL) model on KP20K dataset by varying λ T 5 , the UL loss coefficient (refer Equation 12). As depicted in Figure 2, it can be seen that there is an obvious quality-diversity trade-off. For higher values of λ T , we achieve a higher level of diversity 5 For simplicity, we set λ T = λ C to control the number of variable hyperparameters in the quality-diversity trade-off analysis.   (more unique KPs) at the cost of quality (and vice versa). Similar behaviour has been reported previously in the text generation literature (Bahuleyan et al., 2018;Gao et al., 2019). Hence, we recommend tuning the hyperparameters λ T and λ C to achieve a desired level of diversity.
Ablation Study. We conduct an ablation analysis to investigate the effect of different losses. We start with the MLE baseline and add loss components one-by-one as presented in Table 4. It is evident that the best diversity is obtained while using the full model (last row). Also, interestingly each individual loss component by itself (i.e., TargetUL, CopyUL and K-StepMLE), is not as effective as their combination. This suggests that each of the losses contribute in a synergetic manner to maximize diversity gains.

Related Work
Keyphrase Generation and Extraction. Traditional KP extraction methods such as TextRank (Mihalcea and Tarau, 2004) and TopicRank (Bougouin et al., 2013) first select candidate phrases from the source document using heuristics and then rank these candidates based on some measure of relevance or importance. Meng et al. (2017) formulate keyphrase generation as a S2S learning problem, with an advantage over previous extractive methods that it could even generate relevant KPs absent from the source text. A limitation of their approach is that one was still required to rank the top-k KPs. This was addressed in works by (Ye and Wang, 2018) and (Yuan et al., 2020) which could generate a variable number of KPs depending on the input. We adopt a similar setup but carry out a comprehensive analysis of such models in terms of their output diversity, which has been largely ignored in previous work.
Diversity in Language Generation. Diversity promoting objectives for text generation have been previously explored in the literature Niu and Bansal, 2020;Jiang et al., 2020). However, these studies examine the overall corpus level diversity. For instance, the lack of diversity in a dialogue system, due to the fact that the model generates frequently seen responses from the training set. In our case, we address a different kind of diversity issue, arising as a result of repetitions occurring within individual outputs. Thus neural unlikelihood training (Welleck et al., 2020) is well suited to our problem. Test time decoding strategies to improve diversity such as top-k sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020) and diverse beam search (Vijayakumar et al., 2018) are orthogonal to our approach and can naturally be incorporated.

Conclusion and Future Work
In this work, we first point out the shortcomings of MLE based training for keyphrase generation. We specifically address the lack of output diversity issue via the use of unlikelihood training objective. We adopt a target level unlikelihood loss and propose a novel copy token unlikelihood loss, the combination of which provides large diversity gains. In addition, a K-step ahead MLE and UL objective is incorporated into the training. Through extensive experiments on datasets from three different domains, we demonstrate the effectiveness of our model for diverse keyphrase generation. For future work, we plan to explore directions that would enable us to simultaneously optimize for quality and diversity metrics.

B Implementation Details
We use the AllenNLP package (Gardner et al., 2018), which is built on PyTorch framework (Paszke et al., 2019), for implementing our models. We provide as input to the model the concatenated title and abstract. Following (Yuan et al., 2020), the ground truth target keyphrases are arranged as a sequence, where the absent KPs follow the present KPs. The size of source and target vocabularies are set to 50k and 10k respectively. The delimiter token that is inserted in between target keyphrases is denoted as <SEP>. Both the LSTM 6 encoder and decoder have a hidden size of 100d. Word embeddings on both the source and target side are also set to 100d and randomly initialized. We use Adam optimizer (Kingma and Ba, 2015) with default parameters to train the model. The batch size is set to 64 and we incorporate early stopping based on validation F 1 score as the criterion. Regarding the loss term coefficients for UL losses and K-step ahead loss, we set λ T = 15.0, λ C = 18.0 and γ 0 = 1.0, which are obtained based on performance on validation set after grid search hyperparameter optimization. The hyperparameter tuning is carried out on KP20K dataset and the best values are adopted for other datasets too. The value of K is set to 2, which corresponds to upto 2-step ahead prediction.
For test time decoding, unlike previous work (Ye and Wang, 2018;Chen et al., 2019a;Yuan et al., 2020), we do not apply exhaustive decoding with large beam sizes, followed by pruning and de-duplication of the output. This is because our model is trained to generate outputs without repetitions. As such, we do not require any adhoc post-processing strategies to improve diversity. Thus we adopt greedy decoding at test time as well, similar to (Chan et al., 2019). For quality evaluation, we use the evaluation scripts 7 provided by (Chan et al., 2019). Note that Porter Stemming is applied on the outputs for the purpose of quality evaluation.

C Results on Evaluation-Only Datasets
We present additional results in Table 6 and Table 7 on the following datasets: INSPEC (Hulth and Megyesi, 2006), KRAPIVIN (Krapivin et al., 2009), NUS (Nguyen and Kan, 2007b), SEMEVAL (Kim et al., 2010) and DUC (Wan and Xiao, 2008). These datasets are smaller in size and hence, similar to previous work we only use them as test sets. DUC is a dataset with news articles and their associated keyphrases. Hence, we use the models trained on KPTimes for evaluation on DUC. Since the remaining datasets are from the domain of scientific articles, we test them using the best checkpoints obtained from training on KP20K dataset.

D Qualitative Results
In Tables 8, 9 and 10, we present qualitative results from the three domains respectively, i.e., scientific articles, news and community QA forums. The input to each model is the title and the abstract, and the expected output is displayed as the ground truth. In these case study examples, it can be observed that both the MLE and RL baseline tend to generate numerous repetitions in their output sequence. Our DivKGen base variant (UL) achieves good diversity, although occasionally it does generate few repetitions. However, we are able to avoid duplicates with the DivKGen (Full) model, which additionally incorporates the K-step ahead losses. We attribute this to be due to the enhanced model planning capabilities that DivKGen (Full) exhibits, by learning what the future tokens should or shouldn't be.

Dataset : KP20K
Title automatic image segmentation by dynamic region merging .
Abstract this paper addresses the automatic image segmentation problem in a region merging style . with an initially oversegmented image , in which many regions or superpixels with homogeneous color are detected , an image segmentation is performed by iteratively merging the regions according to a statistical test . there are two essential issues in a region merging algorithm order of merging and the stopping criterion . in the proposed algorithm , these two issues are solved by a novel predicate , which is defined by the sequential probability ratio test and the minimal cost criterion . starting from an oversegmented image , neighboring regions are progressively merged if there is an evidence for merging according to this predicate . we show that the merging order follows the principle of dynamic programming . this formulates the image segmentation as an inference problem , where the final segmentation is established based on the observed image . we also prove that the produced segmentation satisfies certain global properties . in addition , a faster algorithm is developed to accelerate the region merging process , which maintains a nearest neighbor graph in each iteration . experiments on real natural images are conducted to demonstrate the performance of the proposed dynamic region merging algorithm .

Dataset : KPTimes
Title n.f.l . said to be closer to testing for h.g.h .
Abstract the n.f.l . owners and players have figured out how to divide up their money , and have spent a busy week reconstituting rosters and renewing rivalries . but there is still unfinished business in their labor standoff , and the most important issue remaining could be the question of drug testing . the n.f.l . , whose new collective bargaining agreement is expected to be completed and ratified by thursday , could begin blood testing for human growth hormone as soon as september , according to a person briefed on the negotiations who was not authorized to speak publicly , making it the first major north american sports league to conduct such testing on its top players with the union consent . players had long resisted blood testing under the former union president gene upshaw , and negotiators are still determining ways to make the program acceptable to current players . details to be worked out include how many players will be tested for performance enhancing drugs and how they would be randomly selected when drug testing resumes . there was no drug testing of any kind conducted during the lockout . but commissioner roger goodell and demaurice smith , the players union executive director , were said by people briefed on negotiations to have long seen the need for growth hormone testing and to want to cast the n.f.l . as a leader in combating drugs in major sports . they have pointed to the joint actions of upshaw and the former commissioner paul tagliabue , who moved to start the steroid testing program in the late . i think both sides have a commitment to being leaders in this area and to having the best