Exclusive Hierarchical Decoding for Deep Keyphrase Generation

Keyphrase generation (KG) aims to summarize the main ideas of a document into a set of keyphrases. A new setting is recently introduced into this problem, in which, given a document, the model needs to predict a set of keyphrases and simultaneously determine the appropriate number of keyphrases to produce. Previous work in this setting employs a sequential decoding process to generate keyphrases. However, such a decoding method ignores the intrinsic hierarchical compositionality existing in the keyphrase set of a document. Moreover, previous work tends to generate duplicated keyphrases, which wastes time and computing resources. To overcome these limitations, we propose an exclusive hierarchical decoding framework that includes a hierarchical decoding process and either a soft or a hard exclusion mechanism. The hierarchical decoding process is to explicitly model the hierarchical compositionality of a keyphrase set. Both the soft and the hard exclusion mechanisms keep track of previously-predicted keyphrases within a window size to enhance the diversity of the generated keyphrases. Extensive experiments on multiple KG benchmark datasets demonstrate the effectiveness of our method to generate less duplicated and more accurate keyphrases.


Introduction
Keyphrases are short phrases that indicate the core information of a document.As shown in Figure 1, the keyphrase generation (KG) problem focuses on automatically producing a keyphrase set (a set of keyphrases) for the given document.Because of the condensed expression, keyphrases can benefit various downstream applications including opinion mining (Berend, 2011;Wilson et al., 2005), doc-Figure 1: An example of an input document and its expected keyphrase output for keyphrase generation problem.Present keyphrases that appear in the document are underlined.
Keyphrases of a document can be categorized into two groups: present keyphrase that appears in the document and absent keyphrase that does not appear in the document.Recent generative methods for KG apply the attentional encoderdecoder framework (Luong et al., 2015;Bahdanau et al., 2014) with copy mechanism (Gu et al., 2016;See et al., 2017) to predict both present and absent keyphrases.To generate multiple keyphrases for an input document, these methods first use beam search to generate a huge number of keyphrases (e.g., 200) and then pick the top N ranked keyphrases as the final prediction.Thus, in other words, these methods can only predict a fixed number of keyphrases for all documents.
However, in a practical situation, the appropriate number of keyphrases varies according to the content of the input document.To simultaneously predict keyphrases and determine the suitable number of keyphrases, Yuan et al. (2018) adopts a sequential decoding method with greedy search to generate one sequence consisting of the predicted keyphrases and separators.For example, the produced sequence may be "hemodynamics [sep] erectile dysfunction [sep] ...", where "[sep]" is the sep-arator.After producing an ending token, the decoding process terminates.The final keyphrase predictions are obtained after splitting the sequence by separators.However, there are two drawbacks to this method.First, the sequential decoding method ignores the hierarchical compositionality existing in a keyphrase set (a keyphrase set is composed of multiple keyphrases and each keyphrase consists of multiple words).In this work, we examine the hypothesis that a generative model can predict more accurate keyphrases by incorporating the knowledge of the hierarchical compositionality in the decoder architecture.Second, the sequential decoding method tends to generate duplicated keyphrases.It is simple to design specific post-processing rules to remove the repeated keyphrases, but generating and then removing repeated keyphrases wastes time and computing resources.To address these two limitations, we propose a novel exclusive hierarchical decoding framework for KG, which includes a hierarchical decoding process and an exclusion mechanism.
Our hierarchical decoding process is designed to explicitly model the hierarchical compositionality of a keyphrase set.It is composed of phrase-level decoding (PD) and word-level decoding (WD).A PD step determines which aspect of the document to summarize based on both the document content and the aspects summarized by previouslygenerated keyphrases.The hidden representation of the captured aspect is employed to initialize the WD process.Then, a new WD process is conducted under the PD step to generate a new keyphrase word by word.Both PD and WD repeat until meeting the stop conditions.In our method, both PD and WD attend the document content to gather contextual information.Moreover, the attention score of each WD step is rescaled by the corresponding PD attention score.The purpose of the attention rescaling is to indicate which aspect is focused on by the current PD step.
We also propose two kinds of exclusion mechanisms (i.e., a soft one and a hard one) to avoid generating duplicated keyphrases.Either the soft one or the hard one is used in our hierarchical decoding process.Both of them are used in the WD process of our hierarchical decoding.Besides, both of them collect the previously-generated K keyphrases, where K is a predefined window size.The soft exclusion mechanism is incorporated in the training stage, where an exclusive loss is em-ployed to encourage the model to generate a different first word of the current keyphrase with the first words of the collected K keyphrases.However, the hard exclusion mechanism is used in the inference stage, where an exclusive search is used to force WD to produce a different first word with the first words of the collected K keyphrases.Our motivation is from the statistical observation that in 85% of the documents on the largest KG benchmark, the keyphrases of each individual document have different first words.Moreover, since a keyphrase is usually composed of only two or three words, the predicted first word significantly affects the prediction of the following keyphrase words.Thus, our exclusion mechanisms can boost the diversity of the generated keyphrases.In addition, generating fewer duplications will also improve the chance to produce correct keyphrases that have not been predicted yet.
We conduct extensive experiments on four popular real-world benchmarks.Empirical results demonstrate the effectiveness of our hierarchical decoding process.Besides, both the soft and the hard exclusion mechanisms significantly reduce the number of duplicated keyphrases.Furthermore, after employing the hard exclusion mechanism, our model consistently outperforms all the SOTA sequential decoding baselines on the four benchmarks.
We summarize our main contributions as follows: (1) to our best knowledge, we are the first to design a hierarchical decoding process for the keyphrase generation problem; (2) we propose two novel exclusion mechanisms to avoid generating duplicated keyphrases as well as improve the generation accuracy; and (3) our method consistently outperforms all the SOTA sequential decoding methods on multiple benchmarks under the new setting.

Keyphrase Extraction
Most of the traditional extractive methods (Witten et al., 1999;Mihalcea and Tarau, 2004) focus on extracting present keyphrases from the input document and follow a two-step framework.They first extract plenty of keyphrase candidates by handcrafted rules (Medelyan et al., 2009).Then, they score and rank these candidates based on either unsupervised methods (Mihalcea and Tarau, 2004) or supervised learning methods (Nguyen and Kan, 2007;Hulth, 2003).Recently, neural-based se-quence labeling methods (Gollapalli et al., 2017;Luan et al., 2017;Zhang et al., 2016) are also explored in keyphrase extraction problem.However, these extractive methods cannot predict absent keyphrase which is also an essential part of a keyphrase set.

Keyphrase Generation
To produce both present and absent keyphrases, Meng et al. (2017) introduced a generative model, CopyRNN, which is based on an attentional encoder-decoder framework (Bahdanau et al., 2014) incorporating with a copy mechanism (Gu et al., 2016).A wide range of extensions of Copy-RNN are recently proposed (Chen et al., 2018(Chen et al., , 2019b;;Ye and Wang, 2018;Chen et al., 2019a;Zhao and Zhang, 2019).All of them rely on beam search to over-generate lots of keyphrases with large beam size and then select the top N (e.g., five or ten) ranked ones as the final prediction.That means these over-generated methods will always predict N keyphrases for any input documents.Nevertheless, in a real situation, the keyphrase number should be determined by the document content and may vary among different documents.
To this end, Yuan et al. (2018) introduced a new setting that the KG model should predict multiple keyphrases and simultaneously decide the suitable keyphrase number for the given document.Two models with a sequential decoding process, catSeq and catSeqD, are proposed in Yuan et al. (2018).The catSeq is also an attentional encoder-decoder model (Bahdanau et al., 2014) with copy mechanism (See et al., 2017), but adopting new training and inference setup to fit the new setting.The cat-SeqD is an extension of catSeq with orthogonal regularization (Bousmalis et al., 2016) and target encoding.Lately, Chan et al. ( 2019) proposed a reinforcement learning based fine-tuning method, which fine-tunes the pre-trained models with adaptive rewards for generating more sufficient and accurate keyphrases.We follow the same setting with Yuan et al. (2018) and propose an exclusive hierarchical decoding method for the KG problem.To the best of our knowledge, this is the first time the hierarchical decoding is explored in the KG problem.Different from the hierarchical decoding in other areas (Fan et al., 2018;Yarats and Lewis, 2018;Tan et al., 2017;Chen and Zhuge, 2018), we rescale the attention score of each WD step with the corresponding PD attention score to provide aspect guidance when generating keyphrases.Moreover, either a soft or a hard exclusion mechanism is innovatively incorporated in the decoding process to improve generation diversity.

Notations and Problem Definition
We denote vectors and matrices with bold lowercase and uppercase letters respectively.Sets are denoted with calligraphy letters.We use W to represent a parameter matrix.
We define the keyphrase generation problem as follows.The input is a document x, the output is a keyphrase set Y = {y i } i=1,...,|Y| , where |Y| is the keyphrase number of x.Both the x and each y i are sequences of words, i.e., x = [x 1 , ..., x lx ] and where l x and l y i are the word numbers of x and y i correspondingly.

Our Methodology
We first encode each word of the document into a hidden state and then employ our exclusive hierarchical decoding shown in Figure 2 to produce keyphrases for the given document.Our hierarchical decoding process consists of phrase-level decoding (PD) and word-level decoding (WD).Each PD step decides an appropriate aspect to summarize based on both the context of the document and the aspects summarized by previous PD steps.Then, the hidden representation of the captured aspect is employed to initialize the WD process to generate a new keyphrase word by word.The WD process terminates when producing a "[eowd]" token.If the WD process output a "[eopd]" token, the whole hierarchical decoding process stops.Both PD and WD attend the document content.The PD attention score is used to re-weight the WD attention score to provide aspect guidance.To improve the diversity of the predicted keyphrases, we incorporate either an exclusive loss when training (i.e., the soft exclusion mechanism) or an exclusive search mechanism when inference (i.e., the hard exclusion mechanism).

Sequential Encoder
To obtain the context-aware representation of each document word, we employ a two-layered bidirectional GRU (Cho et al., 2014) as the document encoder: where k = 1, 2, ..., l x and e x k is the embedding vector of is the encoded context-aware representation of x k .

Hierarchical Decoder
Our hierarchical decoding process is controlled by the hierarchical decoder, which utilizes a phraselevel decoder and a word-level decoder to handle the PD process and the WD process respectively.We present our hierarchical decoder first and then introduce the exclusion mechanisms.In our decoders, all the hidden states and attentional vectors are d-dimensional vectors.

Phrase-level Decoder
We adopt a unidirectional GRU layer as our phraselevel decoder.After the WD process under last PD step is finished, the phrase-level decoder will update its hidden state as follows: where hi−1,end is the attentional vector for the ending WD step under the (i-1)-th PD step (e.g., h2,2 in Figure 2(b)).h i is regarded as the hidden representation of the captured aspect at the i-th PD step.h 0 is initialized as the document representation ,end is initialized with zeros.In PD-Attention process, the PD attentional score β i = [β i,1 , β i,2 , . . ., β i,lx ] is computed from the following attention mechanism employing h i as the query vector:

Word-level Decoder
We choose another unidirectional GRU layer to conduct word-level decoding.Under the i-th PD step, the word-level decoder updates its hidden state first: where hi,j−1 is the WD attentional vector of the (j-1)-th WD step and e y i j−1 is the d e -dimensional embedding vector of the y i j−1 token.We define , where h i is the current hidden state of the phrase-level decoder, 0 is a zero vector, and e s is the embedding of the start token.Then, the WD attentional vector is computed: where α (i,j),k is the original WD attention score which is computed similar to β i,k except that a new parameter matrix is used and h i,j is employed as the query vector.The purpose of the rescaling operation in Eq. ( 7) is to indicate the focused aspect of the current PD step for each WD step.Finally, the hi,j is utilized to predict the probability distribution of current keyword with the copy mechanism (See et al., 2017): where is the copying probability distribution over X which is a set of all the words that appeared in the document.P i j ∈ R |V∪X | is the final predicted probability distribution.Finally, greedy search is applied to produce the current token.
The WD process terminates when producing a "[eowd]" token.The whole hierarchical decoding process ends if the word-level decoder produces a "[eopd]" token at the 0-th step, i.e., y i 0 is predicted as "[eopd]".

Soft and Hard Exclusion Mechanisms
To alleviate the duplication generation problem, we propose a soft and a hard exclusion mechanisms.Either of them can be incorporated into our hierarchical decoding process to form one kind of exclusive hierarchical decoding method.The predicted probability distribution P i j for the j-th WD step under the i-th PD step where i = 1, . . ., | Ȳ| and j = 0, 1, . . ., l ȳi .1: Firstly, the exclusive loss of the j-th WD step under the i-th PD step is computed as follows.2: KEL ← min{KEL, i − 1} 3: if KEL > 0 and j == 1 then 4:

Soft Exclusion
L i,j EL = 0.0 7: end if 8: Secondly, the exclusive loss for the whole decoding process is calculated as LEL = i,j L i,j EL .9: Finally, the joint loss L = Lg + LEL is employed to train the model.

Algorithm 2 Inference with Exclusive Search
Require: The window size KES.The first words of previously-predicted keyphrases [y 1 1 , . . ., y i−1
in Algorithm 1. "j == 1" in line "3" means the current WD step is predicting the first word of a keyphrase.In short, the exclusive loss punishes the model for the tendency to generate the same first word of the current keyphrase with the first words of previously-generated keyphrases within the window size K EL .

Hard Exclusion
Mechanism.An exclusive search (ES) is introduced in the inference stage as shown in Algorithm 2. The exclusive search mechanism forces the word-level decoding to predict a different first word with the first words of previously-predicted keyphrases within the window size K ES .
Since a keyphrase usually has only two or three words, the first word significantly affects the prediction of the following words.Therefore, both the soft and the hard exclusion mechanisms can improve the diversity of generated keyphrases.

Experiment Setup
Our model implementations are based on the OpenNMT system (Klein et al., 2017) using Py-Torch (Paszke et al., 2017).Experiments of all models are repeated with three different random seeds and the averaged results are reported.

Datasets
We employ four scientific article benchmark datasets to evaluate our models, including KP20k (Meng et al., 2017), Inspec (Hulth, 2003), Krapivin (Krapivin et al., 2009), and Se-mEval (Kim et al., 2010).Following previous work (Yuan et al., 2018;Chen et al., 2019a), we use the training set of KP20k to train all the models.After removing the duplicated data, we maintain 509,818 data samples in the training set, 20,000 in the validation set, and 20,000 in the testing set.
After training, we test all the models on the testing datasets of these four benchmarks.The dataset statistics are shown in Table 1

Baselines
We focus on the comparisons with state-of-the-art decoding methods and choose the following generation models under the new setting as our baselines: • Transformer (Vaswani et al., 2017).A transformer-based sequence to sequence model incorporating with copy mechanism.
• catSeqD (Yuan et al., 2018).An extension of catSeq which incorporates orthogonal regularization (Bousmalis et al., 2016) and target encoding into the sequential decoding process to improve the generation diversity and accuracy.
• catSeqCorr (Chan et al., 2019).Another extension of catSeq, which incorporates the sequential decoding with coverage (See et al., 2017) and review mechanisms to boost the generation diversity and accuracy.This method is adjusted from Chen et al. (2018) to fit the new setting.
In this paper, we propose two novel models that are denoted as follows: • ExHiRD-s.
Our Exclusive HieRarchical Decoding model with the soft exclusion mechanism.In experiments, the window size K EL is selected as 4 after tuning on the KP20k validation dataset.
Our Exclusive HieRarchical Decoding model with the hard exclusion mechanism.In experiments, the values of the window size K ES are selected as 4, 1, 1, 1 for Inspec, Krapivin, SemEval, and KP20k respectively after tuning on the corresponding validation datasets.
We choose the bilinear attention from Luong et al. (2015) and the copy mechanism from See et al. (2017) for all the models.

Evaluation Metrics
We engage F 1 @M which is recently proposed in Yuan et al. (2018) as one of our evaluation metrics.F 1 @M compares all the predicted keyphrases by the model with ground-truth keyphrases, which means it does not use a fixed cutoff for the predictions.Therefore, it considers the number of predictions.
We also use F 1 @5 as another evaluation metric.When the number of predictions is less than five, we randomly append incorrect keyphrases until it obtains five predictions instead of directly using the original predictions.If we do not adopt such an appending operation, F 1 @5 will become the same with F 1 @M when the prediction number is less than five.
The macro-averaged F 1 @M and F 1 @5 scores are reported.When determining whether two keyphrases are identical, all the keyphrases are stemmed first.
Besides, all the duplicated keyphrases are removed after stemming.

Implementation Details
Following previous work (Meng et al., 2017;Yuan et al., 2018;Chen et al., 2019a;Chan et al., 2019), we lowercase the characters, tokenize the sequences, and replace digits with "<digit>" token.Similar to Yuan et al. (2018), when training, the present keyphrase targets are sorted according to the orders of their first occurrences in the document.Then, the absent keyphrase targets are put at the end of the sorted present keyphrase targets.We use "<p start>" and "<a start>" as the
The vocabulary with 50,000 tokens is shared between the encoder and decoder.We set d e as 100 and d as 300.The hidden states of the encoder layers are initialized as zeros.In the training stage, we randomly initialize all the trainable parameters including the embedding using a uniform distribution in [−0.1, 0.1].We set batch size as 10, max gradient norm as 1.0, and initial learning rate as 0.001.We do not use dropout.Adam (Kingma and Ba, 2014) is used as our optimizer.The learning rate decays to half if the perplexity on KP20k validation set stops decreasing.Early stopping is applied when training.When inference, we set the minimum phrase-level decoding step as 1 and the maximum as 20.
6 Results and Analysis

Present and Absent Keyphrase Predictions
We show the present and absent keyphrase prediction results in Table 2 and  diction in all the datasets2 .

Duplication Ratio of Predicted Keyphrases
In this section, we study the model capability of avoiding producing duplicated keyphrases.Duplication ratio is denoted as "DupRatio" and defined as follows: where # means "the number of".For instance, the DupRatio is 0.5 (3/6) for [A, A, B, B, A, C].We report the average DupRatio per document in Table 4. From this table, we observe that our ExHiRD-s and ExHiRD-h consistently and significantly reduce the duplication ratios on all datasets.Moreover, we also find that our ExHiRDh model achieves the lowest duplication ratios on all datasets.

Number of Predicted Keyphrases
We also study the average number of unique keyphrase predictions per document.Duplicated keyphrases are removed.The results are shown in Table 5.One main finding is that all the models generate an insufficient number of unique keyphrases on most datasets, especially for predicting absent keyphrases.We also observe that our methods can improve the number of unique keyphrases by a large margin, which is extremely beneficial to solve the problem of insufficient generation.Correspondingly, it also leads to overgenerate more keyphrases than the ground-truth for the cases that do not have this problem, such as the present keyphrase predictions on Krapivin and KP20k datasets.We leave solving the overgeneration of present keyphrases on Krapivin and KP20k as our future work.

ExHiRD-h: Ablation Study
Since our ExHiRD-h model achieves the best performance on almost all of the metrics, we select it as our final model and probe it more subtly in the following sections.In order to understand the effects of each component of ExHiRD-h, we conduct an ablation study on it and report the results on the SemEval dataset in Table 6.We observe that both our hierarchical decoding process and exclusive search mechanism are help- ful to generate more accurate present and absent keyphrases.Besides, we also find that the significant performance margins on the duplication ratio and the keyphrase numbers are mainly from the exclusive search mechanism.

ExHiRD-h: Window Size of Exclusive Search
For a more comprehensive understanding of our exclusive search mechanism in our ExHiRD-h model, we also study the effects of the window size K ES .We conduct the experiments on KP20k dataset and list the results in Table 7.
We note that a larger window size K ES leads to a lower DupRatio as we anticipated.It is because the exclusive search can observe more previouslygenerated keyphrases to avoid generating duplicated keyphrases when K ES is larger.When K ES is "all", the DupRatio is not absolute zero because we stem keyphrases when determining whether they are duplicated.Besides, we also find that larger K ES leads to better F 1 @5 scores.The reason is that for F 1 @5 scores, we append incorrect keyphrases to obtain five predictions when the number of predictions is less than five.A larger K ES leads to predict more unique keyphrases, append less absolutely incorrect keyphrases and improve the chance to output more accurate keyphrases.However, generating more unique keyphrases may also lead to more incorrect predictions, which will degrade the F 1 @M scores since F 1 @M considers all the unique predictions without a fixed cutoff.

ExHiRD-h: Incorporate Baselines with Exclusive Search
Our exclusive search is a general method that can be easily applied to other models.In this section, we study the effects of our exclusive search on other baseline models.We show the experimental results on KP20k dataset in Table 8.
From this table, we note that the effects of exclusive search on baselines are similar to the effects on our hierarchical decoding.We also see our ExHiRD-h still achieves the best performance on most of the metrics, even if baselines are also incorporated with exclusive search, which exhibits the superiority of our hierarchical decoding again.

ExHiRD-h: Case Study
We display a prediction example in Figure 3.Our ExHiRD-h model generates more accurate keyphrases for the document comparing to the four baselines.Besides, we also observe much less repeated keyphrases are generated by our ExHiRDh.For instance, all the baselines produce the keyphrase "debugging" at least three times.However, our ExHiRD-h only generates it once, which demonstrates that our proposed method is more powerful in avoiding duplicated keyphrases.

Conclusion and Future Work
In this paper, we propose an exclusive hierarchical decoding framework for keyphrase generation.Unlike previous sequential decoding methods, our hierarchical decoding consists of a phrase-level decoding process to capture the current aspect to summarize and a word-level decoding process to generate keyphrases based on the captured aspect.Besides, we also propose a soft and a hard exclusion mechanisms to enhance the diversity of the generated keyphrases.Extensive experimental results demonstrate the effectiveness of our meth-SOC HW/SW co-verification based debugging technique.Purpose -Increasingly complex and sophisticated VLSI design, coupled with shrinking design cycles, requires shorter verification time and efficient debug method.… SOC HW/SW co-verification technique seems to draw a balance, but Design Under Test (DUT) still resides in FPGA and remains hard for debugging.The purpose of this paper is to study a run-time RTL debugging methodology for a FPGAbased co-verification system.…  ods.One interesting future direction is to explore whether the beam search is helpful to our model.

Figure 3 :
Figure 3: An example of generated keyphrases by baselines and our ExHiRD-h.The correct predictions are bold and the present keyphrases are underlined.The digit in parentheses represents the frequency that the corresponding keyphrase is generated by the model (e.g., "debugging (3)" means the keyphrase "debugging" is generated three times by the model).
The simplified framework of our exclusive hierarchical decoding (b) The intermediate step to predict  1 Illustration of our exclusive hierarchical decoding.h i is the hidden state of i-th PD step.h i,j is the corresponding j-th WD hidden state.The "[neopd]" token means PD does not end.The "[eowd]" token means WD terminates.The "[eopd]" token means PD ends and the whole decoding process finishes."[m 1 , . . ., m lx ]" represents the encoded hidden states from the document."PD-Attention" and "WD-Attention" are the attention mechanisms in PD and WD respectively."β i " is the PD attention score at i-th step.hi,j is the WD attentional vector."EL/ES" indicates either the exclusive loss or the exclusive search is incorporated.

Table 1 :
. The statistics of validation and testing datasets.

Table 3 :
Absent keyphrase prediction results of all models on all datasets.The best results are bold.
Table3correspondingly.As indicated in these two tables, both the ExHiRD-s model and the ExHiRD-h outperform the state-of-the-art baselines on most of the metrics, which demonstrates the effectiveness of our exclusive hierarchical decoding methods.Besides, the ExHiRD-h model consistently achieves the best results on both present and absent keyphrase pre-

Table 4 :
The average DupRatios of predicted keyphrases on all datasets.The lower the score, the better the performance.

Table 7 :
Results of ExHiRD-h on KP20k with different window size K ES .When K ES = 0, ExHiRD-h equals to "w/o ES".The "all" means we taking the first words of all the previously-predicted keyphrases into consideration.The "DupRatio" is the average DupRatio per document.We show the average numbers of groundtruth keyphrases in the "Oracle" row.