SGG: Learning to Select, Guide, and Generate for Keyphrase Generation

Keyphrases, that concisely summarize the high-level topics discussed in a document, can be categorized into present keyphrase which explicitly appears in the source text and absent keyphrase which does not match any contiguous subsequence but is highly semantically related to the source. Most existing keyphrase generation approaches synchronously generate present and absent keyphrases without explicitly distinguishing these two categories. In this paper, a Select-Guide-Generate (SGG) approach is proposed to deal with present and absent keyphrases generation separately with different mechanisms. Specifically, SGG is a hierarchical neural network which consists of a pointing-based selector at low layer concentrated on present keyphrase generation, a selection-guided generator at high layer dedicated to absent keyphrase generation, and a guider in the middle to transfer information from selector to generator. Experimental results on four keyphrase generation benchmarks demonstrate the effectiveness of our model, which significantly outperforms the strong baselines for both present and absent keyphrases generation. Furthermore, we extend SGG to a title generation task which indicates its extensibility in natural language generation tasks.


Introduction
Automatic keyphrase prediction recommends a set of representative phrases that are related to the main topics discussed in a document (Liu et al., 2009). Since keyphrases can provide a high-level topic description of a document, they are beneficial for a wide range of natural language processing (NLP) tasks, such as information extraction (Wan and Xiao, 2008), text summarization (Wang and Cardie, 2013) and question generation (Subramanian et al., 2018). Existing methods for keyphrase prediction can be categorized into extraction and generation approaches. Specifically, keyphrase extraction methods identify important consecutive words from a given document as keyphrases, which means that the extracted keyphrases (denoted as present keyphrases) must exactly come from the given document. However, some keyphrases (denoted as absent keyphrases) of a given document do not match any contiguous subsequence but are highly semantically related to the source text. The extraction methods fail to predict these absent keyphrases. Therefore, generation methods have been proposed to produce a keyphrase verbatim from a predefined vocabulary, no matter whether the generated keyphrase appears in the source text. Compared with conventional extraction methods, generation methods have the ability of generating absent keyphrases as well as present keyphrases.
CopyRNN (Meng et al., 2017) is the first to employ the sequence-to-sequence (Seq2Seq) framework (Sutskever et al., 2014) with the copying mechanism (Gu et al., 2016) to generate keyphrases for the given documents. Following the Copy-RNN, several Seq2Seq-based keyphrase generation approaches have been proposed to improve the generation performance (Chen et al., 2018;Ye and Wang, 2018;Chen et al., 2019;Zhao and Zhang, 2019;Wang et al., 2019;Yuan et al., 2020). All these existing methods generate present and absent keyphrases synchronously without ex- plicitly distinguishing these two different categories of keyphrases, which leads to two problems: (1) They complicate the identification of present keyphrases. Specifically, they search for words over the entire predefined vocabulary containing a vast amount of words (e.g., 50,000 words) to generate a present keyphrase verbatim, which is overparameterized since a present keyphrase can be simply selected from a continuous subsequence of the source text containing limited words (e.g., less than 400 words).
(2) They weaken the generation of absent keyphrases. Existing models for absent keyphrase generation are usually trained on datasets mixed with a large proportion of present keyphrases. Table 1 shows that nearly half of the training data are present keyphrases, which leads to the extremely low proportions of absent keyphrases generated by such a model, i.e., CopyRNN. The above observation demonstrates that these methods are biased towards replicating words from source text for present keyphrase generation, which will inevitably affect the performance on generating absent keyphrases.
To address the aforementioned problems, we propose a Select-Guide-Generate (SGG) approach, which deals with present and absent keyphrase generation separately with different stages based on different mechanisms. Figure 1 illustrates an example of keyphrase prediction by SGG. The motivation behind is to solve keyphrase generation problem from selecting to generating, and use the selected results to guide the generation. Specifically, our SGG is implemented with a hierarchical neural network which performs Seq2Seq learning by applying a multi-task learning strategy. This network consists of a selector at low layer, a generator at high layer, and a guider at middle layer for information transfer. The selector generates present keyphrases through a pointing mechanism (Vinyals et al., 2015), which adopts attention distributions to select a sequence of words from the source text as output. The generator further generates the absent keyphrases through a pointing-generating (PG) mechanism (See et al., 2017). Since present keyphrases have already been generated by the selector, they should not be generated again by the generator. Therefore, a guider is designed to memorize the generated present keyphrases from the selector, and then fed into the attention module of the generator to constrain it to focus on generating absent keyphrases. We summarize our main contributions as follows: • We propose a SGG approach which models present and absent keyphrase generation separately in different stages, i.e., select, guide, and generate, without sacrificing the end-to-end training through back-propagation.
• Extensive experiments are conducted to verify the effectiveness of our model, which not only improves present keyphrase generation but also dramatically boosts the performance of absent keyphrase generation.
• Furthermore, we adopt SGG to a title generation task, and the experiment results indicate the extensibility and effectiveness of our SGG approach on generation tasks.

Related Work
As mentioned in Section 1, the extraction and generation methods are two different research directions in the field of keyphrase prediction. The existing extraction methods can be broadly classified into supervised and unsupervised approaches. The supervised approaches treat keyphrase extraction as a binary classification task, which train the models with the features of labeled keyphrases to determine whether a candidate phrase is a keyphrase (Witten et al., 1999;Medelyan et al., 2009;. In contrast, the unsupervised approaches treat keyphrase extraction as a ranking task, scoring each candidate using some different ranking metrics, such as clustering (Liu et al., 2009), or graph-based ranking (Mihalcea and Tarau, 2004;Wang et al., 2014;Gollapalli and Caragea, 2014;Zhang et al., 2017). This work is mainly related to keyphrase generation approaches which have demonstrated good performance on keyphrase prediction task. Following CopyRNN (Meng et al., 2017), several extensions have been proposed to boost the generation capability. In CopyRNN, model training heavily relies on large amount of labeled data, which is often unavailable especially for the new domains. To address this problem, Ye and Wang (2018) proposed a semi-supervised keyphrase generation model that utilizes both abundant unlabeled data and limited labeled data. CopyRNN uses the concatenation of article title and abstract as input, ignoring the leading role of the title. To address this deficiency, Chen et al. (2019) proposed a title-guided Seq2Seq network to sufficiently utilize the already summarized information in title. In addition, some research attempts to introduce external knowledge into keyphrase generation, such as syntactic constraints (Zhao and Zhang, 2019) and latent topics .
These approaches do not consider the one-tomany relationship between the input text and target keyphrases, and thus fail to model the correlation among the multiple target keyphrases. To overcome this drawback, Chen et al. (2018) incorporated the review mechanism into keyphrase generation and proposed a model CorrRNN with correlation constraints. Similarly, SGG separately models one-to-many relationship between the input text and present keyphrases and absent keyphrases. To avoid generating duplicate keyphrases, Chen et al. (2020) proposed an exclusive hierarchical decoding framework that includes a hierarchical decoding process and either a soft or a hard exclusion mechanism. For the same purpose, our method deploys a guider to avoid the generator generating duplicate present keyphrases. Last but most important, all these methods do not consider the difference between present and absent keyphrases. We are the first to discriminately treat present and absent keyphrases in keyphrase generation task.

Problem Definition
Given a dataset including K data samples, where the j-th data item x (j) , y (j,p) , y (j,a) consists of a source text x (j) , a set of present keyphrases y (j,p) and a set of absent keyphrases y (j,a) . Different from CopyRNN (Meng et al., 2017) splitting each data item into multiple training examples, each of which contains only one keyphrase as target, we regard each data item as one training example by concatenating its present keyphrases as one target and absent keyphrases as another one. Specifically, assume that the j-th data item consists of m present keyphrases {y , ..., y (j,a) n }, the target present keyphrases y (j,p) and target absent keyphrases y (j,a) are represented as: where || is a special splitter to separate the keyphrases. We then get the source text x (j) , the present keyphrases y (j,p) and the absent keyphrases y (j,a) all as word sequences. Under this setting, our model is capable of generating multiple keyphrases in one sequence as well as capturing the mutual relations between these keyphrases. A keyphrase generation model is to learn the mapping from the source text x (j) to the target keyphrases (y (j,p) , y (j,a) ). For simplicity, (x, y p , y a ) is used to denote each item in the rest of this paper, where x denotes a source text sequence, y p denotes its present keyphrase sequence and y a denotes its absent keyphrase sequence.

Model Overview
The architecture of our proposed Select-Guide-Generate (SGG) approach is illustrated in Figure 2. Our model is the extension of Seq2Seq framework which consists of a text encoder, a selector, a guider, and a generator. The text encoder converts the source text x into a set of hidden representation vectors {h i } L i=1 with a bidirectional Long Short-term Memory Network (bi-LSTM) (Hochreiter and Schmidhuber, 1997), where L is the length of source text sequence. The selector is a uni-directional LSTM, which predicts the present keyphrase sequence y p based on the attention distribution over source words. After selecting present keyphrases, a guider is produced by a guider to memorize the prediction information of the selector, and then fed to the attention module of a generator to adjust the information it pays attention to. The selection-guided generator is also implemented as a uni-directional LSTM, which produces the absent keyphrase sequence y a based on two distributions over predefined-vocabulary and source words, respectively. At the same time, a soft switch gate p gen is employed as a trade-off between the above two distributions.

Text Encoder
The goal of a text encoder is to provide a series of dense representations {h i } L i=1 of the source text. In our model, the text encoder is implemented as a bi-LSTM (Hochreiter and Schmidhuber, 1997) which reads an input sequence Step t Step Step 1 … … Step t Step The final hidden representation h i of the i-th source word is the concatenation of forward and backward hidden states, i.e.,

Selector
A selector is designed to generate present keyphrase sequences through the pointer mechanism (Vinyals et al., 2015), which adopts the attention distribution as a pointer to select words from the source text as output. Specifically, given source text sequence x and previously generated words {y p 1 , ..., y p t−1 }, the probability distribution of predicting next word y p t in present keyphrases is: where α p,t is the attention (Bahdanau et al., 2015) distribution at decoding time step t, i ∈ (1, ..., L), and V p , W p and b p are trainable parameters of the model. u p,t can be viewed as the degree of matching between input at position i and output at position t. s p t represents the hidden state at deciding time step t, and is updated by equation: where context vector c p t−1 = L i=1 α p,t−1 i h i is the weighted sum of source hidden states.

Guider
A guider is designed to fully utilize the attention information of the selector to guide the generator on absent keyphrase generation. The idea behind is to utilize a guider r to softly indicate which words in source text have been generated by the selector. This is important for helping the generator to focus on generating the absent keyphrases. Specifically, r is constructed through the accumulation of the attention distributions over all decoding time steps of the selector, computed as: where M is the length of present keyphrase sequence. r is an unnormalized distribution over the source words. As the attention distribution of selector is equal to the probability distribution over the source words, r represents the possibility that these words have been generated by the selector. The calculation of guider is inspired by the coverage vector (Tu et al., 2016) that is sequentially updated during the decoding process. In contrast to this, the guider here is a static vector which is capable of memorizing a global information.

Selection-Guided Generator
A generator aims to predict an absent keyphrase sequence based on the guidance of the selection information from the guider. Unlike present keyphrases, most words in absent keyphrases do not appear in source text. Therefore, the generator generates absent keyphrases by picking up words from both a predefined large scale vocabulary and the source text (See et al., 2017;Gu et al., 2016). The probability distribution of predicting next word y a t in absent keyphrases is defined as: where P vocab is the probability distribution over the predefined vocabulary, which is zero if y a t is an outof-vocabulary (OOV) word. Similarly, if y a t does 5721 not appear in the source text, then i:y a t =x i α a,t i is zero. P vocab is computed as: where W and b are learnable parameters, s a t is the hidden state of generator, and c a t is the context vector for generating absent keyphrase sequence, computed by the following equations: where V a , W a and b a are learnable parameters. r is a vector produced by the guider. The generation probability p gen at time step t is computed as: where W gen and b gen are learnable parameters, σ(·) represents a sigmoid function and emb(y a t−1 ) is the embedding of y a t−1 . In addition, p gen in formula (7) is used as a soft switch to choose either generating words over vocabulary or copying words from source text based on distribution α a,t .

Training
Given the set of data pairs {x (j) , y (j,p) , y (j,a) } K j=1 , the loss function of the keyphrase generation consists of two parts of cross entropy losses: where L p and L a are the losses of generating present and absent keyphrases, respectively. N is the word sequence length of absent keyphrases, and θ are the parameters in our model. The training objective is to jointly minimize the two losses: 4 Experiment

Dataset
We use the dataset collected by Meng et al. (2017) from various online digital libraries, which contains approximately 570K samples, each of which contains a title and an abstract of a scientific publication as source text, and author-assigned keywords as target keyphrases. We randomly select the example which contains at least one present keyphrase to construct the training set. Then, a validation set containing 500 samples will be selected from the remaining examples. In order to evaluate our proposed model comprehensively, we test models on four widely used public datasets from the scientific domain, namely Inspec (Hulth and Megyesi, 2006), Krapivin (Krapivin et al., 2009), SemEval-2010(Kim et al., 2010 and NUS (Nguyen and Kan, 2007), the statistic information of which are summarized in Table 2.

Baselines and Evaluation Metrics
For present keyphrase prediction, we compare our model with both extraction and generation approaches. Extraction approaches include two unsupervised extraction methods: TF-IDF, TextRank (Mihalcea and Tarau, 2004) and one classic supervised extraction method KEA (Witten et al., 1999). For the generation baselines, some models, such as CopyRNN, split each data item into multiple training examples, each of which only contains one keyphrase, while the other models concatenate all keyphrases as target. To simplicity, the pattern of training model only with one keyphrase is denoted as one-to-one and with the concatenation of all keyphrases as one-to-many. The generation baselines are the following state-of-the-art encoderdecoder models: • CopyRNN(one-to-one) (Meng et al., 2017) represents a RNN-based encoder-decoder model incorporating the copying mechanism.
• CatSeq(one-to-many) (Yuan et al., 2020) has the same model structure as CopyRNN. The difference is CatSeq is trained by one-to-many.
The baseline CopyTrans has not been reported in existing papers and thus is retrained. The implementation of Transformer is base on open source tool OpenNMT 1 . For our experiments of absent keyphrase generation, only generation methods are chosen as baselines. The copying mechanism used in all reimplemented generation models is based on the version (See et al., 2017), which is slightly different from the implementations by version (Meng et al., 2017;Gu et al., 2016). SGG indicates the full version of our proposed model, which contains a selector, a guider, and a generator. Note that SGG is also trained under one-to-many pattern. Same as CopyRNN, we adopt top-N macroaveraged F-measure (F1) and recall as our evalua-1 https://github.com/OpenNMT/OpenNMT-py tion metrics for the present and absent keyphrases respectively. The choice of larger N (i.e., 50 v.s. 5 and 10) for absent keyphrase is due to the fact that absent keyphrases are more difficult to be generated than present keyphrases. For present keyphrase evaluation, exact match is used for determining whether the predictions are correct. For absent keyphrase evaluation, Porter Stemmer is used to stem all the words in order to remove words' suffix before comparisons.

Implementation Details
We set maximal length of source sequence as 400, 25 for target sequence of selector and generator, and 50 for the decoders of all generation baselines. We choose the top 50,000 frequentlyoccurred words as our vocabulary. The dimension of the word embedding is 128. The dimension of hidden state in encoder, selector and generator is 512. The word embedding is randomly initialized and learned during training. We initialize the parameters of models with uniform distribution in [-0.2,0.2]. The model is optimized using Adagrad (Duchi et al., 2011) with learning rate = 0.15, initial accumulator = 0.1 and maximal gradient normalization = 2. In the inference process, we use beam search to generate diverse keyphrases and the beam size is 200 same as baselines. All the models are trained on a single Tesla P40.

Results and Analysis
In this section, we present the results of present and absent keyphrase generation separately. The results of predicting present keyphrases are shown in Table 3, in which the F1 at top-5 and top-10 predictions are given. We first compare our proposed  model with the conventional keyphrase extraction methods. The results show that our model performs better than extraction methods with a large margin, demonstrating the potential of the Seq2Seq-based generation models in automatic keyphrase extraction task. We then compare our model with the generation baselines, and the results indicate that our model still outperforms these baselines significantly. The better performance of SGG illustrates the pointing based selector is sufficient and more effective to generate present keyphrase.
We further analyze the experimental results of absent keyphrase generation. Table 4 presents the recall results of the generation baselines and our model on four datasets. It can be observed that our model significantly improves the performance of absent keyphrase generation, compared to the generation baselines. This is because SGG is equipped with a generator that is not biased to generate present keyphrases and the designed guider in SGG further guides the generator to focus on generating absent keyphrases. Table 5 shows the proportion of absent keyphrases generated by SGG. The comparison of Table 1 and 5 demonstrates that our model have the ability to generate large portions of absent keyphrases rather than tending to generate present keyphrases.
In addition, an interesting phenomenon can be found from the results of CopyRNN and CatSeq that one-to-one pattern generally performs better than one-to-many if under the same model structure in absent keyphrase generation. To explore this phenomenon, we use the same code, same training set to retrain CopyRNN under one-to-one and one-to-many patterns, and the test results show that one-to-one could boost the performance in absent keyphrase generation. However, SGG cannot be trained under one-to-one pattern as the core of guider in SGG is to memory all present keyphrases. Even so, SGG still has better performance than CopyRNN. The results of SGG achieve 1.6% average gain than CopyRNN and 31.8% average gain than the best-performing results of one-to-many baselines over four test sets.

SGG for Title Generation
In this section, we explore the extensibility of SGG in other natural language generation (NLG) tasks, i.e., title generation. We adopt the same dataset described in Section 4.1 for title generation, which contains abstracts, present keyphrases, absent keyphrases, and titles. Specifically, a title generation model takes an abstract as input and generates a title as output. To train SGG model for title generation, present keyphrases appearing in the titles are used as labels to train the selectors 2 , and the titles are used to train the generators. The idea behind is to utilize the present keyphrase generation as an auxiliary task to help the main title generation task. In order to evaluate SGG on title generation, we choose models CopyTrans and pointer-generator (PG-Net) (See et al., 2017) as baselines. We use ROUGE-1 (unigram), ROUGE-2 (bi-gram), ROUGE-L (LCS) and human evaluation as evaluation metrics. For human evaluation, we randomly selects 100 abstracts for each test set, then distribute them to four people on average. The evaluation standard is the fluency of generated title and whether it correctly provides the core topics of an abstract.    Table 4 and Table 6. model SGG achieves better performance than the strong baselines on all datasets, proving that SGG could be directly applied to title generation task and still keep highly effective.

Ablation Study on Guider
In this section, we further study the effectiveness of our proposed guider module. Table 7 displays the results of SG (only a selector, a generator, no guider) and its comparison with SGG on the two largest test sets Inspec and Krapivin, which illustrates that the guider has a remarkable effect on absent keyphrase and title generation tasks.
In more detail, we analyze that the function of guiders on these two tasks is different, which depends on the correlation between the targets of selector and generator. For example, in the task of keyphrase generation, the words predicted from selector should not be repeatedly generated by generator because the present keyphrases and absent keyphrases in a given text usually do not have overlapping words. However, in the task of title generation, the selected words by selector should be paid more attention on by generator since they are usually part of the target titles. To verify the above analysis, we visualize two examples of the attention scores in generators for the two tasks in Figure 4. For keyphrase generation, SG repeatedly generates "implicit surfaces" that has already been generated by its selector. In contrast, SGG successfully avoids this situation and it correctly generates the absent keyphrase "particle constraint". For title generation, the guider helps SGG to assign higher attention scores to the words in "seat reservation" that has been generated by selector.  Figure 4: Visualization of attention score in generator for keyphrase generation and title generation. The words marked in red have already been generated by the selector. The words marked in blue are the generation of the generator. In these two examples, phrase "particle constraint" is the correct absent keyphrase for keyphrase generation and "seat reservation problem" is part of the correct title for title generation. Figure 3 gives the proportion of test examples that the predictions of generator overlap with the predictions of selector. We observe that SG is more likely to generate the words that have been generated by selector than SGG in keyphrase generation. In contrast, the results on title generation indicate that SGG is more likely to generate previously selected words than SG for this task. Through the analysis above, we conjecture that the guider is able to correctly guide the behaviour of generator in different tasks, i.e., learn to encourage or discourage generating previously selected words.

Conclusion
In this paper, a Select-Guide-Generate (SGG) approach is proposed and implemented with a hierarchical neural model for keyphrase generation, which separately deals with the generation of present and absent keyphrases. Comprehensive empirical studies demonstrate the effectiveness of SGG. Furthermore, a title generation task indicates the extensibility of SGG in other generation tasks.