Incorporating Linguistic Constraints into Keyphrase Generation

Keyphrases, that concisely describe the high-level topics discussed in a document, are very useful for a wide range of natural language processing tasks. Though existing keyphrase generation methods have achieved remarkable performance on this task, they generate many overlapping phrases (including sub-phrases or super-phrases) of keyphrases. In this paper, we propose the parallel Seq2Seq network with the coverage attention to alleviate the overlapping phrase problem. Specifically, we integrate the linguistic constraints of keyphrase into the basic Seq2Seq network on the source side, and employ the multi-task learning framework on the target side. In addition, in order to prevent from generating overlapping phrases of keyphrases with correct syntax, we introduce the coverage vector to keep track of the attention history and to decide whether the parts of source text have been covered by existing generated keyphrases. Experimental results show that our method can outperform the state-of-the-art CopyRNN on scientific datasets, and is also more effective in news domain.


Introduction
Automatic keyphrase prediction recommends a set of representative phrases that are related to the main topics discussed in a document (Liu et al., 2009). Since keyphrases can provide a high-level topic description of a document, they are beneficial for a wide range of natural language processing tasks such as information extraction (Wan and Xiao, 2008), text summarization  and question answering (Tang et al., 2017). However, the performance of existing methods is still far from satisfactory (Hasan and Ng, 2014). The main reason is that it is very challenging to determine whether a phrase or sets of phrases can * Corresponding author accurately capture main topics that are presented in the document.
Existing approaches for keyphrase prediction can be broadly divided into extraction and generation methods. The conventional extraction methods directly select important consecutive words or phrases from the target document as keyphrases. This means that the extracted keyphrases must appear in the target document. In comparison with extraction methods, the generation methods choose keyphrases from a predefined vocabulary regardless of whether the generated keyphrases appear in the target document. CopyRNN (Meng et al., 2017) is the first to employ the sequence-tosequence (Seq2Seq) framework (Sutskever et al., 2014) to generate keyphrases for documents. This method is able to predict absent keyphrases that do not appear in the target document.
Following the CopyRNN, a few extensions of Seq2Seq framework have been proposed to help better generate keyphrases. Through analyzing the results generated by these approaches, we find out that there are many overlapping phrases of correct (author-labeled) keyphrases. For example, in experimental results of CopyRNN, the authorlabeled keyphrases are "Internet" and "Distributed decision" but the predicted are "Internet held" and "Distributed", respectively. There are two shortcomings that lie in the overlapping phrases. First, the correct keyphrase is not generated but its overlapping phrases are predicted as keyphrases. Second, the existing generation approaches often predict the keyphrase and its overlapping phrases as keyphrases. However, the overlapping phrases of keyphrases are not keyphrases in most cases. The more accurate description for this overlapping problem and shortcomings will be given in the next section, including the problem formulation and seriousness found in experimental results of the state-of-the-art CopyRNN. Sub-problems and formulations No.
Seriousness of the problem (top-k, k = 10) |Pi|/|O l | (%) |P n i |/|O n l | (%) n = 1 n = 2 n = 3 n ≥ 4 In this paper, we propose a parallel Seq2Seq network (ParaNet) with the coverage attention to alleviate the overlapping phrase problem. Specifically, we exploit two standalone encoders to encode separately the source text and syntactic constraints into network on the source side, and then applies multi-task learning framework to generate the keyphrases and part-of-speech (POS) tags for words in keyphrases on the target side. Most of keyphrases are noun phrases and they commonly consist of nouns and adjectives. The syntactic constraints are helpful to prevent from generating the overlapping phrases of keyphrases that are not noun phrases, e.g., "internet held" (which contains a verb). In addition, in order to prevent from generating overlapping phrases of keyphrases with correct syntax, we introduce the coverage vector (proposed in (Tu et al., 2016)) to keep track of the attention history and to decide whether the parts of source text have been covered by the existing generated keyphrases.
The remaining of this paper is organized as follows. In the next section, we analyze the overlapping phrase problem in existing generation methods. We summarize related methods to keyphrase prediction, especially for keyphrase generation in Section 3. The proposed method is presented in Section 4. Finally, we show the experiments and results before concluding the paper.

Analysis of the Overlapping Problem
In this section, we first formalize the overlapping phrase problem, and then present its seriousness by analyzing statistics obtained from CopyRNN.
Let p = w i w i+1 ...w i+m be a phrase with lengths m+1 over a finite word dictionary D, i.e., w i ∈ D. we define the phrase p b = w i+j w i+j+1 ...w i+j+k (j ≥ 0, j + k ≤ m) as a sub-phrase of p. Conversely, we define the phrase p as a super-phrase of p b and denote the super-phrase of p as p u . Overlapping relations exist between phrase p and its sub/super-phrase p b /p u . Let O l be a set of authorlabeled keyphrases, and O k be a set of the generated keyphrases at top-k predictions, in which each generated phrase may be correct or incorrect. We assume that p is an author-labeled keyphrase, i.e., p ∈ O l , and its sub-phrase p b and superphrase p u are not keyphrases, i.e., The overlapping phrase problem can be divided into two main problems according to whether p is generated at the top-k results. These two problems are further subdivided into six sub-problems, formulated as shown in Table 1. The formulations No.1-2 shown in Table 1 mean that the authorlabeled keyphrase p is not predicted, and only one of its sub-phrases p b or super-phrases p u is generated. The formulations No.3-6 in Table 1 mean that the author-labeled keyphrase p and one of its sub-phrases p b or super-phrases p u are generated. In addition, T op(p) < T op(p b /p u ) is worse than T op(p) > T op(p b /p u ). Note that p, p b and p u are rarely generated simultaneously.
We next present the seriousness of this problem through analyzing statistics obtained from experimental results of CopyRNN on dataset KP20k. We first calculate the proportion of the keyphrases suffering from the i-th sub-problem in all correct keyphrases, i.e., |P i |/|O l |, where P i is defined as P i = {p|p ∈ O l ∧ p suffers from the i-th sub-problem}, |P i | and |O l | are respectively the size of P i and O l . We select top-k (k = 10) phrases gen-erated by CopyRNN as the final predictions. As the results of |P i |/|O l | shown in Table 1, a total of 42.73% keyphrases suffer from this problem.
In addition, we calculate the proportion of the keyphrases with the length n which suffer from the i-th sub-problem in all correct keyphrases with the same length, i.e., |P n i |/|O n l |, where P n i and O n l are the subsets of P i and O l , respectively, in which the length of each keyphrase is n (i.e., keyphrase is n-gram). Table 1 also shows the seriousness of the sub-problems of overlapping phrase problem with varying n of n-grams. As the results show, we can observe that the long keyphrases can easily suffer from the sub-phrase problem (i.e., p b ∈ O k ) and the short keyphrases can easily suffer from the super-phrase problem (i.e., p u ∈ O k in Table 1).
Although the overlapping problem restricts the performance of existing methods, it also gives us an opportunity to help better generate keyphrases as the overlapping phrases are often very close to the correct keyphrases.

Related Works
As mentioned in Section 1, existing approaches for keyphrase prediction can be broadly divided into extraction and generation methods. The extraction methods can be further classified into supervised and unsupervised approaches. The supervised approaches treat keyphrase extraction as a binary classification task, in which a learning model is trained on the features of labeled keyphrases to determine whether a candidate phrase is a keyphrase (Witten et al., 1999;Medelyan et al., 2009;. In contrast, the unsupervised approaches directly treat keyphrase extraction as a ranking problem, scoring each candidate using different kinds of techniques such as clustering (Liu et al., 2009), or graph-based ranking (Mihalcea and Tarau, 2004;Wan and Xiao, 2008).
This work is mainly related to keyphrase generation approaches which have been proven to be effective in the keyphrase prediction task. Following CopyRNN (Meng et al., 2017) which is the first to generate absent keyphrases using Seq2Seq framework, the few extensions have been proposed to help better generate keyphrases.
In CopyRNN, model training heavily relies on massive amounts of labeled data, which is often unavailable especially for the new domains. To solve this problem, Ye and Wang (2018) proposed a semi-supervised keyphrase generation model by leveraging both abundant unlabeled data and limited labeled data. CopyRNN does not model the one-to-many relationship between the document and keyphrases. Therefore, keyphrase generation only depends on the source document and ignores constraints on the correlation among keyphrases. To overcome this drawback, Chen et al. (2018) proposed a Seq2Seq network with correlation constraints for keyphrase generation. Chen et al. (2019) proposed a title-guided Seq2Seq network to use title of source text to improve performance. However, these methods did not consider the linguistic constraints of keyphrases.

Problem Definition
is the keyphrase set of x i , and N is the number of documents. Both the document x i and keyphrase p i,j are sequences of words, denoted as where L i and L ij are the length of word sequence of x i and p i,j . The goal of a keyphrase generation is to design a model to map each document x into the keyphrase set p. Figure 1 illustrates the overview of the proposed method. The method consists of two components, which are the parallel encoders and decoders. The parallel encoders consist of the word encoder and syntactic information encoder, which are used to compress the source text and its syntactic information into the hidden vectors. The parallel decoders contain the keyphrase decoder and POS tag decoder, which are different decoders and used to generate the keyphrases and POS tags of words in keyphrases. During the training process, these two tasks boost each other providing strong representation for source text. In addition, we employ the coverage attention to alleviate generating the overlapping phrases of keyphrases.

Basic Seq2Seq Model
Our approach is based on a Seq2Seq framework which consists of an encoder and a decoder. Both the encoder and decoder are implemented with recurrent neural networks (RNN). The encoder converts the variable-length source word sequence x = (x 1 , x 2 , ..., x L ) into a set of hidden representation vector {h i } L i=1 , by iterating the following equation: h where where f e is a non-linear function in encoder. The decoder decompresses the context vector and generate the variable-length target keyphrase y = (y 1 , y 2 , ..., y L ) word by word, through the conditional language model: where g is a softmax function, and s i is a decoder hidden vector calculated as: where f d is a non-linear function in decoder. c i is a context vector, calculated as a weight sum over source hidden vector h: where a(s i−1 , h j ) is an alignment function that measures the similarity between s i−1 and h j . Pure generation mode can not predict keyphrase which consists of out-of-vocabulary words. Thus, Meng et al. (2017) first introduced a copy mechanism (Gu et al., 2016) to predict out-of-vocabulary by directly copying words from source text. Consequently, the probability of generating a target word y i (i.e., Equ. 2) is modified as: where y <i represents y 1,...,i−1 and p c is the probability of copying, calculated as: where σ is a non-linear function, X is the set of unique words in source text x, W c is a learned parameter matrix and Z is the sum for normalization.

Parallel Seq2Seq Model
Most of keyphrases are noun phrases which commonly consist of nouns and adjectives (Gollapalli and Caragea, 2014). Hence, the syntactic information is useful for improving keyphrase generation performance. Although conventional generation model is capable of implicitly learning the syntactic information from source text, it can not capture a lot of deep syntactic structural details (Shi et al., 2016). To overcome this shortcoming, we propose a parallel Seq2Seq model which deeply integrates the following additional syntactic information into the basic Seq2Seq model: • POS tag: Keyphrases are commonly noun phrases with a specified part-of-speech (POS) patterns (Hulth, 2003). In supervised approaches for keyphrase extraction, POS tags assigned to words have been chosen as one type of important syntactic features, used to train the classifier (Hasan and Ng, 2014;. We incorporate the POS tags into Seq2Seq network to capture the syntactic combinations of keyphrases.  • Phrase tag: Phrase tags assigned to words are also one type of important syntactic features in supervised extraction approaches, since the words in keyphrase commonly share the same phrase tags . Therefore, we integrate the phrase tags into Seq2Seq network to capture the inherent syntactic structure of keyphrases. We use Stanford Parser 1 (Finkel et al., 2005) to obtain the 32 POS tags and 16 phrase tags of words. An example is shown in Table 2 with both POS and phrase tags, and the author-labeled keyphrase is highlighted in bold.

Parallel Encoders
The proposed model encodes word sequence and tag sequences (including POS and phrase tags) in parallel. We use the RNN encoder to produce the set of word hidden vector {h w } from the source document x, and produce the set of syntactic tag hidden vector {h t } from the POS and phrase tags. We create the look-up based embedding matrices for word, POS tag and phrase tag, and concatenate the embeddings of POS tag and phrase tag into a long vector as input of the tag encoder. We employ two methods to combine the word and syntactic tag hidden vectors into a unified hidden vector h. The first method is inspired by the Tree-LSTM (Tai et al., 2015), which can selectively incorporate the information from each child node. The cell and hidden vectors are calculated by following transition equations: where c w i and c t i are the cell vectors of word and tag, h w i and h t i are the hidden vectors of word and 1 https://nlp.stanford.edu/software/lex-parser.shtml tag, and σ is the sigmoid function. Each of i i , f w i , f t i , o i and u i denotes an input gate, a forget gate of word, a forget gate of syntactic tag, an output gate, and a vector for updating the memory cell, respectively. More details are given in (Tai et al., 2015). The second method is the line transformation followed by the hyperbolic tangent function:

Parallel Decoders
The proposed method consists of two parallel decoders: keyphrase decoder and POS tag decoder. The keyphrase decoder is used to generate a set of keyphrases for documents. Although the keyphrase decoder also can learn syntactic structures of keyphrases to some extent, it fails to capture deep syntactic details. In order to supervise the syntactic combinations of keyphrase, the POS tag decoder is employed to generate a series of POS tags of words in keyphrases. Note that the POS tag decoder in our model serves as a trainingassisted role and is not used in the testing. The probability of predicting each POS tag of word is given as follows: where g t is a softmax function, s t i is a hidden vector of POS tag decoder.

Coverage Attention
Repetition is a common problem for the Seq2Seq models and is especially serious when generating text sequence, such as machine translation (Tu et al., 2016) and automatic text summarization (See et al., 2017). The reason for this is that the traditional attention mechanisms focus on calculating the attention weight of the current time step, ignoring the distribution of weights in history. There can be no doubt that existing Seq2Seq models for keyphrase generation also suffer from this problem, i.e., generating sub-phrases or superphrases of keyphrases. We employ the coverage  model, used in works (Tu et al., 2016;See et al., 2017), to alleviate this problem. In the coverage model, we maintain a coverage vector co to help adjust the future attention through keeping track of the attention history, calculated as: where the coverage vector co i,j is used to measure the attention coverage degree of word x j at step i. More details are shown in (Tu et al., 2016;See et al., 2017). Finally, we integrate coverage vector the attention mechanism, by modifying the alignment function in Equation (5) as: where v c , W s , W h , and W co are the learnable weight parameters.

Overall Loss Function
Given the set of data pairs {x i , y i } N i=1 , where x is the word sequence of the source text, y is the word sequence of its keyphrase, and y is the word of keyphrase y. The loss function consists of two parts. The first is the negative log-likelihood of the target words in keyphrase, calculated as: where L i is the length of keyphrase y, and θ w is the parameter of this task. The second loss function is the negative loglikelihood of the POS tags of words in keyphrases, calculated as follows: where t is the POS tag, and θ t are the parameter. The final goal is to jointly minimize the two losses with Adam optimizer (Kingma and Ba, 2015): where λ is a hyper-parameter to tune the impacts of the two tasks.

Datasets
We use the dataset collected by Meng et al. (2017) from various online digital libraries, which contains about 568K articles 2 . Following Meng et al. (2017), we use about 530K articles for training the model, 20k articles for validating the model, and 20k articles (i.e., KP20k) for testing the model. Similar to Meng et al. (2017), we also test the model on four widely used public datasets from the computer science domain: Inspec (Hulth and Megyesi, 2006), Krapivin (Krapivin et al., 2009), NUS (Nguyen and Kan, 2007), and SemEval-2010(Kim et al., 2010. The datasets are summarized in Table 3 along with the number of present keyphrase (#PKPs), the number of absent keyphrase (#AKPs), the number of articles (#Abs.), the number of present/absent 1grams, 2-grams, 3-grams, 4-grams and more than 4-grams (#>4-grams), in each collection.

Experimental Settings
In the training dataset, input text is the concatenation of the title and abstract of the scientific articles. Following the work (Meng et al., 2017), all numbers in text are mapped to a special token <digit>. The syntactic tags include 32 POS tags and 16 phrase tags. The size of word vocabulary is set to 50,000, the size of word embeddings is set to 150, and the size of embeddings of two syntactic tags is set to 50. All embeddings are randomly initialized with uniform distribution in [-0.1  learned during training. The size of hidden vector is fixed at 300. The weight parameter used to tune the impacts of the two tasks is set to λ = 0.3. The initial learning rate of Adam optimizer is set to 10 −4 , and the dropout rate is set to 0.5. We use the beam search to generate multiple phrases. The max depth of beam search is set to 6, and the beam size is set to 200.

Comparative Methods
We compare our method with extraction and generation approaches. Extraction methods consist of three unsupervised and two supervised methods. Unsupervised extraction methods include TF-IDF, TextRank (Mihalcea and Tarau, 2004) and Sin-gleRank (Wan and Xiao, 2008). Supervised extraction methods include Maui (Medelyan et al., 2009) and KEA (Witten et al., 1999). To clearly represent the experimental results, we select the best-performing method (BL * ) from these extraction baselines with best-performing parameters for each dataset to compare with our method. The generation baselines are state-of-the-art Copy-RNN (Meng et al., 2017) and ConNet, which inputs the concatenation of word embeddings and two syntactic tag embeddings into CopyRNN. The proposed method includes four models: (1) ParaNet L , using the hyperbolic tangent function (i.e., Equ. 15) to combine two hidden vectors of words and syntactic tag generated by encoder; (2) ParaNet T , using the tree-LSTM to combine two hidden vectors; (3) ParaNet L +CoAtt, ParaNet L with the coverage attention; (4) ParaNet T +CoAtt, ParaNet T with the coverage attention.
where # c is the number of correctly predicted keyphrases, # p is the total number of predicted keyphrases, and # l is the total number of author-labeled standard keyphrases. Following the study (Meng et al., 2017), we employ top-N macro-averaged F1-score (F1) for evaluating present keyphrases and recall (R) for evaluating absent keyphrases. We use Porter's stemmer 3 to remove words' suffix before determining the match of two keyphrases.

Prediction of Present Keyphrases
The experimental results are shown in Table 4, in which the F1 at top-5 and top-10 predictions are given and the best scores are highlighted in bold. We compare our method with the best-performing extractive method (BL * ), which can only extract the keyphrases that appear in the source text (i.e., present keyphrases). We first compare our proposed method with the conventional keyphrases extraction methods. The results show that even the worst one in our models (i.e., ParaNet L ) has a large margin over the bestperforming extraction method (BL * ) on all of the test datasets. Secondly, we further compare our method with CopyRNN, and the results indicate that our worst ParaNet L still achieves better performance than CopyRNN. Note that ConNet does not perform as well as we expect, and is slightly worse than CopyRNN on most datasets. The main reason for this may be that directly concatenating embeddings of two syntactic tags and words introduces much noise into the encoder, such as POS tag of verb.
Finally, we compare our different models. From the results shown in Table 4, we can observe that ParaNet T is more effective than ParaNet L . This means that, in combining the word and syntactic   tag hidden vectors form encoders, the tree-LSTM model performs better than the hyperbolic tangent function. The reason for this may be that the multiple gating functions in tree-LSTM help ParaNet T to select the useful information from each encoder. In addition, we can observe that coverage attention mechanism can help to gain better performance in generating present keyphrases. Among our proposed models, ParaNet T +CoAtt achieves the best performance on almost all test datasets.

Prediction of Absent Keyphrase
As mentioned in the work (Meng et al., 2017), the Seq2Seq models can predict absent keyphrases. Therefore, we only compare our method with CopyRNN and ConNet, and evaluate the performance within the recall of the top-10 and top-50 results to see how many absent keyphrases can be correctly predicted.
The results are shown in Table 5. As the results show, our worst model (ParaNet L ) can correctly predict more absent keyphrases than CopyRNN. The main reason for this may be that the syntactic tags provide more useful information for identifying a part of absent keyphrases which have special syntactic structures. In addition, we note that ConNet is still slightly worse than CopyRNN in predicting absent keyphrases.
Finally, we compare our four different models for generating absent keyphrases. From the results shown in Table 5, we can observe that ParaNet T can correctly predict more absent keyphrases than ParaNet L on all test datasets. As the results in the present keyphrase generation, the tree-LSTM model still performs better than the hyperbolic tangent function in the absent keyphrase generation. In addition, we can observe that coverage attention mechanism can help to correctly predict more absent keyphrases. The reason for this may be that the coverage vector can capture long-distance dependencies. This will help to generate the absent keyphrases which are the non-contiguous subsequences of source text. Among our proposed models, ParaNet T +CoAtt perform better than the other three models on most test datasets.

Reduction of Overlapping Phrases
As mentioned in the Section 1, the important motivation for this work is to alleviate generating the overlapping phrases of keyphrases. Table 6 shows the same statistics as Table 1, compared between the best performing model ParaNet T +CoAtt and CopyRNN. From the results, we observe that, compared with CopyRNN, ParaNet T +CoAtt can significantly alleviate the overlapping phrase problem, especially for the sub-phrase problems No.1,No.3 and No.4. For example, the proportion of the keyphrases suffering from the overlapping problem in all keyphrases has dropped from 42.73% to 36.83%. In addition, we investigate the proportion  of the keyphrases with the length n which suffer from the i-th sub-problem in all keyphrases with the same length, i.e., |P n i |/|O n l |. We observe that this proportion of 3-grams (n = 3) reduces most significantly by up to 15.31%.
In addition to the reduction of the overlapping phrases on KP20k dataset, compared with Copy-RNN, ParaNet T +CoAtt can highly rank the correctly predicted keyphrases and rank lowly the overlapping phrases of keyphrases. For example, in the sub-problem No.3, ParaNet T +CoAtt can increase the average ranking of correctly predicted keyphrases from 6.50 to 5.95 at top-10 predictions, and decrease the average ranking of sub-phrases of keyphrases from 2.08 to 2.41.

Cross-Domain Testing
CopyRNN and ParaNet are supervised methods, and are trained on a large-scale dataset in specific scientific domain. Similar to the work (Meng et al., 2017), we expect that our supervised method can learn universal language features that are also effective in other corpora. We thus test our method on new type of text, to see whether the method will work when being transferred to a different domain. We use the popular news article dataset: DUC-2001 (Wan andXiao, 2008) for our experiments, which consists of 308 news articles and 2,488 manually labeled keyphrases.
The results are shown in Table 7. From these results, we can observe that our models generate a certain number of keyphrases in the new domain,. Though the best ParaNet T +CoAtt falls behind the unsupervised algorithms TF-IDF and SingleRank, the worst ParaNet L significantly outperforms the TextRank and CopyRNN. In addition, we note that the overlapping phrase problem also exists in DUC dataset. In the experiment, ParaNet T +CoAtt can reduce the total proportion of keyphrases suffering from the overlapping phrase problem from 21.96% to 19.13%.

Influence of Weight Parameter
In this work, we propose the multi-task Seq2Seq network for keyphrase generation, which jointly learns the dominant task of predicting keyphrases and the auxiliary task of predicting POS tags of keyphrases. We employ the weight parameter λ (in Equ. 21) to tune the impacts of the two tasks. We conduct the experiment to illustrate the influence of the weight parameter λ in ParaNet L , which does not use the coverage attention. The results are shown in Figure 2, in which the F1 at top-10 predictions are given on six datasets. We observe that the performance of ParaNet L is influenced by changes on the parameter λ. In general, the performance slowly increases and then slowly decreases on six datasets as λ grows. The bestperforming settings are λ = 0.5 on news dataset DUC and λ = 0.3 on other five scientific datasets, which are finally used to balance two prediction tasks in the comparison experiments.

Conclusion
In this study, we propose the parallel Seq2Seq network with the coverage attention to alleviate the overlapping problem (including sub-phrase and super-phrase problems) in existing keyphrase generation methods. In particular, we incorporate the linguistic constraints of keyphrases into the basic Seq2Seq network, and employ multi-task learning framework to enhance generation performance. The experimental results show that the proposed method can significantly outperform the state-ofthe-art CopyRNN on scientific datasets, and is also effective in news domain.