Keyphrase Generation with Correlation Constraints

In this paper, we study automatic keyphrase generation. Although conventional approaches to this task show promising results, they neglect correlation among keyphrases, resulting in duplication and coverage issues. To solve these problems, we propose a new sequence-to-sequence architecture for keyphrase generation named CorrRNN, which captures correlation among multiple keyphrases in two ways. First, we employ a coverage vector to indicate whether the word in the source document has been summarized by previous phrases to improve the coverage for keyphrases. Second, preceding phrases are taken into account to eliminate duplicate phrases and improve result coherence. Experiment results show that our model significantly outperforms the state-of-the-art method on benchmark datasets in terms of both accuracy and diversity.


Introduction
A keyphrase is a piece of text that is able to summarize a long document, organize contents and highlight important concepts, like "virtual organizations" in Table 1. It provides readers with a rough understanding of a document without going through its content, and has many potential applications, such as information retrieval, text summarization and document classification.
Keyphrase can be categorized into present keyphrase which appears in a source document, and absent keyphrase that does not appear in the document. Conventional approaches extract important text spans as candidate phrases and rank them as keyphrases (Hulth, 2003;Medelyan et al., 2008;Liu et al., 2011;Wu et al., 2015;Wang et al., 2016), that show promising results on the present keyphrases but cannot handle absent keyphrases. * Corresponding Author Title: Resolving conflict and inconsistency in norm regulated virtual organizations. Abstract: Norm governed virtual organizations define, govern and facilitate coordinated resource sharing and problem solving in societies of agents. With an explicit account of norms, openness in virtual organizations can be achieved new components, designed by various parties, can be seamlessly accommodated. We focus on virtual organizations realised as multi agent systems ... Ground truth: virtual organizations; multi agent systems; agent; norm conflict; conflict prohibition; norm inconsistency; ... Predicted keyphrases: virtual organizations; multi agent systems; artificial intelligence; inter agent; multi agent; action delegation; software agents; resource sharing; grid services; agent systems; issues of state-of-the-art model. The phrases in red are duplicate, and the underlined parts in the source document are not covered by the predicted results, while they are summarized by "norm conflict" and "norm inconsistency" in the golden list.
To predict absent keyphrases, generative methods have been proposed by . The approach employs a sequence-to-sequence (Seq2Seq) framework (Sutskever et al., 2014) with a copy mechanism (Gu et al., 2016) to encourage rare word generation, in which the encoder compresses the text into a dense vector and the decoder generates a phrase with a Recurrent Neural Network (RNN) language model, achieving stateof-the-art performance. Since a document corresponds to multiple keyphrases, the approach divides it into multiple document-keyphrase pairs as training instances. This approach, however, neglects the correlation among target keyphrases since it does not model the one-to-many relationship between the document and keyphrases. Therefore, keyphrase prediction only depends on the source document, and ignores the keyphrases which have been generated. As a consequence, the generated keyphrases suffer from duplication issue and coverage issue. A duplication issue is defined as at least two phrases expressing the same meaning, hindering readers from obtaining more information from keyphrases. For example, three keyphrases have an identical meaning in Table 1, including "multi agent systems", "multi agent" and "agent systems". A coverage issue means some key points in the document are not covered by the keyphrases, such as "norm conflict" and "norm inconsistency" in Table 1. To mitigate such issues, we mimic human behavior in terms of how to assign keyphrases for an arbitrary document. Given a document in Table 1, an annotator will read it and generate keyphrases according to his understanding of the content, like "virtual organizations", "multi agent systems". After that, instead of generating duplicate phrases like "agent systems" and "multi agent", the annotator will review the document and preceding keyphrases, then generate a phrase like "norm conflict" to cover topics that have not been summarized by previous phrases. The iteration does not stop until all of a document's topics are covered by keyphrases.
We propose a new sequence-to-sequence architecture CorrRNN, capable of capturing correlation among keyphrases. Notably, correlation constraints in this paper are defined as keyphrases that should cover all topics in the source document and differ from each other. Specifically, we employ a coverage mechanism (Tu et al., 2016) to memorize which parts in the source document have been covered by previous phrases. In this way, the document coverage is modeled explicitly, enabling the generated keyphrases to cover more topics. Furthermore, we propose a review mechanism that considers the previous keyphrases in the generation process, in order to avoid the repetition in the final results. Concretely, the review mechanism explicitly models the correlation between the keyphrases that have been generated and the keyphrase that is going to be generated with a novel architecture. It extends the existing Seq2Seq model and captures the one-to-many relationship in keyphrase generation. Augmented with the coverage mechanism and the review mechanism, Cor-rRNN does not only inherit the advantages of the Seq2Seq model, but also improves the coverage and diversity in the generation process. We test our model on three benchmark datasets. The results show that our model outperforms stateof-the art methods by a large margin, demonstrating the effectiveness of the correlation constraints. In addition, our model is better than heuristic rules on improving diversity, since it instills the correlation knowledge to the model in an end-to-end fash-ion.
Our contributions in this paper are three-fold: (1) the proposal of modeling the one-to-many correlation for keyphrase generation, (2) the proposal of a new architecture CorrRNN for keyphrase generation, and (3) empirical verification of the effectiveness of CorrRNN on public datasets.
In the remainder of this paper, we will first review the related work in Section 2, then we elaborate on the proposed model in Section 3. After that, we list the experiment settings in Section 4, results and discussion follow in Section 5. Finally, the conclusion and future work in Section 6.

Related Work
How to assign keyphrases to a long document is a fundamental task, that has been studied intensively in previous works. Existing methods can be categorized into two groups: extraction based and generation based methods.
The former group extracts important keyphrases in a document which consists of two phases. The first phase is to construct a set of phrase candidates with heuristic methods, such as extracting important n-grams (Hulth, 2003;Medelyan et al., 2008;Hulth, 2003;Shang et al., 2017) and selecting text chunks with certain postags (Liu et al., 2011;Wang et al., 2016;Le et al., 2016;Liu et al., 2015). The second phase is to rank the candidates with machine learning methods. Specifically, some researchers Hulth, 2003;Medelyan et al., 2009;Gollapalli and Caragea, 2014) formulate the keyphrase extraction as a supervised classification problem, while others apply unsupervised approaches (Mihalcea and Tarau, 2004;Grineva et al., 2009;Liu et al., 2009Liu et al., , 2010Zhang et al., 2013;Bougouin et al., 2013Bougouin et al., , 2016 on this task. Besides, Tomokiyo and Hurst (2003) employ two statistical language models to measure the informativeness for phrases. Liu et al. (2011) use a word alignment model to learn translation probabilities between the words in documents and the words in keyphrases, which alleviates the problem of vocabulary gaps.
The latter group, generative methods, assigns keyphrases to a document with natural language generation techniques, and is capable of generating absent keyphrases. Owing to the development of neural networks (Bahdanau et al., 2014),  apply an encoder-decoder framework (Sutskever et al., 2014) with a copy mechanism (Gu et al., 2016) to this task, achieving state-ofthe-art performance.
Our work is a generation based approach. The main difference of our model is that we consider the correlation among keyphrases. Our model proposes a new review mechanism to enhance keyphrase diversity, while employs a coverage mechanism that has proven effective for summarization (See et al., 2017) and machine translation (Tu et al., 2016) to guarantee keyphrase coverage. Some previous works on keyword extraction have already exploited the correlation problem with a re-rank strategy (Habibi and Popescu-Belis, 2013;Ni et al., 2012). In contrast, we model the correlation in an end-to-end fashion.

Problem Formalization
is the keyphrase set of x i , and N is the number of documents. Both the source text and target keyphrase are word sequences, donated as x i = (x i 1 , x i 2 , ..., x i T ) and p i,j = (y i,j 1 , y i,j 2 , ..., y i,j L i ) respectively. T and L i are the length of word sequences of x i and p i,j . Prior work aims to maximize the probability of N i=1 M i j=1 P (p i,j |x i ), while our model considers keyphrase correlation to address coverage and duplication issues by maximizing the probability of

Seq2Seq Model with Copy Mechanism
A Seq2Seq model (Sutskever et al., 2014) is employed as backbone in this paper. The encoder converts the variable-length input sequence x = (x 1 , x 2 , ..., x T ) into a set of hidden representation h = (h 1 , h 2 , ..., h T ) by iterating along time t with the following equation: where f is a non-linear function.
Then the context vector c is computed as a weighted sum of hidden representation set h through an attention mechanism (Bahdanau et al., 2014), which next acts as the representation of the whole input x at time step t.
where α tj is a coefficient which measures the match degree between the inputs around position j and the output at position t.
With the context vector c t , decoder generates variable-length word sequence step by step, the generative process which is known as a language model: where s t denotes the hidden state of the decoder at time t. y t is the predicted word from vocabulary based on the largest probability after g(.). Unfortunately, pure generative mode cannot generate any keyphrase (e.g. noun, entity) which contains out-of-vocabulary words. Thus we incorporate a copy mechanism (Gu et al., 2016) into the encoder-decoder model to predict out-ofvocabulary words by selecting appropriate words from source text. After incorporation, the probability of predicting a new word consists of two parts: where p g and p c denote the probability of generating and coping. X is the set of unique words in source sequence x, σ is a non-linear function. W c ∈ R is a learned parameter matrix. Z is the sum for normalization. For more details, please see (Gu et al., 2016).

Model Correlation
Keyphrases should cover more topics and differ from each other, while previous work  ignored this correlation among multiple keyphrases, resulting in duplication and coverage issues. In this part, we focus on capturing the one-to-many correlation to alleviate above issues. On the one hand, we employ a coverage mechanism (Tu et al., 2016) that diversifies attention distributions to improve the topic coverage of keyphrases. On the other hand, we propose a review mechanism which makes use of contextual information of previous phrases (already generated) to avoid duplicate generation. For better display of the proposed model, the overall framework is illustrated in Figure 1 and the detailed process is described in Algorithm 1. Figure 1: The overall framework structure. Note that pi indicates a keyphrase (e.g. p0 ="neural network"), s i indicates the hidden state set of phrase pi, coverage vector C and target-side review context S update and transmit along the process of decoding multiple keyphrases.

Coverage Mechanism
As is well known, multiple keyphrases usually correspond to multiple different positions of the source text (see Figure 1), the positions of words that have already been summarized should not be focused again since the attention mechanism automatically focuses on important areas of source text. To overcome the coverage issue, we incorporate a coverage mechanism (Tu et al., 2016) into our model, which diversifies the attention distributions of multiple keyphrases to make sure more important areas in the source document are attended and summarized into keyphrases.
Concretely, we maintain a coverage vector c t which is the sum of attention distributions over all previous decoder time steps. Intuitively, c t represents the degree of coverage that those words in the source text have received from the attention mechanism so far.
Note that c 0 is a zero vector since no word in source text has be covered. Later, the coverage vector c t is an extra input for the attention mechanism, then the source context set h is read and weight averaged into a contextual representation c E t by the attention mechanism with a coverage vector, with Eqn.(2) transforming into Eqn.(8) as follows: ; where E is the encoder and w c is a learned parameter with the same length as v.
With the coverage vector, the attention mechanism's decision for choosing where in source text to focus next is informed by a reminder of its previous decisions, which ensures that the attention mechanism avoids repeatedly attending to the same locations in the source text more easily, thus generated phrases cover more topics in the source document.
Algorithm 1 Training procedure of the proposed model.

Review Mechanism
Considering human behavior on assigning keyphrases that review previous phrases to avoid duplicate assignment, we construct a target side review context set which contains contextual information of generated phases. The target context with an attention mechanism can make use of contextual information of generated phrases to help predict the next phrase, which we call the review mechanism.
Like source context c E t described above, on the target side, the target context is defined as s t = {s 1 , s 2 , ..., s t−1 }, which is the collection of hidden states before time step t. When decoding the word at t-th step, s t is used to inform an extra contextual representation, thus target side attentive contexts are integrated into c D t : Afterwards, c D t is provided as an extra input to derive the hidden state s t and later the probability distribution for choosing t-th word . The target context gets updated consequently as s t+1 = s t ∪ {s t } in the decoding progress.
With the contextual information of previous phrases, review mechanism ensures next predicted phrase less duplication and topic coherence. So far, we transmit and update the coverage vector and review context along the multi-target phrase decoding process to improve the coverage and diversity of keyphrases. We denote our model with coverage only and review only as CorrRNN C and CorrRNN R , and empirically compare them in experiments. The objectives are to minimize the negative log-likelihood of the target words, given a data sample with source text x and corresponding phrases set p = {p i } M i=0 , loss is calculated as follows: 4 Experiment Settings

Implementation Details
In the preprocessing phase, we follow  to preprocess the text with tokenization, lowercasing, and digit replacement to ensure fairness. Each article consists of one source text and corresponding multiple keyphrases, and the source text is the concatenation of its title and abstract. We set the max number of target phrases to 10 for an article in consideration of the device memory, thus those with more than 10 target phrases are cut into multiple articles. Finally, we have 558830 articles (text-keyphrases pair) for training.
In the training phase, we choose a bidirectional GRU for the encoder and another forward GRU for the decoder. The top 50000 frequent words are chosen as the vocabulary, the dimension of word embeddings is set to 150, the value of embedding is randomly initialized with uniform distribution in [-0.1, 0.1], and the dimension of the hidden layers is set to 300. Adam is adopted to optimize the model with initial learning rate=10 −4 , gradient clipping=0.1 and dropout rate=0.5. The training is stopped once the loss on the validation set stops dropping for several iterations.
In the generation phase, we use beam search to generate multiple phrases. The beam depth is set to 6 and the beam size is set to 200. Source code will be released at https://github.com/ nanfeng1101/s2s-kg.

Datasets
Following , we train our model on the KP20k dataset , which contains articles collected from various online digital libraries. The dataset has 527,830 articles for training and 20000 articles for validation. We evaluate our model on three benchmark datasets which are widely adopted in previous works, with the details described below:
To demonstrate the effectiveness of end-to-end learning, we compare CorrRNN to CopyRNN with post-processing. In post-precessing, we only keep the first appearence of keyphrase in duplicate predictions, duplication means that a phrase is a substring of another. The baseline can be seen as heuristic rules for improving the diversity, denoted as CopyRNN F .

Evaluation Metrics
For a fair comparison, we evaluate the experiment results on present keyphrases and absent keyphrases separately, because extractive methods cannot generate absent keyphrases. Following , we employ F1-measure for present keyphrases and recall for absent keyphrases. Here, we use F1@K and R@K to denote the F1 and recall score in the top K keyphrases. Note that we use Porter Stemmer for preprocessing to determine whether the two keyphrases are identical.
Furthermore, α-NDCG, which is widely used to measure the diversity of keyphrase generation (Habibi and Popescu-Belis, 2013) and information retrieval (Clarke et al., 2008), is adopted to evaluate the diversity of the generative methods, denoted as N@K. α is a trade-off between relevance 1 https://github.com/adrien-bougouin/KeyBench and diversity in α-NDCG, which is set to equal weights of 0.5 according to Habibi and Popescu-Belis (2013). The higher α-NDCG is, the more diverse the results are. We re-implement Copy-RNN with the source code 2 provided by the authors in order to evaluate it on the α-NDCG metric. α where α is a parameter, m denotes the number of target phrases, k denotes the number of predicted phrases. J(d k , i) = 0 or 1, which indicates whether the k-th predicted phrase is relevant to the i-th target phrase, and r i,k−1 indicates how many predicted phrases are relevant to the i-th target phrase before the k-th predicted phrase. Note that relevance here is defined as whether the word set of a keyphrase is a subset of another keyphrase (e.g. "multi agent" vs "multi agent system").

Present Phrase Prediction
Present phrase prediction is also known as keyphrase extraction in prior studies. We evaluate how well our model performs on this common task. The results are shown in Tables 2 and 3, which list the performance of the top 5 and top 10 results.
In terms of F1-measure, CorrRNN and Copy-RNN outperform other baselines by a large margin, indicating the effectiveness of RNN with a copy mechanism. As we consider the correlation among multiple phrases, the overall results of CorrRNN are better than CopyRNN significantly (t−test with p < 0.01). This is mainly because CorrRNN alleviates the duplication and coverage issues in existing methods, with more correct phrases boosted in the top 10 results. Heuritic baseline CopyRNN F is even worse than Copy-RNN, indicating that the heuristic rules may hurt the performance of generative approaches. It also proven that it is a better way to model the correlation among keyphrases in an end-to-end fashion.
Regarding α-NDCG, CorrRNN and its variants surpass other methods, demonstrating that incorporating correlation constraints can improve both relevance and diversity. As the heuristic rules influence the relevance of CopyRNN, CopyRNN F performs a little better than CopyRNN on the α-NDCG.

Absent Phrase Prediction
We evaluate the performance of generative methods within the recall of the top 10 results, which is shown in Table 4. We can see that both CopyRNN and CorrRNN outperform RNN although the improvement is not as much as in present phrase prediction. It indicates that the copy mechanism is very helpful for predicting absent phrases. We can also see that CopyRNN and CorrRNN are comparable in terms of recall, but CorrRNN is better on diversity, proving that our model can address the duplicate issue in keyphrase generation.

Generalization Ability
As described above, CorrRNN performed well on scientific publications. In this part, we construct our experiments on news domain to see if the proposed model works when transferring to a different domain with unfamiliar texts. We adopted the popular news article dataset: DUC-2001 (Wan andXiao, 2008) for our experiments. The dataset contains 308 news articles with 2488 manually assigned keyphrases, and each article consists of 740 words on average, which is completely different from the datasets we used above (see Table 5).  We directly applied CorrRNN, which is trained on scientific publications, on predicting phrases for news articles without any adaptive adjustment. Experiment results from (Hasan and Ng, 2010),  and our experiments are shown in Table 6, from which we can see that the proposed model CorrRNN can extract a considerable portion of keyphrases correctly from unfamiliar texts. It outperforms TextRank (Mihalcea and Tarau, 2004), KeyCluster (Liu et al., 2009), TopicRank (Bougouin et al., 2013) and CopyRNN , but it falls behind the other three baselines because the test domain changes. The model should perform better if it is trained on news dataset.
When transferring to news domain, the vocabulary changes a lot, more unknown words occur, and the correlation also may not applicable, the model can still capture positional and syntactic features within the text to predict phrases despite the different text type and length. The experiment verifies the generalization ability of our model, thus we have good reasons to believe that our model has a great potential to be generalized to more domains after sufficient training.  Source text: a a support t vector method for optimizing average precision .
suppo ort ve n . . machine r method m ecto ve or learning is commonly used to improve ranked a achin ne e e n in arn lea d retrieval systems . due to computational difficulties , few l ems syste s w learning due to ms . ng techniques have been developed to directly optimize for mean average precision ( map ) , despite its widespread use in evaluating such systems . existing approaches optimizing map either do not find a globally optimal solution , or are computationally expensive . in contrast , we present a general ationally ral svm exp p nsiv nally y vm m learning ve . in pensiv ng g algorithm that efficiently finds a globally optimal solution to a straightforward relaxation of map . 3. . support ort t t t t t t t vector machines 4. 4. . svm 5. ranked retrieval systems Keyphrases: Figure 2: Visualization, deeper shading denotes higher value. Note that yellow shading and green shading indicate coverage vector and review attention respectively.

Model Ablation
We investigate the effect of coverage mechanism and review mechanism in our model with CorrRNN C and CorrRNN R respectively, shown in Tables 2, 3 and 4. It is clear that both the coverage mechanism and review mechanism are helpful for improving the coverage and diversity of predicted phrases. We note the inconsistency of ablate models in our experiments. First, no ablate model achieves the best performance on all of the test datasets, the full model CorrRNN gets better perfermance on present phrase prediction, while CorrRNN C seems better than the others in absent phrase prediction. As the present phrases are the majority, the full model CorrRNN can achieve best overall performance in actual use. Second, proposed models perform better on dataset NUS and SemEval than Kravipin. This may be due to the difference of assignment quality among test datasets, keyphrase assignments with higher coverage and higher diversity benefiting more from our models.

Visualization
In Figure 2, we visualize the coverage vector and review attention with an example to further clarify how our model works. Due to space limitation, we only visualize top5 phrases in the example, they are already enough to support our analysis. For coverage vector, we can see that source attention transfers along the changes of coverage vector. At the first, only relevant words of "machine learning" are attended. After that, the coverage vector informs attention mechanism to attend other positions instead of repetitive attention, that promotes the generation of later phrases like "average precision". After the last one being generated, it is clear that coverage vector basically covers all topics of the source document, including "machine learning", "mean average pre-cision", "svm" and "ranked retrieval systems". As for the review attention, it's clear that contextual information of all previous phrases are attended for generating the last phrase "ranked retrieval systems", which verifies that the review mechanism helps to alleviate duplication and ensure coherence of results.

Comparison with Heuristic Rules
We design the baseline CopyRNN F with postprocessing to explore whether heuristic rules can alleviate duplication and coverage issues. It is clear from Tables 2, 3 and 4 that the experiment results are negative. Compared to our model CorrRNN, heuristic rules can't address duplication and coverage issues fundamentally. We offer two explanations for these observations. Firstly, heuristic rules can only handle those phrases which have already been generated in results, that shows no help for enabling phrases to cover more topics in source text. Secondly, although duplication can be reduced by heuristic rules forcibly, the remaining phrases are not guaranteed to be correct, thus it hurts the accuracy badly.

Complexity Analysis
According to our observations during experiment phase, the CopyRNN  model has 78835750 network parameters, while CorrRNN owns 94886750 parameters because of the incorporation of coverage mechanism (few) and review mechanism (most). However, benefiting from the consideration of oneto-many relationship, and correlation constraints among keyphrases, our model CorrRNN not only achieves better performance but also converges faster than CopyRNN.

Case Study
As shown in Table 7, we compare the phrases generated by CorrRNN and CopyRNN on an example article. Compared to CopyRNN, CorrRNN generates one more correct present phrase "voip" Title: Deployment issues of a voip conferencing system in a virtual conferencing environment. Abstract: Real time services have been supported by and large on circuitswitched networks. Recent trends favour services ported on packet switched networks. For audio conferencing, we need to consider many issues scalability, quality of the conference application, floor control and load on the clients servers to name a few. In this paper, we describe an audio service framework designed to provide a virtual conferencing environment (vce). The system is designed to accommodate a large number of end users speaking at the same time and spread across the internet. The framework is based on conference servers DIGIT , which facilitate the audio handling, while we exploit the sip capabilities for signaling purposes. Client selection is based on a recent quantifier called loudness number that helps mimic a physical face to face conference. We deal with deployment issues of the proposed solution both in terms of scalability and interactivity, while explaining the techniques we use to reduce the traffic. We have implemented a conference server (cs) application on a campus wide network at our institute. Present Phrase: CopyRNN: deployment; virtual conferencing; real time; distributed systems; virtual conferencing environment; client server; conference server; distributed applications; audio conferencing; floor control; CorrRNN: voip; virtual conferencing; voip conferencing; audio conferencing; audio service; real time services; real time; distributed systems; conference server; virtual conferencing environment; Absent Phrase: CopyRNN: quality of service; distributed conferencing; virtual environments; internet conferencing; conference conferencing; load balancing; packet conferencing; real time systems; distributed computing; virtual server; CorrRNN: real time systems; real time voip; voip service; real time audio; wireless networks; conference conferencing; real time communications; packet conferencing; audio communication; quality of service; and one more correct absent phrase "real time audio" respectively, which covers two important topics, while CopyRNN loses these key points. Moreover, four "conferencing (noun)" phrases are generated by CopyRNN, including "distributed conferencing", "internet conferencing", "conference conferencing" and "packet conferencing", which hinders readers from obtaining more information, while CorrRNN only has two.

Conclusion and Future Work
In this paper, we propose a new Seq2Seq architecture that models correlation among multiple keyphrases in an end-to-end fashion by incorporating a coverage mechanism and a review mechanism. Comprehensive empirical studies demonstrate that our model can alleviate duplication and coverage issues effectively and improve diversity and coverage for keyphrase generation. To the best of our knowledge, this is the first use of encoderdecoder model for keyphrase generation in an oneto-many way. Our future work will focus on two areas: investigation on multi-document keyphrase generation, and incorporation of structure or syntax information in keyphrase generation.