One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases

Different texts shall by nature correspond to different number of keyphrases. This desideratum is largely missing from existing neural keyphrase generation models. In this study, we address this problem from both modeling and evaluation perspectives. We first propose a recurrent generative model that generates multiple keyphrases as delimiter-separated sequences. Generation diversity is further enhanced with two novel techniques by manipulating decoder hidden states. In contrast to previous approaches, our model is capable of generating diverse keyphrases and controlling number of outputs. We further propose two evaluation metrics tailored towards the variable-number generation. We also introduce a new dataset StackEx that expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks. With both previous and new evaluation metrics, our model outperforms strong baselines on all datasets.

We first propose a recurrent generative model that generates multiple keyphrases as delimiter-separated sequences. Generation diversity is further enhanced with two novel techniques by manipulating decoder hidden states. In contrast to previous approaches, our model is capable of generating diverse keyphrases and controlling number of outputs.
We further propose two evaluation metrics tailored towards the variable-number generation. We also introduce a new dataset (ST A C KEX) that expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks. With both previous and new evaluation metrics, our model outperforms strong baselines on all datasets.

Introduction
Keyphrase generation is the task of automatically predicting keyphrases given a source text. Desired keyphrases are often multi-word units that summarize the high-level meaning and highlight certain important topics or information of the source text. Consequently, models that can successfully perform this task should be capable of not only distilling high-level information from a document, but also locating specific, important snippets therein.
To make the problem even more challenging, a keyphrase may or may not be a substring of the source text (i.e., it may be present or absent). Moreover, a given source text is usually associated with * These authors contributed equally. The order is determined by a fidget spinner. a set of multiple keyphrases. Thus, keyphrase generation is an instance of the set generation problem, where both the size of the set and the size (i.e., the number of tokens in a phrase) of each element can vary depending on the source. Similar to summarization, keyphrase generation is often formulated as a sequence-to-sequence (Seq2Seq) generation task in most prior studies (Meng et al., 2017;Chen et al., 2018a;Ye and Wang, 2018;Chen et al., 2018b). Conditioned on a source text, Seq2Seq models generate phrases individually or as a longer sequence jointed by delimiting tokens. Since standard Seq2Seq models generate only one sequence at a time, thus to generate multiple phrases, a common approach is to over-generate using beam search (Reddy et al., 1977) with a large beam width. Models are then evaluated by taking a fixed number of top predicted phrases (typically 5 or 10) and comparing them against the ground truth keyphrases.
Though this approach has achieved good empirical results, we argue that it suffers from two major limitations. Firstly, models that use beam search to generate multiple keyphrases generally lack the ability to determine the dynamic number of keyphrases needed for different source texts. Meanwhile, the parallelism in beam search also fails to model the inter-relation among the generated phrases, which can often result in diminished diversity in the output. Although certain existing models take output diversity into consideration during training (Chen et al., 2018a;Ye and Wang, 2018), the effort is significantly undermined during decoding due to the reliance on over-generation and phrase ranking with beam search.
Secondly, the current evaluation setup is rather problematic, since existing studies attempt to match a fixed number of outputs against a variable number of ground truth keyphrases. Empirically, the number of keyphrases can vary drastically for different source texts, depending on a plethora of factors including the length or genre of the text, the granularity of keyphrase annotation, etc. For the several commonly used keyphrase generation datasets, for example, the average number of keyphrases per data point can range from 5.3 to 15.7, with variances sometimes as large as 64.6 (Table 1). Therefore, using an arbitrary, fixed number k to evaluate entire datasets is not appropriate. In fact, under this evaluation setup, the F1 score for the oracle model on the KP20K dataset is 0.858 for k = 5 and 0.626 for k = 10, which apparently poses serious normalization issues as evaluation metrics.
To overcome these problems, we propose novel decoding strategies and evaluation metrics for the keyphrase generation task. The main contributions of this work are as follows: 1. We propose a Seq2Seq based keyphrase generation model capable of generating diverse keyphrases and controlling number of outputs.
2. We propose new metrics based on commonly used F 1 score under the hypothesis of variable-size outputs from models, which results in improved empirical characteristics over previous metrics based on a fixed k.
3. An additional contribution of our study is the introduction of a new dataset for keyphrase generation: With its marked difference in genre, we expect the dataset to bring added heterogeneity to keyphrase generation evaluation.
2 Related Work

Keyphrase Extraction and Generation
Traditional keyphrase extraction has been studied extensively in past decades. In most existing literature, keyphrase extraction has been formulated as a two-step process. First, lexical features such as part-of-speech tags are used to determine a list of phrase candidates by heuristic methods (Witten et al., 1999;Liu et al., 2011;Wang et al., 2016;Yang et al., 2017). Second, a ranking algorithm is adopted to rank the candidate list and the top ranked candidates are selected as keyphrases. A wide variety of methods were applied for ranking, such as bagged decision trees (Medelyan et al., 2009;Lopez and Romary, 2010), Multi-Layer Perceptron, Support Vector Machine (Lopez and Romary, 2010) and PageRank (Mihalcea and Tarau, 2004;Le et al., 2016;Wan and Xiao, 2008). Recently, Zhang et al. (2016); Luan et al. (2017); Gollapalli et al. (2017) used sequence labeling models to extract keyphrases from text; Subramanian et al. (2017) used Pointer Networks to point to the start and end positions of keyphrases in a source text; Sun et al. (2019) leveraged graph neural networks to extract keyphrases. The main drawback of keyphrase extraction is that sometimes keyphrases are absent from the source text, thus an extractive model will fail predicting those keyphrases. Meng et al. (2017) first proposed the CopyRNN, a neural model that both generates words from vocabulary and points to words from the source text. Based on the Copy-RNN architecture, Chen et al. (2018a); Zhao and Zhang (2019) leveraged attention to help reducing duplication and improving coverage. Ye and Wang (2018) proposed semi-supervised methods by leveraging both labeled and unlabeled data for training. Chen et al. (2018b); Ye and Wang (2018) proposed to use structure information (e.g., title of source text) to improve keyphrase generation performance. Chan et al. (2019) introduced RL to the keyphrase generation task. Chen et al. (2019a) retrieved similar documents from training data to help producing more accurate keyphrases.

Sequence to Sequence Generation
Sequence to Sequence (Seq2Seq) learning was first introduced by Sutskever et al. (2014); together with the soft attention mechanism of , it has been widely used in natural language generation tasks. Gülçehre et al. (2016); Gu et al. (2016) used a mixture of generation and pointing to overcome the problem of large vocabulary size. Paulus et al. (2017); Zhou et al. (2017) applied Seq2Seq models on summary generation tasks, while Du et al. (2017); Yuan et al. (2017) generated questions conditioned on documents and answers from machine comprehension datasets. Seq2Seq was also applied on neural sentence simplification (Zhang and Lapata, 2017) and paraphrase generation tasks (Xu et al., 2018).

Model Architecture
Given a piece of source text, our objective is to generate a variable number of multi-word phrases. To this end, we opt for the sequence-to-sequence (Seq2Seq)  framework as the basis of our model, combined with attention and pointer softmax mechanisms in the decoder.
Since each data example contains one source text sequence and multiple target phrase sequences (dubbed ON E2MA N Y, and each sequence can be of multi-word), two paradigms can be adopted for training Seq2Seq models. The first one (Meng et al., 2017) is to divide each ON E2MA N Y data example into multiple ON E2ON E examples, and the resulting models (e.g., CopyRNN) can generate one phrase at once and must rely on beam search technique to produce more unique phrases.
To enable models to generate multiple phrases and control the number to output, we propose the second training paradigm ON E2SE Q, in which we concatenate multiple phrases into a single sequence with a delimiter sep , and this concatenated sequence is then used as the target for sequence generation during training. An overview of the model's structure is shown in Figure 1. 1

Notations
In the following subsections, we use w to denote input text tokens, x to denote token embeddings, h to denote hidden states, and y to denote output text tokens. Superscripts denote time-steps in a sequence, and subscripts e and d indicate whether a variable resides in the encoder or the decoder of the model, respectively. The absence of a superscript indicates multiplicity in the time dimension. L refers to a linear transformation and L f refers to it followed by a non-linear activation function f . Angled brackets, , denote concatenation.

Sequence to Sequence Generation
We develop our model based on the standard Seq2Seq ) model with attention mechanism  and pointer softmax (Gülçehre et al., 2016). Due to space limit, we describe this basic Seq2Seq model in Appendix A.

Mechanisms for Diverse Generation
There are usually multiple keyphrases for a given source text because each keyphrase represents certain aspects of the text. Therefore keyphrase diversity is desired for the keyphrase generation. Most previous keyphrase generation models generate multiple phrases by over-generation, which is highly prone to generate similar phrases due to the nature of beam search. Given our objective to generate variable numbers of keyphrases, we need to adopt new strategies for achieving better diversity in the output.
Recall that we represent variable numbers of keyphrases as delimiter-separated sequences. One particular issue we observed during error analysis is that the model tends to produce identical tokens following the delimiter token. For example, suppose a target sequence contains n delimiter tokens at time-steps t 1 , . . . , t n . During training, the model is rewarded for generating the same delimiter token at these time-steps, which presumably introduces much homogeneity in the corresponding decoder states h t 1 d , . . . , h tn d . When these states are subsequently used as inputs at the time-steps immediately following the delimiter, the decoder naturally produces highly similar distributions over the following tokens, resulting in identical tokens being decoded. To alleviate this problem, we propose two plug-in components for the sequential generation model.

Semantic Coverage
We propose a mechanism called semantic coverage that focuses on the semantic representations of generated phrases. Specifically, we introduce another uni-directional recurrent model GRU SC (dubbed target encoder) which encodes decoder-generated tokens y τ , where τ ∈ [0, t), into hidden states h t SC . This state is then taken as an extra input to the decoder GRU, modifying equation of the decoder GRU to: If the target encoder were to be updated with the training signal from generation (i.e., backpropagating error from the decoder GRU to the target encoder), the resulting decoder is essentially a 2-layer GRU with residual connections. Instead, inspired  Figure 1: The architecture of the proposed model for improving keyphrase diversity. A represents last states of a bi-directional source encoder; B represents the last state of target encoder; C indicates decoder states where target tokens are either delimiters or end-of-sentence tokens. During orthogonal regularization, all C states are used; during target encoder training, we maximize mutual information between states A with B. Red dash arrow indicates a detached path, i.e., no back-propagation through such path. by previous representation learning works (Logeswaran and Lee, 2018;van den Oord et al., 2018;Hjelm et al., 2018), we train the target encoder in an self-supervised fashion ( Figure 1). Specifically, due to the autoregressive nature of the RNN-based decoder, we follow Contrastive Predictive Coding (CPC) (van den Oord et al., 2018), where a Noise-Contrastive Estimation(NCE) loss is used to maximize a lower bound on mutual information. That is, we extract target encoder's final hidden state vector h M SC , where M is the length of target sequence, and use it as a general representation of the target phrases. We train by maximizing the mutual information between these phrase representations and the final state of the source encoder h T e as follows. For each phrase representation vector h M SC , we take the encodings H T e = {h T e,1 , . . . , h T e,N } of N different source texts, where h T e,true is the encoder representation for the current source text, and the remaining N − 1 are negative samples (sampled at random) from the training data. The target encoder is trained to minimize the classification loss: where B is bi-linear transformation. The motivation here is to constrain the overall representation of generated keyphrase to be semantically close to the overall meaning of the source text. With such representations as input to the decoder, the semantic coverage mechanism can potentially help to provide useful keyphrase information and guide generation.

Orthogonal Regularization
We also propose orthogonal regularization, which explicitly encourages the delimiter-generating decoder states to be different from each other. This is inspired by Bousmalis et al. (2016), who use orthogonal regularization to encourage representations across domains to be as distinct as possible. Specifically, we stack the decoder hidden states corresponding to delimiters together to form matrix H = h t 1 d , . . . , h tn d and use the following equation as the orthogonal regularization loss: where H is the matrix transpose of H, I n is the identity matrix of rank n, indicates element wise multiplication, M 2 indicates L 2 norm of each element in a matrix M . This loss function prefers orthogonality among the hidden states h t 1 d , . . . , h tn d and thus improves diversity in the tokens following the delimiters.

Training Loss
We adopt the widely used negative log-likelihood loss in our sequence generation model, denoted as L NLL . The overall loss we use for optimization is: where λ OR and λ SC are hyper-parameters.

Decoding Strategies
According to different task requirements, various decoding methods can be applied to generate the target sequence y. Prior studies Meng et al. (2017); Yang et al. (2017) focus more on generating excessive number of phrases by leveraging beam search to proliferate the output phrases. In contrast, models trained under ON E2SE Q paradigm are capable of determining the proper number of phrases to output. In light of previous research in psychology (Van Zandt and Townsend, 1993;Forster and Bednall, 1976), we name these two decoding/search strategies as Exhaustive Decoding and Self-terminating Decoding, respectively, due to their resemblance to the way humans behave in serial memory tasks. Simply speaking, the major difference lies in whether a model is capable of controlling the number of phrases to output. We describe the detailed decoding strategies used in this study as follows:

Exhaustive Decoding
As traditional keyphrase tasks evaluate models with a fixed number of top-ranked predictions (say Fscore @5 and @10), existing keyphrase generation studies have to over-generate phrases by means of beam search (commonly with a large beam size, e.g., 150 and 200 in (Chen et al., 2018b;Meng et al., 2017), respectively), a heuristic search algorithm that returns K approximate optimal sequences. For the ON E2ON E setting, each returned sequence is a unique phrase itself. But for ON E2SE Q, each produced sequence contains several phrases and additional processes (Ye and Wang, 2018) are needed to obtain the final unique (ordered) phrase list. It is worth noting that the time complexity of beam search is O(Bm), where B is the beam width, and m is the maximum length of generated sequences. Therefore the exhaustive decoding is generally very computationally expensive, especially for ON E2SE Q setting where m is much larger than in ON E2ON E. It is also wasteful as we observe that less than 5% of phrases generated by ON E2SE Q models are unique.

Self-terminating Decoding
An innate characteristic of keyphrase tasks is that the number of keyphrases varies depending on the document and dataset genre, therefore dynamically outputting a variable number of phrases is a desirable property for keyphrase generation models 2 . Since our model is trained to generate a variable number of phrases as a single sequence joined by delimiters, we can obtain multiple phrases by simply decoding a single sequence for each given 2 Note this is fundamentally different from other NLG tasks. In specific, the number of keyphrases is variable, the length of each keyphrase is also variable. source text. The resulting model thus implicitly performs the additional task of dynamically estimating the proper size of the target phrase set: once the model believes that an adequate number of phrases have been generated, it outputs a special token </s> to terminate the decoding process. One notable attribute of the self-terminating decoding strategy is that, by generating a set of phrases in a single sequence, the model conditions its current generation on all previously generated phrases. Compared to the exhaustive strategy (i.e., phrases being generated independently by beam search in parallel), our model can model the dependency among its output in a more explicit fashion. Additionally, since multiple phrases are decoded as a single sequence, decoding can be performed more efficiently than exhaustive decoding by conducting greedy search or beam search on only the top-scored sequence.

Evaluating Keyphrase Generation
Formally, given a source text, suppose that a model predicts a list of unique keyphrasesŶ = (ŷ 1 , . . . ,ŷ m ) ordered by the quality of the predictionsŷ i , and that the ground truth keyphrases for the given source text is the oracle set Y. When only the top k predictionsŶ :k = (ŷ 1 , . . . ,ŷ min(k,m) ) are used for evaluation, precision, recall, and F 1 score are consequently conditioned on k and defined as: F 1 @k = 2 * P @k * R@k P @k + R@k .
(5) As discussed in Section 1, the number of generated keyphrases used for evaluation can have a critical impact on the quality of the resulting evaluation metrics. Here we compare three choices of k and the implications on keyphrase evaluation for each choice: • F 1 @k: where k is a pre-defined constant (usually 5 or 10). Due to the high variance of the number of ground truth keyphrases, it is often that |Ŷ :k | ≤ k < |Y|, and thus R@k -and in turn F 1 @k -of an oracle model can be smaller than 1. This undesirable property is unfortunately prevalent in the evaluation metrics adopted by all existing keyphrase generation studies to our knowledge.
A simple remedy is to set k as a variable number which is specific to each data example. Here we define two new metrics:

Kp20K
Inspec Krapivin NUS SemEval Model @5 @10 @O @5 @10 @O @5 @10 @O @5 @10 @O @5 @10 @O  Table 2: Performance (F 1 -score) of present keyphrase prediction on scientific publications datasets. Best/secondbest performing score in each column is highlighted with bold/underline. We also list results from literature where models that are not directly comparable (i.e., models leverage additional data and pure extractive models). Note model names with † represent its F 1 @O is computed by us using existing works' released keyphrase predictions. 3 • F 1 @O: O denotes the number of oracle (ground truth) keyphrases. In this case, k = |Y|, which means for each data example, the number of predicted phrases taken for evaluation is the same as the number of ground truth keyphrases.
• F 1 @M: M denotes the number of predicted keyphrases. In this case, k = |Ŷ| and we simply take all the predicted phrases for evaluation without truncation. By simply extending the constant number k to different variables accordingly, both F 1 @O and F 1 @M are capable of reflecting the nature of variable number of phrases for each document, and a model can achieve the maximum F 1 score of 1.0 if and only if it predicts the exact same phrases as the ground truth. Another merit of F 1 @O is that it is independent from model outputs, therefore we can use it to compare existing models.

Datasets and Experiments
In this section, we report our experiment results on multiple datasets and compare with existing models. We use CatSeq to refer to the delimiter-concatenated sequence-to-sequences model described in Section 3; CatSeqD refers to the model augmented with orthogonal regularization and semantic coverage mechanism.
To construct target sequences for training CatSeq and CatSeqD, ground truth keyphrases are sorted by their order of first occurrence in the source text. Keyphrases that do not appear in the source text are appended to the end. This order may guide the attention mechanism to attend to source positions in a smoother way. Implementation details can be found in Appendix D. As for the pre-processing and evaluation, we follow the same steps as in (Meng et al., 2017). More details are provide in Appendix E for reproducing our results.
We include a set of existing models (Meng et al., 2017;Chen et al., 2018a;Chan et al., 2019;Zhao and Zhang, 2019;Chen et al., 2019a) as baselines, they all share same behavior of abstractive keyphrase generation with our proposed model. Specially for computing existing model's scores with our proposed new metrics (F 1 @O and F 1 @M), we implemented our own version of CopyRNN (Meng et al., 2017) based on their open sourced code, denoted as CopyRNN*. We also report the scores of models from Chan et al. and Chen et al. based on their publicly released outputs.
We also include a set of models that use sim-  ilar strategies but can not directly compare with. This includes four non-neural extractive models: TfIdf (Hasan and Ng, 2010), TextRank (Mihalcea and Tarau, 2004), KEA (Witten et al., 1999), and Maui (Medelyan et al., 2009); one neural extractive model (Sun et al., 2019); and two neural models that use additional data (e.g., title) (Ye and Wang, 2018;Chen et al., 2019b). In Section 5.3, we apply the self-terminating decoding strategy. Since no existing model supports such decoding strategy, we only report results from our proposed models. They can be used for comparison in future studies.

Experiments on Scientific Publications
Our first dataset consists of a collection of scientific publication datasets, namely KP20K, IN S P E C, KR A P I V I N, NUS, and SE MEV A L, that have been widely used in existing literature (Meng et al., 2017;Chen et al., 2018a;Ye and Wang, 2018;Chen et al., 2018b;Chan et al., 2019;Zhao and Zhang, 2019;Chen et al., 2019a;Sun et al., 2019). KP20K, for example, was introduced by Meng et al. (2017) and comprises more than half a million scientific publications. For each article, the abstract and title are used as the source text while the author keywords are used as target. The other four datasets contain much fewer articles, and thus used to test transferability of our model. We report our model's performance on the present-keyphrase portion of the KP20K dataset in Table 2. 4 To compare with previous works, we provide compute F 1 @5 and F 1 @10 scores. The new proposed F 1 @O metric indicates consistent ranking with F 1 @5/10 for most cases. Due to its target number sensitivity, we find that its value is closer to F 1 @5 for KP20K and KR A P I V I N where average target keyphrases is less and closer to F 1 @10 for the other three datasets.  From the result we can see that our CatSeqD outperform existing abstractive models on most of the datasets. Our implemented CopyRNN* achieves better or comparable performance against the original model, and on NUS and SemEval the advantage is more salient.

KP20K ST A C KEX
As for the proposed models, both CatSeq and CatSeqD yield comparable results to CopyRNN, indicating that ON E2SE Q paradigm can work well as an alternative option for the keyphrase generation task. CatSeqD outperforms CatSeq on all metrics, suggesting the semantic coverage and orthogonal regularization help the model to generate higher quality keyphrases and achieve better generalizability. To our surprise, on the metric F 1 @10 for KP20K and KR A P I V I N (average number of keyphrases is only 5), where high-recall models like CopyRNN are more favored, CatSeqD is still able to outperform ON E2ON E baselines, indicating that the proposed mechanisms for diverse generation are effective.

Experiments on The ST A C KEX Dataset
Inspired by the StackLite tag recommendation task on Kaggle, we build a new benchmark based on the public StackExchange data 5 . We use questions with titles as source, and user-assigned tags as target keyphrases. We provide details regarding our data collection in Appendix C.
Since oftentimes the questions on StackExchange contain less information than in scientific publications, there are fewer keyphrases per data point in ST A C KEX (statistics are shown in Table 1). Furthermore, StackExchange uses a tag recommendation system that suggests topic-relevant tags to users while submitting questions; therefore, we are more likely to see general terminology such as  Linux and Java 6 . This characteristic challenges models with respect to their ability to distill major topics of a question rather than selecting specific snippets from the text.
We report our models' performance on ST A C KEX in Table 3. Results show CatSeqD performs the best in general; on the absent-keyphrase generation tasks, it outperforms CatSeq by a large margin.

Generating Variable Number Keyphrases
One key advantage of our proposed model is the capability of predicting the number of keyphrases conditioned on the given source text. We thus conduct a set of experiments on KP20K and ST A C KEX present keyphrase generation tasks, as shown in Table 4, to study such behavior. We adopt the selfterminating decoding strategy (Section 3.3), and use both F 1 @O and F 1 @M (Section 4) to evaluate.
In these experiments, we use beam search as in most Natural Language Generation (NLG) tasks, i.e., only use the top ranked prediction sequence as output. We compare the results with greedy search. Since no existing model is capable of generating variable number of keyphrases, in this subsection we only report performance on such setting from CatSeq and CatSeqD.
From Table 4 we observe that in the variable number generation setting, greedy search outperforms beam search consistently. This may because beam search tends to generate short and similar sequences. We can also see the resulting F 1 @O scores are generally lower than results reported in previous subsections, this suggests an over-generation decoding strategy may still benefit from achieving higher recall.

Ablation Study
We conduct an ablation experiment to study the effects of orthogonal regularization and semantic coverage mechanism on CatSeq. As shown in Table 5, semantic coverage provides significant boost to CatSeq's performance on all datasets. Orthogonal regularization hurts performance when is solely applied to CatSeq model. Interestingly, when both components are enabled (CatSeqD), the model outperforms CatSeq by a noticeable margin on all datasets, this suggests the two components help keyphrase generation in a synergistic way. One future direction is to apply orthogonal regularization directly on target encoder, since the regularizer can potentially diversify target representations at phrase level, which may further encourage diverse keyphrase generation in decoder.

Visualizing Diversified Generation
To verify our assumption that target encoding and orthogonal regularization help to boost the diversity of generated sequences, we use two metrics, one quantitative and one qualitative, to measure diversity of generation.
First, we simply calculate the average unique predicted phrases produced by both CatSeq and CatSeqD in experiments shown in Section 5.1 (beam size is 50). The resulting numbers are 20.38 and 89.70 for CatSeq and CatSeqD respectively. Second, from the model running on the KP20K validation set, we randomly sample 2000 decoder hidden states at k steps following a delimiter (k = 1, 2, 3) and apply an unsupervised clustering method (t-SNE (van der Maaten and Hinton, 2008)) on them. From the Figure 2 we can see that hidden states sampled from CatSeqD are easier to cluster while hidden states sampled from CatSeq yield one mass of vectors with no obvious distinct clusters. Results on both metrics suggest target encoding and orthogonal regularization indeed help diversifying generation of our model.

Qualitative Analysis
To illustrate the difference of predictions between our proposed models, we show an example chosen from the KP20K validation set in Appendix F. In this example there are 29 ground truth phrases. Neither of the models is able to generate all of the keyphrases, but it is obvious that the predictions from CatSeq all start with "test", while predictions from CatSeqD are diverse. This to some extent verifies our assumption that without the target encoder and orthogonal regularization, decoder states following delimiters are less diverse.

Conclusion and Future Work
We propose a recurrent generative model that sequentially generates multiple keyphrases, with two extra modules that enhance generation diversity. We propose new metrics to evaluate keyphrase generation. Our model shows competitive performance on a set of keyphrase generation datasets, including one introduced in this work. In future work, we plan to investigate how target phrase order affects the generation behavior, and further explore set generation in an order invariant fashion.

A.1 The Encoder-Decoder Model
Given a source text consisting of N words w 1 e , . . . , w N e , the encoder converts their corresponding embeddings x 1 e , . . . , x N e into a set of N realvalued vectors h e = (h 1 e , . . . , h N e ) with a bidirectional GRU : h t e,fwd = GRU e,fwd (x t e , h t−1 e,fwd ), h t e,bwd = GRU e,bwd (x t e , h t+1 e,bwd ), h t e = h t e,fwd , h t e,bwd .
Dropout (Srivastava et al., 2014) is applied to both x e and h e for regularization. The decoder is a uni-directional GRU, which generates a new state h t d at each time-step t from the word embedding x t d and the recurrent state The initial state h 0 d is derived from the final encoder state h N e by applying a single-layer feedforward neural net (FNN): where the output size of L 3 equals to the target vocabulary size. Subscript a indicates the abstractive nature of p a since it is a distribution over a prescribed vocabulary.

A.3 Pointer Softmax
We employ the pointer softmax (Gülçehre et al., 2016) mechanism to switch between generating a token y t (from a vocabulary) and pointing (to a token in the source text). Specifically, the pointer softmax module computes a scalar switch s t at each generation time-step and uses it to interpolate the abstractive distribution p a (y t ) over the vocabulary (see Equation 11) and the extractive distribution p x (y t ) = α t over the source text tokens: where s t is conditioned on both the attentionweighted source representation i α t,i · h i e and the decoder state h t d :

B Experiment Results on KP20K Absent Subset
Generating absent keyphrases on scientific publication datasets is a rather challenging problem. Existing studies often achieve seemingly good performance by measuring recall on tens and sometimes hundreds of keyphrases produced by exhaustive decoding with a large beam size -thus completely ignoring precision. We report the models' Recall@10/50 scores on the absent portion of five scientific paper datasets in Table 6 to be in line with previous studies.
The absent keyphrase prediction highly prefers recall-oriented models, therefore CopyRNN with beam size of 200 is innately proper for this task setting. However, from the results we observe that with the help of exhaustive decoding and diverse mechanisms, CatSeqD is able to perform comparably to CopyRNN model, and it generally works better for top predictions. Even though the trend of models' performance somewhat matches what we observe on the present data, we argue that it is hard to compare different models' performance on such scale. We argue that ST A C KEX is better testbeds for absent keyphrase generation.  Table 6: Performance of absent keyphrase prediction on scientific publications datasets. Best/second-best performing score in each column is highlighted with bold/underline.

C ST A C KEX Data Collection
We download the public data dump from https: //archive.org/details/stackexchange, and choose 19 computer science related topics from Oct. 2017 dump. We select computer science forums (CS/AI), using "title" + "body" as source text and "tags" as the target keyphrases. After removing questions without valid tags, we collect 330,965 questions. We thus randomly select 16,000 for validation, and another 16,000 as test set. Note some questions in StackExchange forums contain large blocks of code, resulting in long texts (sometimes more than 10,000 tokens after tokenization), this is difficult for most neural models to handle. Consequently, we truncate texts to 300 tokens and 1,000 tokens for training and evaluation splits respectively.  Table 7.

D Implementation Details
We use Adam (Kingma and Ba, 2014) as the step rule for optimization. The learning rate is 1e −3 . The model is implemented using PyTorch (Paszke et al., 2017) and OpenNMT (Klein et al., 2017).
For exhaustive decoding, we use a beam size of 50 and a maximum sequence length of 40. Experiment Setting λOR λSC Table 2 1.0 0.03  Table 5, CatSeqD Same as Table 2   Table 6 Same as Table 2 Table 7: Semantic coverage and orthogonal regularization coefficients.
Following Meng et al. (2017), lowercase and stemming are performed on both the ground truth and generated keyphrases during evaluation.
We leave out 2,000 data examples as validation set for both KP20K and ST A C KEX and use them to identify optimal checkpoints for testing. And all the scores reported in this paper are from checkpoints with best performances (F 1 @O) on validation set.
In Section 6.2, we use the default parameters for t-SNE in sklearn (learning rate is 200.0, number of iterations is 1000, as defined in 8 ).

E Dataset and Evaluation Details
We strictly follow the data pre-processing and evaluation protocols provided by Meng et al. (2017).
We pre-process both document texts and groundtruth keyphrases, including word segmentation, lowercasing and replacing all digits with symbol <digit>. In the datasets, examples with empty ground-truth keyphrases are removed.
We evaluate models' performance on predicting present and absent phrases separately. Specifically, we first lowercase the text, then we determine the presence of each ground-truth keyphrase by checking whether it is a sub-string of the source text (we use Porter Stemmer 9 ). To evaluate present phrase performance, we compute Precision/Recall/F1score (see 14-16 for formulas) for each document taking only present ground-truth keyphrases as target and ignore the absent ones. P @k = #(correct@k) min{k, #(pred)} (14) F 1 @k = 2 * P @k * R P @k + R where #(pred) and #(target) are the number of predicted and ground-truth keyphrases respectively; and #(correct@k) is the number of correct predictions among the first k results.
We report the macro-averaged scores over documents that have at least one present ground-truth phrases (corresponding to the column #PreDoc in Table 8), and similarly to the case for absent phrase evaluation.

F Examples of KP20K and ST A C KEX with Model Prediction
See Table 9 and Figure 3. Table 8: Statistics on number of documents and keyphrases of each test set. #Doc#KP denotes the number of documents/ground-truth keyphrases in the dataset. #PreKP/#AbsKP denotes the number of present/absent groundtruth keyphrases, and #PreDoc/#AbsDoc denotes the number of documents that contain at least one present/absent ground-truth keyphrase.

Integration of a Voice Recognition System in a Social Robot Human-robot interaction
Human-robot interaction ( HRI ) (1) is one of the main fields in the study and research of robotics. Within this field, dialogue systems and interaction by voice play an important role. When speaking about human-robot natural dialogue we assume that the robot has the capability to accurately recognize what the human wants to transmit verbally and even its semantic meaning, but this is not always achieved. In this article we describe the steps and requirements that we went through in order to endow the personal social robot Maggie , developed at the University Carlos III of Madrid, with the capability of understanding the natural language spoken by any human. We have analyzed the different possibilities offered by current software/hardware alternatives by testing them in real environments. We have obtained accurate data related to the speech recognition capabilities in different environments, using the most modern audio acquisition systems and analyzing not so typical parameters such as user age, gender, intonation, volume, and language. Finally, we propose a new model to classify recognition results as accepted or rejected, based on a second automatic speech recognition ( ASR ) opinion.This new approach takes into account the precalculated success rate in noise intervals for each recognition framework, decreasing the rate of false positives and false negatives.

CatSeq
voice recognition system ; social robot ; human robot interaction ; voice recognition ; hri ; speech recognition ; automatic speech recognition ; noise intervals ; noise ; human robot ; automatic speech ; natural language CatSeqD human robot interaction ; voice recognition ; social robotics ; social robots ; integration ; speech recognition ; hri ; social robot ; robotics ; voice recognition system ; recognition ; asr ; automatic speech recognition ; Ground Truth asr ; automatic speech recognition ; dialogue ; human robot interaction ; maggie ; social robot ; speech recognition ; voice recognition ; Table 9: Example from KP20K validation set, and predictions generated by CatSeq and CatSeqD models. .