An Empirical Study on Neural Keyphrase Generation

Recent years have seen a flourishing of neural keyphrase generation (KPG) works, including the release of several large-scale datasets and a host of new models to tackle them. Model performance on KPG tasks has increased significantly with evolving deep learning research. However, there lacks a comprehensive comparison among different model designs, and a thorough investigation on related factors that may affect a KPG system’s generalization performance. In this empirical study, we aim to fill this gap by providing extensive experimental results and analyzing the most crucial factors impacting the generalizability of KPG models. We hope this study can help clarify some of the uncertainties surrounding the KPG task and facilitate future research on this topic.


Introduction
Keyphrases are phrases that summarize and highlight important information in a piece of text. Keyphrase generation (KPG) is the task of automatically predicting such keyphrases given the source text. The task can be (and has often been) easily misunderstood and trivialized as yet another natural language generation task like summarization and translation, failing to recognize one key aspect that distinguishes KPG: the multiplicity of generation targets; for each input sequence, a KPG system is expected to output multiple keyphrases, each a mini-sequence of multiple word tokens.
Despite this unique nature, KPG has been essentially "brute-forced" into the sequence-to-sequence (Seq2Seq) (Sutskever et al., 2014) framework in the existing literature (Meng et al., 2017;Chen et al., 2018;Ye and Wang, 2018;Chen et al., 2019b;Yuan et al., 2020;Chan et al., 2019;Zhao and Zhang, 2019;Chen et al., 2019a).The community has approached the unique challenges with much ingenuity in problem formulation, model design, and evaluation. For example, multiple target phrases have been reformulated by either splitting into one phrase per data point or joining into a single sequence with delimiters ( Figure 1), both allowing straightforward applications of existing neural techniques such as Seq2Seq. In accordance with the tremendous success and demonstrated effectiveness of neural approaches, steady progress has been made in the past few years -at least empiricallyacross various domains, including sub-areas where it was previously shown to be rather difficult (e.g., in generating keyphrases that are not present in the source text).
Meanwhile, with the myriad of KPG's unique challenges comes an ever-growing collection of studies that, albeit novel and practical, may quickly proliferate and overwhelm. We are therefore motivated to present this study as -to the best of our knowledge -the first systematic investigation on such challenges as well as the effect of interplay among their solutions. We hope this study can serve as a practical guide to help researchers to gain a more holistic view on the task, and to profit from the empirical results of our investigations on a variety of topics in KPG including model design, evaluation, and hyper-parameter selection.
The rest of the paper is organized as follows. We first enumerate specific challenges in KPG due to the multiplicity of its target, and describe general setups for the experiments. We subsequently present experimental results and discussions to answer three main questions: 1. How well do KPG models generalize to various testing distributions? 2. Does the order of target keyphrases matter while training One2Seq? 3. Are larger training data helpful? How to better make use of them?

Unique Challenges in KPG
Due to the multiplicity of the generation targets, KPG is unique compared to other NLG tasks such  as summarization and translation. In this section, we start from providing background knowledge of the KPG problem setup. Then we enumerate the unique aspects in KPG model designing and training that we focus on in this work.
Problem Definition Formally, the task of keyphrase generation (KPG) is to generate a set of keyphrases {p 1 , . . . , p n } given a source text t (a sequence of words). Semantically, these phrases summarize and highlight important information contained in t, while syntactically, each keyphrase may consist of multiple words. A keyphrase is defined as present if it is a sub-string of the source text, or as absent otherwise.
Training Paradigms To tackle the unique challenge of generating multiple targets, existing neural KPG approaches can be categorized under one of two training paradigms: One2One (Meng et al., 2017) or One2Seq (Yuan et al., 2020), both based on the Seq2Seq framework. Their main difference lies in how target keyphrase multiplicity is handled in constructing data points ( Figure 1). Specifically, with multiple target phrases {p 1 , . . . , p n }, One2One takes one phrase at a time and pairs it with the source text t to form n data points (t, p i ) i=1:n . During training, a model learns a one-to-many mapping from t to p i 's, i.e., the same source string usually has multiple corresponding target strings. In contrast, One2Seq concatenates all ground-truth keyphrases p i into a single string: P = <bos>p 1 <sep> · · · <sep>p n <eos> (i.e., prefixed with <bos>, joint with <sep>, and suffixed with <eos>), thus forming a single data point (t, P ). A system is then trained to predict the concatenated sequence P given t. By default, we con-struct P follow the ordering strategy proposed in (Yuan et al., 2020). Specifically, we sort present phrases by their first occurrences in source text, and append absent keyphrases at the end. This ordering is denoted as PR E S-AB S in §4.
Architecture In this paper, we adopt the architecture used in both Meng et al. (2017) and Yuan et al. (2020), using RNN to denote it. RNN is a GRU-based Seq2Seq model (Cho et al., 2014) with a copy mechanism (Gu et al., 2016) and a coverage mechanism (See et al., 2017). We also consider a more recent architecture, Transformer (Vaswani et al., 2017), which is widely used in encoderdecoder language generation literature (Gehrmann et al., 2018). We replace both the encoder GRU and decoder GRU in RNN by Transformer blocks, and denote this architecture variant as TRANS. Both the RNN and TRANS models can be trained with either the One2One or One2Seq paradigm.
In recent years, a host of auxiliary designs and mechanisms have been proposed and developed based on either One2One or One2Seq (see §6). In this study, however, we focus only on the "vanilla" version of them and we show that given a set of carefully chosen architectures and training strategies, base models can achieve comparable, if not better performance than state-of-the-art methods. We assume that KPG systems derived from either One2One or One2Seq model would be affected by these factors of model designing in similar ways.
Decoding Strategies KPG is distinct from other NLG tasks since it expects a set of multi-word phrases (rather than a single sequence) as model predictions. Depending on the preference of po-  Table 1: Testing scores across different model architectures, training paradigms, and datasets. In which, D 0 : indistribution; D 1 : out-of-distribution, and D 2 : out-of-domain. We provide the average score over each category.
tential downstream tasks, a KPG system can utilize different decoding strategies. For applications that favor high recall (e.g., generating indexing terms for retrieval systems), a common practice is to utilize beam search and take predictions from all beams 1 . This is applicable in both One2One-and One2Seq-based models to proliferate the number of predicted phrases at inference time. In this work, we use a beam width of 200 and 50 for One2One and One2Seq, respectively. On the contrary, some other applications favor high precision and small number of predictions (e.g., KG construction), a One2Seq-based model is capable of decoding greedily, thanks to its nature of generating multiple keyphrases in a sequential manner.
As an example, we illustrate the two decoding strategies in Figure 1. Specifically, a One2One model typically collects output keyphrases from all beams and use the top k phrases as the model output (k = 5 in the example). In One2Seq, either beam search or greedy decoding can be applied. For beam search, we use both the order of phrases within a beam and the rankings of beams to rank the outputs. In the shown example, top 5 beam search outputs are obtained from the 2 beams with highest rankings. As for greedy decoding, the decoder uses a beam size of 1, and takes all phrases from the single beam as outputs. In this way, the One2Seq model can determine the number of phrases to output by itself conditioned on t.
Evaluation Due to the multiplicity of targets in KPG task, the evaluation protocols are distinct from typical NLG tasks. A spectrum of evaluation metrics have been used to evaluate KPG systems, including metrics that truncate model outputs at a fixed number such as F 1 @5 and F 1 @10 (Meng et al., 2017); metrics that evaluate a model's ability of generating variant number of phrases such as F 1 @O and F 1 @M (Yuan et al., 2020); metrics that evaluate absent keyphrases such as Recall@50 (R@50). Detailed definitions of the metrics are provided in Appendix A. Due to space limit, we mainly discuss F 1 @O, F 1 @10 and R@50 in the main content, complete results with all common metrics are included in Appendix E. We save model checkpoints for every 5,000 training steps and report test performance using checkpoints that produce the best F 1 @O or R@50 on the KP20K validation set.
Datasets A collection of datasets in the domain of scientific publication (KP20K, IN S P E C, KR A P I V I N, NUS, and SE MEV A L) and news articles (DUC) have been widely used to evaluate KPG task. Following previous work, we train models using the training set of KP20K since its size is sufficient to support the training of deep neural networks. Evaluation is performed on KP20K's test set as well as all other datasets without fine-tuning. Details of the datasets are shown in Appendix B.

Generalizability
In this section, we show and analyze the generalization performance of KPG systems from 2 dimensions: model architecture and training paradigm. Specifically, we compare the two model architectures (i.e., RNN and TRANS) as described in §2. For each model architecture, we train the KPG model using either of the training paradigms (i.e., One2One or One2Seq) also as described in §2.
To better understand model variants' generalization properties, we categorize the 6 testing sets into 3 classes according to their distribution similarity with the training data (KP20K), as shown in Table 1. Concretely, KP20K and KR A P I V I N are in-distribution test sets (denoted as D 0 ), since they both contain scientific paper abstracts paired with keyphrases provided by their authors. IN S P E C, NUS and SE MEV A L are out-of-distribution test sets (denoted as D 1 ), they share same type of source text with D 0 , but with additionally labeled keywords by third-party annotators. DUC is a special test set which uses news articles as its source text. Because it shares the least domain knowledge and vocabulary with all the other test sets, we call it out-of-domain test set (denoted as D 2 ).
Model Architecture: RNN vs TRANS The first thing to notice is that on present KPG, the models show consistent trends between F 1 @10 and F 1 @O. We observe that TRANS models significantly outperform RNN models when trained with the One2Seq paradigm on D 0 test sets. However, when test data distribution shift increases, on D 1 test sets, RNN models starts to outperform TRANS; eventually, when dealing with D 2 test set, RNN outperforms TRANS by a large margin. On models trained with One2One paradigm, we observe a similar trend. On D 0 data, TRANS models achieve comparable F 1 @10 and F 1 @O scores with RNN, when data distribution shift increases, RNN models produce better results.
On the contrary, for absent KPG, TRANS outperforms RNN by a significant margin in all experiment settings. This is especially obvious when models are trained with One2Seq paradigm, where RNN models barely generalize to any of the testing data and produce an average R@50 of 2.5. In the same setting, TRANS models get an average R@50 of 9.8, which is 4× higher than RNN.
To further study the different generation behaviors between RNN and TRANS, we investigate the average number of unique predictions generated by either of the models. As shown in Figure 12 in Appendix D, comparing results of order PR E S-AB S in sub-figure a/b (RNN) with subfigure c/d (TRANS), we observe that TRANS is consistently generating more unique predictions than RNN, in both cases of greedy decoding (4.5 vs 4.2) and beam search (123.3 vs 96.8). We suspect that generating a more diverse set of keyphrases may have a stronger effect on in-distribution test data. The generated outputs during inference are likely to represent the distribution learned from the training data, when the test data share the same (or similar) distribution, a larger set of unique predictions leads to a higher recall -which further contributes to their F-scores. In contrast, on test sets which data distribution is far from training distribution, the extra predictions may not be as useful, and even hurts precision. Similarly, because we evaluate ab-  We observe that on present KPG tasks, models trained with the One2Seq paradigm outperforms One2One in most settings, this is particularly clear on D 1 and D 2 test sets. We believe this is potentially due to the unique design of the One2Seq training paradigm where at every generation step, the model conditions its decision making on all previously generated tokens (phrases). Compared to the One2One paradigm where multiple phrases can only be generated independently by beam search in parallel, the One2Seq paradigm can model the dependencies among tokens and the dependencies among phrases more explicitly. However, on absent KPG, One2One consistently outperforms One2Seq. Furthermore, only when trained with One2One paradigm, an RNNbased model can achieve R@50 scores close to TRANS-based models. This may because a One2Seq model tends to produce more duplicated predictions during beam search inference. By design, every beam is a string that contains multiple phrases that concatenated by the delimiter <sep>, there is no guarantee that the phrase will not appear in multiple beams. In the example shown in Figure 1, "topic tracking" is such a duplicate prediction that appears in multiple beams. In fact, the proportion of duplicates in One2Seq predictions  is more than 90%. This is in contrast with beam search on One2One models, where each beam only contains a single keyphrase thus has a much lower probability of generating duplication. 3

Does Order Matter in One2Seq?
In the One2One paradigm (as shown in Figure 1), each data example is split to multiple equally weighted data pairs, thus it generates phrases without any prior on the order. In contrast, One2Seq training has the unique capability of generating a varying number of keyphrases in a single sequence. This inductive bias enables a model to learn dependencies among keyphrases, and also to implicitly estimate the number of target phrases conditioned on the source text. However, the One2Seq approach introduces a new complication. During training, the Seq2Seq decoder takes the concatenation of multiple target keyphrases as target. As pointed out by Vinyals et al. (2016), order matters in sequence modeling tasks; yet the ordering among the target keyphrases has not been fully investigated and its effect to the models' performance remains unclear. Several studies have noted this problem (Ye and Wang, 2018; Yuan et al., 2020) without further exploration.
Ordering Definition To explore along this direction, we first define nine ordering strategies for concatenating target phrases.
• RA N D O M: Randomly shuffle the target phrases.
3 Due to post-processing such as stemming, One2One model may still produce duplication.
Because of the set generation nature of KPG, we expect randomly shuffled target sequences help to learn an order-invariant decoder.
• OR I: Keep phrases in their original order in the data (e.g., provided by the authors of source texts). This was used by Ye and Wang (2018).
• OR I-RE V: Reversed order of OR I.
• S->L: Phrases sorted by lengths (number of tokens, from short to long).
• AL P H A: Sort phrases by alphabetical order.
• AL P H A-RE V: Reversed order of AL P H A.
• PR E S-AB S: Sort present phrases by their first occurrences in source text. Absent phrases are shuffled and appended to the end of the present phrase sequence. This was used by (Yuan et al., 2020).
• AB S-PR E S: Similar to PR E S-AB S, but prepending absent phrases to the beginning.
Greedy Decoding In Figure 2, we show the RNN and TRANS model's F 1 @M on present KPG task, equipped with greedy decoding. In this setting, the model simply chooses the token with the highest probability at every step, and terminates either upon generating the <eos> token or reaching the maximum target length limit (40). This means the model predicts phrases solely relying on its innate distribution learned from the training data, and thus this performance could somewhat reflect to which degree the model fits the training distribution and understands the task. Through this set of experiments, we first observe that each model demonstrates consistent performance across all six test datasets, indicating that ordering strategies play critical roles in training One2Seq models when greedy decoding is applied. When using the RNN architecture, RA N D O M consistently yields lower F 1 @M than other ordering strategies on all datasets. This suggests that a consistent order of the keyphrases is beneficial. However, TRANS models show a better resistance against randomly shuffled keyphrases and produce average tier performance with the RA N D O M ordering. Meanwhile, we observe that PR E S-AB S outperforms other ordering strategies by significant margins. A possible explanation is that with this order (of occurrences in the source text), the current target phrase is always to the right of the previous one, which can serve as an effective prior for the attention mechanism throughout the One2Seq decoding process. We observe similar trends in greedy decoding models' F 1 @O and F 1 @10, due to space limit, we refer readers to Figure 9, 10 in Appendix D.
Beam Search Next, we show results obtained from the same set of models equipped with beam search (beam width is 50) in Figure 3 (a/b). Compared with greedy decoding (Figure 10, Appendix D), we can clearly observe the overall F 1 @10 scores have positive correlation with the beam width (greedy decoding is a special case where beam width equals to 1). We observe that compared to the greedy decoding case, the pattern among different ordering strategies appears to be less clear, with the scores distributed more evenly across different settings (concretely, the absolute difference between max average score and min average score is lower).
We suspect that the uniformity among different ordering strategies with beam search may be due to the limitation of the evaluation metric F 1 @10. The metric F 1 @10 truncates a model's predictions to 10 top-ranked keyphrases. By investigation, we find that during greedy decoding, the number of predictions acts as a dominant factor, this number varies greatly among different ordering. With greedy decoding, PR E S-AB S can generally predict more phrases than the others, which explains its performance advantage (Figure 13 (a/c), Appendix D). However, as the beam width increases, all models can predict more than 10 phrases (Figure 13 (b/d), Appendix D). In this case, the F 1 @10 is contributed more by a model' ability of generating more high quality keyphrases within its top-10 outputs, rather than the amount of predictions. Therefore, the performance gap among ordering strategies is gradually narrowed in beam search. For instance, we observe that the F 1 @10 difference between PR E S-AB S and S->L produced by RNN is 3.5/2.0/1.0/0.2 when beam width is 1/10/25/50.
To validate our assumption, we further investigate the same set of models' performance on F 1 @O, which strictly truncates the generated keyphrase list by the number of ground-truth keyphrases O (where in most cases O < 10). Under this harsher criterion, a model is required to generate more high quality keyphrases within its top-O outputs. From Figure 3 (c/d), we observe that the scores are less uniformly distributed, this indicates a larger difference between different order settings. Among all orders, OR I produces best average F 1 @O with RNN, whereas AL P H A-RE V and OR I-RE V produce best average F 1 @O with TRANS.
In our curated list of order settings, there are 3 pairs of orderings with reversed relationship (i.e.,

S->L vs L->S, AL P H A vs AL P H A-RE V, OR I vs OR I-RE V).
Interestingly, we observe that when beam search is applied, these orderings often show a non-negligible score difference with their counterparts. This also suggests that order matters since specific model architecture and training paradigm often has its own preference on the phrase ordering.
It is also worth mentioning that when we manually check the output sequences in test set produced by AL P H A ordering, we notice that the model is actually able to retain alphabetical order among the predicted keyphrases, hinting that a Seq2Seq model might be capable of learning simple morphological dependencies even without access to any character-level representations.
Ordering in Absent KPG We report the performance of the same set of models on the absent portion of data in Figure 11, Appendix D. Although achieving relatively low R@50 in most settings, scores produced by various orderings show clear distinctions, normalized heat maps suggest that the rankings among different orderings tend to be consistent across all testing datasets. In general, PR E S-AB S produces better absent keyphrases across different model architectures. Due to the space limit, we encourage readers to check out Appendix D, which provides an exhaustive set of heat maps including all experiment settings and metrics discussed in this section.

Training with More Data
In this section, we further explore the possibility of improving KPG performance by scaling up the training data. Data size has been shown as one of the most effective factors for training language models (Raffel et al., 2019; Ott et al., 2018) but it has yet been discussed in the context of KPG.

MagKP Dataset
We construct a new dataset, namely MAGKP, on the basis of Microsoft Academic Graph (Sinha et al., 2015). We filter the original MAG v1 dataset (166 million papers, multiple domains) and only keep papers in Computer Science and with at least one keyphrase. This results in 2.7 million data points (5× larger than KP20K). This dataset remains noisy despite the stringent filtering criteria, this is because 1) the data is crawled from the web and 2) some keywords are labeled by automatic systems rather than humans. This noisy nature brings many interesting observations.
General Observations The first thing we try is to train a KPG model with both KP20K and MAGKP. During training, the two dataset are fed to the model in an alternate manner, we denote this data mixing strategy as ALT. In Figure 4, we compare models' performance when trained on both KP20K and MAGKP against solely on KP20K. We observe the extra MAGKP data brings consistent improvement across most model architecture and training paradigm variants. This suggests that model KPG models discussed in this work can benefit from additional training data. Among all the settings, F 1 @O of the TRANS+One2Seq is boosted by 3 points on present KPG, the resulting score outperforms other variants by a significant margin and even surpass a host of state-of-the-art models (see comparison in Appendix E). Again, the same setting obtains a 2.3 boost of R@50 score on the absent KPG task, makes TRANS+One2Seq the setting that benefits the most from extra data. In contrast, the extra MAGKP data provide only marginal improvement to RNN-based models. On present KPG, RNN+One2Seq even has an F 1 @O drop when trained with more data.
As mentioned in §3, the RNN model is significantly lighter than TRANS. To investigate if an RNN with more parameters can benefit more from MAGKP, we conduct experiments which use a GRU with much larger hidden size (dubbed BI GRNN). Results (in Appendix E) suggest otherwise, extra training data leads to negative effect on One2One and only marginal gain on One2Seq. We thus believe the architecture difference between TRANS and RNN is the potential cause, for instance, the built-in self-attention mechanism may help TRANS models learning from noisy data.
Learning with Noisy Data To further investigate the performance boost brought by the MAGKP dataset on TRANS+One2Seq, we are curious to know which portion of the noisy data helped the most. As a naturally way to cluster the MAGKP data, we define the noisiness by the number of keyphrases per data point. As shown in Figure 5, the distribution of MAGKP (black border) covers a much wider spectrum on the x-axis compared to KP20K (red). Because keyphrase labels are provided by human authors, a majority of its keyphrase numbers lie in the range of [3, 6]; however, only less than 20% of the MAGKP data overlaps with this number distribution.
We thus break MAGKP down into a set of smaller subset: 1) MAGKP-LN is a considerably Less Noisy subset that contains data points that have 3~6 phrases. 2) MAGKP-Nlarge is the Noisy subset in which all data points have more than 10 keyphrases. 3) MAGKP-Nsmall is a randomly sampled subset of MAGKP-Nlarge with the same size as MAGKP-LN.
We also define a set of data mixing strategies to compare against ALT: ONLY: models are trained solely on a single set (or subset) of data; MX: KP20K and MAGKP (or its subset) are split into shards (10k each) and they are randomly sampled during training; FT: models are pre-trained on MAGKP (or its subset) and fine-tuned on KP20K.
In Figure 6, we observe that none of the MAGKP subsets can match KP20K's performance in the ONLY setting. Because MAGKP-LN and MAGKP-Nsmall share similar data size with KP20K, this suggest the distributional shift between MAGKP and the 6 testing sets is significant. In the MX setting where KP20K is mixed with noisy data, we observe a notable performance boost compared to ONLY (yet still lower than ALT), however, we do not see clear patterns among the 4 MAGKP subsets in this setting. In the FT setting, we observe a surge in scores across all MAGKP subsets. In present KPG, both MAGKP and MAGKP-Nlarge outperform the score achieved in the ALT setting; similarly, in absent KPG, MAGKP, MAGKP-Nlarge and MAGKP-Nsmall exceeds the ALT score. This is to our surprise that the subsets considered as noisy provide a greater performance boost, while they perform poorly if "ONLY" trained on these subsets.
To sum up, during our investigation on augmenting KP20K with the noisy MAGKP data, we obtain the best performance from a TRANS+One2Seq model that pre-trained on MAGKP and then finetuned on KP20K, and this performance has outperformed current state-or-the-art models. We conjecture that the performance gain may come from data diversity, because MAGKP contains a much wider distribution of data compared to the author keyword distribution as in KP20K. This inspires us to develop data augmentation techniques to exploit the diversity in unlabeled data.

Related Work
Traditional Keyphrase Extraction Keyphrase extraction has been studied extensively for decades. A common approach is to formulate it as a twostep process. Specifically, a system first heuristically selects a set of candidate phrases from the text using some pre-defined features (Witten et al., 1999;Liu et al., 2011;Wang et al., 2016;Yang et al., 2017). Subsequently, a ranker is used to select the top ranked candidates following various criteria. The ranker can be bagged decision trees (Medelyan et al., 2009;Lopez and Romary, 2010), Multi-Layer Perceptron, Support Vector Machine (Lopez and Romary, 2010) or PageRank (Mihalcea and Tarau, 2004;Le et al., 2016;Wan and Xiao, 2008). Compared to the newly developed data driven approaches with deep neural networks, the above approaches suffer from poor performance and the need of dataset-specific heuristic design.  2020) introduce hierarchical decoding and exclusion mechanism to prevent from generating duplication. Çano and Bojar (2019) also propose to utilize more data, but their goal is to bridge KPG with summarization.

Conclusion and Takeaways
We present an empirical study discussing neural KPG models from various aspects. Through extensive experiments and analysis, we answer the three questions ( §1). Results suggest that given a carefully chosen architecture and training strategy, a base model can perform comparable with fancy SOTA models. Further augmented with (noisy) data in the correct way, a base model can outperform SOTA models (Appendix E). We strive to provide a guideline on how to choose such architectures and training strategies, which hopefully can be proven valuable and helpful to the community.
We conclude our discussion with the following takeaways:

Contents in Appendices:
• In Appendix A, we provide the formal definition of all evaluation metrics we used in this work.
• In Appendix B, we provide detailed statistics of all datasets used in this work.
• In Appendix C, we provide observation and analysis on additional factors that can affect a KPG system's performance.
• In Appendix D, show the set of heat maps that are not shown in the main content due to space limit.
• In Appendix E, we provide a complete set of numbers containing all results discussed in this work, compared with a set of SOTA models in existing literature.
• In Appendix F, we provide implementation details that helps to reproduce our experiments.

A Evaluation Metric Definition
In this section, we provide definition of the metrics we use in this work. All metrics are adopted from (Meng et al., 2017) and (Yuan et al., 2020).
To make the results easy to reproduce, we simply report macro-average scores over all the data examples in a dataset (rather than removing examples that contain no present/absent phrases). Since some data examples contain no valid present/absent phrase and lead to zero scores, this causes our results can be lower than previously reported results. Given a data example consisting a source text X and a list of target keyphrases Y, suppose that a model predicts a list of unique keyphrasesŶ = (ŷ 1 , . . . ,ŷ m ) ordered by the quality of the predictionsŷ i , and that the ground truth keyphrases for the given source text is the oracle set Y. When only the top k predictionsŶ :k = (ŷ 1 , . . . ,ŷ min(k,m) ) are used for evaluation, precision, recall, and F 1 -score are consequently conditioned on k and defined as: (1) Thus the metrics are defined as: • F 1 @5: F 1 @k when k = 5.
• F 1 @O: O denotes the number of oracle (ground truth) keyphrases. In this case, k = |Y|, which means for each data example, the number of predicted phrases taken for evaluation is the same as the number of ground-truth keyphrases.
• F 1 @M: M denotes the number of predicted keyphrases. In this case, k = |Ŷ| and we simply take all the predicted phrases for evaluation without truncation.

B Statistics of Datasets
We provide details of datasets used in this work. We use KP20K-train (Meng et al., 2017) and MAGKP (Sinha et al., 2015) for training keyphrase generation models, both are built on the basis of scientific publications in Computer Science domain. Nevertheless, their distributions are considerably different, e.g. MAGKP data contains 3 times more keyphrases on average than KP20K. This is because KP20K is constructed using real author keywords whereas MAGKP may contain a vast amount of keyphrases annotated by automatic systems. Detailed statistics are listed in Table 2. We also leave out certain amount of data points from KP20K for validation and testing (KP20K-V A L I D and KP20K-T E S T). We also utilize five other datasets for evaluation purposes, as shown in Table 3. All except DUC come from scientific publications in Computer Science domain. KR A P I V I N uses keywords provided by the authors as targets, which is the same as KP20K. IN S P E C, NUS, and SE MEV A L contain author-assigned keywords and additional keyphrases provided by third-party annotators. DUC, different from all above, is a keyphrase dataset based on news articles. Since it represents a rather different distribution from scientific publication datasets, hypothetically, obtaining decent test score on DUC requires extra generalizability.

C Other Model Designing Aspects
Besides the findings we discuss in the paper, there exist other important factors affect the general performance of KPG models. We provides two addi-   tional empirical results that we think might be of interest to certain readers. ) have shown the importance of copy mechanism with RNN+One2One, but no further comparison has been made. In Figure 7, we present the results of four KPG model variants, equipped with and without copy mechanism. The results show that copy mechanism leads to considerable improvements on present KPG, especially for RNN. TRANS benefits less from the copy, which may be because its multi-head attentions behave similarly to the copy mechanism even without explicit training losses. With regard to the absent KPG results, copy mechanism only helps RNN+One2One. This suggests that TRANS can achieve consistently better abstractiveness (absent performance) by disabling the copy mechanism at the cost of weaker extractiveness. This dilemma cautions researchers to use copy mechanism more wisely according to specific applications.

C.2 Effect of Beam Width
As discussed in §2, one unique challenge of the KPG task is due to its multiplicity of target out-  puts. As a result, a common strategy is to take multiple beams during decoding in order to obtain more phrases (as opposed to greedy decoding). This choice is at times not only practical but in fact necessary: under the One2One paradigm, for example, it is crucial to have multiple beams in order to generate multiple keyphrases for a given input.
Generally speaking, KPG and its evaluation metrics are in general favors higher recall. It is thus not totally unexpected that the high precision scores of greedy decoding are often undermined by notable disadvantages in recall, which in turn leads to losing by large margins in F-scores when compared to results of beam search (with multiple beams). Empirically, as shown in Figure 8, we observe that beam search can sometimes achieve a relative gain of more than 10% in present phrase generation , and a much larger performance boost in absent phrase generation, over greedy decoding.
We are also interested in seeing if there exists an optimal beam width. In Figure 8, we show models' testing performance when various beam widths are used. In present KPG task with One2One (upper left), beam width of 16 already provides an optimal score, larger beam widths (even 200) do not show any further advantage. Replacing the training paradigm with One2Seq (upper right), we observe a positive correlation between beam width and testing score -larger beam widths lead to marginally better testing scores. However, the improvement (from beam size of 10 to 50) is not significant.
On absent KPG task (lower), both One2One and One2Seq paradigms seem to benefit from larger beam widths. Testing score shows strong positive correlation with beam width. We observe that this trend is consistent across different model architectures.
Overall, a larger beam width provides better scores in most settings, but the performance gain diminishes quickly towards very large beam width. In addition, it is worth noting that larger beam width also comes with more intense computing demands, for both space and time. As an example, in Figure 8 (top left), we observe that with the One2One training paradigm, a beam width of 200 does not show a significant advantage over 16, however, in terms of computation, beam width of 200 takes about 10× of the resources compared to 16. There clearly exists a trade-off between beam width and computational efficiency (e.g., carbon footprint (Strubell et al., 2019)). We thus hope our results can serve as a reference for researchers, to help them choose beam width more wisely depending on specific tasks.

D Does Order Matter in One2Seq? -Additional Results
In §4, we show models' performance trained with the One2Seq paradigm using different target ordering strategies. Here we provide the complete set of heat maps.
In Figure 9, we show present KPG testing scores in F 1 @O, when using either greedy decoding or beam search as decoding strategy.
In Figure 10, we show present KPG testing scores in F 1 @10, when using either greedy decoding or beam search as decoding strategy.
In Figure 11, we show absent KPG testing scores in R@50, when using either greedy decoding or beam search as decoding strategy.
In addition, we shown in Figure 12, 13, and 14 the number of unique predictions on all/present/absent KPG tasks.

E Complete Results
In this section, we report the full set of our experimental results.
In Table 4, we report all the testing scores on present keyphrase generation tasks. For all experiments, we use F 1 @5, F 1 @10, F 1 @O and F 1 @M to evaluate a model's performance. Additionally, we provide an average score for each of the 4 metrics over all datasets (over each row in Table 4).
In Table 5, we report all the testing scores on absent keyphrase generation tasks. For all experiments, we use R@10 and R@50 to evaluate a model's performance. Additionally, we provide an average score for each of the 2 metrics over all datasets (over each row in Table 5).
In table 7, we report detailed present testing scores when model trained with One2Seq paradigm, using different ordering strategies. For all experiments, we use F 1 @5, F 1 @10, F 1 @O and F 1 @M to evaluate a model's performance.
In table 8, we report detailed absent testing scores when model trained with One2Seq paradigm, using different ordering strategies. For all experiments, we use R@10 and R@50 to evaluate a model's performance.
We also provide scores against all ground-truth phrases (without splitting present/absent) in Table 6 and 9 to avoid the inconsistency in data processing (present/absent split may vary by ways of tokenization).

F Implementation Details
All the code and data have been released at https: //github.com/memray/OpenNMT-kpg-release, including the new MAGKP dataset.
We use the concatenation of title and abstract as the source text. When training with data points contains more than 8 ground-truth keyphrases, we randomly sample 8 from the list to build training target labels. This is to prevent jobs from out-ofmemory issues and speed up the training.
We train RNN models for 100k steps and TRANS for 300k steps. TRANS generally benefits from longer training, especially for absent KPG performance. For the FT setting, we train models for additional 100k steps.