Semi-Supervised Learning for Neural Keyphrase Generation

We study the problem of generating keyphrases that summarize the key points for a given document. While sequence-to-sequence (seq2seq) models have achieved remarkable performance on this task (Meng et al., 2017), model training often relies on large amounts of labeled data, which is only applicable to resource-rich domains. In this paper, we propose semi-supervised keyphrase generation methods by leveraging both labeled data and large-scale unlabeled samples for learning. Two strategies are proposed. First, unlabeled documents are first tagged with synthetic keyphrases obtained from unsupervised keyphrase extraction methods or a self-learning algorithm, and then combined with labeled samples for training. Furthermore, we investigate a multi-task learning framework to jointly learn to generate keyphrases as well as the titles of the articles. Experimental results show that our semi-supervised learning-based methods outperform a state-of-the-art model trained with labeled data only.


Introduction
Keyphrase extraction concerns the task of selecting a set of phrases from a document that can indicate the main ideas expressed in the input (Turney, 2000;Hasan and Ng, 2014). It is an essential task for document understanding because accurate identification of keyphrases can be beneficial for a wide range of downstreaming natural language processing and information retrieval applications. For instance, keyphrases can be leveraged to improve text summarization systems (Zhang et al., 2004;Liu et al., 2009a;Wang and Cardie, 2013), facilitate sentiment analysis and opinion mining (Wilson et al., 2005;Berend, 2011), and help with document clustering (Hammouda et al., 2005). Though relatively * Work was done while visiting Northeastern University.

Document:
In this paper, we consider an enthalpy formulation for a two-phase Stefan problem arising from the solidification of aluminum during process. We solve this free boundary problem in a time varying threedimensional domain and consider convective heat transfer in the liquid phase. The resulting equations are discretized using a characteristics method in time and a method in space, and we propose a numerical algorithm to solve the obtained nonlinear discretized problem. Finally, numerical results are given which are compared with industrial experimental measurements. easy to implement, extract-based approaches fail to generate keyphrases that do not appear in the source document, which are frequently produced by human annotators as shown in Figure  1. Recently, Meng et al. (2017) propose to use a sequence-to-sequence model (Sutskever et al., 2014) with copying mechanism for keyphrase generation, which is able to produce phrases that are not in the input documents.
While seq2seq model demonstrates good performance on keyphrase generation (Meng et al., 2017), it heavily relies on massive amounts of labeled data for model training, which is often unavailable for new domains. To overcome this drawback, in this work, we investigate semisupervised learning for keyphrase generation, by leveraging abundant unlabeled documents along with limited labeled data. Intuitively, additional documents, though unlabeled, can provide useful knowledge on universal linguistic features and discourse structure, such as context information for keyphrases and that keyphrases are likely to be noun phrases or main verbs. Furthermore, learning with unlabeled data can also mitigate the overfitting problem, which is often caused by small size of labeled training data, and thus improve model generalizability and enhance keyphrase generation performance on unseen data.
Concretely, two major approaches are proposed for leveraging unlabeled data. For the first method, unlabeled documents are first tagged with synthetic keyphrases, then mixed with labeled data for model pre-training. Synthetic keyphrases are acquired through existing unsupervised keyphrase extraction methods (e.g., TF-IDF or TextRank (Mihalcea and Tarau, 2004)) or a self-learning algorithm. The pre-trained model will further be fine-tuned on the labeled data only. For the second approach, we propose a multi-task learning (MTL) framework 1 by jointly learning the main task of keyphrase generation based on labeled samples, and an auxiliary task of title generation (Rush et al., 2015) on unlabeled documents. Here one encoder is shared among the two tasks. Importantly, we test our proposed methods on a seq2seq framework, however, we believe they can be easily adapted to other encoder-decoder-based systems.
Extensive experiments are conducted in scientific paper domain. Results on five different datasets show that all of our semi-supervised learning-based models can uniformly significantly outperform a state-of-the-art model (Meng et al., 2017) as well as several competitive unsupervised and supervised keyphrase extraction algorithms based on F 1 and recall scores. We further carry out a cross-domain study on generating keyphrases for news articles, where our models yield better F 1 than a model trained on labeled data only. Finally, we also show that training with unlabeled samples can further produce performance gain even when a large amount of labeled data is available.

Related Work
Keyphrase Extraction and Generation. Early work mostly focuses on the keyphrase extraction task, and a two-step strategy is typically designed. Specifically, a large pool of candidate phrases are first extracted according to predefined syntactic templates (Mihalcea and Tarau, 2004;Wan and Xiao, 2008;Liu et al., 2009bLiu et al., , 2011 or their estimated importance scores (Hulth, 2003). In the second step, re-ranking is applied to yield the final keyphrases, based on supervised learning Hulth, 2003;Lopez and Romary, 2010;Kan, 2009), unsupervised graph algorithms (Mihalcea andTarau, 2004;Wan and Xiao, 2008;Bougouin et al., 2013), or topic modelings (Liu et al., 2009c(Liu et al., , 2010. Keyphrase generation is stud-ied in more recent work. For instance, Liu et al. (2011) propose to use statistic machine translation model to learn word-alignments between documents and keyphrases, enabling the model to generate keyphrases which do not appear in the input. Meng et al. (2017) train seq2seq-based generation models (Sutskever et al., 2014) on large-scale labeled corpora, which may not be applicable to a new domain with minimal human labels.
Neural Semi-supervised Learning. As mentioned above, though significant success has been achieved by seq2seq model in many NLP tasks (Luong et al., 2015;See et al., 2017;Dong and Lapata, 2016;Wang and Ling, 2016;Ye et al., 2018), they often rely on large amounts of labeled data, which are expensive to get. In order to mitigate the problem, semi-supervised learning has been investigated to incorporate unlabeled data for modeling training (Dai and Le, 2015;Ramachandran et al., 2017). For example, neural machine translation community has studied the usage of source-side or target-side monolingual data to improve translation quality (Gülçehre et al., 2015), where generating synthetic data (Sennrich et al., 2016;Zhang and Zong, 2016), multi-task learning (Zhang and Zong, 2016), and autoencoderbased methods (Cheng et al., 2016) are shown to be effective. Multi-task learning is also examined for sequence labeling tasks (Rei, 2017;Liu et al., 2018).
In our paper, we study semi-supervised learning for keyphrase generation based on seq2seq models, which has not been explored before. Besides, we focus on leveraging source-side unlabeled articles to enhance performance with synthetic keyphrase construction or multi-task learning.

Neural Keyphrase Generation Model
In this section, we describe the neural keyphrase generation model built on a sequence-to-sequence model (Sutskever et al., 2014) as illustrated in Figure 2. We denote the input source document as a sequence x = x 1 · · · x |x| and its corresponding keyphrase set as a = {a i } |a| i=1 , with a i as one keyphrase.
Keyphrase Sequence Formulation. Different from the setup by Meng et al. (2017), where input article is paired with each keyphrase a i to consist a training sample, we concatenate the keyphrases in a into a keyphrase sequence y =  Figure 2: Neural keyphrase generation model built on sequence-to-sequence framework. Input is the document, and output is the keyphrase sequence consisting of phrases present (key i ) or absent (key n i ) in the input.
a 1 ♦ a 2 ♦ · · · ♦ a |a| , where ♦ is a segmenter to separate the keyphrases 2 . With this setup, the seq2seq model is capable to generate all possible keyphrases in one sequence as well as capture the contextual information between the keyphrases from the same sequence.
Seq2Seq Attentional Model. With source document x and its keyphrase sequence y, an encoder encodes x into context vectors, from which a decoder then generates y. We set the encoder as onelayer bi-directional LSTM model and the decoder as another one-layer LSTM model (Hochreiter and Schmidhuber, 1997). The probability of generating target sequence p(y|x) is formulated as: where y <t = y 1 · · · y t−1 .
denote the hidden state vector in the encoder at time step t, which is the concatenation of forward hidden vector − → h t and backward hidden vector Decoder hidden state is calculated as We apply global attention mechanism (Luong et al., 2015) to calculate the context vector: (2) 2 We concatenate keyphrases following the original keyphrase order in the corpora, and we set ♦ as ";" in our implementation. The effect of keyphrase ordering will be studied in the future work.

Algorithm 1 Keyphrase Ranking
Input: Generated top R keyphrase sequences S = [y1, · · · , yR] ranked with generation possibility from high to low with beam search Output: Ranked keyphrase set A with importance from high to low where α t,i is the attention weight; W att contains learnable parameters. In this paper, we omit the bias variables to save space. The probability to predict y t in the decoder at time step t is factorized as: where f softmax is the softmax function and W d 1 and W d 2 are learnable parameters.
Pointer-generator Network. Similar to Meng et al. (2017), we utilize copying mechanism via pointer-generator network (See et al., 2017) to allow the decoder to directly copy words from input document, thus mitigating out-of-vocabulary (OOV) problem. At time step t, the generation probability p gen is calculated as: where f sigmoid is a sigmoid function; W c , W s and W y are learnable parameters. p gen plays a role of switcher to choose to generate a word from a fixed vocabulary with probability p vocab or directly copy a word from the source document with the attention distribution α t . With a combination of a fixed vocabulary and the extended source document vocabulary, the probability to predict y t is: where if y t does not appear in the fixed vocabulary, then the first term will be zero; and if y t is outside source document, the second term will be zero. Decoder 2 pre-train fine-tune main task auxiliary task (a) Synthetic Keyphrase Constrction (b) Multi-task Learning Encoder Figure 3: Our two semi-supervised learning methods which are based on (a) synthetic keyphrase construction, and (b) multi-task learning. (X, Y) represents labeled sample. X (1) and X (2) denotes unlabeled documents. Y (1) refers to synthetic keyphrases of X (1) and Y (2) means the title of X (2) . For method of synthetic keyphrase construction, model will be pre-trained on the mixture of gold-standards and synthetic data, then fine-tuned on labeled data. For multi-task learning, model parameters of main task and auxiliary task will be jointly updated. Encoder parameters of the two tasks are shared.
Supervised Learning. With a labeled dataset , the loss function of seq2seq model is as follows: where θ contains all model parameters.

Keyphrase Inference and Ranking Strategy.
Beam search is utilized for decoding, and the top R keyphrase sequences are leveraged for producing the final keyphrases. Here we use a beam size of 50, and R as 50. We propose a ranking strategy to collect the final set of keyphrases. Concretely, in sequence we collect unique keyphrases from the top ranked beams to lower ranked beams, and keyphrases in the same sequence are ordered as in the generation process. Intuitively, higher ranked sequences are likely of better quality. As for keyphrases in the same sequence, we find that more salient keyphrases are usually generated first. The ranking method is presented in Algorithm 1.

Semi-Supervised Learning for Keyphrase Generation
As illustrated in Figure 3, two methods are proposed to leverage abundant unlabeled data. The first is to provide synthetic keyphrases using unsupervised keyphrase extraction methods or selflearning algorithm, then mixed with labeled data for model training, which is described in Section 4.1. Furthermore, we introduce multi-task learning that jointly generates keyphrases and the title of the document (Section 4.2). We denote the large-scale unlabeled documents as

Synthetic Keyphrase Construction
The first proposed technique is to construct synthetic labeled data by assigning keyphrases for unlabeled documents, and then mix the synthetic data with human labeled data for modeling training. Intuitively, adding training samples with synthetic keyphrases has two potentially benefits: (1) the encoder is exposed to more documents in training, and (2) the decoder also benefit from additional information from identifying contextual information for keyphrases. We propose and compare two methods to extract synthetic keyphrases.
Unsupervised Learning Methods. Unsupervised learning methods on keyphrase extraction have been long studied in previous work (Mihalcea and Tarau, 2004;Wan and Xiao, 2008). Here we select two effective and widely used methods to select keyphrases on unlabeled dataset D u , which are TF-IDF and TextRank (Mihalcea and Tarau, 2004). We combine the two methods into a hybrid approach, in which we first adopt the two methods to separately select top K keyphrases from the document, we then take the union with duplicate removal. To construct the keyphrase se- Set θ as the best parameters from PRE-TRAIN Update θ on Dp until converge quence, we concatenate the terms from TF-IDF and then from TextRank, following the corresponding ranking order. We set K as 5 in our experiments.
Self-learning Algorithm. Inspired by prior work (Zhang and Zong, 2016;Sennrich et al., 2016), we adopt self-learning algorithm to boost training data. Concretely, we first build a baseline model by training the seq2seq model on the labeled corpus D p . Then the trained baseline model is utilized to generate synthetic keyphrase sequence y given a unlabeled document x . We adopt beam search to generate the synthetic keyphrase sequences and beam size is set as 10.
The top one beam is selected.
Training Procedure. After the synthetic data is obtained by either of the aforementioned methods, we mix labeled data D p with D s to train the seq2seq model. As described in Algorithm 2, we combine D p with D s into D p+s and shuffle D p+s randomly, then we pre-train the model on D p+s , in which no network parameters are frozen during the training process. The best performing model is selected based on validation set, then fine-tuned on D p until converge.

Multi-task Learning with Auxiliary Task
The second approach to leverage unlabeled documents is to employ a multi-task learning framework which combines the main task of keyphrase generation with an auxiliary task through parameter sharing strategy. Similar to the model structure in Zhang and Zong (2016), our main task and the auxiliary task share an encoder network but have different decoders. Multi-task learning will benefit from the source-side information to improve the model generality of encoder.
In most domains such as scientific papers or news articles, a document usually contains a title that summarizes the core topics or ideas, with a similar spirit as keyphrases. We thus choose title generation as auxiliary task, which has been studied as a summarization problem (Rush et al., 2015;Colmenares et al., 2015).
denote the dataset which is assigned with titles for unlabeled data D u , the loss function of multi-task learning is factorized as: where θ e indicates encoder parameters; θ d 1 and θ d 2 are the decoder parameters.
Training Procedure. We adopt a simple alternating training strategy to switch training between the main task and the auxiliary task. Specifically, we first estimate parameters on auxiliary task with D u for one epoch, then train model on the main task with D p (labeled dataset) for T epochs. We follow this training procedure for several times until the model of our main task converges. We set T as 3.

Datasets
Our major experiments are conducted on scientific articles which have been studied in previous work (Hulth, 2003;Nguyen and Kan, 2007;Meng et al., 2017). We use the dataset from Meng et al. (2017) which is collected from various online digital libraries, e.g. ScienceDirect, ACM Digital Library, Wiley, and other portals. As indicated in Table 1, we construct a relatively small-scale labeled dataset with 40K document-keyphrase 3 pairs, and a large-scale  dataset of 400K unlabeled documents. Each document contains an abstract and a title of the paper. In multi-task learning setting, the auxiliary task is to generate the title from the abstract. For the validation set, we collect 5K documentkeyphrase pairs for the process of pre-training and fine-tuning for methods based on synthetic data construction. For multi-task learning, we use the same 5K document-keyphrase pairs for the main task training, another 15K document-title pairs for the auxiliary task. We further conduct experiments on a 130K large-scale labeled dataset, which includes the small-scale labeled data. Similar to Meng et al. (2017), we test our model on five widely used public datasets from the scientific domain: INSPEC (Hulth, 2003), NUS (Nguyen and Kan, 2007), KRAPIVIN (Krapivin et al., 2009), SEMEVAL-2010(Kim et al., 2010 and KP20K (Meng et al., 2017).

Experimental Settings
Data preprocessing is implemented as in (Meng et al., 2017). The texts are first tokenized by NLTK (Bird and Loper, 2004) and lowercased, then the numbers are replaced with <digit>. We set maximal length of source text as 200, 40 for target text. Encoder and decoder both have a vocabulary size of 50K. The word embedding size is set to 128. Embeddings are randomly initialized and learned during training. The size of hidden vector is 512. Dropout rate is set as 0.3. Maximal gradient normalization is 2. Adagrad (Duchi et al., 2011) is adopted to train the model with learning rate of 0.15 and the initial accumulator rate is 0.1.
For synthetic data construction, we first use batch size of 64 for model pre-training and then re-duce to 32 for model fine-tuning. For both training stages, after 8 epochs, learning rate be decreased with a rate of 0.5. For multi-task learning, batch size is set to 32 and learning rate is reduced to half after 20 training epochs. To build baseline seq2seq model, we set batch size as 32 and decrease learning rate after 8 epochs. For self-learning algorithm, beam size is set to 10 to generate target sequences for unlabeled data and the top one is retained.

Comparisons with Baselines
Evaluation Metrics. Following (Liu et al., 2011;Meng et al., 2017), we adopt top-N macroaveraged precision, recall and F-measure (F 1 ) for model evaluation. Precision means how many top-N extracted keywords are correct and recall means how many target keyphrases are extracted in top-N candidates. Porter Stemmer is applied before comparisons.
Results. Here we show results for present and absent keyphrase generation separately 4 . From the results of present keyphrase generation as shown in Table 2, although trained on small-scale  labeled corpora, our baseline seq2seq model with copying mechanism still achieves better F 1 @5 scores, compared to other baselines. This demonstrates that a baseline seq2seq model has learned the mapping patterns from source text to target keyphrases to some degree. However, small-scale labeled data still hinders the potential of seq2seq model according to the poor performance of F 1 @10. By leveraging large-scale unlabeled data, our semi-supervised learning methods achieve siginifcant improvement over seq2seq baseline, as well as exhibit the best performance in both F 1 @5 and F 1 @10 on almost all datasets. We further compare seq2seq based models on the task of generating keyphrases beyond input article vocabulary. Illustrated by Table 3, semisupervised learning significantly improves the absent generation performance, compared to the baseline seq2seq. Among our models, the multitask learning method is more effective at generating absent keyphrases than the two methods by leveraging synthetic data. The main reason may lie in that synthetic keyphrases potentially introduce noisy annotations, while the decoder in multi-task learning setting focuses on learning from gold-standard keyphrases. We can also see that the overall performances by all models are low, due to the intrinsic difficulty of absent keyphrase generation (Meng et al., 2017). Moreover, we only employ 40K labeled data for training, which is rather limited for training. Besides, we believe better evaluation methods should be used instead of exact match, e.g., by considering paraphrases. This will be studied in the future work.

Effect of Synthetic Keyphrase Quality
In this section, we conduct experiments to further study the effect of synthetic keyphrase quality on model performance. Two sets of experiments are undertaken, one for evaluating unsupervised learning and one for self-learning algorithm.
For self-learning algorithm, we further generate synthetic keyphrases using following options: • Beam-size-3: Based on our baseline model trained with labeled data, we use beam search with a smaller beam size of 3 to generate synthetic data 5 .
• Trained-model: We adopt the model which has been trained with self-learning algorithm on 40K labeled data and 400K unlabeled data, to generate the top one keyphrase sequence with beam size of 10.
For unsupervised learning method, we originally merge top-K (K = 5) keyphrases from TF-IDF and TextRank, here we use options where K is set as 1 or 10 to extract keyphrases: • Top@1: Using TF-IDF or TextRank, we only keep top 1 extraction from each, then take the union of the two.
• Top@10: Similarly, we keep top 10 extracted terms from TF-IDF or TextRank, then take the union.
As illustrated in Figure 4, when models are pretrained with synthetic keyphrases of better quality, results by "Trained-model" consistently produce better performance (i.e., lower perplexity). Similar phenomenon can be observed when "top@5" and "top@10" are applied for extraction in unsupervised learning setting. Furthermore, after models are pre-trained and then fine-tuned, the results in Figure 5 show that the difference among baselines becomes insignificant-the quality of synthetic keyphrases have limited effect on final scores. The reason might be that though synthetic  Figure 5: Effect of synthetic data quality on present keyphrase generation (models are pre-trained and finetuned) based on F 1 @5 (left three columns) and F 1 @10 (right three columns), on five datasets. The upper is for self-learning algorithm and the bottom is for unsupervised learning method.  Figure 6: Effect of various amounts of unlabeled data for training on present keyphrase generation with F 1 @10. Upper is for synthetic data construction method with unsupervised learning. Bottom is for multi-task learning algorithm.
keyphrases potentially introduce noisy information for decoder training, the encoder is still well trained. In addition, after fine-tuning on labeled data, the decoder acquires additional knowledge, thus leading to better performance and minimal difference among the options.

Effect of Amount of Unlabeled Data
In this section, we further evaluate whether varying the amount of unlabeled data will affect model performance. We conduct experiments based on methods of synthetic data construction with unsupervised learning and multi-task learning. We further carry out experiments with randomly selected 50K(1/8), 100K(1/4), 200K(1/2) and 300K(3/4) unlabeled documents from the pool of 400K unlabeled data. After models being trained, we adopt beam search to generate keyphrase sequences with beam size of 50. We keep top N keyphrase sequences to yield the final keyphrases using Algorithm 1. F 1 @10 is adopted to illustrate Multi-task(50K) Multi-task(100K) Multi-task(200K) Multi-task(300K) Multi-task(400K) Figure 7: Perplexity on validation set with varying amounts of unlabeled data for training. Left is for fine-tuning procedure based on models trained with synthetic data constructed with unsupervised learning. Right is for multi-task learning procedure with performance on the main task.  the model performances. N is set as 10 or 50.
The present keyphrase generation results are shown in Figure 6, from which we can see that when increasing the amount of unlabeled data, model performance is further improved. This is because additional unlabeled data can provide with more evidence on linguistic or context features and thus make the model, especially the encoder, have better generalizability. This finding echoes with the training procedure illustrated in Figure 7, where more unlabeled data uniformly leads to better performance. Therefore, we believe that leveraging more unlabeled data for model training can boost model performance.

A Pilot Study for Cross-Domain Test
Up to now, we have demonstrated the effectiveness of leveraging unlabeled data for in-domain experiments, but is it still effective when being tested on a different domain? We thus carry out a pilot cross-domain test on news articles. The widely used DUC dataset (Wan and Xiao, 2008) is utilized, consisting of 308 articles with 2, 048 labeled keyphrases.
The experimental results are shown in Table 4 which indicate that: 1) though trained on scientific papers, our models still have the ability to gener-  Table 5: Results of present keyphrase generation on large-scale labeled data with F 1 @5 and F 1 @10. * indicates significant better performance than SEQ2SEQ-COPY with p < 0.01 (F -test).
ate keyphrases for news articles, illustrating that our models have learned some universal features between the two domains; and 2) semi-supervised learning by leveraging unlabeled data improves the generation performances more, indicating that our proposed method is reasonably effective when being tested on cross-domain data. Though unsupervised methods are still superior, for future work, we can leverage unlabeled out-of-domain corpora to improve cross-domain keyphrase generation performance, which could be a promising direction for domain adaption or transfer learning.

Training on Large-scale Labeled Data
Finally, it would be interesting to study whether unlabeled data can still improve performance when the model is trained on a larger scaled labeled data. We conduct experiments on a larger labeled dataset with 130K pairs, along with the 400K unlabeled data. Here the baseline seq2seq model is built on the 130K dataset. From the present keyphrase generation results in Table 5, it can be seen that unlabeled data is still helpful for model training on a large-scale labeled dataset. This implies that we can also leverage unlabeled data to enhance generation performance even in a resource-rich setting. Referring to the absent keyphrase generation results shown in Table 6, semi-supervised learning also boosts the scores. From Table 6, training on large-scale labeled data, absent generation is significantly improved, compared to being trained on a small-scale labeled data (see Table 3).

Conclusion and Future Work
In this paper, we presented a semi-supervised learning framework that leverages unlabeled data for keyphrase generation built upon seq2seq models. We introduced synthetic keyphrases construction algorithm and multi-task learning to effectively leverage abundant unlabeled documents. Extensive experiments demonstrated the effectiveness of our methods, even in scenario where largescale labeled data is available.
For future work, we will 1) leverage unlabeled data to study domain adaptation or transfer learning for keyphrase generation; and 2) investigate novel models to improve absent keyphrase generation when limited labeled data is available based on semi-supervised learning.