Keyphrase Generation for Scientific Document Retrieval

Sequence-to-sequence models have lead to significant progress in keyphrase generation, but it remains unknown whether they are reliable enough to be beneficial for document retrieval. This study provides empirical evidence that such models can significantly improve retrieval performance, and introduces a new extrinsic evaluation framework that allows for a better understanding of the limitations of keyphrase generation models. Using this framework, we point out and discuss the difficulties encountered with supplementing documents with -not present in text- keyphrases, and generalizing models across domains. Our code is available at https://github.com/boudinfl/ir-using-kg


Introduction
With the exponential growth of the scientific literature (Bornmann and Mutz, 2015), retrieving relevant scientific papers becomes increasingly difficult. Keywords, also referred to as keyphrases, provide an effective way to supplement paper indexing and improve retrieval effectiveness in scientific digital libraries (Barker et al., 1972;Zhai, 1997;Lu and Kipp, 2014). However, only few documents have assigned keyphrases, and those who do were, for the most part, selflabeled by their authors, thus exhibiting annotation inconsistencies (Strader, 2011;Suzuki et al., 2011). This has motivated an active line of research on automatic keyphrase extraction (see Hasan and Ng (2014) for an overview) and, more recently, keyphrase generation (Meng et al., 2017), where the task is to find a set of words and phrases that represents the main content of a document.
Although models for predicting keyphrases have been extensively evaluated on their ability to reproduce author's keywords, it still remains unclear whether they can be usefully applied in information retrieval. One reason for this lack of evidence may have been their relatively low performance discouraging attempts at using them for indexing (Liu et al., 2010;Hasan and Ng, 2014). Yet, recently proposed models not only achieve much better performance, but also display a property that may have a significant impact on retrieval effectiveness: the capacity to generate keyphrases that do not appear in the source text. These absent keyphrases do not just highlight the topics that are most relevant, but provide some form of semantic expansion by adding new content (e.g. synonyms, semantically related terms) to the index (Greulich, 2011). The goal of this paper is two-fold: to gather empirical evidence as to whether current keyphrase generation models are good enough to improve scientific document retrieval, and to gain further insights into the performance of these models from an extrinsic perspective. Our contributions are listed as follows: • We report significant improvements for strong retrieval models on a standard benchmark collection, showing that keyphrases produced by state-of-the-art models are consistently helpful for document retrieval, even, to our surprise, when author keywords are provided.
• We introduce a new extrinsic evaluation framework for keyphrase generation that allows for a deeper understanding of the limitations of current models. Using it, we discuss the difficulties associated with domain generalization and absent keyphrase prediction.

Scientific Document Retrieval
Here, we focus on the task of searching through a collection of scientific papers for relevant docu-ments. All of our experiments are conducted on the NTCIR-2 test collection (Kando, 2001) which is, to our knowledge, the only available benchmark dataset for that task. It contains 322,058 documents 1 (title and abstract pairs) and 49 search topics (queries) with relevance judgments. Most of the documents (98.6%) include author keywords (4.8 per doc. on avg.), which we later use to investigate the performance of keyphrase generation models. Documents cover a broad range of domains from pure science to social sciences and humanities, although half of the documents are about engineering and computer science. Queries are also categorized into one or more research fields (e.g. science, chemistry, engineering), the original intent being to help retrieval models in narrowing down the search space. We follow common practice and use short 2 queries with binary relevance judgments (i.e. without "partially relevant" documents).
We consider two standard ad-hoc retrieval models to rank documents against queries: BM25 and query likelihood (QL), both implemented in the Anserini IR toolkit . These models use unsupervised techniques based on corpus statistics for term weighting, and will therefore be straightforwardly affected when keyphrases are added to a document. We further apply a pseudorelevance feedback method, known as RM3 (Abdul-Jaleel et al., 2004), on top of the models to achieve strong, near state-of-the-art retrieval results (Lin, 2019;. For all models, we use Anserini's default parameters. To verify the effectiveness of the adopted retrieval models, we compared their performance with that of the best participating systems in NTCIR-2. Retrieval performance is measured using mean average precision (MAP) and precision at 10 retrieved documents (P@10). MAP measures the overall ranking quality and P@10 reflects the number of relevant documents on the first page of search results. Documents are indexed with author keywords, same as for participating systems. Results are presented in Table 1. We see that the considered retrieval models achieve strong performance, even outperforming the best participating system by a substantial margin. Note that the two best-performing systems use pseudo-relevance feedback, and that the second-ranked system is based on BM25.  (Murata et al., 2001) (Chen et al., 2001) 26.24 33.88 Table 1: Retrieval effectiveness of the considered models and the best participating systems on NTCIR-2.

Keyphrase Generation
Keyphrase generation is the task of producing a set of words and phrases that best summarise a document (Evans and Zhai, 1996). In contrast with most previous work that formulates this task as an extraction problem (a.k.a. keyphrase extraction), which can be seen as ranking phrases extracted from a document, recent neural models for keyphrase generation are based on sequence-to-sequence learning (Sutskever et al., 2014;Bahdanau et al., 2014), thus potentially allowing them to generate any phrase, also beyond those that appear verbatim in the text. In this study, we consider the following two neural keyphrase generation models: seq2seq+copy (Meng et al., 2017) is a sequenceto-sequence model with attention, augmented with a copying mechanism (Gu et al., 2016) to predict phrases that rarely occur. The model is trained with document-keyphrase pairs and uses beam search decoding for inference.
seq2seq+corr (Chen et al., 2018) extends the aforementioned model with correlation constraints. It employs a coverage mechanism (Tu et al., 2016) that diversifies attention distributions to increase topic coverage, and a review mechanism to avoid generating duplicates.
We implemented the models in PyTorch (Paszke et al., 2017) using AllenNLP (Gardner et al., 2018). Models are trained on the KP20k dataset (Meng et al., 2017), which contains 567,830 scientific abstracts with gold-standard, author-assigned keywords (5.3 per doc. on avg.). We use the parameters suggested by the authors for each model.
To validate the effectiveness of our implementations, we conducted an intrinsic evaluation by counting the number of exact matches between predicted and gold keyphrases. We adopt the standard metric and compute the f-measure at top 5, as it corresponds to the average number of keyphrases in KP20k and NTCIR-2, that is, 5.3 and 4.8, respectively. We also examine cross-domain generalization using the KPTimes news dataset (Gallina et al., 2019), and include a state-of-the-art unsupervised keyphrase extraction model (Boudin, 2018, henceforth mp-rank) for comparison purposes. This latter baseline also provides an additional relevance signal based on graph-based ranking whose usefulness in retrieval will be tested in subsequent experiments. Results are reported in Table 2. Overall, our results are consistent with those reported in (Meng et al., 2017;Chen et al., 2018), demonstrating the superiority of well-trained neural models over unsupervised ones, and stressing their lack of robustness across domains. Rather surprisingly, seq2seq+corr is outperformed by seq2seq+copy which indicates that relevant, yet possibly redundant, keyphrases are filtered out by the added mechanisms for promoting diversity in the output.

Extrinsic Evaluation Framework
Our goal is to find out whether the keyphrase generation models described above are reliable enough to be beneficial for document retrieval. To do so, we contrast the performance of the retrieval models with and without automatically predicted keyphrases. Two initial indexing configurations are also examined: title and abstract only (T +A), and title, abstract and author keywords (T +A+K).
The idea here is to investigate whether generated keyphrases simply act as a proxy for author keywords, or instead supplement them. Unless mentioned otherwise, the top-5 predicted keyphrases are used to expand documents, which is in accordance with the average number of author keywords in NTCIR-2. We evaluate retrieval performance in terms of MAP and omit P@10 for brevity. We use the Student's paired t-test to assess statistical significance of our retrieval results at p < 0.05 (Smucker et al., 2007).

Results
Results for retrieval models using keyphrase generation are reported in Table 3. We note that indexing keyphrases generated by seq2seq+copy, which performs best in our intrinsic evaluation, significantly improves retrieval effectiveness for all models. More interestingly, gains in effectiveness are also significant when both keyphrases and author keywords are indexed, indicating they complement each other well. This important finding suggests that predicted keyphrases are consistently helpful for document retrieval, and should be used even when author keywords are provided. Another important observation is that while both keyphrase generation models perform reasonably well in our intrinsic evaluation on NTCIR-2 (cf. Table 2, column 3), their impact on retrieval effectiveness are quite different, as only s2s+copy reaches consistent significance. This finding advocates for the importance of using document retrieval as an extrinsic evaluation task for keyphrase generation.  Overall, BM25+RM3 achieves the best retrieval effectiveness, confirming previous findings on adhoc retrieval in limited data scenarios (Lin, 2019). For clarity and conciseness, we focus on this model in the rest of this paper. Encouraging diversity in keyphrases seems not to be appropriate for retrieval, as seq2seq+corr consistently gives lower results than seq2seq+copy. It is also interesting to see that the effectiveness gains of query expansion (RM3) and document expansion are additive, suggesting that they provide different but complementary relevance signals. Moreover, our results show that query expansion is more effective, which is in line with past work (Billerbeck and Zobel, 2005).
One hyper-parameter that we have deliberately left untouched so far is the number N of predicted keyphrases that directly controls the precisionrecall trade-off of keyphrase generation models. To understand how this parameter affects retrieval effectiveness, we repeated our experiments by varying N within the range [0,9], and plotted the results in Figure 1. Without author keywords, we observe that all models achieve gains, but only seq2seq+copy does yield significant improvements. With author keywords, seq2seq+copy is again the only model that achieves significance, while the others show mixed results, sometimes even degrading scores. One likely explanation for this is that these models produce keyphrases that cause documents to drift away from their original meaning. We note that results are close to optimal for N = 5, supporting our initial setting for this parameter. From our experiments, it appears that unsupervised keyphrase extraction is not effective enough to significantly improve retrieval effectiveness. The fact that keyphrase generation does so, suggests that the ability to predict absent keyphrases may be what enables better performance. Yet counterintuitively, we found that most of the gains in retrieval effectiveness are due to the high extractive accuracy of keyphrase generation models. Results in Table 4 show that expanding documents with only absent keyphrases is at best useless and at worst harmful, while using only present keyphrases brings significant improvements. We draw two conclusions from this. First, absent keyphrases may not be useful in practice unless they are tied to some form of domain terminology to prevent semantic drift. Second, as generation does not yield improvements, keyphrase extraction models may be worth further investigation. In particular, supervised models could theoretically provide similar results while being easier to train.

Model
T +A (cf.  Neural models for keyphrase generation exhibit a limited generalization ability, which means that their performance degrades on documents that differ from the ones encountered during training (cf. Table 2, columns 3 and 4). To quantify how much this affects retrieval effectiveness, we divided the queries into two disjoint sets: in-domain for those that belong to research fields present in KP20k, and out-domain for the others. Results are presented in Table 5. The first thing we notice is the overall lower performance of out-domain queries, which may be explained by the unbalanced distribution of domains in the NTCIR-2 collection. Most importantly, out-domain queries on full indexing (i.e. T +A+K) is the only configuration in which no significant gains in retrieval effectiveness are achieved. This last experiment shows that expanding documents using existing keyphrase generation models may be ineffective in the absence of in-domain training data, and stresses the need of domain adaptation for keyphrase generation.

Conclusion
We presented the first study of the usefulness of keyphrase generation for scientific document retrieval. Our results show that keyphrases can significantly improve retrieval effectiveness, and also highlight the importance of evaluating keyphrase generation models from an extrinsic perspective.
Other retrieval tasks may also benefit from using keyphrase information and we expect our results to serve as a basis for further improvements. Keyphrase extraction and generation Identifying keyphrases for a given document is a long standing task in NLP. Earlier work typically involves two steps: 1) extracting keyphrase candidates; and 2) ranking those candidates by importance. Models mainly differ in how they do the latter, commonly used techniques being supervised learning Turney, 2003;Nguyen and Kan, 2007;Jiang et al., 2009) and graph-based methods (Mihalcea and Tarau, 2004;Wan and Xiao, 2008;Bougouin et al., 2013;Florescu and Caragea, 2017). These models are, however, inherently limited in the sense that they can only output keyphrases that appear in the text. To allow the prediction of keyphrases describing implicit topics or using different wordings, previous work relied on external resources like controlled vocabularies (Witten and Medelyan, 2006;Bougouin et al., 2016), while recent attempts leveraged neural generative models (Meng et al., 2017;Chen et al., 2018;Zhao and Zhang, 2019).

Biomedical indexing
Also related to our work is the research done on biomedical semantic indexing using MeSH 3 , a hierarchically-organized controlled vocabulary. Automated methods for assigning MeSH terms make use of all sorts of techniques, such as pattern matching (Aronson et al., 2004) or learning to rank (Liu et al., 2015;Peng et al., 2016).

Document expansion
Our work is similar in nature to previous research on document expansion (Tao et al., 2006;Efron et al., 2012), and is closely related to recent work on document expansion using automatically generated queries (Nogueira et al., 2019). Table 6 displays the model parameters we use for seq2seq+copy and seq2seq+corr. Table 7 presents the research fields used for dividing queries into two sets.

A.3 Example
An example of document along with automatically generated keyphrases is shown in Table 8

Research field In Out
Electricity, information and control -Chemistry -Architecture, civil engineering -Biology and agriculture -Science -Engineering -Medicine and dentistry -Cultural and social science -# of queries 27 22 title Grammatical Inference for Concept Acquisition from Documents.
abstract The purpose of this study is to acquire knowledge from large scale natural language documents. There are two types of knowledge in the documents. One is explicitly represented knowledge which is acquired using natural language processing. The other is implicit constrain. In this paper, how to acquire implicit constraint using grammatical inference from the documents is described. We propose a grammatical inference system which uses inference rules based on logic, and show that the system can learn easy pattern of character lists. We also discuss its application to knowledge acquisition from real documents.