Redefining Absent Keyphrases and their Effect on Retrieval Effectiveness

Neural keyphrase generation models have recently attracted much interest due to their ability to output absent keyphrases, that is, keyphrases that do not appear in the source text. In this paper, we discuss the usefulness of absent keyphrases from an Information Retrieval (IR) perspective, and show that the commonly drawn distinction between present and absent keyphrases is not made explicit enough. We introduce a finer-grained categorization scheme that sheds more light on the impact of absent keyphrases on scientific document retrieval. Under this scheme, we find that only a fraction (around 20%) of the words that make up keyphrases actually serves as document expansion, but that this small fraction of words is behind much of the gains observed in retrieval effectiveness. We also discuss how the proposed scheme can offer a new angle to evaluate the output of neural keyphrase generation models.


Introduction
Searching the scholarly literature for documents of interest is becoming frustratingly difficult and timeconsuming as the volume of published research grows exponentially. One promising approach to address this problem and improve the retrievability of documents is to supplement paper indexing with automatically generated keyphrases (Zhai, 1997;. Traditionally, keyphrases are defined as a short list of terms that represent the main concepts in a document (Turney, 2000). In recent years, this definition was further refined to differentiate between keyphrases that are present in the source document or not, and in turn, proposed models for producing keyphrases were divided into extractive (Florescu and Caragea, 2017;Boudin, 2018;Sun et al., 2019;Wang et al., 2020;Santosh et al., 2020, inter alia) and generative models (Meng et al., 2017;Zhao and Zhang, 2019;Chen et al., 2020;Bahuleyan and El Asri, 2020, inter alia) based on their ability to output absent keyphrases.
Obviously, keyphrases have different effects on retrieval models depending on whether or not they occur in the document: present keyphrases highlight important parts of the input and make weighting terms easier, while absent keyphrases add new terms to the input and provide some form of document expansion. Intuitively, assigning absent keyphrases is more appealing since it may alleviate the vocabulary mismatch problem between query terms and relevant documents (Furnas et al., 1987), hence enabling the retrieval of relevant documents that otherwise would have been missed. This is especially true for scholarly collections, in which documents are mostly short texts (i.e. scientific abstracts) due to licensing issues and/or resource limitations (Huang et al., 2019). Yet, the extent to which present and absent keyphrases contribute to improved retrieval effectiveness has not been thoroughly explored. Worse still, there is no unique and rigorous definition of what exactly makes a keyphrase absent.
Although not stated explicitly, many recent studies adopt the definition by (Meng et al., 2017), in which keyphrases that do not match any contiguous subsequence of source text are regarded as absent. From an Information Retrieval (IR) perspective where stemmed content words are used to index documents, this definition is not sufficiently explicit, as demonstrated by the example shown in Figure 1. We see that, under this definition, some absent keyphrases can have all of their words occurring in the source document, and therefore act no differently from present keyphrases on indexing. In fact, only a fraction of the words that compose these absent keyphrases are genuinely expanding the document, which in our example are the set of words retrieval, behavior, support . From a keyphrase generation point of view, this definition is not entirely satisfactory either, since training a

Study on the Structure of Index Data for Metasearch System
This paper proposes a new technique for Metasearch system, which is based on the grouping of both keywords and URLs. This technique enables metasearch systems to share information and to reflect the estimation of users' preference. With this system, users can search not only by their own keywords but by similarity of HTML documents. In this paper, we describe the principle of the grouping technique as well as the summary of the existing search systems. Author-assigned keyphrases are divided into present and absent using token-level matching with stemming. Finergrained categories for absent keyphrases (i.e. Reordered, Mixed and Unseen) are also outlined. model to produce absent keyphrases from an output vocabulary, while some of these might actually be reconstructed from the source document, is arguably overkill. Here, we argue that this may be one reason behind the poor performance of current sequence-to-sequence models in generating absent keyphrases .
In this paper, we advocate for a stricter definition of absent keyphrases and propose a finegrained categorization scheme that reflects how many new words are introduced within each keyphrase. Through this scheme, we shed new light on the effect of absent keyphrases on document retrieval effectiveness, and provide insights as to why current models for keyphrase generation are unable to accurately produce absent keyphrases. As a by-product, we introduce a new benchmark dataset for scientific document retrieval through the task of context-aware citation recommendation, that is composed of 169 manually extracted queries with relevance judgments and a collection of over 100K documents on topics related to IR.

(Re)defining Absent Keyphrases
Telling absent and present keyphrases apart may seem quite easy at first, but actually there are several intricacies to the process that should be noted. Starting from Meng et al. (2017)'s definition, "we denote phrases that do not match any contiguous subsequence of source text as absent keyphrases, and the ones that fully match a part of the text as present keyphrases", it is apparent that simple string matching between keyphrases and source document is not acceptable since it produces false positives (e.g. "supervised learning" matches "unsupervised learning"). Instead, token-level sequence matching is to be used and combined with stemming to deal with different inflectional forms of the same word. Using stemming is critical here since it is carried out as a standard procedure in indexing documents for IR, but also in evaluating the precision of keyphrase generation models against gold standard annotations (Hasan and Ng, 2014).
Looking back at our example in Figure 1, we see that absent keyphrases can be further divided into three sub-categories depending on the proportion of present words they contain. Indeed, some absent keyphrases have some, or even all, of their constituent words (in stemmed forms) present in the text, while others are composed entirely of unseen words. Accordingly, we propose the following fine-grained categorization scheme (illustrated with the example from Figure 1 and explained in more depth with pseudo-code in Appendix A): Present: keyphrases that match contiguous sequences of words in the source document (e.g. "Search System").
Reordered: keyphrases whose constituent words occur in the source document but not as contiguous sequences (e.g. "Information Sharing").
Mixed: keyphrases from which some, but not all, of their constituent words occur in the source document (e.g. "Information Retrieval").
Unseen: keyphrases whose constituent words do not occur in the source document (e.g. "Retrieval Support").
In contrast to the previously-used binary classification (i.e. present or absent), this finer-grained categorization scheme draws a distinction between keyphrases that expand the document (i.e. mixed and unseen) and those that don't (i.e. present and reordered). It thus allows us to better understand how keyphrases affect the retrieval process by making it possible to numerically quantify the contribution of each category to the overall retrieval effectiveness. At the same time, this scheme provides a new angle to evaluate the ability of keyphrase generation models to output absent keyphrases by contrasting their PRMU distributions against those observed in the gold standard annotations. In other words, a model has to mimic the distribution of absent keyphrases in manual annotation in order to perform well.

Experiments
Here, we outline our experimental setup ( §3.1), examine the distribution of keyphrases in commonlyused datasets with respect to the proposed categorization scheme ( §3.2), show the influence of each category on the retrieval effectiveness ( §3.3), and explore how these categories fit into the outputs of neural keyphrase generation models ( §3.4).

Experimental settings
Experiments in ad-hoc document retrieval are carried out on the NTCIR-2 test collection (Kando, 2001) which is, to our knowledge, the only available benchmark dataset for that task. It includes 322,058 scientific abstracts in English annotated with author-assigned keyphrases (4.8 per doc. on avg.), and 49 search topics (queries) with relevance judgments. Documents cover a wide range of domains from pure science to humanities, although half of the documents are about computer science.
Given the rather limited size of the NTCIR-2 test collection, we conducted additional experiments in context-aware citation recommendation (He et al., 2010) which is the task of retrieving citations (documents) for a given text (query). Since no publicly available keyphrase-annotated collection exists for that task, we created one by collecting documents (BIBT E X entries) from the ACM Digital Library. Our dataset contains 102,411 documents in English on topics related to IR 1 , most of which (69.2%) have author-assigned keyphrases (4.5 per doc. on avg.). We then followed the methodology proposed in (Roy, 2017), and selected 30 open-access sci-entific papers 2 from which we manually extracted 169 citation contexts (queries) and 481 cited references (relevant documents). The resulting dataset, named ACM-CR, is publicly available 3 .
For both retrieval tasks, we rank documents against queries using the standard BM25 model implemented in the Anserini 4 open-source IR toolkit (Yang et al., 2017), on top of which we apply the RM3 query expansion technique (Abdul-Jaleel et al., 2004) to achieve strong, near state-of-theart retrieval results (Lin, 2019;. For all models, we use Anserini's default parameters. We evaluate retrieval effectiveness in terms of mean average precision (mAP) on the top 1,000 retrieved documents for ad-hoc document retrieval, and in terms of recall at 10 retrieved documents for context-aware citation recommendation as recommended in (Färber and Jatowt, 2020 Table 1: Proportion of Present, Reordered, Mixed and Unseen keyphrases in datasets. We also report the ratio of unique, unseen words in M+U keyphrases (%uw). Table 1 shows the proportion of gold-standard, author-assigned keyphrases for each category in the different datasets. We also report results for the KP20k dataset (Meng et al., 2017), which is used as training data by most neural keyphrase generation models. We observe very similar distributions across datasets, with absent keyphrases accounting for about 40% of the total number of keyphrases. Interestingly, most of the absent keyphrases belong to the mixed and unseen categories, and therefore  should provide some form of semantic expansion.

Distribution of gold-standard keyphrases under the PRMU scheme
To have a precise idea of how many new words are actually added when indexing absent keyphrases, we compute the ratio (%uw) of unique words from keyphrases that do not occur in their corresponding documents. We find that only about 20% of the words included in keyphrases contribute to expanding documents. This surprisingly low percentage indicates that absent keyphrases play a much smaller role on document expansion than previously thought. Yet, as we will see next, this small fraction of new words is behind much of the gains observed in retrieval effectiveness. Table 2 presents the results of retrieval models on documents supplemented with keyphrases from PRMU categories. We see that adding keyphrases systematically improves retrieval effectiveness on both datasets, but a closer look reveals that the largest gains are obtained with Mixed and Unseen keyphrases. This observation, combined with the fact that the number of Mixed and Unseen keyphrases is comparatively small (less than one on average), demonstrate that expanding documents is more effective than highlighting salient phrases for improving document retrieval performance. The higher scores achieved when combining Mixed and Unseen keyphrases, compared to when combining Present and Reordered keyphrases, further confirm this conclusion. Surprisingly, coupling query expansion (+RM3) with appending keyphrases yields conflicting results, which we attribute to the narrow set of topics (all related to IR) in ACM-CR that limits the vocabulary mismatch problem and makes it sensitive to semantic drift. Another reason may be the incomplete nature of the relevance judgments, i.e. that do not include uncited, yet relevant documents. Here, the use of a co-cited probability metric as in (Livne et al., 2014) may bring some new insights.

Analysis of keyphrase generation outputs under the PRMU scheme
In this last experiment, we explore how the proposed categories fit into the outputs of neural keyphrase generation models. Table 3 shows the distributions over PRMU categories for two strong baseline models: s2s+copy, a sequence-tosequence model with attention and copying mechanisms (Meng et al., 2017), and s2s+corr which extends the aforementioned model with a coverage mechanism (Chen et al., 2018). We observe that the output distributions are heavily skewed towards the Present category, indicating that the models have trouble producing keyphrases made up of new words. Accordingly, the overall performance of these models is quite poor (about 20% in f-measure), and mainly capped by the number of present keyphrases in the gold standard. This advocates for more focus on training generative models to expand documents, rather than to imitate author-assigned annotation.

Related Work
Until recently, most previous models for predicting keyphrases were doing so by extracting the most salient noun phrases from documents (Hasan and Ng, 2014). Keyphrase extraction models are usually divided into supervised models that cast keyphrase extraction either as a binary classification problem (Turney, 2000;Hulth, 2003;Nguyen and Kan, 2007;Medelyan et al., 2009;Sterckx et al., 2016) or as a sequence labelling problem (Augenstein et al., 2017;Xiong et al., 2019;Alzaidy et al., 2019), and unsupervised models that rely predominantly on graphbased ranking approaches (Mihalcea and Tarau, 2004;Litvak and Last, 2008;Wan and Xiao, 2008;Bougouin et al., 2013;Tixier et al., 2016;Boudin, 2018). Note that none of these models can produce absent keyphrases. A related line of research focuses on keyphrase assignment, that is, the task of selecting entries from a predefined list of keyphrases (i.e. a controlled vocabulary) (Leung and Kan, 1997;Dumais et al., 1998;Medelyan and Witten, 2006). Here, predicting keyphrases is treated as a multi-class classification task, and models can produce both present and absent keyphrases. Further in that direction is (Bougouin et al., 2016) that jointly performs keyphrase extraction and assignment using an unsupervised graph-based ranking model.
Also closely related to our work is previous research on document expansion (Tao et al., 2006;Efron et al., 2012), and particularly recent work on supplementing document indexing with automatically generated queries . These latter models augment texts with potential queries that, just as keyphrases, mitigate vocabulary mismatch and reweight existing terms (Lin et al., 2020). On the term weighting side, recent work shows that deep neural language models, in this case BERT (Devlin et al., 2019), can be successfully applied to estimate document-specific term weights (Dai and Callan, 2020).

Conclusion
In this paper, we investigated the usefulness of absent keyphrases for document retrieval. We showed that the commonly accepted definition of absent keyphrases is not sufficiently explicit in the context of IR, and proposed a finer-grained categorization scheme that allows for a better understanding of their impact on retrieval effectiveness. Our code and data are publicly available at https://github.com/boudinfl/ redefining-absent-keyphrases.