Exploring Query Expansion for Entity Searches in PubMed

Identifying relevant studies from the entire scientific literature is an important task in biomedical research. Past efforts have incorporated semantically recognized biological entities and medical ontologies into biomedical literature search. However, semantic relations are largely overlooked by biomedical search engines. In this work, we aim to discover synonymous biomedical semantic relations between entities and explore their uses in query (semantics) understanding for improved retrieval performance. Specifically, we discover synonymous semantic relations from PubMed queries and apply them to query expansion and specification. In these two real-world scenarios, better PubMed retrieval effectiveness, in terms of recall and precision, can be achieved, demonstrating the utility of our proposed approach.


Introduction
PubMed is widely used by millions of users on a daily basis for seeking scholarly publications in biology and life sciences. Recent studies show that a significant portion of PubMed queries are entity specific (i.e. entity searches) (Neveol et al., 2011;Huang and Lu, 2016).
Domain-specific search engines, such as PubMed, typically handle queries with domain knowledge in mind. For example, PubMed incorporates Medical Subject Headings (MeSH) to retrieve documents associated with query's semantic meaning than just keyword matches as in biomedicine it is common for concepts to appear in different forms in user queries and scholarly publications ). However, PubMed can still suffer from mismatches between document and query words when an information need involves entity semantic relations (Baumgartner et al., 2007).
Consider the query chlorthalidone vs hydrochlorothiazide and chlorthalidone versus hydrochlorothiazide. Semantically similar as they are, PubMed returns twice more relevant documents for the latter, clearly overlooking the semantics of the general terms of vs and versus during its search. Unfortunately, such performance difference resulting from different query formulations can lead to different levels of user satisfaction and different user experience with PubMed.
In light of this, we propose a framework where we first understand user query's semantics by discovering synonymous patterns among user queries (e.g. patterns CHEMICAL vs CHEMICAL and CHEMICAL versus CHEMICAL) for entity relations of interest. We then apply these learned synonymous patterns in query expansion to improve retrieval effectiveness for entity searches in PubMed.
In this work, we mine synonymous patterns in user queries instead of scholarly publications because queries are generally short (Islamaj Dogan et al., 2009;Wilkinson et al., 1995) and tend to bond entities in proximity. Here we specifically target chemical-chemical and chemicaldisease relations such as chemical-induceddisease relation (Wei et al., 2016). The proposed framework, however, is easily generalizable to understand other bio-entity relations such as protein-protein interaction (Phizicky and Fields, 1995).
Our work is unique in several aspects. First, PubMed queries are semantically analyzed through context patterns, and synonymous relations or synonymous context patterns are discovered automatically. Second, synonymous patterns are applied to expand entity searches at pattern level to improve recall of relevant documents. Third, synonymous patterns can also be applied to searches with entities only, where we add additional constraints to improve precision. Overall evaluation is able to point key directions for future development and improvement of PubMed, and can also shed light on how to effectively search biomedical literature beyond PubMed.

Related Work
Query Expansion (QE) has been an area of active research in Information Retrieval (IR). QE techniques manage to alleviate vocabulary mismatch between query and document words by adding related words to the initial queries, with the goal of improving retrieval effectiveness. Below we discuss three types of QE techniques classified based on how they derive related words: ontology-oriented, query-independent data-driven, and query-dependent data-driven technique.
Query-independent data-driven QE methods identify queries' similar words by analyzing global-wide documents not specific to queries. Hence, they are also known as global corpusspecific QE methods (Carpineto and Romano, 2012). They learn word association by concept terms (Qiu and Frei, 1993), term clustering (Crouch and Yang, 1992), distributional similarity (Lin 1998;Turney 2001;Chen et al., 2006), semantic topics (Park and Pamamohanarao, 2007), to name a few.
Query-dependent data-driven techniques, on the other hand, analyze query-specific documents for QE. While relevance feedback uses relevant documents from the initial queries, pseudorelevance feedback uses top-ranked documents without human intervention (Xu and Croft, 1996). Measures for finding related terms in initially returned documents include Rocchio's weighting (Rocchio, 1971), Chi-square (Doszkocs, 1978), and Kullback-Leibler distance (Carpineto et al., 2001). Recently, Cui et al. (2003) and Riezler et al. (2007) consider userclicked documents relevant for QE.
In biomedicine, QE studies primarily focus on ontologies and pseudo-relevance feedback. For example, Jalali and Borujerdi (2008) and  expand queries via MeSH ontology, and Srinivasan (1996), Aronson (1996), andZhu et al. (2006) expand queries via Unified Medical Language System (Lindberg et al., 1993). On the other hand, biomedical queries can be reformulated  or systematically expanded based on initially retrieved documents focusing on abbreviations (Bacchin and Melucci, 2005), the controlled vocabulary of MeSH (Thesprasith and Jaruskulchai, 2014), or open vocabulary (Rivas et al., 2014).
In contrast to previous work, we semantically analyze frequently-sought general patterns (or relations) in biomedical queries, discover pattern synonyms, and use these automatically-learnt synonymous patterns to expand real-world entity searches in PubMed. Such general-phrase pattern-level semantics understanding, complementary to domain-specific MeSH, later proves useful in QE and beneficial to PubMed literature search in our case studies.  We focus on understanding users' information needs or search semantics when they submit entity searches to PubMed. We discover synonymous patterns or entity relations in user queries (Section 3.1) and exploit them in the following two use scenarios to improve PubMed retrieval effectiveness. Scenario 1. Consider an entity pair search with explicit relation mention (e.g. comparison relation between two drugs as in albuterol vs levalbuterol). We expand the query with its synonymous counterparts belonging to the same pattern-level relation (e.g. adding albuterol versus levalbuterol, comparison between albuterol and levalbuterol, and etc. Note that in Scenario 1 and Scenario 2 we perform PubMed searches under relevance sorting (as opposed to the default chronical sorting) and we search PubMed and use matches in article titles as a proxy for human relevance evaluation (Kim et al., 2016). In other words, to ensure quick turnaround and large-scale evaluation, we assume those matching titles all satisfy users' information needs (i.e. perfect precision) and thus no human relevance judgments is required.

Discovering Synonymous Patterns
We have previously developed an unsupervised approach for identifying synonymous patterns of entity relations in PubMed queries (Huang and Lu, 2016). Due to space limitation, we only briefly outline major steps below. We refer interested readers to (Huang and Lu, 2016) for details.
First, a six-month worth of PubMed queries (35M queries) are stemmed and tagged using entity recognition tools  Leaman et al., 2013;Leaman et al., 2015) for genes/proteins, diseases, and chemicals/drugs. Next, we formulate queries to context patterns and focus on specifically discovering synonymous patterns for chemical-chemical (CC) and chemical-disease (CD) relations. For instance, the query skin necrosis associate with warfarin is formulated into #D associate with #C where #C and #D stands for chemical and disease entity respectively.
Inspired by distributional similarity (Lin 1998), we then exploit these patterns' participating entity pairs to understand their semantics. In such a way, synonymous patterns can be found in an unsupervised fashion in contrast to seedsrequired pattern recognition work (e.g. Xu and Wang, 2014). Take Figure 1 for example. Our framework will consider the pattern #C induce #D semantically closer to #D due to #C than to #C in #D treatment since #C induce #D and #D due to #C share more participating entities in user queries: 2 overlapping entities out of 7 entities vs 0 out of 7.
To avoid data sparseness issue on (distributional similarity in) entity mention, we further leverage latent semantic analysis, LSA, (Rehurek and Sojka, 2010) to find entities' LSA topics which in turn reduces the space of semantics analysis from the dimension of entity pairs to a much smaller dimension of LSA topics. The benefit of using LSA topics is clear: after LSA transformation, #C induce #D in Figure 1, where circle's colors depict LSA topics, shows stronger semantics connection with #D due to #C than previously without LSA: 2 overlapping LSA topics out of 3 topics.
Our LSA-based approach is able to achieve satisfying performance in finding semantically similar patterns across entity relations of interest, such as drug-induced-disease relation, drug-drug interaction, to name a few. We refer interested readers to (Huang and Lu, 2016) for detailed evaluation results.

Expanding Entity-Relation Searches
Once our method identifies candidates of pattern synonyms, we collect the set of true synonymous patterns and apply them to semantic query expansion as below.
We first order a semantic relation's synonymous patterns according to their frequencies in PubMed queries, which represent user preferences or user intuitions (in searching the target bio-relation between two entities). See patterns in descending order of frequency in the second and fifth column of Table 2. For example, Pub-Med users prefer using #C versus #C to #C vs #C or comparison of #C and #C in comparing two drugs. Currently, four common entity relations between drugs and between drugs and diseases are of our particular interest: drug comparison, drug combination, drug-induced-disease and drug-treatsdisease.
Second, for each relation, we assemble its 500 most searched entity pairs from our search logs. For example, <albuterol, levalbuterol> is a popular chemical pair for the drug comparison relation.
For each entity pair (e.g. <albuterol, levalbuterol>) of a semantic relation, we then submit a query with the pair using one of the relational patterns (e.g. albuterol vs levalbuterol) and compare the search result with that of semantically expanded query that leverages all synonymous patterns (e.g. albuterol versus levalbuterol OR albuterol vs levalbuterol OR … Syntax OR combines PubMed retrieval results). Recall that the searches are limited to PubMed titles. Finally, we compute the ratio of the number of total search results via all patterns of the semantic relation over that of each individual pattern, averaged over 500 entity pairs. Such difference in recall is referred to as benefit in recall, BiR.
As Table 2 shows, a BiR score above 1 means expanding queries using collective synonymous patterns of the same semantics improves PubMed recall or helps PubMed retrieve more relevant documents. Take the drug comparison relation for example. Regardless of the chemical pair of interest, expanded queries can always retrieve more relevant documents than using the individual pattern of #C versus #C (more than twice as many on average: 2.38). In some cases of Table  2, the improvement in recall is substantial (e.g. 135.65 associated with #C compare #C, 904.2 associated with combine #C and #C, and so on).
The benefit of using our synonymous patterns for query expansion in current PubMed settings can be observed across various types of CC or CD entity-relation searches, searches with explicit relation mention. And interestingly, the most frequently used patterns by users (or the most intuitive/straightforward search patterns from users' points of view) may not always be the best choice at default: among the drug comparison patterns, comparison of #C and #C is more effective than the most popular #C versus #C in retrieving relevant documents. A semantic framework like ours can balance PubMed retrieval results across different entity-relation expressions in searches with similar meanings.

Expanding Pure Entity Pair Searches
Among PubMed searches, pure entity pair searches or searches containing only two bioentities without any explicit relation mentions (e.g. midazolam sevoflurane), account for approximately half of the searches involving dual bioentities. As a result, we investigate in this subsection how we can improve PubMed user experience by expanding these queries, with the help of our synonymous patterns and past user searches. The process is detailed below.
First, we identify pure entity pair searches only sought by PubMed users in a specific relation/context, based on which we expand the searches and impose semantic search constraints. Take the pure entity pair search midazolam sevoflurane for instance. Since it had only been searched with drug comparison relation by PubMed users, we later explicitly constrain that search query in the context of drug comparison relation. This step infers the implicit relation between the entity pair from the wisdom of the crowd (i.e. past search logs). Our hypothesis is that such implicit relation, if explicitly added to the search, may improve retrieval results and in turn user experience.
In the current experiment, a total of 1,600 unique pure entity-pair queries are collected with CC relation constraints (i.e. drug comparison, drug combination, and drug interaction) and CD relation constrains (i.e. drug-treats-disease, druginduced-disease, supplement-for-disease, drugresistance-in-disease).
Similar to the settings in Section 3.2, we submit to PubMed (a) original queries, i.e. pure entity pairs and (b) expanded queries with explicit relation constraints learnt from past user queries. For example, original search midazolam sevoflurane and its semantics-constrained counterpart midazolam versus sevoflurane OR midazolam vs sevoflurane OR … (expanded using our synonymous patterns of the drug comparison relation, in which midazolam sevoflurane had only been sought) will be submitted to PubMed.
Finally, based on the search results from (a) and (b), we compute the retrieval effectiveness of regular PubMed by using (b)'s results as the ground truth. In other words, we assume the expanded queries truly represent users' search intention and their search results truly satisfy users' information needs. Retrieval performance is measured by standard information retrieval (IR) measures: precision (P), mean reciprocal rank (MRR) and nDCG (Jarvelin and Kekalainen, 2002) at rank 20.
As we can see in Table 3, the difference between current performance scores in MRR or nDCG and perfect scores (i.e. perfect MRR or nDCG equals 1) suggests genuinely there is room for performance increase in retrieval for such searches, i.e. pure entity pair searches, in current PubMed settings. While pure CD searches yield better results than pure CC searches, potential gain in performance is still substantial for CD queries, which can be achieved by simply adding semantics constraints and expanding queries accordingly. In some cases (e.g. pure entity pair searches with implicit drug interaction relation), semantics constraints almost warrant a more satisfying search performance.  Table 3. Results on pure CC and CD queries with implicit relations.

Summary
We have applied query semantics understanding to PubMed literature search. The proposed framework involves discovering synonymous relational patterns in queries and, based on those, expanding PubMed user queries, specifically entity search queries. Preliminary evaluation shows such semantic query expansion helps to improve PubMed retrieval effectiveness. And better PubMed performance implies better user experience and less curation effort (Lu and Hirschman, 2012). Incorporating such general-phrase semantics framework, complementary to domainspecific MeSH, into PubMed serving millions of users is warranted.

Acknowledgements
This work was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. The authors would like to thank anonymous reviewers for their suggestions and comments.