A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval

In this study, we investigate learning-to-rank and query refinement approaches for information retrieval in the pharmacogenomic domain. The goal is to improve the information retrieval process of biomedical curators, who manually build knowledge bases for personalized medicine. We study how to exploit the relationships between genes, variants, drugs, diseases and outcomes as features for document ranking and query refinement. For a supervised approach, we are faced with a small amount of annotated data and a large amount of unannotated data. Therefore, we explore ways to use a neural document auto-encoder in a semi-supervised approach. We show that a combination of established algorithms, feature-engineering and a neural auto-encoder model yield promising results in this setting.


Introduction
Personalized medicine strives to relate genomic detail to patient phenotypic conditions (such as disease, adverse reactions to treatment) and to assess the effectiveness of available treatment options (Brunicardi et al., 2011). For computerassisted decision making, knowledge bases need to be compiled from published scientific evidence. They describe biomarker relationships between key entity types: Disease, Protein/Gene, Variant/Mutation, Drug, and Patient Outcome (Outcome) (Manolio, 2010). While automated information extraction has been applied to simple relationships -such as Drug-Drug (Asada et al., 2017) or Protein-Protein (Peng and Lu, 2017); (Peng et al., 2015); (Li et al., 2017) interactionwith adequate precision and recall, clinically ac-tionable biomarkers need to satisfy rigorous quality criteria set by physicians and therefore call upon manual data curation by domain experts.
To ascertain the timeliness of information, curators are faced with the labor-intensive task to identify relevant articles in a steadily growing flow of publications (Lee et al., 2018). In our scenario, curators iteratively refine search queries in an electronic library, such as PubMed. 1 The information the curators search for, are biomarkerfacts in the form of {Gene(s) -Variant(s) -Drug(s) -Disease(s) -Outcome}. For example, a curator starts with a query consisting of a single gene, e.g. q 1 = {PIK3CA}, and receives a set of documents D 1 . After examining D 1 , the curator identifies the variants H1047R and E545K, which yields queries q 2 = {PIK3CA, H1047R} and q 3 = {PIK3CA, E545K} that lead to D 2 and D 3 . As soon as studies are found that contain the entities in a biomarker relationship, the entities and the studies are entered into the knowledge base. This process is then repeated until, theoretically, all published literature regarding the gene PIK3CA has been screened.
Our goal is to reduce the amount of documents which domain experts need to examine. To achieve this, an information retrieval system should rank documents high that are relevant to the query and should facilitate the identification of relevant entities to refine the query.
Classic approaches for document ranking, like tf-idf (Luhn, 1957); (Spärck Jones, 1972), or bm25 (Robertson andZaragoza, 2009), and, for example, the Relevance Model (Lavrenko and Croft, 2001) for query refinement are established techniques in this setting. They are known to be robust and do not require data for training. However, as they are based on a bag-of-words model (BOW), they cannot represent a semantic relationship of entities in a document. This, for example, yields search results with highly ranked review articles that only list query terms, without the desired relationship between them. Therefore, we investigate approaches that model the semantic relationships between biomarker entities. This can either be addressed by combining BOW with rule-based filtering, or by supervised learning, i.e. learningto-rank (LTR).
Our goal is, to tailor document ranking and query refinement to the task of the curator. This means that a document ranking model should assign a high rank to a document that contains the query entities in a biomarker relationship. A query refinement model should suggest additional query terms, i.e. biomarker entities, to the curator that are relevant to the current query. Given the complexity of entity relationships and the high variety of textual realizations this requires either effective feature engineering, or large amounts of training data for a supervised approach. The in-house data set of Molecular Health consists of 5833 labeled biomarker-facts, and 24 million unlabeled text documents from PubMed. Therefore, a good solution is to exploit the large amount of unlabeled data in a semi-supervised approach. Li et al. (2015) have shown that a neural auto-encoder with LSTMs (Hochreiter and Schmidhuber, 1997) can encode the syntactics and semantics of a text in a dense vector representation. We show that this representation can be effectively used as a feature for semi-supervised learning-to-rank and query refinement.
In this paper, we describe a feature engineering approach and a semi-supervised approach. In our experiments we show that the two approaches are, in comparison, almost on par in terms of performance and even improve in a joint model. In Section 2 we describe the neural auto-encoder, and then proceed in Section 3 to describe our models for document ranking and in Section 4 the models for query refinement.

Neural Auto-Encoder
In this study, we use an unsupervised method to encode text into a dense vector representation. Our goal is to investigate if we can use this representation as an encoding of the relations between biomarker entities.
Following Sutskever et al. (2014) Cho et al.  ) we implemented a text auto-encoder with a Sequenceto-Sequence approach. In this model an encoder Enc produces a vector representation v = Enc(d) of an input document d = [w 1 , w 2 , . . . , w n ], with w i being word embedding representations (Mikolov et al., 2013). This dense representation v is then fed to a decoder Dec, that tries to reconstruct the original input, i.e.d = Dec(v). During training we minimize error(d, d). After training we only use the Enc(d) to encode the text. We want to explore if we can use Enc to encode the documents and the query. We will use the output of the document encoder Enc as features for a document ranking model and for a query refinement model.

Document Ranking
Information retrieval systems rank documents in the order that is estimated to be most useful to a user query by assigning a numeric score to each document. Our pipeline for document ranking is depicted in Figure 1: Given a query q, we first retrieve a set of documents D q that contain all of the query terms. Then, we compute a representation rep q (q) for the query q, and a representation rep d (d) for each document d ∈ D q . Finally, we compute the score with a ranker model score rank . For rep d we need to find a representation for an arbitrary number of entity-type combinations, because a fact can consist of e.g. 3 Genes, 4 Variants, 1 Drug, 0 Diseases and 0 Outcomes. In the following, we describe several of the settings for rep q (q), rep d (d) and the ranker model.

Bag-of-Words Models
We have implemented two commonly used BOW models tf-idf and bm25. For these models the text representations rep q (q) and rep d (d) is the vector space model.

Learning-to-Rank Models
For the learning-to-rank models, we chose a multilayer perceptron (MLP) as the scoring function score rank . In the following we explain how rep q (q) and rep d (d) are computed.

Feature Engineering Model
We created a set of basic features: encoding the frequency of entity types, distance features between entity types, and context words of entities. In this model, features are query dependent and are computed on-demand by a feature function f (q, d).
The algorithm to compute the distance feature is as follows: Given query q with entities e ∈ q and document d = [w 1 , w 2 , . . . , w n ], with w being words in the document. Let type(e) be the function that yields the entity type, f.ex. type(e) = Gene. Then, if e i , e j ∈ q and there exists a w k = e i , w l = e j then we add |l − k| to the bucket of {type(e i ), type(e j )}. To summarize the collected distances we compute min(), max(), mean() and std() over all collected distances for each bucket separately.
For the context words feature, we collected in a prior step a list of the top 20 words for each {type(e i ), type(e j )} bucket, i.e. we collect words that are between w k = e i and w l = e j if |k − l| < 10. We remove stop words, special characters and numbers from this list and also manually remove words using domain knowledge. The top 20 of remaining words for each {type(e i ), type(e j )} bucket are used as boolean indicator features.

Auto-Encoder Model
In this model we use the auto-encoder from Section 2 to encode the query and the document. The input to the score function is the element-wise product, denoted by , of the query encoding rep q (q) = Enc(q) = q and the document encoding rep d (d) = Enc(d) = d: To encode the queries we compose a pseudo text using the query terms. The input to the autoencoder Enc are the word embeddings of the pseudo text for rep q (q) and the word embeddings of the document terms for rep d (d).

Query Refinement
Query refinement finds additional terms for the initial query q that better describe the information need of the user (Nallapati and Shah, 2006). In our approach we follow Cao et al. (2008), in which the ranked documents D q are used as pseudo relevance feedback. Our goal is to suggest relevant entities e that are contained in D q and that are in a biomarker relationship to q. Therefore, we define a scoring function score ref for ranking of candidate entities e to the query q with respect to the retrieved document set D q . See Figure 2 for a sketch of the query refinement pipeline. In the following, we describe several of the scoring functions.

Bag-of-Words Models
We implemented the two classic query refinement models the Rocchio algorithm (Rocchio and Salton, 1965) and Relevance Model.

Auto-Encoder Model
In this model, we also try to exploit the autoencoding of the query. The idea is as follows: (i) Given a list of documents and their scores D q = [(d 1 , s 1 ), (d 2 , s 2 ), . . . , (d n , s n )] for a query q from the previous step, we use the ranking score s i as pseudo-relevance feedback to create a pseudo document rep c (D,ŝ) = n i d iŝi =d. The scores s are normalized so that they are non-negative and iŝ i = 1, see Appendix A.1. (ii) From all entities e i ∈ D q \q we generate new query encoding rep q (q ∪ e i ) = Enc(q ∪ e i ) =q e i (iii) We rank the entities based on the pseudo document using the scoring function i.e. we propose those entities as a query refinement that agree with the most relevant documents.

Experiments
In this section, we first explain our evaluation strategy to assess the performance of the respective models for document ranking and query refinement. Subsequently, we describe the settings for the data and the results of the experiments that we have conducted.

Evaluation Protocol
Document ranking models are evaluated by their ability to rank relevant documents higher than irrelevant documents. Query refinement models are evaluated both, by their ability to rank relevant query terms high, and by the recall of retrieved relevant documents when the query automatically is refined by the 1st, 2nd and 3rd proposed query term. We evaluate our models using mean average precision (MAP) (Manning et al., 2008, Chapter 11) and normalized discounted cumulative gain (nDCG) (Järvelin and Kekäläinen, 2000). For both the document ranking and query refinement approach we interpret a biomarker-fact as a perfect query and the corresponding papers as the true-positive (or relevant) papers associated with this query. In this way, we use the curated facts as document level annotation for our approach. Because we want to assist the iterative approach of curators in which they refine an initially broad query to ever narrower searches, we need to create valid partial queries and associated relevant documents to mimic this procedure. Therefore, we generate sub-queries, which are partials of the facts. We generated two data sets: one for document ranking and one for query refinement. For document ranking, we generated all distinct subsets of the facts. For query refinement, we defined the eliminated entities (of the sub-query generation process) as true-positive refinement terms. For both data sets, we use all associated documents, of the original biomarker-fact, as true-positive relevant documents.

Data
Unlabeled Data As unlabeled data we use ∼24 Million abstracts of PubMed. To automatically annotate PubMed abstracts with disambiguated biomarker entities, we use a tool set that has been developed together with biomedical curators. It employs "ProMiner" 2 (Hanisch et al., 2005) for Protein/Gene and Disease entity recognition and regular expression based text scanning using synonyms from "DrugBank" 3 and "PubChem" 4 for the identification of Drugs and manually edited regular expressions, relating to "HGVS" 5 standards, to retrieve Variants. We restricted the PubMed documents to include at least one entity of type Gene, Drug and Disease leaving us with 2.7 Million documents. Additionally we replaced the text of every disambiguated entity with its id.
Labeled Data As labeled data we use a knowledge base that contains a set of 5833 hand curated {Gene(s) -Variant(s) -Drug(s) -Disease(s) -Out-come} biomarker-facts and their associated papers that domain experts extracted from ∼1200 fulltext documents. We only keep facts in which the disambiguated entities are fully represented in the available abstracts. This restricted our data set to a set of 1181 distinct facts. The 4 top curated genes are EGFR (29%), BRAF (13%), KRAS (8%), and PIK3CA (5%).

Cross Validation
To exploit all of our labeled data for training and testing we do 4-fold crossvalidation. Because in our scenario a curator starts with an initial entity of type Gene we have generated our validation and test sets based on Genes, instead of randomly sampling facts. This also guarantees us to never have the same sub-query included in the training, validation and test set. In total we have built 12 different splits of our data set basing the validation and test set each on a different gene. The respective training sets are built with all remaining facts that do not include the validation and test gene. Statistics can be found in Table 1.

Training of Embeddings and Auto-Encoder
The training data for the embeddings and the autoencoder are the PubMed abstracts described in the previous Section 5.2. We trained the embeddings with Skip-Gram. For the vocabulary, we keep the top 100k most frequent words, while making sure all known entities are included. We use a window size of 10 and train the embeddings for 17 We normalize digits to "0" and lowercase all words. Tokenization is done by splitting on white space and before and after special characters. For both, the encoder and the decoder, we use two LSTM cells per block with hidden size 400 each. We skip unknown tokens and feed the words in reverse order for the decoder following Sutskever et al. (2014). The auto-encoder was trained for 15 epochs using early stopping.

Document Ranking
In this section, we describe the document ranking models, their training and then discuss the results of their evaluation.
Models We evaluate the BOW models (tf-idf, bm25) (Section 3.1) and the two LTR models using feature engineering (feat) and the auto-encoded features (auto-rank) (Section 3.2). We also evaluate an additional set of models to investigate if maybe simpler solutions can be competitive. See Table 2 for an overview over all ranking models.
(a.) A simpler solution than learning a M LP for a score function is to compute the similarity between q = Enc(q) and the document encoding d = Enc(d). Therefore, we use the cosine similarity as scoring function between the vector representations q and d (auto/cosine).
(b.) Instead of encoding the documents and queries with the auto-encoder, we encode the documents and queries based on their tf-idf weighted embeddings, i.e. q = tf idf (q i ) * w q i . Similarly to the auto-rank model, the input to the classifier M LP is the element-wise product of the query encoding and the document encoding (emb).
(c.) Due to promising results of the auto-rank model, bm25, and the feat model, we also tested combinations of them. We tested the concatenation of the the bm25 score with the auto-rank features (auto-rank + bm25) as well as the concatenation of feat with the auto-rank features (auto-rank + feat).
Training We train our models with Adam (Kingma and Ba, 2014) and tune the initial learning rate, the other parameters are kept default of TensorFlow 6 . We use a pairwise hinge loss (Chen et al., 2009) and compare relevant documents with irrelevant documents.
The ranking score function is parameterized by a MLP for which the number of layers is a hyperparameter which is tuned using grid-search. The input layer size is based on the number of input features. To limit the total number of parameters, we decrease the layer size while going deeper, i.e. layer i has size l i = b−i+1 b |u|, with b being the depth of the network, |u| the number of input features. For activation functions between layers we use ReLU (Glorot et al., 2011).
The BOW models as well as auto/cosine were only computed for the respective validation and test sets.

Results
In Table 3 we have listed the average MAP and nDCG scores of the test sets. The tf-idf model is outperformed by most of the other models. However, bm25, which additionally takes the length of a document into account, performs very well. tf-idf and bm25 have the major benefit of fast computation.
The feat model slightly outperforms the auto-   The cosine similarity of a query and a document (auto/cos) does not yield a good result. This shows that the auto-encoder has learned many features, most of which do not correlate with our task. We also find that emb does not yield an equal performance to auto-rank. The combination of the auto-rank + feat model is slightly better than the auto-rank + bm25 model, both of which have the overall best performance. This shows, that the auto-encoder learns something orthogonal to term frequency and document length. The best model with respect to document ranking is the auto-rank + feat model.
In Figure 3 we show the correlation between the different models. Interestingly, the bm25 and the feat strongly correlate. However, the scores of bm25 do not correlate with the scores of the combination of auto-rank and bm25. This indicates, that the model does not primarily learn to use the bm25 score but also focuses on the the auto-encoded representation. This underlines the hypothesis that the auto-encoder is able to represent latent features of the relationship of the query terms in the document.
Influence of the Data It is interesting to see that the learned models do not perform well for the EGFR set. The reason for this might be that testing on it reduces the amount of training data substantially, as EGFR is the best curated gene and thus the largest split of the data set.
In a manual error analysis we compared the rankings of four of the best models (bm25, feat, auto-rank, and auto-rank + bm25). We observe cases where the auto-rank model is unable to detect, when similar entities are used, i.e. entities like Neoplasm and Colorectal Neoplasm. In these cases the bm25 helps, as it treats different words as different features. However, both bm25 and the feat models rank reviews high, that only list query terms. For example, when executing the query {BRAF, PIK3CA, H1047R}, these models rank survey articles high (i.e. ; (Janku et al., 2012); (Chen et al., 2014)).
The auto-rank model on the other hand ranks those document high, for which each entities are listed in a semantic relationship (i.e. (Falchook et al., 2013)).

Query Refinement
In this section we describe our training approaches for query refinement and discuss our findings. The pseudo relevance feedback for the query refinement is based on the ranked documents from the previous query. For our experiments we chose the second best document ranker (auto-rank + bm25) from the previous experiments, because our prototype implementation for auto-rank + feat was computationally too expensive.

Models
We combined both the auto-encoder features with the candidate terms of the respective BOW models (auto-ref + rocch + relev). In order to identify if the good results of this combination are due to the BOW models, or if the auto-encoded features have an effect, we trained a MLP with the same amount of parameters, but only use the features of the two BOW models as input (rocchio + relev).
Training For training, we use the same settings for query refinement as for document ranking and again use a pairwise hinge loss. Here we compare entities that occur in the facts with randomly sampled entities which occur in the retrieved documents.
Due to limitations in time we were only able to test our query refinement models on one validation/test split. We chose to use the split data set of genes KRAS and PIK3CA for validation and testing respectively. We have restricted our models to only regard the top 50 ranked documents for refinement.

Results
To evaluate the ranking of entity terms we have computed nDCG@10, nDCG@100 and MAP, see Table 4 for the results. We also compute Recall@k of relevant documents for automatically refined queries using the 1st, 2nd and 3rd ranked entities. The scores can be found in Table 5.
Tables 4 and 5 show that the Relevance Model outperforms the Rocchio algorithm in every aspect. Both models outperform the auto-encoder approach (auto-ref ). We suspect that summing over the encodings distorts the individual features too much for a correct extraction of relevant entities to be possible.
The combination of all three models (auto-ref + rocchio + relevance) outperforms the other models in most cases. Especially the performance for ranking of entity terms is increased using the autoencoded features. However, it is interesting to see that the rocchio + relevance model outperforms the recall for second and third best terms. This indicates that for user-evaluated term suggestions, the inclusion of the auto-encoded features is advisable. For automatic query refinement however, in average, this is not the case.     Xu et al. (2017) propose using auto-encoders on the vector-space model in a supervised setting for information retrieval and show that it improves performance. The quality of biomedical word embeddings was investigated by Th et al. (2015) and Chiu et al. (2016). Dogan et al. (2017) have developed an open source data set to which we would like to adapt our approach. Sheikhshab et al. (2016) have developed a novel approach for tagging genes which we would like to explore. Palangi et al. (2016) use LSTMs to encode the query and document and use the cosine similarity together with the click-through data as features for ranking in a supervised approach. Cao et al. (2008) define a distance based feature-engineered supervised learning approach to identify good expansion terms. They try to elaborate if the selected terms for expansion are useful for information retrieval by identifying if the terms are ac-tually related to the initial query. Nogueira and Cho (2017) have introduced a reinforcement learning approach for query refinement using logging data. They learn a representation of the text and the query using RNNs and CNNs and reinforce the end result based on recall of a recurrently expanded query. Sordoni et al. (2015) have developed a query reformulation model based on sequences of user queries. They have used a Sequence-to-Sequence model using RNNs to encode and decode queries of a user.

Conclusion
We have considered several approaches for document ranking and query refinement by investigating classic models, feature engineering and, due to the large amount of unlabeled data, a semisupervised approach using a neural auto-encoder.
Leveraging the large amounts of unlabeled data to learn an auto-encoder on text documents yields semantically descriptive features that make subsequent document ranking and query refinement feasible. The combination with BOW features increases the performance substantially, which for our experiments, outputs the best results, for both document ranking and query refinement.
We were able to achieve promising results, however, there is a wide range of Sequence-to-Sequence architectures and text encoding strategies, therefore, we expect that there is room for improvement.

A.1 Normalizing Document Scores for Query Refinement
Because we used a hinge loss instead of cross entropy loss in the ranking model, we cannot interpret the scores s of the ranker as logits. While we do not know the magnitude of the ranker score, we do, however, expect the scores to be positive for relevant documents. If however many documents have scores below zero, this should also be regarded. Based on this, we have defined a normalization setting of the document scores: