Deep Learning for Biomedical Information Retrieval: Learning Textual Relevance from Click Logs

We describe a Deep Learning approach to modeling the relevance of a document’s text to a query, applied to biomedical literature. Instead of mapping each document and query to a common semantic space, we compute a variable-length difference vector between the query and document which is then passed through a deep convolution stage followed by a deep regression network to produce the estimated probability of the document’s relevance to the query. Despite the small amount of training data, this approach produces a more robust predictor than computing similarities between semantic vector representations of the query and document, and also results in significant improvements over traditional IR text factors. In the future, we plan to explore its application in improving PubMed search.


Introduction
The goal of this research was to explore Deep Learning models for learning textual relevance of documents to simple keyword-style queries, as applied to biomedical literature. We wanted to address two main research questions: (1) Without using a curated thesaurus of synonyms and related terms, or an industry ontology like Medical Subject Headings (MeSH R ) , can a neural network relevance model go beyond measuring the presence of query words in a document, and capture some of the semantics in the rest of the document text? (2) Can a deep learning model demonstrate robust performance despite training on a relatively small amount of labelled data?
We had access to a month of click logs from PubMed R 1 , a biomedical literature search engine serving about 3 million queries a day, 20 results per page (Dogan et al., 2009). Most current users of the system are domain experts looking for the most recent papers by an author or search with complex topical boolean query expressions on document aspects. For a small proportion (∼ 5%) of the searches in PubMed, the retrieved articles are sorte by relevance, instead of the default sort order by date. Usage analysis has shown (ibid.) that topic-based queries are a significant part of the search traffic. Such queries often combine two or more entities (e.g. gene and disease), and while users still use short queries, the users are persistent and will frequently reformulate their queries to narrow the search results. So improving the ranking is important to satisfy the needs of PubMed's expanding user base. Traditional lexical Information Retrieval (IR) factors measure the prominence of query terms in documents treated as bags of words. While such factors like Okapi BM25 (Robertson et al., 1994) and Query Likelihood (Miller et al., 1999) are quite effective, there are several cases where they fail. Two that we wanted to target were: (i) under-specified query problem, where even irrelevant documents have prominent presence of the query terms, and relevance requires analysis of the topics and semantics not directly specified in the query, and (ii) the term mismatch problem (Furnas et al., 1987), which requires detection of related alternative terms or phrases in the document when the actual query terms are not in the document.

Background
Deep Learning models have been applied to various types of text matching problems. Their common goal is to go beyond the lexical bag-of-words treatment and model text matching as a complex function in a continuous space. An overview of neural retrieval models can be found in (Zhang et al., 2016;Mitra and Craswell, 2017). We review some of this work that motivated our research.
Most text Deep Learning models start with a numeric vector representation of text's lexical units, most commonly terms or words. Ideally these vectors are trained as part of the model, however when training data is limited, many researchers pretrain these word-vectors in an unsupervised manner on a large text corpus, often using one of the word2vec models (Mikolov et al., 2013a,b). We used the SkipGram Hierarchical Softmax method to pre-train our word-vectors on Titles and Abstracts from all documents in PubMed.
Word Mover's Distance (WMD) (Kusner et al., 2015) is an (untrained) model for determining the semantic similarity between two texts by computing the pairwise distances between the words' vectors. It leverages the similarity of vectors of semantically related words. When applied to ad hoc IR, it often successfully tackles the term mismatch problem. We compare our model's performance against WMD, and show that the added complexity produces further improvements in ranking.
Many deep learning text similarity and IR models first project the query and each document to vectors to a common latent semantic space. A second stage then determines the 'match' between the query and document vectors. In the relevance model described in (Huang et al., 2013) the last stage is the cosine similarity function, and in follow-up work (Shen et al., 2014) the authors use a convolutional layer as part of the semantic mapping network, and a feed-forward classification network is trained to compute the similarity. Instead of training word embeddings, their document presentation is based on representing each word as a bag of letter tri-grams. Their model is trained on about 30 million labelled querydocument pairs extracted from the click logs of a web search engine. The convolution layer is used to capture a word's context and word n-grams. A similar approach is taken in . The ARC-I semantic similarity model of (Hu et al., 2014) uses a stack of interleaving convolution and max-pooling layers to map a sentence to a semantic vector. They argue that stacking convolutions of width 3 or more allows them to capture richer compositional semantics than the recurrent (Mikolov et al., 2010) or recursive (Socher et al., 2011a,b) approaches. However convolutional architectures do have fixed depths that bound the level of composition. Our use of a vertical stack of convolutional layers without interleaving pooling layers is similar to the successful image recognition models AlexNet (Krizhevsky et al., 2012) and VGGNet (Simonyan and Zisserman, 2015). Severyn & Moschitti's (2015) model to rank short text pairs is trained on small data (∼ 50k − 100k samples). Word embeddings are pretrained using word2vec, a convolutional network maps documents to a semantic vector, followed by a difference matrix and a 3-layer classification network to compute the similarity between the input texts. This is much closer to our final approach, and we compare the performance of our relevance model against this model, but using word-embeddings of size 300 rather than 50 to try to capture richer semantics in biomedical literature.
Another approach to text matching first develops 'local interactions' by comparing all possible combinations of words and word sequences between the two texts. Examples are described in (Hu et al., 2014;Lu and Li, 2013). A recent IR model based on this approach is described in (Guo et al., 2016). Authors argue that the local interaction based approach is better at capturing detail, especially exact query term matches. Our approach simplifies the local-interactions by pairing each document word with a single query word, followed by deep convolutions to attempt to capture some related compositional semantics.

The Input
We extracted query-document pairs from one month of PubMed click logs where users selected 'Best Match' (relevance) as the retrieval sort order. For each search resulting in a click, the first page of up to 20 documents was recorded. If the clicked document was not on the first page, it was added to this list. The first click on a PubMed search result takes you to a document summary page. Further clicks to the full text of the document were also recorded. Documents that received clicks were labelled as relevant. This binary notion of relevance was used to train our models, and for model evaluation using precision-based ranking metrics. We also experimented with relevance levels, based on a formula hand-tuned to match human-perceived relevance (see appendix). We report NDCG metrics using these relevance levels.
The queries were restricted to simple text searches, of up to seven words, thus eliminating boolean expressions, author searches and queries mentioning document fields. Log extracts were further restricted to queries with at least 21 documents, and at least 3 clicked documents. These filters reduced the the logs to about 33,500 queries.
These queries were randomly split to 60% training, and 20% each for validation and testing. The number of documents available for each query was quite skewed. Since the metrics we use (described below) give equal weight to each query, we further sub-sampled the training and validation datasets to pick at most 20 of the most relevant documents and an equal number of non-relevant documents. This helped balance out the significance of the queries without reducing the data size too much, and improved the mean per-query metrics of the trained models. The resulting training dataset consisted of 634,790 samples (query-document pairs).

Pre-processing the Input
We used each document's Title and Abstract to form its text. After some experimentation and evaluation on the validation dataset, we found that limiting this to the first 50 words was optimal. Documents shorter than that were padded with 0's, as were queries shorter than 7 words.
We used a simple tokenizer that split words on space and punctuation, while preserving abbreviations and numeric forms, followed by a conversion to lower-case. All punctuation was dropped, which also resulted in a loss of sentence and some grammatical structure, an area to be explored in the future. Numeric forms were collapsed into 7 classes: Integer, Fraction in (0, 1), Real number, year "19xx", year "20xx", Percentage (number followed by "%"), and dollar amount (number preceded by "$"). Removing stop-words from the query and documents did not improve performance of the models.
We leveraged the large PubMed corpus of about 26 million documents to pre-train the word vectors, using the SkipGram Hierarchical Softmax method of word2vec (Mikolov et al., 2013b), with a window size of ±5, a minimum termfrequency of 101, and a word-vector size of 300. This resulted in a vocabulary of 207,716 words. Rare words were replaced with the generic "UNK"

The Delta Relevance Classifier Model
The components of the Delta Relevance Classifier (figure 1) are described below. Optimal sizes of the various layers were determined by tuning for best accuracy on validation data.

Note on Convolutional Layers
A convolutional operation (LeCun, 1989) is a series of identical transformations on subsequences of the input obtained by a sliding window on the input. The result is called a feature map, and a convolutional layer will usually involve several feature maps. The width of the input subsequence is called the filter width. In our application, the input is a sequence of words, each word represented by a real vector of size d. A convolution of filter-width k processes word k-grams. The value of the t-th element of the j-th feature map c j is computed as follows: Here σ is the non-linear activation function, W j ∈ R d×k , b j ∈ R are the parameters of the j-th feature map, and the input is x ∈ R d×m . In the models in this paper, the feature maps are applied in full mode, which effectively pads the input on either side with k − 1 d-sized 0-vectors, so x i,j = 0 for j < 1 or j > m, and t ranges from 1 to m + k − 1.
Applied to text, a convolutional layer of width 3 will extract features from 3-grams. A second con-  Table 1: Test Data and its subsets volutional layer of the same width stacked above then extracts features from 5-grams, and so on.

Query-Document Overlap Features
Following (Severyn and Moschitti, 2015), we compute some overlap features to aid relevance detection when dealing with exact matches, and rare words collpased to the 'UNK' token. We use the following overlap features, the first two of which are taken directly from that paper: (i) proportion of query and document words in common, (ii) IDF-weighted version of (i), (iii) proportion of query words in the document, and (iv) proportion of query bigrams in the document.

Difference Features Stage
Instead of developing all pairwise local interactions between query and document terms, we capture interactions between pairs of closest terms. This simplifies the model, and since queries are short, we are unlikely to loose any useful interactions. The difference features are computed in two steps (algorithm 1). First, for each word in the document of a query-document pair, the closest query word in absolute vector distance is identified (skipping all "UNK" words in the query and document). We then output the difference vector, along with its length and the cosine angle between the two vectors. With word-vectors of size d and a document of T words, the output of this stage is a real matrix of size d × (T + 2). We found T = 50 produced the best results for the Delta models.

Algorithm 1 Query-Doc Difference Features
Input: Query text Q and Document text D.
for each word vector w in D s.t. w = UNK do:

Delta Scanner Stage
The Delta Scanner stage is a vertical stack of three Convolutional layers of 256 filters each, fol-lowed by a Dropout layer, and then a Global Max-Pooling layer outputting a fixed-width vector. All feature maps use the ReLU (Rectified Linear Unit) activation function.
The input to the Delta Scanner stage is the d × (T + 2) matrix produced by the Difference Features stage. Documents whose text is fewer than T words are right-padded with 0's, and the Delta Scanner supports a mask input that it uses to ignore the padding. The output of this stage is a vector of size 256, representing the semantic difference between the the query and the document in a query-document pair. The remaining hyperparameters are: Dropout probability, and the L2regularization coefficient.

Relevance Classifier Stage
This is a deep fully connected feed-forward logistic regression stage. The input to the Relevance Classifier stage is the combined vectors output from the Overlap Features and Delta Scanner stages, with a total width of 260 = (4+256). This data is fed through the following layers: i. a Dropout layer, ii. two feed-forward layers, each of width 260, using the ReLU activation function, iii. another Dropout layer, and iv. a sigmoid-based Classification layer. The Relevance Classifier's output is an estimate of the probability of the input document's relevance to the query. Documents are ranked in reverse order on this estimated probability.
This stage's hyper-parameters are: Dropout probability (same value used for both Dropout layers), and the L2-regularization coefficient.

Loss Function and Sample Weighting
The data labels capture a binary sense of relevance, and our models are binary classifiers, so we used the standard binary cross-entropy loss.
In the default mode, the neural network models were trained without any weighting of the training samples. We trained a second set of models with sample weights derived from the non-binary relevance levels (described above). For each relevance level r, a weight of max[1, log(1 + r)] was used. This damped the relevance levels, while ensuring that each relevant document received at least the same weight as a non-relevant document.

Optimization and Implementation Notes
All the neural network models were optimized using Adadelta (Zeiler, 2012), with mini-batches of 256 samples. Mini-batch gradient descent was run for 10 epochs, and the trained values at the end of the epoch producing the best classification accuracy on the Validation dataset were chosen. A greedy search was done in the grid space of the hyper-parameters for the Delta Scanner and Relevance Classifier stages, and the values that produced the best validation accuracy were selected.

Experimental Setup
We compare the performance of the relevance models on the following ranking metrics: NDCG at rank 20, Precision at ranks 5, 10 and 20, and Mean Average Precision (MAP). Scoring ties were resolved by sorting on decreasing document-id.

Methods Compared
We compared the performance of our deep learning model against: BM25; the Unigram Query Likelihood Model (UQLM) with Dirichlet Smoothing (Zhai and Lafferty, 2004); Word Mover's Distance (WMD) that leverages pretrained word-vectors; and a couple of neural network models based on the architecture described in (Severyn and Moschitti, 2015).
We tested BM25 on the document Title, Abstract and Title + Abstract, and found BM25 on Title to give the best ranking performance, with parameters k 1 = 2.0, b = 0.75. Similarly, UQLM applied to the document Title and WMD applied to the document Title after removal of stop-words performed better than the other alternatives.

Severyn-Moschitti Model
We tested four variants of the relevance classifier described in (Severyn and Moschitti, 2015). All versions used the same input data and wordvectors as used for the Delta model. In the basic version, which we will refer to as "SevMos-C1", the query and document were fed into a single-layer Convolutional stage as described in section 4.1, with 256 feature maps and a filter width of 5. This was followed by a Dropout layer and then Global Max-Pooling. The outputs of the query and document convolutions, along with the overlap features described in section 4.2, were fed into a Classifier stage. This stage computed a difference between the query and document features using a difference matrix, and this value along with the other inputs were fed into a deep classification stage identical to that used in the Delta model (section 4.5), sized to match these inputs.
In the "SevMos-C3" variant of this model, we replaced the single-layer convolution stage with a deeper 3-layer stack of convolutions of filter width 3, followed by global max-pooling, just like the Delta model's 'Delta Scanner' stage.
In addition to training the models on unweighted samples, we also trained separate models on relevance-based weighted samples (see section 4.6), which we refer to below as "SevMos-C1 w" and "SevMos-C3 w".
Optimal values for the L2-regularization and Dropout probability hyper-parameters were determined by doing a greedy grid search, as described for the Delta model.

The Test Data
The test data used to compare performance of the different textual relevance approaches is the heldout 20% split of the data extracted from search logs, as described in section 3.1, without any further sub-sampling. Of the relevant documents ("+ives"), 38.9% did not contain all query terms in the title. Similarly among the non-relevant documents ("-ives"), 59.1% contained all the query terms in the title (see table 1).
In addition to comparing ranking metrics of the different approaches on the test data, we also wanted to explore the main research questions motivating this work: (i) the problems of under-specified queries and term mismatch, and (ii) model robustness. To help answer these questions, we also compare ranking metrics on the following subsets of the test data: Neg20+: This consists of all queries for which there were at least 20 non-relevant documents that contained all the query words in the title. This helps evaluate performance on underspecified queries.  AllNewWords: A smaller subset of queries all of whose words are new: none of the training or validation queries included these words.
The last two subsets will help us evaluate model robustness. The statistics of the test data and its subsets are summarized in table 1. Table 2 compares the performance of all the above ranking factors and models on the full test data. The first row shows the metrics obtained by ranking all the documents on reverse order of Document ID. We use this as a score tie-breaker for all the other rankers, so it provides a useful baseline performance of an uninformed ranker.

Models trained on Un-weighted Samples
As also seen in (Shen et al., 2014), BM25 on Title slightly outperforms the Unigram Query Likelihood Model. We have seen other cases where UQLM outperforms BM25. We believe the better performance of BM25 here is partly due to it being a strong factor in the relevance ranking from which these click logs were extracted, thus biasing the click data to some extent.
Word Mover's Distance (WMD-Title) is the first factor in the table that takes non-query words into account, and it does show an improvement over BM25. However WMD relies on the word-vectors obtained by unsupervised training, using a simple Euclidean distance on these vectors as the semantic distance between words. This, and its relatively simple computation, limit how well it performs.
The SevMos-C1 model applies a complex nonlinear transformation on the word-vector based text space, in an attempt to better capture comparable semantics of documents. However its NDCG numbers are worse than both WMD and BM25, although its precision numbers, while better than BM25, are about the same as those for WMD. Given that the neural network models in this table were trained on a boolean version of relevance, we expect the main gains to be in the precisionbased metrics, which also use a boolean notion of relevance. The lack of improvement in precision metrics over WMD shows that SevMos-C1's nonlinear transformations are not doing a better job of capturing query and document semantics.
The SevMos-C3 model learns a more complex non-linear transformation than SevMos-C1, by using a stack of three non-linear convolution layers instead of one in the first part of the model. However its metrics are no better (actually somewhat worse) than SevMos-C1 across the board. So increasing the expressive power of the model did not help. Lack of sufficient training data might be limiting the performance of these models.
The main difference between the Delta model and SevMos-C3 is that the Delta Model starts by computing a difference vector between the Document and Query's word-vector representations. This local interaction vector is inspired by Word Mover's Distance, and in the Delta model we hope to combine the benefits of the WMD and Sev-Mos approaches, while at the same time reducing the complexity of the input space, and thus allowing us to extract more benefit from the small amount of training data. The performance metrics for the Delta model do indeed show sizeable improvements over both WMD and SevMos-C1 (and thus also over BM25 and UQLM). The relative improvements in the metrics are shown in the last two rows of the table 2 .
The 'Neg20+' section of the   Model robustness is tested when queries with words not seen during training (i.e. training and validation datasets) are encountered. This is explored in sections 'OneNewWord' and 'All-NewWords' of table 3. Both these sub-tables show a consistently better performance by the Delta model over the other approaches compared here. Interestingly, the improvements in the Delta model's NDCG at 20 metrics over the other approaches are quite sizeable, even though for a simple un-weighted relevance classifier, the primary target was precision and not NDCG.

Relevance Weighted Models
In this section we explore the performance of the Delta model trained on relevance-weighted samples against the corresponding weighted versions of the neural network models SevMos-C1 and SevMos-C3. These metrics are shown in table 4. A quick comparison with previous tables shows that all the models turn in better NDCG numbers than their un-weighted versions. In particular, the "Delta w" model continues to depict statistically significant better metrics than the other weighted neural network models "SevMos-C1 w" and "SevMos-C3 w".
Comparing the Delta weighted model against the unweighted Delta model, we see that there is a statistically significant improvment in the NDCG metrics for all the Test subsets (at the 99% confidence level). However the precision metrics do not show a significant change. So by weighting the samples we have been able to improve the NDCG without hurting the precision.

Concluding Remarks
We have demonstrated a Deep Learning approach for learning textual relevance from a fairly small labelled training dataset. We show that this model is robust and it outperforms both traditional IR factors as well as related shallow (WMD) and deep (SevMos) models based on continuous representations of text, with better results on the underspecified query and term mismatch problems. While the Delta model is comparable to other local-interaction ranking models, we compute fewer and richer interactions. We believe the fewer interactions captured in the difference vector are sufficient for the shorter queries in our data. As a comparison, the model in (Guo et al., 2016) computes a match histogram based on cosine similarity between all document-query word pairs, and also query-term IDF based weighting. We plan to test this model on our data.
The main advantage to the separate semantic vector approach is that document semantic vectors can be pre-computed. Prediction run-time then primarily depends on the complexity of the similarity computation between these semantic vectors. Local-interaction models, including ours, do not allow this pre-computation, significantly increasing the ranker's run-time cost.
We believe the most promising directions for future research include: modeling deeper semantics (see example in appendix), unsupervised training on data auto-generated from the corpus and finetuning with supervised training, improving extraction of non-binary relevance levels and using a pair-wise ranking target. Further investigation is also warranted for incorporating these models into PubMed.
with the parameters µ = 0.33, λ = 15, where AbClicks is the number of observed clicks to the document summary page in PubMed, FTClicks is the number of observed clicks to the document's full text, if available, and the value of IsDocWith-outFullText is 1 if the full text for that document is not available, and 0 otherwise. The formula attempts to capture the increased notion of relevance if the user accesses the document's full text, without penalizing documents whose full-text is not available. The parameters were hand-tuned to reflect domain experts' relevance judgments.

B Rankings on Some Example Queries
Here are some example queries from the test set showing the titles of the top 3 ranked documents for the Delta weighted model, BM25 and WMD. Relevance levels of the documents are indicated inside parentheses before the titles.

B.1 Query: cryoglobulinemia
This word did not occur in training or validation queries. Delta w ranks the most relevant document at the top despite its use of an alternative spelling. BM25 and WMD seem to prefer shorter titles with exact matches. Number of documents in the test dataset: relevant = 27, non-relevant = 26. Top three relevance levels: 39.0, 11.0, 4.0.
As ranked by Delta w:

B.3 Query: chronic headache and depression review
In this example, both WMD and Delta w are able to leverage word vectors to relate headache to migraine. However both miss the most relevant document, whose title is "Psychological Risk Factors in Headache" (relevance level = 6.0). This example demonstrates the need for deeper semantic modeling. Number of documents in the test dataset: relevant = 23, non-relevant = 18. Top three relevance levels: 6.0, 3.0, 3.0.
As ranked by Delta w: