Using Centroids of Word Embeddings and Word Mover's Distance for Biomedical Document Retrieval in Question Answering

We propose a document retrieval method for question answering that represents documents and questions as weighted centroids of word embeddings and reranks the retrieved documents with a relaxation of Word Mover's Distance. Using biomedical questions and documents from BIOASQ, we show that our method is competitive with PUBMED. With a top-k approximation, our method is fast, and easily portable to other domains and languages.


Introduction
Biomedical experts (e.g., researchers, clinical doctors) routinely need to search the biomedical literature to support research hypotheses, treat rare syndromes, follow best practices etc. The most widely used biomedical search engine is PUBMED, with more than 24 million biomedical references and abstracts, mostly of journal articles. 1 To improve their performance, biomedical search engines often use large, manually curated ontologies, e.g., to identify biomedical terms and expand queries with related terms. 2 Biomedical experts, however, report that search engines often miss relevant documents and return many irrelevant ones. 3 There is also growing interest for biomedical question answering (QA) systems (Athenikos and Han, 2010;Bauer and Berleant, 2012;Tsatsaronis et al., 2015), which allow their users to specify their information needs more precisely, as natural language questions rather than Boolean queries, 1 See http://www.ncbi.nlm.nih.gov/pubmed. 2 PUBMED uses UMLS (http://www.nlm.nih.gov/ research/umls/). See also the GoPubMed search engine (http://www.gopubmed.com/).
3 Malakasiotis et al. (2014) summarize the findings of interviews that investigated how biomedical experts search. and aim to produce more concise answers. Document retrieval is particularly important in biomedical QA, since most of the information sought resides in documents and is essential in later stages.
We propose a new document retrieval method. Instead of representing documents and questions as bags of words, we represent them as the centroids of their word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and retrieve the documents whose centroids are closer to the centroid of the question. This allows retrieving relevant documents that may have no common terms with the question without query expansion. Using biomedical questions from the BIOASQ competition (Tsatsaronis et al., 2015), we show that our method combined with a relaxation of the recently proposed Word Mover's Distance (WMD) (Kusner et al., 2015) is competitive with PUBMED. We also show that with a top-k approximation, our method is particularly fast, with no significant decrease in effectiveness. Given that it does not require ontologies, term extractors, or manually labeled training data, our method could be easily ported to other domains (e.g., legal texts) and languages.

The proposed method
The word embeddings and document centroids are pre-computed. For each question, its centroid is computed and the documents with the top-k nearest (in terms of cosine similarity) centroids are retrieved ( Fig. 1). The retrieved documents are then optionally reranked using a relaxation of WMD.

Centroids of documents and questions
In the simplest case, the centroid t of a text t is the sum of the embeddings of the tokens of proved performance when the IDF scores of the tokens are also taken into account as follows: where |V | is the vocabulary size (approx. 1.7 million words, ignoring stop words), w j is the j-th vocabulary word, w j its embedding, TF(w j , t) the term frequency of w j in t, and IDF(w j ) the inverse document frequency of w j (Manning et al., 2008). We use the 200-dimensional word embeddings of BIOASQ, obtained by applying WORD2VEC (Mikolov et al., 2013) to approx. 11 million abstracts from PubMed. 4 The IDF scores are computed on the 11 million abstracts.

Document retrieval and reranking
Given a question with centroid q, identifying the documents with the k nearest centroids requires computing the distance between q and each document centroid, which is impractical for large document collections. Efficient approximate top-k algorithms, however, exist. They divide the vector space into subspaces and use trees to index the instances in each subspace (Arya et al., 1998;Indyk and Motwani, 1998;Andoni and Indyk, 2006;Muja and Lowe, 2009). We show that with an approximate top-k algorithm, document retrieval is very fast, with no significant decrease in performance. The top-k retrieved documents d i are ranked by decreasing (cosine) similarity of their centroids to q. We call this method Cent when the 4 The skipgram model of WORD2VEC was used, with hierarchical softmax, 5-word windows, and default other parameters. See http://participants-area.bioasq. org/info/BioASQword2vec/ for further details. simple (no IDF) centroids are used, and CentIDF when the IDF-weighted centroids (Eq. 1) are used.
The top-k documents are optionally reranked with an approximation of the WMD distance. WMD measures the total distance the word embeddings of two texts (in our case, question and document) have to travel to become identical. In its full form, WMD allows each word embedding to be partially aligned (travel) to multiple word embeddings of the other text, which requires solving a linear program and is too slow for our purposes. Kusner et al. (2015) reported promising results in text classification using WMD as the distance of a k-NN classifier. They also introduced relaxed, much faster WMD versions. In our case, the first relaxation (RWMD-Q) sums the distances the word embeddings w of the question q have to travel to the closest word embeddings w of the document d: Following Kusner et al., we use the Euclidean distance as dist( w, w ). Similarly, the second relaxed form (RWMD-D) sums the distances of the word embeddings of d to the closest embeddings of q. If we set dist( w, w ) = 1 if w, w are identical and 0 otherwise, RWMD-Q counts how many words of q are present in d, and RWMD-D counts the words of d that are present in q. Kusner et al. found the maximum of RWMD-Q and RWMD-D (RWMD-MAX) to be the best relaxation of WMD. In our case, where q is much shorter than d, RWMD-Q works much better, because d contains many irrelevant words that have no close counter-parts in q, and their long distances dominate in RWMD-D and RWMD-MAX. 5 We call CentIDF-RWMD-Q and CentIDF-RWMD-D the CentIDF method with the additional reranking by RWMD-Q or RWMD-D, respectively.

Data
We used the 1,307 training questions and the gold relevant PUBMED document ids of the fourth year of BIOASQ (Task 4b). 6 The questions were written by biomedical experts, who also identified the gold relevant documents using PUBMED, and reflect real needs (Tsatsaronis et al., 2015). We pass each question to our methods (after tokenization and stop-word removal) or the PUBMED search engine (hereafter PubMedSE), which performs its own tokenization and query expansion. 7 The document collection that we search contains approx. 14 million article abstracts and titles from the November 2015 PUBMED dump, which was also used in the fourth year of BIOASQ. 8 Our methods view each document as a concatenation of the title and abstract of an article. 9 The titles and abstracts have an average length of approx. 13 and 143 tokens, respectively. When comparing against PubMedSE, we ignore documents returned by PubMedSE that are not in the dump, but this is very rare and does not affect the results.

Figures 2-4 show Mean Interpolated Precision (MIP) at 11 recall levels, Mean Average Interpolated Precision (MAIP), Mean Average Precision (MAP), and Normalized Discounted Cumulative
Gain (nDCG). 10 Roughly speaking, MAIP is the area under the MIP curve, MAP is the same area without interpolation, and nDCG is an alternative 7 We use relevance ranking (not recency) in PubMedSE. 8 The dump is available from https://www.nlm. nih.gov/databases/license/license.html. The 14 million articles do not include approx. 10 million articles for which only titles are provided. There are hardly any title-only gold relevant documents, and PubMedSE very rarely returns title-only documents. 9 It is unclear to us if PUBMED also searches the full texts of the articles, which may put our methods at a disadvantage. 10 All measures are widely used (Manning et al., 2008). We use binary relevance in nDCG, as in the BIOASQ dataset. to MAIP. Unless otherwise stated, the number of retrieved documents is set to k = 1,000. Figure 2 shows that Cent performs much worse than CentIDF. At low recall, CentIDF is as good as PubMedSE, but PubMedSE outperforms CentIDF at high recall. Reranking the top-k documents of CentIDF by RWMD-Q has a significant impact, leading to a system (CentIDF-RWMD-Q) that performs better or as good as PubMedSE up to 0.7 recall. Reranking the top-k documents of PubMedSE by RWMD-Q (PubMedSE-RWMD-Q) also improves the performance of PubMedSE. Reranking the top-k documents of CentIDF by RWMD-D (or RWMD-MAX, not shown) leads to much worse results (CentIDF-RWMD-D), for reasons already explained. 11 Similar conclusions are reached by examining the MAIP, MAP, and nDCG scores.
Keyword-based information retrieval may miss relevant documents that use different terms than the question, even with query expansion. PubMedSE retrieves no documents for 35% (460/1307) of our questions. 12 Further experiments (not reported), however, indicate that PubMedSE has higher precision than CentIDF-RWMD-Q, when PubMedSE returns documents, at the expense of lower recall. Hence, there is scope to combine PubMedSE with our methods. As a first, crude step, we tested a method (Hybrid) that returns the documents of CentIDF-RWMD-Q when PubMedSE retrieves no documents, and those of 11 The same holds when the top-k documents of PubMedSE are reranked by RWMD-D or RWMD-MAX (not shown). 12 The experts that identified the gold relevant documents used simple keyword, Boolean, and advanced PubMedSE queries, whereas we used the English questions as queries.    PubMedSE-RWMD-Q otherwise. Hybrid had the best results in our experiments; the only exception was its nDCG@100 score, which was slightly lower than the score of CentIDF-RWMD-Q. Table 1 shows that an approximate top-k algorithm (ANN) in CentIDF-RWMD-Q (ANN-CentIDF-RWMD-Q) reduces dramatically the time to obtain the top-k documents, with a very small decrease in MAIP, MAP, and nDCG scores (Figures 3 and 4). 13 We also compared against the other participants of the second year of BIOASQ; the participant results of later years are not yet available. 14 The official BIOASQ score is MAP; MIP, MAIP, and nDCG scores are not provided. Our best method was again Hybrid (avg. MAP over the five batches of the second year 16.18%). It performed overall better than the BIOASQ 'baselines' (best avg. MAP 15.60%) and all eight participants, except for the best one (avg. MAP 28.20%). The best system (Choi and Choi, 2014) used dependency IR models (Metzler and Croft, 2005), combined with UMLS and query expansion heuristics (e.g., adding 13 We use Annoy (https://github.com/spotify/ annoy), 100 trees, 1,000 neighbors, search-k = 10 · |trees| · |neighbors|. Times on a server with 4 Intel Xeon E5620 CPUs (16 cores total), at 2.4 GHz, with 128 GB RAM. 14 We used the evaluation platform of BIOASQ (http:// participants-area.bioasq.org/oracle). the titles of the top-k initially retrieved documents to the query). The 'baselines' are actually very competitive; no system beat them in the first year, and only one was better in the second year. They are PubMedSE, but using BIOASQ-specific heuristics (e.g., instructing PubMedSE to ignore types of articles the experts did not consider). Our system is simpler and does not use heuristics; hence, it can be ported more easily to other domains. Kosmopoulos et al. (2016) reports that a k-NN classifier that represents articles as IDF-weighted centroids (Eq. 1) of 200-dimensional word embeddings (200 features) is as good at assigning semantic labels (MeSH headings) to biomedical articles as when using millions of bag-of-word features, reducing significantly the training and classification times. To our knowledge, our work is the first attempt to use IDF-weighted centroids of word embeddings in information retrieval, and the first to use WMD to rerank the retrieved documents. More elaborate methods to encode texts as vectors have been proposed (Le and Mikolov, 2014;Kiros et al., 2015;Hill et al., 2016) and they could be used as alternatives to centroids of word embeddings, though the latter are simpler and faster to compute.

Other related work
The OHSUMED dataset (Hersh et al., 1994) is often used in biomedical information retrieval experiments. It is much smaller (101 queries, approx. 350K documents) than the BIOASQ dataset that we used, but we plan to experiment with OHSUMED in future work for completeness.

Conclusions and future work
We proposed a new QA driven document retrieval method that represents documents and questions as IDF-weighted centroids of word embeddings. Combined with a relaxation of the WMD distance, our method is competitive with PUBMED, without ontologies and query expansion. Combined with PUBMED, it performs better than PUBMED on its own. With a top-k approximation, it is fast, and easily portable to other domains and languages.
We plan to consider alternative dense vector encodings of documents and queries, textual entailment (Bowman et al., 2015;Rocktäschel et al., 2016), and full-text documents, where it may be necessary to extend RWMD-Q to take into account the proximity (density) of the words of the (now longer) document the query words are mapped to.