Literature Retrieval for Precision Medicine with Neural Matching and Faceted Summarization

Information retrieval (IR) for precision medicine (PM) often involves looking for multiple pieces of evidence that characterize a patient case. This typically includes at least the name of a condition and a genetic variation that applies to the patient. Other factors such as demographic attributes, comorbidities, and social determinants may also be pertinent. As such, the retrieval problem is often formulated as ad hoc search but with multiple facets (e.g., disease, mutation) that may need to be incorporated. In this paper, we present a document reranking approach that combines neural query-document matching and text summarization toward such retrieval scenarios. Our architecture builds on the basic BERT model with three specific components for reranking: (a). document-query matching (b). keyword extraction and (c). facet-conditioned abstractive summarization. The outcomes of (b) and (c) are used to essentially transform a candidate document into a concise summary that can be compared with the query at hand to compute a relevance score. Component (a) directly generates a matching score of a candidate document for a query. The full architecture benefits from the complementary potential of document-query matching and the novel document transformation approach based on summarization along PM facets. Evaluations using NIST's TREC-PM track datasets (2017--2019) show that our model achieves state-of-the-art performance. To foster reproducibility, our code is made available here: https://github.com/bionlproc/text-summ-for-doc-retrieval.

Information retrieval (IR) for precision medicine (PM) often involves looking for multiple pieces of evidence that characterize a patient case. This typically includes at least the name of a condition and a genetic variation that applies to the patient. Other factors such as demographic attributes, comorbidities, and social determinants may also be pertinent. As such, the retrieval problem is often formulated as ad hoc search but with multiple facets (e.g., disease, mutation) that may need to be incorporated. In this paper, we present a document reranking approach that combines neural query-document matching and text summarization toward such retrieval scenarios. Our architecture builds on the basic BERT model with three specific components for reranking: (a). document-query matching (b). keyword extraction and (c). facet-conditioned abstractive summarization. The outcomes of (b) and (c) are used to essentially transform a candidate document into a concise summary that can be compared with the query at hand to compute a relevance score. Component (a) directly generates a matching score of a candidate document for a query. The full architecture benefits from the complementary potential of document-query matching and the novel document transformation approach based on summarization along PM facets. Evaluations using NIST's TREC-PM track datasets (2017)(2018)(2019) show that our model achieves state-of-the-art performance. To foster reproducibility, our code is made available here: https://github.com/bionlproc/t ext-summ-for-doc-retrieval.

Introduction
The U.S. NIH's precision medicine (PM) initiative (Collins and Varmus, 2015) calls for designing treatment and preventative interventions considering genetic, clinical, social, behavioral, and environmental exposure variability among patients.
The initiative rests on the widely understood finding that considering individual variability is critical in tailoring healthcare interventions to achieve substantial progress in reducing disease burden worldwide. Cancer was chosen as its near term focus with the eventual aim of expanding to other conditions. As the biomedical research enterprise strives to fulfill the initiative's goals, computing needs are also on the rise in drug discovery, predictive modeling for disease onset and progression, and in building NLP tools to curate information from the evidence base being generated.  In a dovetailing move, the U.S. NIST's TREC (Text REtrieval Conference) has been running a PM track since 2017 with a focus on cancer (Roberts et al., 2020). The goal of the TREC-PM task is to identify the most relevant biomedical articles and clinical trials for an input patient case. Each case is composed of (1) a disease name, (2) a gene name and genetic variation type, and (3) demographic information (sex and age). Table 1 shows two example cases from the 2019 track. So the search is ad hoc in the sense that we have a free text input in each facet but the facets themselves highlight the PM related attributes that ought to characterize the retrieved documents. We believe this style of faceted retrieval is going to be more common across medical IR tasks for many conditions as the PM initiative continues its mission.

Vocabulary Mismatch and Neural IR
The vocabulary mismatch problem is a prominent issue in medical IR given the large variation in the expression of medical concepts and events. For example, in the query "What is a potential side effect for Tymlos?" the drug is referred by its brand name. Relevant scientific literature may contain the generic name Abaloparatide more frequently. Traditional document search engines have clear limitations on resolving mismatch issues. The IR community has extensively explored methods to address the vocabulary mismatch problem, including query expansion based on relevance feedback, query term re-weighting, or query reconstruction by optimizing the query syntax.
Several recent studies highlight exploiting neural network models for query refinement in document retrieval (DR) settings. Nogueira and Cho (2017) address this issue by generating a transformed query from the initial query using a neural model. They use reinforcement learning (RL) to train it where an agent (i.e., reformulator) learns to reformulate the initial query to maximize the expected return (i.e., retrieval performance) through actions (i.e., generating a new query from the output probability distribution). In a different approach, Narayan et al. (2018) use RL for sentence ranking for extractive summarization.

Our Contributions
In this paper, building on the BERT architecture (Devlin et al., 2019), we focus on a different hybrid document scoring and reranking setup involving three components: (a). a document relevance classification model, which predicts (and inherently scores) whether a document is relevant to the given query (using a BERT multi-sentence setup); (b). a keyword extraction model which spots tokens in a document that are likely to be seen in PM related queries; and (c). an abstractive document summarization model that generates a pseudo-query given the document context and a facet type (e.g., genetic variation) via the BERT encoder-decoder setup. The keywords (from (b)) and the pseudoquery (from (c)) are together compared with the original query to generate a score. The scores from all the components are combined to rerank top k (set to 500) documents returned with a basic Okapi BM25 retriever from a Solr index (Grainger and Potter, 2014) of the corpora.
Our main innovation is in pivoting from the focus on queries by previous methods to emphasis on transforming candidate documents into pseudoqueries via summarization. Additionally, while generating the pseudo-query, we also let the decoder output concept codes from biomedical terminologies that capture disease and gene names. We do this by embedding both words and concepts in a common semantic space before letting the decoder generate summaries that include concepts. Our overall architecture was evaluated using the TREC-PM datasets (2017-2019) with the 2019 dataset used as the test set. The results show an absolute 4% improvement in P@10 compared to prior best approaches while obtaining a small ≈ 1% gain in R-Prec. Qualitative analyses also highlight how the summarization is able to focus on document segments that are highly relevant to patient cases.

Background
The basic reranking architecture we begin with is the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) model. BERT is trained on a masked language modeling objective on a large text corpus such as Wikipedia and BooksCorpus. As a sequence modeling method, it has achieved state-of-the-art results in a wide range of natural language understanding (NLU) tasks, including machine translation (Conneau and Lample, 2019) and text summarization (Liu and Lapata, 2019). With an additional layer on top of a pretrained BERT model, we can fine-tune models for specific NLU tasks. In our study, we utilize this framework in all three components identified in Section 1.3 by starting with a bert-base-uncased pretrained HuggingFace model (Wolf et al., 2019).

Text Summarization
We plan to leverage both extractive and abstractive candidate document summarization in our framework. In terms of learning methodology, we view extractive summarization as a sentence (or token) classification problem. Previously proposed models include the RNN-based sequence model (Nallapati et al., 2017), the attention-based neural encoder-decoder model (Cheng and Lapata, 2016), and the sequence model with a global learning objective (e.g., ROUGE) for ranking sentences optimized via RL (Narayan et al., 2018;Paulus et al., 2018). More recently, graph convolutional neural networks (GCNs) have also been adapted to allow the incorporation of global information in text summarization tasks (Sun et al., 2019;Prasad and Kan, 2019). Abstractive summarization is typically cast as a sequence-to-sequence learning problem. The encoder of the framework reads a document and yields a sequence of continuous representations, and the decoder generates the target summary token-by-token (Rush et al., 2015;Nallapati et al., 2016). Both approaches have their own merits in generating comprehensive and novel summaries; hence most systems leverage these two different models in one framework (See et al., 2017;Liu and Lapata, 2019). We use the extractive component to identify tokens in a candidate document that may be relevant from a PM perspective and use the abstractive component to identify potential terms that may not necessarily be in the document but nevertheless characterize it for PM purposes.

Word and Entity Embeddings
Most of the neural text summarization models, as described in the previous section, adopt the encoder-decoder framework that is popular in machine translation. As such the vocabulary on the decoding side does not have to be the same as that on the encoding side. We exploit this to design a summarization trick for PM where the decoder outputs both regular English tokens and also entity codes from a standardized biomedical terminology that captures semantic concepts discussed in the document. This can be trained easily by converting the textual queries in the training examples to their corresponding entity codes. This trick is to enhance our ability to handle vocabulary mismatch in a different way (besides the abstractive framing). We created BioMedical Entity Tagged (BMET) embeddings 1 for this purpose. BMET embeddings are trained on biomedical literature abstracts that were annotated with entity codes in the Medical Subject Headings (MeSH) terminology 2 ; codes are appended to the associated textual spans in the training examples. So regular tokens and the entity codes are thus embedded in the same semantic space via pretraining with the fastText architecture (Bojanowski et al., 2017). Besides regular English tokens, the vocabulary of BMET thus includes 29,351 MeSH codes and a subset of supplementary concepts. In the dictionary, MeSH codes are differentiated from the regular words by a unique prefix; for example, mesh d000123 for MeSH code D000123. With this, our summarization model can now translate a sequence of regular text tokens into a sequence of biomedical entity codes or vice versa. That is, we use MeSH as a new "semantic" facet besides those already provided by TREC-PM organizers. The expected output for the MeSH facet is the set of codes that capture entities in the disease and gene variation facets.

Models and Reranking
In this effort, toward document reranking, we aim to measure the relevance match between a document and a faceted PM query. Each training instance is a 3-tuple (d, q, y d q ) where q is a query, d is a candidate document, and y d q is a Boolean human adjudicated outcome: whether d is relevant to q. As mentioned in Section 1.3, first, we fine-tune BERT for a query-document relevance matching task modeled as a classification goal to predict y d q (REL). Next, we fine-tune BERT for token-level relevance classification, different from REL, where a token in d is deemed relevant during training if it occurs as part of q. We name this model EXT for keyword extraction. Lastly, we train a BERT model in the seq2seq setting where the encoder is initialized with a pretrained EXT model. The encoder reads in d, and the decoder attends to the contextualized representations of d to generate a facet-specific pseudo-query sentence q d , which is then compared with the original query q. We conceptualize this process as text summarization from a document to query sentences 3 and refer to it as ABS. All three models are used together to rerank a candidate d at test time for a specific input query.

Document Relevance Matching (REL)
Neural text matching has been recently carried out through siamese style networks (Mueller and Thyagarajan, 2016), which also have been adapted to biomedicine (Noh and Kavuluru, 2018). Our approach adapts the BERT architecture for the matching task in the multi-sentence setting as shown in Figure 1. We use BERT's tokenizer on its textual

Abstractive Document Summarization (ABS)
ABS employs a standard seq2seq attention model, similar to that by Nallapati et al. (2016), as shown in Figure 2. We initialize the parameters of the encoder with a pretrained EXT model. The decoder is a 6-layer transformer in which the self-attention layers attend to only the earlier positions in the output sequence as is typical in auto-regressive language models. In each training phase step, the decoder takes each previous token from the reference query sentence; in the generation process, the decoder uses the token predicted one step earlier.   signals are the latent variables for which ABS is optimized. Through them, ABS learns not only the thematic aspects of the queries but also the meta attributes such as length. The special tokens for facets are listed in Table 2 (the last row indicates a new auxiliary facet we introduce in Section 4.1).
Each faceted query is enclosed by its assigned bos/eos pair, and the decoder of ABS learns p θ (x i |x <i , x 0 ) , where x 0 is the facet signal. As in the encoder and the original transformer architecture (Vaswani et al., 2017), we add the sinusoidal positional embedding P t and the segment vector A (or B) to the token embedding E t . Note that the dimension of the token embeddings used in the encoder (BERT embeddings) is different from that of the decoder (our custom BMET embeddings), which causes a discrepancy in computing context-attentions of the target text across the source document. Hence, we add an additional linear layer to project the constructed decoder embeddings (E n j + A + P i in the right hand portion of Figure 2) into the same space of embeddings of the encoder. These projected embeddings are fed to the decoder's transformer layers. Each transformer layer applies multi-head attention for computing the self-and context-attentions. The attention function reads the input masks to preclude attending to future tokens of the input and any padded tokens (i.e., [PAD]) of the source text. Both attention functions apply a residual connection (He et al., 2016). Lastly, each transformer layer ends with a position-wise feedforward network. Final scores for each token are computed from the linear layer on top of the transformer layers. In training, these scores are consumed by a cross-entropy loss function. In generation process, the softmax function is applied over the vocabulary yielding a probability distribution for sampling the next token.
Finally to generate the pseudo-query, we use beam search to find the most probable sentence among predicted candidates. The scores are penalized by two measures proposed by Wu et al. (2016, Equation 14): (1). The length penalty lp(Y ) = (5+|Y |) α /(5+1) α , where |Y | is the current target length and 0 < α < 1 is the length normalization coefficient. (2). The coverage penalty where p i,j is the attention score of the j-th target word y j on the i-th source word x i , |X| is the source length, and 0 < β < 1 is the coverage normalization coefficient. Intuitively, these functions avoid favoring shorter predictions and yielding duplicate terms. We tune the parameters of the penalty functions (α = β = 0.4), with grid-search on the validation set for TREC-PM.

Reranking with REL, EXT, and ABS
The main purpose of the models designed in the previous subsections is to come up with a combined measure for reranking. For a query q, let d 1 , . . . , d r , be the set of top r (set to 500) candidate documents returned by the Solr BM25 eDisMax query. It is straightforward to impose an order on d j through REL via the output probability estimates of relevance. Given q, for each d j we generate the pseudo-query (summary) q d j by concatenating all distinct words in the generated pseudo-query sentences by ABS along with the words selected by EXT. Repeating words and special tokens are removed. Although faceted summaries are generated through ABS, in the end q d j is essentially the set of all unique terms from ABS and EXT. Each d j is now scored by comparing q and q d j via two similarity metrics: The ROUGE-1 recall score, s ROU GE (Lin, 2004), and a cosine similarity based score computed as where e i denote vector representations from BMET embeddings (Section 2.2). Overall, we compute four different scores (and hence rankings) of a document: (1) the retrieval score returned by Solr, (2) the document relevance score by REL, (3) pseudo-query based ROUGE score, and (4) pseudo-query similarity score s cos . In the end we merge the rankings with reciprocal rank fusion (Cormack et al., 2009) to obtain the final ranked list of documents. The results are compared against the state-of-the-art models from the 2019 TREC-PM task.

Data
Across 2017-2019 TREC-PM tasks, we have a total of 120 patient cases and 63,387 qrels (document relevance judgments) as shown in Table 3 Table 4. REL takes a document along with the given query as the source input and predicts document-level relevance. We consider a document with the human judgment score either 1 (partially relevant) or 2 (totally relevant) as relevant for this study. Note that we do not include MeSH terms in the query sentences for REL. EXT reads in a document as the source input and predicts token-level relevances. During training, a relevant token is one that occurs in the given patient case. A pseudoquery is the output for ABS taking in a document and a facet type.

Model
Source Target

Implementation Details
For all three models, we begin with the pretrained bert-base-uncased HuggingFace model (Wolf et al., 2019) to encode source texts. We use BERT's WordPiece (Schuster and Nakajima, 2012) tokenizer for the source documents. REL and EXT are trained for 30,000 steps with batch size of 12. The maximum number of tokens for source texts is limited to 384. As the loss function of these two models, we use weighted binary cross entropy. That is, given high imbalance with many more irrelevant instances than positive ones, we put different weights on the classes in computing the loss according to the target distributions (proportions of negative examples are 87% for REL and 93% for EXT). The loss is l(x, y; θ) = −w y [y log p(x)+(1−y) log(1−p(x))], where w 0 = 13/87 = 0.15, w 1 = 1 for REL and w 0 = 7/93 = 0.075, w 1 = 1 for EXT. Adam optimizer with parameters β 1 = 0.9 and β 2 = 0.999, starting learning rate lr = 1e −5 , and fixed weight decay of 0.0 was used. The learning rate is reduced when a metric has stopped improving by using the ReduceLROnPlateau scheduler in PyTorch.
For the decoder of ABS, multi-head attention module from OpenNMT (Klein et al., 2017) was used. To tokenize target texts, we use the NLTK word tokenizer (https://www.nltk.org/api/n ltk.tokenize.html) unlike the one used in the encoder; this is because we use customized word embeddings, the BMET embeddings (Section 2.2), trained with a domain-specific corpus and vocabulary. The vocabulary size is 120,000 which includes the 29,351 MeSH codes. We use six transformer layers in the decoder. Model dimension is 768 and the feed-forward layer size is 2048. We use different initial learning rates for the encoder and decoder, since the encoder is initialized with a pretrained EXT model: 1e −5 (encoder) and 1e −3 (decoder). Negative log-likelihood is the loss function for ABS on the ground-truth faceted query sentences. For beam search in ABS, beam size is set to 4. At test time, we select top two best predictions and merge them into one query sentence. The max length of target sentence is limited to 50 and a sequence is incrementally generated until ABS outputs the corresponding eos token for each facet. All parameter choices were made based on best practices from prior efforts and experiments to optimize P@10 on validation subsets.

Evaluations and Results
We conducted both quantitative and qualitative evaluations with example outcomes. The final evaluation was done on the 2019 TREC-PM dataset while all hyperparameter tuning was done using a training and validation dataset split of a shuffled combined set of instances from 2017 and 2018 tracks (20% validation and the rest for training).

Quantitative Evaluations
We first discuss the performances of the constituent REL and EXT models that were evaluated using train and validation splits from 2017-2018 years.  Next we discuss the main results comparing against the top two teams (rows 1-2) in the 2019 track in Table 6. Before we proceed, we want to highlight one crucial evaluation consideration that applies to any TREC track. TREC evaluates systems in the Cranfield paradigm where pooled top documents from all participating teams are judged for relevance by human experts. Because we did not participate in the original TREC-PM 2019 task, our retrieved results are not part of the judged documents. Hence, we may be at a slight disadvantage when comparing our results with those of teams that participated in 2019 TREC-PM. Nevertheless, we believe that at least the top few most relevant documents are typically commonly retrieved by all models. Hence we compare with both P@10 and R-Prec (P@all-relevant-doc-count) measures.

Model R-Prec P@10
julie-mug (Faessler et al., 2020) 0.3572 0.6525 BITEM PM (Caucheteur et al., 2020) Table 7: Sample facet-conditioned document summarizations by ABS erated by adding a few "interesting" terms (top TF/IDF terms) from the retrieved documents of the initial eDisMax query. This traditional relevance feedback method (row 4) method has decreased the performance from the baseline and hence has not been used in our reranking methods. All our models (rows 5-7) present stable baseline scores in P@10 and the combined method (+REL+ABS) tops the list with a 4% improvement over the prior best model (Faessler et al., 2020). Baseline with REL does the best in terms of R-Prec. Both prior top teams rely heavily on query expansion through external knowledge bases to add synonyms, hypernyms, and hyponyms of terms found in the original query. Table 7 presents sample pseudo-queries generated by ABS. The summaries of the first document show some novel words, intrahepatic and cholangiocarcinoma, that do not occur in the given document (we only show title for conciseness, but the abstract also does not contain those words). The model may have learned the close relationship between cholangiocarcinoma and BRAF v600e, the latter being part of the genetic facet of the actual query for which PMID: 28454296 turns out to be relevant. Also embedding proximity between intrahepatic and cholangiocarcinoma may have introduced both into the pseudo query, although they are not central to this document's theme. Still, this maybe important in retrieving documents that have an indirect (yet relevant) link to the query through the pseudoquery terms. This could be why, although ABS underperforms REL, it still complements it when combined ( Table 6). The table also shows that ABS can generate concepts in a domain-specific terminology. For example, the second document yields following MeSH entity codes, which are strongly related to the topics of the document: D002471 (Cell Transformation, Neoplastic), D017728 (Lymphoma, Large-Cell, Anaplastic), and D000077548 (Anaplastic Lymphoma Kinase).

Qualitative Analysis
For a qualitative exploration of what EXT and different facets of ABS capture, we refer the reader to Appendix A.

Machine Configuration and Runtime
All training and testing was done on a single Nvidia Titan X GPU in a desktop with 64GB RAM. The corpus to be indexed had 30,429,310 biomedical citations (titles and abstracts of biomedical articles 4 ). We trained the three models for five epochs and the training time per epoch (80,319 query, doc pairs) is 69 mins for REL, 72 mins for EXT, and 303 mins for ABS. Coming to test time, per query, the Solr eDisMax query returns top 500 results in 20 ms. Generating pseudo-queries for 500 candidates via EXT and ABS takes 126 seconds and generating REL scores consumes 16 seconds. So per query, it takes nearly 2.5 mins at test time to return a ranked list of documents. Although this does not facilitate real time retrieval as in commercial search engines, given the complexity of the queries, we believe this is at least near real time offering a convenient way to launch PM queries. Furthermore, this comes at an affordable configuration for many labs and clinics with a smaller carbon footprint.

Conclusion
In this paper, we proposed an ensemble document reranking approach for PM queries. It builds on pretrained BERT models to combine strategies from document relevance matching and extractive/abstractive text summarization to arrive at document rankings that are complementary in eventual evaluations. Our experiments also demonstrate that entity embeddings trained on an annotated domain specific corpus can help in document retrieval settings. Both quantitative and qualitative analyses throw light on the strengths of our approach.
One scope for advances lies in improving the summarizer to generate better pseudo-queries so that ABS starts to perform better on its own. At a high level, training data is very hard to generate in large amounts for IR tasks in biomedicine and this holds for the TREC-PM datasets too. To better train ABS, it may be better to adapt other biomedical IR datasets. For example, the TREC clinical decision support (CDS) task that ran from 2014 to 2016 is related to the PM task (Roberts et al., 2016). A future goal is to see if we can apply our neural transfer learning (Rios and Kavuluru, 2019) and domain adaptation (Rios et al., 2018) efforts to repurpose the CDS datasets for the PM task.
Another straightforward idea is to reuse generated pseudo-query sentences in the eDisMax query by Solr, as a form of pseudo relevance feedback. The s cos expression in Section 3.4 focuses on an asymmetric formulation that starts with a query term and looks for the best match in the pseudoquery. Considering a more symmetric formulation, where, we also begin with the pseudo-query terms and average both summands may provide a better estimate for reranking. Additionally, a thorough exploration of how external biomedical knowledge bases (Wagner et al., 2020) can be incorporated in the neural IR framework for PM is also important (Nguyen et al., 2017). Figure 3 depicts words highlighted by EXT. Evidently, we see terms related to the regulations of gene expressions, proteins, or disease names featuring more prominently. Figure 4 shows how ABS reads the source document differently depending on which facet signal it starts with, in the process of query generation; compared to [unused0] (disease facet), the attention heat map by [unused1] (genetic facet) focuses more on the words related to gene regulations.