BioNLP-OST 2019 RDoC Tasks: Multi-grain Neural Relevance Ranking Using Topics and Attention Based Query-Document-Sentence Interactions

This paper presents our system details and results of participation in the RDoC Tasks of BioNLP-OST 2019. Research Domain Criteria (RDoC) construct is a multi-dimensional and broad framework to describe mental health disorders by combining knowledge from genomics to behaviour. Non-availability of RDoC labelled dataset and tedious labelling process hinders the use of RDoC framework to reach its full potential in Biomedical research community and Healthcare industry. Therefore, Task-1 aims at retrieval and ranking of PubMed abstracts relevant to a given RDoC construct and Task-2 aims at extraction of the most relevant sentence from a given PubMed abstract. We investigate (1) attention based supervised neural topic model and SVM for retrieval and ranking of PubMed abstracts and, further utilize BM25 and other relevance measures for re-ranking, (2) supervised and unsupervised sentence ranking models utilizing multi-view representations comprising of query-aware attention-based sentence representation (QAR), bag-of-words (BoW) and TF-IDF. Our best systems achieved 1st rank and scored 0.86 mAP and 0.58 macro average accuracy in Task-1 and Task-2 respectively.


Introduction
The scientific research output of the biomedical community is becoming more sub-domain specialized and increasing at a faster pace. Most of the biomedical domain knowledge is in the form of unstructured text data. Natural Language Processing (NLP) techniques such as relation extraction and information retrieval have enabled us to effectively mine relevant information from a large corpus. These techniques have significantly reduced the time and effort required for knowledge min-* : Equal Contribution ing and information extraction from past scientific studies and electronic health reports (EHR).
Information Retrieval (IR) is the process of retrieving relevant information from an unstructured text corpus, which satisfies a given query/requirement, for example Google search, email search, database search etc. This is generally achieved by converting the query and the document collection into an external representation which by preserving the important semantical information can reduce the IR processing time. This external representation can be generated using either statistical approach i.e., word counts or distributed semantical approach i.e., word embeddings. Therefore, there is a motivation to develop such IR system which can understand the specialized sub-domain language and domainspecific jargon of biomedical domain and assist researchers and medical professionals by effectively and efficiently retrieving most relevant information given a query.
RDoC Tasks aims at exploring information retrieval (IR) and information extraction (IE) tasks on selected abstracts from PubMed dataset. While Task-1 aims to rank abstracts i.e., coarse granularity, Task-2 aims to rank sentences i.e., fine granularity and hence the term multi-grain. An RDoC construct combines information from multiple sources like genomics, symptoms, behaviour etc. and therefore, is a much broader way of describing mental health disorders than symptoms based approach. Table 1 shows the association between PubMed abstracts and RDoC constructs depending on the semantic knowledge of the highlighted content words. Both of these tasks aim in the direction of ease of accessibility of PubMed abstracts labelled with diverse RDoC constructs so that this information can reach its full potential and can be of help to biomedical researchers and healthcare professionals.

PMID
RDoC Construct PubMed Abstract

Acute Threat Fear
Title: Mother lowers glucocorticoid levels of preweaning rats after acute threat. Abstract: Exposure to a deadly threat, an adult male rat, induced the release of corticosterone in 14-day-old rat pups. The endocrine stress response was decreased when the pups were reunited with their mother immediately after exposure. These findings demonstrate that social variables can reduce the consequences of an aversive experience.

Sleep Wakefulness
Title: Central mechanisms of sleep-wakefulness cycle Abstract: Brief anatomical, physiological and neurochemical basics of the regulation of wakefulness, slow wave (NREM) sleep and paradoxical (REM) sleep are regarded as representing by the end of the first decade of the second millennium.

Task Description and Contributions
RDoc-IR Task-1: The task aims at retrieving and ranking the PubMed abstracts (within each of the eight clusters) that are relevant for the RDoC construct (i.e, a query) related to the cluster in the abstract appears. The training data consists of abstracts (title + sentences) each annotated with one or more RDoC constructs. Test data consists of abstracts without annotation and the goal is to submit a ranked lists of relevant articles for each medical domain RDoC construct. RDoc-IE Task-2 The task aims at extracting the most relevant sentence from each PubMed abstract for the corresponding RDoC construct. The input consists of an abstract (title t and sentences s) for an RDoC construct q. The training data consists of abstracts each annotated with one RDoC construct and the most relevant sentence. Test data contains abstracts relevant for RDoC constructs and the goal is to submit a list of predicted most relevant sentence for each abstract.
Our Contributions: Following are our multifold contributions in this paper: (1) RDoC-IR Task-1: We perform document (or abstract) ranking in two steps, first using supervised neural topic model and SVM. Moreover, we have introduced attentions in supervised neural topic model, along with pre-trained word embeddings from several sources. Then, we re-rank documents using BM25 and similarity scores between query and query-aware attention-based document representation.
Comparing with other participating systems in the shared task, our submission is ranked 1 st with a mAP score of 0.86.
(2) RDoC-IE Task-2: We have addressed the sentence ranking task by introducing unsupervised and supervised sentence ranking schemes. Moreover, we have employed multi-view representations consisting of bag-of-words, TF-IDF and query-aware attention-based sentence representation via enhanced query-sentence interactions. We have also investigated relevance of title with the sentences and coined ways to incorporate both query-sentence and title-sentence relevance scores in ranking sentences with an abstract. Comparing with other participating systems in the shared task, our submission is ranked 1 st with a macro average accuracy of 0.58. Our code is available at https://github.com/ YatinChaudhary/RDoC_Task.

Methodology
In this section, we first describe representing a query, sentence and document using local and distributed representation schemes. We further describe enhanced query-document (query-title and query-content) and query-sentence interactions to compute query-aware document or sentence representations for Task-1 and Task-2, respectively. Finally, we discuss the application of supervised neural topic modeling in ranking documents for task 1 and introduce unsupervised and supervised sentence rankers for Task-2.

Query, Sentence and Document Vectors
In this paper, we deal with texts of different lengths in form of query, sentence and document. In this section, we describe the way we represent the different texts.
TF-IDF (Manning et al., 2008) to compute sentence/document vectors. Embedding Sum Representation (ESR): Word embeddings (Mikolov et al., 2013;Pennington et al., 2014) have been successfully used in computing distributed representation of text snippets (short or long). In ESR scheme, we employ the pre-trained word embeddings from FastText (Bojanowski et al., 2017) and word2vec (Mikolov et al., 2013). To represent a text (query, sentence or document), we compute the sum of (pre-trained) word vectors of each word in the text. E.g., ESR for a document d with D words can be computed as: where, e ∈ R E is the pre-trained embedding vector of dimension E for the word d i .
Query-aware Attention-based Representation (QAR) for Documents and Sentences: Unlike ESR, we reward the maximum matches between a query and document by computing density of matches between them, similar to McDonald et al. (2018). In doing so, we introduce a weighted sum of word vectors from pre-trained embeddings and therefore, incorporate importance/attention of certain words in document (or sentence) that appear in the query text.
For an enhanced query-aware attention based document (or sentence) representation, we first compute an histogram a i (d) ∈ R D of attention weights for each word k in the document d (or sentence s) relative to the ith query word q i , using cosine similarity: Here, e(w) refers to an embedding vector of the word w.
We then compute an query-aware attentionbased representation Φ i (d) of document d from the viewpoint of ith query word by summing the word vectors of the document, weighted by their attention scores a i (d): where is an element-wise multiplication operator.
Next, we compute density of matches between several words in query and the document by summing each of the attention histograms a i for all the query terms i. Therefore, the query-aware document representation for a document (or sentence) relative to all query words in q is given by: Similarly, a query-aware sentence representation Φ q (s) and query-aware title representation Φ q (t) can be computed for the sentence s and document title t, respectively.
For query representation, we use ESR scheme as q = |q| i=1 e(w i ). Figure 2 illustrates the computation of queryaware attention-based sentence representation.

Document Neural Topic Models
Topic models (TMs) (Blei et al., 2003) have shown to capture thematic structures, i.e., topics appearing within the document collection. Beyond interpretability, topic models can extract latent document representation that is used to perform document retrieval. Recently, Gupta et al. (2019a) and Gupta et al. (2019b) have shown that the neural network-based topic models (NTM) outperform LDA-based topic models (Blei et al., 2003;Srivastava and Sutton, 2017) in terms of generalization, interpretability and document retrieval.
In order to perform document classification and retrieval, we have employed supervised version of neural topic model with extra features and further introduced word-level attention in a neural topic model, i.e. in DocNADE (Larochelle and Lauly, 2012;Gupta et al., 2019a).
Supervised NTM (SupDocNADE): Document Neural Autoregressive Distribution Estimator (DocNADE) is a neural network based topic model that works on bag-of-words (BoW) representation to model a document collection in a language modeling fashion.
Consider a document d, represented as .., Z} is the index of ith word in the vocabulary and Z is the vocabulary size. DocNADE models the joint As shown in Figure 1 (left), DocNADE computes each autoregressive conditional p(v i |v <i ) using a feed forward neural network for i ∈ {1, ..., D} as, where, f (·) is a non-linear activation function, W ∈ R H×Z and U ∈ R Z×H are encoding and decoding matrices, c ∈ R H and b ∈ R Z are encoding and decoding biases, H is the number of units in latent representation h i (v <i ). Here, h i (v <i ) contains information of words preceding the word v i . For a document v, the log-likelihood L(v) and latent representation h(v) are given as, Here, L(v) is used to optimize the topic model in unsupervised fashion and h(v) encodes the topic proportion. See Gupta et al. (2019a) for further details on training unsupervised DocNADE.
Here, we extend the unsupervised version to DocNADE with a hybrid cost L hybrid (v), consisting of a (supervised) discriminative training cost p(y = q|v) along with an unsupervised generative cost p(v) for a given query q and associated document v: where λ ∈ [0, 1]. The supervised cost is given by: Here, S ∈ R L×H and d ∈ R L are output matrix and bias, L is the total number of unique RDoC constructs (i.e., unique query labels). Supervised Attention-based NTM (a-SupDocNADE): Observe in equation 3 that the DocNADE computes document representation h(v) via aggregation of word embedding vectors without considering attention over certain words. However, certain content words own high important, especially in classification task. Therefore, we have introduced attention-based embedding aggregation in supDocNADE (Figure 1, left): Here, α i is an attention score of each word i in the document v, learned via supervised training. Additionally, we incorporate extra word features, such as pre-trained word embeddings from several sources: FastText (E f ast ) (Bojanowski et al., 2017) and word2vec (E word2vec ) (Mikolov et al., 2013). We introduce these features by concatenating h e (v) with h(v) in the supervised portion of the a-supDocNADE model, as Supervised Sentence Ranker version1: r unsup = r q . r q + r t . r t version2: r unsup = r q . r q + r t . r t r BM25-Extra + Unsupervised Sentence Ranker r q = sim(Φ q (s j ), q p ) r t = sim(Φ q (s j ), t p ) Therefore, the classification portion of a-supDocNADE with additional features is given by: where, S ∈ R H ×L and H = H + E f ast + E word2vec .

Traditional Methods for IR
BM25: A ranking function proposed by Robertson and Zaragoza (2009) is used to estimate the relevance of a document for a given query.
BM25-Extra: The relevance score of BM-25 is combined with four extra features: (1) percentage of query words with exact match in the document, (2) percentage of query words bigrams matched in the document, (3) IDF weighted document vector for feature #1, and (4) IDF weighted document vector for feature #2. Therefore, BM25-Extra returns a vector of 5 scores.

System Description for RDoC Task-1
RDoC Task-1 aims at retrieving and ranking of PubMed abstracts (title and content) that are relevant for 8 RDoC constructs. Participants are provided with 8 clusters, each with a RDoC construct label and required to rank abstracts within each cluster based on their relevance to the corresponding cluster label. Each cluster contains abstracts relevant to its RDoC construct, while some (or most) of the abstracts are noisy in the sense that they belong to a different RDoC construct. Ideally, the participants are required to rank abstracts in each of the clusters by determining their relevance with the RDoC construct of the cluster in which they appear.
To address the RDoc Task-1, we learn a mapping function between latent representation h(v) of a document (i.e.., abstract) v and its RDoC construct, i.e., query words q in a supervised fashion. In doing so, we have employed supervised classifiers, especially supervised neural topic model a-supDocNADE (section 3.2) for document ranking. We treat q as label and maximize p(q|v) leading to maximize L hybrid (v) in a-supDocNADE model.
As demonstrated in Figure 1 (right), we perform document ranking in two steps: (1) Document Relevance Ranking: We build a supervised classifier using all the training documents and their corresponding labels (RDoC constructs), provided with the training set. At the test time, we compute prediction probability score p(CID = q|v test (CID))) of the label=CID for each test document v test (CID) in the cluster, CID. This prediction probability (or confidence score) is treated as a relevance score of the document for the RDoC construct of the cluster. Figure 1(right) shows that we perform document ranking using the probability scores (col-2) of the RDoC construct (e.g. loss) within the cluster C1. Observe that a test document with least confidence for a cluster are ranked lower within the cluster and thus, improving mean average precision (mAP). Additionally, we also show the predicted RDoC construct in col-1 by the supervised classifier.
(2) Document Relevance Re-ranking: Secondly, we re-ranked each document v (ti-tle+abstract) within each cluster (with label q) using unsupervised ranking, where the relevance scores are computed as: (a) reRank(BM25-Extra): sum each of the 5 relevance scores to get the final relevance, and (b) reRank(QAR): cosine-similarity(QAR(v), q).

System Description for RDoC Task-2
The RDoC Task-2 aims at extracting the most relevant sentence from each of the PubMed abstract for the corresponding RDoC construct. Each abstract consists of title t and sentences s with an RDoC construct q.
To address RDoc Task-2, we first compute multi-view representation: BoW, TF-IDF and QAR (i.e., Φ q (s j )) for each sentence s j in an abstract d. On other hand, we compute ESR representation for RDoC construct (query q) and title t of the abstract d to obtain q and t, respectively. Figure 2 and section 3.1 describe the computation of these representations. We then use the representations (Φ q (s j ), t and q) to compute a relevance scores of a sentence s j relative to q and/or t via unsupervised and supervised ranking schemes, discussed in the following section.

Unsupervised Sentence Ranker
As shown in Figure 2, we first extract representations: Φ q (s j ), t and q for the sentence s j query q and title t. During ranking sentences within an abstract for the given RDoC construct q, we also consider title t in computing the relevance score for each sentence relative to q and t. It is inspired from the fact that the title often contains relevant terms (or words) appearing in sentence(s) of the document (or abstract). On top, we observe that q is a very short text and non-descriptive, leading to minimal text overlap with s.
We compute two relevance scores: r q and r t for a sentence s j with respect to a query q and title t, respectively. r q = sim( q, Φ q (s j )) and r t = sim( t, Φ q (s j )) Now, we devise two ways to combine the rele-vance scores r q and r t in unsupervised paradigm: version1: r unsup 1 = r q · r q + r t · r t Observe that the relevance scores are weighted by itself. However, the task-2 expects a higher importance to the relevance score r q over q t . Therefore, we coin the following weighting scheme to give higher importance to r q only if it is higher than r t otherwise we compute a weight factor r t for r t . version2: r unsup 2 = r q · r q + r t · r t where r t is compute as: The relevance score r unsup 2 is effective in ranking sentences when a query and sentence does not overlap. In such a scenario, a sentence is scored by title, penalized by a factor of |r t − r q |.
At the end, we obtain a final relevance score r unsup f for a sentence s j by summing the relevance scores of BM25-Extra and r unsup 1 or r unsup 2 .

Supervised Sentence Ranker
Beyond unsupervised ranking, we further investigate sentence ranking in supervised paradigm by introducing a distance metric between the query (or title) and sentence vectors. Figure 2 describes the computation of relevance score for a sentence s j using a supervised sentence ranker scheme. Like the unsupervised ranker (section 3.5.1), the supervised ranker also employs vector representations: Φ q (s j ), t and q. Using the projection matrix G, we then apply a projection to each of the representation to obtain Φ p q (s j ), t p and q p . Here, the operator ⊗ performs concatenation of the projected vector with its input via residual connection. Next, we apply a Manhattan distance metric to compute similarity (or relevance) scores, following :   Table 3: RDoC Task-1 results (on development set): Classification accuracy and mean Average Precision (mAP) of a-supDocNADE and SVM models. Each model's classification accuracy and ranking mAP (using prediction probabilities) are shown together. Furthermore, each model's ranked clusters are re-ranked using different re-ranking algorithms. Best mAP score for each model is marked in bold.

Data Statistics and Experimental Setup
Dataset Description: Dataset for RDoC Tasks contains a total of 266 PubMed abstracts labelled with 8 RDoC constructs in a single label fashion. Number of abstracts for each RDoC construct is described in Table 2, where first row describes the statistics for all abstracts and second & third row shows the split of those abstracts into training and development sets maintaining a 80-20 ratio for each RDoC construct. For Task-1, each PubMed abstract contains its associated title, PubMed ID (PMID) and label (RDoC construct). In addition for Task-2, each PubMed abstract also contains a list of most relevant sentences from that abstract. Final evaluation test data for Task-1 & Task-2 contains 999 & 244 abstracts respectively.
We use "RegexpTokenizer" from scikit-learn to tokenize abstracts and lower-cased all tokens. After this, we remove those tokens which occur in less than 3 abstracts and also remove stopwords (using nltk). For computing BM25-Extra relevance score, we use unprocessed raw text of sentences and titles.
Experimental Setup: As the training dataset labelled with RDoC constructs is very small, we use an external source of semantical knowledge by incorporating pretrained distributional word embeddings (Zhang et al., 2019) from FastText model (Bojanowski et al., 2017) trained on the entire corpus of PubMed and MIMIC III Clinical notes (Johnson et al., 2016). Similarly, we also use pretrained word embeddings (Moen and Ananiadou, 2013) from word2vec model (Mikolov et al., 2013) trained on PubMed and PMC abstracts. We create 3 folds * of train/dev splits for cross-validation.
RDoC Task-1: For DocNADE topic model, we use latent representation of size 50. We use pretrained FastText embeddings of size 300 and pretrained word2vec embeddings of size 200. For SVM, we use Bag-of-words (BoW) representation of abstracts with radial basis kernel function. PubMed abstracts are provided in eight different clusters, one for each RDoC construct, for final test set evaluation.
RDoC Task-2: We use pretrained FastText embeddings to compute query-aware sentence representation of a sentence (Φ q (s j )), title ( t) and query ( q) representations. We also train Replicated-Siamese-LSTM  model with input as sentence and query pair i.e., (s j , q) and label as 1 if s j is relevant otherwise 0. We use β ∈ {0, 1}. Task-1   Table 3 shows the performance of supervised Document Ranker models i.e, a-supDocNADE and SVM, for Task-1. SVM achieves a classification accuracy of 0.947 and mean average precision * we only report results on f old1 because of best scores on partial test dataset  Table 4: RDoC Task-1 analysis: Ranking of PubMed abstracts within "Potential Threat Anxiety (PTA)" cluster using supervised prediction probabilities (p(q|v)). It shows that an intruder/noisy abstract (Gold Label: Loss) is assigned higher probability than the abstracts with same Gold Label as the cluster. But, using re-ranking with BM25-Extra (reRank(BM25-Extra)) relevance score assigns lowest relevance to the intruder abstract.

Results: RDoC
(mAP) of 0.992 by ranking the abstracts in their respective clusters using the supervised prediction probabilities (p(q|v)). After that, we use three different relevance scores: (1) reRank(BM25-Extra), (2) reRank(QAR) and (3) reRank(BM25-Extra) + reRank(QAR), for re-ranking of the abstracts in their respective clusters. It is to be noted that the ranking mAP of the clusters using prediction probabilities is already the best possible i.e., the intruder abstracts (abstracts with different label (RDoC construct) than the cluster label) are at the bottom of the ranked clusters. Therefore, re-ranking of these clusters would not achieve a better score. Similarly, we train a-supDocNADE model with three different settings: (1) random weight initialization, (2) incorporating FastText embeddings (h e (v)) and (3) incorporating Fast-Text and word2vec embeddings (h e (v)). By using the pretrained embeddings, the classification accuracy increases from 0.912 to 0.965, this shows that distributional pretrained embeddings carry significant semantic knowledge. Furthermore, re-ranking using reRank(BM25-Extra) and reRank(QAR) further results in the improvement of mAP score (0.994 vs 0.983) by shifting the intruder documents at the bottom of each impure cluster. Task-1   Table 4 shows an impure "Potential Threat Anxiety" cluster of abstracts containing an intruder abstract with label (RDoC construct) "Loss". When this cluster is ranked on the basis of predic-  tion probabilities (p(q|v)), then "Loss" abstract is ranked third from the bottom and it degrades the mAP score of the retrieval system. But after re-ranking this cluster using reRank(BM25-Extra) relevance score, the "Loss" abstract is ranked at the bottom, thus maximizing the mAP score. Therefore, re-ranking with BM25-Extra on top of ranking with p(q|v) is, evidently, a robust abstract/document ranking technique. Task-2   Table 5 shows results for Task-2 using three unsupervised and two supervised sentence ranker models. For unsupervised model, using reRank(BM25-Extra) relevance score between a query (q), label (RDoC construct) of an abstract, and all the sentences (s j ) in an abstract, we get an macroaverage accuracy (MAA) of 0.631. However, using version1 and version2 models (see Fig 2), we achieve a MAA score of 0.701 and 0.526 respectively. Higher accuracy of version1 model suggests that title (t) of an abstract also contains the essential information regarding the most relevant sentence. For supervised model, we get an MAA score of 0.772 and 0.737 by setting β = 0 & 1 in supervised relevance score (r sup f ) equation in section 3.5.2. Hence, for supervised sentence ranker model, title (t) is playing a negative influence in correctly identifying the relevance (r sup f ) of different sentences. Furthermore, we combine the knowledge of unsupervised and supervised sentence rankers by creating multiple ensembles (majority voting) of the predictions from different models. We achieve the highest MAA score of 0.789 by combining the predictions of (1) reRank(BM25-Extra), (2) version1, and (3) r sup f with β = 0. Notice that all the proposed supervised and unsupervised sentence ranking mod-    (Table 5) predicts the correct sentence as the most relevant. els (except [#3]) outperform tranditional ranking models, e.g., reRank(BM25-Extra) in terms of query-document relevance score. Table 7 shows that the most relevant sentence predicted by reRank(BM25-Extra) is actually a non-relevant sentence. But an ensemble of predictions from both unsupervised and supervised ranker models correctly predicts the relevant sentence. This suggests that complementary knowledge of different models is able to capture the relevance of sentences on different scales and majority voting among them is, evidently, a robust sentence ranking technique.

Analysis: RDoC Task-2
4.6 Results: RDoC Task 1 & 2 on Test set Table 6 shows the final evaluation scores of different competing systems for both the RDoC Task-1 & Task-2 on final test set. Observe that our submission (MIC-CIS) scored a mAP score of 0.86 and MAA of 0.58 in Task-1 and Task-2, respectively. Notice that we outperform the second best system by 20.83% (0.58 vs 0.48) margin in Task2.

Conclusion
In conclusion, both supervised neural topic model and SVM can effectively perform ranking of PubMed abstracts in a given cluster based on the prediction probabilities. However, a further reranking using BM25-Extra or query-aware sentence representation (QAR) has proven to maximize the mAP score by correctly assigning the lowest relevance score to the intruder abstracts. Also, unsupervised and supervised sentence ranker models using query-title-sentence interactions outperform the traditional BM25-Extra based ranking model by a significant margin. In future, we would like to introduce complementary feature representation via hidden vectors of LSTM jointly with topic models and would like to further investigate the interpretability (Gupta et al., 2015; of the proposed neural ranking models in the sense that one can extract salient patterns determining relationship between query and text. Another promising direction would be introduce abstract information, such as part-of-speech and named entity tags (Lample et al., 2016;Gupta et al., 2016) to augment information retrieval (IR).