Lessons from Natural Language Inference in the Clinical Domain

State of the art models using deep neural networks have become very good in learning an accurate mapping from inputs to outputs. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized, and knowledge intensive domains, where training data is limited. To address this gap, we introduce MedNLI - a dataset annotated by doctors, performing a natural language inference task (NLI), grounded in the medical history of patients. We present strategies to: 1) leverage transfer learning using datasets from the open domain, (e.g. SNLI) and 2) incorporate domain knowledge from external data and lexical sources (e.g. medical terminologies). Our results demonstrate performance gains using both strategies.


Introduction
Natural language inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. This task, formerly known as recognizing textual entailment (RTE) (Dagan et al., 2006) has long been a popular task among researchers. Moreover, contribution of datasets from past shared tasks (Dagan et al., 2009), and recent research (Bowman et al., 2015;Williams et al., 2018) have pushed the boundaries for this seemingly simple, but challenging problem.
The Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) is a large, high quality dataset and serves as a benchmark to evaluate NLI systems. However, it is restricted to a single text genre (Flickr image captions) and mostly consists of short and simple sentences. The MultiNLI corpus (Williams et al., 2018) which introduced NLI corpora from multiple genres (e.g. fiction, travel) was a welcome step towards addressing these limitations. MultiNLI offers diversity in linguistic phenomena, which makes it more challenging.
Following these efforts, we explore the problem of NLI in the clinical domain. Language inference in specialized domains such as medicine is extremely complex and remains unexplored by the machine learning community. Moreover, since this domain has a distinct sublanguage (Friedman et al., 2002), clinical text also presents unique challenges (abbreviations, inconsistent punctuation, misspellings, etc.) that differentiate it from open-domain data (Meystre et al., 2008).
In this paper, we address these gaps and make the following contributions: • Introduce MedNLI -a new, publicly available, expert annotated dataset for NLI in the clinical domain.

The MedNLI dataset
Let us recall the procedure followed for creating the SNLI dataset: annotators were presented with captions for a Flickr photo (the premise) without the photos themselves. They were asked to write three sentences (hypotheses): 1) A clearly true description of the photo, 2) A clearly false description, and 3) A description that might be true or false. This procedure produces three training pairs  Clinical notes are typically organized into sections such as Chief Complaint, Past Medical History, Physical Exam, Impression, etc. These sections can be easily identified since the formatting for associated section headers often resembles capital letters, followed by a colon. The clinicians in our team suggested Past Medical History to be the most informative section of a clinical note, from which critical inferences can be drawn about the patient.
Therefore, we segmented these notes into sections using a simple rule based program capturing the formatting of these section headers. We extracted the Past Medical History section and used a sentence splitter trained on biomedical articles (Lingpipe, 2008) to get a pool of candidate premises. We then randomly sampled a subset from these candidates and presented them to You will be shown a sentence from the Past Medical History section of a de-identified clinical note. Using only this sentence, your knowledge about the field of medicine, and common sense: • Write one alternate sentence that is definitely a true description of the patient. Example, for the sentence "Patient has type II diabetes" you could write "Patient suffers from a chronic condition" • Write one alternate sentence that might be a true description of the patient. Example, for the sentence "Patient has type II diabetes" you could write "Patient has hypertension" • Write one sentence that is definitely a false description of the patient. Example, for the sentence "Patient has type II diabetes" you could write "The patient's insulin levels are normal without any medications." Figure 1: Annotation prompt shown to clinicians the clinicians for annotation. Figure 1 shows the exact prompt shown to the clinicians for the annotation task. SNLI annotations are grounded since they are associated with captions of the same image. We seek to achieve the same goal by grounding the annotations against the medical history of the same patient. As discussed earlier, examples shown in Table 1 depict unique challenges that involve reasoning over domain-specific knowledge. For instance, the first three examples require the knowledge about clinical terminology. The fourth example requires awareness of medications and the last example elicits knowledge about radiology images. We make the MedNLI dataset available 2 through the MIMIC-III derived data repository. Thus, any individual certified to access MIMIC-III can also access MedNLI.

Annotation collection
Conclusions in the clinical domain are known to be context dependent and a source of multiple uncertainties (Han et al., 2011). We had to ensure that such subjective interpretations do not result in annotation conflicts affecting the quality of the dataset. To ensure agreement, we worked with clinicians and generated annotation guidelines for a pilot study. Two board certified radiologists worked on the annotation task, and were presented with the 100 unique premises each.
Some premises, often marred by deidentification artifacts, did not contain any information from which useful inferences could be drawn, e.g. This was at the end of [ ** Month (only) 1702 ** ] of this year. Such sentences were deemed as invalid for the task and discarded based on clinician judgment. The MIMIC-III dataset contains many de-identification artifacts associated with dates and names (persons and places) which also makes MedNLI more challenging.
After discarding 16 premises, the result of hypothesis generation was a set of 552 pairs. To calculate agreement, we presented pairs generated by one clinician, and sought annotations from the other clinician, determining if the inference was "Definitely true", "Maybe true", or "Definitely false" (Bowman et al., 2015). Comparison of these annotations resulted in a Cohen's kappa of κ = 0.78. While this is substantial if not perfect agreement by itself (McHugh, 2012), it is particularly good given the challenging nature of NLI and the complexity of the domain. 3 On reviewing the annotations, we found that labeling differences between "Definitely true" and "Maybe true" were the major source of disagreement. This was primarily because one clinician would think of a scenario that is generally true, while the other would think of assumptions (e.g. patient might be lying, or patient might be pregnant) when it would not.
A discussion with clinicians concluded that the annotation guideline was clear and any person with a formal background of medicine should be 3 Rajpurkar et al. (2017) report F1 < 0.45 for four radiologists when compared among themselves able to complete the task successfully. To generate the final dataset, we recruited two additional clinicians, both board certified medical students pursuing their residency programs. Unlike SNLI, we did not collect multiple annotations per sentence pair because of the time and funding constraints.

Dataset statistics
Together, the four clinicians worked on a total of 4,683 premises over a period of six weeks. The resulting dataset consists of 14,049 unique sentence pairs. Following Bowman et al. (2015) Table 2: Key statistics of the dataset the dataset into training, development, and testing subsets and ensured that no premise was overlapping between the three subsets. Table 2 presents key statistics of MedNLI.

Models
To establish a baseline performance on MedNLI, we experimented with a feature-based system. To further explore the performance of modern neural networks-based systems, we experimented several models of various degrees of complexity: Bag of Words (BOW), InferSent (Conneau et al., 2017) and ESIM (Chen et al., 2017). Note that our goal here is not to outperform existing models, but to explore the relative gain of the proposed methods, and compare them to a baseline. We used the same set of hyperparameters in all models to ensure that any difference in performance is exclusively due to the algorithms.

Feature-based system
We used a gradient boosting classifier incorporating a variety of hand crafted features. Apart from standard NLP features, we also infused clinical knowledge from the Unified Medical Language System (UMLS) (Bodenreider, 2004). Each terminology in the UMLS can be viewed as a graph where nodes represent medical concepts, and edges represent relations between them. These are canonical relationships found in ontologies such as IS A and SYNONYMY. For instance, diabetes IS A disorder of the endocrine system. The domain specific features we added to the model represent similarity between UMLS concepts from the premise and the hypothesis, based how close they appear in the UMLS graph (Pedersen et al., 2007). Following (Shivade et al., 2015;Pedersen et al., 2007) we used the SNOMED-CT terminology in our experiments. The groups below summarize the feature sets used in our model (35 features in total): 1. BLEU score 2. Number of tokens (e.g. min, max, difference) 3. Negations (e.g. keywords such as no, do not) 4. TF-IDF similarity (e.g. cosine, euclidean) 5. Edit distances (e.g. Levenshtein) 6. Embedding similarity (e.g. cosine, euclidean) 7. UMLS similarity features (e.g. shortest path distance between UMLS concepts) Bag of words We use a bag-of-words (BOW) model as a simple baseline for the NLI task: the Sum of words model by Bowman et al. (2015) with a small modification. While Bowman et al. (2015) use tanh as the activation function in the model, we use ReLU, since it trained faster and achieved better results (Glorot et al., 2011). In order to represent an input sentence as a single vector, this architecture simply sums up the vectors of individual tokens. The premise and hypothesis vectors are then concatenated and passed through a multi-layer neural network. Recent work shows that even this straightforward approach encodes a non-trivial amount of information about the sentence (Adi et al., 2017).
InferSent InferSent (Conneau et al., 2017) is a model for sentence representation that demonstrated close to state-of-the-art performance across a number of tasks in NLP (including NLI) and computer vision. The main differences from the BOW model are as follows: • A bidirectional LSTM encoder of input sentences and a max-pooling operation over timesteps are used to get a vector for the premise (p) and for the hypothesis (h); • A more complex scheme of interaction between the vectors p and h to get a single vector z that contains all the information needed to produce a decision about the relationship between the input sentences: ESIM The ESIM model, developed by Chen et al. (2017), is shown in Figure 2. It is a fairly complex model that makes use of two bidirectional LSTM networks. The basic idea of ESIM is as follows: • The first LSTM produces a sequence of hidden states. • Pairwise attention matrix e is computed between all tokens in the premise and the hypothesis to produce new sequences of "attended" hidden states, which are then fed into the second LSTM. • Max and average pooling are performed over the output of the LSTMs. • The output of the pooling operations is combined in a way similar to the InferSent model. The three aforementioned models exemplify the architectures that are, perhaps, the most widely used for NLI task, spanning from simple bag-ofwords approaches to complicated models with Bi-LSTM and inter-sentence attention. We additionally experimented with a plain Bi-LSTM model as well as GRU (Cho et al., 2014), but since their performance was not remarkable (in the same range as BOW) we do not report it here.

Transfer learning
Given the existence of larger general-domain NLI datasets such as SNLI and MultiNLI, it stands to reason to try to leverage them to improve the performance in the clinical domain. Transfer learning has been shown to improve performance on variety of tasks such as: machine translation on lowresource languages (Zoph et al., 2016) and also some tasks from the bio-medical domain in particular (Sahu and Anand, 2017; Lee et al., 2018). To see if a corresponding boost would be possible for the NLI task, we investigated three common transfer learning techniques on the MedNLI dataset using SNLI and five different genres from MultiNLI.
Direct transfer is the simplest method of transfer learning. After training a model on a large source domain dataset, the model is directly tested on the target domain dataset. If the source and the target domains are similar to some extent, one can achieve a reasonable accuracy by simply applying a model pre-trained on the source domain to the target domain. In our case the source domain is general domain in SNLI and the various genres in MultiNLI, and the target domain is clinical.
Sequential transfer is the most widely used technique. After pre-training the model on a large source domain, the model is further fine-tuned using the smaller training data of the target domain.
The assumption is that while the model would learn domain-specific features, it would also learn some domain-independent features that will be useful for the target domain. Furthermore, the fine-tuning process would affect the learned features from the source domain and make them more suitable for the target domain.
Multi-target transfer is a more complex method involving separation of the model into three components (or layers): • The shared component is trained on both the source and target domains; • The source domain component is trained only during the pre-training phase and does not participate in the prediction of the target domain; • The target domain component is trained during the fine-tuning stage and it produces the predictions together with the shared component.
The motivation for multi-target transfer is that performance should be improved by splitting deeper layers of the model into domain-specific parts and having a shared block early in the network, where it presumably learns domainindependent features.

Word embeddings
Another way to improve the accuracy on the target domain is to use domain-specific word embeddings instead of, or, in addition to, open-domain ones. For example, Stanovsky et al. (2017) achieved state of the art results on the task of recognizing Adverse Drug Reaction using graphbased embeddings trained on the "Drugs" and "Diseases" categories from DBpedia (Lehmann et al., 2015), as well as embeddings trained on web-pages categorized as "medical domain".
We experimented with the following publicly available general-domain word embeddings: Furthermore, we trained fastText embeddings on the following domain-specific corpora: • fastText [BioASQ] :A collection of PubMed abstracts from the BioASQ challenge data (Tsatsaronis et al., 2015). This data includes abstracts from 12,834,585 scientific articles from the biomedical domain. Finally, we experimented with initializing word embeddings with pre-trained vectors from general domain and further training on a domain-specific corpus: • GloVe [CC] → fastText [BioASQ] : GloVe embeddings for initialization, and the BioASQ data for fine-tuning. • GloVe [CC] → fastText [BioASQ] → fastText [MIMIC-III] : GloVe embeddings for initialization, and two consequent finetuning using the BioASQ and MIMIC-III data.
Experiments using other approaches to word embeddings, such as word2vec (Mikolov et al., 2013) and CoVe (McCann et al., 2017) did not show any gains. All the above trained embeddings are available for download.

Knowledge integration
Since understanding medical texts requires domain-specific knowledge, we experimented with different ways of incorporating such knowledge into the systems. First, we can modify the input to the system so it carries a portion of clinical information. Second, we can modify the model itself, integrating domain knowledge directly into it.
The UMLS is the largest, publicly available, and regularly updated database of medical terminologies, concepts, and relationships between them. It can be viewed as a graph where clinical concepts are nodes, connected by edges representing relations, such as synonymy, parentchild, etc. Following past work, we restricted to the SNOMED-CT terminology in UMLS and experimented with two techniques for incorporating knowledge: retrofitting and attention.

Retrofitting
Retrofitting (Faruqui et al., 2015) modifies pretrained word embeddings based on an ontology. The basic idea is to try to bring the representations of the concepts that are connected in the ontology closer to one another in vector space. The authors showed that retrofitting using WordNet (Fellbaum, 1998) synsets improves accuracy on several wordlevel tasks, as well as sentiment analysis.

Knowledge-directed attention
Attention proved to be a useful technique for many NLP tasks, starting from machine transla-tion (Bahdanau et al., 2015) to parsing (Vinyals et al., 2015) and NLI itself (Parikh et al., 2016;Rocktäschel et al., 2016). In most models (including the ESIM model that we use in our experiments) attention is learned in an end-to-end fashion. However, if we have knowledge about relationships between concepts, we could leverage it to explicitly tell the model to attend to specific concepts during the processing of the input sentence.
For example, there is an edge in SNOMED-CT from the concept Lung consolidation to Pneumonia. Using this information, during the processing of a sentence pair • Premise The patient has pneumonia. • Hypothesis The patient has a lung disease. the model could attend to the token lung while processing pneumonia.
We propose to integrate this knowledge in a way similar to how attention is used in the ESIM model. Specifically, we calculate the attention matrix e ∈ R n×m between all pairs of tokens a i and b j in the inputs sentences, where n is the length of the hypothesis and m is the length of the premise. The value in each cell reflects the length of the shortest path l ij between the corresponding concepts of the premise and the hypothesis in SNOMED-CT (the shorter is the path, the higher is the value).
This process could be informally described as follows: each tokenã i of the premise is a weighted sum of relevant tokens b j of the hypothesis, according to the medical ontology, and vice versa. This enables the medical domain knowledge to be integrated directly into the system.
We used the original tokens a i as well as the attendedã i inside the model for both InferSent and ESIM. For InferSent, we simply concatenate them across the time dimension: a = [a 1 , a 2 , . . . , a n ,ã 1 ,ã 2 , . . . ,ã n ] where n is the length of the inputs sequence. For the ESIM model, we concatenate a i andã i before passing them to the composition layer (see Figure 2 and Section 3.3 in the original paper (Chen et al., 2017)). This enables the model to learn the relative importance of both the token and the knowledge directed attention.

Results and discussion
We implemented all models using PyTorch 5 and trained them with the Adam optimizer (Kingma and Ba, 2015) until the validation loss showed no improvement for 5 epochs. The epoch with the lowest loss on the validation set was selected for testing. We used the GloVe word embeddings (Pennington et al., 2014) in all experiments, except for subsection 5.3. In all experiments we report the average result of 6 different runs, with the same hyperparameters and different random seeds. Medical concepts in SNOMED-CT were identified in the premise and hypothesis sentences using Metamap (Aronson and Lang, 2010). The code for all experiments is publicly available. 6 Table 3 shows the baseline results: the performance of a model when trained and tested on the MedNLI dataset. The feature-based system performed the worst. As for neural networksbased systems, the BOW model showed the lowest performance on the both development and test sets. The InferSent model, in contrast, achieved the highest accuracy, despite ESIM outperforming it on SNLI. This could be attributed to the fact that ESIM has twice as many parameters as InferSent, and so InferSent overfits less to the smaller MedNLI dataset.

Transfer learning
As expected, Table 4 shows that direct transfer is worse than the baseline but is still better than a random baseline of 33.3%. Sequential and multitarget transfer learning, in contrast, yields a considerable gain for all the models. The maximum gain is 2.4%, 0.9%, and 0.3% for the BOW, In-ferSent, and ESIM models correspondingly. Second, note that the biggest SNLI domain gave the most boost in only two out of six cases, implying that the size of the domain should not be the most important factor in choosing the source domain for transfer learning. The best accuracy for all the models was obtained with the "slate" domain from MultiNLI corpus with sequential transfer (note, however, that the accuracy of ESIM is actually lower than the baseline accuracy). This is consistent with observations of Williams et al. (2018). Finally, although some domains are better for particular transfer learning methods with particular models, there is no single combination that works for all cases. Table 5 shows that simply using of the embeddings trained on the MIMIC-III notes significantly increases the accuracy for all the models. Furthermore, the InferSent models achieves a 3.1% boost with the fastText Wikipedia embeddings, fine-tuned on the MIMIC-III data. Note that the results fastText [Wiki] are worse than the baseline GloVe [CC] for all models, which could be due to the source corpus size. However, the results on BioASQ are worse than on MIMIC-III, despite the significantly larger corpus of the BioASQ embeddings. Overall, our experiments show the benefit of domain-specific rather than general-domain word embeddings. Table 6 shows that retrofitting only hurts the performance. This is in contrast with the results of the original study, where retrofitting was beneficial not only for word-level tasks but also for tasks such as sentiment analysis (Faruqui et al., 2015). We hypothesize that although WordNet and UMLS are structurally similar, significant differences in the content (Burgun and Bodenreider, 2001) might be the reason for these results. Retrofitting should be more useful when it is used on a WordNet-like database where the main relation is synonymy, and tested on tasks such as word similarity tests or sentiment analysis. The UMLS semantic network is more complex and contains relations that may not be suitable for retrofitting.

Retrofitting
Moreover, retrofitting works only on directly related concepts in a knowledge graph (although it might affect, to some extent, indirectly related concepts by transitivity). However, as Figure 3 shows, UMLS contains few training pairs that have such concepts (namely, pairs with a path of   [BioASQ] 0.2 0.7 1.4 GloVe [CC] → fastText [BioASQ] → fastText [MIMIC-III] 0.9 2.7 1.8 fastText [Wiki] → fastText [MIMIC-III] 0.1 3.1 1.7 Table 5: Absolute gain in accuracy with respect to the baseline (GloVe [CC] ) for different word embeddings (the number in parentheses reflects the number of tokens in the corresponding training corpora).
length 1). In contrast, the lengths of the shortest path in SNLI using WordNet fall close to 1. This suggests that the medical inferences represented in MedNLI requires more complex reasoning, typically involving multiple steps. As a sanity check, we applied retrofitting to the GloVe embeddings and tested the InferSent model on the "fiction" domain from the MultiNLI corpus. We used the code and lexicons provided by Faruqui et al. (2015) and confirmed that retrofitting hurts the performance in that case as well.

Knowledge-directed attention
To evaluate the potential of knowledge-directed attention, let us consider its effect on a baseline embedding (GloVe [CC] ) and a fastText embedding trained on MIMIC-III (fastText [MIMIC-III] ) that showed good performance in section 5.3.
Knowledge-directed attention showed positive effect with the InferSent model on GloVe [CC] (0.3   0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  999 Shortest path length gain), and was not detrimental to ESIM. However, in case of the fastText [MIMIC-III] embeddings knowledge-directed attention was beneficial to both models, as shown in Table 7. Note that while retrofitting can use only direct relations during the training process, our method incorporates information about relationships of any length, which is a necessity (as evident from Figure 3).

Error analysis
The neutral class is the hardest to recognize for all models. Majority errors stem from confusion between entailment and the neutral class. Use of domain-specific embeddings trained on MIMIC-  We categorized the errors made by all the models in four broad categories. Table 8 outlines representative errors made by most models in these categories. Numerical reasoning such as abnormal lab value → disease or abnormal vital sign → finding are very hard for a model to learn unless it has seen multiple instances of the same numerical value. 7 The first step is to learn what values are abnormal and the next is to actually perform the inference. Many inferences require world knowledge that could be deemed close to open domain NLI. While these are very subtle, some are quite domain specific (e.g. emergency admission planned visit). Abbreviations are ubiquitously found in clinical text. While some are standard and therefore frequent, clinicians tend to use non standard abbreviations making inference harder. Finally, many inferences are at the core of reasoning with clinical knowledge. While training on large datasets maybe a natural but impractical solution, this is an open research problem for researchers in the community.

Limitations
Unlike SNLI and MultiNLI, each example in the MedNLI dataset was single annotated. However, this was the best we could do in the limited time and resources available. Very recently Gururangan et al. (2018) discovered annotation artifacts in NLI datasets. Since we followed the exact same process, we found them to be present in MedNLI as well. The premise-oblivious text-classifier that 7 The symbol → represents entailment relationship achieves 67.0 F1 on SNLI, and 53.9 on Multi-NLI achieves 61.9 on MedNLI.

Conclusion
We have presented MedNLI, an expert annotated, public dataset for natural language inference in the clinical domain. To the best of our knowledge, MedNLI is the first dataset of its kind. Our experiments with several state-of-the-art models provide a strong baseline for this dataset. Our work compliments the current efforts in NLI by presenting thorough experiments for the specialized and knowledge intensive field of medicine. We also demonstrated that a simple use of domain-specific word embeddings provides a performance boost. Finally, we also presented a method for integrating domain ontologies into the training regime of models. We hope the released code and dataset with clear benchmarks help advance research in clinical NLP and the NLI task.