UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference

Recent advances in distributed language modeling have led to large performance increases on a variety of natural language processing (NLP) tasks. However, it is not well understood how these methods may be augmented by knowledge-based approaches. This paper compares the performance and internal representation of an Enhanced Sequential Inference Model (ESIM) between three experimental conditions based on the representation method: Bidirectional Encoder Representations from Transformers (BERT), Embeddings of Semantic Predications (ESP), or Cui2Vec. The methods were evaluated on the Medical Natural Language Inference (MedNLI) subtask of the MEDIQA 2019 shared task. This task relied heavily on semantic understanding and thus served as a suitable evaluation set for the comparison of these representation methods.


Introduction
This paper describes our approach to the Natural Language Inference (NLI) subtask of the MEDIQA 2019 shared task (Ben Abacha et al., 2019). As it is not yet clear the extent to which knowledge-based embeddings may provide task-specific improvement over recent advances in contextual embeddings, we provide an analysis of the differences in performance between these two methods. Additionally, it is not yet clear from the literature the extent to which information stored in contextual embeddings overlaps with that in knowledge-based embeddings for which we provide a preliminary analysis of the attention weights of models that use these two representation methods as input. We compare BERT fine-tuned to MIMIC-III (Johnson et al., 2016) and PubMed to Embeddings of Semantic Predications (ESP) trained on SemMedDB and a baseline that uses Cui2Vec embeddings trained on clinical and biomedical text.
Two recent advances in the unsupervised modeling of natural language, Embeddings of Language Models (ELMo)  and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), have led to drastic improvements across a variety of shared tasks. Both of these methods use transfer learning, a method whereby a multi-layered language model is first trained on a large unlabeled corpus. The weights of the model are then frozen and used as input to a task specific model Devlin et al., 2018;Liu et al., 2019). This method is particularly well-suited for work in the medical domain where datasets tend to be relatively small due to the high cost of expert annotation.
However, whereas clinical free-text is difficult to access and share in bulk due to privacy concerns, the biomedical domain is characterized by a significant amount of manually-curated structured knowledge bases. The BioPortal repository currently hosts 773 different biomedical ontologies comprised of over 9.4 million classes. SemMedDB is a triple store that consists of over 94 million predications extracted from PubMed by SemRep, a semantic parser for biomedical text (Rindflesch and Fiszman, 2003;Kilicoglu et al., 2012). These available resources make a strong case for the evaluation of knowledge-based methods for the Medical Natural Language Inference (MedNLI) task (Romanov and Shivade, 2018).

Related Work
In this section, we provide a brief overview of methods for distributional and frame-based semantic representation of natural language. For a more detailed synthesis, we refer the reader to the review of Vector Space Models (VSMs) by Turney and Pantel (2010).

Distributional Semantics
The distributed representation of words has a long history in computational linguistics, beginning with latent semantic indexing (LSI) (Deerwester et al., 1990;Hofmann, 1999;Kanerva et al., 2000), maximum entropy methods (Berger et al., 1996), and latent Dirichlet allocation (LDA) (Blei et al., 2003). More recently, neural network methods have been applied to model natural language (Bengio et al., 2003;Weston et al., 2008;Turian et al., 2010). These methods have been broadly applied as a method of improving supervised model performance by learning word-level features from large unlabeled datasets with more recent work using either Word2Vec (Mikolov et al., 2013;Pavlopoulos et al., 2014) or GloVe (Pennington et al., 2014) embeddings. Recent work has learned a continuous representation of Unified Medical Language System (UMLS) (Aronson, 2006) concepts by applying the Word2Vec method to a large corpus of insurance claims, clinical notes, and biomedical text where UMLS concepts were replaced with their Concept Unique Identifiers (CUIs) (Beam et al., 2018).
Models that incorporate sub-word information are particularly useful in the medical domain for representing medical terminology and out-ofvocabulary terms common in clinical notes and consumer health questions (Romanov and Shivade, 2018). Most approaches use a temporal convolution over a sliding window of characters and have been shown to improve performance on a variety of tasks (Kim et al., 2015;Zhang et al., 2015;Seo et al., 2016;Bojanowski et al., 2017).
Embeddings from Language Models (ELMo) computes word representations using a bidirectional language model that consist of a characterlevel embedding layer followed by a deep bidirectional long short-term memory (LSTM) network . Bidirectional Encoder Representations from Transformers (BERT) replaces the each forward and backward LSTMs with a single Transformer that simultaneously computes attention in both the forward and backward directions and is regarded as the current state-of-theart method for language representation (Vaswani et al., 2017;Devlin et al., 2018). This method additionally substitutes two new unsupervised train-ing objectives in place of the classical language models, i.e., masked language modeling (MLM) and next sentence prediction (NSP). In the case of MLM, a percentage of the words in the corpus are replaced by a [MASK] token. The task is then for the system to predict the masked token. For NSP, the task is given two sentences, s1 and s2, from a document to determine whether s2 is the next sentence following s1.
While ELMo has been shown to outperform GloVe and Word2Vec on consumer health question answering (Kearns and Thomas, 2018), BERT has outperformed ELMo on various clinical tasks (Si et al., 2019) and has been fine-tuned and applied to the biomedical literature and clinical notes (Alsentzer et al., 2019;Huang et al., 2019;Si et al., 2019;Lee et al., 2019). BERT supports the transfer of a pretrained general purpose language model to a task-specific application through fine-tuning. The next sentence prediction objective in the pretraining process suggests this method would be inherently suitable for NLI. In addition, BERT utilizes character-based and WordPiece tokenization (Wu et al., 2016) to learn the morphological patterns among inflections. The subword segmentation such as ##nea in the word dyspnea makes it capable to understand the context of an out-ofvocabulary word making it a particularly suitable representation for clinical text.

Frame-based Semantics
FrameNet is a database of sentence-level framebased semantics that proposes human understanding of natural language is the result of frames in which certain roles are expected to be filled (Baker et al., 1998). For example, the predicate "replace" has at least two such roles, the thing being replaced and the new object. A sentence such as "The table was replaced." raises the question "With what was the table replaced?". Framebased semantics is a popular approach for semantic role labeling (SRL) (Swayamdipta et al., 2018), question answering (QA) (Shen and Lapata, 2007;Roberts and Demner-fushman, 2016;He, 2015;, and dialog systems (Larsson and Traum, 2000;Gupta et al., 2018).
Vector symbolic architectures (VSA) are an approach that seeks to represent semantic predications by applying binding operators that define a directional transformation between entities (Levy and Gayler, 2008). Early approaches in-cluded binary spatter code (BSC) for encoding structured knowledge (Kanerva, 1996(Kanerva, , 1997 and Holographic Embeddings that used circular convolution as a binding operator to improve the scalability of this approach to large knowledge graphs (Plate, 1995). The resurgence of neural network methods has focused attention on extending these methods as there is a growing interest in leveraging continuous representations of structured knowledge to improve performance on downstream applications.
Knowledge graph embeddings (KGE) are one approach that represents entities and their relationships as continuous vectors that are learned using TransE/R (Bordes and Weston, 2009), RESCAL (Nickel et al., 2011), or Holographic Embeddings (Plate, 1995;Nickel et al., 2015). Stanovsky et. al (2017) showed that RESCAL embeddings pretrained on DbPedia improved performance on the task of adverse drug reaction labeling over a clinical Word2Vec model. RESCAL uses tensor products whose application to representation learning dates back to Smolensky (1986;1990) that used the inner product and has recently been applied to the bAbI dataset (Smolensky et al., 2016;Weston et al., 2016). Embeddings of Semantic Predications (ESP) are a neural-probabilistic representational approach that uses VSA binding operations to encode structured relationships (Cohen and Widdows, 2017). The Embeddings Augmented by Random Permutations (EARP) used in this paper are a modified ESP approach that applies random permutations to the entity vectors during training and were shown to improve performance on the Bigger Analogy Test Set by up to 8% against a fastText baseline (Cohen and Widdows, 2018).

Methods
In this section, we provide details on the three representation methods used in this study, i.e. BERT, Cui2Vec, and ESP. We continue with a description of the inference model used in each experiment to predict the label for a given hypothesis/premise pair.

Representation Layer
There are many publicly available biomedical BERT embeddings which were initialized from the original BERT Base models. BioBERT was trained on PubMed Abstracts and PubMed Central Full-text articles (Lee et al., 2019). In this study, we applied ClinicalBERT that was initialized from BioBERT and subsequently trained on all MIMIC-III notes (Alsentzer et al., 2019).
For Cui2Vec, we used the publicly available implementation from Beam et al. (2018) that was trained on a corpus consisting of 20 million clinical notes from a research hospital, 1.7 million full-text articles from PubMed, and an insurance claims database with 60 million members.
For ESP, we used a 500-dimensional model trained over SemMedDB using the recent Embeddings Augmented by Random Permutations (EARP) approach with a 10 −7 sampling threshold for predications and a 10 −5 sampling threshold for concepts excluding concepts that had a frequency greater than 10 6 (Cohen and Widdows, 2018).
To apply Cui2Vec and ESP, we first processed the MedNLI dataset (Romanov and Shivade, 2018) with MetaMap to normalize entities to their concept unique identifier (CUI) in the UMLS (Aronson, 2006). MetaMap takes text as input and applies biomedical and clinical entity recognition (ER), followed by word sense disambiguation (WSD) that links entities to their normalized concept unique identifiers (CUIs). Entities that mapped to a UMLS CUI were assigned a representation in Cui2Vec and ESP. Other tokens were assigned vector representations using fastText embeddings trained on MIMIC-III data (Bojanowski et al., 2017;Romanov and Shivade, 2018).

Inference Model
For all experiments, we used the AllenNLP implementation  of the Enhanced Sequential Inference Model (ESIM) architecture (Chen et al., 2017). This model encodes the premise and hypothesis using a Bidirectional LSTM (BiLSTM) where at each time step the hidden state of the LSTMs are concatenated to represent its context. Local inference between the two sentences is then achieved by aligning the relevant information between words in the premise and hypothesis. This alignment based on soft attention is implemented by the inner product between the encoded premise and encoded hypothesis to produce an attention matrix (Figure 1 and 2). These attention values are used to create a weighted representation of both sentences. An enhanced representation of the premise is created by concatenating the encoded premise, the weighted hypoth- Figure 1: An example of a correct BERT prediction demonstrating its general domain coverage and contextual embedding. Premise: "He will be spending time with family and friends who are coming in from around the country to see him." Hypothesis: "his family and friends do not yet have plans to visit." esis, the encoded premise minus the weighted hypothesis, and the element-wise multiplication of the encoded premise and the weighted hypothesis. The enhanced representation of the hypothesis is created similarly. This operation is expected to enhance the local inference information between elements in each sentence. This representation is then projected into the original dimension and fed into a second BiLSTM inference layer in order to capture inference composition sequentially. The resulting vector is then summarized by max and average pooling. These two pooled representations are concatenated and passed through a multilayered perceptron followed by a sigmoid function to predict probabilities for each of the sentence labels, i.e. entailment, contradiction, and neutral.

Results
The ESIM model achieved an accuracy of 81.2%, 65.2%, and 77.8% for the MedNLI task using BERT, Cui2Vec, and ESP, respectively. Table 1 shows the number of correct predictions by each embedding type. The BERT model has the highest accuracy on predicting entailment and contradiction labels, while the ESP model has the highest accuracy on predicting neutral labels. However, the difference is only significant in the case of entailment.
To evaluate the ability to set a predictive threshold for use in clinical applications, we sought to measure the certainty with which the model made its predictions. To achieve this goal, we used the predicted probabilities of each embedding type on their respective subset of correct predictions such that. We found the predicted probability of ESP to be much higher than the others as depicted in Figure 3. ESP's minimum predicted probability as well as the variance of its distribution is the lowest among all embedding types.

Focus
A total of eleven, non-mutually exclusive hypothesis focus classes were arrived at by consensus of the three authors after an initial blinded round of annotation by two annotators. The remaining data was annotated by one of these annotators. We provide definitions of the classes and their overall counts in Table 2. The classes are: State, Anatomy, Disease, Process, Temporal, Medication, Clinical Finding, Location, Lab/Imaging, Procedure, and Examination.
We then performed Pearson's chi-squared test with Yates' continuity correction on 2x2 contingency tables for each embedding sentence pair prediction (correct or incorrect) with each hy-pothesis focus (presence or absence) using the chisq.test function in R software and results reported in Table 3.
The only significant relationships between hypothesis focus and embedding accuracy were found between BERT and Disease (p-value = 0.01) and Cui2Vec and Disease (p-value = 0.01) through Pearson's Chi-squared test with Yates' continuity correction. Both embeddings achieved higher accuracy on sentence pairs with a hypothesis focus labeled Disease (BERT=90.4%; Cui2Vec=76.6%) than without (BERT=78.5%; Cui2Vec=61.7%).

Tense
Each hypothesis was annotated for tense into one of three mutually exclusive classes: Past, Current,  Reference to time (e.g. "...initial blood pressure was low") 51 (12.6) besides tense or history Medication Any reference to medication (e.g. "antibiotics", "fluids", 32 (7.9) "oxygen", "IV") including administration and patient habits Clinical Finding Results of an exam, lab/image, procedure, or a diagnosis 28 (6.9) Location Specific physical location specified (e.g."...discharged home") 28 (6.9) Lab/Imaging Laboratory tests or imaging (e.g. histology, CBC, CT scan) 24 (5.9) Procedure Physical procedure besides Lab/Image or exam 14 (3.5) (e.g. "intubation", "surgery", "biopsies") Examination Physical examination or explicit use of the word exam(ination) 3 (0.7)  The coverage of entities and their associations was characteristic of BERT predictions (Figure 1). BERT associated "spending time" with "plans" in addition to the lexical overlap of the word "family" which is attended by each experimental condition in this example. All three embeddings identified the contradictory significance of the word "not" in the hypothesis. However, BERT associated it with both spans "will be" and "are coming" in the premise, which led to the correct prediction. Cui2Vec over-attended the lexical match of the words "and", "to" and "C0079382", which led to the wrong prediction.
The ESP model recognized hierarchical relationships between entities, e.g. "Advil" and "NSAIDs" (Figure 2). In this example, the ESP approach attends to the daily use of "ASA" (acetyl-salicylic acid), i.e. aspirin, and the patient denying the use of "other NSAIDs". This pattern was recognized multiple times in our analysis and provides a strong example of how continuous representations of biomedical ontologies may be used to augment contextual representations.

Limitations
The results presented in this paper compare a single model for each representation method finetuned to the development set. However, it is well known that the weights of the same model may vary slightly between training runs. Therefore, a more comprehensive approach would be to present the average attention weights across mul-tiple training runs and to examine the weights at each attention layer of the models which we leave for future work.

Conclusion
We have presented our analysis of representation methods on the MedNLI task as evaluated during the MEDIQA 2019 shared task. We found that BERT embeddings fine-tuned using PubMed and MIMIC-III outperformed both Cui2Vec and ESP methods. However, we found that ESP had the lowest variance and highest predictive certainty, which may be useful in determining a minimum threshold for clinical decision support systems. Disease was the only hypothesis focus to show a significant positive relationship with embedding prediction accuracy. This association was present for BERT and Cui2Vec embeddings -but not ESP. Overall, contradiction was the easiest label to predict for all three embeddings, which may be the result of an annotation artifact where contradiction pairs had higher lexical overlap often differentiated by explicit negation. However, overfitting on the negation can lead to lower accuracy on other entailment labels. Further, our preliminary results indicate that recognition of hierarchical relationships is characteristic of ESP suggesting that they can be used to augment contextual embeddings which, in turn, would contribute lexical coverage including sub-word information. We propose combining these methods in future work.