Saama Research at MEDIQA 2019: Pre-trained BioBERT with Attention Visualisation for Medical Natural Language Inference

Natural Language inference is the task of identifying relation between two sentences as entailment, contradiction or neutrality. MedNLI is a biomedical flavour of NLI for clinical domain. This paper explores the use of Bidirectional Encoder Representation from Transformer (BERT) for solving MedNLI. The proposed model, BERT pre-trained on PMC, PubMed and fine-tuned on MIMICIII v1.4, achieves state of the art results on MedNLI (83.45%) and an accuracy of 78.5% in MEDIQA challenge. The authors present an analysis of the attention patterns that emerged as a result of training BERT on MedNLI using a visualization tool, bertviz.


Introduction
Natural Language Inference (NLI) is a fundamental task in Natural Language Processing in which the objective is to determine if the hypothesis is true (entailment), false (contradiction) or undetermined (neutral), given a premise. Entailment, Contradiction and Neutral (semantic independence) are semantic concepts that represent the relationship between sentences. The ability to infer these relations between sentences or pieces of text, is crucial in tasks like Information Retrieval, Semantic Parsing, Commonsense Reasoning, etc. NLI, like most NLP tasks, is challenging due to the ambiguous nature of natural language. A particular meaning can be expressed in multiple linguistic forms. This calls for methods that can capture meaningful semantic concepts from text.
Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) is a collection of * *Equal Contribution: Kamal had sole access to MIMIC and MEDIQA data, focussed on the algorithm development and implementation. Suriyadeepan and Archana focussed on the attention visualisation and writing. Soham and Malaikannan focussed on reviewing sentence pairs labeled for entailment, contradiction, and semantic independence. It contains approximately 550,000 labeled hypothesis/premise pairs. Multi-Genre Natural Language Inference (Multi-NLI) corpus (Williams et al., 2017) contains 433,000 samples, covering a wide range of (10) genres of written and spoken English. Multi-NLI, in its complexity, is closer to Natural Language than SNLI.  Annotations were done by two board-certified radiologists and two additional clinicians pursuing their residency programs.

Dataset Statistics
The MedNLI dataset used over 4 clinicians working on a total of 4,683 premises over a period of 6 weeks with 14,049 unique sentence pairs. The dataset was then split into training, development, and test sets. The class distribution is even across all classes, throughout training, development and test sets (Table 2).

MEDIQA Shared Task
MEDIQA(Ben Abacha et al., 2019) is a shared task which is part of BIONLP 2019. It was cre-ated by using an annotation technique similar to MedNLI. It serves as an additional test for the MedNLI data. It contains 405 premise-hypothesis pairs. These pairs were randomly sampled from records, segmented from Past Medical History section with a simple rule-based method.MedNLI train set is used to train the model and hyper parameter are tuned based MedNLI development and test set accuracy. MedNLI and MEDIQA test set follows the same label mapping.

Description
Bidirectional encoder representation from transformer (Vaswani et al., 2017) is a language representation model which performs on a wide range of NLP tasks such as question answering and language inference. The architecture of the BERT leverages the use of pre-trained deep bidirectional representations. Existing pretrained language representations include featurebased (ELMO) (Peters et al., 2018) and fine-tuning approach (OpenAI GPT) (Radford et al., 2018) . However, these models are severely restricted due to their unidirectional nature. BERT uses masked language models to enable pre-trained deep bidirectional representations.
The BERT model the authors experimented with, is BERT BASE . The model is composed of 12 transformer blocks with a hidden size of 768 and 12 attention heads. The feed-forward/filter size is 4 times the hidden size. For fine tuning on MedNLI, a classification layer is added and all the parameters of the final model are fine-tuned jointly as per the original paper (Devlin et al., 2018).

BERT on MedNLI
BERT displays a clear supremacy over contemporary architectures (Radford et al., 2018) (Peters et al., 2018) on several NLP tasks. BERT's use of bidirectional encoders is a characteristic feature that separates it from other architectures.
Natural language inference requires learning the relationship between two sentences which is not supported by naive language models. Thus, BERT which is pre-trained on binarized next sentence prediction is vital for NLI.
MedNLI is built based on GLUE (General Language Understanding Evaluation) dataset (Wang et al., 2018). The goal as of before, with inference is to predict how the first sentence is related to the next in terms of entailment, contradiction or neutral. MedNLI is a sequence level task. The model needs to learn a minimum number of parameters and is used with an additional output layer with BERT.

Experiments
All the experiments in this paper are done with BERT pre-trained on unlabelled biomedical data-BioBERT (Lee et al., 2019). Three pretrained models are available: One model is trained only on PubMed articles, one is trained on PMC articles and one trained on both PubMed and PMC articles.
BioBERT Fine tuning on BioBERT was done using Ten-sorFlow with three GeForce GTX 1080Ti GPUs for 2 weeks. The model on MIMIC III v1.4 is trained with maximum sequence length 128 with batch size 32 and learning rate 2e-5 for 200,000 steps. The sequence length is limited such that it can fit into GPU memory. The pretraining data from MIMIC III v1.4 is prepared using scripts from the original BERT github repository (Devlin et al., 2018) with the default parameters. Further fine tuning on MedNLI task is done with one GeForce GTX 1080Ti GPU with 11 GB of RAM. One epoch on MedNLI takes around 3 minutes on a single GPU 1 .  All of hyperparameter search is done with a fixed random seed of 42. Each iteration took an average of 3-4 minutes. A variant of Adam optimizer which selectively avoids applying weight decay to normalization layers, proposed in BERT (Devlin et al., 2018) paper is used. Only learning rate is tuned while all the other hyperparameters like β 1 , β 2 , L 2 weight decay are fixed at 0.9, 0.999 and 0.01 respectively.

Results
The results of the experiments with BERT pretrained on PubMed, PMC and fine-tuned on MIMIC III v1.4, are tabulated in Table 4  known for their opaque nature, these intuitions offer a peek behind the curtains. bertviz, was subsequently used to visualize BioBERT-MIMIC III v1.4 before and after training on MedNLI task. In this section, some of the interesting patterns are presented which were observed by comparing and contrasting attention patterns before and after finetuning on MedNLI task. The distinct patterns that emerge from fine-tuning are heavily dependent on the nature of the task (NLI). 2. Word Similarity It can be observed that words similar to source word gets more attention. Notice the words negative and no, expressing similar sentiments (negative), connected via attention flow in figure 4. Wordlevel similarity, although not always, is a good indicator of entailment. Upon encountering sentences with similar words, it is reasonable for a network to be biased towards entailment.
3. Tokenized Words In BERT, OOV (Out of Vocabulary) words are identified and split into segments. This way, the morphological information is maintained, which comes in handy in tasks such as textual entailment where word-level similarity is an important aspect to notice. Before fine-tuning, the OOV (Out of Vocabulary) words split into multiple tokens receive weak attention from source tokens, as observed in figure 5. After finetuning on MedNLI, a strong attention flow between the tokenized words across two sentences can be seen. As mentioned above, these connections as seen in figure 6, help in identifying word-level similarity between sentences.
The authors have presented a error analysis study based on attention patterns in Appendix A. Based on the intuitions gained from error analysis, the authors propose a list of changes that could improve the performance of the model. A limitations of the proposed approach and a list of possible improvements are presented in Appendix B.

Conclusion
In this paper, a variant of BERT, fine-tuned on MIMIC III v1.4,is proposed

A Error Analysis
The authors have studied the misclassified examples in MedNLI (test set) and MedQA (task set). 70% of the misclassified examples are falsely labelled as Contradiction. The confusion matrix consisting of the count of misclassified examples for both the sets are presented in figures 9 and 10. The common pattern that exists in misclassified examples, is the model's lack of understanding of certain tokens that are crucial for relating the premise to the hypothesis. Consider the example presented below. Premise : "Reports lack of appetite but no n/v." Hypothesis : "the patient denies nausea and vomiting" The abbreviation n/v in the premise expands to nausea and vomiting. The hypothesis contains the expanded form nausea and vomiting. It is clear from observing the attention pattern (figure 7) that the model doesn't identify n/v and nausea and vomiting as same concepts. When the abbreviation in the premise was expanded to nausea and vomiting, the model identified them as same concepts which is clearly evident from figure 8. Based  One of the limitations of this work is the lack of text preprocessing. The only preprocessing step followed by the authors is tokenization. In domain-specific tasks like Medical NLI, it would be beneficial to identify and normalize medical concepts which could be represented in more than one form. The other significant limitation is that the sentences are tokenized based on a 30,000 size vocabulary derived from Wikipedia corpus. Al- though the fine-tuning is done on Pubmed, the commonly occurring medical terms are identified as unknown words and split into tokens.
The authors suggest a preprocessing step that identifies and normalizes medical concepts. The vocabulary could be built based on PubMed corpus which ensures that most common medical terms are part of the vocabulary. Along those lines, the authors suggest the use of entity embeddings to learn medical concepts and make use of the information contained in them.