Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset

Machine reading comprehension has made great progress in recent years owing to large-scale annotated datasets. In the clinical domain, however, creating such datasets is quite difficult due to the domain expertise required for annotation. Recently, Pampari et al. (EMNLP'18) tackled this issue by using expert-annotated question templates and existing i2b2 annotations to create emrQA, the first large-scale dataset for question answering (QA) based on clinical notes. In this paper, we provide an in-depth analysis of this dataset and the clinical reading comprehension (CliniRC) task. From our qualitative analysis, we find that (i) emrQA answers are often incomplete, and (ii) emrQA questions are often answerable without using domain knowledge. From our quantitative experiments, surprising results include that (iii) using a small sampled subset (5%-20%), we can obtain roughly equal performance compared to the model trained on the entire dataset, (iv) this performance is close to human expert's performance, and (v) BERT models do not beat the best performing base model. Following our analysis of the emrQA, we further explore two desired aspects of CliniRC systems: the ability to utilize clinical domain knowledge and to generalize to unseen questions and contexts. We argue that both should be considered when creating future datasets.


Introduction
Medical professionals often query over clinical notes in Electronic Medical Records (EMRs) to find information that can support their decision making (Demner-Fushman et al., 2009;Rosenbloom et al., 2011;Wang et al., 2018). One way to facilitate such information seeking activities is to build a natural language question answering (QA) system that can extract precise answers from clinical notes (Cairns et al., 2011;Cao et al., 2011;Wren, 2011;Demner-Fushman, 2016, 2019). 1 Our code is available at https://github.com/ xiangyue9607/CliniRC. Context: ... For HTN control, pt was given HCTZ and lopressor which sufficiently controlled his BP. Pt was sent home on HCTZ 25mg daily and atenolol 50mg daily. ... ADDITIONAL COMMENTS: 1.) Take hydrochlorothiazide 25mg daily and atenolol 50mg daily for your blood pressure. You should also take aspirin 81mg daily.
Question: What was the dosage prescribed of hydrochlorothiazide? Answer: ADDITIONAL COMMENTS: 1.) Take hydrochlorothiazide 25mg daily and atenolol 50mg daily for your RECORD #992321, Date: 2145-09-22 Question: Why has the patient been prescribed hctz? Answer: For HTN control, pt was given HCTZ and lopressor which sufficiently Machine reading comprehension (RC) aims to automatically answer questions based on a given document or text corpus and has drawn wide attention in recent years. Many neural models (Cheng et al., 2016;Wang and Jiang, 2017;Seo et al., 2017;Chen et al., 2017;Devlin et al., 2019) have achieved very promising results on this task, owing to large-scale QA datasets (Hermann et al., 2015;Rajpurkar et al., 2016;Trischler et al., 2017;Joshi et al., 2017;Yang et al., 2018). Unfortunately, clinical reading comprehension (CliniRC) has not observed as much progress due to the lack of such QA datasets.
In order to create QA pairs on clinical texts, annotators must have considerable medical expertise and data handling must be specifically designed to address ethical issues and privacy concerns. Due to these requirements, using crowdsourcing like in the open domain to create large-scale clinical QA datasets becomes highly impractical .
Recently, Pampari et al. (2018) found a smart way to tackle this issue and created emrQA, the first large-scale QA dataset on clinical texts. Instead of relying on crowdsourcing, emrQA was semiautomatically generated based on annotated question templates and existing annotations from the n2c2 (previously called i2b2) challenge datasets 2 . Example QA pairs from the dataset are shown in Figure 1.
In this paper, we aim to gain a deep understanding of the CliniRC task and conduct a thorough analysis of the emrQA dataset. We first explore the dataset directly by carrying out a meticulous qualitative analysis on randomly-sampled QA pairs and we find that: 1) Many answers in the emrQA dataset are incomplete and hence are hard to read and ineffective for training ( §3.1). 2) Many questions are simple: More than 96% of the examples contain the same key phrases in both questions and answers. Though Pampari et al. (2018) claims that 39% of the questions may need knowledge to answer, our error analysis suggests only a very small portion of the errors (2%) made by a state-of-theart reader might be due to missing external domain knowledge ( §3.2).
Following our qualitative analysis of the emrQA dataset, we conduct a comprehensive quantitative analysis based on state-of-the-art readers and BERT models (BERT-base (Devlin et al., 2019) as well as its biomedical and clinical versions: BioBERT  and ClinicalBERT (Alsentzer et al., 2019)) to understand how different systems behave on the emrQA dataset. Surprising results include: 1) Using a small sampled subset (5%-20%), we can obtain roughly equal performance compared to the model trained on the entire dataset, suggesting that many examples in the dataset are redundant ( §4.1).
2) The performance of the best base model is close to the human expert's performance 3 ( §4.2). 3) The performance of BERT models is around 1%-5% worse than the best performing base model ( §4.3).
After completing our analysis of the dataset, we explore two potential needs for systems doing CliniRC: 1) The need to represent and use clinical domain knowledge effectively ( §5.1) and 2) the need to generalize to unseen questions and contexts ( §5.2). To investigate the first one, we analyze sev-2 https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ 3 Which is obtained by comparing emrQA answers to answers created by our medical experts on sampled QA pairs. In summary, given our analysis of the emrQA dataset and the task in general, we conclude that future work still needs to create better datasets to advance CliniRC. Such datasets should be not only large-scale, but also less noisy, more diverse, and allow researchers to directly evaluate a system's ability to encode domain knowledge and to generalize to new questions and contexts.

Overview of the emrQA dataset
Similar to the open-domain reading comprehension task, the Clinical Reading Comprehension (CliniRC) task is defined as follows: Definition 2.1. Given a patient's clinical note (context) C = {c 1 , ..., c n } and a question Q = {t 1 , ..., t m }, the CliniRC task aims to extract a continuous span A = {c i , c i+1 , ..., c i+k }(1 ≤ i ≤ i + k ≤ n) from the context as the answer, where c i , t j are tokens.
The emrQA dataset (Pampari et al., 2018) was semi-automatically generated from expertannotated question templates and existing i2b2 annotations. More specifically, clinical question templates were first created by human experts. Then, manual annotations from the medication information extraction, relation learning, and coreference Has the patient ever been on | medication | ? Question Template <Medication = "Flagyl", Line Index = 128>

Existing i2b2 Annotation
Has the patient ever been on Flagyl ?

Generated Question
Flagyl. By discharge, the patient was afebrile (line 128) Generated Answer Figure 2: An example to illustrate how emrQA generates QA pairs. resolution i2b2 challenges were re-framed into answers for the question templates. After linking question templates to i2b2 annotations, the gold annotation entities were used to both replace placeholders in the question templates and extract the sentence around them as answers. An example of this generation process can be seen in Figure 2.
The emrQA dataset contains 5 subsets: Medication, Relation, Heart Disease, Obesity and Smoking, which were generated from 5 i2b2 challenge datasets respectively. The answer format in each dataset is different. For the Obesity and Smoking datasets, answers are categorized into 7 classes and the task is to predict the question's class based on the context. For the Medication, Relation, and Heart Disease datasets, answers are usually short snippets from the text accompanied by a longer span around it which we refer to as an evidence. The short snippet is a single entity or multiple entities while the evidence contains the entire line around those entities in the clinical note. For questions that cannot be answered via entities, only the evidence is provided as an answer. Given that some questions do not have short answers and that entire evidence spans are usually important for supporting clinical decision making (Demner-Fushman et al., 2009), we treat the answer evidence 4 as our answer just as is done in (Pampari et al., 2018).
In this work, we mainly focus on the Medication and Relation datasets because (1) they make up 80% of the entire emrQA dataset and (2) their format is consistent with the span extraction task, which is more challenging and meaningful for clinical decision making support. We filter the answers whose lengths (number of tokens) are more than 20. The detailed statistics of the two datasets are shown in Table 1. 4 For simplicity, we use "answer" directly henceforth.

Metric
Medication Relation

In-depth Qualitative Analysis
In this section, we carry out an in-depth analysis of the emrQA dataset. We aim to examine (1) the quality and (2) level of difficulty for the generated QA pairs in the emrQA dataset.

How clean are the emrQA answers?
Since the emrQA dataset was created via a generation framework unlike human-labeled or crowdsourcing datasets, the quality of the datasets remains largely unknown. In order to use this dataset to explore the CliniRC task, it is essential to determine whether it is meaningful. In order to do this, we randomly sample 50 QA pairs from the Medication and the Relation datasets respectively. Since some questions share the same answer due to automatic generation, we make sure all the samples have different answers.
Since the questions were generated from expert created templates, most of them are humanreadable and unambiguous. We therefore mainly focus on evaluating answer quality. We ask two human experts to score each answer from 1 to 5 depending on the relevance of the answer to the question (1: irrelevant or incorrect; 2: missing key parts; 3: contains key parts but is not humanreadable or contains many irrelevant parts; 4: contains key parts and is only missing a few parts or has a few irrelevant extra segments; 5: perfect answer). We also ask human annotators to label the gold answers and then calculate the Exact Match (EM) and F1 score (F1) of the emrQA answers v.s. human gold answers. The answer quality score, EM and F1 in both datasets, are shown in Table 2.
The scores of the Medication dataset are low since most of the answers are broken sentences or contain unnecessary segments. For instance, in the Figure 2 example, the correct answer should be "Clindamycin was changed to Flagyl", how-  ever, the emrQA answer misses important parts "Clindamycin was changed to" and contains irrelevant parts "By discharge, the patient was afebrile". These issues are common in the Medication dataset and make it difficult to train a good system. To understand why the generated answers contain such noise, we explored the "i2b2 2009 Medication" challenge dataset which was used to create these QA pairs. We found that most documents in this dataset contain many complete sentences split into separate lines. Since the i2b2 annotation are token based and the emrQA obtains full lines around the token as evidence spans, these lines often end up being broken sentences. We tried to relabel the answers with existing sentence segmentation tools and heuristic measures but found that it is very challenging to obtain concise and complete text spans as answers. Compared with the Medication dataset, the answer quality of the Relation dataset is much better. In most cases, the answers are complete and meaningful sentences with no unnecessary parts.

How challenging are the emrQA pairs?
Another observation from the 50 samples is that 96% of the answers in the Medication dataset and 100% of the answers in the Relation dataset contain the key phrase in the question. This is due to the generation procedure illustrated in Figure 2. In this example, the key phrase or entity ("Flagyl") in the question is also included in the answer. This undoubtedly makes the answer easier to extract as long as the model can recognize significant words and do "word matching".
To further explore how much clinical language understanding is needed and what kind of errors do the state-of-the-art reader make, we conduct error analysis using DocReader (Chen et al., 2017) (also used in (Pampari et al., 2018)) on the emrQA dataset. More specifically, we randomly sample 50 questions that are answered incorrectly by the model (based on exact match metric) from the Medication and Relation dev set respectively 5 . The results are shown in Table 3 (examples for each error type are also given for better understanding).
Since emrQA answers are often incomplete in the dataset, we deem span mismatch errors acceptable as long as the predictions include the key part of the ground truths. Surprisingly, span mismatchinclude key info errors, along with ambiguous questions, incorrect golds and false negatives (the prediction is correct but it is not in the emrQA answers) errors, which are caused by the dataset itself, account for 90% of total errors, suggesting that the accuracy of these models is even higher than we report.
Another interesting finding from the error analysis is that to our surprise, only a very small amount (2%) of errors may have been caused by a lack of external domain knowledge while Pampari et al. (2018) claim that 39% of the questions in the em-rQA dataset need domain knowledge. This surprising result might be due to: (1) neural models being able to encode relational or associative knowledge from the text corpora as has also been reported in recent studies (Petroni et al., 2019;Bouraoui et al., 2020), and (2) questions and answers sharing key phrases (as we mentioned earlier in §3.1) in many samples, making it more likely that fewer questions need external knowledge to be answered than previously reported.

Comprehensive Quantitative Analysis
In this section, we conduct comprehensive experiments on the emrQA dataset with state-of-theart readers and recently dominating BERT models. Full experimental settings are described in Appendix A.

How redundant are the emrQA pairs?
Though there are more than 1 million questions in the emrQA dataset (as shown in Table 1), many questions and their patterns are very similar since they are generated from the same question templates. This observation leads to a natural question: do we really need so many questions to train an CliniRC system? If many questions are similar to each other, it is very likely that using a sampled subset can achieve roughly the same performance that is based on the entire dataset.
To verify our hypothesis, we first split the two datasets into train, dev, and test set with the proportion of 7:1:2 w.r.t. the contexts (full statistics are shown in Appendix Table A1). Then we randomly sample {5%, 10%, 20%, 40%, 60%} and {1%, 3%, 5%, 10%, 15%} 6 of the QA pairs in each document (context) of the Medication and the Relation training sets respectively. We run DocReader (Chen et al., 2017) on the sampled subsets and evaluate them on the same dev and test set.
As shown in Figure 3, using 20% of the questions in the Medication and 5% of the questions in the Relation dataset can achieve roughly the same performance as using the entire training sets. 6 The sampling percentage of the Relation dataset is smaller than the Medication dataset since the former one has more QA pairs (roughly 4 times). These verify our hypothesis, and illustrate learning a good and robust reader system based on the emrQA dataset does not need so many questionanswer pairs. While deep models are often datahungry, it does not mean more data can always lead to better performance. In addition to the training size, diversity should also be considered as another important criterion for data quality.
In the following experiments, we use the sampled subsets (20% for Medication and 5% for Relation) considering the time and memory cost as well as performance.

Little room for improvement
Since the answers in emrQA are often incomplete, the performance of a model is more appropriately reflected by its F1 score. As shown in Table 2, we obtain F1 scores of 74% and 95% on two datasets respectively when we test human-labeled answers against the emrQA answers on a sampled dataset. We can see from Table 4 that the best performing reader, DocReader, achieves around 70% and 94% F1 performance on the Medication and Relation test set respectively, which are very close to the human performance just described. Though designing more complex and advanced models may achieve better scores, such scores are obtained w.r.t. noisy emrQA answers and may not translate meaningfully to real cases.

BERT does not always win
BERT models have achieved very promising results recently in various NLP tasks including RC (Devlin et al., 2019). We follow their experiment setting of BERT for doing reading comprehension on the SQuAD (Rajpurkar et al., 2016) dataset. To our surprise, as shown in Table 4, BERT models (BERT-base, its biomedical version BioBERT , and its clinical version ClinicalBERT   (Zhu et al., 2015) and PubMed articles respectively, both of which may have different vocabularies and use different language expressions from clinical texts. Though ClinicalBERT was pretrained on MIMIC-III (Johnson et al., 2016) clinical texts, the training size of the corpus (∼50M words) is far less than that used in BERT (∼3300M words), which may make the model less powerful as it is on the open-domain tasks. 2) Longer Contexts. As can be seen from Table 1, the number of tokens in the contexts is commonly larger than open-domain RC datasets like SQuAD (∼1000 v.s.∼116 avg). We suspect that long contexts might make it more challenging to model sequential information. For sequences that are longer than the max length of the BERT model, they are truncated into a set of short sequences, which may hinder the model from capturing long dependencies (Dai et al., 2019) and global information in the entire document. 3) Easy Questions. Another possible reason might be the question patterns are too easy and a simpler reader with far less parameters can learn the patterns and obtain satisfying performance.
Additionally, to further evaluate the models in the fine-grained level, inspired by (Gururangan et al., 2018), we partition the Medication and Relation test sets into Easy and Hard subsets using a base model. The details of Easy/Hard splits can be found in Appendix C. As can be seen from Table  A4, most of the questions in the two datasets are easy, which indicates the emrQA dataset might not be challenging for the current QA models. More difficult datasets are needed to advance the Clinical Reading Comprehension task.

Desiderata in Real-World CliniRC
Following our analysis of the emrQA dataset, we further study two aspects of clinical reading comprehension systems that we believe are crucial for their real-world applicability: the need to encode clinical domain knowledge and to generalize to unseen questions and documents.

External domain knowledge is needed
So far, we have shown that domain knowledge may not be very useful for models answering questions in the emrQA dataset; however, we argue that systems in real-world CliniRC need to be able to encode and use clinical domain knowledge effectively.
Clinical text often contains high variability in many domain-specific words due to abbreviations and synonyms. The presence of different aliases in the question and context can make it difficult for a model to represent semantics accurately and choose the correct span. Besides, medical domainspecific relations (e.g., treats, caused by) and hierarchical relations (e.g., isa) between medical concepts would be likely to appear. The process followed to generate the current emrQA dataset leads to these problems being largely under-represented, even though they can be very common in real cases. We use the following 3 examples as representatives to illustrate the real cases we may encounter. Synonym. For example, for the question in Figure 2, "Has this patient ever been on Flagyl?", it is easy for the model to answer since "Flagyl" appears in the context. However, if we change "Flagyl" to its synonyms "Metronidazole" (which may not appear in training) in the question, it is hard for the reader to extract the correct answer, as it is not possible for model to capture the semantic meaning of "Metronidazole" as "Flagyl". Clinical Relations. Another example is the ques-tion shown in Figure 1, "Why has the patient been prescribed hctz?". Currently, machines can easily find the answer since keyword "hctz" is mentioned in the answer. However, given a situation where the drug "hctz" does not appear in the local context of "HTN", our model may have a better chance to extract the correct answers if it stores the relation "(hctz, treats, HTN)". Hierarchical Relation. For the question "Is there a history of mental illness?", it is more likely that the medical report describes a specific type of psychological condition rather than mention the general phrase "mental illness" since clinical support require specifics. To obtain the correct answer in this case "Depression with previous suicidal ideation.", encoding the relation "(depression, isa, mental illness)" would probably help the model make a correct prediction.
These three cases help illustrate how complex medical relations affect the real CliniRC task. Without leveraging external domain knowledge, it is difficult for models to capture the semantic relations necessary to resolve such cases.
In order to verify our claim quantitatively, we select synonym as a representative relation type and manipulate each question by replacing its entities with plausible synonyms or abbreviations. We then introduce external domain knowledge into current models and compare their performance against base models on these augmented questions.
More specifically, we first detect entities in the questions and link them to a medical knowledge base (KB): UMLS (Bodenreider, 2004) using a biomedical and clinical text NLP pipeline tool, ScispaCy (Neumann et al., 2019). Synonyms of detected entities are then retrieved from UMLS and used to replace the original mention. We filter the questions that do not contain entities or that contain entities with no synonyms. We focus on the Relation dataset and only modify the questions in the dev and test set; the questions in the training set are not modified. Finally, we get 69,912 and 125,338 questions in the dev and test set.
We then introduce a simple Knowledge Incorporation Module (KIM) to evaluate the usefulness of external domain knowledge. Formally, given a question q : {w q 1 , w q 2 , ..., w q l } and its context c : {w c 1 , w c 2 , ..., w c m }, where w q i , w c j are words (tokens), all the words can be mapped to d 1 dimensional vectors via a word embedding matrix E w ∈ R d 1 ×|V| , where V denotes the word vocab- ulary. So we have q : w q 1 , ..., w q l ∈ R d 1 and c : w c 1 , ..., w c m ∈ R d 1 . We then detect entities {e q 1 , e q 2 , ..., e q n } in the question and entities {e c 1 , e c 2 , ..., e q o } in the context and map them to a medical knowledge base (KB), UMLS (Bodenreider, 2004) using scispacy (Neumann et al., 2019). Note that l is not equal to n and m is not equal to o, since not every token can be mapped to a entity in KB. For entities that contain multiple words, we align them to the first token, same as the alignment used in (Zhang et al., 2019). We then map detected entities to d 2 dimensional vectors {e q 1 , e q 2 , ..., e q n } and {e c 1 , e c 2 , ..., e c o } via a entity embedding matrix E e ∈ R d 2 ×|U | , which is pretrained on the entire UMLS KB using the knowledge embedding method TransE (Bordes et al., 2013). U denotes the entity vocabulary.
We merge the word embeddings with entity embeddings to feed them into a Multi-layer Perceptron (MLP): where σ is activation function, W c , W e , b are trainable parameters and h q i , h c j denote the integrated embeddings that contain information from both the word c j and the entity e j in the question and context respectively. For the word that is not mapped to an entity, e j will be set to 0. The merged embeddings are used as the input to the base reader.
As shown in Figure 4, by adding a basic Knowledge Incorporation Module to the base model, we obtain around 5% increase of F1 score on the manipulated questions in the test set. This suggests that for questions that involve relations between medical concepts, external domain knowledge may be quite important.  Table 5: Results of models when tested on new questions and unseen clinical notes (not in emrQA, but from MIMIC-III dataset). Performance drops around 40% compared with previously reported on the Relation test set, highlighting generalizability as an essential future direction for CliniRC.

Generalizing to unseen questions and documents
The aim of CliniRC is to build robust QA systems for doctors to retrieve information buried in clinical texts. When deploying a CliniRC system to a new environment (e.g., a new set of clinical records, a new hospital, etc.), it is infeasible to create new QA pairs for training every time. Thus, an ideal CliniRC system is able to generalize to unseen documents and questions after being fully trained.
To test the generalizability of models trained on emrQA (we focus on the Relation dataset here), our medical experts created 50 new questions that were not present in the emrQA dataset and extracted answers from unseen patient notes in the MIMIC-III (Johnson et al., 2016) dataset. This dataset consists of three types of questions: 12 questions were made from emrQA question templates but contain entities which do not appear in the training set (e.g., "How was the diagnosis of acute cholecystitis made?" was created from the template "How was the diagnosis of |problem| made?"). The other 38 questions have different forms from existing question templates: 21 paraphrase existing questions from emrQA (e.g., "Was an edema found in the physical exam?") was paraphrased from "Does he have any evidence of |problem| in |test|?") and 17 are completely semantically different from the ones in the emrQA dataset (e.g., "What chemotherapy drugs are being administered to the patient?").
As could be expected, we see in Table 5 that the more the new questions deviate from the original emrQA, the more the models struggle to answer them. We observe a performance drop of roughly 20% compared to the Relation test set on questions made from emrQA templates using MIMIC III clinical notes which were not in the original dataset. For question that are more significantly different, we notice an approximate 40% and 60% loss in F1 score when predicting paraphrased questions and entirely new questions respectively. This steep drop in performance for these new settings, espe-cially for paraphrased and new questions, shows how much work there is to be done on this front and highlights generalizability as an important future direction in CliniRC. We also notice that Clinical-BERT works slightly better than the base model DocReader. The reason might be ClinicalBERT was pretrained on the MIMIC-III dataset, which might help the model have a better understanding of the context.
Summary. Based on these two aspects and our previous thorough analysis of the emrQA dataset, it is clear that better datasets are needed to advance CliniRC. Such datasets should be not only largescale, but also less noisy, more diverse, and moreover allow researchers to systematically evaluate a model's ability to encode domain knowledge and to generalize to new questions and contexts.

Related Work
We present a brief overview of open-domain, biomedical and clinical question answering tasks, which are most related to our work: Question Answering (QA) aims to automatically answer questions asked by humans based on external sources, such as Web (Sun et al., 2016), knowledge base Sun et al., 2015) and free text (Chen et al., 2016). As an important type of QA, reading comprehension intends to answer a question after reading the passage (Hirschman et al., 1999). Recently, the release of large-scale RC datasets, such as CNN & Daily Mail (Hermann et al., 2015), Stanford Question-Answering Dataset (SQuAD) (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 makes it possible to solve RC tasks by building deep neural models (Hermann et al., 2015;Wang and Jiang, 2017;Seo et al., 2017;Chen et al., 2017).
More recently, contextualized word representations and pretrained language models, such as ELMo (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), have been demonstrated to be very useful in various NLP tasks including RC. By seeing diverse contexts in large corpora, these pretrained language models can capture the rich semantic meaning and produce more accurate and precise representations for words given different contexts. Even a simple classifier or score function built upon these pretrained contextualized word representations perform well in extracting answer spans (Devlin et al., 2019). Biomedical and Clinical QA. Due to the lack of large-scale annotated biomedical or clinical data, QA and RC systems in these domains are often rule-based and heuristic feature-based (Lee et al., 2006;Niu et al., 2006;Athenikos and Han, 2010).
In recent years, BioASQ challenges (Tsatsaronis et al., 2012) proposed the Biomedical Semantic QA task, where the participants need to respond to each test question with relevant articles, snippets and exact answers.Šuster and Daelemans (2018) use summary points of clinical case reports to build a large-scale cloze-style dataset (CliCR), which is similar to the style of CNN & Daily Mail dataset. Jin et al. (2019b) presents PubMedQA, which extracts question-style titles and their corresponding abstracts as the questions and contexts respectively. A few QA pairs are annotated by human experts and most of them are annotated based a simple heuristic rule with "yes/no/maybe".
Due to the great power of contextualized word representations, pretrained language models also have been introduced to biomedical and clinical domain, e.g., BioELMo (Jin et al., 2019a)

Conclusion
We study the Clinical Reading Comprehension (CliniRC) task with the recently created emrQA dataset. Our qualitative and quantitative analysis as well as exploration of the two desired aspects of CliniRC systems show that future clinical QA datasets should not only be large-scale but also less noisy and more diverse. Moreover, questions that involve complex relations and are across different domains should be included, and then more advanced external knowledge incorporation methods as well as domain adaptation methods can be carefully designed and systematically evaluated.

A Experimental Set-up
We split the two datasets Medication and Relation based on the documents (clinical texts) into train, dev, test with the ratio 7:1:2. The statistics are shown in Table A1. We adopt Exact Match (EM) and F1 score (F1) as our evaluation metrics, same as the open-domain RC (Rajpurkar et al., 2016). We use SQuAD v1.1 official evaluation script 1 to evaluate all the models. All the models used in the paper, BiDAF 2 , DocReader 3 , QANet 4 , BERT 5 , BioBERT 6 , Clin-icalBERT 7 are run based on the implementations listed here and strictly followed the instructions.
For reproducibility, we list all the key hyperparaters we use for each method in the Table A2.
We implement our Knowledge Incorporation Module based on DocReader implementations. Entity embeddings are pretrained using TransE (Bordes et al., 2013) with the dimension of 100. The hyperparameters are kept same as the DocReader. All the models are run on NVIDIA GeForce GTX 1080 GPUs. We save the best model (with the highest EM) on the dev set and use it for test set.

B Performance on Shorter Contexts
Using the entire clinical record as the context might be too long for models to capture sequential information. We also try to split the entire record into different sections (e.g., "medical history", "family history") based on some heuristic measures. Specifically, in order to split the clinical notes into sections, we notice that most sections begin with easily identifiable headers. To detect these headers we use a combination of heuristics such as whether the line contains colons, all uppercase formatting   Table A3: Performance of the two models on the shorter context setting. or phrases found in a list of clinical headers taken from SecTag (Denny et al., 2009). We then select the section that contains the answer as the context (∼100 words avg). We select DocReader and ClinicalBERT as representative methods and re-run them on the modified shorter context. The results are shown in Table A3. The performance of the two models is improved compared with the performance of models built on the whole record (long context). However, ClinicalBERT still does not outperform DocReader in this setting, indicating that longer context may not explain why BERT models do not win on this dataset or that shortening context in a such manner might break long dependencies.
This experiment setting may also inspire future research on "Open Clinical Reading Comprehension". Given that patients often have multiple clini- cal records, it may not be feasible to jointly use all of them as context for one question. Given multiple records for one patient (instead of just one) and a question, the model would first need to retrieve the most relevant paragraphs and do reading comprehension on each of them or find clever ways to merge them. Such a setting would be interesting for future CliniRC datasets to explore.

C Easy/Hard Questions Split
We partition the questions into Easy and Hard. Specifically, we first train a BiLSTM reader and do the prediction on the test set. We obtain the performance of each question template by averaging the performance of all the questions made by this template (such template and question mappings are included in the emrQA dataset). Question templates that obtain higher performance than the overall performance are labeled as "Easy" and "Hard" otherwise. Then we map the difficulty level of question templates back to each question. The reason why we focus on splitting on the question template level is that we can avoid some random noise (e.g., random errors produced by the model on some questions). Also, we release the difficulty level of each question template so that users can easily know which questions are easy or hard and do not need to run a base model to obtain such mappings again. Distributions of easy/hard questions and results of the two selected models are shown in Table A4.