emrQA: A Large Corpus for Question Answering on Electronic Medical Records

We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million questions-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.


Introduction
Automatic question answering (QA) has made big strides with several open-domain and machine comprehension systems built using large-scale annotated datasets (Voorhees et al., 1999;Ferrucci et al., 2010;Rajpurkar et al., 2016;Joshi et al., 2017). However, in the clinical domain this problem remains relatively unexplored. Physicians frequently seek answers to questions from unstructured electronic medical records (EMRs) to support clinical decision-making (Demner-Fushman et al., 2009). But in a significant majority of cases, they are unable to unearth the information they want from EMRs (Tang et al., 1994). Moreover to date, there is no general system for answering natural language questions asked by physicians on a patient's EMR (Figure 1) due to lack of largescale datasets (Raghavan and Patwardhan, 2016).
EMRs are a longitudinal record of a patient's health information in the form of unstructured clinical notes (progress notes, discharge summaries etc.) and structured vocabularies. Physi- cians wish to answer questions about medical entities and relations from the EMR, requiring a deeper understanding of clinical notes. While this may be likened to machine comprehension, the longitudinal nature of clinical discourse, little to no redundancy in facts, abundant use of domain-specific terminology, temporal narratives with multiple related diseases, symptoms, medications that go back and forth in time, and misspellings, make it complex and difficult to apply existing NLP tools (Demner-Fushman et al., 2009;Raghavan and Patwardhan, 2016). Moreover, answers may be implicit or explicit and may require domain-knowledge and reasoning across clinical notes. Thus, building a credible QA system for patient-specific EMR QA requires largescale question and answer annotations that sufficiently capture the challenging nature of clinical narratives in the EMR. However, serious privacy concerns about sharing personal health information (Devereaux, 2013;Krumholz et al., 2016), and the tedious nature of assimilating answer annotations from across longitudinal clinical notes, makes this task impractical and possibly erroneous to do manually (Lee et al., 2017).
In this work, we address the lack of any publicly available EMR QA corpus by creating a large-scale dataset, emrQA, using a novel gener- ation framework that allows for minimal expert involvement and re-purposes existing annotations available for other clinical NLP tasks (i2b2 challenge datasets (Guo et al., 2006)). The annotations serve as a proxy-expert in generating questions, answers, and logical forms. Logical forms provide a human-comprehensible symbolic representation, linking questions to answers, and help build interpretable models, critical to the medical domain (Davis et al., 1977;Vellido et al., 2012). We analyze the emrQA dataset in terms of question complexity, relations, and the reasoning required to answer questions, and provide neural and heuristic baselines for learning to predict questionlogical forms and question-answers.
The main contributions of this work are as follows: • A novel framework for systematic generation of domain-specific large-scale QA datasets that can be used in any domain where manual annotations are challenging to obtain but limited annotations may be available for other NLP tasks.
• The first accessible patient-specific EMR QA dataset, emrQA * , consisting of 400,000 question-answer pairs and 1 million questionlogical form pairs. The logical forms will allow users to train and benchmark interpretable models that justify answers with corresponding logical forms.
• Two new reasoning challenges, namely arithmetic and temporal reasoning, that are absent in open-domain datasets like SQuAD (Rajpurkar et al., 2016). * https://github.com/panushri25/emrQA, scripts to generate emrQA from i2b2 data. i2b2 data is accessible by everyone subject to a license agreement.

Related Work
Question Answering (QA) datasets are classified into two main categories: (1) machine comprehension (MC) using unstructured documents, and (2) QA using Knowledge Bases (KBs).
MC systems aim to answer any question that could be posed against a reference text. Recent advances in crowd-sourcing and search engines have resulted in an explosion of large-scale (100K) MC datasets for factoid QA, having ample redundant evidence in text (Rajpurkar et al., 2016;Trischler et al., 2016;Joshi et al., 2017;Dhingra et al., 2017). On the other hand, complex domainspecific MC datasets such as MCTest (Richardson et al., 2013), biological process modeling (Berant et al., 2014), BioASQ (Tsatsaronis et al., 2015), InsuranceQA (Feng et al., 2015), etc have been limited in scale (500-10K) because of the complexity of the task or the need for expert annotations that cannot be crowd-sourced or gathered from the web. In contrast to the open-domain, EMR data cannot be released publicly due to privacy concerns (Šuster et al., 2017). Also, annotating unstructured EMRs requires a medical expert who can understand and interpret clinical text. Thus, very few datasets like i2b2, MIMIC (Johnson et al., 2016) (developed over several years in collaboration with large medical groups and hospitals), share small-scale annotated clinical notes. In this work, we take advantage of the limited expertly annotated resources to generate emrQA.
KB-based QA datasets, used for semantic parsing, are traditionally limited by the requirement of annotated question and logical form (LF) pairs for supervision where the LF are used to retrieve answers from a schema (Cai and Yates, 2013;Lopez et al., 2013;Bordes et al., 2015). Roberts and Demner-Fushman (2016)   Recent advances in QA combine logic-based and neural MC approaches to build hybrid models (Usbeck et al., 2015;Palangi et al., 2018). These models are driven to combine the accuracy of neural approaches (Hermann et al., 2015) and the interpretability of the symbolic representations in logic-based methods (Gao et al.;Chabierski et al., 2017). Building interpretable yet accurate models is extremely important in the medical domain (Shickel et al., 2017). We generate large-scale ground truth annotations (questions, logical forms, and answers) that can provide supervision to learn such hybrid models. Our approach to generating emrQA is in the same spirit as Su et al. (2016), who generate graph queries (logical forms) from a structured KB and use them to collect answers. In contrast, our framework can be applied to generate QA dataset in any domain with minimal expert input using annotations from other NLP tasks.

QA Dataset Generation Framework
Our general framework for generating a largescale QA corpus given certain resources consists of three steps: (1) collecting questions to capture domain-specific user needs, followed by normalizing the collected questions to templates by replacing entities (that may be related via binary or composite relations) in the question with placeholders. The entity types replaced in the question are grounded in an ontology like WordNet (Miller, 1995), UMLS (Bodenreider, 2004), or a usergenerated schema that defines and relates different entity types. (2) We associate question templates with expert-annotated logical form templates; logical forms are symbolic representations using relations from the ontology/schema to express the relations in the question, and associate the ques-How was the |problem| managed ? How was the patient's |problem| treated ? What was done to correct the patient's |problem| ? Has the patient ever been treated for a |problem| ? What treatment has the patient had for his |problem| ? Has the patient ever received treatment for |problem| ? What treatments for |problem| has this patient tried ? tion entity type with an answer entity type. (3) We then proceed to the important step of re-purposing existing NLP annotations to populate questionlogical form templates and generate answers. QA is a complex task that requires addressing several fundamental NLP problems before accurately answering a question. Hence, obtaining expert manual annotations in complex domains is infeasible as it is tedious to expert-annotate answers that may be found across long document collections (e.g., longitudinal EMR) (Lee et al., 2017). Thus, we reverse engineer the process where we reuse expert annotations available in NLP tasks such as entity recognition, coreference, and relation learning, based on the information captured in the logical forms to populate entity placeholders in templates and generate answers. Reverse engineering serves as a proxy expert ensuring that the generated QA annotations are credible. The only manual effort is in annotating logical forms, thus significantly reducing expert labor. Moreover, in domain specific instances such as EMRs, manually annotated logical forms allow the experts to express information essential for natural language understanding such as domain knowledge, temporal relations, and negation (Gao et al.;Chabierski et al., 2017). This knowledge, once captured, can be used to generate QA pairs on new documents, making the framework scalable.

Generating the emrQA Dataset
We apply the proposed framework to generate the emrQA corpus consisting of questions posed by physicians against longitudinal EMRs of a patient, using annotations provided by i2b2 ( Figure 2).

Question Collection and Normalization
We collect questions for EMR QA by, 1) polling physicians at the Veterans Administration for what they frequently want to know from the EMR (976 questions), 2) using an existing source of 5,696 questions generated by a team of medical experts from 71 patient records (Raghavan et al., 2017) and 3) using 15 prototypical questions from an ob-

FamilyHistoryEvent
Relations Attributes I2b2 entity types as arguments servational study done by physicians (Tang et al., 1994). To obtain templates, the questions were automatically normalized by identifying medical entities (using MetaMap (Aronson, 2001)) in questions and replacing them with generic placeholders. The resulting ∼2K noisy templates were expert reviewed and corrected (to account for any entity recognition errors by MetaMap). We align our entity types to those defined in the i2b2 concept extraction tasks (Uzuner et al., 2010a(Uzuner et al., , 2011 problem, test, treatment, mode and medication. E.g., The question What is the dosage of insulin? from the collection gets converted to the template What is the dosage of |medication|? as shown in Fig.2. This process resulted in 680 question templates. We do not correct for the usage/spelling errors in these templates, such as usage of "pt" for "patient", or make the templates gender neutral in order to provide a true representation of physicians' questions. Further, analyzing these templates shows that physicians most frequently ask about test results (11%), medications for problem (9%), and problem existence (8%). The long tail following this includes questions about medication dosage, response to treatment, medication duration, prescription date, etiology, etc. Temporal constraints were frequently imposed on questions related to tests, problem diagnosis and medication start/stop.

Associating Templates w/ Logical Forms
The 680 question templates were annotated by a physician with their corresponding logical form (LF) templates, which resulted in 94 unique LF templates. More than one question template that map to the same LF are considered paraphrases of each other and correspond to a particular question type (Table 2). Logical forms are defined based on an ontology schema designed by medical experts ( Figure 3). This schema captures entities in unstructured clinical notes through medical events and their attributes, interconnected through relations. We align the entity and relation types of i2b2 to this schema.
A formal representation of the LF grammar using this schema (Figure 3) is as follows. Medical events are denoted as M E i (e.g LabEvent, ConditionEvent) and relations are denoted as RE i (e.g conducted/reveals). Now, M E[a 1 , .., a j , .., oper(a n )] is a medical event where a j represents the attribute of the event (such as result in LabEvent). An event may optionally include constraints on attributes captured by an operator (oper() ∈ sort, range, check for null values, compare). These operators sometimes require values from external medical KB (indicated by ref, e.g. lab.ref low/lab.ref high to indicate range of reference standards considered healthy in lab results) indicating the need for medical knowledge to answer the question. Using these constructs, a LF can be defined using the following rules, Advantages of our LF representation include the ability to represent composite relations, define attributes for medical events and constrain the attributes to precisely capture the information need in the question. While these can be achieved using different methods that combine lambda calculus and first order logic (Roberts and Demner-Fushman, 2016), our representation is more human comprehensible. This allows a physician to consider an ontology like Figure 3 and easily define a logical form. Some example question templates with their LF annotations are described in Table 3 using the above notation. The LF representation of the question in Figure  2 is MedicationEvent(|medication|) [dosage=x]. The entities seen in LF are the entities posed in the question and entity marked x indicates the answer entity type.

Template Filling and Answer Extraction
The next step in the process is to populate the question and logical form (QL) templates with existing annotations in the i2b2 clinical datasets and extract answer evidence for the questions.  The i2b2 datasets are expert annotated with fine-grained annotations (Guo et al., 2006) that were developed for various shared NLP challenge tasks, including (1) smoking status classification (Uzuner et al., 2008), (2) diagnosis of obesity and its co-morbidities (Uzuner, 2009), extraction of (3) medication concepts (Uzuner et al., 2010a), (4) relations, concepts, assertions (Uzuner et al., 2010b(Uzuner et al., , 2011 (5) co-reference resolution (Uzuner et al., 2012) and (6) heart disease risk factor identification (Stubbs and Uzuner, 2015). In Figure 2, this would correspond to leveraging annotations from medications challenge between medications and their dosages, such as medica-tion=Nitroglycerin, dosage=40mg, to populate |medication| and generate several instances of the question "What is the dosage of |medication|?" and its corresponding logical form MedicationEvent(|medication|) [dosage=x]. The answer would be derived from the value of the dosage entity in the dataset.
Preprocessing: The i2b2 entities are preprocessed before using them with our templates to ensure syntactic correctness of the generated questions. The pre-processing steps are designed based on the i2b2 annotations syntax guidelines (Guo et al., 2006). To estimate grammatical correctness, we randomly sampled 500 generated questions and found that <5% had errors. These errors include, among others, incorrect usage of article with the entity and incorrect entity phrasing.
Answer Extraction: The final step in the process is generating answer evidence corresponding to each question. The answers in emrQA are defined differently; instead of a single word or phrase we provide the entire i2b2 annotation line from the clinical note as the answer. This is because the context in which the answer entity or phrase is mentioned is extremely important in clinical decision making (Demner-Fushman et al., 2009).
Hence, we call them answer evidence instead of just answers. For example, consider the question Is the patient's hypertension controlled?.
The answer to this question is not a simple yes/no since the status of the patient's hypertension can change through the course of treatment. The answer evidence to this question in emrQA are multiple lines across the longitudinal notes that reflect this potentially changing status of the patients condition, e.g. Hypertension-borderline today. Additionally, for questions seeking specific answers we also provide the corresponding answer entities.
The overall process for answer evidence generation was vetted by a physician. Here is a brief overview of how the different i2b2 datasets were used in generating answers. The relations challenge datasets have various event-relation annotations across single/multiple lines in a clinical note. We used a combination of one or more of these, to generate answers for a question; in doing so we used the annotations provided by the i2b2 co-reference datasets. Similarly, the medications challenge dataset has various event-attribute annotations but since this dataset is not provided with co-reference annotations, it is currently not possible to combine all valid answers. The heart disease challenge dataset has longitudinal notes (∼5 per patient) with record dates. The events in this dataset are also provided with time annotations and are rich in quantitative entities. This dataset was primarily used to answer questions that require temporal and arithmetic reasoning on events. The patient records in the smoking and obesity challenge datasets are categorized into classes with no entity annotations. Thus, for questions generated on these datasets, the entire document acts as evidence and the annotated class information (7 classes) needs to be predicted as the answer.
The total questions, LFs and answers gener-ated using this framework are summarized in Table 1. Consider the question How much does the patient smoke? for which we do not have i2b2annotations to provide an answer. In cases where the answer entity is empty, we only generate the question and LF, resulting in more question types being used for QL than QA pairs: only 53% of question types have answers.

emrQA Dataset Analysis
We analyze the complexity of emrQA by considering the LFs for question characteristics, variations in paraphrases, and the type of reasoning required for answering questions (Table 2, 3, 4).

Question/Logical Form Characteristics
A quantitative and qualitative analysis of emrQA question templates is shown in Table 3, where logical forms help formalize their characteristics (Su et al., 2016). Questions may request specific finegrained information (attribute values like dosage) or may express a more coarse-grained need (event entities like medications etc), or a combination of both. 25% of questions require complex operators (e.g compare(>)) and 12% of questions express the need for external medical knowledge (e.g. lab.refhigh). The questions in emrQA are highly compositional, where 47% of question templates have at least one event relation.

Paraphrase Complexity Analysis
Questions templates that map to the same LF are considered paraphrases (e.g, Table 2) and correspond to the same question type. In emrQA, an average of 7 paraphrase templates exist per question type. This is representative of FAQ types that are perhaps more important to the physician. Good paraphrases are lexically dissimilar to each other (Chen and Dolan, 2011). In order to understand the lexical variation within our paraphrases, we randomly select a question from the list of paraphrases as a reference and evaluate the others with respect to the reference, and report the average BLEU (0.74 ± 0.06) and Jaccard Score (0.72 ± 0.19). The low BLEU and Jaccard score with large standard deviation indicates the lexical diversity captured by emrQA's paraphrases (Papineni et al., 2002;Niwattanakul et al., 2013).

Answer Evidence Analysis
33% of the questions in emrQA have more than one answer evidence, with the number ranging from 2 to 61. E.g., the question Medications Record? has all medications in the patient's longitudinal record as answer evidence. In order to analyze the reasoning required to answer emrQA questions, we sampled 35 clinical notes from the corpus and analyzed 3 random questions per note by manually labeling them with the categories described in Table 4. Categories are not mutually exclusive: a single example can fall into multiple categories. We compare and contrast this analysis with SQuAD (Rajpurkar et al., 2016), a popular MC dataset generated through crowdsourcing, to show that the framework is capable of generating a corpus as representative and even more complex. Compared to SQuAD, emrQA offers two new reasoning categories, temporal and arithmetic which make up 31% of the dataset. Additionally, over two times as many questions in emrQA require reasoning over multiple sentences. Long and noisy documents make the question answering task more difficult (Joshi et al., 2017). EMRs are inherently noisy and hence 29% have incomplete context and the document length is 27 times more than SQuAD which offers new challenges to existing QA models. Owing to the domain specific nature of the task, 39% of the examples required some form of medical/world knowledge.
As discussed in Section 4.3, 12% of the questions in emrQA corpus require a class category from i2b2 smoking and obesity datasets to be predicted. We also found 6% of the questions had other possible answers that were not included by emrQA, this is because of the lack of co-reference annotations for the medications challenge.

Baseline Methods
We implement baseline models using neural and heuristic methods for question to logical form (Q-L) and question to answer (Q-A) mapping.

Q-L Mapping
Heuristic Models: We use a template-matching approach where we first split the data into train/test sets, and then normalize questions in the test set into templates by replacing entities with placeholders. The templates are then scored against the ground truth templates of the questions in the train set, to find the best match. The placeholders in the LF template corresponding to the best matched question template is then filled with the normalized entities to obtain the predicted LF. To normalize the test questions we use CLiNER   Table 5: Heuristic (HM) and neural (seq2seq) models performance on question to logical form learning in emrQA. (Boag et al., 2015) for emrQA and Jia and Liang (2016)'s work for ATIS and GeoQuery. Scoring and matching is done using two heuristics: (1) HM-1, which computes an identical match, and (2) HM-2, which generates a GloVe vector (Arora et al., 2016) representation of the templates using sentence2vec and then computes pairwise cosine similarity. Neural Model: We train a sequence-tosequence (seq2seq) (Sutskever et al., 2014) with attention paradigm (Bahdanau et al., 2014;Luong et al., 2017) as our neural baseline (2 layers, each with 64 hidden units). The same setting when used with Geoquery and ATIS gives poor results because the parameters are not appropriate for the nature of that dataset. Hence, for comparison with GeoQuery and ATIS, we use the results of seq2seq model with a single 200 hidden units layer (Jia and Liang, 2016). At test time we automatically balance missing right parentheses. † results from Jia and Liang (2016)

Experimental Setup
We randomly partition the QL pairs in the dataset in train(80%) and test(20%) sets in two ways. (1) In emrQL-1, we first split the paraphrase templates corresponding to a single LF template into train and test, and then generate the instances of QL pairs. (2) In emrQL-2, we first generate the instances of QL pairs from the templates and then distribute them into train and test sets. As a result, emrQL-1 has more lexical variation between train and test distribution compared to emrQL-2, resulting in increased paraphrase complexity. We use accuracy i.e, the total number of logical forms predicted correctly as a metric to evaluate our model.

Results
The performance of the proposed models is summarized in Table 5. emrQL results are not directly comparable with GeoQuery and ATIS because of the differences in the lexicon and tools available for the domains. However, it helps us establish that QL learning in emrQA is non-trivial and supports significant future work.
Error analysis of heuristic models on emrQL-1 and emrQL-2 showed that 70% of the errors occurred because of incorrect question normalization. In fact, 30% of these questions had not been normalized at all. This shows that the entities added to the templates are complex and diverse and make the inverse process of template generation non trivial. This makes a challenging QL corpus that cannot trivially be solved by template matching based approaches.
Errors made by the neural model on both emrQL-1 and emrQL-2 are due to long LFs (20%) and incorrectly identified entities (10%), which are harder for the attention-based model (Jia and Liang, 2016). The increased paraphrase complexity in emrQL-1 compared to emrQL-2 resulted in 20% more structural errors in emrQL-1, where the predicted event/grammar structure deviates significantly from the ground truth. This shows that the model is not adequately capturing the semantics in the questions to generalize to new paraphrases. Therefore, emrQL-1 can be used to benchmark QL models robust to paraphrasing.

Q-A Mapping
Question-answering on emrQA consists of two different tasks, (1) extraction of answer line from the clinical note (machine comprehension (MC)) and (2) prediction of answer class based on the entire clinical note. We provide baseline models to illustrate the complexity in doing both these tasks.
Machine Comprehension: To do extractive QA on EMRs, we use DrQA's  document reader which is a multi-layer RNN based MC model. We use their best performing settings trained for SQuAD data using Glove vectors (300 dim-840B).
Class Prediction: We build a multi-class logistic regression model for predicting a class as an answer based on the patient's clinical note. Features input to the classifier are TF-IDF vectors of the question and the clinical notes taken from i2b2 smoking and obesity datasets.

Experimental setup
We consider a 80-20 split of the data for train-test. In order to evaluate worst-case performance, we train on question-evidence pairs in a clinical note obtained by using only one random paraphrase for a question instead of all the paraphrases. We use a slightly modified ‡ version of the two popularly reported metrics in MC for evaluation since our evidence span is longer: Exact Match (EM) and F1. Wherever the answer entity in an evidence is explicitly known, EM checks if the answer entity is ‡ using the original definitions, the evaluated values were far less than those obtained in Table 7 Model  present within the evidence, otherwise it checks if the predicted evidence span lies within ±20 characters of the ground truth evidence. For F1 we construct a bag of tokens for each evidence string and measure the F1 score of the overlap between the two bags of tokens. Since there may be multiple evidence for a given question, we consider only the top 10 predictions and report an average of EM and F1 over ground truth number of answers. In the class prediction setting, we report the subset accuracy.

Results
The performance of the proposed models is summarized in Table 7. DrQA is one of the best performing models on SQuAD with an F1 of 78.8 and EM of 69.5. The relatively low performance of the models on emrQA (60.6 F1 and 59.2 EM) shows that QA on EMRs is a complex task and offers new challenges to existing QA models.
To understand model performance, we macroaverage the EM across all the questions corresponding to a LF template. We observe that LFs representing temporal and arithmetic § needs had < 16% EM. LFs expressing the need for medical KB § performed poorly since we used general Glove embeddings. An analysis of LFs which had approximately equal number of QA pair representation in the test set revealed an interesting relation between the model performance and LF complexity, as summarized in Table 6. The trend shows that performance is worse on multiple relation questions as compared to single relation and attribute questions, showing that the LFs sufficiently capture the complexity of the questions and give us an ability to do a qualitative model analysis.
Error analysis on a random sample of 50 questions containing at least one answer entity in an evidence showed that: (1) 38% of the examples required multiple sentence reasoning of which 16% were due to a missing evidence in a multiple evidence question, (2) 14% were due to syntactic variation, (3) 10% required medical reasoning and (4) in 14%, DrQA predicted an incomplete evidence span missing the answer entity in it. § maximum representation of these templates comes from the i2b2 heart disease risk dataset

Discussion
In this section, we describe how our generation framework may also be applied to generate opendomain QA datasets given the availability of other NLP resources. We also discuss possible extensions of the framework to increase the complexity of the generated datasets.
Open domain QA dataset generation: Consider the popularly used SQuAD (Rajpurkar et al., 2016) reading comprehension dataset generated by crowdworkers, where the answer to every question is a segment of text from the corresponding passage in the Wikipedia article. This dataset can easily be generated or extended using our proposed framework with existing NLP annotations on Wikipedia (Auer et al., 2007;Nothman et al., 2008;Ghaddar and Langlais, 2017).
For instance, consider DBPedia (Auer et al., 2007), an existing dataset of entities and their relations extracted from Wikipedia. It also has its own ontology which can serve as the semantic frames schema to define logical forms. Using these resources, our reverse engineering technique for QA dataset generation can be applied as follows. (1) Question templates can be defined for each entity type and relation in DBPedia. For example ¶ , consider the relation [place, country] field in DBpedia. For this we can define a question template In what country is |place| located?.
(2) Every such question template can be annotated with a logical form template using existing DB-Pedia ontology. (3) By considering the entity values of DBPedia fields such as [place=Normandy, dbo:country=France], we can automatically generate the question In what country is Normandy located? and its corresponding logical form from the templates. The text span of country=France from the Wikipedia passage is then used as the answer (Daiber et al., 2013). Currently, this QA pair instance is a part of the SQuAD dev set. Using our framework we can generate many more instances like this example from different Wikipedia passages -without crowdsourcing efforts. ¶ example reference: http://dbpedia.org/page/Normandy Extensions to the framework: The complexity of the generated dataset can be further extended as follows.
(1) We can use a coreferred or a lexical variant of the original entity in the question-logical form generation. This can allow for increased lexical variation between the question and answer line entities in the passage. (2) It is possible to combine two or more question templates to make compositional questions with the answers to these questions similarly combined. This can also result in more multiple sentence reasoning questions. (3) We can generate questions with entities not related to the context in the passage. This can increase empty answer questions in the dataset, resulting in increased negative training examples.

Conclusions and Future Work
We propose a novel framework that can generate a large-scale QA dataset using existing resources and minimal expert input. This has the potential to make a huge impact in domains like medicine, where obtaining manual QA annotations is tedious and infeasible. We apply this framework to generate a large scale EMR QA corpus (emrQA), consisting of 400,000 question-answers pairs and 1 million question-logical forms, and analyze the complexity of the dataset to show its non-trivial nature. We show that the logical forms provide a symbolic representation that is very useful for corpus generation and for model analysis. The logical forms also provide an opportunity to build interpretable systems by perhaps jointly (or latently) learning the logical form and answer for a question. In future, this framework may be applied to also re-purpose and integrate other NLP datasets such as MIMIC and generate a more diverse and representative EMR QA corpus (Johnson et al., 2016).