HEAD-QA: A Healthcare Dataset for Complex Reasoning

We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.


Introduction
Recent progress in question answering (QA) has been led by neural models (Seo et al., 2016;Kundu and Ng, 2018), due to their ability to process raw texts. However, some authors (Kaushik and Lipton, 2018; have discussed the tendency of research to develop datasets and methods that accomodate the data-intensiveness and strengths of current neural methods. This is the case of popular English datasets such as bAbI (Weston et al., 2015) or SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018, where some systems achieve near human-level performance (Hu et al., 2018;Xiong et al., 2017) and often surface-level knowledge suffices to answer. To counteract this, Clark et al. (2016) and  have encouraged progress by developing multi-choice datasets that require reasoning. The questions match grade-school science, due to the difficulties to collect specialized questions. With a similar aim, Lai et al. (2017) released 100k questions and 28k passages intended for middle or high school Chinese students, and Zellers et al. (2018) introduced a dataset for common sense reasoning from a spectrum of daily situations. However, this kind of dataset is scarce for complex domains like medicine: while challenges have been proposed in such domains, like textual entailment (Abacha et al., 2015;Abacha and Dina, 2016) or answering questions about specific documents and snippets (Nentidis et al., 2018), we know of no resources that require general reasoning on complex domains. The novelty of this work falls in this direction, presenting a multi-choice QA task that combines the need of knowledge and reasoning with complex domains, and which takes humans years of training to answer correctly.
Contribution We present HEAD-QA, a multichoice testbed of graduate-level questions about medicine, nursing, biology, chemistry, psychology, and pharmacology (see Table 1 Train Dev  Test  Biology  1,132  452  226  454  nursing  1,069  384  230  455  Pharmacology  1,139  457  225  457  Medicine  1149  455  231  463  Psychology  1134  453  226  455  Chemistry  1142  456  228  458  Total  6,765 2,657 1,366 2,742  2 The HEAD-QA corpus The Ministerio de Sanidad, Consumo y Bienestar Social 2 (as a part of the Spanish government) announces every year examinations to apply for specialization positions in its public healthcare areas. The applicants must have a bachelor's degree in the corresponding area (from 4 to 6 years) and they prepare the exam for a period of one year or more, as the vacancies are limited. The exams are used to discriminate among thousands of applicants, who will choose a specialization and location according to their mark (e.g., in medicine, to access a cardiology or gynecology position at a given hospital). We use these examinations (from 2013 to present) to create HEAD-QA. We consider questions involving the following healthcare areas: medicine (aka MIR), pharmacology (FIR), psychology (PIR), nursing (EIR), biology (BIR), and chemistry (QIR). 34 Exams from 2013 and 2014 are multi-choice tests with five options, while the rest of them have just four. The questions mainly refer 2 https://www.mscbs.gob.es/ 3 Radiophysics exams are excluded, due to the difficulty to parse their content (e.g. equations) from the PDF files. 4 Some of the questions might be considered invalid after the exams. We remove those questions from the final dataset. to technical matters, although some of them also consider social issues (e.g. how to deal with patients in stressful situations). A small percentage (∼14%) of the medicine questions refer to images that provide additional information to answer correctly. These are included as a part of the corpus, although we will not exploit them in this work. For clarity, Table 4 shows an example: 5 Question Question linked to image no 21. A 38year-old bank employee who has been periodically checked by her company is referred to us to assess the chest X-ray. The patient smokes 20 cigarettes / day from the age of 21. She says that during the last months, she is somewhat more tired than usual. The basic laboratory tests are normal except for an Hb of 11.4 g / dL. An electrocardiogram and forced spirometry are normal. What do you think is the most plausible diagnostic orientation?  We describe in detail the JSON structure of HEAD-QA in Appendix A. We enumerate below the fields for a given sample: • The question ID and the question's content.
• Path to the image referred to in the question (if any).
• A list with the possible answers. Each answer is composed of the answer ID and its text.
• The ID of the right answer for that question.
Although all the approaches that we will be testing are unsupervised or distant-supervised, we additionally define official training, development and test splits, so future research with supervised approaches can be compared with the work presented here. For this supervised setting, we choose the 2013 and 2014 exams for the training set, 2015 for the development set, and the rest for testing. The statistics are shown in Tables 2 and 3. It is worth noting that a common practice to divide a dataset is to rely on randomized splits to avoid potential biases in the collected data. We decided not to follow this strategy for two reasons. First, the questions and the number of questions per area are designed by a team of healthcare experts who already try to avoid these biases. Second (and more relevant), random splits would impede comparison against official (and aggregated) human results.
Finally, we hope to increase the size of HEAD-QA by including questions from future exams.
English version HEAD-QA is in Spanish, but we include a translation to English (HEAD-QA-EN) using the Google API, which we use to perform cross-lingual experiments. We evaluated the quality of the translation using a sample of 60 random questions and their answers. We relied on two fluent Spanish-English speakers to score the adequacy 6 and on one native English speaker for the fluency, 7 following the scale by Koehn and Monz (2006). The average scores for adequacy were 4.35 and 4.71 out of 5, i.e. most of the meaning is captured; and for fluency 4 out of 5, i.e. good. As a side note, it was observed by the annotators that most names of diseases were successfully translated to English. On the negative side, the translator tended to struggle with elements such as molecular formulae, relatively common in chemistry questions. 8

Methods
Notation We represent HEAD-QA as a list of tuples: [(q 0 , A 0 ), ..., (q N , A N )], where: q i is a question and A i = [a i0 , ..., a im ] are the possible answers. We useã ik to denote the predicted answer, ignoring indexes when not needed. Kaushik and Lipton (2018) discuss on the need of providing rigorous baselines that help better understand the improvement coming from future models, and also the need of avoiding architectural novelty when introducing new datasets. For this reason, our baselines are based on state-of-the-art systems used in open-domain and multi-choice QA (Chen et al., 2017;Kembhavi et al., 2017;.

Control methods
Given the complex nature of the task, we include three control methods: where φ is a random distribution.
Blind xãik = a ix ∀i. Always chosing the xth option. Tests made by the examiners are not totally random (Poundstone, 2014) and right answers tend occur more in middle options.

Length
Choosing the longest answer. 9 Poundstone (2014) points out that examiners have to make sure that the right answer is totally correct, which might take more space.

Strong multi-choice methods
We evaluate an information retrieval (IR) model for HEAD-QA and cross-lingual models for HEAD-QA-EN. Following Chen et al. (2017), we use Wikipedia as our source of information (D) 10 for all the baselines. We then extract the raw text and remove the elements that add some type of structure (headers, tables, . . . ). 11

Spanish information retrieval
Let (q i , [a i0 , ..., a im ]) be a question with its possible answers, we first create a set of m queries of the form [q i + a i0 , ..., q i + a im ], which will be sent separately to a search engine. In particular, we use the DrQA's Document Retriever (Chen et al., 2017), which scores the relation between the queries and the articles as TF-IDF weighted bagof-word vectors, and also takes into account word order and bi-gram counting. The predicted answer is defined asã ik = arg max k (score(m k , D)), i.e. the answer in the query m k for which we obtained the highest document relevance. This is equivalent to the IR baselines by Clark et al. (2016.

Cross-lingual methods
Although some research on Spanish QA has been done in the last decade (Magnini et al., 2003;Vicedo et al., 2003;Buscaldi and Rosso, 2006;Kamateri et al., 2019), most recent work has been done for English, in part due to the larger availability of resources. On the one hand this is interesting because we hope HEAD-QA will encourage research on multilingual question answering. On the other hand, we want to check how challenging the dataset is for state-of-the-art systems, usually available only for English. To do so, we use HEAD-QA-EN, as the adequacy and the fluency scores of the translation were high.
Cross-lingual Information Retrieval The IR baseline, but applied to HEAD-QA-EN. We also use this baseline as an extrinsic way to evaluate the quality of the translation, expecting to obtain a performance similar to the Spanish IR model. (Chen et al., 2017) DrQA first returns the 5 most relevant documents for each question, relying on the information retrieval system described above. It will then try to find the exact span in them containing the right answer on such documents, using a document reader. For this, the authors rely on a neural network system inspired in the Attentive Reader (Hermann et al., 2015) that was trained over SQuAD (Rajpurkar et al., 2016). The original DrQA is intended for open-domain QA, focusing on factoid questions. To adapt it to a multi-choice setup, to selectã we compare the selected span against all the answers and select the one that shares the largest percentage of tokens. 12 Non-factoid questions (common in HEAD-QA) are not given any special treatment.  Similar to the multi-choice DrQA, but using a BiDAF architecture as the document reader (Seo et al., 2016). The way BiDAF is trained is also different: they first trained the reader on SQuAD, but then further tuned to science questions presented in , using continued training. This system might select as correct more than one answer. If this happens, we follow a simple approach and select the longest one.  The models adapt the DGEM (Parikh et al., 2016) and Decompatt  entailment systems. They consider a set of hypothesis h ik =q i + a ik and each h i is used as a query to retrieve a set of relevant sentences, S ik . Then, an entailment score entailment(h ik , s) is computed for every h ik and s ∈ S ik , whereã is the answer inside h ik that maximizes the score. If multiple answers are selected, we choose the longest one.

Experiments
Metrics We use accuracy and a POINTS metric (used in the official exams): a right answer counts 3 points and a wrong one subtracts 1 point. 13 Results (unsupervised setting) Tables 5 and 6 show the accuracy and POINTS scores for both HEAD-QA and HEAD-QA-EN. The cross-lingual IR model obtains even a greater performance than the Spanish one. This is another indicator that the translation is good enough to apply crosslingual approaches. On the negative side, the approaches based on current neural architectures obtain a lower performance.    Table 7: Accuracy on the HEAD-QA and HEAD-QA-EN corpora (supervised setting) questions and the answers (this was shown in Table 3). This hypothesis is supported by the lower results on the nursing domain (EIR), the category with the second longest questions/answers. On the contrary, the categories for which we obtained the better results, such as pharmacology (FIR) or biology (BIR), have shorter questions and answers. While the evaluated models surpass all control methods, their performance is still well behind the human performance. We illustrate this in Table 9, comparing the performance (POINTS score) of our best model against a summary of the results, on the 2016 exams. 14 Also, the best performing model was a non-machine learning model based on standard information retrieval techniques. This reinforces the need for effective information extraction techniques that can be later used to perform complex reasoning with machine learning models. 14 2016 was the annual examination for which we were able to find more available information.

Conclusion
We presented a complex multi-choice dataset containing questions about medicine, nursing, biology, pharmacology, psychology and chemistry. Such questions correspond to examinations to access specialized positions in the Spanish healthcare system, and require specialized knowledge and reasoning to be answered. To check its complexity, we then tested different state-of-the-art models for open-domain and multi-choice questions. We show how they struggle with the challenge, being clearly surpassed by a non-machine learning model based on information retrieval. We hope this work will encourage research on designing more powerful QA systems that can carry out effective information extraction and reasoning. We also believe there is room for alternative challenges in HEAD-QA. In this work we have used it as a closed QA dataset (the potential answers are used as input to determine the right one). Nothing prevents to use the dataset in an open setting, where the system is given no clue about the possible answers. This would require to think as well whether widely used metrics such as BLEU (Papineni et al., 2002) or exact match could be appropriate for this particular problem.