What Makes Reading Comprehension Questions Easier?

A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiple-choice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.


Introduction
Evaluating natural language understanding (NLU) systems is a long-established problem in AI (Levesque, 2014). One approach to doing so is the machine reading comprehension (MRC) task, in which a system answers questions about given texts (Hirschman et al., 1999). Although recent studies have made advances (Yu et al., 2018), it is still unclear to what precise extent questions require understanding of texts (Jia and Liang, 2017).
In this study, we examine MRC datasets and discuss what is needed to create datasets suit- Figure 1: Example from the SQuAD dataset (Rajpurkar et al., 2016). The baseline system can answer the token-limited question and, even if there are other candidate answers, it can easily attend to the answercontained sentence (s 1 ) by watching word overlaps. able for the detailed testing of NLU. Our motivation originates from studies that demonstrated unintended biases in the sourcing of other NLU tasks, in which questions contain simple patterns and systems can recognize these patterns to answer them (Gururangan et al., 2018;Mostafazadeh et al., 2017).
We conjecture that a situation similar to this occurs in MRC datasets. Consider the question shown in Figure 1, for example. Although the question, starting with when, requires an answer that is expressed as a moment in time, there is only one such expression (i.e., November 2014) in the given text (we refer to the text as the context). In other words, the question has only a single candidate answer. The system can solve it merely by recognizing the entity type required by when. In addition to this, even if another expression of time appears in other sentences, only one sentence (i.e., s 1 ) appears to be related to the question; thus, the system can easily determine the correct answer by attention, that is, by matching the words appearing both in the context and the ques-tion. Therefore, this kind of question does not require a complex understanding of language-e.g., multiple-sentence reasoning, which is known as a more challenging task (Richardson et al., 2013).
In Section 3, we define two heuristics, namely entity-type recognition and attention. We specifically analyze the differences in the performance of baseline systems for the following two configurations: (i) questions answerable or unanswerable with the first k tokens; and (ii) questions whose correct answer appears or does not appear in the context sentence that is most similar to the question (henceforth referred to as the most similar sentence). Although similar heuristics are proposed by Weissenborn et al. (2017), ours are utilized for question filtering, rather than system development; Using these simple heuristics, we split each dataset into easy and hard subsets for further investigation of the baseline performance.
After conducting the experiments, we analyze the following two points in Section 4. First, we consider which questions are valid for testing, i.e., reasonably solvable. Second, we consider what reasoning skills are required and whether this exposes any differences among the subsets. To investigate these two concerns, we manually annotate sample questions from each subset in terms of validity and required reasoning skills, such as word matching, knowledge inference, and multiple sentence reasoning.
We examine 12 recently proposed MRC datasets (Table 1), which include answer extraction, description, and multiple-choice styles. We also observe differences based on these styles. For our baselines, we use two neural-based systems, namely, the Bidirectional Attention Flow (Seo et al., 2017) and the Gated-Attention Reader (Dhingra et al., 2017).
In Section 5, we describe the advantages and disadvantages of different question styles with regard to evaluating NLU systems. We also interpret our heuristics for constructing realistic MRC datasets.
Our contributions are as follows: • This study is the first large-scale investigation across recent 12 MRC datasets with three question styles.
• We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets.
We observed the following: • The baseline performances for the hard subsets remarkably degrade compared to those of entire datasets.
• Our annotation study shows that hard questions require knowledge inference and multiplesentence reasoning in comparison with easy questions.
• Compared to questions with answer extraction and description styles, multiple-choice questions tend to require a broader range of reasoning skills while exhibiting answerability, multiple answer candidates, and unambiguity.

Baseline Systems
We employed the following two widely used baselines.
Bidirectional Attention Flow (BiDAF) (Seo et al., 2017) was used for the answer extraction and description datasets. BiDAF models bi-directional attention between the context and question. It achieved state-of-the-art performance on the SQuAD dataset.
Gated-Attentive Reader (GA) (Dhingra et al., 2017) was used for the multiple-choice datasets. GA has a multi-hop architecture with an attention mechanism. It achieved state-of-the-artperformance on the CNN/Daily Mail and Whodid-What datasets.
Why we used different baseline systems: The multiple-choice style can be transformed to answer extraction, as mentioned in . However, in some datasets, many questions have no textual overlap to determine the correct answer span in the context. Therefore, in order to avoid underestimating the baseline performance of those datasets, we used the GA system which is applicable to multiple choice questions.
We scored the performance using exact match (EM)/F1 (Rajpurkar et al., 2016), Rouge-L (Lin, 2004), and accuracy for the answer extraction, description, and multiple-choice datasets, respectively (henceforth, we refer to these collectively as the score, for simplicity). For the description datasets, we determined in advance the answer span of the context that gives the highest Rouge-L score to the human-generated gold answer. We computed the Rouge-L score between the predicted span and the gold answer. 3 Reproduction of the baseline performance: We used the same architecture as the official baseline systems unless specified otherwise. All systems were trained on the training set and tested on the development/test set of each dataset. We also used different hyperparameters for each dataset according to characteristics such as context length (see Appendix A for details). We show the baseline performance of both the official results and those from our implementations in Tables 2 and 3. Our implementations outperformed or showed comparable performance to the official baseline on most datasets. However, in TriviaQA, MCTest, RACE, and ARC-E, our baseline performance did not reach that of the official baseline, due to differences in architecture or the absence of reported hyperparameters in the literature.

Two Filtering Heuristics
The first goal of this paper is to determine whether there are unintended biases of the kind exposed in Figure 1 in MRC datasets. We examined the influence of the two filtering heuristics: (i) entity type recognition (Section 3.1) and (ii) attention (Section 3.2). We then investigated the performance of the baseline systems on the questions filtered by the defined heuristics (Section 3.3).

Entity Type-based Heuristic
The aim of this heuristic was to detect questions that can be solved based on (i) the existence of a single candidate answer that is restricted by expressions such as "wh-" and "how many," and (ii) lexical patterns that appear around the correct answer. Because the query styles are not uniform across datasets (e.g., MARCO uses search engine queries), we could not directly use interrogatives. Instead, we simply provided the first k tokens of questions to the baseline systems. We chose smaller values for k than the (macro) average of the question length across the datasets (= 12.2 tokens). For example, for k = 4 of the question will I qualify for OSAP if I'm new in Canada (excerpted from MARCO), we use will I qualify for. Even if the tokens do not have an interrogative, the system may recognize lexical patterns around the correct answer. Questions that can be solved  Statistics from the answer extraction and description datasets and their baselines. Dev represents a development set. Ans in sim sent refers to questions whose answer appears in the sentence that is most similar to the question. 1 The questions are not complete sentences and may start with more specific words than interrogatives. 2 Verified set. 3 No answer questions were removed. 4 The Passage Ranking model (Nguyen et al., 2016).
by examining these patterns were also of interest when filtering. Results: Tables 2 and 3 present the results for k = 1, 2, 4. In addition, to know the exact ratio of the questions that are solved rather than the scores for the answer extraction and description styles, we counted questions with k = 2 that achieved the score ≥ 0.5. 4 As k decreased, so too did the baseline performance on all datasets in Table 2 except QAngaroo. By contrast, in QAngaroo and the multiple-choice datasets, the performance did not degrade so strongly. In particular, the difference between the scores on the full and k = 1 questions in QAngaroo was 1.8. Because the questions in QAngaroo are not complete sentences, but rather knowledge-base entries that have a blank, such as country of citizenship Henry VI of England, this result implies that the baseline system can infer the answer merely by the first token of questions, i.e., the type of knowledge-base entry.
In most multiple-choice datasets, the k = 1 scores were significantly higher than randomchoice scores. Given that multiple-choice ques-tions offer multiple options that are of valid entity/event types, this gap was not necessarily caused by the limited number of candidate answers, as in the case with the answer extraction datasets. Therefore, we inferred that in the solved questions, incorrect options appeared less than the correct option did or did not appear at all in the context (such questions were regarded as solvable exclusively using the word match skill, which we analyzed in Section 4). Remarkably, although we failed to achieve a higher baseline performance, the score for the complete questions in MCTest was lower than that of the k = 1 questions. This result showed that the MCTest questions were sufficiently difficult such that it was not especially useful for the baseline system to consider the entire question statement.

Attention-based Heuristic
Next, we examined in each dataset (i) how many questions have their correct answers in the most similar sentence and (ii) whether a performance gap exists for such questions (i.e., whether such questions are easier than the others).
We used uni-gram overlap as a similarity mea-  sure. 5 We counted how many times question words appeared in each sentence, where question words were stemmed and stopwords were dropped. We then checked whether the correct answer appeared in the most similar sentence. For the multiple-choice datasets, we selected the text span that provided the highest Rouge-L score with the correct option as the correct answer.
Results : Tables 2 and 3 show the results. Considering the average number of context sentences, most datasets contained a significantly high proportion of questions whose answers were in the most similar sentence.
In the answer extraction and description datasets, except QAngaroo, the baseline performance improved when the correct answer appeared in the most similar sentence, and gaps were found between the performances on these questions and the others. These gaps indicated that the dataset may lack balance for testing NLU. If these questions tend to require the word matching skill exclusively, attending the other portion is useful in studying a more realistic NLU, e.g., common-sense reasoning and discourse understanding. Therefore, we investigated whether these questions merely require word matching (see Section 4).
Meanwhile, in the first three multiple-choice datasets, the performance differences were marginal or inversed, implying that although the baseline performance was not especially high, the difficulty of these questions for the baseline system was not affected by whether their correct answers appeared in the most similar sentence.
We further analyzed the baseline performance after removing the context and leaving only the most similar sentence. In AddSent and QAngaroo, the scores remarkably improved (>20 F1). From this result, we can infer that on these datasets the baseline systems were distracted by other sentences in the context. This observation was supported by the results from the AddSent dataset (Jia and Liang, 2017), which contains manually injected distracting sentences (i.e., adversarial examples).

Performance on Hard Subsets
In the previous two sections, we observed that in the examined datasets (i) some questions were solved by the baseline systems merely with the first k tokens and/or (ii) the baseline performances increased for questions whose answers were in the most similar sentence. We were concerned that these two will become dominant factors in measuring the baseline performance using the datasets; Hence, we split each development/test set into easy and hard subsets for further investigation.
Hard subsets: A hard subset comprised questions (i) whose score is not positive when k = 2 and (ii) whose correct answer does not appear in the most similar sentence. The easy subsets comprised the remaining questions. We aimed to investigate the gap of the performance values between the easy and hard subsets. If the gap is large, the dataset may be strongly biased toward questions that are solved by recognizing entity types or lexical patterns and may not be suitable for measuring the system's ability for complex reasoning.
Results and clarification: The bottom row of Tables 2 and 3 shows that the baseline performances on the hard subset remarkably decreased in almost all examined datasets. These results revealed that we may overestimate the ability of the baseline systems previously perceived. How-ever, we clarify that our intention is not to remove the questions solved or mitigated by our defined heuristics to create a new hard subset because this may generate new biases as indicated in Gururangan et al. (2018). Rather, we would like to emphasize the importance of the defined heuristics when sourcing questions. Indeed, ill attention to these heuristics can lead to unintended biases.

Annotating Question Validity and
Required Skills

Annotation Specifications
Objectives: To complement the observations in the previous sections, we annotated sampled questions from each subset of the datasets. Our motivation can be summarized as follows: (i) How many questions are valid in each dataset? That is, the hard questions may not in fact be hard, but just unsolvable, as indicated in Chen et al. (2016). (ii) What kinds of reasoning skills explain the easy/hard questions? (iii) Are there any differences among the datasets and the question styles?
We annotated the minimum skills required to choose the correct answer among other candidates. We assumed that the solver knows what type of entity or event is entailed by the question.
Annotation labels: Our annotation labels (Table 4) were inspired by previous works such as Chen et al. (2016), Trischler et al. (2017), and Lai et al. (2017). The major modifications were twofold: (i) detailed question validity, including a number of reasonable candidate answers and answer ambiguity, and (ii) posing multiple-sentence reasoning as a skill compatible with other skills.
Reasoning types indeed have other classifications. For instance, Lai et al. (2017) defined five reasoning types, including attitude analysis and whole-picture reasoning. We incorporated them into the knowledge and meta/whole classes.  proposed detailed knowledge and reasoning types, but these were specific to science exams and, thus, omitted from our study.
Independent of the abovementioned reasoning types, we checked whether the question required multiple-sentence reasoning to answer the questions. As another modification, we extended the notion of "sentence" in our annotation and considered a subordinate clause as a sentence. This modification was intended to deal with the internal complexity of a sentence with multiple clauses, which can also render a question difficult. Validity 1. Unsolvable -the context coupled with the question does not reasonably give the answer. 2. Single candidate -the question does not have multiple candidate answers. 3. Ambiguous -the question does not have a unique, decidable answer, or, multiple possible answers are not covered by the gold answers.
Reasoning skill 4. Word matching -matching the context and question words. 5. Paraphrasing -using lexical and grammatical knowledge. 6. Knowledge -inference using commonsense and/or world knowledge. 7. Meta/Whole -understanding meta terms, such as "author" and "writer," and comprehending the general context. 8. Math/Logic -using mathematical and logical knowledge, includeing multiple-choice questions that ask "which option is not true." Multiple-sentence reasoning 9. (i) coreference (ii) causal relation (iii) spatialtemporal relations (iv) none -gathering cues from multiple sentences/clauses. Table 4: Annotation labels. One of the reasoning skills is annotated with the questions that are "no" in all validity labels. Multiple sentence reasoning is independent of reasoning skills and annotated with all valid questions.
Settings: For each subset of the datasets, 30 questions were annotated. Therefore we obtained annotations for 30 × 2 × 12 = 720 questions. The annotation was performed by the authors. The annotator was given the context, question, and candidate answers for multiple-choice questions along with the correct answer. To reduce bias, the annotator did not know which easy or hard subset the questions were in, and was not told the predictions and scores of the respective baseline systems.

Annotation Results
Tables 5 and 6 show the annotation results.
Validity: TriviaQA, QAngaroo, and ARCs revealed a relatively high unsolvability, which seemed to be caused by the unrelatedness between the questions and their context. For example, QAngaroo's context was gathered from Wikipedia articles that were not necessarily related to the questions. 6 The context passages in ARCs were     Table 7: Pearson's correlation coefficients (r) between the annotation labels and the baseline scores with p < 0.05. curated from textbooks that may not provide sufficient information to answer the questions. 7 Note questions were difficult for the baseline system. 7 Our analysis was not intended to undermine the quality that it is possible for unsolvable questions to be permitted, and that the system must indicate them in some datasets, such as QA4MRE, NewsQA, MARCO, and SQuAD (v2.0). However, for single candidate, we found that few questions had only single-candidate answers. Furthermore, there were even fewer singlecandidate answers in AddSent than in SQuAD. This result supported the claim that the adversarial examples augmented the number of possible candidate answers, thereby degrading the baseline performance.
In our annotation, ambiguous questions were of these questions. We refer readers to .  Figure 2: Example of an ambiguous question from NewsQA (Trischler et al., 2017). found to be those with multiple correct spans. Figure 2 shows an example. In this case, several answers aside from "93" were correct. Ambiguity is an important feature insofar because it can lead to unstable scoring in EM/F1.
The multiple-choice datasets mostly comprised valid questions, with the exception of the unsolvable questions in the ARC datasets.
Reasoning skills: We can see that word matching was more important in the easy subsets, and knowledge was more pertinent to the hard subsets in 10 of the 12 datasets. These results confirmed that the manner by which we split the subsets was successful at filtering questions that were relatively easy in terms of reasoning skills. However, we did not observe this trend with paraphrasing, which seemed difficult to distinguish from word matching and knowledge. With regard to meta/whole and math/logic, we can see that these skills were needed less in the answer extraction and description datasets. They were more pertinent to the multiple-choice datasets.

Multiple-sentence reasoning:
Multiplesentence reasoning was more correlated with the hard subsets in 10 of the 12 datasets. Although NewsQA showed the inverse tendency for word matching, knowledge, and multiple-sentence reasoning, we suspect that this was caused by annotation variance and filtering a large portion of ambiguous questions. For relational types, we did not see a significant trend in any particular type.

Correlation of labels and baseline scores:
Across all examined datasets, we analyzed the correlations between the annotation labels and the scores of each baseline system in Table 7. In spite of the small size of the annotated samples, we derived statistically significant correlations for six labels. These results confirmed that BiDAF performed well for the word matching questions and relatively poorly with the knowledge questions. By contrast, we did not observe this trend in GA.

Discussion
In this section, we discuss the advantages and disadvantages of the question styles. We also interpret the defined heuristics in terms of constructing more realistic MRC datasets.
Differences among the question styles: The biggest advantage to the answer extraction style is its ease in generating questions, which enables us to produce large-scale datasets. In contrast, a disadvantage to this style is that it rarely demands meta/whole and math/logic skills, which can require answers not contained in the context. Moreover, as observed in Section 4, it seems difficult to guarantee that all possible answer spans are given as the correct answers. By contrast, the description and multiple-choice styles have the advantage of having no such restrictions on the appearance of candidate answers (Kočiský et al., 2018;Khashabi et al., 2018). Nonetheless, the description style is difficult to evaluate because the Rouge-L and BLEU scores are insufficient for testing NLU. Whereas it is easy to evaluate the performance on multiple-choice questions, generating multiple reasonable options requires considerable effort.
Interpretation of our heuristics: When we regard the MRC task as recognizing textual entailment (RTE) (Dagan et al., 2006), the task requires the reader to construct one or more premises from the context and form the most reasonable hypothesis from the question and candidate answer (Sachan et al., 2015). Thus, easier questions are those (i) where the reader needs to generate only one hypothesis, and (ii) where the premises directly describe the correct hypothesis. Our two heuristics can also be seen as the formalizations of these criteria. Therefore, to make questions more realistic, we need to create multiple hypotheses that require complex reasoning to be distinguished. Moreover, the integration of premises should be complemented by external knowledge to provide sufficient information to verify the correct hypothesis.
Unintended biases: The MRC task tests a reading process that involves retrieving stored information and performing inferences (Sutcliffe et al., 2013). However, constructing datasets that comprehensively require those skills is difficult. As Levesque (2014) discussed as a desideratum for testing AI, we should avoid creating questions that can be solved by matching patterns, using unintended biases, and selectional restrictions. For the unintended biases, one suggestive example is the Story Cloze Test (Mostafazadeh et al., 2016), in which a system chooses a sentence among candidates to conclude a given paragraph of the story. A recent attempt at this task showed that recognizing superficial features in the correct candidate is critical to achieve the state of the art (Schwartz et al., 2017).
Similarly, in MRC, Weissenborn et al. (2017) proposed context/type matching heuristic to develop a simple neural system. Min et al. (2018) observed that, in SQuAD, 92% of answerable questions can be answered only using a single context sentence. In visual question answering, Agrawal et al. (2016) analyzed the behavior of models with the variable length of the first question words. Khashabi et al. (2018) more recently proposed a dataset with questions for multisentence reasoning.

Evaluation overfitting:
The theory behind evaluating AI distinguishes between taskand skill-oriented approaches (Hernández-Orallo, 2017). In the task-oriented approach, we usually develop a system and test it on a specific dataset. The developed system sometimes lacks generality but achieves the state of the art for that specific dataset. Further, it becomes difficult to verify and explain the solution to tasks. The situation in which we are biased to the specific tasks is called evaluation overfitting (Whiteson et al., 2011). By contrast, with the skill-oriented approach, we aim to interpret the relationships between tasks and skills. This orientation can encourage the development of more realistic NLU systems.
As One of our goals was to investigate whether easy questions are dominant in recent datasets, it did not necessarily require a detailed classification of reasoning types. Nonetheless, we recognize there are more fine-grained classifications of the required skills for NLU. For example, Weston et al. (2015) defined 20 skills as a set of toy tasks. Sugawara et al. (2017) also organized 10 prerequisite skills for MRC. LoBue and Yates (2011) and Sammons et al. (2010) analyzed entailment phenomena using detailed classifications in RTE. For the ARC dataset, Boratko et al. (2018) proposed knowledge and reasoning types.

Conclusion
This study examined MRC questions from 12 datasets to determine what makes such questions easier to answer. We defined two heuristics that limit candidate answers and thereby mitigate the difficulty of questions. Using these heuristics, the datasets were split into easy and hard subsets. We further annotated the questions with their validity and the reasoning skills needed to answer them. Our experiments revealed that the baseline performance degraded with the hard questions, which required knowledge inference and multiple-sentence reasoning compared to easy questions. These results suggest that one might overestimate the ability of the baseline systems. They also emphasize the importance of analyzing and reporting the properties of new datasets when released. One limitation of this work was the heavy cost of the annotation. In future research, we plan to explore a method for automatically classifying reasoning types. This will enable us to evaluate systems through a detailed organization of the datasets.