A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

The recent work of Clark et al. (2018) introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into easy and challenge sets. That paper includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them; however, it does not include clear definitions of these types, nor does it offer information about the quality of the labels. We propose a comprehensive set of definitions of knowledge and reasoning types necessary for answering the questions in the ARC dataset. Using ten annotators and a sophisticated annotation interface, we analyze the distribution of labels across the challenge set and statistics related to them. Additionally, we demonstrate that although naive information retrieval methods return sentences that are irrelevant to answering the query, sufficient supporting text is often present in the (ARC) corpus. Evaluating with human-selected relevant sentences improves the performance of a neural machine comprehension model by 42 points.


Introduction
The recent work of  introduces the AI2 Reasoning Challenge (ARC) 1 and the associated ARC dataset. This dataset contains science questions from standardized tests that are separated into an Easy Set and a Challenge Set. The Challenge Set is comprised of questions that are answered incorrectly by two solvers based on 1 http://data.allenai.org/arc/ Pointwise Mutual Information (PMI) Information Retrieval (IR). In addition to this division, a survey of the various types of knowledge as well as the types of reasoning that are required to answer various questions in the ARC dataset was presented. This survey was based on an analysis of 100 questions chosen at random from the Challenge Set. However, very little detail is provided about the questions chosen, the annotations provided, or the methodology used. These questions surround the very core of the paper, since the main contribution is a dataset that contains complex questions.
Additionally, while their manual analysis suggests that 95% of the questions can be answered using the ARC corpus,  note that the IR system (Elasticsearch 2 ) serves as a severe bottleneck. Our annotation process supports this observation, but we also find that simple reformulations to the query can greatly increase the quality of the retrieved sentences. Contributions: In this work, in order to overcome some of the limitations of  described above, we present a detailed annotation process for the ARC dataset. Specifically, we (a) introduce a novel labeling interface that allows a distributed set of annotators to label the knowledge and reasoning types; and (b) improve upon the knowledge and reasoning type categories provided previously, in order to make the annotation more intuitive and accurate. Following an annotation round involving over ten people at two institutions, we measure and report statistics such as inter-rater agreement, and the distribution of knowledge and reasoning type labels in the dataset. We then (c) clarify the role of knowledge and reasoning within the ARC dataset with a comprehensive set of annotations for both the questions and returned results, demonstrating the efficacy of query refine-ment to improve existing QA systems. Our annotators were also asked to mark whether individual retrieved sentences were relevant to answering a given question. Our labeling interface logs the reformulated queries issued by each annotator, as well as their relevance annotations. To quantitatively demonstrate the effectiveness of the relevant sentences, we (d) evaluate a subset of questions and the relevant retrieval results with a pre-trained DrQA model (Chen et al., 2017), and find that the performance of the system increases by 42 points.

Related Work
To explore reading comprehension as a research problem, Hirschman et al. (1999) manually created a dataset of 3rd and 6th grade reading comprehension questions with short answers. The techniques that were explored for this dataset included pattern matching, rules, and logistic regression. More such datasets have been created that include natural language questions: for instance, MCTest (Richardson et al., 2013). MCTest is crowdsourced and comprises 660 childrens fictional stories, which are the source of questions and multiple choice answers. Questions and answers were constructed with a restrictive vocabulary that a 7 year-old could understand. Half of the questions require the answer to be derived from two sentences, with the motivation being to encourage research in multi-hop (one-hop) reasoning. Recent techniques such as Wang et al. (2015) and Yin et al. (2016) have performed well on this dataset. Currently, SQuAD (Rajpurkar et al., 2016) is one of the most popular datasets for reading comprehension: it uses Wikipedia passages as its source, and question-answer pairs are created using crowdsourcing. While it is stated that SQuAD requires logical reasoning, the complexity of reasoning required is far less than that for the AI2 standardized tests dataset (Clark and Etzioni, 2016;Kembhavi et al., 2017). NewsQA (Trischler et al., 2016) is another dataset that was created using crowdsourcing; it utilizes passages from 10, 000 news articles to create questions.
Most of the datasets mentioned above are closed domain, where the answer exists in a given snippet of text. On the other hand, in the open domain setting, the question-answer datasets are constructed to encompass the whole pipeline for questionanswering, starting with the retrieval of relevant documents. SearchQA (Dunn et al., 2017) is an effort to create such a dataset; it contains 140K question-answer (QA) pairs. While the motivation was to create an open domain dataset, SearchQA provides text that contains 'evidence' (a set of annotated search results) and hence falls short of being a complete open-domain QA dataset. Trivi-aQA (Joshi et al., 2017) is another reading comprehension dataset that contains 650K QA pairs with evidence.
Datasets created from standardized science tests offer some of the few existing examples of questions that require exploration of complex reasoning techniques to find solutions. A number of science-question focused datasets have been released over the past few years. The AI2 Science Questions dataset was introduced by Clark (2015) along with the Aristo Framework, which we build off of. This dataset contains several thousand multiple choice questions from state and federal science questions for elementary and middle school students. 3 Multiple surveys of the knowledge base requirements for accomplishing this task (Clark et al., 2013;Jansen et al., 2016) concluded that advanced inference methods were necessary for many of the questions, as they could not be answered by simple fact-based retrieval. The Worldtree corpus (Jansen et al., 2018) augments more than 2,000 of the AI2 questions with explanation graphs (Jansen, 2017) that aggregate multiple facts from a semi-structured tablestore of knowledge. Khashabi et al. (2017) showed that an IR based QA system can be improved substantially by focusing on terms which are essential in question. They also introduce a dataset of 2,200 questions where the essential terms are annotated. The SciQ Dataset (Welbl et al., 2017) contains 13,679 crowdsourced multiple choice science questions. To construct this dataset, workers were shown a passage and asked to construct a question along with correct and incorrect answer options. The dataset contains both the source passage as well as the question and answer options.

ARC Dataset Annotations
In previous work , the standardized test questions under consideration are split into various categories based on the kinds of knowledge and reasoning that are needed to answer those questions. The idea of classifying ques-Knowledge Label

Instructions
Example Question Definition A question should be labeled definition if it requires you to know the definition of a term and use only that definition to answer the question.
Recall is necessary to answer the question. Examples include questions that require direct recall, questions that require picking an exemplar or element from a set.
What is a worldwide increase in temperature called? (A) greenhouse effect (B) global warming (C) ozone depletion (D) solar heating

Basic Facts
A question should be labeled basic facts if you would require basic facts or properties about a term that would not necessarily capture the full textbook definition.
Examples include, how many earth rotations are in a day and what the advantage of large family sizes are for animals.
Which element makes up most of the air we breathe? (A) carbon (B) nitrogen (C) oxygen (D) argon

Causes & Processes
A question should be labeled causes if answering it requires recognizing an ordered process (two or more sequential, related events) described in the question.
What is the first step of the process in the formation of sedimentary rocks?  tions by these two types is central to the notion of standardized testing, which endeavors to test students on various kinds of knowledge, as well as various problem types and solution techniques.
In accordance with this, Clark et al. provide preliminary definitions for knowledge and reasoning categories that can be employed by a QA system to solve a given question. These categories allow for the classification of questions, which makes it easier to partition them into sets to measure performance and improve solution strategies. In this work, we present an interface (c.f. Section 4) and annotation rules that seek to turn this classification of questions into a systematic process. In this section, we first discuss the classification types and associated annotation rules.

Knowledge Types
In most question-answering (QA) scenarios, the knowledge that is present with the system (or the agent) determines whether a given question can be answered. The full list of the revised knowledge labels (types) -along with the instructions given to annotators and respective exemplars from the ARC question set -is given in Table 1. The labelers were given the following instructions at the beginning of the annotation process: You are to answer the question, "In a perfect world given an ideal knowledge source, what types of knowledge would you as a human need to answer this question?" You are allowed to select multiple labels for this type which will be recorded as an ordered list. You are to assign labels in the order of importance to answering the questions at hand.

Reasoning Label
Instructions Example Question

Question Logic
Questions which only make sense in the context of a multiplechoice question. That is, absent the choices, the question makes no sense; after being provided the choices, the question contains (nearly) all necessary information to answer the question. If taking away the answer options makes it impossible to answer, then it is question logic.
Example: pick the element of a set that does not belong, pick one of a grouping, etc.
Which item below is not made from a material grown in nature? (A) a cotton shirt (B) a wooden chair (C) a plastic spoon (D) a grass basket Linguistic Matching Any question that requires aligning a question with retrieved results (sentences or facts). This can be paired with multihop and other reasoning types where the facts that are retrieved need to be aligned with the particular questions.
This often goes with knowledge types like questions and basic facts in the case that the retrieved results do not perfectly align with the answer choices.
Which of the following best describes a mineral? (A) the main nutrient in all foods (B) a type of grain found in cereals (C) a natural substance that makes up rocks (D) the decomposed plant matter found in soil Causal / Explanation Given the facts retrieved from web sentences or another reasonable corpus the answer can be extrapolated from a single fact or element. This category also includes single hop causal processes and scenarios where the question asks in the form "What is the most likely result of X happening?" Example: if you need to explain this to a 10 year old, it would require one statement of fact to explain. Any question that requires reasoning about or applying abstract facts to a hypothetical/scenario situation that is described in the question. In some cases the hypotheticals are described in the answer options.
A hypothetical / counterfactual is an entity/scenario that may not be not mentioned as a fact in the corpus, e.g. "..a gray squirrel gave birth to a ...", "When lemon juice is added to water.. Any question that refers to a physical / spatial / kinematic relationship between entities and likely requires a model of the physical world in order to be answered.
Where will a sidewalk feel hottest on a warm, clear day? Inside cells, special molecules carry messages from the membrane to the nucleus. Which body system uses a similar process? (A) endocrine system (B) lymphatic system (C) excretory system (D) integumentary system The wording of the paragraph above is quite deliberate. First, we make the non-trivial point that the kind of knowledge that is available determines the reasoning type to be employed, and eventually whether the given question can be answered or not. For example, the question: Giant redwood trees change energy from one form to another. How is energy changed by the trees?
(A) They change chemical energy into kinetic energy.
(B) They change solar energy into chemical energy.
(C) They change wind energy into heat energy.
(D) They change mechanical energy into solar energy.
can be answered using two different kinds of reasoning depending on the knowledge retrieved: (1) Trees change solar energy into chemical energy: Linguistic Reasoning (2a) Solar energy is changed into chemical energy by plants; (2b) Trees are classified as plants: Multi-hop Reasoning In order to level the field among annotators, we include the phrasing about an ideal knowledge source. Additionally, displaying the retrieved search results in the interface provides another way for the annotators to share some common ground with respect to the typical kind of knowledge that is likely to be available -in this case, from the ARC corpus.
In comparison to the knowledge types provided by , we make the following changes. First -and most important -we provide instruction-based definitions for each class, as opposed to the single exemplars provided previously. We believe this greatly simplifies the annotation task for new annotators, since they no longer need to perform a preliminary manual analysis of the QA set in order to understand the distinctions between the classes. Second, we completely eliminate the Structure type -this is a very specific type of knowledge, and we believe it is not represented in any significant percentage in the current ARC QA set. Third, we rename some of the labels to bring them more in line with the specific properties of the knowledge that they are describing -for example, spatial / kinematic is renamed to Physical Model in our table.

Reasoning Types
The analysis of reasoning types with an eye towards annotation follows a similar pattern to the knowledge types described in the previous section. Table 2 shows the reasoning labels and classification rules that we used for labeling the dataset. The annotators were given the following instructions: You are to answer the question, "What types of reasoning or problem solving would a competent student with access to Wikipedia need to answer this question?" You are allowed to select multiple labels for this type which will be recorded as an ordered list. You are to assign labels in the order of importance to answering the questions at hand.
You may use the search results to help differentiate between the linguistic and multi-hop reasoning types. Any label other than these should take precedence if they apply. For example, a question that requires using a mathematical formula along with linguistic matching should be labeled algebraic, linguistic.
Notice that the instructions in this case refer to being able to access a specific knowledge corpus, and allow for the selection of multiple labels in decreasing order of applicability. We also provide specific instructions on the order of precedence as relates to linguistic and multi-hop reasoning types: this is based on our empirical observation that many questions can be classified trivially into these reasoning categories, and we would prefer (for downstream application use) a clean split into as many distinct categories as possible.

Labeling Interface
The labeling interface is shown in Figure 1. The text of the question is displayed at the top of the left side, followed by the answer options. Each of the answer options is preceded by a radio button: each button is initially transparent, but the annotator can click on a button to check whether the corresponding option is the answer to the question. This facility is to help annotators with extra information if it is needed in labeling the question; however, we leave it blank initially to avoid biasing the annotations.
Clicking on a specific answer option runs a search on the ARC corpus, with the query text set to the last sentence of the question appended with the entire text of the clicked answer option. The retrieved search results are shown in the bottom left half of the interface. Annotators have the option of labeling retrieved search results as irrele- vant or relevant to answering the question at hand. The query box also accepts free text, and annotators who wish to craft more specific queries are free to do so. We collect all the queries executed, as well as the relevant/irrelevant annotations.
The right hand side of the interface deals with the labeling process itself. There are two boxes for annotating knowledge and reasoning types respectively. The labels are populated from Table 1 and  Table 2. The annotator can also provide optional information on the quality of the retrieved search results if they chose to run a query. Finally, the annotator can use the optional field below quality to enter additional notes about the question which are stored and can be retrieved for subsequent discussion and refinement of the labels.

Human Annotated Search Results
In addition to labeling the knowledge and reasoning types systematically, we demonstrate yet another capability of our interface: given a corpus of knowledge, we are able to retrieve and display search results that may be relevant to the question (and its corresponding options) at hand. This is useful because it gives a solution technique an additional signal as it tries to identify the correct answer to a given question. In open-domain question answering, the retriever plays as important a role as the machine reader (Chen et al., 2017). In the past few years, there has been a lot of effort in designing sophisticated neural architectures for reading a small piece of text (e.g. paragraph) (Wang and Jiang, 2016;Xiong et al., 2016;Seo et al., 2016;Lee et al., 2016, inter alia). However, most work in open domain settings (Chen et al., 2017;Clark and Gardner, 2017;Wang et al., 2018) only uses a simple retriever (such as TF-IDF based). As a result, there is a notable decrease in the performance of the QA system. One roadblock for training a sophisticated retriever is the lack of available training data which annotates the relevance of a retrieved context with respect to the question. We believe our annotated retrieval data can be used to train a better ranker/retriever. The underlying retriever in our interface is a simple Elasticsearch, similar to the one used by . The interface is populated by default with the top ranked sentences that are retrieved with the given question as the input query. However, we noticed that results thus retrieved were often irrelevant to answering the question. To address this, our labeling interface also allows annotators to input their own custom queries. We found that reformulating the initial query significantly improved the quality of the retrieved context (results). While not the main focus of this work, we encouraged the annotators to mark the contexts (results) that they thought were relevant to answering the question at hand. For example, in Figure 1, the annotator came up with a novel query -'metals are solid at room temperatures' -and also marked the relevant sentences which are needed to answer this question. Note that sometimes we need to reason over multiple sentences to arrive at the answer. For example, the question in Figure 1 can be answered by combining the first and third sentences in the 'Relevant Results' tab.
To quantitatively measure the efficacy of the annotated context, we evaluated 47 questions and their respective human-annotated relevant sentences with a pretrained DrQA model (Chen et al., 2017). We compared this to a baseline which only returned the sentences retrieved by using the text of the question plus given options as input queries. Since DrQA returns a span from the input sentences, we picked the multiple choice option that maximally overlapped with the returned answer span. Our baseline results are 7 correct out of 47 questions. With the annotated context, the performance increased to 27 correctly answered questions -a 42% increase in accuracy. Encouraged by these results, we posit that the community should focus a lot of attention on improving the retrieval portions of the various QA systems available; we think that annotated context will certainly help in training a better ranker.

Results
Each of the team members were given access to the labeling interface (which includes the question, answers, query search results and more information as described above). Each annotator was shown the questions in a random order, and was allowed to skip or pass any question. Statistics. We collected labels from at least 3 unique annotators (out of the possible 10) for 192 distinct questions. This labeling process produced 1.42 knowledge type labels and 1.7 reasoning type labels per question. Figure 2 and Figure 3 shows the distribution of annotation labels by all raters at any position. While Basic Facts dominates the knowledge type labels, there is no clear cut consensus for the reasoning type. Indeed, qn logic, linguistic, and explanation occur most frequently.
Inter-Rater Agreement. A comprehensive look at the labels and inter-rater agreement can be found in Table 3 and Table 4. Fleiss' κ is of-ten used to measure inter-rater agreement (Cohen, 1995). Informally, this measures the amount of agreement, beyond chance, based on the number of raters, objects and classes. κ > 0.2 is typically taken to denote good agreement between raters, while a negative value means that there was little to no agreement. Since Fleiss' κ is only defined for a single set of labels, we consider only the first (most important) label for each question in the statistic we report. In addition to Fleiss' κ we also use the Kemeny voting rule (Kemeny, 1959) to measure the consensus by the annotators. The Kemeny voting rule minimizes the Kendall Tau (Kendall, 1938) (flip) distance between the output ordering and the ordering of all annotators. One theory of voting (aggregation) is that there is a true or correct ordering and all voters provide a noisy observation of the ground truth. This method of thinking is largely credited to Condorcet (de Caritat, 1785;Young, 1988) and there is recent work in characterizing other voting rules as maximum likelihood estimators (MLEs) (Conitzer et al., 2009). The Kemeny voting rule is the MLE of the Condorcet Noise Model, in which pairwise inversions of the preference order happen uniformly at random (Young, 1988(Young, , 1995. Hence, if we assume all annotators make pairwise errors uniformly at random then Kemeny is the MLE of label orders they report.

Label
Appears Majority  Consensus   basic facts  125  69  28  algebraic  13  5  2  definition  52  16  5  causes  78  33  15  experiments  35  19  13  purpose  30  13  0  physical  21  3  1 Fleiss' κ = 0.342  Knowledge Labels. We achieve κ = 0.342, which means that our raters did a good job of independently agreeing on the types of knowledge required to answer the questions. The mean Kemeny score of the consensus ranking for each question is 2.57, meaning on average there are less than three flips required to get from the consensus ranking to each of the annotators' rankings. The most frequent label in the first position was basic facts, followed by causes. Overall, there was a reasonable amount of consensus between the raters for knowledge type: 64 /192 questions had a consensus amongst all the raters. Taken together, our results on knowledge type indicate that most questions deal with basic facts, causes, and definitions; and that labeling can be done reliably.
Reasoning Labels. The reasoning labels tell a very different story from the knowledge labels. The agreement was κ = −0.683, which indicates that raters did not agree above chance on their labels. Strong evidence for this comes from the fact that only 27 /192 questions had a consensus label. This may be due to the fact that we allow multiple labels, and the annotators simply disagree on the order of the labels. However, the score of the consensus ranking for each question is 6.57, which indicates that on average the ordering of the labels is quite far apart. Considering the histogram in Figure 3, we see that qn logic, linguistic, and explanation are the most frequent label types; this may indicate that getting better at understanding the questions themselves could lead to a big boost for reasoners. For Figure 4, we have merged the first and second label (if present) for all annotators. Now, the set of all possible labels is all singletons as well as all pairs of labels. Comparing this histogram to the one in Figure 3, we see that while linguistic and explanation remain somewhat unchanged, the qn logic label becomes very spread out across the types. This is more support for our hypothesis that annotators may be disagreeing on the ordering of the labels, rather than the content itself. Baseline Performance. Using the Kemeny voting rule to partition the set of questions based on their top reasoning and knowledge labels, we evaluate the performance of several baseline systems: word2vec similarity (Mikolov et al., 2013), Text Search based on the scores assigned by Elasticsearch over the Aristo-mini corpus, and the SemanticILP system that uses constrained search over semantically-motivated graph structures (Khashabi et al., 2018). Additionally, we also test the three pre-trained neural baselines released by : DecompAttn, the De-  Table 5: Accuracy on our subset of ARC Challenge Set questions, partitioned based on the first label from the Kemeny ordering of the reasoning and knowledge type annotations, respectively; 1 /k partial credit is given when the correct answer is in the set of k selected answers. The number of questions assigned each primary label is indicated by (#  composable Attention natural language inference model ; DGEM, the Decomposable Graph Entailment Model ; and BiDAF, the Bi-Directional Attention Flow reading comprehension model (Seo et al., 2016). Results are shown in Table 5. While none of the systems approach human performance on any of the categories, they do illustrate some particular difficulties of the ARC Challenge Set that motivate our future work. Specifically, all techniques perform at or below chance on questions that primarily require qn logic reasoning. Unremarkable performance on the popular basic facts and causes knowledge types also illustrates the shortcomings of sophisticated language processing systems that still rely on basic text retrieval. Bridging this gap appears to be a requirement for making progress on this and similar datasets.

Conclusion & Future Work
In this paper, we introduce a novel annotation interface and define annotation instructions for the knowledge and reasoning type labels that are used for question analysis for standardized tests. We annotate approximately 200 questions from the ARC Challenge Set shared by AI2 with the types of knowledge and reasoning required to answer the respective questions. Each question has at least 3 annotators, with high agreement on the requirements for knowledge type. While standard baselines do not perform significantly better on any of the different subsets of questions (partitioned by type), we offer a preliminary demonstration that search annotations collected through our interface can significantly improve the performance of state of the art systems. We will leverage the knowledge and reasoning type annotations, as well as the search annotations, to improve the performance of QA systems.