Multi-hop Reading Comprehension through Question Decomposition and Rescoring

Multi-hop Reading Comprehension (RC) requires reasoning and aggregation across several paragraphs. We propose a system for multi-hop RC that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models. Since annotations for such decomposition are expensive, we recast subquestion generation as a span prediction problem and show that our method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-questions. We also introduce a new global rescoring approach that considers each decomposition (i.e. the sub-questions and their answers) to select the best final answer, greatly improving overall performance. Our experiments on HotpotQA show that this approach achieves the state-of-the-art results, while providing explainable evidence for its decision making in the form of sub-questions.


Introduction
Multi-hop reading comprehension (RC) is challenging because it requires the aggregation of evidence across several paragraphs to answer a question. Table 1 shows an example of multi-hop RC, where the question "Which team does the player named 2015 Diamond Head Classics MVP play for?" requires first finding the player who won MVP from one paragraph, and then finding the team that player plays for from another paragraph.
In this paper, we propose DECOMPRC, a system for multi-hop RC, that learns to break compositional multi-hop questions into simpler, singlehop sub-questions using spans from the original question. For example, for the question in Table 1, we can create the sub-questions "Which player named 2015 Diamond Head Classics MVP?" and "Which team does ANS play for?",  Table 1: An example of multi-hop question from HOT-POTQA. The first cell shows given question and two of given paragraphs (other eight paragraphs are not shown), where the red text is the groundtruth answer. Our system selects a span over the question and writes two sub-questions shown in the second cell.
where the token ANS is replaced by the answer to the first sub-question. The final answer is then the answer to the second sub-question.
Recent work on question decomposition relies on distant supervision data created on top of underlying relational logical forms (Talmor and Berant, 2018), making it difficult to generalize to diverse natural language questions such as those on HOTPOTQA (Yang et al., 2018). In contrast, our method presents a new approach which simplifies the process as a span prediction, thus requiring only 400 decomposition examples to train a competitive decomposition neural model. Furthermore, we propose a rescoring approach which obtains answers from different possible decompositions and rescores each decomposition with the answer to decide on the final answer, rather than deciding on the decomposition in the beginning.
Our experiments show that DECOMPRC outperforms other published methods on HOT-POTQA (Yang et al., 2018), while providing explainable evidence in the form of sub-questions. In addition, we evaluate with alternative distrator paragraphs and questions and show that our decomposition-based approach is more ro-bust than an end-to-end BERT baseline (Devlin et al., 2019). Finally, our ablation studies show that our sub-questions, with 400 supervised examples of decompositions, are as effective as humanwritten sub-questions, and that our answer-aware rescoring method significantly improves the performance.

Related Work
Reading Comprehension. In reading comprehension, a system reads a document and answers questions regarding the content of the document (Richardson et al., 2013). Recently, the availability of large-scale reading comprehensiondatasets (Hermann et al., 2015;Rajpurkar et al., 2016;Joshi et al., 2017) has led to the development of advanced RC models (Seo et al., 2017;Xiong et al., 2018;Yu et al., 2018;Devlin et al., 2019). Most of the questions on these datasets can be answered in a single sentence (Min et al., 2018), which is a key difference from multi-hop reading comprehension.
Multi-hop Reading Comprehension. In multihop reading comprehension, the evidence for answering the question is scattered across multiple paragraphs. Some multi-hop datasets contain questions that are, or are based on relational queries (Welbl et al., 2017;Talmor and Berant, 2018). In contrast, HOTPOTQA (Yang et al., 2018), on which we evaluate our method, contains more natural, hand-written questions that are not based on relational queries.
Prior methods on multi-hop reading comprehension focus on answering relational queries, and emphasize attention models that reason over coreference chains (Dhingra et al., 2018;Zhong et al., 2019;Cao et al., 2019). In contrast, our method focuses on answering natural language questions via question decomposition. By providing decomposed single-hop sub-questions, our method allows the model's decisions to be explainable.
Our work is most related to Talmor and Berant (2018), which answers questions over web snippets via decomposition. There are three key differences between our method and theirs. First, they decompose questions that are correspond to relational queries, whereas we focus on natural language questions. Next, they rely on an underly-ing relational query (SPARQL) to build distant supervision data for training their model, while our method requires only 400 decomposition examples. Finally, they decide on a decomposition operation exclusively based on the question. In contrast, we decompose the question in multiple ways, obtain answers, and determine the best decomposition based on all given context, which we show is crucial to improving performance.
Semantic Parsing. Semantic parsing is a larger area of work that involves producing logical forms from natural language utterances, which are then usually executed over structured knowledge graphs (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2011). Our work is inspired by the idea of compositionality from semantic parsing, however, we focus on answering natural language questions over unstructured text documents.

Overview
In multi-hop reading comprehension, a system answers a question over a collection of paragraphs by combining evidence from multiple paragraphs. In contrast to single-hop reading comprehension, in which a system can obtain good performance using a single sentence (Min et al., 2018), multi-hop reading comprehension typically requires more complex reasoning over how two pieces of evidence relate to each other.
We propose DECOMPRC for multi-hop reading comprehension via question decomposition. DE-COMPRC answers questions through a three step process: 1. First, DECOMPRC decomposes the original, multi-hop question into several single-hop sub-questions according to a few reasoning types in parallel, based on span predictions. Figure 1 illustrates an example in which a question is decomposed through four different reasoning types. Section 3.2 details our decomposition approach.
2. Then, for every reasoning types DECOMPRC leverages a single-hop reading comprehension model to answer each sub-question, and combines the answers according to the reasoning type. Figure 1 shows an example for which bridging produces 'City of New Figure 1: The overall diagram of how our system works. Given the question, DECOMPRC decomposes the question via all possible reasoning types (Section 3.2). Then, each sub-question interacts with the off-the-shelf RC model and produces the answer (Section 3.3). Lastly, the decomposition scorer decides which answer will be the final answer (Section 3.4). Here, "City of New York", obtained by bridging, is determined as a final answer.  York' as an answer while intersection produces 'Columbia University' as an answer. Section 3.3 details the single-hop reading comprehension procedure.
3. Finally, DECOMPRC leverages a decomposition scorer to judge which decomposition is the most suitable, and outputs the answer from that decomposition as the final answer.
In Figure 1, "City of New York", obtained via bridging, is decided as the final answer. Section 3.4 details our rescoring step.
We identify several reasoning types in multi-hop reading comprehension, which we use to decompose the original question and rescore the decompositions. These reasoning types are bridging, intersection and comparison.

Decomposition
The goal of question decomposition is to convert a multi-hop question into simpler, single-hop subquestions. A key challenge of decomposition is that it is difficult to obtain annotations for how to decompose questions. Moreover, generating the question word-by-word is known to be a difficult task that requires substantial training data and is not straight-forward to evaluate (Gatt and Krahmer, 2018;Novikova et al., 2017). Instead, we propose a method to create subquestions using span prediction over the question.
The key idea is that, in practice, each sub-question can be formed by copying and lightly editing a key span from the original question, with different span extraction and editing required for each reasoning type. For instance, the bridging question in Table 2 requires finding "the player named 2015 Diamond Head Classic MVP" which is easily extracted as a span. Similarly, the intersection question in Table 2 specifies the type of entity to find ("which actor and comedian"), with two conditions ("Stories USA starred" and "from "The Office""), all of which can be extracted. Comparison questions compare two entities using a discrete operation over some properties of the entities, e.g., "which is smaller". When two entities are extracted as spans, the question can be converted into two sub-questions and one discrete operation over the answers of the sub-questions.

Span Prediction for Sub-question Generation
Our approach simplifies the sub-question generation problem into a span prediction problem that requires little supervision (400 annotations). The annotations are collected by mapping the question into several points that segment the question into spans (details in Section 4.2). We train a model Pointer c that learns to map a question into c points, which are subsequently used to compose sub-questions for each reasoning type through Algorithm 1.
Pointer c is a function that points to c indices ind 1 , . . . , ind c in an input sequence. 1 Let S = [s 1 , . . . , s n ] denote a sequence of n words in the input sequence. The model encodes S using BERT (Devlin et al., 2019): where h is the output dimension of the encoder. Let W ∈ R h×c denote a trainable parameter matrix. We compute a pointer score matrix where P(i = ind j ) = Y ij denotes the probability that the ith word is the jth index produced by the pointer. The model extracts c indices that yield the highest joint probability at inference: 1 c is a hyperparameter which differs in different reasoning types.
2 Details for find op, form subq in Appendix B.
Algorithm 1 Sub-questions generation using

Single-hop Reading Comprehension
Given a decomposition, we use a single-hop RC model to answer each sub-question. Specifically, the goal is to obtain the answer and the evidence, given the sub-question and N paragraphs.
Here, the answer is a span from one of paragraphs, yes or no. The evidence is one of N paragraphs on which the answer is based. Any off-the-shelf RC model can be used. In this work, we use the BERT reading comprehension model (Devlin et al., 2019) combined with the paragraph selection approach from Clark and Gardner (2018) to handle multiple paragraphs. Given N paragraphs S 1 , . . . , S N , this approach independently computes answer i and y none i from each paragraph S i , where answer i and y none i denote the answer candidate from ith paragraph and the score indicating ith paragraph does not contain the answer. The final answer is selected from the paragraph with the lowest y none i . Although this approach takes a set of multiple paragraphs as an input, it is not capable of jointly reasoning across different paragraphs.
For each paragraph S i , let U i ∈ R n×h be the BERT encoding of the sub-question concatenated with a paragraph S i , obtained by Equation 1. We compute four scores, y span i y yes i , y no i and y none i , indicating if the answer is a phrase in the paragraph, yes, no, or does not exist.
where max denotes a max-pooling operation across the input sequence, and W 1 ∈ R h×4 de-notes a parameter matrix. Additionally, the model computes span i , which is defined by its start and end points start i and end i .
where P i,start (j) and P i,end (k) indicate the probability that the jth word is the start and the kth word is the end of the answer span, respectively. P i,start (j) and P i,end (k) are obtained by the jth element of p start i and the kth element of p end Here, W start , W end ∈ R h are the parameter matrices. Finally, answer i is determined as one of span i , yes or no based on which of y span i , y yes i and y no i is the highest. The model is trained using questions that only require single-hop reasoning, obtained from SQUAD (Rajpurkar et al., 2016) and easy examples of HOTPOTQA (Yang et al., 2018) (details in Section 4.2). Once trained, it is used as an offthe-shelf RC model and is never directly trained on multi-hop questions.

Decomposition Scorer
Each decomposition consists of sub-questions, their answers, and evidence corresponding to a reasoning type. DECOMPRC scores decompositions and takes the answer of the top-scoring decomposition to be the final answer. The score indicates if a decomposition leads to a correct final answer to the multi-hop question.
Let t be the reasoning type, and let answer t and evidence t be the answer and the evidence from the reasoning type t. Let x denote a sequence of n words formed by the concatenation of the question, the reasoning type t, the answer answer t , and the evidence evidence t . The decomposition scorer encodes this input x using BERT to obtain U t ∈ R n×h similar to Equation (1). The score p t is computed as where W 2 ∈ R h is a trainable matrix. During inference, the reasoning type is decided as argmax t p t . The answer corresponding to this reasoning type is chosen as the final answer.
Pipeline Approach. An alternative to the decomposition scorer is a pipeline approach, in which the reasoning type is determined in the beginning, before decomposing the question and obtaining the answers to sub-questions. Section 4.6 compares our scoring step with this approach to show the effectiveness of the decomposition scorer. Here, we briefly describe the model used for the pipeline approach.
First, we form a sequence S of n words from the question and obtainS ∈ R n×h from Equation 1. Then, we compute 4-dimensional vector p t by: where W 3 ∈ R h×4 is a parameter matrix. Each element of 4-dimensional vector p t indicates the reasoning type is bridging, intersection, comparison or original.

HOTPOTQA
We experiment on HOTPOTQA (Yang et al., 2018), a recently introduced multi-hop RC dataset over Wikipedia articles. There are two types of questions-bridge and comparison. Note that their categorization is based on the data collection and is different from our categorization (bridging, intersection and comparison) which is based on the required reasoning type. We evaluate our model on dev and test sets in two different settings, following prior work.
Distractor setting contains the question and a collection of 10 paragraphs: 2 paragraphs are provided to crowd workers to write a multi-hop question, and 8 distractor paragraphs are collected separately via TF-IDF between the question and the paragraph.     on TF-IDF similarity between the question and the paragraph, using Document Retriever from DrQA (Chen et al., 2017). We train 3 instances with n = 0, 2, 4 for an ensemble, which we use as the single-hop model. To deal with ungrammatical questions generated through our decomposition procedure, we augment the training data with ungrammatical samples. Specifically, we add noise in the question by randomly dropping tokens with probability of 5%, and replace wh -word into 'the' with probability of 5%.
Training Decomposition Scorer We create training data by making inferences for all reasoning types on HOTPOTQA medium and hard examples. We take the reasoning type that yields the correct answer as the gold reasoning type. Appendix C provides the full details.

Baseline Models
We compare our system DECOMPRC with the state-of-the-art on the HOTPOTQA dataset as well as strong baselines. BiDAF is the state-of-the-art RC model on HOT-POTQA, originally from Seo et al. (2017)   is no access to the groundtruth answers of multihop questions, a decomposition scorer cannot be trained. Therefore, a final answer is obtained based on the confidence score from the single-hop RC model, without a rescoring procedure. We also observe that BERT trained on singlehop RC achieves a high F1 score, even though it does not draw inferences across different paragraphs. For further analysis, we split the HOT-POTQA development set into single-hop solvable (Single) and single-hop non-solvable (Multi). 4 We observe that DECOMPRC outperforms BERT by a large margin in single-hop non-solvable (Multi) examples. This supports our attempt toward more explainable methods for answering multihop questions.

Results
Finally, Table 4 shows the F1 score on the test set for distractor setting and full wiki setting on the leaderboard. 5 These include unpublished models that are concurrent to our work. DECOMPRC achieves the best result out of models that report both distractor and full wiki setting.

Evaluating Robustness
In order to evaluate the robustness of different methods to changes in the data distribution, we set up two adversarial settings in which the trained model remains the same but the evaluation dataset is different.
Modifying Distractor Paragraphs. We collect a new set of distractor paragraphs to evaluate if the models are robust to the change in distractors. 6 In particular, we follow the same strategy as the original approach (Yang et al., 2018) using TF-IDF similarity between the question and the paragraph, but with no overlapping distractor paragraph with the original distractor paragraphs. Table 5 compares the F1 score of DECOMPRC and BERT in the original distractor setting and in the modified distractor setting. As expected, the performance of both methods degrade, but DE-COMPRC is more robust to the change in distractors. Namely, DECOMPRC-1hop train degrades much less (only 3.41 F1) compared to other approaches because it is only trained on single-hop data and therefore does not exploit the data distribution. These results confirm our hypothesis that the end-to-end model is sensitive to the change of the data and our model is more robust.
Adversarial Comparison Questions. We create an adversarial set of comparison questions by altering the original question so that the correct answer is inverted. For example, we change "Who was born earlier, Emma Bull or Virginia Woolf?" to "Who was born later, Emma Bull or Virginia Woolf?" We automatically invert 665 questions (details in Appendix D). We report the joint F1, taken as the minimum of the prediction F1 on the original and the inverted examples. Table 5 shows   the joint F1 score of DECOMPRC and BERT. We find that DECOMPRC is robust to inverted questions, and outperforms BERT by 36.53 F1.

Ablations
Span-based vs. Free-form sub-questions. We evaluate the quality of generated sub-questions using span-based question decomposition. We replace the question decomposition component using Pointer 3 with (i) sub-question decomposition through groundtruth spans, (ii) sub-question decomposition with free-form, hand-written subquestions (examples shown in Table 6). Table 7 (left) compares the question answering performance of DECOMPRC when replaced with alternative sub-questions on a sample of 50 bridging questions. 7 There is little difference in model performance between span-based and sub-questions written by human. This indicates that our span-based sub-questions are as effective as free-form sub-questions. In addition, Pointer 3 trained on 200 or 400 examples obtains close to human performance. We think that identifying spans often rely on syntactic information of the question, which BERT has likely learned from language modeling. We use the model trained on 200 examples for DECOMPRC to demonstrate sample-efficiency, and expect performance improvement with more annotations.
Ablations in decomposition decision method.  For comparison, we report the F1 score of the confidence-based method which chooses the decomposition with the maximum confidence score from the single-hop RC model, and the pipeline approach which independently selects the reasoning type as described in Section 3.4. In addition, we report an oracle which takes the maximum F1 score across different reasoning types to provide an upperbound. A pipeline method gets lower F1 score than the decomposition scorer. This suggests that using more context from decomposition (e.g., the answer and the evidence) helps avoid cascading errors from the pipeline. Moreover, a gap between DECOMPRC and oracle (6.2 F1) indicates that there is still room to improve.
Upperbound of Span-based Sub-questions without a decomposition scorer. To measure an upperbound of span-based sub-questions without a decomposition scorer, where a human-level RC model is assumed, we conduct a human experiment on a sample of 50 bridging ques-Q What country is the Selun located in? P1 Selun lies between the valley of Toggenburg and Lake Walenstadt in the canton of St. Gallen. P2 The canton of St. Gallen is a canton of Switzerland.
Q Which pizza chain has locations in more cities, Round Q Which magazine had more previous names, Watercolor Artist or The General? P1 Watercolor Artist, formerly Watercolor Magic, is an American bi-monthly magazine that focuses on ... P2 The General (magazine): Over the years the magazine was variously called 'The Avalon Hill General', 'Avalon Hill's General', 'The General Magazine', or simply 'General'. Q1 Watercolor Artist had how many previous names? Q2 The General had how many previous names? tions. 8 In this experiment, humans are given each sub-question from decomposition annotations and are asked to answer it without an access to the original, multi-hop question. They are asked to answer each sub-question with no cross-paragraph reasoning, and mark it as a failure case if it is impossible. The resulting F1 score, calculated by replacing RC model to humans, is 72.67 F1. Table 8 reports the breakdown of fifteen error cases. 53% of such cases are due to the incorrect groundtruth, partial match with the groundtruth or mistake from humans. 47% are genuine failures in the decomposition. For example, a multi-hop question "Which animal races annually for a national title as part of a post-season NCAA Division I Football Bowl Subdivision college football game?" corresponds to the last category in Table 8. The question can be decomposed into "Which post-season NCAA Division I Football Bowl Subdivision college football game?" and "Which animal races annually for a national title as part of ANS?". However in the given set of paragraphs, there are multiple games that can be the answer to the first sub-question. Although only one of them is held with the animal racing, it is impossible to get the correct answer only given the first subquestion. We think that incorporating the original question along with the sub-questions can be one solution to address this problem, which is partially done by a decomposition scorer in DECOMPRC.
Limitations. We show the overall limitations of DECOMPRC in Table 9. First, some questions are not compositional but require implicit multihop reasoning, hence cannot be decomposed. Sec-

Conclusion
We proposed DECOMPRC, a system for multihop RC that decomposes a multi-hop question into simpler, single-hop sub-questions. We recasted sub-question generation as a span prediction problem, allowing the model to be trained on 400 labeled examples to generate high quality sub-questions. Moreover, DECOMPRC achieved further gains from the decomposition scoring step. DECOMPRC achieved the state-of-the-art on HOTPOTQA distractor setting and full wiki setting, while providing explainable evidence for its decision making in the form of sub-questions and being more robust to adversarial settings than strong baselines. In this section, we describe span annotation collection procedure for bridging and intersection questions.

A Span Annotation
The goal is to collect three points (bridging) or two points (intersection) given a multi-hop question. We design an interface to annotate span over the question by clicking the word in the question. First, given a question, the annotator is asked to identify which reasoning type out of bridging, intersection, one-hop and neither is the most proper. 9 Since bridging type is the most common, bridging is checked by default. If the question type is bridging, the annotator is asked to make three clicks for the start of the span, the end of the span, and the head-word (top four examples in Figure 2). After three clicks are all made, the annotator can see the heuristically generated subquestions. If the question type is intersection, the annotator is asked to make two clicks for the start and the end of the second segment out of three segments (bottom three examples in Figure 2). Similarly, the annotator can see the heuristically generated sub-questions after two clicks. If the question type is one-hop or neither, the annotator does not have to make any click. If the question can be decomposed into more than one way, the annotator is asked to choose the more natural decomposition. If the question is ambiguous, the annotator is asked to pass the example, and only annotate for the clear cases. For the quality control, all annotators have enough in person, one-on-one tutorial sessions and are given 100 example annotations for the reference.

B Decompotision for Comparison
In this section, we describe the decomposition procedure for comparison, which does not require any extra annotation.
Comparison requires to compare a property of two different entities, usually requiring discrete operations. We identify 10 discrete operations which sufficently cover comparison operations, shown in Table 10. Based on these pre-defined discrete operations, we decompose the question through the following three steps.
First, we extract two entities under comparison. We use Pointer 4 to obtain ind 1 , . . . , ind 4 , where ind 1 and ind 2 indicate the start and the end of the first entity, and ind 3 and ind 4 indicate those of the second entity. We create a training data which each example contains the question and four points as follows: we filter out bridge questions in HOTPOTQA to leave comparison questions, extract the entities using Spacy 10 NER tagger in the question and in two supporting facts (annotated sentences in the dataset which serve as evidence to answer the question), and match them to find two entities which appear in one supporting sentence but not in the other supporting sentence.
Then, we identity the suitable discrete operation, following Algorithm 2.
Finally, we generate sub-questions according to the discrete operation. Two sub-questions are obtained for each entity.  ANS is the answer of each query, and ENT is the entity corresponding to each query. The answer of each query is shown in the right side of →. If the question and two entities for comparison are given, queries and a discrete operation can be obtained by heuristics.

C Implementation Details
Implementation Details. We use PyTorch (Paszke et al., 2017) on top of Hugging Face's BERT implementation. 11 We tune our model from Google's pretrained BERT-BASE (lowercased) 12 , containing 12 layers of Transformers (Vaswani et al., 2017) and a hidden dimension of 768. We optimize the objective function using Adam (Kingma and Ba, 2015) with learning rate 5 × 10 −5 . We lowercase the input and set the maximum sequence length |S| to 300 for models which input is both the question and the paragraph, and 50 for the models which input is the question only.

D Creating Inverted Binary Comparison Questions
We identify the comparison question with 7 out of 10 discrete operations (Is greater, Is smaller, 11 https://github.com/huggingface/ pytorch-pretrained-BERT 12 https://github.com/google-research/ bert Which is greater, Which is smaller, Which is true, Is equal, Not equal) can automatically be inverted. It leads to 665 inverted questions.

E A Set of Samples used for Ablations
A set of samples used for ablations in Section 4.6 is shown in Table 11.
Algorithm 2 Algorithm for Identifying Discrete Operation. First, given two entities for comparison, the coordination and the preconjunct or the predeterminer are identified. Then, the quantitative indicator and the head entity is identified if they exist, where a set of uantitative indicators is pre-defined. In case any quantitative indicator exists, the discrete operation is determined as one of numeric operations. If there is no quantitative indicator, the discrete operation is determined as one of logical operations or string operations.
procedure FIND OPERATION (question, entity1, entity2) coordination, preconjunct ← f (question, entity1, entity2) Determine if the question is either question or both question from coordination and preconjunct head entity ← f head (question, entity1, entity2) if more, most, later, last, latest, longer, larger, younger, newer, taller, higher in question then if head entity exists then discrete operation ← Which is greater else discrete operation ← Is greater else if less, earlier, earliest, first, shorter, smaller, older, closer in question then if head entity exists then discrete operation ← Which is smaller else discrete operation ← Is smaller else if head entity exists then discrete operation ← Which is true else if question is not yes/no question and asks for the property in common then discrete operation ← Intersection else if question is yes/no question then Determine if question asks for logical comparison or string comparison if question asks for logical comparison then if either question then discrete operation ← Or else if both question then discrete operation ← And else if question asks for string comparison then if asks for same? then discrete operation ← Is equal else if asks for difference? then discrete operation ← Not equal return discrete operation 5abce73055429959677d6b34,5a80071f5542992bc0c4a684,5a840a9e5542992ef85e2397,5a7e02cf5542997cc2c474f4,5ac1c9a15542994ab5c67e1c 5a81ea115542995ce29dcc78,5ae7308d5542991e8301cbb8,5ae527945542993aec5ec167,5ae748d1554299572ea547b0,5a71148b5542994082a3e567 5ae531695542990ba0bbb1fb,5a8f5273554299458435d5b1,5ac2db67554299657fa290a6,5ae0c7e755429945ae95944c,5a7150c75542994082a3e7be 5abffc0d5542990832d3a1e2,5a721bbc55429971e9dc9279,5ab57fc4554299488d4d99c0,5abbda84554299642a094b5b,5ae7936d5542997ec27276a7 5ab2d3df554299194fa9352c,5ac279345542990b17b153b0,5ab8179f5542990e739ec817,5ae20cd25542997283cd2376,5ae67def5542991bbc9760f3 5a901b985542995651fb50b0,5a808cbd5542996402f6a54b,5a84574455429933447460e6,5ab9b1fd5542996be202058e,5a7f1ad155429934daa2fce2 5ade03da5542997dc7907120,5a809fe75542996402f6a5ba,5ae28058554299495565da90,5abd09585542996e802b469b,5a7f9cbd5542994857a7677c 5a7b4073554299042af8f733,5ac119335542992a796dede4,5a7e1a2955429965cec5ea5d,5a8febb555429916514e73e4,5a87184a5542991e771816c5 5a86681c5542991e77181644,5abba584554299642a094afa,5add39e75542997545bbbcc4,5a7f354b5542992e7d278c8c,5a89810655429946c8d6e929 5a78c7db55429974737f7882,5a8d0c1b5542994ba4e3dbb3,5a87e5345542993e715abffb,5ae736cb5542991bbc9761c2,5ae057fd55429945ae959328 Table 11: Question IDs from a set of samples used for ablations in Section 4.6.