Compositional Questions Do Not Necessitate Multi-hop Reasoning

Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1—comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.


Introduction
Multi-hop reading comprehension (RC) requires reading and aggregating information over multiple pieces of textual evidence (Welbl et al., 2017;Yang et al., 2018;Talmor and Berant, 2018). In this work, we argue that it can be difficult to construct large multi-hop RC datasets. This is because multi-hop reasoning is a characteristic of both the question and the provided evidence; even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. For example, the question in Figure 1 is compositional: a plausible solution is to find "What animal's habitat was the Réserve Naturelle Lomako Yokokala * Equal Contribution.
Question: What is the former name of the animal whose habitat the Réserve Naturelle Lomako Yokokala was established to protect? Paragraph 5: The Lomako Forest Reserve is found in Democratic Republic of the Congo. It was established in 1991 especially to protect the habitat of the Bonobo apes. Paragraph 1: The bonobo ("Pan paniscus"), formerly called the pygmy chimpanzee and less often, the dwarf or gracile chimpanzee, is an endangered great ape and one of the two species making up the genus "Pan".
Figure 1: A HOTPOTQA example designed to require reasoning across two paragraphs. Eight spurious additional paragraphs (not shown) are provided to increase the task difficulty. However, since only one of the ten paragraphs is about an animal, one can immediately locate the answer in Paragraph 1 using one hop. The full example is provided in Appendix A. established to protect?", and then answer "What is the former name of that animal?". However, when considering the evidence paragraphs, the question is solvable in a single hop by finding the only paragraph that describes an animal.
Our analysis is centered on HOTPOTQA (Yang et al., 2018), a dataset of mostly compositional questions. In its RC setting, each question is paired with two gold paragraphs, which should be needed to answer the question, and eight distractor paragraphs, which provide irrelevant evidence or incorrect answers. We show that single-hop reasoning can solve much more of this dataset than previously thought. First, we design a single-hop QA model based on BERT (Devlin et al., 2018), which, despite having no ability to reason across paragraphs, achieves performance competitive with the state of the art. Next, we present an evaluation demonstrating that humans can solve over 80% of questions when we withhold one of the gold paragraphs.
To better understand these results, we present a detailed analysis of why single-hop reasoning works so well. We show that questions include redundant facts which can be ignored when com-puting the answer, and that the fine-grained entity types present in the provided paragraphs in the RC setting often provide a strong signal for answering the question, e.g., there is only one animal in the given paragraphs in Figure 1, allowing one to immediately locate the answer using one hop.
This analysis shows that more carefully chosen distractor paragraphs would induce questions that require multi-hop reasoning. We thus explore an alternative method for collecting distractors based on adversarial paragraph selection. Although this appears to mitigate the problem, a single-hop model re-trained on these distractors can recover most of the original single-hop accuracy, indicating that these distractors are still insufficient. Another method is to consider very large distractor sets such as all of Wikipedia or the entire Web, as done in open-domain HOTPOTQA and ComplexWebQuestions (Talmor and Berant, 2018). However, this introduces additional computational challenges and/or the need for retrieval systems. Finding a small set of distractors that induce multihop reasoning remains an open challenge that is worthy of follow up work.
Existing multi-hop QA datasets are constructed using knowledge bases, e.g., WIKIHOP (Welbl et al., 2017) and COMPLEXWEBQUESTIONS (Talmor and Berant, 2018), or using crowd workers, e.g., HOTPOTQA (Yang et al., 2018). WIKI-HOP questions are posed as triples of a relation and a head entity, and the task is to determine the tail entity of the relationship. COMPLEXWE-BQUESTIONS consists of open-domain compositional questions, which are constructed by increasing the complexity of SPARQL queries from WE-BQUESTIONS (Berant et al., 2013). We focus on HOTPOTQA, which consists of multi-hop questions written to require reasoning over two para- graphs from Wikipedia.
Parallel research from Chen and Durrett (2019) presents similar findings on HOTPOTQA. Our work differs because we conduct human analysis to understand why questions are solvable using singlehop reasoning. Moreover, we show that selecting distractor paragraphs is difficult using current retrieval methods.

Single-paragraph QA
This section shows the performance of a single-hop model on HOTPOTQA.

Model Description
Our model, single-paragraph BERT, scores and answers each paragraph independently ( Figure 2). We then select the answer from the paragraph with the best score, similar to Clark and Gardner (2018). 1 The model receives a question Q = [q 1 , .., q m ] and a single paragraph P = [p 1 , ..., p n ] as input. Following Devlin et al. (2018), S = [q 1 , ..., q m , [SEP], p 1 , ..., p n ], where [SEP] is a special token, is fed into BERT: where h is the hidden dimension of BERT. Next, a classifier uses max-pooling and learned parameters W 1 ∈ R h×4 to generate four scalars: [y span ; y yes ; y no ; y empty ] = W 1 maxpool(S ), where y span , y yes , y no and y empty indicate the answer is either a span, yes, no, or no answer. An extractive paragraph span, span, is obtained separately following Devlin et al. (2018). The final model outputs are a scalar value y empty and a text of either span, yes or no, based on which of y span , y yes , y no has the largest value. For a particular HOTPOTQA example, we run single-paragraph BERT on each paragraph in parallel and select the answer from the paragraph with the smallest y empty .

HOTPOTQA has two settings: a distractor setting and an open-domain setting.
Distractor Setting The HOTPOTQA distractor setting pairs the two paragraphs the question was written for (gold paragraphs) with eight spurious paragraphs selected using TF-IDF similarity with the question (distractors). Our single-paragraph BERT model achieves 67.08 F1, comparable to the state-of-the-art (Table 1). 2 This indicates the majority of HOTPOTQA questions are answerable in the distractor setting using a single-hop model.

Open-domain Setting
The HOTPOTQA opendomain setting (Fullwiki) does not provide a set of paragraphs-all of Wikipedia is considered. We follow Chen et al. (2017) and retrieve paragraphs using bigram TF-IDF similarity with the question.
We use the single-paragraph BERT model trained in the distractor setting. We also fine-tune the model using incorrect paragraphs selected by the retrieval system. In particular, we retrieve 30 paragraphs and select the eight paragraphs with the lowest y empty scores predicted by the trained model.

Categorizing Bridge Questions
Bridge questions consist of two paragraphs linked by an entity (Yang et al., 2018), e.g., Figure 1. We first investigate single-hop human performance on HOTPOTQA bridge questions using a human study consisting of NLP graduate students. Humans see the paragraph that contains the answer span and the eight distractor paragraphs, but do not see the other gold paragraph. As a baseline, we show a different set of people the same questions in their standard ten paragraph form. On a sample of 200 bridge questions from the validation set, human accuracy shows marginal degradation when using only one hop: humans obtain 87.37 F1 using all ten paragraphs and 82.06 F1 when using only nine (where they only see a single gold paragraph). This indicates humans, just like models, are capable of solving bridge questions using only one hop.
Next, we manually categorize what enables single-hop answers for 100 bridge validation examples (taking into account the distractor paragraphs), and place questions into four categories (Table 2).
Multi-hop 27% of questions require multi-hop reasoning. The first example of Table 2 requires locating the university where "Ralph Hefferline" was a psychology professor, and multiple universities are provided as distractors. Therefore, the answer cannot be determined in one hop. 3 Weak Distractors 35% of questions allow single-hop answers in the distractor setting, mostly by entity type matching. Consider the question in the second row of   To further investigate entity type matching, we reduce the question to the first five tokens starting from the wh-word, following Sugawara et al. (2018). Although most of these reduced questions appear void of critical information, the F1 score of single-paragraph BERT only degrades about 15 F1 from 67.08 to 52.13.
Redundant Evidence 26% of questions are compositional but are solvable using only part of the question. For instance, in the third example of Table 2 there is only a single founder of "Kaiser Ventures." Thus, one can ignore the condition on "American industrialist" and "father of modern American shipbuilding." This category differs from the weak distractors category because its questions are single-hop regardless of the distractors.
Non-compositional Single-hop 8% of questions are non-compositional and single-hop. In the last example of Table 2, one sentence contains all of the information needed to answer correctly.

Categorizing Comparison Questions
Comparison questions require quantitative or logical comparisons between two quantities or events. We create rules (Appendix C) to group comparison questions into three categories: questions which require multi-hop reasoning (multi-hop), may require multi-hop reasoning (context-dependent), and require single-hop reasoning (single-hop).
Many comparison questions are multi-hop or context-dependent multi-hop, and single-paragraph  BERT achieves near chance accuracy on these types of questions (Table 3). 4 This shows that most comparison questions are not solvable by our single-hop model.

Can We Find Better Distractors?
In Section 4.1, we identify that 35% of bridge examples are solvable using single-hop reasoning due to weak distractor paragraphs. Here, we attempt to automatically correct these examples by choosing new distractor paragraphs which are likely to trick our single-paragraph model.

Adversarial Distractors
We select the top-50 first paragraphs of Wikipedia pages using TF-IDF similarity with the question, following the original HOTPOTQA setup. Next, we use single-paragraph BERT to adversarially select the eight distractor paragraphs from these 50 candidates. In particular, we feed each paragraph to the model and select the paragraphs with the lowest y empty score (i.e., the paragraphs that the model thinks contain the answer). These paragraphs are dissimilar to the original distractors-there is a 9.82% overlap. We report the F1 score of single-paragraph BERT on these new distractors in Table 4: the accuracy declines from 67.08 F1 to 46.84 F1. However, when the same procedure is done on the training set and the model is re-trained, the accuracy increases to 60.10 F1 on the adversarial distractors.

Type Distractors
We also experiment with filtering the initial list of 50 paragraph to ones whose entity type (e.g., person) matches that of the gold paragraphs. This can help to eliminate the entity type bias described in Section 4.1. As shown in Table 4, the original model's accuracy degrades significantly (drops to 40.73 F1). However, similar to the previous setup, the model trained on the adversarially selected distractors can recover most of its original accuracy (increases to 58.42 F1).
These results show that single-paragraph BERT can struggle when the distribution of the distractors changes (e.g., using adversarial selection rather than only TF-IDF). Moreover, the model can somewhat recover its original accuracy when re-trained on distractors from the new distribution.

Conclusions
In summary, we demonstrate that question compositionality is not a sufficient condition for multi-hop reasoning. Instead, future datasets must carefully consider what evidence they provide in order to ensure multi-hop reasoning is required. There are at least two different ways to achieve this.

Open-domain Questions
Our single-hop model struggles in the open-domain setting. We largely attribute this to the insufficiencies of standard TF-IDF retrieval for multi-hop questions. For example, we fail to retrieve the paragraph about "Bonobo apes" in Figure 1, because the question does not contain terms about "Bonobo apes." Table 5 shows that the model achieves 39.12 F1 given 500 retrieved paragraphs, but achieves 53.12 F1 when additional two gold paragraphs are given, demonstrating the significant effect of failure to retrieve gold paragraphs. In this context, we suggest that future work can explore better retrieval methods for multi-hop questions.
Retrieving Strong Distractors Another way to ensure multi-hop reasoning is to select strong dis-  tractor paragraphs. For example, we found 35% of bridge questions are currently single-hop but may become multi-hop when combined with stronger distractors (Section 4.1). However, as we demonstrate in Section 5, selecting strong distractors for RC questions is non-trivial. We suspect this is also due to the insufficiencies of standard TF-IDF retrieval for multi-hop questions. In particular, Table 5 shows that single-paragraph BERT achieves 53.12 F1 even when using 500 distractors (rather than eight), indicating that 500 distractors are still insufficient. In this end, future multi-hop RC datasets can develop improved methods for distractor collection.

A Example Distractor Question
We present the full example from Figure 1 below. Paragraphs 1 and 5 are the two gold paragraphs.
Question What is the former name of the animal whose habitat the Réserve Naturelle Lomako Yokokala was established to protect?
Answer pygmy chimpanzee (Gold Paragraph) Paragraph 1 The bonobo (or ; "Pan paniscus"), formerly called the pygmy chimpanzee and less often, the dwarf or gracile chimpanzee, is an endangered great ape and one of the two species making up the genus "Pan"; the other is "Pan troglodytes", or the common chimpanzee. Although the name "chimpanzee" is sometimes used to refer to both species together, it is usually understood as referring to the common chimpanzee, whereas "Pan paniscus" is usually referred to as the bonobo.
Paragraph 2 The Carriére des Nerviens Regional Nature Reserve (in French "Réserve naturelle régionale de la carriére des Nerviens") is a protected area in the Nord-Pas-de-Calais region of northern France. It was established on 25 May 2009 to protect a site containing rare plants and covers just over 3 ha. It is located in the municipalities of Bavay and Saint-Waast in the Nord department.
Paragraph 3 Céreste (Occitan: "Ceirésta") is a commune in the Alpes-de-Haute-Provence department in southeastern France. It is known for its rich fossil beds in fine layers of "Calcaire de Campagne Calavon" limestone, which are now protected by the Parc naturel régional du Luberon and the Réserve naturelle géologique du Luberon.
Paragraph 4 The Grand Cote National Wildlife Refuge (French: "Réserve Naturelle Faunique Nationale du Grand-Cote") was established in 1989 as part of the North American Waterfowl Management Plan. It is a 6000 acre reserve located in Avoyelles Parish, near Marksville, Louisiana, in the United States.
(Gold Paragraph) Paragraph 5 The Lomako Forest Reserve is found in Democratic Republic of the Congo. It was established in 1991 especially to protect the habitat of the Bonobo apes. This site covers 3,601.88 km 2 .
Paragraph 6 Guadeloupe National Park (French: "Parc national de la Guadeloupe") is a national park in Guadeloupe, an overseas department of France located in the Leeward Islands of the eastern Caribbean region. The Grand Cul-de-Sac Marin Nature Reserve (French: "Réserve Naturelle du Grand Cul-de-Sac Marin") is a marine protected area adjacent to the park and administered in conjunction with it. Together, these protected areas comprise the Guadeloupe Archipelago (French: "l'Archipel de la Guadeloupe") biosphere reserve.

B Full Model Details
Single-paragraph BERT is a pipeline which first retrieves a single paragraph using a classifier and then selects the associated answer. Formally, the model receives a question Q = [q 1 , .., q m ] and a single paragraph P = [p 1 , ..., p n ] as input. The question and paragraph are merged into a single where h is the hidden dimension of BERT. Next, a classifier uses max-pooling and learned parameters W 1 ∈ R h×4 to generate four scalars: [y span ; y yes ; y no ; y empty ] = W 1 maxpool(S ), where y span , y yes , y no and y empty indicate the answer is either a span, yes, no, or no answer.
A candidate answer span is then computed separately from the classifier. We define where W 2 , W 3 ∈ R h are learned parameters. Then, y start and y end are obtained: where p i start and p j end indicate the i-th element of p start and j-th element of p end , respectively.
We now have four scalar values y span , y yes , y no , and y empty and a span from the paragraph span = [S ystart , . . . , S y end ].
For HOTPOTQA, the input is a question and N context paragraphs. We create a batch of size N , where each entry is a question and a single paragraph. Denote the ouput from i-th entry as y i span , y i yes , y i no , y i empty and span i . The final answer is selected as: During training, y i empty is set to 0 for the paragraph which contains the answer span and 1 otherwise.

Implementation
Details We use Py-Torch (Paszke et al., 2017) based on Hugging Face's implementation. 5 We use Adam (Kingma and Ba, 2015) with learning rate 5 × 10 −5 . We lowercase the input and set the maximum sequence length |S| to 300. If a sequence is longer than 300, we split it into multiple sequences and treat them as different examples.

C Categorizing Comparison Questions
This section describes how we categorize comparison questions. We first identify ten question operations that sufficiently cover comparison questions (Table 6). Next, for each question, we extract the two entities under comparison using the Spacy 6 NER tagger on the question and the two HOT-POTQA supporting facts. Using these extracted  coordination, preconjunct ← f (question, entity1, entity2) 3: Determine if the question is either question or both question from coordination and preconjunct 4: head entity ← f head (question, entity1, entity2) 5: if more, most, later, last, latest, longer, larger, younger, newer, taller, higher in question then 6: if head entity exists then discrete operation ← Which is greater 7: else discrete operation ← Is greater 8: else if less, earlier, earliest, first, shorter, smaller, older, closer in question then 9: if head entity exists then discrete operation ← Which is smaller 10: else discrete operation ← Is smaller 11: else if head entity exists then 12: discrete operation ← Which is true 13: else if question is not yes/no question and asks for the property in common then 14: discrete operation ← Intersection 15: else if question is yes/no question then 16: Determine if question asks for logical comparison or string comparison 17: if question asks for logical comparison then 18: if either question then discrete operation ← Or 19: else if both question then discrete operation ← And 20: else if question asks for string comparison then 21: if asks for same? then discrete operation ← Is equal 22: else if asks for difference? then discrete operation ← Not equal 23: return discrete operation entities, we identity the suitable question operation following Algorithm 1. Based on the identified operation, questions are classified into multi-hop, context-dependent multi-hop, or single-hop. First, numerical questions are always multi-hop (e.g., first example of Table 6). Next, the operations And, Or, Is equal, and Not equal are context-dependent multi-hop. For instance, in the second example of Table 6, if "Hot Rod" is not a magazine, one can immediately answer No. Finally, the operations Which is true and Intersection are single-hop because they can be answered using one paragraph regardless of the context. For instance, in the third example of Table 6, if Henry Roth's paragraph explains he is from England, one can answer Henry Roth, otherwise, the answer is Robert Erskine Childers.