Exploiting Explicit Paths for Multi-hop Reading Comprehension

We propose a novel, path-based reasoning approach for the multi-hop reading comprehension task where a system needs to combine facts from multiple passages to answer a question. Although inspired by multi-hop reasoning over knowledge graphs, our proposed approach operates directly over unstructured text. It generates potential paths through passages and scores them without any direct path supervision. The proposed model, named PathNet, attempts to extract implicit relations from text through entity pair representations, and compose them to encode each path. To capture additional context, PathNet also composes the passage representations along each path to compute a passage-based representation. Unlike previous approaches, our model is then able to explain its reasoning via these explicit paths through the passages. We show that our approach outperforms prior models on the multi-hop Wikihop dataset, and also can be generalized to apply to the OpenBookQA dataset, matching state-of-the-art performance.


Introduction
Many Reading Comprehension (RC) datasets (Rajpurkar et al., 2016;Trischler et al., 2017;Joshi et al., 2017) have been proposed recently to evaluate a system's ability to understand language based on its ability to answer a question given a passage. However, most of the questions in these datasets can be answered by using only a single sentence or passage.As a result, systems designed for these tasks may not acquire the capability to compose knowledge from multiple passages, a key aspect of natural language understanding. To remedy this, new datasets Welbl et al., 2018;Khashabi et al., 2018a;Mihaylov et al., 2018) have been proposed recently that require a system to combine information from multiple sentences to arrive at the answer, often referred to as multi-hop reasoning.
Multi-hop reasoning has been studied for question answering over structured knowledge graphs (Lao et al., 2011;Guu et al., 2015;Das et al., 2017) where many of the successful models * This work was done during an internship at AI2 Question: (always breaking my heart, record label, ?) Supporting Passages: (p1) "Always Breaking My Heart" is the second single from Belinda Carlisle's A Woman and a Man album , released in 1996 ( see 1996 in music ) . It made ... explicitly identify paths in the knowledge graph that led to the answer. While these models can be highly interpretable due to explicit path-based reasoning, they can not be directly applied to question answering in the absence of such structure. As a result, most multi-hop RC models (Dhingra et al., 2017;Hu et al., 2018) over text extend standard attention-based models from RC by iteratively updating the attention to "hop" over different parts of the text. Recently, graph-based models (Song et al., 2018;Cao et al., 2018) have been proposed for WikiHop, but these models still only implicitly combine knowledge from all the passages, and hence are unable to provide explicit reasoning paths for the selected answer.
We propose a model that explicitly extracts po-tential paths from text and encodes the knowledge captured by each path. Figure 1 shows how to apply this approach to an example in the Wiki-Hop dataset (Welbl et al., 2018). In this example, we show two sample paths connecting an entity in the question, the song "Always Breaking My Heart", to a candidate answer, "Chrysalis Records", through a singer, "Belinda Carlisle", and an album, "A Woman and a Man". Our model extracts implicit (latent) relations between entity pairs in a passage based on the contextual representation of the first sentence. For example, our model would try to extract the implicit relation capturing the single from relation between the song and the name of the album in the first passage. Similarly, it extracts the relation capturing the released by between the album and the record label in the second passage.
Having extracted these implicit relations, the model learns to compose them such that they map to the main relation in the query, namely, record label. In essence, our goal is to train a model that learns to extract implicit relations from text and valid compositions of these relations, such as: (x, single from, y), (y, released by, z) → (x, record label, z). Apart from focusing on the specific entities in each passage, we also learn to compose the aggregate passage representations in a path to capture more global information i.e. encode(p1), encode(p2) → (x, record label, z).
We make 3 main contributions: (1) a novel pathbased reasoning approach for multi-hop question answering over text.
(2) A model that learns to extract implicit relations from text and compose them. (3) A competitive model 1 on the WikiHop dataset that also produces explanations via reasoning over explicit paths.

Related Work
We present related work from question answering over text, semi-structured knowledge and knowledge graphs.

Multi-hop RC
Recently multiple datasets such as bAbI , Multi-RC (Khashabi et al., 2018a), WikiHop (Welbl et al., 2018), and Open-BookQA (Mihaylov et al., 2018) have been proposed to encourage research in multi-hop ques-tion answering over text. Multi-hop models for these tasks can be categorized into: state-based or graph-based reasoning models. The statebased reasoning models (Dhingra et al., 2017;Shen et al., 2017;Hu et al., 2018) are closer to standard attention-based reading comprehension model with an additional "state" representation that is iteratively updated. The changing state representation results in the model focusing on different parts of the passage during each iteration allowing the model to combine information from different parts of the passage. The graphbased reasoning models (Dhingra et al., 2018;Cao et al., 2018;Song et al., 2018), on the other hand, create graphs over entities within these passages and update the entity representations via recurrent/convolutional networks. Our approach explicitly identifies paths connecting the entities in the question to the answer choices.

Semi-structured QA
Our model is closer to Integer Linear Programming (ILP)-based methods (Khashabi et al., 2016;Khot et al., 2017;Khashabi et al., 2018b) developed for science question answering. These approaches define an ILP program to find optimal "support graphs" connecting words in the question to the choices through a semi-structured knowledge representation. However, these models require a manually authored and tuned ILP program and need to convert text into a semi-structured representation that can be noisy(such as Open IE tuples (Khot et al., 2017) or SRL frames (Khashabi et al., 2018b)). Our models are trained end-to-end on the target dataset and allow the model to discover the relevant relations from text.

Knowledge Graph QA
Question answering datasets on knowledge graphs such as Freebase (Bollacker et al., 2008) require systems to map queries to a single relation , a path (Guu et al., 2015) or complex structured queries (Berant et al., 2013) over these graphs. While early models (Lao et al., 2011;Gardner and Mitchell, 2015) on this task focused on creating path-based features, recent neural path-based models (Guu et al., 2015;Das et al., 2017;Toutanova et al., 2016) encode the entities and relations along the path and compose them using recurrent networks. However, the input knowledge graphs have entities and relations shared across all examples that the model can ex-ploit during learning (e.g. through learned entity/relation embeddings). When reasoning with text, our model has to learn these representations purely based on their local context.

Approach Overview
In this work, we focus on the multiple-choice RC setting. Given a question and a set of passages as support, the task is to find the correct answer from a predefined set of candidate answer choices.
In the WikiHop dataset, a question Q is given in the form of a tuple (h e , r, ?), where h e represents the head entity and r represents the relation between h e and the unknown tail entity. The task is to select the unknown tail entity from a given set of answer candidates represented as (c 1 , c 2 , . . . c N ).
To perform multi-hop reasoning, we extract multiple paths P (described in the next section) connecting h e to each c k from the supporting passages, P = p 1 , . . . , p M . We represent the j-th path for candidate c k as p kj . For simplicity, we only consider two-hop paths, i.e., p kj = h e → e 1 → c k where e 1 is called an intermediate entity.
Note that, while we only consider up to two hops of reasoning 2 , our approach can be easily extended to more than two hops.

Path Extraction
The first step in our approach is extracting paths from text passages for each candidate given a question. Consider the example in Figure 1. Overall, there are four steps in our path extraction approach: • We find a passage p 1 that contains h e of the question Q. In our example, we would find the first supporting passage that contains always breaking my heart.
• Then, we find all the named entities and noun phrases that appear in the same sentence as h e or in the subsequent sentence. For instance, we would collect Belinda Carlisle, A Woman and a Man, or album as intermediate entity e 1 .
• Now, we find a passage p 2 that contains any of the intermediate entities found in the previous step. For instance, we find the second passage that contains both Belinda Carlisle and A Woman and a Man. • Finally, we check if p 2 contains any of the candidate answer choices. For instance, p 2 contains chrysalis records and emi group in our example.
The extracted paths for candidate chrysalis records can be summarized as a set of entity sequences: (always breaking my heart, Belinda Carlisle, chrysalis records), (always breaking my heart, A Man and a Woman, chrysalis records). Similarly, we can extract paths for the other candidate, emi group. Notably, our path extraction method can be easily extended for three or more hops by repeating step 2 and 3. Specifically, for m hop reasoning, step 2 and 3 need to repeated (m − 1) times. For one hop reasoning, i.e., when a single passage is sufficient to answer a question, we construct the path with e 1 as null. In this case, both h e and answer candidate are found in a single passage.

Path-based Multi-Hop QA Model
Once we have all the paths collected for the questions and candidates, we feed them to our proposed model. An overview of our proposed model is depicted in Fig 2. They key component of our model is the path-scorer module that computes the score for each path p kj . We normalize these scores across all paths and compute the probability of a candidate by summing the normalized scores of the paths associated with that candidate. In this way, the probability of candidate c k being an answer can be given as: Next we describe three key model components, given the question Q, supporting passages p 1 and p 2 , the candidate c k , and the locations of h e , e 1 , c k in these passages as input.

Embedding and Encoding
Here, we will describe the text embedding and encoding approaches used in the model. We use the same embedding and contextual encoding for question, supporting passages, and candidate answer choices. For word embedding, we use pretrained 300d embedding vectors from GloVe (Pennington et al., 2014). For out of vocabulary (OOV) words, we use randomly initialized vectors. For contextual encoding, we use bi-directional long short term memory (BiLSTM) networks (Hochreiter and Schmidhuber, 1997). Let e sp t be the tth word embedding vector of the pth supporting passage. To get the contextual representations, we use the concatenation of the forward and backward hidden states of the BiLSTM, i.e.
sp t is the LSTM hidden state representation, and s p,t is the contextual representation of the t th token in p th passage.
The final encoded representation for the p th supporting passage can be obtained by stacking these vectors into S p ∈ R T ×H where H is the number of hidden units for the BiLSTMs. Similarly, the sequence level encoding for a question Q ∈ R U ×H and any candidate answer choice C k ∈ R V ×H . T , U , and V represent the number of tokens in the pth supporting passage, question, and kth candidate answer choice respectively.

Path Encoding
This is the core component of the proposed model architecture. After extracting the paths as discussed in Section 4, they are encoded inside an end-to-end neural network architecture. Path encoder consists of two components: context-based path encoder and passage-based path encoder.

Context-based Path Encoding
In context-based path encoding, we aim to implicitly encode the relation between (h e , e 1 ), and (e 1 , c k ). These implicit relation representations are further composed to encode a path representation for (h e , e 1 , c k ). Note that, (h e , e 1 ) and (e 1 , c k ) are located in different passages, say p 1 and p 2 respectively. For clarity, we denote (e 1 , c k ) as (e 1 , c k ) from now onwards. First, we extract the contextual representations for each h e , e 1 , e 1 , and c k . Based on the locations of these entities in the corresponding passages, we extract the boundary vectors from the passage encoding representation. For instance, if h e appears in the pth supporting passage from token i 1 to i 2 (i 1 ≤ i 2 ), then the contextual encoding of h e , g he ∈ R 2H can be given as: Similarly, we obtain the location encoding vectors g e 1 , g e 1 , and g c k . Note that, if they appear multiple locations, we use the mean vector representation for all the locations. Now, we extract the implicit relations between h e and e 1 as r he,e 1 ∈ R H with a simple feed forward layer: where FFL can be described as: where a ∈ R H and b ∈ R H are input vectors. W a ∈ R H ×H and W b ∈ R H ×H are trainable weight matrices. The bias vectors are not shown here for simplicity. Similarly, we compute the implicit relation between e 1 and c k as r e 1 ,c k ∈ R H , using their location encoding vectors g e 1 and g c k . Finally, we compose the two implicit relation vectors with a feed forward layer to obtain a context-based path representation x ctx ∈ R H :

Passage-based Path Encoding
In passage-based path encoder, we use the whole passages to compute the path representation. Let us consider that, (h e , e 1 ) and (e 1 , c k ) appear in the supporting passages p 1 and p 2 respectively. We encode both p 1 and p 2 into single vectors based on the interaction with question encoding representation Q. We first compute a question-weighted representation for the tokens and then compute an aggregate vector for each passage.

Question-weighted Passage Representation:
For any pth passage, we first compute the attention matrix A p ∈ R T ×U which captures the similarity between the passage and question words. Then, we calculate a question-aware passage representation S q 1 Now based on this updated question representation, we compute another passage representation S q 2 p ∈ R T ×H from Q p , where S q 2 p = AQ p . Intuitively, S q 1 p captures the important passage words based on the question whereas S q 2 p focuses on the passage-relevant question words. The idea of encoding a passage after interacting with the question multiple times is inspired from the Gated Attention Reader model (Dhingra et al., 2017).
Aggregate Passage Representation: To get a single passage vector, we first concatenate the two passage representations for each token, S q p ∈ R T ×2H = S q 1 p || S q 2 p . We then use an attentive pooling mechanism for aggregating the token representations. The aggregated vectors p ∈ R 2H for pth passage can be obtained as: where w ∈ R 2H is a learned vector. In this way, we obtain the aggregated vector representations for both supporting passages p 1 and p 2 as s p 1 ∈ R 2H ands p 2 ∈ R 2H respectively.
Path Encoding: Now, we compose the aggregated passage vectors to obtain the passage-based path representation x psg ∈ R H by using a simple feed forward network: x psg = FFL(s p 1 ,s p 2 ) (7)

Path Scoring
Context-based Path Scoring: We score the context-based paths based on the interaction between question encoding and context-based path encoding. First, we aggregate the question into a single vector. As the question is in the form (h e , r), we take the first and last hidden state representation from the question encoding Q to explicitly cover both head entity and relation. The aggregated question vectorq ∈ R H can be given as:q where W q ∈ R 2H×H is a learnable weight matrix. The combined representation y xctx,q ∈ R H of the question and a context-based path can be given as: Finally, the scores for context-based paths are derived: where w ctx ∈ R H is a learnable vector.
Passage-based Path Scoring: On the other hand, we capture the interaction between passagebased path encoding vector and candidate encoding to score the passage-based paths 3 . We aggregate a candidate encoding representation C k into a single vectorc k ∈ R H by applying an attentive pooling operation similar to Eq. 6. Now, the score for passage-based path is computed as follows: Finally, the unnormalized score for a path p kj is: and the normalized score(p kj ) is calculated by applying a softmax over all the paths and candidates.

Experiments
We start by describing the experimental setup, which includes the dataset and experimental configuration. Then, we present the results and analysis of our model.

Settings
For experimentation, we used the recently proposed WikiHop dataset (Welbl et al., 2018). In this work, we considered the unmasked version of the dataset. WikiHop is a large scale multi-hop QA dataset consisting of about 51K questions. Each question is associated with an average of 13.7 supporting passages collected from Wikipedia. Each passage consists of 36.4 tokens on average. We use Spacy for tokenization. For word embedding, we use the 840B 300-dimensional pretrained word vectors from GloVe and we do not update them during training. For simplicity, we do not use any character embedding in our model. The number of hidden units in all LSTMs is 50 (H = 100). We use dropout (Srivastava et al., 2014) with probability 0.25 for every learnable layer. During training, the minibatch size is fixed at 8. We use the Adam optimizer (Kingma and Ba, 2015) with learning rate of 0.001 and clipnorm of 5. We use cross entropy loss for training. This being a multiple-choice QA task, we use accuracy as the evaluation metric.   (Peters et al., 2018) for embedding which has proven to be very useful in the recent past in many natural language processing tasks. Table 1 clearly shows that our proposed model significantly outperforms the prior models by significant margin 5 . Additionally, in contrast to our model, the competing models do not possess the capability to identify which particular entity chains are leading to the final predicted answer. Table 2 shows the ablation results on the Wik-iHop development set. When we do not consider the context-based paths in the model, only the passage-based paths are considered and vice versa. As we can see that, performance of the model degrades significantly when we ablate any of the two path encoding modules. Also the information captured by context-based and passagebased paths are complementary to some extent, as evidenced by the larger drop in the model with no paths. Intuitively, in context-based paths, lim-5 With 5129 dev questions, any gain above 1.3% would be significant at p = 0.05 based on the Wilson score interval (Wilson, 1927). ited and more fine-grained context is considered due to the use of syntactically matched locationbased encoding representations of the entities that are used to construct a path. On the contrary, the passage-based path encoder computes the path representation with semantic similarity-based aggregation of the entire passages.

Results
One key aspect of our proposed model is that it can indicate the paths that contribute the most towards predicting the answer choice. Table 3 illustrates the top two paths for two example questions which lead to correctly predicted final answer choice. In the first question, the top-2 paths are formed by connecting Zoo Lake to Gauteng through the intermediate entities Johannesburg and South Africa respectively. In the second example, the science fiction novel This Day All Gods Die is connected to the publisher Bantam Books through the author Stephen R. Donaldson in the first path, and through the collection Gap Cycle in the second path.

Conclusion
We present a novel path-based multi-hop reading comprehension model that achieves state-of-theart results on the WikiHop Dev set. We also show that our model can explain its reasoning through paths across multiple passages. Our approach can potentially be generalized to longer chains (more than 2 hops) and longer natural language questions, which we will explore further.