Multi-Hop Paragraph Retrieval for Open-Domain Question Answering

This paper is concerned with the task of multi-hop open-domain Question Answering (QA). This task is particularly challenging since it requires the simultaneous performance of textual reasoning and efficient searching. We present a method for retrieving multiple supporting paragraphs, nested amidst a large knowledge base, which contain the necessary evidence to answer a given question. Our method iteratively retrieves supporting paragraphs by forming a joint vector representation of both a question and a paragraph. The retrieval is performed by considering contextualized sentence-level representations of the paragraphs in the knowledge source. Our method achieves state-of-the-art performance over two well-known datasets, SQuAD-Open and HotpotQA, which serve as our single- and multi-hop open-domain QA benchmarks, respectively.


Introduction
Textual Question Answering (QA) is the task of answering natural language questions given a set of contexts from which the answers to these questions can be inferred. This task, which falls under the domain of natural language understanding, has been attracting massive interest due to extremely promising results that were achieved using deep learning techniques. These results were made possible by the recent creation of a variety of large-scale QA datasets, such as TriviaQA (Joshi et al., 2017) and SQuAD (Rajpurkar et al., 2016). The latest state-of-the-art methods are even capable of outperforming humans on certain tasks (Devlin et al., 2018) The basic and arguably the most popular task of QA is often referred to as Reading Comprehension (RC), in which each question is paired with a relatively small number of paragraphs (or documents) from which the answer can potentially be inferred. The objective in RC is to extract the correct answer from the given contexts or, in some cases, deem the question unanswerable (Rajpurkar et al., 2018). Most large-scale RC datasets, however, are built in such a way that the answer can be inferred using a single paragraph or document. This kind of reasoning is termed single-hop reasoning, since it requires reasoning over a single piece of evidence. A more challenging task, called multi-hop reasoning, is one that requires combining evidence from multiple sources (Talmor and Berant, 2018;Welbl et al., 2018;Yang et al., 2018). Figure 1 provides an example of a question requiring multihop reasoning. To answer the question, one must first infer from the first context that Alex Ferguson is the manager in question, and only then can the answer to the question be inferred with any confidence from the second context.
Another setting for QA is open-domain QA, in which questions are given without any accompanying contexts, and one is required to locate the relevant contexts to the questions from a large knowledge source (e.g., Wikipedia), and then extract the correct answer using an RC component. This task has recently been resurged following the work of Chen et al. (2017), who used a TF-IDF based retriever to find potentially relevant documents, followed by a neural RC component that extracted the most probable answer from the retrieved documents. While this methodology performs reasonably well for questions requiring single-hop reasoning, its performance decreases significantly when used for open-domain multihop reasoning.
We propose a new approach to accomplishing this task, called iterative multi-hop retrieval, in which one iteratively retrieves the necessary evi-Question: The football manager who recruited David Beckham managed Manchester United during what timeframe? Context 1: The 1995-96 season was Manchester United's fourth season in the Premier League ... Their triumph was made all the more remarkable by the fact that Alex Ferguson ... had drafted in young players like Nicky Butt, David Beckham, Paul Scholes and the Neville brothers, Gary and Phil. Context 2: Sir Alexander Chapman Ferguson, CBE (born 31 December 1941) is a Scottish former football manager and player who managed Manchester United from 1986 to 2013. He is regarded by many players, managers and analysts to be one of the greatest and most successful managers of all time. Figure 1: An example of a question and its answer contexts from the HotpotQA dataset requiring multihop reasoning and retrieval. The first reasoning hop is highlighted in green, the second hop in purple, and the entity connecting the two is highlighted in blue bold italics. In the first reasoning hop, one has to infer that the manager in question is Alex Ferguson. Without this knowledge, the second context cannot possibly be retrieved with confidence, as the question could refer to any of the club's managers throughout its history. Therefore, an iterative retrieval is needed in order to correctly retrieve this context pair. dence to answer a question. We believe this iterative framework is essential for answering multihop questions, due to the nature of their reasoning requirements.
Our main contributions are the following: • We propose a novel multi-hop retrieval approach, which we believe is imperative for truly solving the open-domain multi-hop QA task.
• We show the effectiveness of our approach, which achieves state-of-the-art results in both single-and multi-hop open-domain QA benchmarks.
• We also propose using sentence-level representations for retrieval, and show the possible benefits of this approach over paragraph-level representations.
While there are several works that discuss solutions for multi-hop reasoning (Dhingra et al., 2018;Zhong et al., 2019), to the best of our knowledge, this work is the first to propose a viable solution for open-domain multi-hop QA.

Task Definition
We define the open-domain QA task by a triplet (KS, Q, A) where KS = {P 1 , P 2 , . . . , P |KS| } is a background knowledge source and P i = (p 1 , p 2 , . . . , p l i ) is a textual paragraph consisting of l i tokens, Q = (q 1 , q 2 , . . . , q m ) is a textual question consisting of m tokens, and A = (a 1 , a 2 , . . . , a n ) is a textual answer consisting of n tokens, typically a span of tokens p j 1 , . . . , p jn in some P i ∈ KS, or optionally a choice from a predefined set of possible answers. The objective of this task is to find the answer A to the question Q using the background knowledge source KS. Formally speaking, our task is to learn a function φ such that A = φ(Q, KS).

Single-Hop Retrieval
In the classic and most simple form of QA, questions are formulated in such a way that the evidence required to answer them may be contained in a single paragraph, or even in a single sentence. Thus, in the opendomain setting, it might be sufficient to retrieve a single relevant paragraph P i ∈ KS using the information present in the given question Q, and have a reading comprehension model extract the answer A from P i . We call this task variation single-hop retrieval.

Multi-Hop Retrieval
In contrast to the singlehop case, there are types of questions whose answers can only be inferred by using at least two different paragraphs. The ability to reason with information taken from more than one paragraph is known in the literature as multi-hop reasoning (Welbl et al., 2018). In multi-hop reasoning, not only might the evidence be spread across multiple paragraphs, but it is often necessary to first read a subset of these paragraphs in order to extract the useful information from the other paragraphs, which might otherwise be understood as not completely relevant to the question. This situation becomes even more difficult in the opendomain setting, where one must first find an initial evidence paragraph in order to be able to retrieve the rest. This is demonstrated in Figure 1, where one can observe that the second context alone may appear to be irrelevant to the question at hand and the information in the first context is necessary to retrieve the second part of the evidence correctly. We extend the multi-hop reasoning ability to the open-domain setting, referring to it as multi-hop retrieval, in which the evidence paragraphs are re-trieved in an iterative fashion. We focus on this task and limit ourselves to the case where two iterations of retrieval are necessary and sufficient.

Methodology
Our solution, which we call MUPPET (multi-hop paragraph retrieval), relies on the following basic scheme consisting of two main components: (a) a paragraph and question encoder, and (b) a paragraph reader. The encoder is trained to encode paragraphs into d-dimensional vectors, and to encode questions into search vectors in the same vector space. Then, a maximum inner product search (MIPS) algorithm is applied to find the most similar paragraphs to a given question. Several algorithms exist for fast (and possibly approximate) MIPS, such as the one proposed by Johnson et al. (2017). The most similar paragraphs are then passed to the paragraph reader, which, in turn, extracts the most probable answer to the question.
It is critical that the paragraph encodings do not depend on the questions. This enables storing precomputed paragraph encodings and executing efficient MIPS when given a new search vector. Without this property, any new question would require the processing of the complete knowledge source (or a significant part of it).
To support multi-hop retrieval, we propose the following extension to the basic scheme. Given a question Q, we first obtain its encoding q ∈ R d using the encoder. Then, we transform it into a search vector q s ∈ R d , which is used to retrieve the top-k relevant paragraphs {P Q 1 , P Q 2 , . . . , P Q k } ⊂ KS using MIPS. In each subsequent retrieval iteration, we use the paragraphs retrieved in its previous iteration to reformulate the search vector. This produces k new search vectors, {q s 1 ,q s 2 , . . . ,q s k }, whereq s i ∈ R d , which are used in the same manner as in the first iteration to retrieve the next top-k paragraphs, again using MIPS. This method can be seen as performing a beam search of width k in the encoded paragraphs' space. A high-level view of the described solution is given in Figure 2.

Paragraph and Question Encoder
We define f , our encoder model, in the following way. Given a paragraph P consisting of k sentences (s 1 , s 2 , . . . , s k ) and m tokens (t 1 , t 2 , . . . , t m ), such that s i = (t i 1 , t i 2 , . . . , t i l ), where l is the length of the sentence, our encoder generates k respective d-dimensional encodings (s 1 , s 2 , . . . , s k ) = f (P ), one for each sentence. This is in contrast to previous work in paragraph retrieval in which only a single fixed-size representation is used for each paragraph (Lee et al., 2018;Das et al., 2019). The encodings are created by passing (t 1 , t 2 , . . . , t m ) through the following layers.
Word Embedding We use the same embedding layer as the one suggested by Clark and Gardner (2018). Each token t is embedded into a vector t using both character-level and word-level information. The word-level embedding t w is obtained via pretrained word embeddings. The characterlevel embedding of a token t with l t characters (t c 1 , t c 2 , . . . , t c lt ) is obtained in the following manner: each character t c i is embedded into a fixedsize vector t c i . We then pass each token's character embeddings through a one-dimensional convolutional neural network, followed by max-pooling over the filter dimension. This produces a fixedsize character-level representation for each token, t c = max CNN(t c 1 , t c 2 , . . . , t c lt ) . Finally, we concatenate the word-level and character-level embeddings to form the final word representation, t = [t w ; t c ].
Sentence-wise max-pooling Finally, we chunk the contextualized representations of the paragraph tokens into their corresponding sentence groups, and apply max-pooling over the time dimension of each sentence group to obtain the parargaph's d-dimensional sentence representations, s i = max(c i 1 , c i 2 , . . . , c i l ). A high-level outline of the sentence encoder is shown is Figure 3a, where we can see a series of m tokens being passed through the aforementioned layers, producing k sentence representations.
The encoding q of a question Q is computed similarly, such that q = f (Q). Note that we produce a single vector for any given question, thus the max-pooling operation is applied over all question words at once, disregarding sentence information.

Reformulation Component
The reformulation component receives a paragraph P and a question Q, and produces a single vectorq. First, contextualized word representations are obtained using the same embedding and recurrent layers used for the initial encoding, (c q 1 , c q 2 , . . . , c q nq ) for Q and (c p 1 , c p 2 , . . . , c p np ) for P . We then pass the contextualized representations through a bidirectional attention layer, which we adopt from Clark and Gardner (2018). The attention between question word i and paragraph word j is computed as:  Figure 4: An example from the SQuAD dataset of a paragraph that acts as the context for two different questions. Question 1 and its evidence (highlighted in purple) have little relation to question 2 and its evidence (highlighted in green). This motivates our method of storing sentence-wise encodings instead of a single representation for an entire paragraph.
where w a 1 , w a 2 , w a 3 ∈ R d are learned vectors. For each question word, we compute the vector a i : A paragraph-to-question vector a p is computed as follows: We concatenate c q i , a i , c q i a i and a p a i and pass the result through a linear layer with ReLU activations to compute the final bidirectional attention vectors. We also use a residual connection where we process these representations with a bidirectional GRU and another linear layer with ReLU activations. Finally, we sum the outputs of the two linear layers. As before, we derive the ddimensional reformulated question representatioñ q using a max-pooling layer on the outputs of the residual layer. A high-level outline of the reformulation layer is given in Figure 3b, where m contextualized token representations of the question and n contextualized token representations of the paragraph are passed through the component's layers to produce the reformulated question representation,q.
Relevance Scores Given the sentence representations (s 1 , s 2 , . . . , s k ) of a paragraph P , and the question encoding q for Q, the relevance score of P with respect to a question Q is calculated in the following way: where w 1 , w 2 , w 4 ∈ R d and w 3 , b ∈ R are learned parameters. A similar max-pooling encoding approach, along with the scoring layer's structure, were proposed by Conneau et al. (2017) who showed their efficacy on various sentence-level tasks. We find this sentence-wise formulation to be beneficial because it suffices for one sentence in a paragraph to be relevant to a question for the whole paragraph to be considered as relevant. This allows more fine-grained representations for paragraphs and more accurate retrieval. An example of the benefits of using this kind of sentence-level model is given in Figure 4, where we see two questions answered by two different sentences. Our model allows each question to be similar only to parts of the paragraph, and not necessarily to all of it.
Search Vector Derivation Recall that our retrieval algorithm is based on executing a MIPS in the paragraph encoding space. To derive such a search vector from the question encoding q, we observe that: Therefore, the final search vector of a question Q is q s = w 1 + w 2 q + w 3 · q. The same equations apply when predicting the relevance score for the second retrieval iteration, in which case q is swapped withq.
Training and Loss Functions Each training sample consists of a question and two paragraphs, (Q, P 1 , P 2 ), where P 1 corresponds to a paragraph retrieved in the first iteration, and P 2 corresponds to a paragraph retrieved in the second iteration using the reformulated vectorq. P 1 is considered relevant if it constitutes one of the necessary evidence paragraphs to answer the question. P 2 is considered relevant only if P 1 and P 2 together constitute the complete set of evidence paragraphs needed to answer the question. Both iterations have the same form of loss functions, and the model is trained by optimizing the sum of the iterations' losses.
Our training objective for each iteration is composed of two components: a binary cross-entropy loss function and a ranking loss function. The cross-entropy loss is defined as follows: where y i ∈ {0, 1} is a binary label indicating the true relevance of P i to Q i in the iteration in which rel(Q i , P i ) is calculated, and N is the number of samples in the current batch. The ranking loss is computed in the following manner. First, for each question Q i in a given batch, we find the mean of the scores given to positive and negative paragraphs for each question, q pos where M 1 and M 2 are the number of positive and negative samples for Q i , respectively. We then define the margin ranking loss (Socher et al., 2013) as where M is the number of distinct questions in the current batch, and γ is a hyperparameter. The final objective is the sum of the two losses: where λ is a hyperparameter.
We note that we found it slightly beneficial to incorporate pretrained ELMo (Peters et al., 2018) embeddings in our model. For more detailed information of the implementation details and training process, please refer to Appendix C.

Paragraph Reader
The paragraph reader receives as input a question Q and a paragraph P and extracts the most probable answer span to Q from P . We use the S-norm model proposed by Clark and Gardner (2018). A detailed description of the model is given in Appendix A.
Training An input sample for the paragraph reader consists of a question and a single context (Q, P ). We optimize the same negative loglikelihood function used in the S-norm model for the span start boundaries: where P Q is the set of paragraphs paired with the same question Q, A j is the set of tokens that start an answer span in the j-th paragraph, and s ij is the score given to the i-th token in the j-th paragraph. The same formulation is used for the span end boundaries, so that the final objective function is the sum of the two: L span = L start + L end .

Experiments and Results
We test our approach on two datasets, and measure end-to-end QA performance using the standard exact match (EM) and F 1 metrics, as well as the metrics proposed by Yang et al. (2018) for the HotpotQA dataset (see Appendix B).

Datasets
HotpotQA Yang et al. (2018) introduced a dataset of Wikipedia-based questions, which require reasoning over multiple paragraphs to find the correct answer. The dataset also includes hard supervision on sentence-level supporting facts, which encourages the model to give explainable answer predictions. Two benchmark settings are available for this dataset: (1) a distractor setting, in which the reader is given a question as well as a set of paragraphs that includes both the supporting facts and irrelevant paragraphs; (2) a full wiki setting, which is an open-domain version of the dataset. We use this dataset as our benchmark for the multi-hop retrieval setting. Several extensions must be added to the reader from Section 3.2 in order for it to be suitable for the HotpotQA dataset. A detailed description of our proposed extensions is given in Appendix B.

Experimental Setup
Search Hyperparameters For our experiments in the multi-hop setting, we used a width of 8 in the first retrieval iteration. In all our experiments, unless stated otherwise, the reader is fed the top 45 paragraphs through which it reasons independently and finds the most probable answers. In addition, we found it beneficial to limit the search space of our MIPS retriever to a subset of the knowledge source, which is determined by a TF-IDF heuristic retriever. We define n i to be the size of the search space for retrieval iteration i. As we will see, there is a trade-off for choosing various values of n i . A large value of n i offers the possibility of higher recall, whereas a small value of n i introduces less noise in the form of irrelevant paragraphs.
Knowledege Sources For HotpotQA, our knowledge source is the same Wikipedia version used by Yang et al. (2018) 3 . This version is a set of all of the first paragraphs in the entire Wikipedia. For SQuAD-Open, we use the same Wikipedia dump used by Chen et al. (2017). For both knowledge sources, the TF-IDF based retriever we use for search space reduction is the one proposed by Chen et al. (2017), which uses bigram hashing and TF-IDF matching. We note that in the HotpotQA Wikipedia version each document is a single paragraph, while in SQuAD-Open, the full Wikipedia documents are used. . At the bottom, we compare the end-to-end performance on the full wiki setting. TF-IDF + Reader refers to using the TF-IDF based retriever without our MIPS retriever. MUPPET (sentence-level) refers to our approach with sentence-level representations, and MUPPET (paragraph-level) refers to our approach with paragraph-level representations. For both sentence-and paragraph-level results, we set n 1 = 32 and n 2 = 512.   Tables 1 and 2 show our main results on the HotpotQA and SQuAD-Open datasets, respectively. In the HotpotQA distractor setting, our paragraph reader greatly improves the results of the baseline reader, increasing the joint EM and F 1 scores by 17.12 (148%) and 13.22 (32%) points, respectively. In the full wiki setting, we compare three methods of retrieval: (1) TF-IDF, in which only the TF-IDF heuristic is used. The reader is fed all possible paragraph pairs from the top-10 paragraphs. (2) Sentencelevel, in which we use MUPPET with sentencelevel encodings. (3) Paragraph-level, in which we use MUPPET with paragraph-level encodings (no sentence information). We can see that both methods significantly outperform the naïve TF-IDF retriever, indicating the efficacy of our approach. As of writing this paper, we are placed second in the HotpotQA full wiki setting (test set) leaderboard 4 . For SQuAD-Open, our sentencelevel method established state-of-the-art results, improving the current non-BERT (Devlin et al., 2018) state-of-the-art by 4.6 (13%) and 3.6 (8%) EM and F 1 points, respectively. This shows that our encoder can be useful not only for multi-hop questions, but also for single-hop questions.

Retrieval Recall Analysis
We analyze the performance of the TF-IDF retriever for HotpotQA in Figure 5a. We can see that the retriever succeeds in retrieving at least one of the gold paragraphs for each question (above 90% with the top-32 paragraphs), but fails at retrieving both gold paragraphs. This demonstrates the necessity of an efficient multi-hop retrieval approach to aid or replace classic information retrieval methods.
Effect of Narrowing the Search Space In Figures 5b and 5c, we show the performance of our method as a function of the size of the search space of the last retrieval iteration. For SQuAD-Open, the TF-IDF retriever initially retrieves a set of documents, which are then split into paragraphs to form the search space. Each search space of top-k paragraphs limits the potential recall of the model to that of the top-k paragraphs retrieved by the TF-IDF retriever. This proves to be suboptimal for very small values of k, as the performance of the TF-IDF retriever is not good enough. Our models, however, fail to benefit from increasing the search space indefinitely, hinting that they are not as robust to noise as we would want them to be.

Effectiveness of Sentence-Level Encodings
Our method proposes using sentence-level encodings for paragraph retrieval. We test the significance of this approach in Figures 5b and 5c. While sentence-level encodings seem to be vital for improving state-of-the-art results on SQuAD-Open, the same cannot be said about HotpotQA. We hypothesize that this is a consequence of the way the datasets were created. In SQuAD, each paragraph serves as the context of several questions, as shown in Figure 4. This leads to questions being asked about facts less essential to the gist of the paragraph, and thus they would not be encapsulated in a single paragraph representation. In HotpotQA, however, most of the paragraphs in the training set serve as the context of at most one question.

Related Work
Chen et al. (2017) first introduced the use of neural methods to the task of open-domain QA using a textual knowledge source. They proposed DrQA, a pipeline approach with two components: a TF-IDF based retriever, and a multi-layer neural network that was trained to find an answer span given a question and a paragraph. In an attempt to improve the retrieval of the TF-IDF based component, many that employs a multi-step interaction between a retriever and a reader. This interactive framework is used to refine a question representation in order for the retrieval to be more accurate. Their method is complimentary to ours -the interactive framework is used to enhance retrieval performance for single-hop questions, and does not handle the multi-hop domain. Another line of work reminiscent of our method is the one of Memory Networks (Weston et al., 2015). Memory Networks consist of an array of cells, each capable of storing a vector, and four modules (input, update, output and response) that allow the manipulation of the memory for the task at hand. Many variations of Memory Networks have been proposed, such as end-to-end Memory Networks (

Concluding Remarks
We present MUPPET, a novel method for multihop paragraph retrieval, and show its efficacy in both single-and multi-hop QA datasets. One difficulty in the open-domain multi-hop setting is the lack of supervision, a difficulty that in the singlehop setting is alleviated to some extent by using distant supervision. We hope to tackle this problem in future work to allow learning more than two retrieval iterations. An interesting improvement to our approach would be to allow the retriever to automatically determine whether or not more retrieval iterations are needed. A promising direction could be a multi-task approach, in which both single-and multi-hop datasets are learned jointly. We leave this for future work.

A Paragraph Reader
In this section we describe in detail the reader mentioned in Section 3.2. The paragraph reader receives as input a question Q and a paragraph P and extracts the most probable answer span to Q from P . We use the shared-norm model presented by Clark and Gardner (2018), which we refer to as S-norm. The model's architecture is quite similar to the one we used for the encoder. First, we process Q and P seperately to obtain their contexualized token representations, in the same manner as used in the encoder. We then pass the contextualized representations through a bidirectional attention layer similar to the one defined in the reformulation layer of the encoder, with the only difference being that the roles of the question and the paragraph are switched. As before, we further pass the bidirectional attention representations through a residual connection, this time using a self-attention layer between the bidirectional GRU and the linear layer. The self-attention mechanism is similar to the bidirectional attention layer, only now it is between the paragraph and itself. Therefore, question-to-parargaph attention is not used, and we set a ij = −∞ if i = j. The summed outputs of the residual connection are passed to the prediction layer. The inputs to the prediction layer are passed through a bidirectional GRU followed by a linear layer that predicts the answer span start scores. The hidden layers of that GRU are concatenated with the input and passed through another bidirectional GRU and linear layer to predict the answer span end scores.
Training An input sample for the paragraph reader consists of a question and a single context (Q, P ). We optimize the same negative loglikelihood function used in the S-norm model for the span start boundaries: where P Q is the set of paragraphs paired with the same question Q, A j is the set of tokens that start an answer span in the j-th paragraph, and s ij is the score given to the i-th token in the j-th paragraph. The same formulation is used for the span end boundaries, so that the final objective function is the sum of the two: L span = L start + L end .

B Paragraph Reader Extension for HotpotQA
HotpotQA presents the challenge of not only predicting an answer span, but also yes/no answers. This is a combination of span-based questions and multiple-choice questions. In addition, one is also required to provide explainability to the answer predictions by predicting the supporting facts leading to the answer. We extend the paragraph reader from Section 3.2 to support these predictions in the following manner.

Yes/No Prediction
We argue that one can decide whether the answer to a given question should be span-based or yes/no-based without looking at any context at all. Therefore, we first create a fixed-size vector representing the question using max-pooling over the first bidirectional GRU's states of the question. We pass this representation through a linear layer that predicts whether this is a yes/no-based question or a span-based question. If span-based, we predict the answer span from the context using the original span prediction layer. If yes/no-based, we encode the questionaware context representations to a fixed-size vector by performing max-pooling over the outputs of the residual self-attention layer. As before, we then pass this vector through a linear layer to predict a yes/no answer.

Supporting Fact Prediction
As a context's supporting facts for a question are at the sentencelevel, we encode the question-aware context representations to fixed-size sentence representations by passing the outputs of the residual self-attention layer through another bidirectional GRU, followed by performing max-pooling over the sentence groups of the GRU's outputs. Each sentence representation is then passed through a multilayer perceptron with a single hidden layer equipped with ReLU activations to predict whether it is indeed a supporting fact or not.
Training An input sample for the paragraph reader consists of a question and a single context, (Q, P ). Nevertheless, as HotpotQA requires multiple paragraphs to answer a question, we define P to be the concatenation of these paragraphs.
Our objective function comprises four loss functions, corresponding to the four possible predictions of our model. For the span-based prediction we use L span , as before. We use a similar neg-ative log likelihood loss for the answer type prediction (whether the answer should be span-based or yes/no-based) and for a yes/no answer prediction: are the likelihood scores of the j-th questionparagraph pair being a binary yes/no-based type, a span-based type, and its true type, respectively. e s yes j , e s no j and e s yes/no j are the likelihood scores of the j-th question-paragraph pair having the answer 'yes', the answer 'no', and its true answer, respectively. For span-based questions, L yes/no is defined to be zero, and vice-versa.
For the supporting fact prediction, we use a binary cross-entropy loss on each sentence, L sp . The final loss function is the sum of these four objectives, L hotpot = L span + L type + L yes/no + L sp During inference, the supporting facts prediction is taken only from the paragraph from which the answer is predicted.
Metrics Three sets of metrics were proposed by Yang et al. (2018) to evaluate performance on the HotpotQA dataset. The first set of metrics focuses on evaluating the answer span. For this purpose the exact match (EM) and F 1 metrics are used, as suggested by Rajpurkar et al. (2016). The second set of metrics focuses on the explainability of the models, by evaluating the supporting facts directly using the EM and F 1 metrics on the set of supporting fact sentences. The final set of metrics combines the evaluation of answer spans and supporting facts as follows. For each example, given its precision and recall on the answer span (P (ans) , R (ans) ) and the supporting facts (P (sup) , R (sup) ), respectively, the joint F 1 is calculated as P (joint) = P (ans) P (sup) , R (joint) = R (ans) R (sup) , The joint EM is 1 only if both tasks achieve an exact match and otherwise 0. Intuitively, these metrics penalize systems that perform poorly on either task. All metrics are evaluated example-byexample, and then averaged over examples in the evaluation set.

C Implementation Details
We use the Stanford CoreNLP toolkit (Manning et al., 2014) for tokenization. We implement all our models using TensorFlow. For the encoder, we also concatenate ELMo (Peters et al., 2018) embeddings with a dropout rate of 0.5 and the token representations from the output of embedding layer to form the final token representations, before processing them through the first bidirectional GRU. We use the ELMo weights pretrained on the 5.5B dataset. 5 To speed up computations, we cache the context independent token representations of all tokens that appear at least once in the titles of the HotpotQA Wikipedia version, or appear at least five times in the entire Wikipedia version. Words not in this vocabulary are given a fixed OOV vector. We use a learned weighted average of all three ELMo layers. Variational dropout (Gal and Ghahramani, 2016), where the same dropout mask is applied at each time step, is applied on the inputs of all recurrent layers with a dropout rate of 0.2. We set the encoding size to be d = 1024.
For the paragraph reader used for HotpotQA, we use a state size of 150 for the bidirectional GRUs. The size of the hidden layer in the MLP used for supporting fact prediction is set to 150 as well. Here again variational dropout with a dropout rate of 0.2 is applied on the inputs of all recurrent layers and attention mechanisms. The reader used for SQuAD is the shared-norm model trained on the SQuAD dataset by Clark and Gardner (2018). 6 Training Details We train all our models using the Adadelta optimizer (Zeiler, 2012) with a learning rate of 1.0 and ρ = 0.95.
SQuAD-Open: The training data is gathered as follows. For each question in the original SQuAD dataset, the original paragraph given as the question's context is considered as the single relevant (positive) paragraph. We gather ∼12 irrelevant (negative) paragraphs for each question in the following manner: • The three paragraphs with the highest TF-IDF similarity to the question in the same SQuAD document as the relevant paragraph (excluding the relevant paragraph). The same method is applied to retrieve the three paragraphs most similar to the relevant paragraph.
• The two paragraphs with the highest TF-IDF similarity to the question from the set of all first paragraphs in the entire Wikipedia (excluding the relevant paragraph's article). The same method is applied to retrieve the two paragraphs most similar to the relevant paragraph.
• Two randomly sampled paragraphs from the entire Wikipedia.
Questions that contain only stop-words are dropped, as they are most likely too dependent on the original context and not suitable for opendomain. In each epoch, a question appears as a training sample four times; once with the relevant paragraph, and three times with randomly sampled irrelevant paragraphs. We train with a batch size of 45, and do not use the ranking loss by setting λ = 0 in Equation (2). We limit the length of the paragraphs to 600 tokens.
HotpotQA: The paragraphs used for training the encoder are the gold and distractor paragraphs supplied in the original HotpotQA training set. As mentioned in Section 3.1, each training sample consists of a question and two paragraphs, (Q, P 1 , P 2 ), where P 1 corresponds to a paragraph retrieved in the first iteration, and P 2 corresponds to a paragraph retrieved in the second iteration. For each question, we create the following sample types: 1. Gold: The two paragraphs are the two gold paragraphs of the question. Both P 1 and P 2 are considered positive.
2. First gold, second distractor: P 1 is one of the gold paragraphs and considered positive, while P 2 can be a random paragraph from the training set, the same as P 1 , or one of the distractors, with probabilities 0.05, 0.1 and 0.85, respectively. P 2 is considered negative.
3. First distractor, second gold: P 1 is either one of the distractors or a random paragraph from the training set, with probabilities 0.9 and 0.1, respectively. P 2 is one of the gold paragraphs. Both P 1 and P 2 are considered negative.
4. All distractors: Both P 1 and P 2 are sampled from the question's distractors, and are considered negative.
5. Gold from another question: A gold paragraph pair taken from another question; both paragraphs are considered negative.
The use of the sample types from the above list motivation is motivated as follows. Sample type 1 is the only one that contains purely positive examples and hence is mandatory. Sample type 2 is necessary to allow the model to learn a valuable reformulation, which does not give a relevant score based solely on the first paragraph. Sample type 3 is complementary to type 2; it allows the model to learn that a paragraph pair is irrelevant if the first paragraph is irrelevant, regardless of the second. Sample type 3 is used for random negative sampling, which is the most common case of all. Sample type 4 is used to guarantee the model does not determine relevancy solely based on the paragraph pair, but also based on the question. In each training batch, we include three samples for each question in the batch: a single gold sample (type 1), and two samples from the other four types, with sample probabilities of 0.35, 0.35, 0.25 and 0.05, respectively.
We use a batch size of 75 (25 unique questions). We set the margin to be γ = 1 in Equation (1) and λ = 1 in Equation (2), for both prediction iterations. We limit the length of the paragraphs to 600 tokens.
HotpotQA Reader: The reader receives a question and a concatenation of a paragraph pair as input. Each training batch consists of three samples with three different paragraph pairs for each question: a single gold pair, which is the two gold paragraphs of the question, and two randomly sampled paragraph pairs from the set of the distractors and one of the gold paragraphs of the question. We label the correct answer spans to be every text span that has an exact match with the ground truth answer, even in the distractor paragraphs. We use a batch size of 75 (25 unique questions), and limit the length of the paragraphs (before concatenation) to 600 tokens.