What’s Missing: A Knowledge Gap Guided Approach for Multi-hop Question Answering

Multi-hop textual question answering requires combining information from multiple sentences. We focus on a natural setting where, unlike typical reading comprehension, only partial information is provided with each question. The model must retrieve and use additional knowledge to correctly answer the question. To tackle this challenge, we develop a novel approach that explicitly identifies the knowledge gap between a key span in the provided knowledge and the answer choices. The model, GapQA, learns to fill this gap by determining the relationship between the span and an answer choice, based on retrieved knowledge targeting this gap. We propose jointly training a model to simultaneously fill this knowledge gap and compose it with the provided partial knowledge. On the OpenBookQA dataset, given partial knowledge, explicitly identifying what’s missing substantially outperforms previous approaches.


Introduction
Reading Comprehension datasets (Richardson et al., 2013;Rajpurkar et al., 2016;Joshi et al., 2017) have gained interest as benchmarks to evaluate a system's ability to understand a document via question answering (QA). Since many of these early datasets only required a system to understand a single sentence, new datasets were specifically designed to focus on the problem of multi-hop QA, i.e., reasoning across sentences (Khashabi et al., 2018;Welbl et al., 2018;. While this led to improved language understanding, the tasks still assume that a system is provided with all knowledge necessary to answer the question. In practice, however, we often only have access to partial knowledge when dealing with such multi-hop questions, and must retrieve additional facts (the knowledge "gaps") based on Question: Which of these would let the most heat travel through? A) a new pair of jeans. B) a steel spoon in a cafeteria. C) a cotton candy at a store. D) a calvin klein cotton hat. Core Fact: Metal lets heat travel through.
Knowledge Gap (similar gaps for other choices): steel spoon in a cafeteria metal.
Filled Gap (relation identified using KB): steel spoon in a cafeteria is made of metal. Figure 1: A sample OpenBookQA question, the identified knowledge gap based on partial information in the core fact, and relation (is made of ) identified from a KB to fill that gap.
the question and the provided knowledge. Our goal is to identify such gaps and fill them using an external knowledge source. The recently introduced challenge of open book question answering  highlights this phenomenon. The questions in the corresponding dataset, OpenBookQA, are derived from a science fact in an "open book" of about 1300 facts. To answer these questions, a system must not only identify a relevant "core" science fact from this small book, but then also retrieve additional common knowledge from large external sources in order to successfully apply this core fact to the question. Consider the example in Figure 1. The core science fact metal lets heat to travel through points to metal as the correct answer, but it is not one of the 4 answer choices. Given this core fact (the "partial knowledge"), a system must still use broad external knowledge to fill the remaining gap, that is, identify which answer choice contains or is made of metal.
This work focuses on QA under partial knowledge. This turns out to be a surprisingly chal-lenging task in itself; indeed, the partial knowledge models of  achieve a score of only 55% on OpenBookQA, far from human performance of 91%. Since this and several recent multi-hop datasets use the multiple-choice setting (Welbl et al., 2018;Khashabi et al., 2018;Lai et al., 2017), we assume access to potential answers to a question. While our current model relies on this for a direct application to spanprediction based RC datasets, the idea of identifying knowledge gaps can be used to create novel RC specific models.
We demonstrate that an intuitive approach leads to a strong model: first identify the knowledge gap and then fill this gap, i.e., identify the missing relation using external knowledge. We primarily focus on the OpenBookQA dataset since it is the only dataset currently available that provides partial context. However, we believe such an approach is also applicable to the broader setting of multi-hop RC datasets, where the system could start reasoning with one sentence and fill remaining gap(s) using sentences from other passages.
Our model operates in two steps. First, it predicts a key span in the core fact ("metal" in the above example). Second, it answers the question by identifying the relationship between the key span and answer choices, i.e., by filling the knowledge gap. This second step can be broken down further: (a) retrieve relevant knowledge from resources such as ConceptNet (Speer et al., 2017) and large-scale text corpora ; (c) based on this, predict potential relations between the key span and an answer choice; and (d) compose the core fact with this filled gap.
We collect labels for knowledge gaps on ∼30% of the training questions, and train two modules capturing the two main steps above. The first exploits an existing RC model and large-scale dataset to train a span-prediction model. The second uses multi-task learning to train a separate QA model to jointly predict the relation representing the gap, as well as the final answer. For questions without labelled knowledge gaps, the QA model is trained based solely on the predicted answer.
Our model outperforms the previous state-ofthe-art partial knowledge models by 6.5% (64.41 vs 57.93) on a targeted subset of OpenBookQA amenable to gap-based reasoning. Even without missing fact annotations, our model with a simple heuristic to identify missing gaps still outperforms previous models by 3. 4% (61.38 vs. 57.93). It also generalizes to questions that were not its target, with 3.6% improvement (59.40 vs. 55.84) on the full OpenBookQA set.
Overall, the contributions of this work are: (1) an analysis and dataset 1 of knowledge gaps for QA under partial knowledge; (2) a novel two-step approach of first identifying and then filling knowledge gaps for multi-hop QA; (3) a model 1 that simultaneously learns to fill a knowledge gap using retrieved external knowledge and compose it with partial knowledge; and (4) new state-of-the-art results on QA with partial knowledge (+6.5% using annotations on only 30% of the questions).
On the other hand, open domain question answering datasets  come with no context, and require first retrieving relevant knowledge before reasoning with it. Retrieving this knowledge from noisy textual corpora, while simultaneously solving the reasoning problem, can be challenging, especially when questions require multiple facts. This results in simple approaches (e.g. word-overlap/PMI-based approaches), that do not heavily rely on the retrieval quality, being competitive with other complex reasoning methods that assume clean knowledge Jansen et al., 2017;Angeli et al., 2016). To mitigate this issue, semistructured tables (Khashabi et al., 2016;Jansen et al., 2018) have been manually authored targeting a subset of these questions. However, these tables are expensive to create and these questions often need multiple hops (sometimes up to 16 (Jansen et al., 2018)), making reasoning much more complex.
OpenBookQA dataset  was proposed to limit the retrieval problem by providing a set of ∼1300 facts as an 'open book' for the system to use. Every question is based on one of the core facts, and in addition requires basic external knowledge such as hypernymy, definition, and causality. We focus on the task of question answering under partial context, where the core fact for each question is available to the system.
Knowledge-Based QA. Another line of research is answering questions (Bordes et al., 2015;Pasupat and Liang, 2015;Berant et al., 2013) over a structured knowledge base (KB) such as Freebase (Bollacker et al., 2008). Depending on the task, systems map questions to a KB query with varying complexity: from complex semantic parses (Krishnamurthy et al., 2017) to simple relational lookup (Petrochuk and Zettlemoyer, 2018). Our sub-task of filling the knowledge gap can be viewed as KB QA task with knowledge present in a KB or expected to be inferred from text.
Some RC systems (Mihaylov and Frank, 2018;Kadlec et al., 2016) and Textual Entailment (TE) models (Weissenborn et al., 2017;Inkpen et al., 2018) incorporate external KBs to provide additional context to the model for better language understanding. However, we take a different approach of using this background knowledge in an explicit inference step (i.e. hop) as part of a multihop QA model.

Knowledge Gaps
We now take a deeper look at categorizing knowledge gaps into various classes. While grounded in OpenBookQA, this categorization is relevant for other multi-hop question sets as well. 2 We will then discuss how to effectively annotate such gaps.

Understanding Gaps: Categorization
We analyzed the additional facts needed for answering 75 OpenBookQA questions. These facts naturally fall into three classes, based on the knowledge gap they are trying to fill: (1) Questionto-Fact, (2) Fact-to-Answer, and (3) Questionto-Answer(Fact). Question-to-Fact Gap. This gap exists between concepts in the question and the core fact. For example, in Figure 3, the knowledge that "Kool-aid" is a liquid is needed to even recognize that the fact is relevant.
Fact-to-Answer Gap. This gap captures the relationship between concepts in the core fact and the answer choices. For example, in Figure 4, the knowledge "Heat causes evaporation" is needed to relate "evaporated" in the fact to the correct answer "heat". Note that it is often possible to find relations connecting the fact to even incorrect answer choices. For example, "rainfall" could be connected to the fact using "evaporation leads to rainfall". Thus, identifying the correct relation and knowing if it can be composed with the core fact is critical, i.e., "evaporation causes liquid to disappear" and "evaporation leads to rainfall" do not imply that "rainfall causes liquid to disappear".
Question-to-Answer(Fact) Gap. Finally, some questions need additional knowledge to connect concepts in the question to the answer, based on the core fact. For example, composition questions ( Figure 5) use the provided fact to replace parts of the original question with words from the fact.
Notably Question-to-Fact and Fact-to-Answer gaps are more common in OpenBookQA (44% and 86% respectively 3 ), while the Question-to-Answer(Fact) gap is very rare (<20%). While all three gap classes pose important problems, we focus on Fact-to-Answer gap and assume that the core fact is provided. This is still a challenging problem as one must not only identify and fill the gap, but also learn to compose this filled gap with the input fact.

Annotating Gaps: Data Collection
Due to space constraints, details of our crowdsourcing process for annotating knowledge gaps, including the motivation behind various design choices as well as several examples, are deferred to the Appendix (Section A). Here we briefly summarize the final crowdsourcing design.
Our early pilots revealed that straightforward approaches to annotate knowledge gaps for all OpenBookQA questions lead to noisy labels. To address this, we (a) identified a subset of questions suitable for this annotation task and (b) split While it is clear how to apply the knowledge, we need to know that "Heat causes evaporation" to identify the right answer.  Table 1: Examples from KGD dataset. Note that the knowledge gap is captured in the form of (Span, Relation, Answer) but not explicitly annotated. −1 is used to indicate that the argument order should be flipped.
Fact-to-Answer gap annotation into two steps: key term identification and relation identification.
Question Subset. First, we identified valid question-fact pairs where the fact supports the correct answer (verified via crowdsourcing) but does not trivially lead to the answer (fact only overlaps with the correct answer). Second, we noticed that the Fact-to-Answer gaps were much noisier for longer answer options, where you could write multiple knowledge gaps or a single complex knowledge gap. So we created OBQA-Short, the subset of OpenBookQA where answer choices have at most two non-stopword tokens. This contains over 50% of the original questions and is also the target set of our approach.
Two-step Gap Identification. Starting with the above pairs of questions with valid partial knowledge, the second task is to author facts that close the Fact-to-Answer knowledge gap. Again, initial iterations of the task resulted in poor quality, with workers often writing noisy facts that restated part of the provided fact or directly connect the question to the answer (skipping over the pro-vided fact 4 ). We noticed that the core fact often contains a key span that hints at the final answer. So we broke the task into two steps: (1) identify key terms (preferably a span) in the core fact that could answer the question, and (2) identify one or more relations 5 that hold between the key terms and the correct answer choice but not the incorrect choices. Table 1 shows example annotations of the gaps obtained through this process.

Knowledge Gap Dataset: KGD
Our Knowledge Gap Dataset (KGD) contains key span and relation label annotations to capture knowledge gaps. To reduce noise, we only use knowledge gap annotations where at least two of three workers found a contiguous span from the core fact and a relation from our list. The final   Table 2).

Knowledge-Gap Guided QA: GapQA
We first introduce the notation used to describe our QA system. For each question q and fact f , the selected span is given by s and the set of valid relations between this span and the correct choice is given by r. Borrowing notation from Open-BookQA, we refer to the question without the answer choices c as the stem q s , i.e., q = q s c. We usê s to indicate the predicted span andr for the predicted relations. We use q m and f m to represent the tokens in the question stem and fact respectively. Following the Turk task, our model first identifies the key span from the fact and then identifies the relation using retrieved knowledge.

Key Span Identification Model
Since the span selected from the fact often tends to be the answer to the question (c.f. Table 1), we can use a reading comprehension model to identify this span. The fact serves as the input passage and the question stem as the input question to the reading comprehension model. We used the Bi-Directional Attention Flow (BiDAF) model (Seo et al., 2017), an attention-based span prediction model designed for the SQuAD RC dataset (Rajpurkar et al., 2016). We refer the reader to the original paper for details about the model.

Knowledge Retrieval Module
Given the predicted span, we retrieve knowledge from two sources: triples from Concept-Net (Speer et al., 2017) and sentences from ARC corpus . ConceptNet contain (subject, relation, object) triples with relations such as /r/IsA, /r/PartOf that closely align with the relations in our gaps. Since ConceptNet can be incomplete or vague (e.g. /r/RelatedTo relation), we also use the ARC corpus of 14M science-relevant sentences to improve our recall.
Tuple Search. To find relevant tuples connecting the predicted spanŝ to the answer choice c i , we select tuples where at least one token 6 in the subject matchesŝ and at least one token in the object matches c i (or vice versa). We then score each tuple t using the Jaccard score 7 and pick the top k tuples for each c i (k = 5 in our experiments).
Text Search. To find the relevant sentences for s and c i , we used ElasticSearch 8 with the query: s + c i (refer to Appendix D for more details). Similar to ConceptNet, we pick top 5 sentences for each answer choice. To ensure a consistent formatting of all knowledge sources, we convert the tuples into sentences using few hand-defined rules(described in Appendix C). Finally all the retrieved sentences are combined to produce the input KB for the model, K.

Question Answering Model
The question answering model takes as input the question q s , answer choices c, fact f , predicted span,ŝ and retrieved knowledge K. We use 300dimensional 840B GloVe embeddings (Pennington et al., 2014) to embed each word in the inputs. We use a Bi-LSTM with 100-dimensional hidden states to compute the contextual encodings for each string, e.g., E f ∈ R fm×h . The question answering model selects the right answer using two components: (1) Fact Relevance module (2) Relation Prediction module.
Fact Relevance. This module is motivated by the intuition that a relevant fact will often capture a relation between concepts that align with the question and the correct answer (the cyan and magenta regions in Figure 2). To deal with the gaps between these concepts, this module relies purely on word embeddings while the next module will focus on using external knowledge. We compute a question-weighted and answerweighted representation of the fact to capture the part of the fact that links to the question and answer respectively. We compose these fact-based representations to then identify how well the answer choice is supported by the fact.
To calculate the question-weighted fact representation, we first identify facts words with a high similarity to some question word (V qs (f )) using the attention weights: The final attention weights are similar to the Query-to-Context attention weights in BiDAF. The final question-weighted representation is: (1) We similarly compute the choice-weighted representation of fact as S c i (f ). We compose these two representations by averaging 9 these two vectors S qsc i (f ) = (S qs (f ) + S c i (f ))/2. We finally score the answer choice by comparing this representation with the aggregate fact representation, obtained by averaging too, as: where (x, y) = [x−y; x * y] ∈ R 1×2h and FF is a feedforward neural network that outputs a scalar score for each answer choice.
Filling the Gap: Relation Prediction. The relation prediction module uses the retrieved knowledge to focus on the Fact-to-Answer gap by first predicting the relation betweenŝ and c i and then compose it with the fact to score the choice. We first compute the span and choice weighted representation(R 1×h ) for each sentence k j in K using the same operations as above: These representations capture the contextual embeddings of the words in the k j that most closely resemble words inŝ and c i respectively. We predict the kb-based relation between them based on the composition of these representations : We pool the relation representations from all the KB facts into a single prediction by averaging, i.e. R(ŝ, c i ) = avg j R j (ŝ, c i ). 9 We found this simple composition function performed better than other composition operations.
Relation Prediction Score. We first identify the potential relations that can be composed with the fact, given the question, e.g., in Figure 1, we can compose the fact with (steel spoon; made of ; metal) relation but not (metal; made of ; ions). We compose an aggregate representation of the question and fact encoding to capture this information: We finally score the answer choice based on this representation and the relation representation: The final score for each answer choice is computed by summing the fact relevance and relation prediction based scores i.e. score(c i ) = score f (c i ) + score r (c i ). The final architecture of our QA model is shown in Figure 6.

Model Training
We use cross-entropy loss between the predicted answer scoresĉ and the gold answer choicec. Since we also have labels on the true relations between the gold span and the correct answer choice, we introduce an auxiliary loss to ensure the predicted relation R corresponds to the true relation between s and c i . We use a single-layer feedforward network to project R(s, c i ) into a vector r i ∈ R 1×l where l is the number of relations. Since multiple relations can be valid, we create an n-hot vector representationr ∈ R 1×l wherē r[k] = 1 if r k is a valid relation.
We use binary cross-entropy loss between thê r i and r for the correct answer choice. For the incorrect answer choice, we do not know if any of the unselected relations(i.e. where r[k] = 0) hold. But we do know that the relations selected by Turkers for the correct answer choice should not hold for the incorrect answer choice. To capture this, we compute the binary cross entropy loss betweenr i and 1 − r for the incorrect answer choices but ignore the unselected relations.
Finally, the loss for each example, assuming c i is the correct answer, is given as loss = ce(ĉ,c) + λ· bce(r i ,r)+ j =i mbce(r j , 1−r,r) , where ce is cross-entropy loss, bce is binary cross-entropy loss, and mbce is masked binary cross entropy loss, where unselected relations are masked.
We further augment the training data with questions in the OBQA-Short dataset using the predicted spans and ignoring the relation loss. Also, we assume the labelled core fact in the Open-BookQA dataset provides the partial knowledge needed to answer these questions. Implementation details and parameter settings are deferred to Appendix B. A sample visualization of the attentions and knowledge used in the model are provided in Figure 10 in the Appendix.

Experimental Results
We present results of our proposed model, GapQA, on two question sets: (a) those with short answers, 10 OBQA-Short (290 test questions), and (b) the complete set, OBQA-Full (500 test questions). As we mentioned before, OBQA-Short subset is likely to have Fact-to-Answer gaps that can be targeted by our approach and we therefore expect larger and more meaningful gains on this subset.

Key Span Identification
We begin by evaluating three training strategies for the key span identification model, using the annotated spans in KGD for training. As seen in Table 3, the BiDAF model trained on the SQuAD dataset (Rajpurkar et al., 2016) performs poorly on our task, likely due to the different question style in OpenBookQA. While training on KGD (from scratch) substantially improves accuracy, we observe that using KGD to fine-tune BiDAF pretrained on SQuAD results in the best F1 (78.55) and EM (63.99) scores on the Dev set. All subsequent experiments use this fine-tuned model. 10 Answers with at most two non-stopword tokens.

OpenBookQA Results
We compare with three previous state-of-the-art models reported by . Two of these are Knowledge-free models (also re- Following OpenBookQA, we train each model five times using different random seeds, and report the average score and standard deviation (without Bessel's correction) on the test set. For simplicity and consistency with prior work, we report one std. dev. from the mean using the µ ± σ notation. We train our model on the combined KGD and OBQA-Short question set with full supervision on examples in KGD and only QA supervision (with predicted spans) on questions in OBQA-Short. We train the baseline approaches on the entire question set as they have worse accuracies on both the sets when trained on the OBQA-Short subset. We do not use our annotations for any of the test evaluations. We use the core fact provided by the original dataset and use the predicted spans from the fine-tuned BiDAF model.
We present the test accuracies on the two ques-  Table 4: Test accuracy on the the OBQA-Short subset and OBQA-Full dataset assuming core fact is given. * denotes the results are statistically significantly better than all the baselines (p≤0.05, based on Wilson score intervals (Wilson, 1927)). tion sets in Table 4. On the targeted OBQA-Short subset, our proposed GapQA improves statistically significantly over the partial knowledge baselines by 6.5% to 14.4%. Even though the full OpenBookQA dataset contains a wider variety of questions not targeted by GapQA, we still see an improvement of 3+% relative to prior approaches.
It is worth noting that recent large-scale language models (LMs) (Devlin et al., 2019;Radford et al., 2018) have now been applied on this task, leading to improved state-of-the-art results (Sun et al., 2018;Banerjee et al., 2019;Pan et al., 2019). However, our knowledge-gap guided approach to QA is orthogonal to the underlying model. Combining these new LMs with our approach is left to future work.
Effect of input knowledge. Since the baseline models use different knowledge sources as input, we evaluate the performance of our model using the same knowledge as the baselines. 12 Even when our model is given the same knowledge, we see an improvement by 5.9% and 11.3% given only WordNet and OMCS knowledge respectively. This shows that we can use the available knowledge, even if limited, more effectively than previous methods. When provided with the full Con-ceptNet knowledge and large-scale text corpora, our model is able to exploit this additional knowledge and improve further by 4%.

Ablations
We next evaluate key aspects of our model in an ablation study, with average accuracies in Table 6.
No Annotations (No Anns): We ignore all collected annotations (span, relation, and fact) for training the model. We use the BiDAF(SQuAD) model for span prediction, and only the question 12 The baselines do not scale to large scale corpora and so can not be evaluated against our knowledge sources.   answering loss for the QA model trained on the OBQA-Short subset. 13 Due to the noisy spans produced by the out-of-domain BiDAF model, this model performs worse that the full GapQA model by 5.5% (comparable performance to the KER models). This shows that our model does utilize the human annotations to improve on this task. Heuristic Span Annotations: We next ask whether some of the above loss in accuracy can be recovered by heuristically producing the spans for training-a cost-effective alternative to human annotations. We find the longest subsequence of tokens (ignoring stop words) in f that is not mentioned in q and assume this span (including the intermediate stop words) to be the key term. To prevent noisy key terms, we only consider a subset of questions where 60% of the non-stopword stemmed tokens in f are covered by q. We finetune the BiDAF(SQuAD) model on this subset and then use it to predict the spans on the full set. Scores the relation for the incorrect answer higher because of the facts connecting "systems" and "ships".
Cocoon creation occurs (A) after the caterpillar stage (B) after the chrysalis stage (C) after the eggs are laid (D) after the cocoon emerging stage the cocoons being created occurs during the pupa stage in a life cycle after the chrysalis stage Does not model the complex relation (temporal ordering) between the key span: "pupa stage" and "caterpillar stage". Instead it predicts "chrysalis" due to the synonymy with "pupa". We train GapQA model on this dataset without any relation labels (and associated loss). This simple heuristic leads to a 3% drop compared to human annotations, but still out-performs previous approaches on this dataset, showing the value of the gap-based QA approach.
No Relation Score: We ignore the entire relation-based score (score r ) in the model and only rely on the fact-relevance score. The drop in score by 3.9% shows that the fact alone is not sufficient to answer the question using our model.
No Spans (Model): We ignore the spans in the model, i.e., we use the entire fact to compute the span-based representation Sŝ(k j ). In effect, the model is predicting the gap between the entire fact and answer choice. 15 We see a drop of ∼2%, showing the value of spans for gap prediction.
No Spans (IR): Ignoring the span for retrieval, the knowledge is retrieved based on the entire fact (full GapQA model is used). The drop in accuracy by 2.6% shows the value of targeted knowledgegap based retrieval.

Error Analysis
We further analyzed the performance of GapQA on 40 incorrectly answered questions from the dev set in the OBQA-Short dataset. Table 7 shows a few error examples. There were three main classes of errors: Incorrect predicted spans (25%) often due to complex language in the fact or the Question-to-Fact gap needed to accurately identify the span. Incorrect relation scores (55%) due to distracting facts for the incorrect answer or not finding good quality partial context as provided in KGD. 15 Retrieval is still based on the span and we ignore the relation prediction loss. relevant facts for the correct answer, leading to an incorrect answer scoring higher. Out-of-scope gap relations (20%) where the knowledge gap relations are not handled by our model such as temporal relations or negations (e.g., is not made of).
Future work in expanding the dataset, incorporating additional relations, and better retrieval could mitigate these errors.

Conclusion
We focus on the task of question answering under partial knowledge: a novel task that lies inbetween open-domain QA and reading comprehension. We identify classes of knowledge gaps when reasoning under partial knowledge and collect a dataset targeting one common class of knowledge gaps. We demonstrate that identifying the knowledge gap first and then reasoning by filling this gap outperforms previous approaches on the OpenBookQA task, with and even without additional missing fact annotation. This work opens up the possibility of focusing on other kinds of knowledge gaps and extending this approach to other datasets and tasks (e.g., span prediction).  boiling point means temperature above which a liquid boils Figure 10: Visualization of the models behavior with the predicted span, top predicted relation, and the top fact used by model. The heat map shows the confidence of the model for all the relations for each input sentence (first five) and ConceptNet sentencized tuple (last but one) and the back-off tuple (last one) to capture the knowledge in the embeddings.

B Implementation Details
We implement all our models in Pytorch (Paszke et al., 2017) using the AllenNLP  toolkit. We also used the AllenNLP implementation of the BiDAF model for span prediction. We use 300D 840B Glove (Pennington et al., 2014) embeddings and use 200 dimensional hidden representations for the BiLSTM shared between all inputs (each direction uses 100 dimensional hidden vectors). We use 100 dimensional representations for the relation prediction, R j . Each feedforward network, FF is a 2-layer network with relu activation, 0.5 dropout (Srivastava et al., 2014), 200 hidden dimensions on the first layer and no dropout on the output layer with linear activation. We use a variational dropout (Gal and Ghahramani, 2016) of 0.2 in all the BiL-STMs. The relation prediction loss is scaled by λ = 1. We used the Adam (Kingma and Ba, 2015) optimization with initial lr = 0.001 and a learning rate scheduler that halves the learning rate after 5 epochs of no change in QA accuracy. We tuned the hyper-parameters and performed early stopping based on question answering accuracy on the validation set. Specifically, we considered {50, 100, 200} dimensional representations, λ ∈ {0.1, 1, 10}, retrieving {10, 20} knowledge tuples and {[x -y; x*y], [x, y]} combination functions for during the development of the model. The baseline models were developed for this dataset using hyper-parameter tun-ing; we do not perform any additional tuning. Our model code and pre-trained models are available at https://github.com/allenai/missing-fact.

C ConceptNet sentences
Given a tuple t = (s, v, o), the sentence form is generated as "s is split(v) o" where split(v) splits the ConceptNet relation v into a phrase based on its camel-case notation. For example, (belt buckle, /r/MadeOf, metal) would be converted into "belt buckle is made of metal".

D Text retrieval
For each spanŝ and answer choice c i , we query an ElasticSearch 17 index on the input text corpus with the "ŝ + c i " as the query. We also require the matched sentence must contain both the span and the answer choice. We filter long sentences (>300 characters), sentences with negation and noisy sentences 18 from the retrieved sentences.