Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge

We introduce a neural reading comprehension model that integrates external commonsense knowledge, encoded as a key-value memory, in a cloze-style setting. Instead of relying only on document-to-question interaction or discrete features as in prior work, our model attends to relevant external knowledge and combines this knowledge with the context representation before inferring the answer. This allows the model to attract and imply knowledge from an external knowledge source that is not explicitly stated in the text, but that is relevant for inferring the answer. Our model improves results over a very strong baseline on a hard Common Nouns dataset, making it a strong competitor of much more complex models. By including knowledge explicitly, our model can also provide evidence about the background knowledge used in the RC process.


Introduction
Reading comprehension (RC) is a language understanding task similar to question answering, where a system is expected to read a given passage of text and answer questions about it.Cloze-style reading comprehension is a task setting where the question is formed by replacing a token in a sentence of the story with a placeholder (left part of Figure 1).
In contrast to many previous complex models (Weston et al., 2015;Dhingra et al., 2017;Cui et al., 2017;Munkhdalai and Yu, 2016;Sordoni et al., 2016) that perform multi-turn reading of a story and a question before inferring the correct answer, we aim to tackle the cloze-style RC task in a way that resembles how humans solve it: using, in addition, background knowledge.We develop The prince put his away and prepared for his long trip..

Task setup
The prince was on his white , with a in his .a neural model for RC that can successfully deal with tasks where most of the information to infer answers from is given in the document (story), but where additional information is needed to predict the answer, which can be retrieved from a knowledge base and added to the context representations explicitly. 1 An illustration is given in Figure 1.Such knowledge may be commonsense knowledge or factual background knowledge about entities and events that is not explicitly expressed but can be found in a knowledge base such as Con-ceptNet (Speer et al., 2017), BabelNet (Navigli and Ponzetto, 2012), Freebase (Tanon et al., 2016) or domain-specific KBs collected with Information Extraction approaches (Fader et al., 2011;Mausam et al., 2012;Bhutani et al., 2016).Thus, we aim to define a neural model that encodes preselected knowledge in a memory, and that learns to include the available knowledge as an enrichment to the context representation.
The main difference of our model to prior state-of-the-art is that instead of relying only on document-to-question interaction or discrete features while performing multiple hops over the document, our model (i) attends to relevant selected external knowledge and (ii) combines this knowledge with the context representation before inferring the answer, in a single hop.This allows the model to explicitly imply knowledge that is not stated in the text, but is relevant for inferring the answer, and that can be found in an external knowledge source.Moreover, by including knowledge explicitly, our model provides evidence and insight about the used knowledge in the RC.
Our main contributions are: (i) We develop a method for integrating knowledge in a simple but effective reading comprehension model (AS Reader, Kadlec et al. (2016)) and improve its results significantly whereas other models employ features or multiple hops.(ii) We examine two sources of common knowledge: WordNet (Miller et al., 1990) and ConceptNet (Speer et al., 2017) and show that this type of knowledge is important for answering common nouns questions and also improves slightly the performance for named entities.
(iii) We show that knowledge facts can be added directly to the text-only representation, enriching the neural context encoding.(iv) We demonstrate the effectiveness of the injected knowledge by case studies and data statistics in a qualitative evaluation study.

Reading Comprehension with Background Knowledge Sources
In this work, we examine the impact of using external knowledge as supporting information for the task of cloze style reading comprehension.We build a system with two modules.The first, Knowledge Retrieval, performs fact retrieval and selects a number of facts f 1 , ..., f p that might be relevant for connecting story, question and candidate answers.The second, main module, the Knowledgeable Reader, is a knowledge-enhanced neural module.It uses the input of the story context tokens d 1..m , the question tokens q 1..n , the set of answer candidates a 1..k and a set of 'relevant' background knowledge facts f 1..p in order to select the right answer.To include external knowledge for the RC task, we encode each fact f 1..p and use attention to select the most relevant among them for each token in the story and question.We expect that enriching the text with additional knowledge about the mentioned concepts will improve the prediction of correct answers in a strong single-pass system.See Figure 1 for illustration.

Knowledge Retrieval
In our experiments we use knowledge from the Open Mind Common Sense (OMCS, Singh et al. (2002)) part of ConceptNet, a crowd-sourced resource of commonsense knowledge with a total of ∼630k facts.Each fact f i is represented as a triple f i =(subject, relation, object), where subject and object can be multi-word expressions and relation is a relation type.An example is: We experiment with three set-ups: using (i) all facts from OMCS that pertain to Concept-Net, referred to as CN5All, (ii) using all facts from CN5All excluding some WordNet relations referred to as CN5Sel(ected) (see Section 3), and using (iii) facts from OMCS that have source set to WordNet (CN5WN3).
Retrieving relevant knowledge.For each instance (D, Q, A 1..10 ) we retrieve relevant commonsense background facts.We first retrieve facts that contain lemmas that can be looked up via tokens contained in any D(ocument), Q(uestion) or A(nswer candidates).We add a weight value for each node: 4, if it contains a lemma of a candidate token from A; 3, if it contains a lemma from the tokens of Q; and 2 if it contains a lemma from the tokens of D. The selected weights are chosen heuristically such that they model relative fact importance in different interactions as A+A > A+Q > A+D > D+Q > D+D.We weight the fact triples that contain these lemmas as nodes, by summing the weights of the subject and object arguments.Next, we sort the knowledge triples by this overall weight value.To limit the memory of our model, we run experiments with different sizes of the top number of facts (P ) selected from all instance fact candidates, P ∈ {50, 100, 200}.As additional retrieval limitation, we force the number of facts per answer candidate to be the same, in order to avoid a frequency bias for an answer candidate that appears more often in the knowledge source.Thus, if we select the maximum 100 facts for each task instance and we have 10 answer candidates a i=1..10 , we retrieve the top 10 facts for each candidate a i that has either a subject or an object lemma for a token in a i .If the same fact contains lemmas of two candidates a i and a j (j > i), we add the fact once for a i and do not add the same fact again for a j .If several facts have the same weight, we take the first in the order of the list2 , i.e., the order of retrieval from the database.If one candidate has less than 10 facts, the overall fact candidates for the sample will be less than the maximum (100).

Neural Model: Extending the Attention Sum Reader with a Knowledge Memory
We implement our Knowledgeable Reader (Kn-Reader) using as a basis the Attention Sum Reader as one of the strongest core models for single-hop RC.We extend it with a knowledge fact memory that is filled with pre-selected facts.Our aim is to examine how adding commonsense knowledge to a simple yet effective model can improve the RC process and to show some evidence of that by attending on the incorporated knowledge facts.The model architecture is shown in Figure 5.
Base Attention Model.The Attention-Sum Reader (Kadlec et al., 2016), our base model for RC reads the input of story tokens d 1..n , the question tokens q 1..m , and the set of candidates a 1..10 that occur in the story text.The model calculates the attention between the question representation r q and the story token context encodings of the candidate tokens a 1..10 and sums the attention scores for the candidates that appear multiple times in the story.The model selects as answer the candidate that has the highest attention score.
Word Embeddings Layer.We represent input document and question tokens w by looking up their embedding representations e i = Emb(w i ), where Emb is an embedding lookup function.We apply dropout (Srivastava et al., 2014) with keep probability p = 0.8 to the output of the embeddings lookup layer.
Context Representations.To represent the document and question contexts, we first encode the tokens with a Bi-directional GRU (Gated Recurrent Unit) (Chung et al., 2014) to obtain context-encoded representations for document (c ctx d 1 . .n ) and question (c ctx q 1 ..m ) encoding: , where d i and q i denote the ith token of a text sequence d (document) and q (question), respectively, n and m is the size of d and q and h the output hidden size (256) of a single GRU unit.
BiGRU is defined in (3), with e i a word embedding vector , where For the question we construct a single vector representation r ctx q by retrieving the token representation at the placeholder (XXXX) index pl (cf. Figure 5): where [pl] is an element pickup operation.
Our question vector representation is different from the original AS Reader that builds the question by concatenating the last states of a forward and backward layer [ ].We changed the original representation as we observed some very long questions and in this way aim to prevent the context encoder from 'forgetting' where the placeholder is.
Answer Prediction: Q ctx to D ctx Attention.In order to predict the correct answer to the given question, we rank the given answer candidates a 1 ..a L according to the normalized attention sum score between the context (ctx) representation of the question placeholder r ctx q and the representation of the candidate tokens in the document: , where j is an index pointer from the list of indices that point to the candidate a i token occurrences in the document context representation c d .
Att is a dot product.
Enriching Context Representations with Knowledge (Context+Knowledge).To enhance the representation of the context, we add knowledge, retrieved as a set of knowledge facts.
Knowledge Encoding.For each instance in the dataset, we retrieve a number of relevant facts (cf.Section 2.1).Each retrieved fact is represented as a triple f = (w subj 1..L subj , w rel 0 , w obj 1..L obj ), where w subj 1..L subj and w obj 1..L obj are a multi-word expressions representing the subject and object with sequence lengths L subj and L obj , and w rel 0 is a word token corresponding to a relation. 3As a result of fact encoding, we obtain a separate knowledge memory for each instance in the data.
To encode the knowledge we use a BiGRU to encode the triple argument tokens into the following context-encoded representations: , where f subj last , f rel last , f obj last are the final hidden states of the context encoder BiGRU , that are also used as initial representations for the encoding of the next triple attribute in left-to-right order.See Supplement for comprehensive visualizations.The motivation behind this encoding is: (i) We encode the knowledge fact attributes in the same vector space as the plain tokens; (ii) we preserve the triple directionality; (iii) we use the relation type as a way of filtering the subject information to initialize the object.
Querying the Knowledge Memory.To enrich the context representation of the document and question tokens with the facts collected in the knowledge memory, we select a single sum of weighted fact representations for each token using Key-Value retrieval (Miller et al., 2016).In our model the key M memory keys M k i in the retrieved P knowledge facts.We use an attention function Att, scale the scalar attention value using sof tmax, multiply it with the value representation M v i and sum the result into a single vector value representation c kn s i : (10) Att is a dot product, but it can be replaced with another attention function.As a result of this operation, the context token representation c ctx s i and the corresponding retrieved knowledge c kn s i are in the same vector space ∈ R 2h .

Combine Context and Knowledge (ctx+kn).
We combine the original context token representation c ctx s i , with the acquired knowledge representation c kn s i to obtain c ctx+kn s i : , where γ = 0.5.We keep γ static but it can be replaced with a gating function.
Answer Prediction: +kn) .To rank answer candidates a 1 ..a L we use attention sum similar to Eq.5 over an attention α ensemble i j that combines attentions between context (ctx) and context+knowledge (ctx+kn) representations of the question (r ctx(+kn) q ) and candidate token occurrences a i j in the document c ctx(+kn) d j : , where j is an index pointer from the list of indices that point to the candidate a i token occurrences in the document context representation c ctx(+kn) d . W 1..4 are scalar weights initialized with 1.0 and optimized during training. 4e propose the combination of ctx and ctx + kn attentions because our task does not provide supervision whether the knowledge is needed or not.

Data and Task Description
We experiment with knowledge-enhanced clozestyle reading comprehension using the Common Nouns and Named Entities partitions of the Children's Book Test (CBT) dataset (Hill et al., 2015).
In the CBT cloze-style task a system is asked to read a children story context of 20 sentences.The following 21 st sentence involves a placeholder token that the system needs to predict, by choosing from a given set of 10 candidate words from the document.An example with suggested external knowledge facts is given in Figure 1.While in its Common Nouns setup, the task can be considered as a language modeling task, Hill et al. (2015) show that humans can answer the questions without the full context with an accuracy of only 64.4% and a language model alone with 57.7%.By contrast, the human performance when given the full context is at 81.6%.Since the best neural model (Munkhdalai and Yu, 2016) achieves only 72.0% on the task, we hypothesize that the task itself can benefit from external knowledge.The characteristics of the data are shown in Table 1.
Other popular cloze-style datasets such as CNN/Daily Mail (Hermann et al., 2015) or Who-DidWhat (Onishi et al., 2016) are mainly focused on finding Named Entities where the benefit of adding commonsense knowledge (as we show for the NE part of CBT) would be more limited.
Knowledge Source.As a source of commonsense knowledge we use the Open Mind Common Sense part of ConceptNet 5.0 that contains 630k fact triples.We refer to this entire source as CN5All.We conduct experiments with subparts of this data: CN5WN3 which is the WordNet 3 part of CN5All (213k triples) and CN5Sel, which excludes the following WordNet relations: RelatedTo, IsA, Synonym, SimilarTo, HasContext.
Integrating Background Knowledge in Neural Models.Integrating background knowledge in a neural model was proposed in the neural-checklist model by Kiddon et al. (2016) for text generation of recipes.They copy words from a list of ingredients instead of inferring the word from a global vocabulary.Ahn et al. (2016) proposed a language model that copies fact attributes from a topic knowledge memory.The model predicts a fact in the knowledge memory using a gating mechanism and given this fact, the next word to be selected is copied from the fact attributes.The knowledge facts are encoded using embeddings obtained using TransE (Bordes et al., 2013).Yang et al. (2017) extended a seq2seq model with attention to external facts for dialogue and recipe generation and a co-reference resolution-aware language model.A similar model was adopted by He et al. (2017) for answer generation in dialogue.Incorporating external knowledge in a neural model has proven beneficial for several other tasks: Yang and Mitchell (2017) incorporated knowledge di-rectly into the LSTM cell state to improve event and entity extraction.They used knowledge embeddings trained on WordNet (Miller et al., 1990) and NELL (Mitchell et al., 2015) using the BILIN-EAR (Yang et al., 2014) model.
Work similar to ours is by Long et al. (2017), who have introduced a new task of Rare Entity Prediction.The task is to read a paragraph from WikiLinks (Singh et al., 2012) and to fill a blank field in place of a missing entity.Each missing entity is characterized with a short description derived from Freebase, and the system needs to choose one from a set of pre-selected candidates to fill the field.While the task is superficially similar to cloze-style reading comprehension, it differs considerably: first, when considering the text without the externally provided entity information, it is clearly ambiguous.In fact, the task is more similar to Entity Linking tasks in the Knowledge Base Population (KBP) tracks at TAC 2013-2017, which aim at detecting specific entities from Freebase.Our work, by contrast, examines the impact of injecting external knowledge in a reading comprehension, or NLU task, where the knowledge is drawn from a commonsense knowledge base, ConceptNet in our case.Another difference is that in their setup, the reference knowledge for the candidates is explicitly provided as a single, fixed set of knowledge facts (the entity description), encoded in a single representation.In our work, we are retrieving (typically) distinct sets of knowledge facts that might (or might not) be relevant for understanding the story and answering the question.Thus, in our setup, we crucially depend on the ability of the attention mechanism to retrieve relevant pieces of knowledge.Our aim is to examine to what extent commonsense knowledge can contribute to and improve the cloze-style RC task, that in principle is supposed to be solvable without explicitly given additional knowledge.We show that by integrating external commonsense knowledge we achieve clear improvements in reading comprehension performance over a strong baseline, and thus we can speculate that humans, when solving this RC task, are similarly using commonsense knowledge as implicitly understood background knowledge.Weissenborn et al. (2017) is driven by similar intentions.The authors exploit knowledge from ConceptNet to improve the performance of a reading comprehen-sion model, experimenting on the recent SQuAD (Rajpurkar et al., 2016) and TriviaQA (Joshi et al., 2017) datasets.While the source of the background knowledge is the same, the way of integrating this knowledge into the model and task is different.(i) We are using attention to select unordered fact triples using key-value retrieval and (ii) we integrate the knowledge that is considered relevant explicitly for each token in the context.The model of Weissenborn et al. (2017), by contrast, explicitly reads the acquired additional knowledge sequentially after reading the document and question, but transfers the background knowledge implicitly, by refining the word embeddings of the words in the document and the question with the words from the supporting knowledge that share the same lemma.In contrast to the implicit knowledge transfer of Weissenborn et al. (2017), our explicit attention over external knowledge facts can deliver insights about the used knowledge and how it interacts with specific context tokens (see Section 6).

Experiments and Results
We perform quantitative analysis through experiments.We study the impact of the used knowledge and different model components that employ the external knowledge.Some of the experiments below focus only on the Common Nouns (CN) dataset, as it has been shown to be more challenging than Named Entities (NE) in prior work.

Model Parameters
We experiment with different model parameters.Hyper-parameters.For our experiments we use pre-trained Glove (Pennington et al., 2014) embeddings, BiGRU with hidden size 256, batch size of 64 and learning reate of 0.001 as they were shown (Kadlec et al., 2016) to perform good on the AS Reader.

Empirical Results
We perform experiments with the different model parameters described above.We report accuracy on the Dev and Test and use the results on Dev set for pruning the experiments.
Knowledge Sources.We experiment with different configuration of ConceptNet facts (see Section 3).Results on the CBT CN dataset are shown in Table 2. CN5Sel works best on the Dev set but CN5WN3 works much better on Test.Further experiments use the CN5Sel setup.

Number of facts.
We further experiment with different numbers of facts on the Common Nouns dataset (Table 3).The best result on the Dev set is for 50 facts so we use it for further experiments.
Component ablations.We ensemble the attentions from different combinations of the interaction between the question and document context (ctx) representations and context+knowledge (ctx+kn) representations in order to infer the right answer (see Section 2.2, Answer Ranking).
Table 4 shows that the combination of different interactions between ctx and ctx+kn representations leads to clear improvement over the w/o knowledge setup, in particular for the Common Nouns dataset.We also performed ablations for a model with 100 facts (see Supplement).(Weston et al., 2015) 70.4 66.6 64.2 63.0 EpiReader (Trischler et al., 2016) 74.9 69.0 71.5 67.4 GA Reader (Dhingra et al., 2017) 77.2 71.4 71.6 68.0 IAA Reader (Sordoni et al., 2016)  Note that our work focuses on the impact of external knowledge and employs a single inter-action (single-hop) between the document context and the question so we primarily compare to and aim at improving over similar models.KnReader clearly outperforms prior single-hop models on both datasets.While we do not improve over the state of the art, our model stands well among other models that perform multiple hops.In the Supplement we also give comparison to ensemble models and some models that use re-ranking strategies.

Key-Value Selection Strategy.
6 Discussion and Analysis 6.1 Analysis of the empirical results.
Our experiments examined key parameters of the KnReader.As expected, injection of background knowledge yields only small improvements over the baseline model for Named Entities.However, on this dataset our single-hop model is competitive to most multi-hop neural architectures.
The integration of knowledge clearly helps for the Common Nouns task.The impact of knowledge sources (Table 2) is different on the Dev and Test sets which indicates that either the model or the data subsets are sensitive to different knowledge types and retrieved knowledge.Table 5 shows that attending over the Subj of the knowledge triple is slightly better than Obj.This shows that using a Key-Value memory is valuable.A reason for lower performance of Obj/Obj is that the model picks facts that are similar to the candidate tokens, not adding much new information.From the empirical results we see that training and evaluation with less facts is slightly better.We hypothesize that this is related to the lack of supervision on the retrieved and attended knowledge.

Qualitative Data Investigation
We will use the attention values of the interactions between D ctx(+kn) and Q ctx(+kn) and attentions to facts from each candidate token and the question placeholder to interpret how knowledge is employed to make a prediction for a single example.

Method: Interpreting Model Components.
We manually inspect examples from the evaluation sets where KnReader improves prediction (blue (left) category, Fig. 3) or makes the prediction worse (orange (right) category, Fig. 3). Figure 7 shows the question with placeholder, followed by answer candidates and their associated attention weights as assigned by the model w/o knowledge.The matrix shows selected facts and their assigned weights for the question and the candidate tokens.Finally, we show the attention weights determined by the knowledge-enhanced D to Q interactions.The attention to the correct answer (head) is low when the model considers the text alone (w/o knowledge).When adding retrieved knowledge to the Q only (row ctx, ctx + kn) and to both Q and D (row ctx + kn, ctx + kn) the score improves, while when adding knowledge to D alone (row ctx + kn, ctx) the score remains ambiguous.The combined score Ensemble (see Eq. 13) then takes the final decision for the answer.In this example, the question can be answered without the story.The model tries to find knowledge that is related to eyes.The fact eyes /r/PartOf head is not contained in the retrieved knowledge but in- stead the model selects the fact ear /r/PartOf head which receives the highest attention from Q.The weighted Obj representation (head) is added to the question with the highest weight, together with animal and bird from the next highly weighted facts This results in a high score for the Q ctx to D ctx+kn interaction with candidate head.See Supplement for more details.
Using the method described above, we analyze several example cases (presented in Supplement) that highlight different aspects of our model.Here we summarize our observations.(i.) Answer prediction from Q or Q+D.In both human and machine RC, questions can be answered based on the question alone (Figure 7) or jointly with the story context (Case 2, Suppl.).We show that empirically, enriching the question with knowledge is crucial for the first type, while enrichment of Q and D is required for the second.
(ii.) Overcoming frequency bias..We show that when appropriate knowledge is available and selected, the model is able to correct a frequency bias towards an incorrect answer (Cases 1 and 3).
(iii.) Providing appropriate knowledge.We observe a lack of knowledge regarding events (e.g.take off vs. put on clothes, Case 2; climb up, Case 5).Nevertheless relevant knowledge from CN5 can help predicting infrequent candidates (Case 2).
(iv.) Knowledge, Q and D encoding.The context encoding of facts allows the model to detect knowledge that is semantically related, but not surface near to phrases in Q and D (Case 2).The model finds facts to non-trivial paraphrases (e.g.undressed-naked, Case 2).

Conclusion and Future Work
We propose a neural cloze-style reading comprehension model that incorporates external commonsense knowledge, building on a single-turn neural model.Incorporating external knowledge improves its results with a relative error rate reduction of 9% on Common Nouns, thus the model is able to compete with more complex RC models.We show that the types of knowledge contained in ConceptNet are useful.We provide quantitative and qualitative evidence of the effectiveness of our model, that learns how to select relevant knowledge to improve RC.The attractiveness of our model lies in its transparency and flexibility: due to the attention mechanism, we can trace and analyze the facts considered in answering specific questions.This opens up for deeper investigation and future improvement of RC models in a targeted way, allowing us to investigate what knowledge sources are required for different data sets and domains.Since our model directly integrates background knowledge with the document and questioncontext representations, it can be adapted to very different task settings where we have a pair of two arguments (i.e.entailment, question answering, etc.)In future work, we will investigate even tighter integration of the attended knowledge and stronger reasoning methods.
"#$"%&'" =  *   are lowercased.Following Kadlec et al. (2016) we use multiple unknown tokens (UNK 1 , UNK 2 , . . ., UNK 100 ).In each example, for each unknown word, we pick randomly an unknown token from the list and use it for all occurrences of the word in the document (story) and question.
Word Embeddings.We use Glove 100D 9 word embeddings pre-trained on 6B tokens from Wikipedia and Gigaword5.We initialize the outof-vocabulary words by sampling from a uniform distribution in range [−0.1, 0.1].We optimize all 9 The embeddings can be downloaded from: http:// nlp.stanford.edu/data/glove.6B.zip word embeddings in the first 8000 training steps.
Encoder Hidden Size.We use a hidden size of 256 for the GRU encoder states (512 output for our bi-directional encoding).This setting has been shown to perform well for the Attention Sum Reader (Kadlec et al., 2016).
Batching, Learning rate, Sampling.We sort the data examples in the training set by document length and create batches with 64 examples.For each training step we pick batches randomly.After every 1000 training steps we evaluate the models on the validation Dev set.We train for 60 epochs and pick the model with the highest validation ac-curacy to make the predictions for Test.
Optimization.We use cross entropy loss on the predicted scores for each answer candidate.We use Adam (?) optimizer with initial learning rate of 0.001 and clip the gradients in the range [−10, 10].

B Quantitative Analysis Additional Ablation Experiments
Due to space limitation in the main paper, we present additional results here.In addition to ablation of model components for 50 facts, we perform experiments for 100 as well.The results are shown in Table 7.The results show a similar tendency, but in this setting, omitting the model without knowledge enrichment yields best results for the CN data.

Results for Ensemble Models
For each dataset we combine our best 11 runs and use majority voting to predict the answer for our Ensemble model.
In Table 8 we show the comparison with multihop models.We report Accuracy on the Dev and Test sets, rounded to the first decimal point as done in previous work.The AoA Reader (Cui et al., 2017) uses re-ranking as a post-processing step and the other neural models are not directly comparable.
the question with placeholder, followed by answer candidates and their associated attention weights as assigned by the model w/o knowledge.The matrix shows selected facts and their learned weights for question and the candidate tokens.Finally, we show the attention weights determined by the knowledge-enhanced D to Q interactions.The attention to the correct answer (head) is low when the model considers the text alone (w/o knowledge).When adding retrieved knowledge to the Q only (row ctx, ctx + kn) and to both Q and D (row ctx + kn, ctx + kn) the score improves, while when adding knowledge to D alone (row ctx + kn, ctx) the score remains ambiguous.The combined score Ensemble then takes the final decision for the answer.In this example, the question can be answered without the story.The model tries to find knowledge that is related to eyes.The fact eyes /r/PartOf head is not contained in the retrieved knowledge but instead the model selects the fact ear /r/PartOf head which receives the highest attention from Q.The weighted Obj representation (head) is added to the question with the highest weight, together with animal and bird from the next highly weighted facts This results in a high score for the Q ctx to D ctx+kn interaction with candidate head.occurrences in the text, and since the updated representation combines well with the phrase put on which is antonym to undressed clothes /r/Antonym undressed and clothes /r/Antonym naked.A reason for this could also be the high frequency of clothes in the story.However, the example cannot be answered using the story context alone, as it talks about the imaginary, not existing (air, nothing) new clothes of the king.
The example also shows what kind of knowledge is missing in our currently used resources: ideally, the question can be answered using information from the question alone, by analyzing the meaning of the phrases take off your clothes and then we will put on the new XXXX.If they were available, the model could exploit the knowledge that taking off (clothes) and putting on (clothes) are actions often performed in temporal sequence.
Case 3 In Figure 9 we have an example where the model overcomes the frequency bias of the story (magician occurs 4 times) to select a more plausible example (father) using the fact father Story: ... they pretended they were taking the cloth from the loom , cut with huge scissors in the air , sewed with needles without thread , and then said at last , ' now the clothes are finished !' the emperor came himself with his most distinguished knights , and each impostor held up his arm just as if he were holding something , and said , ' see ! here are the breeches !here is the coat !here the cloak !' and so on .' spun clothes are so comfortable that one would imagine one had nothing on at all ; but that is the beauty of it !' ' yes , ' said all the knights , but they could see nothing , for there was nothing there .Case 4 Figure 10 shows an example where a correct initial prediction obtained without knowledge is reversed and a clearly wrong answer is selected instead.Although a relevant fact is se-lected (people /r/UsedFor help you), apparently, the model misses the information that brothers are people and can't combine the acquired concept help you with the question context and with their help dragged ..., and thus, the correct answer is not Story: ... a celebrated magician , who had given the seed to my father , promised him that they would grow into the three finest trees the world had ever seen .' after this i had the beautiful fruit of these trees carefully guarded by my most faithful servants ; but every year , on this very night , the fruit was plucked and stolen by an invisible hand , and next morning not a single apple remained on the trees .for some time past i have given up even having the trees watched .That is, the context-only neural representation guesses that the plausible answer is similar to cliff (inlet and shores are usually associated with cliff ).Again, we are missing knowledge of actions, e.g., that climbing is done Story: ... in the same village there lived three brothers , who were all determined to kill the mischievous hawk .... ... his eyelids closed , and his head sank on his shoulders , but the thorns ran into him and were so painful that he awoke at once .the hawk fell heavily under a big stone , severely wounded in its right wing .the youth ran to look at it , and saw that a huge abyss had opened below the stone .In future work we plan to experiment with sources that offer more information about events.Story: ... i also lay this belt beside you , to put on when you awaken ; it will keep you from growing faint with hunger .the woman now disappeared , and unk 98 woke , and saw that all her dream had been true .
the rope hung down from the cliff , and the clew and belt lay beside her .
He mounted his XXXX and rode away.
each context-encoded token c ctx s i (s = d, q; i the token index) we attend over all knowledge3 The 0 in w rel 0 indicates that we encode the relation as a single relation type word.Ex. /r/IsUsedFor.

Figure 3
Figure 3 shows the impact on prediction accuracy of individual components of the Full model, including the interaction between D and Q with ctx or ctx + kn (w/o ctx-only).The values for each component are obtained from the attention weights, without retraining the model.The difference between blue (left) and orange (right) values indicates how much the module contributes to the model.Interestingly, the ranking of the contribution (D ctx , Q ctx+kn > D ctx+kn , Q ctx > D ctx+kn , Q ctx+kn ) corresponds to the component importance ablation on the Dev set, lines 5-8, Table4.

Figure 3 :
Figure 3: # of items with reversed prediction (±correct) for each combination of (ctx+kn, ctx) for Q and D. We report the number of wrong → correct (blue) and correct → wrong (orange) changes when switching from score w/o knowledge to score w/ knowledge.The best model type is Ensemble.(Full model w/o D ctx , Q ctx ).

Figure 4 :
Figure 4: Interpreting the components of Kn-Reader.Adding knowledge to Q and D increases the score for the correct answer.Results for top 5 candidates are shown.(Full model, CN data, CN5Sel, Subj/Obj, 50 facts)

Figure 5 :Figure 6 :
Figure 5: The Knowledgeable Reader combines plain context & enhanced (context + knowledge) repres. of D and Q and retrieved knowledge from explicit memory with the Key-Value approach.

Case 2 Figure 7 :
Figure 7: Case 1: Interpreting the components of KnReader (Full model).Adding retrieved knowledge to Q and D helps the model to increase the score for the correct answer.Results for top 5 candidates are shown.(Subj/Obj as key-value memory, 50 facts, CN5Sel) (Item #357)

Figure 8 :
Figure 8: Case 2: Interpreting the components of KnReader (Full model).Adding retrieved knowledge to Q and D helps the model to increase the score for the correct answer.Results for top 5 candidates are shown.(Subj/Obj as key-value memory, 50 facts, CN5Sel) (Item #52)

Figure 9 :
Figure 9: Case 3: Interpreting the components of KnReader (Full model).Adding retrieved knowledge to Q and D helps model to increase the score for the correct answer.Results for top 5 candidates are shown.(Subj/Obj as key-value memory, 50 facts, CN5Sel) (Item #240)

Figure 10 :
fight /r/Antonym peace listening to the radio /r/Causes noise noise /r/Causes headaches to get worse music /r/Antonym noise noise /r/Antonym music

Figure 11 :
Figure 11: Case 5: Interpreting the components of KnReader (Full model).Adding retrieved knowledge to Q and D confuses the model and decreases the score for the correct answer.Results for top 5 candidates are shown.(Subj/Obj as key-value memory, 50 facts, CN5Sel) (Item #187)

Table 1 :
Characteristics of Children Book Test datasets.CN: Common Nouns, NE: Named Entities.Cells for Train, Dev, Test show overall numbers of examples and average story size in tokens.

Table 2 :
Number of facts.We explore different sizes of knowledge memories, in terms of number of acquired facts.If not stated otherwise, we use 50 facts per example.Key-Value Selection Strategy.We use two strategies for defining key and value (Key/Value): Subj/Obj and Obj/Obj, where Subj and Obj are the subject and object attributes in the fact triples and they are selected as Key and Value for the KV memory (see Section 2.2, Querying the Knowledge Memory).If not stated otherwise, we use the Subj/Obj strategy.Answer Selection Components.If not stated otherwise, we use ensemble attention α ensemble (combinations of ctx and ctx+kn) to rank the answers.We call this our Full model (see Sec. 2.2).Results with different knowledge sources, for CBT-CN (Full model, 50 facts).

Table 5 :
Table 5 shows that for the NE dataset, the two strategies perform NE CN D repr to Q repr interaction Dev Test Dev Test D ctx , Q ctx (w/o know) 75.50 70.30 68.20 64.80 D ctx+kn , Q ctx+kn 76.45 69.68 70.85 66.32 D ctx , Q ctx+kn 77.10 69.72 70.80 66.32 D ctx+kn , Q ctx 75.65 70.88 71.20 67.96 Results for key-value knowledge retrieval and integration.(CN5Sel, 50 facts).Subj/Obj means: we attend over the fact subject (Key) and take the weighted fact object as value (Value).

Table 6 :
Comparison of KnReader to existing endto-end neural models on the benchmark datasets.equally well on the Dev set, whereas the Subj/Obj strategy works slightly better on the Test set.For Common Nouns, Subj/Obj is better.
ctx+kn , Q ctx ) with 100 facts; for CN the Full model with 50 facts, both with CN5Sel.