Question Answering over Knowledge Base using Factual Memory Networks

In the task of question answering, Memory Networks have recently shown to be quite effective towards complex reasoning as well as scalability, in spite of limited range of topics covered in training data. In this paper, we introduce Factual Memory Network, which learns to answer questions by extracting and reasoning over relevant facts from a Knowledge Base. Our system generate distributed representation of questions and KB in same word vector space, extract a subset of initial candidate facts, then try to ﬁnd a path to answer entity using multi-hop reasoning and re-ﬁnement. Additionally, we also improve the run-time efﬁciency of our model using various computational heuristics.


Introduction
Open-domain question answering (Open QA) is a longstanding problem that has been studied for decades. Early systems took an information retrieval approach, where question answering is reduced to returning passages of text containing an answer as a substring. Recent advances in constructing largescale knowledge bases (KBs) have enabled new systems that return an exact answer from a KB.
A key challenge in Open QA is to be robust to the high variability found in natural language and the many ways of expressing knowledge in largescale KBs. Another challenge is to link the natural language of questions with structured semantics of KBs. In this paper, we present a novel architecture based on memory networks (Bordes et al., 2015) that can be trained end-to-end using (question, answer) pairs as training set, instead of strong supervision in the form of (question, associated facts in KB) pairs.
The major contributions of this paper are twofold: first, we introduce factual memory networks, which are used to answer questions in natural language (e.g "Where was Bill Gates born?") using facts stored in the form of (subject, predicate, object) triplets in a KB (e.g ('Bill Gates', 'place of birth', 'Seattle')). We evaluate our system against current baselines on various benchmark datasets. Since KBs can be extremely large, making it computationally inefficient to search over all entities and paths, our second goal of this paper is to increase the efficiency of our model in terms of various performance measures and provide better coverage of relevant facts, by intelligently selecting which nodes to expand.

Related Work
The state-of-the-art methods for QA over a knowledge base can be classified into three classes: semantic parsing, information retrieval and embedding based.
Semantic parsing (Cai and Yates, 2013;Berant et al., 2013;Kwiatkowski et al., 2013;Berant and Liang, 2014;Fader et al., 2014) based approaches aim to learn semantic parsers which parse natural language questions into logical forms and then query knowledge base to lookup answers. Even though these approaches are difficult to train at scale because of the complexity of their inference, they tend to provide a deep interpretation of the question.
Information retrieval based systems retrieve a set of candidate answers and then conduct further analysis to rank them. Their main difference lies in select-ing correct answers from the candidate set. Yao and Van Durme (2014) used rules to extract question features from dependency parse of questions, and used relations and properties in the retrieved topic graph as knowledge base features.
Embedding based approaches (Bordes et al., 2014b;Bordes et al., 2014a) learn low-dimensional vectors for words and knowledge base constitutes, and use the sum of these vectors to represent questions and candidate answers. However, simple vector addition ignores word order information and higher order n-grams. For example, the question representations of "who killed A?" and "who A killed?" are same in the vector addition model. (Bordes et al., 2015) used strong supervision signal in form of supporting facts for a question during training to improve their performance.

Processing FREEBASE
Freebase : Freebase (Bollacker et al., 2008) is a huge and freely available database of factual information, organized as triplets (subject Entity, Relationship, object Entity). All Freebase entities and relationships are typed and the lexicon for types and relationships is closed. Each entity has an internal id and a set of alternative names (called aliases, e.g. JFK for John Kennedy) that can refer to that entity in text.
The overall structure of Freebase is a hypergraph, in which more than two entities can be linked together in a n-ary fact. The underlying triple storage involves dummy entities to represent such facts, effectively making actual entities involved linked through paths of length 2, instead of 1. For example, a statement like "A starred as character B in movie C" is represented in Freebase as (A, 'starring', dummy entity), (dummy entity, 'movie', C), (dummy entity, 'character', B), where dummy entity has same internal id in all three facts.
To obtain direct links between entities in such cases, we modify these facts by removing the dummy entities and using the second relationship as the relationship for the new condensed facts. In our example, we condense aforementioned facts into two : (A, 'character', B) and (A, 'movie', C). Also, we label these condensed facts as siblings of each other. Therefore, (A, 'character', B) is sibling of (A, 'movie', C) and vice versa. Moving forward, we use the term 'fact' to refer to a triplet in Freebase, containing actual entities only (no dummy entities).
After above preprocessing, we represent each entity and relationship in KB as a vector ∈ R d . Each such entity/relationship vector is computed as the average of its word vectors, where each word vector ∈ R d . In case of entities, we also include word vectors of all its aliases when calculating the average. Such a scheme has the additional advantage that we can benefit from pre-trained unsupervised word vectors (e.g. from Word2Vec), which in general capture some distributional syntactic and semantic information.

Processing Question
Let a question be given as sequence of words (x 1 , x 2 , . . . , x |q| ). Each word x i mapped to its word vector. We experimented with two different ways to compose the question embedding q out of these word-vectors: Position Encoding (PE): This heuristic take in account order of words in the question. The embedding is of form q = |q| i=1 l i x i where is an element-wise multiplication. For each i, l i is a column vector ∈ R d with the structure l ij = min( i×d j×|q| , j×|q| i×d ), where d is the dimension of the embedding and j runs from 1 to d. This type of function ensures that initial part of summed vector is weighed higher for initial word vectors and vice versa. Thus, a statement like 'Who is parent of Darth Vader?' will map to a different embedding than statement like 'Who is Darth Vader parent of?'.

Fact Extraction
To begin, we generate an initial list of facts (called candidate fact list) which is fed as input to our network. To generate this list, we match all possible ngrams of words of the question to aliases of Freebase entities and discard all n-grams (and their matched entities) that are a subsequence of another n-gram. All facts having one of the remaining entities as subject are added to candidate fact list.

Factual Memory Network
A L-Hop Factual Memory Network consists of L Computation Layers C 1 . . . C L connected one after another in a linear chain, and an Output Function. The initial Computation Layer C 1 takes as input the candidate fact list (from previous step) and the initial question embedding (from Section 3.2). The Output Function takes as input the output of final Computation Layer C L and generates a list of answers (in form of Freebase entities). Layer C i takes as input the output of layer C i−1 .

Computation Layer
A Computation layer accesses two inputs : 1) a fact list (our 'memory' where each fact f is of form (s, R, o), s and o being the entity vectors and R the relationship vector, 2) a question embedding q ∈ R d . For each fact f ∈ F , we calculate an unnormalised score g(f ) and a normalized score h(f ).
We visualize a fact f = (s, R, o) as a questionanswer pair with (s, R) forming the question and o the answer. Therefore, g(f ) calculates similarity between the given question embedding and a hypothetical question of form q = (s, R).
For example, given a question q = "Where was Bill Gates born?" and a fact f = ('Bill Gates', 'place of birth', 'Seattle'), g(f ) will compute similarity between the question q and a hypothetical question q of form "Bill Gates place of birth?".
In case a fact f has some siblings b 1 , b 2 , . . . , b k (Section 3.1), we re-calculate its g(f ) as follows: where g(f ) on RHS is calculated using Eq.(1). For each sibling b i , we calculate g(b i ) using Eq.(1), but with b i 's subject replaced by its object in the formula.
Continuing with example in Section 3.1, if f = (A,'character',B) and b 1 = (A,'movie',C) is sibling of f , then this kind of processing is helpful in answering questions like "What character does A play in movie C?", where fact f alone would not be enough to answer the question. Thus, above processing corresponds to a hypothetical question of form "A character C movie ?" for the fact f . The normalized score h(f ) for a fact f is calculated using softmax function over all facts to determine their relative importance.
Next we modify the fact list and the question embedding on basis of above calculated scores.
Step 1. Fact Pruning: We choose a threshold value and remove facts from fact list with h(f ) < . We found in our experiments that setting = gives best results. Performing pruning seems to remove non-relevant facts from subsequent computation, significantly improving our response time and allowing us to explore larger search space around the subgraphs of our question entities.
Step 2. The question embedding is modified as follows: Each such modification allows us to incorporate knowledge gathered (in form of hypothetical questions (s, R), weighted by their relevance h) at a particular layer about the question and pass it on to subsequent layer.
Step 3. Fact Addition: For each object entity o belonging to a fact f , we find all the facts (o, R , o ) connected to it in KB and assign a score h(f )q T (o+ R ) to each of them. If the new fact's score is > , it is added to the fact list, effectively increasing pathlength of our search space by 1.
The modified fact list along with the new question embedding(q ) form the outputs of this layer, which are fed as input to next Computation Layer or the Output function.

Output Function
Output Function takes as input the output of final Computation Layer C L (i.e. its output fact list and q ) and calculate scores h(f ). The answer set is formed by the object entity of highest scoring fact as well as object entities of all those facts that have same s and R as the highest scoring fact.
This simple heuristic can increase utility of our model when there are multiple correct answers for a question. For example, a question like 'Who is Anakin Skywalker father of?' has more than one answer entities i.e. ['Leia Organa', 'Luke Skywalker'], and they are all linked by same s ('Anakin Skywalker') and R ('children') in Freebase. Of course, this follows the assumption that all such facts survive till this stage and atleast one of them is highest scoring fact.

Training Objectives
Let the QA training set D be set of question-answer pairs (q, A), where q is the question with list of correct answers A, e.g. (q = 'Who is Anakin Skywalker father of?', A = ['Leia Organa', 'Luke Skywalker']). To train our parameters, we minimize the following loss function: Here n refers to n-th Computation Layer in our network (with F n as its input fact list) and L is total number of Computation layers/hops in the network. This loss function defines the degree to which the object entities in fact list of a given layer are near to given answer list, weighted by h(f ), by taking pairwise difference between entities in answer list and objects in fact list. It was observed that minimizing this function gives higher weights to facts in which object entities are similar to answer entities as well as allow our network to generate shorter paths to reach answers from the question.
Paraphrasing Loss: Following previous work such as (Fader et al., 2013;Bordes et al., 2015), we also use the question paraphrases dataset WikiAnswers 1 to generalize for words and question patterns which are unseen in the training set of questionanswer pairs. We minimize hinge loss so that a question and its paraphrases should map to similar representations. For the paraphrase dataset P of set of tuples (p, p ) where p and p are paraphrases of each other, the loss is defined as: where p is random example in P that is not a paraphrase of p and 0.1 is the margin hyper-parameter. 1 http://knowitall.cs.washington.edu/paralex/ Backpropagation is used to calculate gradients while Adagrad was used to perform optimisation using max-norm regularisation. At each time step, a sample is drawn from either P with probability 0.25 or D with probability 0.75. If sample from P is chosen, gradient of L P P is calculated. Otherwise, gradient of L QA is calculated. The only parameters optimised in our model are the word vectors.

Baselines
We evaluate our model on following datasets: WebQuestions 2 : This dataset, introduced in (Berant et al., 2013), contains 5,810 question-answer pairs where answer can be a list of entities, similar to (q, A) pairs described before. It was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. WebQuestions is built on Freebase since all answers are defined as Freebase entities.
On WebQuestions, we evaluate against following baselines : (Berant et al., 2013;Berant and Liang, 2014;Yih et al., 2015) (semantic parsing based methods), (Fader et al., 2013) (uses a pattern matching scheme), (Bordes et al., 2014b;Bordes et al., 2014a;Bordes et al., 2015) (Embedding based approaches). Results of the baselines have been extracted from respective papers, except for (Berant et al., 2013;Berant and Liang, 2014) where we use the code provided by the author to replicate the results 2 . We compare our system in terms of F1 score as computed by the official evaluation script 2 (Berant et al., 2013), which is the average, over all test questions, of the F1-score of the sets of predicted answers.
SimpleQuestions 3 :The SimpleQuestions dataset, introduced in (Bordes et al., 2015), consists of a total of 108,442 questions written in natural language by human English-speaking annotators each paired with a corresponding Freebase fact. Our model only use the answer entity during the training, instead of whole fact. For example, {q = 'Which forest is Fires Creek in?', Fact = '(fires creek, contained by, nantahala national forest)'} could be data point in Sim-pleQuestions but we only use {q = 'Which forest is Fires Creek in?', A = ['nantahala national forest']} for training.
On SimpleQuestions, we evaluate against previous result (Bordes et al., 2015) in terms of path-level accuracy, in which a prediction is correct if the subject and the relationship of highest scoring fact were correctly retrieved by the system.

Experimental Setup
The current dump of Freebase data was downloaded 4 and processed as described before. Our data contained 1.9B triplets. We used following splits of each evaluation dataset for training, validation and testing, same as (Bordes et al., 2015). WebQuestions (WQ) : [3000,778,2032] SimpleQuestions (SQ) : [75910,10845,21687] We also train on automatic questions generated from the KB, which are essential to learn embeddings for the entities not appearing in either WebQuestions or SimpleQuestions. We generated one training question per fact following the same process as that used in (Bordes et al., 2014a).
The embedding dimension d was chosen 64 and max-norm cutoff was chosen as 4 using validation dataset. We pre-train our word vectors using method described by (Wang et al., 2014) to initialize our embeddings.
We experimented with variations of our model on both test sets. Specifically, we analyze the effect of question encoding (PE vs BOW), number of Hops and inclusion of pruning/fact-additions (P/FA) in our model. In subsequent section, the word 'significant' implies that the results were statistically significant (p < 0.05) with paired T-test

Results
The results of our experiments are presented in Table 1. It shows that our best model outperforms considered baselines by about 3% in case of We-bQuestions and even comparable to previous results in case of SimpleQuestions. Note that the best performing system for SimpleQuestions used strong supervision (question with supporting fact) while our model used only answer entities associated with a question for training.

Setup
WQ SQ F1 Acc Random Guess 1.9 4.9 (Berant et al., 2013) 31.3 n/a (Bordes et al., 2014a) 39.2 n/a (Berant and Liang, 2014) 39.9 n/a (Yih et al., 2015) 52.5 n/a (Bordes et al., 2015) 42 We also give the performance for the variations of our model. Position Encoding improves our performance by 7% on WQ and by 5% on SQ, validating our choice of heuristic. Also, most answers in both datasets can be found within path length of two from candidate fact list, thus a 3-Hop network shows only 2% improvement over 2-Hop network.   Table 2, we present the top-k results on both datasets. A large majority of questions can be answered from the top two or three candidates. By providing these alternative results (in addition to the top-ranked candidate) to the user, many questions can be answered correctly.

Efficiency and Error Analysis
Efficiency: All experiments for this paper were done on an Intel Core i5 CPU @ 2.60GHz, 8GB RAM with average HDD Access time of 12.3 ms. We calculated the Average Response time / Query (ART/Q), defined as average time taken by the system from input of query till the generation of answer list, including both computational and searchand-access time for KB. ART/Q for 1, 2 and 3 Hop Networks was 200, 350 and 505 ms respectively. Also the training time (including time to search for hyper-parameters) for each of these networks was 740, 1360 and 2480 min respectively.
The major bottleneck in ART/Query for our network was the search-and-access of large amount of KB Data, therefore we implemented efficient search procedures using Prefix Trees for String Matching and pipelined different stages of the model using multi-threading (i.e. Fact Extraction, Computation and Back-Propagation were performed on individual threads) to improve our response and training time.
Effect of Pruning/Fact-Additions : From Table  1, we can see that pruning and fact-additions have significantly improved scores on all datasets. We analyzed 200 random data points from set of examples that were correctly answered only when P/FA was used (for each test set). In 97.5% of these samples, we observed that pruning allowed our model to remove spurious facts generated during initial fact extraction, making soft-max calculation in Eq. (3) more robust.
We also observed that the set of correctly answered questions using model without P/FA was proper subset of one with P/FA on each evaluation dataset, signifying that if relevant facts were scored higher in previous Computation layers, their scores are not reduced as more facts are added in subsequent layers. Removing pruning alone didn't improve our performance by more than 0.4% on any dataset while exponentially increasing the response time, signifying that pruning itself didn't remove relevant/correct fact in majority of examples.
On the computational end, we tried to determine the effect of pruning on our model response time (excluding search and access time). Including pruning improved our ART/Q by approximately 44% during test phase and by 21% during training phase for our 3 Hop network.
Manual Error Analysis : We sampled 100 examples from each test set to identify major sources of errors made by our model. Following classes of errors were determined : Complex Questions (55%) : These types of questions involved temporal or superlative qualifier like 'first', 'after', 'largest', etc. This problem occurred in both test sets. We may be able to solve this problem using small set of rules for comparison on final answer set or better semantic representations for numerical values and qualifiers.
Question Ambiguity (20%) : This error class contains those questions that may have ambiguity in their interpretation. For example, a question like 'Where is shakira from?' generated answer as 'place of birth' (Baranquilla) while ground truth is 'nationality' (Colombia). This occurred mostly in WebQuestions dataset.
Ground truth Inconsistency (10%) : This type of error occurred when ground truth differed from correct entity present in Freebase KB (even though both are correct in many cases). For example, the question 'Where did eleanor roosevelt died?' have ground truth as 'New York City' whereas KB delivers the entity 'Manhattan', even though both are entities in Freebase. It occurred only in WebQuestions dataset.
Miscellaneous (15%) : This error class contains bad entity/relationship extraction (for example, mapping Anakin Skywalker to Darth Vader), bad question/answer pairs (e.g. q = "what time does american horror story air?" A = [Tom Selleck]), Typos in Question, etc.