Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. Universal schema can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing Memory networks to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on Spades fill-in-the-blank question answering dataset show that exploiting universal schema for question answering is better than using either a KB or text alone. This model also outperforms the current state-of-the-art by 8.5 F1 points.


Introduction
Question Answering (QA) has been a longstanding goal of natural language processing. Two main paradigms evolved in solving this problem: 1) answering questions on a knowledge base; and 2) answering questions using text.
Knowledge bases (KB) contains facts expressed in a fixed schema, facilitating compositional reasoning. These attracted research ever since the early days of computer science, e.g., BASEBALL (Green Jr et al., 1961). This problem has matured into learning semantic parsers from parallel question and logical form pairs (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005), to recent scaling of methods to work on very large KBs like Freebase using question and answer pairs (Berant et al., 2013). However, a major drawback of this paradigm is that KBs are highly incomplete (Dong et al., 2014). It is also an open question whether KB relational structure is expressive enough to represent world knowledge (Stanovsky et al., 2014;Gardner and Krishnamurthy, 2017) The paradigm of exploiting text for questions started in the early 1990s (Kupiec, 1993). With the advent of web, access to text resources became abundant and cheap. Initiatives like TREC QA competitions helped popularizing this paradigm (Voorhees et al., 1999). With the recent advances in deep learning and availability of large public datasets, there has been an explosion of research in a very short time (Rajpurkar et al., 2016;Trischler et al., 2016;Nguyen et al., 2016;Wang and Jiang, 2016;Lee et al., 2016;Xiong et al., 2016;Seo et al., 2016;Choi et al., 2016). Still, text representation is unstructured and does not allow the compositional reasoning which structured KB supports.
An important but under-explored QA paradigm is where KB and text are exploited together (Ferrucci et al., 2010). Such combination is attractive because text contains millions of facts not present in KB, and a KB's generative capacity represents infinite number of facts that are never seen in text. However QA inference on this combination is challenging due to the structural non-uniformity of KB and text. Distant supervision methods (Bunescu and Mooney, 2007;Mintz et al., 2009;Zeng et al., 2015) address this problem partially by means of aligning text patterns with KB. But the rich and ambiguous nature of language allows a fact to be expressed in many different forms which these models fail to capture.  Universal schema (Riedel et al., 2013) avoids the alignment problem by jointly embedding KB facts and text into a uniform structured representation, allowing interleaved propagation of information. Figure 1 shows a universal schema matrix which has pairs of entities as rows, and Freebase and textual relations in columns. Although universal schema has been extensively used for relation extraction, this paper shows its applicability to QA. Consider the question USA has elected blank , our first african-american president with its answer Barack Obama. While Freebase has a predicate for representing presidents of USA, it does not have one for 'african-american' presidents. Whereas in text, we find many sentences describing the presidency of Barack Obama and his ethnicity at the same time. Exploiting both KB and text makes it relatively easy to answer this question than relying on only one of these sources.
Memory networks (MemNN; ) are a class of neural models which have an external memory component for encoding short and long term context. In this work, we define the memory components as observed cells of the universal schema matrix, and train an end-to-end QA model on question-answer pairs. The contributions of the paper are as follows (a) We show that universal schema representation is a better knowledge source for QA than either KB or text alone, (b) On the SPADES dataset (Bisk et al., 2016), containing real world fill-in-the-blank questions, we outperform state-of-the-art semantic parsing baseline, with 8.5 F 1 points. (c) Our analysis shows how individual data sources help fill the weakness of the other, thereby improving overall performance.

Background
Problem Definition Given a question q with words w 1 , w 2 , . . . , w n , where these words contain one blank and at least one entity, our goal is to fill in this blank with an answer entity q a using a knowledge base K and text T . Few example question answer pairs are shown in Table 2.
Universal Schema Traditionally universal schema is used for relation extraction in the context of knowledge base population. Rows in the schema are formed by entity pairs (e.g. USA, NYC), and columns represent the relation between them. A relation can either be a KB relation, or it could be a pattern of text that exist between these two entities in a large corpus. The embeddings of entities and relation types are learned by low-rank matrix factorization techniques. Riedel et al. (2013) treat textual patterns as static symbols, whereas recent work by Verga et al. (2016) replaces them with distributed representation of sentences obtained by a RNN. Using distributed representation allows reasoning on sentences that are similar in meaning but different on the surface form. We too use this variant to encode our textual relations.
Memory Networks MemNNs are neural attention models with external and differentiable memory. MemNNs decouple the memory component from the network thereby allowing it store external information. Previously, these have been successfully applied to question answering on KB where the memory is filled with distributed representation of KB triples , or for reading comprehension (Sukhbaatar et al., 2015;Hill et al., 2016), where the memory consists of distributed representation of sentences in the comprehension. Recently, key-value MemNN are introduced (Miller et al., 2016) where each memory slot consists of a key and value. The attention weight is computed only by comparing the question with the key memory, whereas the value is used to compute the contextual representation to predict the answer. We use this variant of MemNN for our model. Miller et al. (2016), in their experiments, store either KB triples or sentences as memories but they do not explicitly model multiple memories containing distinct data sources like we do.

Model
Our model is a MemNN with universal schema as its memory. Figure 1 shows the model architecture.
Memory: Our memory M comprise of both KB and textual triples from universal schema. Each memory cell is in the form of key-value pair. Let (s, r, o) ∈ K represent a KB triple. We represent this fact with distributed key k ∈ R 2d formed by concatenating the embeddings s ∈ R d and r ∈ R d of subject entity s and relation r respectively. The embedding o ∈ R d of object entity o is treated as its value v.
Let (s, [w 1 , . . . , arg 1 , . . . , arg 2 , w n ], o) ∈ T represent a textual fact, where arg 1 and arg 2 correspond to the positions of the entities 's' and 'o'. We represent the key as the sequence formed by replacing arg 1 with 's' and arg 2 with a special ' blank ' token, i.e., k = [w 1 , . . . , s, . . . , blank , w n ] and value as just the entity 'o'. We convert k to a distributed representation using a bidirectional LSTM (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005), where k ∈ R 2d is formed by concatenating the last states of forward and backward LSTM, i.e., k = − −−− → LSTM(k); ← −−− − LSTM(k) . The value v is the embedding of the object entity o. Projecting both KB and textual facts to R 2d offers a unified view of the knowledge to reason upon. In Figure 1, each cell in the matrix represents a memory containing the distributed representation of its key and value.
Question Encoder: A bidirectional LSTM is also used to encode the input question q to a distributed representation q ∈ R 2d similar to the key encoding step above.
Attention over cells: We compute attention weight of a memory cell by taking the dot product of its key k with a contextual vector c which encodes most important context in the current iteration. In the first iteration, the contextual vector is the question itself. We only consider the memory cells that contain at least one entity in the question. For example, for the input question in Figure 1, we only consider memory cells containing USA. Using the attention weights and values of memory cells, we compute the context vector c t for the next iteration t as follows: where c 0 is initialized with question embedding q, W p is a projection matrix, and W t represents the weight matrix which considers the context in previous hop and the values in the current iteration based on their importance (attention weight). This multi-iterative context selection allows multi-hop reasoning without explicitly requiring a symbolic query representation.
Answer Entity Selection: The final contextual vector c t is used to select the answer entity q a (among all 1.8M entities in the dataset) which has the highest inner product with it.

Evaluation Dataset
We use Freebase (Bollacker et al., 2008) as our KB, and ClueWeb (Gabrilovich et al., 2013) as our text source to build universal schema. For evaluation, literature offers two options: 1) datasets for text-based question answering tasks such as answer sentence selection and reading comprehension; and 2) datasets for KB question answering.
Although the text-based question answering datasets are large in size, e.g., SQuAD (Rajpurkar et al., 2016) has over 100k questions, answers to these are often not entities but rather sentences which are not the focus of our work. Moreover these texts may not contain Freebase entities at all, making these skewed heavily towards text. Coming to the alternative option, WebQuestions (Berant et al., 2013) is widely used for QA on Freebase. This dataset is curated such that all questions can be answered on Freebase alone. But since our goal is to explore the impact of universal schema, testing on a dataset completely answerable on a KB is not ideal. WikiMovies dataset (Miller et al., 2016) also has similar properties. Gardner  (2017) created a dataset with motivations similar to ours, however this is not publicly released during the submission time. Instead, we use SPADES (Bisk et al., 2016) as our evaluation data which contains fill-in-the-blank cloze-styled questions created from ClueWeb. This dataset is ideal to test our hypothesis for following reasons: 1) it is large with 93K sentences and 1.8M entities; and 2) since these are collected from Web, most sentences are natural. A limitation of this dataset is that it contains only the sentences that have entities connected by at least one relation in Freebase, making it skewed towards Freebase as we will see ( § 4.4). We use the standard train, dev and test splits for our experiments. For text part of universal schema, we use the sentences present in the training set.

Models
We evaluate the following models to measure the impact of different knowledge sources for QA.
ONLYKB: In this model, MemNN memory contains only the facts from KB. For each KB triple (e 1 , r, e 2 ), we have two memory slots, one for (e 1 , r, e 2 ) and the other for its inverse (e 2 , r i , e 1 ).
ONLYTEXT: SPADES contains sentences with blanks. We replace the blank tokens with the answer entities to create textual facts from the training set. Using every pair of entities, we create a memory cell similar to as in universal schema.
ENSEMBLE This is an ensemble of the above two models. We use a linear model that combines the scores from, and use an ensemble to combine the evidences from individual models.
UNISCHEMA This is our main model with universal schema as its memory, i.e., it contains memory slots corresponding to both KB and textual facts.

Implementation Details
The dimensions of word, entity and relation embeddings, and LSTM states were set to d =50. The word and entity embeddings were initialized with Question Answer 1. USA have elected blank , our first african-american president.
Obama 2. Angelina has reportedly been threatening to leave blank .

Brad Pitt
3. Spanish is more often a second and weaker language among many blank .
Latinos 4. blank is the third largest city in the United States.
Chicago 5. blank was Belshazzar 's father. Nabonidus Table 2: A few questions on which ONLYKB fails to answer but UNISCHEMA succeeds.
word2vec (Mikolov et al., 2013) trained on 7.5 million ClueWeb sentences containing entities in Freebase subset of SPADES. The network weights were initialized using Xavier initialization (Glorot and Bengio, 2010). We considered up to a maximum of 5k KB facts and 2.5k textual facts for a question. We used Adam (Kingma and Ba, 2015) with the default hyperparameters (learning rate=1e-3, β 1 =0.9, β 2 =0.999, ε=1e-8) for optimization. To overcome exploding gradients, we restricted the magnitude of the 2 norm of the gradient to 5. The batch size during training was set to 32.
To train the UNISCHEMA model, we initialized the parameters from a trained ONLYKB model. We found that this is crucial in making the UNIS-CHEMA to work. Another caveat is the need to employ a trick similar to batch normalization (Ioffe and Szegedy, 2015). For each minibatch, we normalize the mean and variance of the textual facts and then scale and shift to match the mean and variance of the KB memory facts. Empirically, this stabilized the training and gave a boost in the final performance. Table 1 shows the main results on SPADES. UNIS-CHEMA outperforms all our models validating our hypothesis that exploiting universal schema for QA is better than using either KB or text alone. Despite SPADES creation process being friendly to Freebase, exploiting text still provides a significant improvement. Table 2 shows some of the questions which UNISCHEMA answered but ONLYKB failed. These can be broadly classified into (a) relations that are not expressed in Freebase (e.g., african-american presidents in sentence 1); (b) intentional facts since curated databases only represent concrete facts rather than intentions (e.g., threating to leave in sentence 2); (c) comparative predicates like first, second, largest, smallest Both ONLYKB and ONLYTEXT got it correct 18.5 ONLYKB got it correct and ONLYTEXT did not 20.6 ONLYTEXT got it correct and ONLYKB did not 6.80

Results and Discussions
Both UNISCHEMA and ONLYKB got it correct 34.6 UNISCHEMA got it correct and ONLYKB did not 6.42 ONLYKB got it correct and UNISCHEMA did not 4.47 Both UNISCHEMA and ONLYTEXT got it correct 19.2 UNISCHEMA got it correct and ONLYTEXT did not 21.9 ONLYTEXT got it correct and UNISCHEMA did not 6.09 Table 3: Detailed results on SPADES.
(e.g., sentences 3 and 4); and (d) providing additional type constraints (e.g., in sentence 5, Freebase does not have a special relation for father. It can be expressed using the relation parent along with the type constraint that the answer is of gender male).
We have also anlalyzed the nature of UNIS-CHEMA attention. In 58.7% of the cases the attention tends to prefer KB facts over text. This is as expected since KBs facts are concrete and accurate than text. In 34.8% of cases, the memory prefers to attend text even if the fact is already present in the KB. For the rest (6.5%), the memory distributes attention weight evenly, indicating for some questions, part of the evidence comes from text and part of it from KB. Table 3 gives a more detailed quantitative analysis of the three models in comparison with each other.
To see how reliable is UNISCHEMA, we gradually increased the coverage of KB by allowing only a fixed number of randomly chosen KB facts for each entity. As Figure 2 shows, when the KB coverage is less than 16 facts per entity, UNISCHEMA outperforms ONLYKB by a wide-margin indicating UNISCHEMA is robust even in resource-scarce scenario, whereas ONLYKB is very sensitive to the coverage. UNISCHEMA also outperforms EN-SEMBLE showing joint modeling is superior to ensemble on the individual models. We also achieve the state-of-the-art with 8.5 F 1 points difference. Bisk et al. use graph matching techniques to convert natural language to Freebase queries whereas even without an explicit query representation, we outperform them.

Related Work
A majority of the QA literature that focused on exploiting KB and text either improves the inference on the KB using text based features (Krishnamurthy and Mitchell, 2012;Reddy et al., 2014;Joshi et al., 2014;Yao and Van Durme, 2014;Neelakantan et al., 2015b;Guu et al., 2015;Xu et al., 2016b;Choi et al., 2015;Savenkov and Agichtein, 2016) or improves the inference on text using KB (Sun et al., 2015).
Limited work exists on exploiting text and KB jointly for question answering. Gardner and Krishnamurthy (2017) is the closest to ours who generate a open-vocabulary logical form and rank candidate answers by how likely they occur with this logical form both in Freebase and text. Our models are trained on a weaker supervision signal without requiring the annotation of the logical forms.
A few QA methods infer on curated databases combined with OpenIE triples (Fader et al., 2014;Yahya et al., 2016;Xu et al., 2016a). Our work differs from them in two ways: 1) we do not need an explicit database query to retrieve the answers (Neelakantan et al., 2015a;Andreas et al., 2016); and 2) our text-based facts retain complete sentential context unlike the OpenIE triples (Banko et al., 2007;Carlson et al., 2010).

Conclusions
In this work, we showed universal schema is a promising knowledge source for QA than using KB or text alone. Our results conclude though KB is preferred over text when the KB contains the fact of interest, a large portion of queries still attend to text indicating the amalgam of both text and KB is superior than KB alone.