Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural Networks

Our goal is to combine the rich multi-step inference of symbolic logical reasoning together with the generalization capabilities of vector embeddings and neural networks. We are particularly interested in complex reasoning about the entities and relations in knowledge bases. Recently Neelakantan et al. (2015) presented a compelling methodology using recurrent neural networks (RNNs) to compose the meaning of relations in a Horn clause consisting of a connected chain. However, this work has multiple weaknesses: it accounts for relations but not entities; it limits generalization by training many separate models; it does not combine evidence over multiple paths. In this paper we address all these weaknesses, making key strides towards our goal of rich logical reasoning with neural networks: our RNN leverages and jointly trains both relation and entity type embeddings, we train a single high-capacity RNN to compose Horn clause chains across all predicted relation types; we demonstrate that pooling evidence across multiple chains can dramatically improve both speed of training and final accuracy. We also explore multi-task training of entity and relation types. On a large dataset from ClueWeb and Freebase our approach provides a significant increase in mean average precision from 55.3% to 73.2%


Introduction
There is a rising interest in extending neural networks to perform more complex reasoning, formerly addressed only by symbolic and logical reasoning systems. So far this work has mostly focused on small or synthetic data (Grefenstette, 2013;Bowman et al., 2014;Rocktäschel and Riedel, 2016). Our interest is primarily in reasoning about large knowledge bases (KBs) with diverse semantics, populated from text. One method 1 The code and data are available at https://rajarshd.github.io/ChainsofReasoning/ i. place.birthpa, bq Ð 'was born in'pa, xq' commonly known as'px, bq ii. location.containspa, bq Ð nationality´1pa, xq place.birthpx, bq iii. book.characterspa, bq Ð'aka'pa, xq( theater.character.plays)´1px, bq iv. cause.deathpa, bq Ð'contracted'pa, bq Table 1: Several highly probable clauses learnt by our model. The textual relations are shown in quotes and italicized. Our model has the ability to combine textual and schema relations. r´1 is the inverse relation r, i.e. rpa, bq ô r´1pb, aq.
for populating a KB from text (and for representing diverse semantics in the KB) is Universal Schema Verga et al., 2016), which learns vector embeddings of relation types -the union of all input relation types, both from the schemas of multiple structured KBs, as well as expressions of relations in natural language text.
An important reason to populate a KB is to support not only look-up-style question answering, but reasoning on its entities and relations in order to make inferences not directly stored in the KB. KBs are often highly incomplete (Min et al., 2013), and reasoning can fill in these missing facts. The "matrix completion" mechanism that underlies the common implementation of Universal Schema can thus be seen as a simple type of reasoning, as can other work in tensor factorization (Nickel et al., 2011;Bordes et al., 2013;Socher et al., 2013). However these methods can be understood as operating on single pieces of evidence: for example, inferring that Microsoft-located-in-Seattle implies Microsoft-HQ-in-Seattle.
A highly desirable, richer style of reasoning  (Fig. 1a). Symbolic rules of this form are learned by the Path Ranking Algorithm (PRA) (Lao et al., 2011). Dramatic improvement in generalization can be obtained by reasoning about paths, not in terms of relation-symbols, but Universal Schema style relation-vector-embeddings. This is done by Neelakantan et al. (2015), where RNNs compose the per-edge relation embeddings along an arbitrarylength path, and output a vector embedding representing the inferred relation between the two entities at the end-points of the path. This approach thus represents a key example of complex reasoning over Horn clause chains using neural networks. However, for multiple reasons detailed below it is inaccurate and impractical. This paper presents multiple modeling advances that significantly increase the accuracy and practicality of RNN-based reasoning on Horn clause chains in large-scale KBs. (1) Previous work, including (Lao et al., 2011;Neelakantan et al., 2015;Guu et al., 2015) reason about chains of relations, but not the entities that form the nodes of the path. In our work, we jointly learn and reason about relation-types, entities, and entity-types. (2) The same previous work takes only a single path as evidence in inferring new predictions. However, as shown in Figure 1b, multiple paths can provide ev-idence for a prediction. In our work, we use neural attention mechanisms to reason about multiple paths. We use a pooling function which does soft attention during gradient step and find it to work better. (3) The most problematic impracticality of the above previous work 2 for application to KBs with broad semantics is their requirement to train a separate model for each relation-type to be predicted. In contrast, we train a single, high-capacity RNN that can predict all relation types. In addition to efficiency advantages, our approach significantly increases accuracy because the multi-task nature of the training shares strength in the common RNN parameters.
We evaluate our new approach on a large scale dataset of Freebase entities, relations and ClueWeb text. In comparison with the previous best on this data, we achieve an error reduction of 25% in mean average precision (MAP). In an experiment specially designed to explore the benefits of sharing strength with a single RNN, we show a 54% error reduction in relations that are available only sparsely at training time. We also evaluate on a second data set, chains of reasoning in WordNet. In comparison with previous state-of-the-art (Guu et al., 2015) our model achieves a 84% reduction in error in mean quantile.

Background
In this section, we introduce the compositional model (Path-RNN) of Neelakantan et al. (2015). The Path-RNN model takes as input a path between two entities and infers new relations between them. Reasoning is performed non-  Figure 2: At each step, the RNN consumes both entity and relation vectors of the path. The entity representation can be obtained from its types. The path vector y π is the last hidden state. The parameters of the RNN and relation embeddings are shared across all query relations. The dot product between the final representation of the path and the query relation gives a confidence score, with higher scores indicating that the query relation exists between the entity pair.
atomically about conjunctions of relations in an arbitrary length path by composing them with a recurrent neural network (RNN). The representation of the path is given by the last hidden state of the RNN obtained after processing all the relations in the path.
Let pe s , e t q be an entity pair and S denote the set of paths between them. The set S is obtained by doing random walks in the knowledge graph starting from e s till e t . Let π " te s , r 1 , e 1 , r 2 , . . . , r k , e t u P S denote a path between pe s , e t q. The length of a path is the number of relations in it, hence, plenpπq " kq. Let y r t P R d denote the vector representation of r t . The Path-RNN model combines all the relations in π sequentially using a RNN with an intermediate representation h t P R h at step t given by W r hh P R hˆh and W r ih P R dˆh are the parameters of the RNN. Here r denotes the query relation. Path-RNN has a specialized model for predicting each query relation r, with separate parameters py r r t , W r hh , W r ih q for each r. f is the sigmoid function. The vector representation of path π py π q is the last hidden state h k . The similarity of y π with the query relation vector y r is computed as the dot product between them: Pairs of entities may have several paths connecting them in the knowledge graph ( Figure 1b). Let ts 1 , s 2 , . . . , s N u be the similarity scores (Equation 2) for N paths connecting an entity pair pe s , e t q.
Path-RNN computes the probability that the entity pair pe s , e t q participates in the query relation prq by, where σ is the sigmoid function. Path-RNN and other models such as the Path Ranking Algorithm (PRA) and its extensions (Lao et al., 2011;Lao et al., 2012;Gardner et al., 2013;Gardner et al., 2014) makes it impractical to be used in downstream applications, since it requires training and maintaining a model for each relation type. Moreover, parameters are not shared across multiple target relation types leading to large number of parameters to be learned from the training data.
In (3), the Path-RNN model selects the maximum scoring path between an entity pair to make a prediction, possibly ignoring evidence from other important paths. Not only is this a waste of computation (since we have to compute a forward pass for all the paths anyway), but also the relations in all other paths do not get any gradients updates during training as the max operation returns zero gradient for all other paths except the maximum scoring one. This is especially ineffective during the initial stages of the training since the maximum probable path will be random.
The Path-RNN model and other multi-hop relation extraction approaches (such as Guu et al. (2015)) ignore the entities in the path. Consider the following paths JFK-locatedIn-NYC-locatedIn-NY and Yankee Stadium-locatedIn-NYC-locatedIn-NY. To predict the airport serves relation, the Path-RNN model assigns the same scores to both the paths even though the first path should be ranked higher. This is because the model does not have information about the entities and just uses the relations in the path for prediction.

Shared Parameter Architecture
Previous section discussed the problems associated with per-relation modeling approaches. In response, we share the relation type representation and the composition matrices of the RNN across all target relations enabling lesser number of parameters for the same training data. We refer to this model as Single-Model. Note that this is just multi-task learning (Caruana, 1997) among prediction of target relation types with an underlying shared parameter architecture. The RNN hidden state in (1) is now given by: Readers should take note that the parameters here are independent of each target relation r.

Model Training
We train the model using existing observed facts (triples) in the KB as positive examples and unobserved facts as negative examples. Let R " tγ 1 , γ 2 , . . . , γ n u denote the set of all query relation types that we train for. Let ∆R, ∆Ŕ denote the set of positive and negative triples for all the relation types in R. The parameters of the model are trained to minimize the negative log-likelihood of the data.
Here M is the total number of training examples and Θ denotes the set of all parameters of the model (lookup table of embeddings (shared) and parameters of the RNN (shared)). It should be noted that the Path-RNN model has a separate loss function for each relation r P R which depends only on the relevant subset of the data.

Score Pooling
In this section, we introduce new methods of score pooling that takes into account multiple paths between an entity pair. Let ts 1 , s 2 , . . . , s N u be the similarity scores (Equation 2) for N paths connecting an entity pair pe s , e t q. The probability for entity pair pe s , e t q to participate in relation r Ppr|e 1 , e 2 q " σpLSEps 1 , s 2 , . . . , s N qq The average and the LSE pooling functions apply non-zero weights to all the paths during inference.
However only a few paths between an entity pair are predictive of a query relation. LSE has another desirable property since BLSE Bs i " expps i q ř i expps i q . This means that during the back-propagation step, every path will receive a share of the gradient proportional to its score and hence this is a kind of attention during the gradient step. In contrast, for averaging, every path will receive equal p 1 N q share of the gradient. Top-(k) (similar to max) receives sparse gradients.

Incorporating Entities
A straightforward way of incorporating entities is to include entity representations (along with relations) as input to the RNN. Learning separate representations of entity, however has some disadvantages. The distribution of entity occurrence is heavy tailed and hence it is hard to learn good representations of rarely occurring entities. To alleviate this problem, we use the entity types present in the KB as described below.
Most KBs have annotated types for entities and each entity can have multiple types. For example, Melinda Gates has types such as CEO, Duke University Alumni, Philanthropist, American Citizen etc. We obtain the entity representation by a simple addition of the entity type representations. The entity type representations are learned during training. We limit the number of entity types for an entity to 7 most frequently occurring types in the KB. Let y e t P R m denote the representation of entity e t , then 4 now becomes h t " f pW hh h t´1`Wih y r t`Weh y e t q (6) W eh P R mˆh is the new parameter matrix for projecting the entity representation. Figure  2 shows our model with an example path between entities (Microsoft, USA) with country-OfHQ (country of head-quarters) as the query relation.

Related Work
Two early works on extracting clauses and reasoning over paths are SHERLOCK (Schoenmackers et al., 2010) and the Path Ranking Algorithm (PRA) (Lao et al., 2011). SHERLOCK extracts purely symbolic clauses by exhaustively exploring relational paths of increasing length. PRA replaces exhaustive search by random walks. Observed paths are used as features for a per-targetrelation binary classifier. Lao et al. (2012) extend PRA by augmenting KB-schema relations with observed text patterns. However, these methods do not generalize well to millions of distinct paths obtained from random exploration of the KB, since each unique path is treated as a singleton, where no commonalities between paths are modeled. In response, pre-trained vector representations have been used in PRA to tackle the feature explosion (Gardner et al., 2013;Gardner et al., 2014) but still rely on a classifier using atomic path features. Yang et al. (2015) also extract horn rules, but they restrict it to a length of 3 and the literals are restricted to schema types in the knowledge base. Zeng et al. (2016) show improvements in relation extraction by incorporating sentences which  contain one entity. Guu et al. (2015) introduce new compositional techniques by modeling additive and multiplicative interactions between relation matrices in the path. However they model only a single path between an entity pair in-contrast to our ability to consider multiple paths. Toutanova et al. (2016) improves upon them by additionally modeling the intermediate entities in the path and modeling multiple paths. However, in their approach they have to store scores for intermediate path length for all entity pairs, making it prohibitive to be used in our setting where we have more than 3M entity pairs. They also model entities as just a scalar weight whereas we learn both entity and type representations. Lastly it has been shown by Neelakantan et al. (2015) that non-linear composition function out-performs linear functions (as used by them) for relation extraction tasks.
The performance of relation extraction methods have been improved by incorporating entity types for their candidate entities, both in sentence level (Roth and Yih, 2007;Singh et al., 2013) and KB relation extraction (Chang et al., 2014), and in learning entailment rules (Berant et al., 2011). Serban et al. (2016 use RNNs to generate factoid question from Freebase.

Data and Experimental Setup
We apply our models to the dataset released by Neelakantan et al. (2015), which is a subset of Freebase enriched with information from ClueWeb. The dataset is comprised of a set of triples (e 1 , r, e 2 ) and also the set of paths connecting the entity pair (e 1 ,e 2 ) in the knowledge graph. The triples extracted from ClueWeb consists of sentences that contain entities linked to Freebase (Orr et al., 2013). The raw text between the two entities in the sentence forms the relation type. To limit the number of textual relations, we retain the two following words after the first entity and two words before the second entity. We also collect the entity type information from Freebase. Table 2 summarizes some important statistics. For the PathQA experiment, we use the same train/dev/test split of WordNet dataset released by Guu et al. (2015) and hence our results are directly comparable to them. The WordNet dataset has just 22 relation types and 38194 entities which is order of magnitudes less than the dataset we use for relation extraction tasks.
The dimension of the relation type representations and the RNN hidden states are d, h " 250 and the entity and type embeddings have m " 50 dimensions. The Path-RNN model has sigmoid units as their activation function. However, we found rectifier units (ReLU) to work much better (Le et al., 2015), even when compared to LSTMs (73.2 vs 72.4 in MAP). For the path-query experiment, the dimension of entity, relation embeddings and hidden units are set to 100 (as used by Guu et al. (2015)). As our evaluation metric, we use the average precision (AP) to score the ranked list of entity pairs. The MAP score is the mean AP across all query relations. AP is a strict metric since it penalizes when an incorrect entity is ranked higher above correct entities. Also MAP approximates the area under the Precision Recall curve (Manning et al., 2008). We use Adam (Kingma and Ba, 2014) for optimization for all our experiments with the default hyperparameter settings (learning rate = 1e´3, β 1 " 0.9, β 2 " 0.999, " 1e´8). Statistical significance for scores reported in Table  3 were done with a paired-t test.

Effect of Pooling Techniques
Section 1 of Table 3 shows the effect of the various pooling techniques presented in section 3.2. It is encouraging to see that LogSumExp gives the best results. This demonstrates the importance of considering information from all the paths. However, Avg. pooling performs the worst, which shows that it is also important to weigh the paths scores according to their values. Figure 3 plots the training loss w.r.t gradient update step. Due to non-zero gradient updates for all the paths, the LogSumExp pooling strategy leads to faster training vs. max pooling, which has sparse gradients. This is especially relevant during the early stages of training, where the argmax path is essentially a random guess. The scores of max and LSE pooling are significant with (p ă 0.02).

Comparison with multi-hop models
We next compare the performance of the Single-Model with two other multi-hop models -Path-RNN and PRA (Lao et al., 2011). Both of these approaches train an individual model for each query relation. We also experiment with another extension of PRA that adds bigram features (PRA + Bigram). Additionally, we run an experiment replacing the max-pooling of Path-RNN with Log-SumExp. The results are shown in the second section of Table 3. It is not surprising to see that the Single-Model, which leverages parameter sharing improves performance. It is also encouraging to see that LogSumExp makes the Path-RNN baseline stronger. The scores of Path-RNN (with LSE) and Single-Model are significant with (p ă 0.005).

Effect of Incorporating Entities
Next, we provide quantitative results supporting our claim that modeling the entities along a KB path can improve reasoning performance. The last section of Table 3 lists the performance gain obtained by injecting information about entities. We achieve the best performance when we represent entities as a function of their annotated types in Freebase (Single-Model + Types) pp ă 0.005q.
In comparison, learning separate representations of entities (Single-Model + Entities) gives slightly worse performance. This is primarily because we encounter many new entities during test time, for  Table 3: The first section shows the effectiveness of LogSumExp as the score aggregation function. The next section compares performance with existing multi-hop approaches and the last section shows the performance achieved using joint reasoning with entities and types.
which our model does not have a learned representation. However the relatively limited number of entity types helps us overcome the problem of representing unseen entities. We also extend PRA to include entity type information (PRA + Types), but this did not yield significant improvements.

Performance in Limited Data Regime
In constructing our dataset, we selected query relations with reasonable amounts of data. However, for many important applications we have very limited data. To simulate this common scenario, we create a new dataset by randomly selecting 23 out of 46 relations and removing all but 1% of the positive and negative triples previously used for training.
Effectively, the difference between Path-RNN and Single-Model is that Single-Model does multitask learning, since it shares parameters for different target relation types. Therefore, we expect it to outperform Path-RNN on this small dataset, since this multitask learning provides additional regularization. We also experiment with an extension of Single-Model where we introduce an additional task for multitask learning, where we seek to predict annotated types for entities. Here, parameters for the entity type embeddings are shared with the Single-Model. Supervision for this task is provided by the entity type annotation in the KB. We train with a Bayesian Personalized Ranking loss of Rendle et al. (2009). The results are listed in Table  4. With Single-Model there is a clear jump in performance as we expect. The additional multitask training with types gives a very incremental gain.

Model
Performance ( Table 4: Model performance when trained with a small fraction of the data. Guu et al. (2015) introduce a task of answering questions formulated as path traversals in a KB. Unlike binary fact prediction, to answer a path query, the model needs to find the set of correct target entities 't' that can be reached by starting from an initial entity 's' and then traversing the path 'p'. They model additive and multiplicative interactions of relations in the path. It should be noted that the compositional Trans-E and Bilineardiag have comparable number of parameters to our model since they also represent relations as vectors, however the Bilinear model learns a dense square matrix for each relation and hence has a lot more number of parameters. Hence, we compare with Trans-E and Bilinear-diag models. Bilineardiag has also been shown to outperform Bilinear models (Yang et al., 2015). Instead of combining relations using simple additions and multiplications, we propose to combine the intermediate hidden representations h i obtained from a RNN (via (4)) after consuming relation r i at each step. Let h denote the sum of all intermediate representations h i . The score of a triple ps, p, tq by our model is given by   matrix with vector h as its diagonal elements. We compare to the results reported by Guu et al. (2015) on the WordNet dataset. It should be noted that the dataset is fairly small with just 22 relation types and an average path length of 3.07. More importantly, there are only few unseen paths during test time and only one path between an entity pair, suggesting that this dataset is not an ideal test bed for compositional neural models. The results are shown in table 6. Mean Quantile(MQ) is the fraction of incorrect entities which have been scored lower than the correct entity. Our model achieves a 84% reduction in error when compared to their best model.

Qualitative Analysis
Entities as Existential Quantifiers: Table 5 shows the body of two horn clauses. Both the clauses are predictive of the fact location.containspx, bq. The first clause is universally true irrespective of the entities present in the chain (transitive property). However the value of the second clause is only true conditional on the instantiations of the entities. The score of the Path-RNN model is independent of the entity values, whereas our model outputs a different score based on the entities in the chain. We average the scores across entities, which are connected through this path and for which the relation holds in column 3 (With Entities).
For the first clause, which is independent of entities, both models predict a high score. However for the second clause, the model without entity information predicts a lower score because this path is seen in both positive and negative training ex- amples and the model cannot condition on the entities to learn to discriminate. However our model predicts the true relations with high confidence. This is a first step towards the capturing existential quantification for logical inference in vector space.
Length of Clauses: Figure 4 shows the length distribution of top scoring paths in the test set. The distribution peaks at lengths" t3, 4, 5u, suggesting that previous approaches (Yang et al., 2015) which restrict the length to 3 might limit performance and generalizability. Limitation: A major limitation of our model is inability to handle long textual patterns because of sparsity. Compositional approaches for modeling text (Toutanova et al., 2015;Verga et al., 2016) are a right step in this direction and we leave this as future work.

Conclusion
This paper introduces a single high capacity RNN model which allows chains of reasoning across multiple relation types. It leverages information from the intermediate entities present in the path between an entity pair and mitigates the problem of unseen entities by representing them as a function of their annotated types. We also demonstrate that pooling evidence across multiple paths improves both training speed and accuracy. Finally, we also address the problem of reasoning about infrequently occurring relations and show significant performance gains via multitasking.