Lifted Rule Injection for Relation Embeddings

Methods based on representation learning currently hold the state-of-the-art in many natural language processing and knowledge base inference tasks. Yet, a major challenge is how to efficiently incorporate commonsense knowledge into such models. A recent approach regularizes relation and entity representations by propositionalization of first-order logic rules. However, propositionalization does not scale beyond domains with only few entities and rules. In this paper we present a highly efficient method for incorporating implication rules into distributed representations for automated knowledge base construction. We map entity-tuple embeddings into an approximately Boolean space and encourage a partial ordering over relation embeddings based on implication rules mined from WordNet. Surprisingly, we find that the strong restriction of the entity-tuple embedding space does not hurt the expressiveness of the model and even acts as a regularizer that improves generalization. By incorporating few commonsense rules, we achieve an increase of 2 percentage points mean average precision over a matrix factorization baseline, while observing a negligible increase in runtime.


Introduction
Current successful methods for automated knowledge base construction tasks heavily rely on learned distributed vector representations (Nickel et al., 2012;Riedel et al., 2013;Socher et al., 2013;Chang et al., 2014;Neelakantan et al., 2015;Toutanova et al., 2015;Nickel et al., 2015;. Although these models are able to learn robust representations from large amounts of data, they often lack commonsense knowledge. Such knowledge is rarely explicitly stated in texts but can be found in resources like PPDB (Ganitkevitch et al., 2013) or WordNet (Miller, 1995).
Combining neural methods with symbolic commonsense knowledge, for instance in the form of implication rules, is in the focus of current research (Rocktäschel et al., 2014;Wang et al., 2014;Bowman et al., 2015;Wang et al., 2015;Vendrov et al., 2016;Hu et al., 2016;. A recent approach (Rocktäschel et al., 2015) regularizes entity-tuple and relation embeddings via first-order logic rules. To this end, every first-order rule is propositionalized based on observed entity-tuples, and a differentiable loss term is added for every propositional rule. This approach does not scale beyond only a few entity-tuples and rules. For example, propositionalizing the rule ∀x : isMan(x) ⇒ isMortal(x) would result in a very large number of loss terms on a large database.
In this paper, we present a method to incorporate simple rules while maintaining the computational efficiency of only modeling training facts. This is achieved by minimizing an upper bound of the loss that encourages the implication between relations to hold, entirely independent from the number of entity pairs. It only involves representations of the relations that are mentioned in rules, as well as a general rule-independent constraint on the entity-tuple embedding space. In the example given above, if we require that every component of the vector representation of isMan is smaller than the corresponding component of relation isMortal, then we can show that the rule holds for any nonnegative representation of an entity-tuple. Hence our method avoids the need for separate loss terms for every ground atom resulting from propositionalizing rules. In statistical relational learning this type of approach is often referred to as lifted inference or learning (Poole, 2003;Braz, 2007) because it deals with groups of random variables at a first-order level. In this sense our approach is a lifted form of rule injection. This allows for imposing large numbers of rules while learning distributed representations of relations and entity-tuples. Besides drastically lower computation time, an important advantage of our method over Rocktäschel et al. (2015) is that when these constraints are satisfied, the injected rules always hold, even for unseen but inferred facts. While the method presented here only deals with implications and not general first-order rules, it does not rely on the assumption of independence between relations, and is hence more generally applicable.
Our contributions are fourfold: (i) we develop a very efficient way of regularizing relation representations to incorporate first-order logic implications ( §3), (ii) we reveal that, against expectation, mapping entity-tuple embeddings to non-negative space does not hurt but instead improves the generalization ability of our model ( §5.1) (iii) we show improvements on a knowledge base completion task by injecting mined commonsense rules from WordNet ( §5.3), and finally (iv) we give a qualitative analysis of the results, demonstrating that implication constraints are indeed satisfied in an asymmetric way and result in a substantially increased structuring of the relation embedding space ( §5.6).

Background
In this section we revisit the matrix factorization relation extraction model by Riedel et al. (2013) and introduce the notation used throughout the paper. We choose the matrix factorization model for its simplicity as the base on which we develop implication injection. Riedel et al. (2013) represent every relation r ∈ R (selected from Freebase (Bollacker et al., 2008) or extracted as textual surface pattern) by a k-dimensional latent representation r ∈ R k . A particular relation instance or fact is the combination of a relation r and a tuple t of entities that are engaged in that relation, and is written as r, t . We write O as the set of all such input facts available for training. Furthermore, every entity-tuple t ∈ T is represented by a latent vector t ∈ R k (with T the set of all entity-tuples in O).
Model F by Riedel et al. (2013) measures the compatibility between a relation r and an entitytuple t using the dot product r t of their respective vector representations. During training, the representations are learned such that valid facts receive high scores, whereas negative ones receive low scores. Typically no negative evidence is available at training time, and therefore a Bayesian Personalized Ranking (BPR) objective (Rendle et al., 2009) is used. Given a pair of facts f p := r p , t p ∈ O and f q := r q , t q ∈ O, this objective requires that (1) The embeddings can be trained by minimizing a convex loss function R that penalizes violations of that requirement when iterating over the training set. In practice, each positive training fact r, t q is compared with a randomly sampled unobserved fact r, t p for the same relation. The overall loss can hence be written as (2) and measures how well observed valid facts are ranked above unobserved facts, thus reconstructing the ranking of the training data. We will henceforth call L R the reconstruction loss, to make a distinction with the implication loss that we will introduce later. Riedel et al. (2013) use the logistic loss R (s) := − log σ(−s), where σ(s) := (1 + e −x ) −1 denotes the sigmoid function. In order to avoid overfitting, an L 2 regularization term on the r and t embeddings is added to the reconstruction loss. The overall objective to minimize hence is where α is the regularization strength.

Lifted Injection of Implications
In this section, we show how an implication ∀t ∈ T : r p , t ⇒ r q , t , can be imposed independently of the entity-tuples. For simplicity, we abbreviate such implications as r p ⇒ r q (e.g., professorAt ⇒ employeeAt).

Grounded Loss Formulation
The implication rule can be imposed by requiring that every tuple t ∈ T is at least as compatible with relation r p as with r q . Written in terms of the latent representations, eq. (4) therefore becomes If r p , t is a true fact with a high score r p t, and the fact r q , t has an even higher score, it must also be true, but not vice versa. We can therefore inject an implication rule by minimizing a loss term with a separate contribution from every t ∈ T , adding up to the total loss if the corresponding inequality is not satisfied. In order to make the contribution of every tuple t to that loss independent of the magnitude of the tuple embedding, we divide both sides of the above inequality by t 1 . Witht := t/ t 1 , the implication loss for the rule r p ⇒ r q can be written as for an appropriate convex loss function I , similarly to eq. (2). In practice, the summation can be reduced to those tuples that occur in combination with r p or r q in the training data. Still, the propositionalization in terms of training facts leads to a heavy computational cost for imposing a single implication, similar to the technique introduced in Rocktäschel et al. (2015). Moreover, with that simplification there is no guarantee that the implication between both relations would generalize towards inferred facts not seen during training.

Lifted Loss Formulation
The problems mentioned above can be avoided if instead of L I , a tuple-independent upper bound is minimized. Such a bound can be constructed, provided all components of t are restricted to a nonnegative embedding space, i.e., T ⊆ R k,+ . If this holds, Jensen's inequality allows us to transform eq. (6) as follows where 1 i is the unit vector along dimension i in tuple-space. This is allowed because the form convex coefficients (t i > 0, and it i = 1), and I is a convex function. If we define we can write in which β is an upper bound on tt i . One such bound is |T |, but others are conceivable too. In practice we rescale β to a hyper-parameterβ that we use to control the impact of the upper bound to the overall loss. We call L U I the lifted loss, as it no longer depends on any of the entity-tuples; it is grounded over the unit tuples 1 i instead.
The implication r p ⇒ r q can thus be imposed by minimizing the lifted loss L U I . Note that by minimizing L U I , the model is encouraged to satisfy the constraint r p ≤ r q on the relation embeddings, where ≤ denotes the component-wise comparison. In fact, a sufficient condition for eq. (5) to hold, is r p ≤ r q and ∀t ∈ T : t ≥ 0 with 0 the k-dimensional null vector. This corresponds to a single relation-specific loss term, and the general restriction T ⊆ R k,+ on the tupleembedding space.

Approximately Boolean Entity Tuples
In order to impose implications by minimizing a lifted loss L U I , the tuple-embedding space needs to be restricted to R k,+ . We have chosen to restrict the tuple space even more than required, namely to the hypercube t ∈ [0, 1] k , as approximately Boolean embeddings (Kruszewski et al., 2015). The tuple embeddings are constructed from real-valued vectors e, using the component-wise sigmoid function For minimizing the loss, the gradients are hence computed with respect to e, and the L 2 regularization is applied to the components of e instead of t.
Other choices for ensuring the restriction t ≥ 0 in eq. (11) are possible, but we found that our approach works better in practice than those (e.g., the exponential transformation proposed by Demeester et al. (2016)). It can also be observed that the unit tuples over which the implication loss is grounded, form a special case of approximately Boolean embeddings.
In order to investigate the impact of this restriction even when not injecting any rules, we introduce model FS: the original model F, but with sigmoidal entity-tuples: Here, e p and e q are the real-valued representations as in eq. (12), for tuples t p and t q , respectively. With the above choice of a non-negative tupleembedding space we can now state the full lifted rule injection model (FSL): L U I denotes a lifted loss term for every rule in a set I of implication rules that we want to inject.

Convex Implication Loss
The logistic loss R (see §2) is not suited for imposing implications because once the inequality in eq. (11) is satisfied, the components of r p and r q do not need to be separated any further. However, with R this would continue to happen due to the small non-zero gradient. In the reconstruction loss L R this is a desirable effect which further separates the scores for positive from negative examples. However, if an implication is imposed between two relations that are almost equivalent according to the training data, we still want to find almost equivalent embedding vectors. Hence, we propose to use the loss with δ a small positive margin to ensure that the gradient does not disappear before the inequality is actually satisfied. We use δ = 0.01 in all experiments.
The main advantage of the presented approach over earlier methods that impose the rules in a grounded way (Rocktäschel et al., 2015;Wang et al., 2015) is the computational efficiency of imposing the lifted loss. Evaluating L U I or its gradient for one implication rule is comparable to evaluating the reconstruction loss for one pair of training facts. In typical applications there are much fewer rules than training facts and the extra computation time needed to inject these rules is therefore negligible.

Related Work
Recent research on combining rules with learned vector representations has been important for new developments in the field of knowledge base completion. Rocktäschel et al. (2014) and Rocktäschel et al. (2015) provided a framework to jointly maximize the probability of observed facts and propositionalized first-order logic rules. Wang et al. (2015) demonstrated how different types of rules can be incorporated using an Integer Linear Programming approach. Wang and Cohen (2016) learned embeddings for facts and first-order logic rules using matrix factorization. Yet, all of these approaches ground the rules in the training data, limiting their scalability towards large rule sets and KBs with many entities. As argued in the introduction, this forms an important motivation for the lifted rule injection model put forward in this work, which by construction does not suffer from that limitation. Wei et al. (2015) proposed an alternative strategy to tackle the scalability problem by reasoning on a filtered subset of grounded facts. Wu et al. (2015) proposed to use a path ranking approach for capturing long-range interactions between entities, and to add these as an extra loss term, besides the loss that models pairwise relations. Our model FSL differs substantially from their approach, in that we consider tuples instead of separate entities, and we inject a given set of rules. Yet, by cre-ating a partial ordering in the relation embeddings as a result of injecting implication rules, model FSL can also capture interactions beyond direct relations. This will be demonstrated in §5.3 by injecting rules between surface patterns only and still measuring an improvement on predictions for structured Freebase relations.
Combining logic and distributed representations is also an active field of research outside of automated knowledge base completion. Recent advances include the work by Faruqui et al. (2014), who injected ontological knowledge from WordNet into word representations. Furthermore, Vendrov et al. (2016) proposed to enforce a partial ordering in an embeddings space of images and phrases. Our method is related to such order embeddings since we define a partial ordering on relation embeddings. However, to ensure that implications hold for all entity-tuples we also need a restriction on the entitytuple embedding space and derive bounds on the loss. Another important contribution is the recent work by Hu et al. (2016), who proposed a framework for injecting rules into general neural network architectures, by jointly training on the actual targets and on the rule-regularized predictions provided by a teacher network. Although quite different at first sight, their work could offer a way to use our model in various neural network architectures, by integrating the proposed lifted loss into the teacher network.
This paper builds upon our previous workshop paper (Demeester et al., 2016). In that work, we tested different tuple embedding transformations in an ad-hoc manner. We used approximately Boolean representations of relations instead of entity-tuples, strongly reducing the model's degrees of freedom. We now derive the FSL model from a carefully considered mathematical transformation of the grounded loss. The FSL model only restricts the tuple embedding space, whereby relation vectors remain real valued. Furthermore, previous experiments were performed on small-scale artificial datasets, whereas we now test on a real-world relation extraction benchmark.
Finally, we explicitly discuss the main differences with respect to the strongly related work from Rocktäschel et al. (2015). Their method is more general, as they cover a wide range of first-order logic rules, whereas we only discuss implications. Lifted rule injection beyond implications will be studied in future research contributions. However, albeit less general, our model has a number of clear advantages: Scalability -Our proposed model of lifted rule injection scales according to the number of implication rules, instead of the number of rules times the number of observed facts for every relation present in a rule.
Generalizability -Injected implications will hold even for facts not seen during training, because their validity only depends on the order relation imposed on the relation representations. This is not guaranteed when training on rules grounded in training facts by Rocktäschel et al. (2015).
Training Flexibility -Our method can be trained with various loss functions, including the rank-based loss as used in Riedel et al. (2013). This was not possible for the model of Rocktäschel et al. (2015) and already leads to an improved accuracy as seen from the zero-shot learning experiment in §5.2.
Independence Assumption -In Rocktäschel et al. (2015) an implication of the form a p ⇒ a q for two ground atoms a p and a q is modeled by the logical equivalence ¬(a p ∧ ¬a q ), and its probability is approximated in terms of the elementary probabilities π(a p ) and π(a q ) as 1 − π(a p ) 1 − π(a q ) . This assumes the independence of the two atoms a p and a q , which may not hold in practice. Our approach does not rely on that assumption and also works for cases of statistical dependence. For example, the independence assumption does not hold in the trivial case where the relations r p and r q in the two atoms are equivalent, whereas in our model, the constraints r p ≤ r q and r p ≥ r q would simply reduce to r p = r q .

Experiments and Results
We now present our experimental results. We start by describing the experimental setup and hyperparameters. Before turning to the injection of rules, we compare model F with model FS, and show that restricting the tuple embedding space has a regularization effect, rather than limiting the expressiveness of the model ( §5.1). We then demonstrate that model FSL is capable of zero-shot learning ( §5.2), and show that injecting high-quality WordNet rules  Riedel et al. (2013) are denoted as R13-F. leads to an improved precision ( §5.3). We proceed with a visual illustration of the relation embeddings with and without injected rules ( §5.4), provide details on time efficiency of the lifted rule injection method ( §5.5), and show that it correctly captures the asymmetry of implication rules ( §5.6). All models were implemented in Tensor-Flow (Abadi et al., 2015). We use the hyperparameters of Riedel et al. (2013), with k = 100 hidden dimensions and a weight of α = 0.01 for the L 2 regularization loss. We use ADAM (Kingma and Ba, 2014) for optimization with an initial learning rate of 0.005 and a mini-batch size of 8192. The embeddings are initialized by sampling uniformly from [−0.1, 0.1] and we useβ = 0.1 for the implication loss throughout our experiments.

Restricted Embedding Space
Before incorporating external commonsense knowledge into relation representations, we were curious how much we lose by restricting the entity-tuple space to approximately Boolean embeddings. We evaluate our models on the New York Times dataset introduced by Riedel et al. (2013). Surprisingly, we find that the expressiveness of the model does not suffer from this strong restriction. From Table 1 we see that restricting the tuple-embedding space seems to perform slightly better (FS) as opposed to a realvalued tuple-embedding space (F), suggesting that this restriction has a regularization effect that improves generalization. We also provide the original results for model F by Riedel et al. (2013) (denoted as R13-F) for comparison. Due to a different implementation and optimization procedure, the results for our model F and R13-F are not identical.
Inspecting the top relations for a sampled dimension in the embedding space reveals that the relation space of model FS more closely resembles clusters than that of model F (Table 2). We hypothesize that this might be caused by approximately Boolean entity-tuple representations in model FS, resulting in attribute-like entity-tuple vectors that capture which relation clusters they belong to.

Zero-shot Learning
The zero-shot learning experiment performed in Rocktäschel et al. (2015) leads to an important finding: when injecting implications with right-hand sides for Freebase relations for which no or very limited training facts are available, the model should be able to infer the validity of Freebase facts for those relations based on rules and correlations between textual surface patterns.
We inject the same hand-picked relations as used by Rocktäschel et al. (2015), after removing all Freebase training facts. The lifted rule injection (model FSL) reaches a weighted MAP of 0.35, comparable with 0.38 by the Joint model from Rocktäschel et al. (2015) (denoted R15-Joint). Note that for this experiment we initialized the Freebase relations implied by the rules with negative random vectors (sampled uniformly from [−7.9, −8.1]). The reason is that without any negative training facts for these relations, their components can only go up due to the implication loss, and we do not want to get values that are too high before optimization. Figure 1 shows how the relation extraction performance improves when more Freebase relation training facts are added. It effictively measures how well the proposed models, matrix factorization (F), propositionalized rule injection (R15-Joint), and our model (FSL), can make use of the provided rules and correlations between textual surface form pat- Table 2: Top patterns for a randomly sampled dimension in non-restricted and restricted embedding space .

Model F (non-restricted) Model FS (restricted)
nsubj<-represent->dobj rcmod->return->prep->to->pobj appos->member->prep->of->pobj->team->nn nn<-return->prep->to->pobj nsubj<-die->dobj nsubj<-return->prep->to->pobj nsubj<-speak->prep->about->pobj rcmod->leave->dobj appos->champion->poss nsubj<-quit->dobj terns and increased fractions of Freebase training facts. Although FSL starts at a lower performance than R15-Joint when no Freebase training facts are present, it outperforms R15-Joint and a plain matrix factorization model by a substantial margin when provided with more than 7.5% of Freebase training facts. This indicates that, in addition to being much faster than R15-Joint, it can make better use of provided rules and few training facts. We attribute this to the Bayesian personalized ranking loss instead of the logistic loss used in Rocktäschel et al. (2015). The former is compatible with our ruleinjection method, but not with the approach of maximizing the expectation of propositional rules used by R15-Joint.

Injecting Knowledge from WordNet
The main purpose of this work is to be able to incorporate rules from external resources for aid-ing relation extraction. We use WordNet hypernyms to generate rules for the NYT dataset. To this end we iterate over all surface form patterns in the dataset and attempt to replace words in the pattern by their hypernyms. If the resulting pattern is contained in the dataset, we generate the corresponding rule. For instance, we generate a rule appos->diplomat->amod ⇒ appos->official->amod since both patterns are contained in the NYT dataset and we know from WordNet that a diplomat is an official. This leads to 427 rules from WordNet that we subsequently annotate manually to obtain 36 high-quality rules. Note that none of these rules directly imply a Freebase relation. Although the test relations all originate from Freebase, we still hope to see improvements by transitive effects, i.e., better surface form representations that in turn help to predict Freebase facts.
We show results obtained by injecting these WordNet rules in Table 1 (column FSL). The weighted MAP measure increases by 2% with respect to model FS, and 4% compared to our reimplementation of the matrix factorization model F. This demonstrates that imposing a partial ordering based on implication rules can be used to incorporate logical commonsense knowledge and increase the quality of information extraction systems. Note that our evaluation setting guarantees that only indirect effects of the rules are measured, i.e., we do not use any rules directly implying test relations. This shows that injecting such rules influences the relation embedding space beyond only the relations explicitly stated in the rules. For example, injecting the rule appos<-father->appos ⇒ poss<-parent->appos can contribute to improved predictions for the test relation parent/child.

Visualizing Relation Embeddings
We provide a visual inspection of how the structure of the relation embedding space changes when rules are imposed. We select all relations involved in the WordNet rules, and gather them as columns in a single matrix, sorted by increasing 1 norm (values in the 100 dimensions are similarly sorted). Figures 2a  and 2b show the difference between model F (without injected rules) and FSL (with rules). The values of the embeddings in model FSL are more polarized, i.e., we observe stronger negative or positive components than for model F. Furthermore, FSL also reveals a clearer difference between the leftmost (mostly negative, more specific) and right-most (predominantly positive, more general) embeddings (i.e., a clearer separation between positive and negative values in the plot), which results from imposing the order relation in eq. (11) when injecting implications.

Efficiency of Lifted Injection of Rules
In order to get an idea of the time efficiency of injecting rules, we measure the time per epoch when restricting the program execution to a single 2.4GHz CPU core. We measure on average 6.33s per epoch without rules (model FS), against 6.76s and 6.97s when injecting the 36 high-quality WordNet rules and the unfiltered 427 rules (model FSL), respectively. Increasing the amount of injected rules from 36 to 427 leads to an increase of only 3% in computation time, even though in our setup all rule losses are used in every training batch. This confirms the high efficiency of our lifted rule injection method.

Asymmetric Character of Implications
In order to demonstrate that injecting implications conserves their asymmetric nature, we perform the following experiment. After incorporating highquality Wordnet rules r p ⇒ r q into model FSL we select all of the tuples t p that occur with relation r p in a training fact r p , t p . Matching these with relation r q should result in high values for the scores r q t p , if the implication holds. If however the tuples t q are selected from the training facts r q , t q , and matched with relation r p , the scores r p t q should be much lower if the inverse implication does not hold (in other words, if r q and r p are not equivalent). Table 3 lists the averaged results for 5 example rules, and the average over all relations in WordNet rules, both for the case with injected rules (model FSL), and without rules (model FS). For easier comparison, the scores are mapped to the unit interval via the sigmoid function. This quantity σ(r t) is often interpreted as the probability that the corresponding fact holds (Riedel et al., 2013), but because of the BPR-based training, only differences between scores play a role here. After injecting rules, the average scores of facts inferred by these rules (i.e., column σ(r q t p ) for model FSL) are always higher than for facts (incorrectly) inferred by the inverse rules (column σ(r p t q ) for model FSL). In the fourth example, the inverse rule leads to high scores as well (on average 0.79, vs. 0.98 for the actual rule). This is due to the fact that the daily and newspaper relations are more or less equivalent, such that the components of r p are not much below those of r q . For the last example (the ambassador ⇒ diplomat rule), the asymmetry in the implication is maintained, although the absolute scores are rather low for these two relations.
The results for model FS reflect how strongly the implications in either direction are latently present in the training data. We can only conclude that model FS manages to capture the similarity be-rule model FSL model FS r p ⇒ r q σ(r q t p ) σ(r p t q ) σ(r q t p ) σ(r p t q )  Table 3: Average of σ(r q t) over all inferred facts r q , t p for tuples t p from training items for relation r p , and vice versa, for Wordnet implications r p ⇒ r q , and model FSL (injected rules) vs. model FS (no rules).
tween relations, but not the asymmetric character of implications. For example, purely based on the training data, it appears to be more likely that the parent relation implies the father relation, than vice versa. This again demonstrates the importance and added value of injecting external rules capturing commonsense knowledge.

Conclusions
We presented a novel, fast approach for incorporating first-order implication rules into distributed representations of relations. We termed our approach 'lifted rule injection', as it avoids the costly grounding of first-order implication rules and is thus independent of the size of the domain of entities. By construction, these rules are satisfied for any observed or unobserved fact. The presented approach requires a restriction on the entity-tuple embedding space. However, experiments on a real-world dataset show that this does not impair the expressiveness of the learned representations. On the contrary, it appears to have a beneficial regularization effect. By incorporating rules generated from WordNet hypernyms, our model improved over a matrix factorization baseline for knowledge base completion. Especially for domains where annotation is costly and only small amounts of training facts are available, our approach provides a way to leverage external knowledge sources for inferring facts.
In future work, we want to extend the proposed ideas beyond implications towards general firstorder logic rules. We believe that supporting conjunctions, disjunctions and negations would enable to debug and improve representation learning based knowledge base completion. Furthermore, we want to integrate these ideas into neural methods beyond matrix factorization approaches.