Regularizing Relation Representations by First-order Implications

Methods for automated knowledge base construction often rely on trained ﬁxed-length vector representations of relations and entities to predict facts. Recent work showed that such representations can be regularized to inject ﬁrst-order logic formulae. This enables to incorporate domain-knowledge for improved prediction of facts, especially for uncommon relations. However, current approaches rely on propositionalization of formulae and thus do not scale to large sets of formulae or knowledge bases with many facts. Here we propose a method that imposes ﬁrst-order constraints directly on relation representations, avoiding costly grounding of formulae. We show that our approach works well for implications be-tween pairs of relations on artiﬁcial datasets.


Introduction
Many methods for automated knowledge base (KB) construction rely on learned relation and entity vector representations (Nickel et al., 2015). Such representations are hard to learn for relations with only few supporting facts in KBs. Moreover, inference on KBs such as Freebase (Bollacker et al., 2008) could still benefit from common-sense knowledge contained in ontologies like WordNet (Miller, 1995) or PPDB (Ganitkevitch et al., 2013). It is thus desirable to be able to use various kinds of domain or ontological knowledge, for instance in the form of first-order logic formulae, to help knowledge base inference. Furthermore, such formulae make use of learned representations as well as help to learn better representations.
One way to incorporate logical formulae is to regularize relation and entity-pair representations (Rocktäschel et al., 2015). However, in their method first-order formulae need to be grounded for all entity pairs in the KB. As a result of this propositionalization, the method does not scale to large KBs or many formulae. Another recent method is based on imposing rules as constraints in an integer linear program (Wang et al., 2015). This approach suffers from a similar scalability problem, since every rule is imposed for all occurrences of facts in the training data.
To alleviate this computational bottleneck, we propose a method to incorporate first-order implications directly (and only) into relation representations. The idea is to map relation and entity-pair representations into a well-chosen subspace in which formulae can be expressed as direct regularizers of relation representations without imposing them on entity representations too. As such, the proposed method is suited for problems with large numbers of rules and facts.
Our approach is based on the concept of orderembeddings, introduced by Vendrov et al. (2016). Order-embeddings capture partial orderings, such as textual entailment, directly in vector representations. This idea can be extended towards relation representations in KBs. In particular, we show how to construct order-embeddings for capturing implications between relations, such that these implications hold for any possible entity-pair.
The model presented here is also related to Kruszewski et al. (2015). They demonstrate that textual entailment can be captured by mapping real-valued vectors into (approximate) Boolean valued vectors. This is achieved by requiring that Boolean vector representations of more specific words or sentences are included in the representation of more general ones. Furthermore, these representations may be useful for modeling other types of logical relationships, such as negation or conjunction. It is our goal to extend the approach towards arbitrary firstorder formulae between relations. Therefore, as a first step we investigate whether restricting the relation embedding space to approximate Boolean vectors still allows us to reconstruct training facts and imposed implications.
The rest of the paper is organized as follows. We first revisit matrix factorization for KB construction ( §2), before introducing a factorization model that regularizes approximately Boolean relation representations to incorporate first-order implications ( §3). Finally, we show empirical results on synthetic knowledge bases. We explore how enforcing restrictions on representations influences the ability to model the observed data, analyze the learned relation representations qualitatively, and investigate the impact of injecting implications ( §4).

Model
Before introducing first-order regularization of relation representations, we revisit one possible model that uses relation (and entity-pair) representations to estimate the probability of a fact: the universal schema matrix factorization proposed by Riedel et al. (2013). Let R be a set of relations r and P a set of entity pairs (e i , e j ) (which we will shortly write as e from now on). We can represent facts, i.e., possible combinations of entity pairs and relations, as a binary matrix of size |P| × |R|. The probability that a particular relation and entity pair combination is a valid fact can be modeled by the sigmoid of the dot product of the relation's vector representation v(r) and the entity-pair's vector representation v(e): with the binary target variable z indicating validity of the considered fact and v(r), v(e) ∈ R k . The representations v(r) and v(e) can be found by minimizing the negative log-likelihood of true given training facts (together with a set of negative facts) using stochastic gradient descent. The contribution to this loss from relation r and entity pair e takes the following form with p short-hand for the probability in eq. (1).
In this paper we propose various forms of v(r) and v(e). However, when the representations are chosen to be unrestricted real-valued vectors, i.e., v(r) = ρ ∈ R k and v(e) = e ∈ R k for some fixed embedding length k, we get the latent feature Model F by Riedel et al. (2013).
Note that often no explicit negative instances are available for training, in which case unobserved facts can be randomly sampled and assumed to be negative.

Non-Negative Embedding Space
With the model described above we do not have any control over the learned representations. However, the embeddings can gain useful properties once we restrict them in an appropriate way. We propose the following restrictions, motivated below: we require all components of v(e) to be non-negative, and we confine relation representations v(r) to lie within the unit hypercube (0, 1) k .
We want to be able to model implications between relations by defining an order relation on their vector representations. An in-depth description of orderembeddings is given in Vendrov et al. (2016), but the main idea applied to relation representations is as follows. Consider a pair of relations r p and r q such that r p implies r q for any entity pair for which r p holds (which we shortly write as 'r p ⇒ r q '). For their vector representations we require that the component-wise inequality v i (r p ) ≤ v i (r q ) holds (i = 1, . . . , k). Note that enforcing this locally for every relation pair will also lead to globally consistent relation representations (e.g. imposing r s ⇒ r t and r t ⇒ r u will satisfy r s ⇒ r u by construction). Relations that hold true more often will have larger entries, whereas relation vectors with the overall lowest values will represent the most specific relations (such as leaf nodes in an ontology).
If r p ⇒ r q holds, it needs to hold for any entity pair e. Thus, we require that ∀e ∈ P : If v i (r p ) ≤ v i (r q ) (i = 1, . . . , k), and we restrict all components of v(e) to be non-negative, then by construction v(r p ) T v(e) ≤ v(r q ) T v(e), and with eq. (1), the above requirement is satisfied.
Besides the ability to capture pairwise implications, we also want to incorporate more complex first-order formulae and need to be able to express these as a function of the relation and entity-pair representations. Approximate Boolean vectors discussed in Kruszewski et al. (2015) provide an attractive direction, but studying how they can be adapted to suit the relation extraction use case is out of scope of the current work. To pave the way for future work on incorporating arbitrary first-order constraints, we will however investigate whether constraining relation representations to the unit hypercube v(r) ∈ (0, 1) k still allows us to reliably encode observed facts and impose implications.

Training Restricted Representations
There are different ways to impose the discussed restrictions on vector representations. In this work, we choose v(r) = σ(ρ), and v(e) = ReLU(e) or exp(e), where ReLU(e) = log(1 + exp e) is the component-wise smooth approximation of the rectified linear unit, and with again ρ ∈ R k and e ∈ R k . The imposed restrictions constrain the set of usable loss functions for training. Indeed, the lowest value of σ v(r) T v(e) is 0.5, which makes training with the loss function in eq. (2) no longer practical. The problem can be avoided if the dot product v(r) T v(e) is first mapped from the positive real axis to entire R. Among various options, we choose the logarithm because such that the loss from eq. (2) simplifies to The expression on the right-hand side of eq. (3) represents an alternative form of the probability in eq. (1) for training and predicting the validity of facts using non-negative embeddings. Note that since log and exp are inverse functions, choosing v(e) = exp(e) leads to values of log v(r) T v(e) with the same order of magnitude as e, unlike the choice v(e) = ReLU(e). This may be the reason why the former seems to work better in practice (see § 3). Yet another option would be to construct an approximate Boolean factorization for both, entity pairs and relations, whereby v(r) = σ(ρ) and v(e) = σ(e). Finding a suitable loss function is less straightforward, but we tested the quadratic loss on v(r) T v(e). As shown in the following section, this additional restriction reduces the ability of the model to reconstruct facts.

Implication Regularization
We will refer to the loss term L F introduced above as the fact loss, as it measures how well training facts are recovered with low-dimensional representations. To impose logical constraints, we add an additional loss term per rule which we will call the implication loss L I . As already described, the required order relation between two relations can be expressed by their representations as k i=1 v i (r p ) ≤ v i (r q ). We thus propose the following loss term for every implication r p ⇒ r q , As before, other choices are possible. It is however essential to ensure that only positive values of ρ p,i − ρ q,i are penalized, which is obtained by applying the ReLU function (see § 2.2). The difficulty in choosing an appropriate loss function is that its behavior needs to be compatible with the fact loss. For instance, the simple loss ReLU(ρ p,i − ρ q,i ) seems not to work in practice as balancing both losses during optimization becomes difficult. The particular form of L I in eq. (5) was obtained in a similar way to eq. (4), and originates from simplifying We empirically found that this loss works well in practice and behaves in an intuitive way. For example, injecting the formulae r p ⇒ r q and r q ⇒ r p leads to roughly identical representations for both relations.

Experiments
To gain insights into the proposed models, we investigate their behavior on small-scale artificial KB inference datasets that we can adapt to different possible scenarios. Concretely, we sample facts for a predefined number of entities and relations. Then, we generate implications for sampled pairs of relations and add a fraction of implied facts to the training data and the rest to a test set. This gives us control over how much an implication is visible for training representations of facts in the KB.
Fact Reconstruction in Non-Negative Space We first investigate whether restricting embedding spaces still allows to reconstruct observed facts. To this end, we consider a dataset with 20 relations and 50 entities, leading to observations for 249 entity pairs. We calculate the F 1 score for reconstructing all training facts, assuming that all unobserved facts are negative. Fig. 1 shows the result for different combinations of restricting the relation and entity pair embedding spaces. Every model maps a relation r and entity-pair e into vector space, denoted by v(r), v(e) where ρ and e represent the learned real-valued (i.e., non-restricted) representations before mapping into a non-negative subspace. The results are shown as a function of the embedding size k. We found that from the two models that satisfy both the relation and the entity pair restriction, the one with v(e) = exp(e) seems to work best and will be used in the remainder of the experiments. As expected, imposing restrictions leads to a reduced ability to fit the data exactly and hence requires higher-dimensional vector representations of relations and entity-pairs.

Implication Regularization
To visualize what happens when regularizing relation representations based on given implications, we sample a small KB with 8 relations, 17 entity pair observations and the following five implications: r 4 ⇒ r 1 , r 7 ⇒ r 3 , r 4 ⇒ r 2 , r 6 ⇒ r 4 and r 5 ⇒ r 4 . We add 20% of the facts that can be inferred from these rules as training data and use the rest as test data. Fig. 2(a) shows observed facts (dark blue), as well as test facts (light blue). With an embedding size of 15, L F is able to perfectly reconstruct the training data, as shown in Fig. 2(b), but therefore overfits. In contrast, when imposing implications we can reconstruct training facts and predict test facts that could be inferred by these implications (Fig. 2(c)). Note that in Fig. 2(b) the predictions are made with high confidence, whereas in Fig. 2(c) the reconstruction is not perfect, with the predictions distributed between 0 and 1. This is due to the fact that during training the loss related to some of the facts is influenced both by the implication loss and by a conflicting contribution from the fact loss (due to the random sampling of negative examples among the unobserved ones). Although this effect is an artifact of the small scale of the example (where nonobserved facts are sampled more often than in a large and sparse situations), it underlines the importance of properly weighting both loss terms, for which further research on large-scale data is needed.
The learned relation embeddings are visualized in Fig. 3. We can see that regularizing relation embeddings by implications leads to representations that satisfy the order imposed by the implications (see Fig. 3(b)).
For the final experiment, we again consider the dataset used for Fig. 1, but this time we inject 10 pairwise implications and add half of the additional facts that can be inferred from them to the training set. The others are added to the test set, together with as many sampled negative test facts. The F 1 value on the test facts for different embedding sizes is shown in Fig. 4. We found that the implication loss successfully acts as a regularizer, yielding F 1 scores of around 80% for predicting unobserved valid facts even with large embedding sizes where a model without this regularization drastically overfits.

Conclusion and Future Work
We have presented a scalable method to incorporate first-order implications into relation representations for knowledge base inference. It alleviates the need for propositionalization of such formulae and we plan to use it to improve large-scale knowledge base inference with many formulae extracted from ontologies. We discussed and illustrated the method in a matrix factorization setting, but it can be applied to any model that produces relation and entity (or entity-pair) representations that can be mapped into non-negative space. In future work, we will investigate ways to efficiently incorporate more complex formulae as well, involving conjunctions, disjunctions, and negations.