Jointly Embedding Knowledge Graphs and Logical Rules

,


Introduction
Knowledge graphs (KGs) provide rich structured information and have become extremely useful resources for many NLP related applications like * Corresponding author: Quan Wang. word sense disambiguation (Wasserman-Pritsker et al., 2015) and information extraction (Hoffmann et al., 2011). A typical KG represents knowledge as multi-relational data, stored in triples of the form (head entity, relation, tail entity), e.g., (Paris, Capital-Of, France). Although powerful in representing structured data, the symbolic nature of such triples makes KGs, especially large-scale KGs, hard to manipulate.
Recently, a promising approach, namely knowledge graph embedding, has been proposed and successfully applied to various KGs (Nickel et al., 2012;Socher et al., 2013;Bordes et al., 2014). The key idea is to embed components of a KG including entities and relations into a continuous vector space, so as to simplify the manipulation while preserving the inherent structure of the KG. The embeddings contain rich semantic information about entities and relations, and can significantly enhance knowledge acquisition and inference .
Most existing methods perform the embedding task based solely on fact triples Wang et al., 2014;Nickel et al., 2016). The only requirement is that the learned embeddings should be compatible with those facts. While logical rules contain rich background information and are extremely useful for knowledge acquisition and inference (Jiang et al., 2012;Pujara et al., 2013), they have not been well studied in this task.  and Wei et al. (2015) tried to leverage both embedding methods and logical rules for KG completion. In their work, however, rules are modeled separately from embedding methods, serving as postprocessing steps, and thus will not help to obtain better embeddings. Rocktäschel et al. (2015) recently proposed a joint model which injects first-order logic into embeddings. But it focuses on the relation extraction task, and creates vector embeddings for entity pairs rather than individual entities. Since entities do not have their own embeddings, relations between unpaired entities cannot be effectively discovered (Chang et al., 2014).
In this paper we introduce KALE, a new approach that learns entity and relation Embeddings by jointly modeling Knowledge And Logic. Knowledge triples are taken as atoms and modeled by the translation assumption, i.e., relations act as translations between head and tail entities . A triple (e i , r k , e j ) is scored by ∥e i + r k − e j ∥ 1 , where e i , r k , and e j are the vector embeddings for entities and relations. The score is then mapped to the unit interval [0, 1] to indicate the truth value of that triple. Logical rules are taken as complex formulae constructed by combining atoms with logical connectives (e.g., ∧ and ⇒), and modeled by t-norm fuzzy logics (Hájek, 1998). The truth value of a rule is a composition of the truth values of the constituent atoms, defined by specific logical connectives. In this way, KALE represents triples and rules in a unified framework, as atomic and complex formulae respectively. Figure 1 gives a simple illustration of the framework. After unifying triples and rules, KALE minimizes a global loss involving both of them to obtain entity and relation embeddings. The learned embeddings are therefore compatible not only with triples but also with rules, which will definitely be more predictive for knowledge acquisition and inference.
The main contributions of this paper are summarized as follows. (i) We devise a unified framework that jointly models triples and rules to obtain more predictive entity and relation embeddings. The new framework KALE is general enough to handle any type of rules that can be represented as first-order logic formulae. (ii) We evaluate KALE with link prediction and triple classification tasks on WordNet (Miller, 1995) and Freebase (Bollacker et al., 2008). Experimental results show significant and consistent improvements over state-of-the-art methods. Particularly, joint embedding enhances the prediction of new facts which cannot even be directly inferred by pure logical inference, demonstrating the capability of KALE to learn more predictive embeddings.

Related Work
Recent years have seen rapid growth in KG embedding methods. Given a KG, such methods aim to encode its entities and relations into a continuous vector space, by using neural network architectures (Socher et al., 2013;Bordes et al., 2014), matrix/tensor factorization techniques (Nickel et al., 2011;Riedel et al., 2013;Chang et al., 2014), or Bayesian clustering strategies (Kemp et al., 2006;Xu et al., 2006;Sutskever et al., 2009). Among these methods, TransE , which models relations as translating operations, achieves a good trade-off between prediction accuracy and computational efficiency. Various extensions like TransH (Wang et al., 2014) and Tran-sR (Lin et al., 2015b) are later proposed to further enhance the prediction accuracy of TransE. Most existing methods perform the embedding task based solely on triples contained in a KG. Some recent work tries to further incorporate other types of information available, e.g., relation paths (Neelakantan et al., 2015;Lin et al., 2015a;Luo et al., 2015), relation type-constraints (Krompaßet al., 2015), entity types , and entity descriptions (Zhong et al., 2015), to learn better embeddings.
Logical rules have been widely studied in knowledge acquisition and inference, usually on the basis of Markov logic networks (Richardson and Domingos, 2006;Bröcheler et al., 2010;Pujara et al., 2013;Beltagy and Mooney, 2014). Recently, there has been growing interest in combining logical rules and embedding models.  and Wei et al. (2015) tried to utilize rules to refine predictions made by embedding models, via integer linear programming or Markov logic networks. In their work, however, rules are modeled separately from embedding models, and will not help obtain better embeddings. Rocktäschel et al. (2015) proposed a joint model that injects first-order logic into embeddings. But their work focuses on relation extraction, creating vector embeddings for entity pairs, and hence fails to discover relations between unpaired entities. This paper, in contrast, aims at learning more predictive embeddings by jointly modeling knowledge and logic. Since each entity has its own embedding, our approach can successfully make predictions between unpaired entities, providing greater flexibility for knowledge acquisition and inference.

Jointly Embedding Knowledge and Logic
We first describe the formulation of joint embedding. We are given a KG containing a set of triples K = {(e i , r k , e j )}, with each triple composed of two entities e i , e j ∈ E and their relation r k ∈ R.
Here E is the entity vocabulary and R the relation set. Besides the triples, we are given a set of logical rules L, either specified manually or extracted automatically. A logical rule is encoded, for example, in the form of ∀x, y : (x, r s , y) ⇒ (x, r t , y), stating that any two entities linked by relation r s should also be linked by relation r t . Entities and relations are associated with vector embeddings, denoted by e, r ∈ R d , representing their latent semantics. The proposed method, KALE, aims to learn these embeddings by jointly modeling knowledge triples K and logical rules L.

Overview
To enable joint embedding, a key ingredient of KALE is to unify triples and rules, in terms of firstorder logic (Rocktäschel et al., 2014;Rocktäschel et al., 2015). A triple (e i , r k , e j ) is taken as a ground atom which applies a relation r k to a pair of entities e i and e j . Given a logical rule, it is first instantiated with concrete entities in the vocabulary E, resulting in a set of ground rules. For example, a universally quantified rule ∀x, y : (x, Capital-Of, y) ⇒ (x, Located-In, y) might be instantiated with the concrete entities of Paris and France, giving the ground rule (Paris, Capital-Of, France) ⇒ (Paris, Located-In, France). 1 A ground rule can then be interpreted as a complex formula, constructed by combining ground atoms with logical connectives (e.g. ∧ and ⇒).
Let F denote the set of training formulae, both atomic (triples) and complex (ground rules). KALE further employs a truth function I : F → [0, 1] to assign a soft truth value to each formula, indicating how likely a triple holds or to what degree a ground rule is satisfied. The truth value of a triple is determined by the corresponding entity and relation embeddings. The truth value of a ground rule is determined by the truth values of the constituent triples, via specific logical connectives. In this way, KALE models triples and rules in a unified framework. See Figure 1 for an overview. Finally, KALE minimizes a global loss over the training formulae F to learn entity and relation embeddings compatible with both triples and rules. In what follows, we describe the key components of KALE, including triple modeling, rule modeling, and joint learning.

Triple Modeling
To model triples we follow TransE , as it is simple and efficient while achieving state-of-the-art predictive performance. Specifically, given a triple (e i , r k , e j ), we model the relation embedding r k as a translation between the entity embeddings e i and e j , i.e., we want e i + r k ≈ e j when the triple holds. The intuition here originates from linguistic regularities such as France − Paris = Germany − Berlin (Mikolov et al., 2013). In relational data, such analogy holds because of the certain relation Capital-Of, through which we will get Paris + Capital-Of = France and Berlin + Capital-Of = Germany. Then, we score each triple on the basis of ∥e i + r k − e j ∥ 1 , and define its soft truth value as where d is the dimension of the embedding space. It is easy to see that I (e i , r k , e j ) ∈ [0, 1] with the constraints ∥e i ∥ 2 ≤ 1, ∥e j ∥ 2 ≤ 1, and ∥r k ∥ 2 ≤ 1. 2 I (e i , r k , e j ) is expected to be large if the triple holds, and small otherwise.

Rule Modeling
To model rules we use t-norm fuzzy logics (Hájek, 1998), which define the truth value of a complex formula as a composition of the truth values of its constituents, through specific t-norm based logical connectives. We follow Rocktäschel et al. (2015) and use the product t-norm. The compositions associated with logical conjunction (∧), disjunction (∨), and negation (¬) are defined as follow: , where f 1 and f 2 are two constituent formulae, either atomic or complex. Given these compositions, the truth value of any complex formula can be calculated recursively, e.g., , This paper considers two types of rules. The first type is ∀x, y : (x, r s , y) ⇒ (x, r t , y). Given a ground rule f (e m , r s , e n ) ⇒ (e m , r t , e n ), the truth value is calculated as: where I(·,·,·) is the truth value of a constituent triple, defined by Eq. (1). The second type is ∀x, y, z : (x, r s 1 , y) ∧ (y, r s 2 , z) ⇒ (x, r t , z). Given a ground rule f (e ℓ , r s 1 , e m ) ∧ (e m , r s 2 , e n ) ⇒ (e ℓ , r t , e n ), the truth value is: The larger the truth values are, the better the ground rules are satisfied. It is easy to see that besides these two types of rules, the KALE framework is general enough to handle any rules that can be represented as first-order logic formulae. The investigation of other types of rules will be left for future work.
for any x ∈ R d , according to the Cauchy-Schwarz inequality.

Joint Learning
After unifying triples and rules as atomic and complex formulae, we minimize a global loss over this general representation to learn entity and relation embeddings. We first construct a training set F containing all positive formulae, including (i) observed triples, and (ii) ground rules in which at least one constituent triple is observed. Then we minimize a margin-based ranking loss, enforcing positive formulae to have larger truth values than negative ones: Here f + ∈ F is a positive formula, f − ∈ N f + a negative one constructed for f + , γ a margin separating positive and negative formulae, and is a triple, we construct f − by replacing either e i or e j with a random entity e ∈ E, and calculate its truth value according to Eq. (1). For example, we might generate a negative instance (Paris, Capital-Of, Germany) for the triple (Paris, Capital-Of, France). If f + (e m , r s , e n ) ⇒ (e m , r t , e n ) or (e ℓ , r s 1 , e m ) ∧ (e m , r s 2 , e n ) ⇒ (e ℓ , r t , e n ) is a ground rule, we construct f − by replacing r t in the consequent with a random relation r ∈ R, and calculate its truth value according to Eq. (2) or Eq. (3). For example, given a ground rule (Paris, Capital-Of, France) ⇒ (Paris, Located-In, France), a possible negative instance (Paris, Capital-Of, France)⇒ (Paris, Has-Spouse, France) could be generated. We believe that most instances (both triples and ground rules) generated in this way are truly negative. Stochastic gradient descent in mini-batch mode is used to carry out the minimization. To satisfy the ℓ 2 -constraints, e and r are projected to the unit ℓ 2 -ball before each mini-batch. Embeddings learned in this way are required to be compatible with not only triples but also rules.

Discussions
Complexity. We compare KALE with several stateof-the-art embedding methods in space complexity and time complexity (per iteration) during learning. of the embedding space, and n e /n r /n t /n g is the number of entities/relations/triples/ground rules. The results indicate that incorporating additional rules will not significantly increase the space or time complexity of KALE, keeping the model complexity almost the same as that of TransE (optimal among the methods listed in the table). But please note that KALE needs to ground universally quantified rules before learning, which further requires O(n u n t /n r ) in time complexity. Here, n u is the number of universally quantified rules, and n t /n r is the averaged number of observed triples per relation. During grounding, we select those ground rules with at least one triple observed. Grounding is required only once before learning, and is not included during the iterations.
Extensions. Actually, our approach is quite general.
(i) Besides TransE, a variety of embedding methods, e.g., those listed in Table 1, can be used for triple modeling (Section 3.2), as long as we further define a mapping f : R → [0, 1] to map original scores to soft truth values. (ii) Besides the two types of rules introduced in Section 3.3, other types of rules can also be handled as long as they can be represented as first-order logic formulae. (iii) Besides the product t-norm, other types of t-norm based fuzzy logics can be used for rule modeling (Section 3.3), e.g., the Łukasiewicz t-norm used in probabilistic soft logic (Bröcheler et al., 2010) and the minimum t-norm used in fuzzy description logic (Stoilos et al., 2007).
(iv) Besides the pairwise ranking loss, other types of loss functions can be designed for joint learning (Section 3.4), e.g., the pointwise squared loss or the logarithmic loss (Rocktäschel et al., 2014;Rocktäschel et al., 2015).

Experiments
We empirically evaluate KALE with two tasks: (i) link prediction and (ii) triple classification.  We further create logical rules for each dataset, in the form of ∀x, y : (x, r s , y) ⇒ (x, r t , y) or ∀x, y, z : (x, r s 1 , y) ∧ (y, r s 2 , z) ⇒ (x, r t , z). To do so, we first run TransE to get entity and relation embeddings, and calculate the truth value for each of such rules according to Eq. (2) or Eq. (3). Then we rank all such rules by their truth values and manually filter those ranked at the top. We finally create 47 rules on FB122, and 14 on WN18 (see Table 2 for examples). The rules are then instantiated with concrete entities (grounding). Ground rules in which at least one constituent triple is observed in the training set are used in joint learning.
Note that some of the test triples can be inferred by directly applying these rules on the training set (pure logical inference). On each dataset, we further split the test set into two parts, test-I and test-II. The former contains triples that cannot be directly inferred by pure logical inference, and the latter the remaining test triples. Table 3 gives some statistics of the datasets, including the number of entities, relations, triples in training/validation/test-I/test-II set, and ground rules.
We further test our approach in three different scenarios. (i) KALE-Trip uses triples alone to perform the embedding task, i.e., only the training triples are included in the optimization Eq. (4). It is a linearly transformed version of TransE. The only difference is that relation embeddings are normalized in KALE-Trip, but not in TransE. (ii) KALE-Pre first repeats pure logical inference on the training set and adds inferred triples as additional training data, until no further triples can be inferred. Both original and inferred triples are then included in the optimization. For example, given a logical rule ∀x, y : (x, r s , y) ⇒ (x, r t , y), a new triple (e i , r t , e j ) can be inferred if (e i , r s , e j ) is observed in the training set, and both triples will be used as training instances for embedding. (iii) KALE-Joint is the joint learning scenario, which considers both training triples and ground rules in the optimization. In the aforementioned example, training triple (e i , r s , e j ) and ground rule (e i , r s , e j ) ⇒ (e i , r t , e j ) will be used in the training process of KALE-Joint, without explicitly incorporating triple (e i , r t , e j ). Among the methods, TransE/TransH/TransR and KALE-Trip use only triples, while KALE-Pre/KALE-Joint further incorporates rules, before or during embedding. Implementation details. We use the code provided by  for TransE 4 , and the code provided by Lin et al. (2015b) for TransH and Tran-sR 5 . KALE is implemented in Java. Note that Lin et al. (2015b) initialized TransR with the results of 4 https://github.com/glorotxa/SME 5 https://github.com/mrlyk423/relation extraction TransE. However, to ensure fair comparison, we randomly initialize all the methods in our experiments. For all the methods, we create 100 mini-batches on each dataset, and tune the embedding dimension d in {20, 50, 100}. For TransE, TransH, and Tran-sR which score a triple by a distance in R + , we tune the learning rate η in {0.001, 0.01, 0.1}, and the margin γ in {1, 2, 3, 4}. For KALE which scores a triple (as well as a ground rule) by a soft truth value in the unit interval [0, 1], we set the learning rate η in {0.01, 0.02, 0.05, 0.1}, and the margin γ in {0.1, 0.12, 0.15, 0.2}. KALE allows triples and rules to have different weights, with the former fixed to 1, and the latter (denoted by λ) selected in {0.001, 0.01, 0.1, 1}.

Link Prediction
This task is to complete a triple (e i , r k , e j ) with e i or e j missing, i.e., predict e i given (r k , e j ) or predict e j given (e i , r k ). Evaluation protocol. We follow the same evaluation protocol used in TransE . For each test triple (e i , r k , e j ), we replace the head entity e i by every entity e ′ i in the dictionary, and calculate the truth value (or distance) for the corrupted triple (e ′ i , r k , e j ). Ranking the truth values in descending order (or the distances in ascending order), we get the rank of the correct entity e i . Similarly, we can get another rank by corrupting the tail entity e j . Aggregated over all the test triples, we report three metrics: (i) the mean reciprocal rank (MRR), (ii) the median of the ranks (MED), and (iii) the proportion of ranks no larger than n (HITS@N). We do not report the averaged rank (i.e., the "Mean Rank" metric used by ), since it is usually sensitive to outliers (Nickel et al., 2016).
Note that a corrupted triple may exist in KGs, which should also be taken as a valid triple. Consider a test triple (Paris, Located-In, France)    and a possible corruption (Lyon, Located-In, France). Both triples are valid. In this case, ranking Lyon before the correct answer Paris should not be counted as an error. To avoid such phenomena, we follow  and remove those corrupted triples which exist in either the training, validation, or test set before getting the ranks. That means, we remove Lyon from the candidate list before getting the rank of Paris in the aforementioned example. We call the original setting "raw" and the new setting "filtered".
Optimal configurations. For each of the methods to be compared, we tune its hyperparameters in the ranges specified in Section 4.1, and select a best model that leads to the highest filtered MRR score on the validation set (with a total of 500 epochs over the training data). The optimal configurations for KALE are: d = 100, η = 0.05, γ = 0.12, and λ = 1 on FB122; d = 50, η = 0.05, γ = 0.2, and λ = 0.1 on WN18. To better see and understand the effects of rules, we use the same configuration for KALE-Trip, KALE-Pre, and KALE-Joint on each dataset.
Results. Table 4 and To better understand how the joint embedding scenario can learn more predictive embeddings, on each dataset we further split the test-I set into two parts. Given a triple (e i , r k , e j ) in the test-I set, we assign it to the first part if relation r k is covered by the rules, and the second part otherwise. We call the two parts Test-Incl and Test-Excl respectively. Table 6 compares the performance of KALE-Trip and KALE-Joint on the two parts. The results show that KALE-Joint outperforms KALE-Trip on both parts, but the improvements on Test-Incl are much more significant than those on Test-Excl. Take the filtered setting on WN18 as an example. On Test-Incl, KALE-Joint increases the metric MRR by 55.7%, decreases the metric MED by 26.9%, and increas-es the metric HITS@10 by 38.2%. On Test-Excl, however, MRR rises by 3.1%, MED remains the same, and HITS@10 rises by only 0.3%. This observation indicates that jointly embedding triples and rules helps to learn more predictive embeddings, especially for those relations that are used to construct the rules. This might be the main reason that KALE-Joint can make better predictions even beyond the scope of pure logical inference.

Triple Classification
This task is to verify whether an unobserved triple (e i , r k , e j ) is correct or not. Evaluation protocol. We take the following evaluation protocol similar to that used in TransH (Wang et al., 2014). We first create labeled data for evaluation. For each triple in the test or validation set (i.e., a positive triple), we construct 10 negative triples for it by randomly corrupting the entities, 5 at the head position and the other 5 at the tail position. 6 To make the negative triples as difficult as possible, we corrupt a position using only entities that have appeared in that position, and further ensure that the corrupted triples do not exist in either the training, validation, or test set. We simply use the truth values (or distances) to classify triples. Triples with large truth values (or small distances) tend to be predicted as positive. To evaluate, we first rank the triples associated with each specific relation (in descending order according to their truth values, or in ascending order according to the distances), and calculate the average precision for that relation. We then report on the test sets the mean average precision (MAP)  aggregated over different relations.
Optimal configurations. The hyperparameters of each method are again tuned in the ranges specified in Section 4.1, and the best models are selected by maximizing MAP on the validation set. The optimal configurations for KALE are: d = 100, η = 0.1, γ = 0.2, and λ = 0.1 on FB122; d = 100, η = 0.1, γ = 0.2, and λ = 0.001 on WN18. Again, we use the same configuration for KALE-Trip, KALE-Pre, and KALE-Joint on each dataset.
Results. Table 7 shows the results on the test-I, test-II, and test-all sets of our datasets. From the results, we can see that: (i) KALE-Pre and KALE-Joint outperform the other methods which use triples alone on almost all the test sets, demonstrating the superiority of incorporating logical rules. (ii) KALE-Joint performs better than KALE-Pre on the test-I sets, i.e., triples that cannot be directly inferred by performing pure logical inference on the training set. This observation is similar to that observed in the link prediction task, demonstrating that the joint embedding scenario can learn more predictive embeddings and make predictions beyond the capability of pure logical inference.

Conclusion and Future Work
In this paper, we propose a new method for jointly embedding knowledge graphs and logical rules, referred to as KALE. The key idea is to represent and model triples and rules in a unified framework. Specifically, triples are represented as atomic formulae and modeled by the translation assumption, while rules as complex formulae and by the t-norm fuzzy logics. A global loss on both atomic and complex formulae is then minimized to perform the embedding task. Embeddings learned in this way are compatible not only with triples but also with rules, which are certainly more useful for knowledge acquisition and inference. We evaluate KALE with the link prediction and triple classification tasks on WordNet and Freebase data. Experimental results show that joint embedding brings significant and consistent improvements over state-of-the-art methods. More importantly, it can obtain more predictive embeddings and make better predictions even beyond the scope of pure logical inference.
For future work, we would like to (i) Investigate the efficacy of incorporating other types of logical rules such as ∀x, y, z : (x, Capital-Of, y) ⇒ ¬(x, Capital-Of, z). (ii) Investigate the possibility of modeling logical rules using only relation embeddings as suggested by Demeester et al. (2016), e.g., modeling the above rule using only the embedding associated with Capital-Of. This avoids grounding, which might be time and space inefficient especially for complicated rules. (iii) Investigate the use of automatically extracted rules which are no longer hard rules and tolerant of uncertainty.