Improving Knowledge Graph Embedding Using Simple Constraints

Embedding knowledge graphs (KGs) into continuous vector spaces is a focus of current research. Early works performed this task via simple models developed over KG triples. Recent attempts focused on either designing more complicated triple scoring models, or incorporating extra information beyond triples. This paper, by contrast, investigates the potential of using very simple constraints to improve KG embedding. We examine non-negativity constraints on entity representations and approximate entailment constraints on relation representations. The former help to learn compact and interpretable representations for entities. The latter further encode regularities of logical entailment between relations into their distributed representations. These constraints impose prior beliefs upon the structure of the embedding space, without negative impacts on efficiency or scalability. Evaluation on WordNet, Freebase, and DBpedia shows that our approach is simple yet surprisingly effective, significantly and consistently outperforming competitive baselines. The constraints imposed indeed improve model interpretability, leading to a substantially increased structuring of the embedding space. Code and data are available at https://github.com/iieir-km/ComplEx-NNE_AER.


Introduction
The past decade has witnessed great achievements in building web-scale knowledge graphs (KGs), e.g., Freebase (Bollacker et al., 2008), DBpedia (Lehmann et al., 2015), and Google's Knowledge * Corresponding author: Quan Wang. Vault (Dong et al., 2014). A typical KG is a multirelational graph composed of entities as nodes and relations as different types of edges, where each edge is represented as a triple of the form (head entity, relation, tail entity). Such KGs contain rich structured knowledge, and have proven useful for many NLP tasks (Wasserman-Pritsker et al., 2015;Hoffmann et al., 2011;Yang and Mitchell, 2017).
Recently, the concept of knowledge graph embedding has been presented and quickly become a hot research topic. The key idea there is to embed components of a KG (i.e., entities and relations) into a continuous vector space, so as to simplify manipulation while preserving the inherent structure of the KG. Early works on this topic learned such vectorial representations (i.e., embeddings) via just simple models developed over KG triples (Bordes et al., 2011(Bordes et al., , 2013Jenatton et al., 2012;Nickel et al., 2011). Recent attempts focused on either designing more complicated triple scoring models (Socher et al., 2013;Bordes et al., 2014;Wang et al., 2014;Lin et al., 2015b;Xiao et al., 2016;Nickel et al., 2016b;Trouillon et al., 2016;Liu et al., 2017), or incorporating extra information beyond KG triples (Chang et al., 2014;Zhong et al., 2015;Lin et al., 2015a;Neelakantan et al., 2015;Luo et al., 2015b;Xie et al., 2016a,b;Xiao et al., 2017). See (Wang et al., 2017) for a thorough review. This paper, by contrast, investigates the potential of using very simple constraints to improve the KG embedding task. Specifically, we examine two types of constraints: (i) non-negativity constraints on entity representations and (ii) approximate entailment constraints over relation representations. By using the former, we learn compact representations for entities, which would naturally induce sparsity and interpretability (Murphy et al., 2012). By using the latter, we further encode regularities of logical entailment between relations into their distributed representations, which might be advantageous to downstream tasks like link prediction and relation extraction (Rocktäschel et al., 2015;Guo et al., 2016). These constraints impose prior beliefs upon the structure of the embedding space, and will help us to learn more predictive embeddings, without significantly increasing the space or time complexity.
Our work has some similarities to those which integrate logical background knowledge into KG embedding (Rocktäschel et al., 2015;Guo et al., 2016Guo et al., , 2018. Most of such works, however, need grounding of first-order logic rules. The grounding process could be time and space inefficient especially for complicated rules. To avoid grounding, Demeester et al. (2016) tried to model rules using only relation representations. But their work creates vector representations for entity pairs rather than individual entities, and hence fails to handle unpaired entities. Moreover, it can only incorporate strict, hard rules which usually require extensive manual effort to create. Minervini et al. (2017b) proposed adversarial training which can integrate first-order logic rules without grounding. But their work, again, focuses on strict, hard rules. Minervini et al. (2017a) tried to handle uncertainty of rules. But their work assigns to different rules a same confidence level, and considers only equivalence and inversion of relations, which might not always be available in a given KG.
Our approach differs from the aforementioned works in that: (i) it imposes constraints directly on entity and relation representations without grounding, and can easily scale up to large KGs; (ii) the constraints, i.e., non-negativity and approximate entailment derived automatically from statistical properties, are quite universal, requiring no manual effort and applicable to almost all KGs; (iii) it learns an individual representation for each entity, and can successfully make predictions between unpaired entities.
We evaluate our approach on publicly available KGs of WordNet, Freebase, and DBpedia as well.
Experimental results indicate that our approach is simple yet surprisingly effective, achieving significant and consistent improvements over competitive baselines, but without negative impacts on efficiency or scalability. The non-negativity and approximate entailment constraints indeed improve model interpretability, resulting in a substantially increased structuring of the embedding space.
The remainder of this paper is organized as follows. We first review related work in Section 2, and then detail our approach in Section 3. Experiments and results are reported in Section 4, followed by concluding remarks in Section 5.

Related Work
Recent years have seen growing interest in learning distributed representations for entities and relations in KGs, a.k.a. KG embedding. Early works on this topic devised very simple models to learn such distributed representations, solely on the basis of triples observed in a given KG, e.g., TransE which takes relations as translating operations between head and tail entities (Bordes et al., 2013), and RESCAL which models triples through bilinear operations over entity and relation representations (Nickel et al., 2011). Later attempts roughly fell into two groups: (i) those which tried to design more complicated triple scoring models, e.g., the TransE extensions (Wang et al., 2014;Lin et al., 2015b;Ji et al., 2015), the RESCAL extensions (Yang et al., 2015;Nickel et al., 2016b;Trouillon et al., 2016;Liu et al., 2017), and the (deep) neural network models (Socher et al., 2013;Bordes et al., 2014;Shi and Weninger, 2017;Schlichtkrull et al., 2017;Dettmers et al., 2018); (ii) those which tried to integrate extra information beyond triples, e.g., entity types Xie et al., 2016b), relation paths (Neelakantan et al., 2015;Lin et al., 2015a), and textual descriptions (Xie et al., 2016a;Xiao et al., 2017). Please refer to (Nickel et al., 2016a;Wang et al., 2017) for a thorough review of these techniques. In this paper, we show the potential of using very simple constraints (i.e., nonnegativity constraints and approximate entailment constraints) to improve KG embedding, without significantly increasing the model complexity.
A line of research related to ours is KG embedding with logical background knowledge incorporated (Rocktäschel et al., 2015;Guo et al., 2016Guo et al., , 2018. But most of such works require grounding of first-order logic rules, which is time and space inefficient especially for complicated rules. To avoid grounding, Demeester et al. Both works, however, can only handle strict, hard rules which usually require extensive effort to create. Minervini et al. (2017a) tried to handle uncertainty of background knowledge. But their work considers only equivalence and inversion between relations, which might not always be available in a given KG. Our approach, in contrast, imposes constraints directly on entity and relation representations without grounding. And the constraints used are quite universal, requiring no manual effort and applicable to almost all KGs.
Non-negativity has long been a subject studied in various research fields. Previous studies reveal that non-negativity could naturally induce sparsity and, in most cases, better interpretability (Lee and Seung, 1999). In many NLP-related tasks, nonnegativity constraints are introduced to learn more interpretable word representations, which capture the notion of semantic composition (Murphy et al., 2012;Luo et al., 2015a;Fyshe et al., 2015). In this paper, we investigate the ability of non-negativity constraints to learn more accurate KG embeddings with good interpretability.

Our Approach
This section presents our approach. We first introduce a basic embedding technique to model triples in a given KG ( § 3.1). Then we discuss the nonnegativity constraints over entity representations ( § 3.2) and the approximate entailment constraints over relation representations ( § 3.3). And finally we present the overall model ( § 3.4).

A Basic Embedding Model
We choose ComplEx (Trouillon et al., 2016) as our basic embedding model, since it is simple and efficient, achieving state-of-the-art predictive performance. Specifically, suppose we are given a KG containing a set of triples O = {(e i , r k , e j )}, with each triple composed of two entities e i , e j ∈ E and their relation r k ∈ R. Here E is the set of entities and R the set of relations. ComplEx then represents each entity e ∈ E as a complex-valued vector e ∈ C d , and each relation r ∈ R a complex-valued vector r ∈ C d , where d is the dimensionality of the embedding space. Each x ∈ C d consists of a real vector component Re(x) and an imaginary vector component Im(x), i.e., x = Re(x) + iIm(x). For any given triple (e i , r k , e j ) ∈ E × R × E, a multilinear dot product is used to score that triple, i.e., where e i , r k , e j ∈ C d are the vectorial representations associated with e i , r k , e j , respectively;ē j is the conjugate of e j ; [·] is the -th entry of a vector; and Re(·) means taking the real part of a complex value. Triples with higher φ(·, ·, ·) scores are more likely to be true. Owing to the asymmetry of this scoring function, i.e., φ(e i , r k , e j ) = φ(e j , r k , e i ), ComplEx can effectively handle asymmetric relations (Trouillon et al., 2016).

Non-negativity of Entity Representations
On top of the basic ComplEx model, we further require entities to have non-negative (and bounded) vectorial representations. In fact, these distributed representations can be taken as feature vectors for entities, with latent semantics encoded in different dimensions. In ComplEx, as well as most (if not all) previous approaches, there is no limitation on the range of such feature values, which means that both positive and negative properties of an entity can be encoded in its representation. However, as pointed out by Murphy et al. (2012), it would be uneconomical to store all negative properties of an entity or a concept. For instance, to describe cats (a concept), people usually use positive properties such as cats are mammals, cats eat fishes, and cats have four legs, but hardly ever negative properties like cats are not vehicles, cats do not have wheels, or cats are not used for communication.
Based on such intuition, this paper proposes to impose non-negativity constraints on entity representations, by using which only positive properties will be stored in these representations. To better compare different entities on the same scale, we further require entity representations to stay within the hypercube of [0, 1] d , as approximately Boolean embeddings (Kruszewski et al., 2015), i.e., where e ∈ C d is the representation for entity e ∈ E, with its real and imaginary components denoted by Re(e), Im(e) ∈ R d ; 0 and 1 are d-dimensional vectors with all their entries being 0 or 1; and ≥, ≤ , = denote the entry-wise comparisons throughout the paper whenever applicable. As shown by Lee and Seung (1999), non-negativity, in most cases, will further induce sparsity and interpretability.

Approximate Entailment for Relations
Besides the non-negativity constraints over entity representations, we also study approximate entailment constraints over relation representations. By approximate entailment, we mean an ordered pair of relations that the former approximately entails the latter, e.g., BornInCountry and Nationality, stating that a person born in a country is very likely, but not necessarily, to have a nationality of that country. Each such relation pair is associated with a weight to indicate the confidence level of entailment. A larger weight stands for a higher level of confidence. We denote by r p λ − → r q the approximate entailment between relations r p and r q , with confidence level λ. This kind of entailment can be derived automatically from a KG by modern rule mining systems (Galárraga et al., 2015). Let T denote the set of all such approximate entailments derived beforehand.
Before diving into approximate entailment, we first explore the modeling of strict entailment, i.e., entailment with infinite confidence level λ = +∞. The strict entailment r p → r q states that if relation r p holds then relation r q must also hold. This entailment can be roughly modelled by requiring where φ(·, ·, ·) is the score for a triple predicted by the embedding model, defined by Eq. (1). Eq. (3) can be interpreted as follows: for any two entities e i and e j , if (e i , r p , e j ) is a true fact with a high score φ(e i , r p , e j ), then the triple (e i , r q , e j ) with an even higher score should also be predicted as a true fact by the embedding model. Note that given the non-negativity constraints defined by Eq. (2), a sufficient condition for Eq. (3) to hold, is to further impose Re(r p ) ≤ Re(r q ), Im(r p ) = Im(r q ), (4) where r p and r q are the complex-valued representations for r p and r q respectively, with the real and imaginary components denoted by Re(·), Im(·) ∈ R d . That means, when the constraints of Eq. (4) (along with those of Eq. (2)) are satisfied, the requirement of Eq. (3) (or in other words r p → r q ) will always hold. We provide a proof of sufficiency as supplementary material.
Next we examine the modeling of approximate entailment. To this end, we further introduce the confidence level λ and allow slackness in Eq. (4), which yields λ Im(r p ) − Im(r q ) 2 ≤ β.
Here α, β ≥ 0 are slack variables, and (·) 2 means an entry-wise operation. Entailments with higher confidence levels show less tolerance for violating the constraints. When λ = +∞, Eqs. (5) -(6) degenerate to Eq. (4). The above analysis indicates that our approach can model entailment simply by imposing constraints over relation representations, without traversing all possible (e i , e j ) entity pairs (i.e., grounding). In addition, different confidence levels are encoded in the constraints, making our approach moderately tolerant of uncertainty.

The Overall Model
Finally, we combine together the basic embedding model of ComplEx, the non-negativity constraints on entity representations, and the approximate entailment constraints over relation representations.
The overall model is presented as follows: Here, Θ {e : e ∈ E} ∪ {r : r ∈ R} is the set of all entity and relation representations; D + and D − are the sets of positive and negative training triples respectively; a positive triple is directly observed in the KG, i.e., (e i , r k , e j ) ∈ O; a negative triple can be generated by randomly corrupting the head or the tail entity of a positive triple, i.e., (e i , r k , e j ) or (e i , r k , e j ); y ijk = ±1 is the label (positive or negative) of triple (e i , r k , e j ). In this optimization, the first term of the objective function is a typical logistic loss, which enforces triples to have scores close to their labels. The second term is the sum of slack variables in the approximate entailment constraints, with a penalty coefficient µ ≥ 0. The motivation is, although we allow slackness in those constraints we hope the total slackness to be small, so that the constraints can be better satisfied. The last term is L 2 regularization to avoid over-fitting, and η ≥ 0 is the regularization coefficient.
To solve this optimization problem, the approximate entailment constraints (as well as the corresponding slack variables) are converted into penalty terms and added to the objective function, while the non-negativity constraints remain as they are. As such, the optimization problem of Eq. (7) can be rewritten as: where [x] + = max(0, x) with max(·, ·) being an entry-wise operation. The equivalence between Eq. (7) and Eq. (8) is shown in the supplementary material. We use SGD in mini-batch mode as our optimizer, with AdaGrad (Duchi et al., 2011) to tune the learning rate. After each gradient descent step, we project (by truncation) real and imaginary components of entity representations into the hypercube of [0, 1] d , to satisfy the non-negativity constraints.
While favouring a better structuring of the embedding space, imposing the additional constraints will not substantially increase model complexity. Our approach has a space complexity of O(nd + md), which is the same as that of ComplEx. Here, n is the number of entities, m the number of relations, and O(nd + md) to store a d-dimensional complex-valued vector for each entity and each relation. The time complexity (per iteration) of our approach is O(sd+td+nd), where s is the average number of triples in a mini-batch,n the average number of entities in a mini-batch, and t the total number of approximate entailments in T . O(sd) is to handle triples in a mini-batch, O(td) penalty terms introduced by the approximate entailments, and O(nd) further the non-negativity constraints on entity representations. Usually there are much fewer entailments than triples, i.e., t s, and alsō n ≤ 2s. 1 So the time complexity of our approach is on a par with O(sd), i.e., the time complexity of ComplEx.

Experiments and Results
This section presents our experiments and results. We first introduce the datasets used in our experiments ( § 4.1). Then we empirically evaluate our approach in the link prediction task ( § 4.2). After that, we conduct extensive analysis on both entity representations ( § 4.3) and relation representations ( § 4.4) to show the interpretability of our model. 1 There will be at most 2s entities contained in s triples.
Code and data used in the experiments are available at https://github.com/iieir-km/ ComplEx-NNE_AER.

Datasets
The first two datasets we used are WN18 and F-B15K, released by Bordes et al. (2013). 2 WN18 is a subset of WordNet containing 18 relations and 40,943 entities, and FB15K a subset of Freebase containing 1,345 relations and 14,951 entities. We create our third dataset from the mapping-based objects of core DBpedia. 3 We eliminate relations not included within the DBpedia ontology such as HomePage and Logo, and discard entities appearing less than 20 times. The final dataset, referred to as DB100K, is composed of 470 relations and 99,604 entities. Triples on each datasets are further divided into training, validation, and test sets, used for model training, hyperparameter tuning, and evaluation respectively. We follow the original split for WN18 and FB15K, and draw a split of 597,572/ 50,000/50,000 triples for DB100K.
We further use AMIE+ (Galárraga et al., 2015) 4 to extract approximate entailments automatically from the training set of each dataset. As suggested by Guo et al. (2018), we consider entailments with PCA confidence higher than 0.8. 5 As such, we extract 17 approximate entailments from WN18, 535 from FB15K, and 56 from DB100K. Table 1 gives some examples of these approximate entailments, along with their confidence levels. Table 2 further summarizes the statistics of the datasets.

Link Prediction
We first evaluate our approach in the link prediction task, which aims to predict a triple (e i , r k , e j ) with e i or e j missing, i.e., predict e i given (r k , e j ) or predict e j given (e i , r k ).
Evaluation Protocol: We follow the protocol introduced by Bordes et al. (2013). For each test triple (e i , r k , e j ), we replace its head entity e i with every entity e i ∈ E, and calculate a score for the corrupted triple (e i , r k , e j ), e.g., φ(e i , r k , e j ) defined by Eq. (1). Then we sort these scores in de-  scending order, and get the rank of the correct entity e i . During ranking, we remove corrupted triples that already exist in either the training, validation, or test set, i.e., the filtered setting as described in (Bordes et al., 2013). This whole procedure is repeated while replacing the tail entity e j . We report on the test set the mean reciprocal rank (MRR) and the proportion of correct entities ranked in the top n (HITS@N), with n = 1, 3, 10.
Comparison Settings: We compare the performance of our approach against a variety of KG embedding models developed in recent years. These models can be categorized into three groups: • Simple embedding models that utilize triples alone without integrating extra information, including TransE (Bordes et al., 2013), Dist-Mult (Yang et al., 2015), HolE (Nickel et al., 2016b), ComplEx (Trouillon et al., 2016), and ANALOGY (Liu et al., 2017). Our approach is developed on the basis of ComplEx.
• Other extensions of ComplEx that integrate logical background knowledge in addition to triples, including RUGE (Guo et al., 2018) and ComplEx R (Minervini et al., 2017a). The former requires grounding of first-order logic rules. The latter is restricted to relation equiv-alence and inversion, and assigns an identical confidence level to all different rules.
• Latest developments or implementations that achieve current state-of-the-art performance reported on the benchmarks of WN18 and F-B15K, including R-GCN (Schlichtkrull et al., 2017), ConvE (Dettmers et al., 2018), and Single DistMult (Kadlec et al., 2017). 6 The first two are built based on neural network architectures, which are, by nature, more complicated than the simple models. The last one is a re-implementation of DistMult, generating 1000 to 2000 negative training examples per positive one, which leads to better performance but requires significantly longer training time.
We further evaluate our approach in two different settings: (i) ComplEx-NNE that imposes only the Non-Negativity constraints on Entity representations, i.e., optimization Eq. (8) with µ = 0; and (ii) ComplEx-NNE+AER that further imposes the Approximate Entailment constraints over Relation representations besides those non-negativity ones, i.e., optimization Eq. (8) with µ > 0.
Implementation Details: We compare our approach against all the three groups of baselines on the benchmarks of WN18 and FB15K. We directly report their original results on these two datasets to avoid re-implementation bias. On DB100K, the newly created dataset, we take the first two groups of baselines, i.e., those simple embedding models and ComplEx extensions with logical background knowledge incorporated. We do not use the third group of baselines due to efficiency and complexity issues. We use the code provided by Trouillon et al. (2016) 7 for TransE, DistMult, and ComplEx, and the code released by their authors for ANAL-OGY 8 and RUGE 9 . We re-implement HolE and ComplEx R so that all the baselines (as well as our approach) share the same optimization mode, i.e., SGD with AdaGrad and gradient normalization, to facilitate a fair comparison. 10 We follow Trouillon et al. (2016) to adopt a ranking loss for TransE and a logistic loss for all the other methods.   (Trouillon et al., 2016). Results for the other baselines are taken from the original papers.
Missing scores not reported in the literature are indicated by "-". Best scores are highlighted in bold, and " * " indicates statistically significant improvements over ComplEx.  Table 4: Link prediction results on the test set of DB100K, with best scores highlighted in bold, statistically significant improvements marked by " * ".
Among those baselines, RUGE and ComplEx R require additional logical background knowledge. RUGE makes use of soft rules, which are extracted by AMIE+ from the training sets. As suggested by Guo et al. (2018), length-1 and length-2 rules with PCA confidence higher than 0.8 are utilized. Note that our approach also makes use of AMIE+ rules with PCA confidence higher than 0.8. But it only considers entailments between a pair of relations, i.e., length-1 rules. ComplEx R takes into account equivalence and inversion between relations. We derive such axioms directly from our approximate entailments. If r p λ 1 − → r q and r q λ 2 − → r p with λ 1 , λ 2 > 0.8, we think relations r p and r q are equivalent.
And similarly, if r −1 p λ 1 − → r q and r −1 q λ 2 − → r p with λ 1 , λ 2 > 0.8, we consider r p as an inverse of r q .
Experimental Results: Table 3 presents the results on the test sets of WN18 and FB15K, where the results for the baselines are taken directly from previous literature. Table 4 further provides the results on the test set of DB100K, with all the methods tuned and tested in (almost) the same setting. On all the datasets, we test statistical significance of the improvements achieved by ComplEx-NNE/ ComplEx-NNE+AER over ComplEx, by using a paired t-test. The reciprocal rank or HITS@N value with n = 1, 3, 10 for each test triple is used as paired data. The symbol " * " indicates a significance level of p < 0.05.
The results demonstrate that imposing the nonnegativity and approximate entailment constraints indeed improves KG embedding. ComplEx-NNE and ComplEx-NNE+AER perform better than (or at least equally well as) ComplEx in almost all the metrics on all the three datasets, and most of the improvements are statistically significant (except those on WN18). More interestingly, just by introducing these simple constraints, ComplEx-NNE+ AER can beat very strong baselines, including the best performing basic models like ANALOGY, those previous extensions of ComplEx like RUGE or ComplEx R , and even the complicated developments or implementations like ConvE or Single DistMult. This demonstrates the superiority of our approach.

Analysis on Entity Representations
This section inspects how the structure of the entity embedding space changes when the constraints are imposed. We first provide the visualization of entity representations on DB100K. On this dataset each entity is associated with a single type label. 11 We pick 4 types reptile, wine region, species, and programming language, and randomly select 30 entities from each type. Figure 1 visualizes the representations of these entities learned by Com-plEx and ComplEx-NNE+AER (real components only), with the optimal configurations determined by link prediction (see § 4.2 for details, applicable to all analysis hereafter). During the visualization, we normalize the real component of each entity by max(x)−min(x) , where min(x) or max(x) is the minimum or maximum entry of x respectively. We observe that after imposing the non-negativity constraints, ComplEx-NNE+AER indeed obtains compact and interpretable representations for entities. Each entity is represented by only a relatively small number of "active" dimensions.  with the same type tend to activate the same set of dimensions, while entities with different types often get clearly different dimensions activated. Then we investigate the semantic purity of these dimensions. Specifically, we collect the representations of all the entities on DB100K (real components only). For each dimension of these representations, top K percent of entities with the highest activation values on this dimension are picked. We can calculate the entropy of the type distribution of the entities selected. This entropy reflects diversity of entity types, or in other words, semantic purity. If all the K percent of entities have the same type, we will get the lowest entropy of zero (the highest semantic purity). On the contrary, if each of them has a distinct type, we will get the highest entropy (the lowest semantic purity). Figure 2 shows the average entropy over all dimensions of entity representations (real components only) learned by ComplEx, ComplEx-NNE, and ComplEx-NNE+ AER, as K varies. We can see that after imposing the non-negativity constraints, ComplEx-NNE and ComplEx-NNE+AER can learn entity representations with latent dimensions of consistently higher semantic purity. We have conducted the same analyses on imaginary components of entity representations, and observed similar phenomena. The results are given as supplementary material.

Analysis on Relation Representations
This section further provides a visual inspection of the relation embedding space when the constraints are imposed. To this end, we group relation pairs involved in the DB100K entailment constraints into 3 classes: equivalence, inversion, and others. 12 We choose 2 pairs of relations from each class, and visualize these relation representations learned by ComplEx-NNE+AER in Figure 3, where for each relation we randomly pick 5 dimensions from both its real and imaginary components. By imposing the approximate entailment constraints, these relation representations can encode logical regularities quite well. Pairs of relations from the first class (equivalence) tend to have identical representations r p ≈ r q , those from the second class (inversion) complex conjugate representations r p ≈r q ; and the others representations that Re(r p ) ≤ Re(r q ) and Im(r p ) ≈ Im(r q ).
12 Equivalence and inversion are detected using heuristics introduced in § 4.2 (implementation details). See the supplementary material for detailed properties of these three classes.

Conclusion
This paper investigates the potential of using very simple constraints to improve KG embedding. Two types of constraints have been studied: (i) the non-negativity constraints to learn compact, interpretable entity representations, and (ii) the approximate entailment constraints to further encode logical regularities into relation representations. Such constraints impose prior beliefs upon the structure of the embedding space, and will not significantly increase the space or time complexity. Experimental results on benchmark KGs demonstrate that our method is simple yet surprisingly effective, showing significant and consistent improvements over strong baselines. The constraints indeed improve model interpretability, yielding a substantially increased structuring of the embedding space. Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christopher Meek. 2014