Knowledge Graph Embedding with Hierarchical Relation Structure

The rapid development of knowledge graphs (KGs), such as Freebase and WordNet, has changed the paradigm for AI-related applications. However, even though these KGs are impressively large, most of them are suffering from incompleteness, which leads to performance degradation of AI applications. Most existing researches are focusing on knowledge graph embedding (KGE) models. Nevertheless, those models simply embed entities and relations into latent vectors without leveraging the rich information from the relation structure. Indeed, relations in KGs conform to a three-layer hierarchical relation structure (HRS), i.e., semantically similar relations can make up relation clusters and some relations can be further split into several fine-grained sub-relations. Relation clusters, relations and sub-relations can fit in the top, the middle and the bottom layer of three-layer HRS respectively. To this end, in this paper, we extend existing KGE models TransE, TransH and DistMult, to learn knowledge representations by leveraging the information from the HRS. Particularly, our approach is capable to extend other KGE models. Finally, the experiment results clearly validate the effectiveness of the proposed approach against baselines.


Introduction
Knowledge Graphs (KGs) are extremely useful resources for many AI-related applications, such as question answering, information retrieval and query expansion. Indeed, KGs are multi-relational directed graphs composed of entities as nodes and relations as edges. They represent information about real-world entities and relations in the form of knowledge triples, which is denoted as (h, r, t), where h and t correspond to the head and tail entities and r denotes the relation between them, e.g., (Donald T rump, presidentOf, U SA). Large scale, collaboratively created KGs , such as Freebase (Bollacker et al., 2008), WordNet (Miller, 1994), Yago (Suchanek et al., 2007), Gene Ontology (Sherlock, 2009), NELL (Carlson et al., 2010) and Google's KG 1 , have recently become available. However, despite the impressively large sizes, the coverage of most existing KGs are far from complete. This has motivated research in knowledge base completion task, which includes KGE methods aiming to embed entities and relations in KGs into low-dimensional embeddings.
In the literature, there are a number of studies about KGE models. These models embed entities and relations into latent vectors and complete KGs based on these vectors, such as TransE (Bordes et al., 2013), TransH (Wang et al., 2014) and TransR (Lin et al., 2015b). However, most of the existing works simply embed relations into vectors. Less efforts have been made for investigating the rich information from the relation structure. Indeed, in this research, we define a three-layer hierarchical relation structure (HRS), which can be conformed by relation clusters, relations and subrelations in KGs.
• Relation clusters: Semantically similar relations are often observed in Large-scales KGs. For example, the relation 'producerOf' and 'di-rectorOf' may be semantically related if both of them describe a relation between a person and a film. These semantically similar relations can make up relation clusters. We believe the information from semantically similar relations is of great value, and relations in the same group can be trained in a collective way to facilitate the knowledge sharing when learning the embeddings of related relations.
• Relations: A relation connects the head and tail entities in a knowledge triple, denoted as (h, r, t), where h and t correspond to the head and tail entities and r denotes the relation between them.
• Sub-relations: There are relations that have multiple semantic meanings and can be split into several sub-relations. For example, the relation partOf has at least two semantics: location-related as (N ew Y ork, partOf , U SA) and composition-related as (monitor, partOf , television). We believe the subrelations can give fine-grained descriptions for each relation.
The relation clusters, relations and sub-relations correspond to the top, middle and bottom layer of the three-layer HRS.
In this paper, we extend state-of-the-art models TransE (Bordes et al., 2013), TransH (Wang et al., 2014) and DistMult (Yang et al., 2015) to learn knowledge representations by leveraging the rich information from the HRS. Moreover, the same technique can easily be used to extend other stateof-the-art models and utilize the HRS information. In the proposed models, for each knowledge triple (h, r, t), the embedding of r is the sum of three embedding vectors, which correspond to the three layers of the HRS respectively and therefore, the information from the HRS is leveraged. Particularly, instead of using additional information like text or paths, our model simply use the knowledge triples in KGs and the rich information from the HRS. Extensive experiments on popular benchmark data sets demonstrate the effectiveness of our models.
In summary, we highlight our key contributions as follows, 1. We propose a technique by making use of the HRS information to conduct the KGE task, and extend three state-of-the-art models to utilize this technique. The technique can be easily applied to other KGE models.
2. Our proposed models don't use additional information like text or paths, instead, we only use the knowledge triples in KGs and take advantage of the rich information from the HRS.
3. We evaluate our models on popular benchmark data sets, and the results show that our extended models achieve substantial improvements against the original models as well as other state-of-the-art baselines.

Preliminaries and Related Work
We extend three popular KGE models by leveraging the HRS information in this study. Therefore, in this section, we first introduce the three existing models TransE (Bordes et al., 2013), TransH (Wang et al., 2014) and DistMult (Yang et al., 2015) in detail. Then, we further summarize other state-of-the-art models on the topic of KGE.

TransE, TransH and DistMult
Recently, a number of KGE models have been proposed. These methods learn low-dimensional vector representations for entities and relations (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015b).
TransE (Bordes et al., 2013) is one of the most widely used model, which views relations as translations from a head entity to a tail entity on the same low-dimensional hyperplane, i.e, h + r ≈ t when (h, r, t) holds. This indicates that t should be the nearest neighbor of h + r. In this case, the score function of TransE is defined as which can be measured by L 1 or L 2 norm. Positive triples are supposed to have lower scores than negative ones. TransH (Wang et al., 2014) introduces a mechanism of projecting entities into relation-specific hyperplanes that enables different roles of an entity in different relations. TransH models the relation as a vector r on a hyperplane w r and assumes that h ⊥ + r ≈ t ⊥ when (h, r, t) holds, where h ⊥ and t ⊥ are the projection of h and t in the relationspecific hyperplane. The score function of TransH is defined as where h ⊥ = h − w r hw r , t ⊥ = t − w r tw r and w r 2 = 1. Like triples in TransE, positive triples in TransH should have lower scores than negative ones.
DistMult (Yang et al., 2015) adopts a bilinear score function to compute the scores given (h, r, t) triples. The score function is defined as where M r is a relation-specific diagonal matrix, which represents the characteristics of a relation. Different from TransE and TransH, positive triples should have larger scores than negative ones.

Other KGE Models
Besides TransE, TransH and DistMult, there are also many models on the topic of KGE. TransR (Lin et al., 2015b) embeds entities and relations into separate entity space and relationspecific spaces. ComplEx (Welbl et al., 2016) extends DistMult to embed entities and relations into complex vectors instead of real-valued ones. HolE (Nickel et al., 2016) employs circular correlations to create compositional representations. ProjE (Shi and Weninger, 2017) adopts a twolayer network to embed entities and relations.
Some KGE works focus on making use of the information from relations. CTransR (Lin et al., 2015b), TransD (Ji et al., 2015) and TransG (Xiao et al., 2016) try to find fine-grained representations for each relation. However, these works didn't utilize the information from semantically similar relations and the HRS is also not exploited. Different from the above studies, we believe semantically similar relations can make up relation clusters, and some relations may have multiple semantic meanings and can be split into fine-grained subrelations. In this paper, we take advantage of the three-layer HRS and conduct the KGE task by extending three widely used models.

Methodology
In this section, we provide the technical details of how to extend existing KGE models by leveraging the HRS information. We first formally define the HRS and its integration with existing models.Then we introduce the new loss functions of extended models TransE-HRS, TransH-HRS and DistMult-HRS. Finally, two variants of the HRS models and implementation details are provided.

Hierarchical Relation Structure
where E and R are the entity (node) set and relation (edge) set respectively. We believe the relations in KGs can make up relation clusters as well as be split into fine-grained sub-relations. On the one hand, large scale KGs always have semantically related relations. The information from semantically similar relations is of great value and these relations should be trained in a collective way. In this way, meaningful associations among related relations can be utilized and less frequent relations can be enriched with more training data. On the other hand, some relations may have multiple semantic meanings and can be split into several subrelations, which can provide fine-grained descriptions for each relation. In general, relations in KGs conform to a three-layer HRS, as shown in Figure 1. The HRS include a relation cluster layer, a relation layer and a sub-relation layer, which are denoted in yellow, green and blue in Figure 1 respectively.
For a triple (h, r, t) in the HRS model, the embedding of r is comprised of three parts: the relation cluster embedding r c , relation-specific embedding r and sub-relation embedding r s , which is denotes as According to the above equation, the embedding of each relation can leverage the information from the three-layer HRS. The relation clusters and subrelations are determined by k-means algorithm based on the results of TransE: • Relation clusters. We first run TransE on a given data set and obtain the embeddings of relations r 1 , r 2 , r 3 , ..., r |R| , where |R| is the number of relations. Then, the k-means algorithm is applied on these embeddings. In this way, we get relation clusters C 1 , C 2 , C 3 , ..., C |C| , where C is the set of relation clusters. Previous studies have shown that the embeddings of semantically similar relations locate near each other in the latent space (Yang et al., 2015). In this way, we are able to find relation clusters composed of semantically related relations. • Sub-relations. TransE assumes that t − h ≈ r when (h, r, t) holds. For each triple (h, r, t), we define that r = t−h, where h and t are obtained from the results of TransE. For each relation, we collect all the r and adopt the k-means algorithm to cluster these vectors into several groups S r 1 , S r 2 , S r 3 , ..., S r nr , where n r is the number of subrelations for relation r. Each group corresponds to a fine-grained sub-relation.

Loss Function
The loss of the extended HRS model is comprised of two parts, as is shown in Equation (5), where L Orig is the loss function of the original model, while L HRS is the loss function for the HRS information.
We know that TransE, TransH and DistMult all adopt a margin-based ranking loss. Taking TransE as an example, the loss function of TransE for the first part L Orig is shown as Equation (6), where [x] + = max(0, x), r denotes the set of positive triples for relation r and r = {(h , r, t)|h ∈ E} ∪ {(h, r, t )|t ∈ E} is the set of negative ones for relation r. γ is the margin separating the positive triples from the negative ones. f r (h, t) is the score function as shown in Equation (7), which can be measured by L 1 or L 2 norm. Positive triples are supposed to have lower scores than negative ones.
The second part, L HRS , is composed of three regularized terms, which is shown in Equation (8), .., S r nr |r ∈ R} is the set of fine-grained sub-relations, n r is the number of sub-relations for relation r. λ 1 , λ 2 and λ 3 are trade-off parameters. Large value of λ 1 will result in the separate training of each relation, while large value of λ 2 will lead to all relations in the same relation cluster sharing the same embedding vector. λ 3 should be larger than λ 1 and λ 2 to restrict r s to be a small value, i.e., the sub-relations from the same relation should be close.

Variants of the HRS Model and Implementation details
Additionally, we introduce two variants of the HRS model: the top-middle model and the middle-bottom model. The top-middle model only uses the HRS by leveraging the information from the top to the middle layer. For this model, the relation embedding and the loss for HRS is defined as Equation (9) and (10).
While the middle-bottom model only utilizes the information from the middle to the bottom layer. The relation embedding and HRS loss are defined as Equation (11) and (12).
The learning process of the extended models is carried out by using the Adam (Kingma and Ba, 2014) optimizer. For the extended models of TransE, all the entity and relation embedding parameters are initialized with a uniform distribution U − 6 √ k , 6 √ k following TransE, where k is the dimension of the embedding space. For the extended models of TransH and DistMult, we initialize these parameters with the results of TransE. For the relation cluster embeddings and sub-relation embeddings, we initialize all the parameters with the value of zero.

Data Sets
In this research, we evaluate the performances of our extended models on popular benchmarks FB15k (Bordes et al., 2013), FB15k-237 (Toutanova and Chen, 2015), FB13 (Socher et al., 2013), WN18 (Bordes et al., 2013) and WN11 (Socher et al., 2013). FB15k, FB15k-237 and FB13 are extracted from Freebase (Bollacker et al., 2008), which provides general facts of the world. WN18 and WN11 are obtained from WordNet (Miller, 1994), which provides semantic knowledge of words. FB15k-237 and WN18 are used for the task of link prediction, FB13 and WN11 are used for the triple classification task, while FB15k is used for both tasks. The statistics of the five data sets are summarized in Table 1.

Baselines
To demonstrate the effectiveness of our models, we compare results with the following baselines.
• DistMult (Yang et al., 2015): a state of the art model which uses a bilinear score function to compute scores of knowledge triples.
• TransG (Xiao et al., 2016): the first generative KGE model that uses a non-parametric bayesian model to embed KGs.

Link Prediction
Link prediction, a.k.a. knowledge graph completion, aims to fill the missing values into incomplete knowledge triples. More formally, the goal of link prediction is to predict either the head entity in a given query (?, r, t) or the tail entity in a given query (h, r, ?).

Experimental Settings
All the parameters are set by some preliminary test. For TransE-HRS, TransE-top-middle and TransE-middle-bottom, λ 1 , λ 2 , λ 3 and the margin γ are set as λ 1 = 1e − 5, λ 2 = 1e − 4, λ 3 = 1e−3, γ = 2. For the extended models of TransH, we set the parameters as λ 1 = 1e − 5, λ 2 = 1e − 5, λ 3 = 1e − 3, γ = 1. For the extended models of DistMult, the parameters are set as λ 1 = 1e − 5, λ 2 = 1e − 4, λ 3 = 1e − 3, γ = 1. For all the above models, the learning rate ς, batch size b and embedding size k are set as ς = 1e − 3, b = 4096, k = 100. The L 1 norm is adopted by the score function of TransE and its extended models. The number of relation clusters are set as 300, 120 and 10 for FB15k, FB15k-237 and WN18 respectively. For all the data sets, we generate 3 subrelations for relations that have more than 500 occurrences in the training set. For all the extended models and baselines, we produce negative triples following the "bern" sampling strategy which was introduced in TransH ( Wang et al., 2014). For baselines TransE, TransH and DistMult, the embedding parameters of entities and relations are initialized the same way as the extended models for a fair comparison.
In the test phase, we replace the head and tail entities with all the entities in KG in turn for each triple in the test set. Then we compute a score for each corrupted triple. Note that for each corrupted triple (h , r, t ), the sub-relation is determined by t − h , i.e., the k-means model is adopted to assign t − h to a specific sub-relation of r. We rank all the candidate entities according to the scores. Specifically, positive candidates are supposed to precede negative ones. Finally, the rank of the correct entity is stored. We compare our models with baselines using the following metrics: (1) Mean Rank (MR, the mean of all the predicted ranks); (2) Mean Reciprocal Rank (MRR, the mean of all the reciprocals of predicted ranks); (3) Hits@n (Hn, the proportion of ranks not larger than n). Lower values of MR and larger values of MRR and Hn indicate better performance. All the results are reported in the "filtered" setting (Bordes et al., 2013).

Experimental Results
Evaluation results are shown in Table 2. We divide all the results into 4 groups. The second, third and forth group are results of TransE, TransH, DistMult and their extended models respectively, while the first group are results of other state-ofthe-art competitors. Results in bold font are the best results in the group and the underlined results denote the best results in the column. From Table 1, we have the following findings: (1) Our extended models outperform the original models, which indicates that the information learned from the HRS is valuable; (2) For WN18, the results from 'top-middle' models of TransE, TransH and DistMult are worse than the original models, and HRS models can't outperform middle-bottom ones. We conjecture the reason lies as follows: WN18 has only 18 relations and the semantic correlation among relations is small. In this case, the information learned from the top to the middle layer of the HRS may lead to worse results since for each relation, even though the information learned from semantically similar relations are useful, the information learned from unrelated relations may damage the results. The results indicate that HRS models are especially useful for KGs with dense semantic distributions over relations; (3) For WN18, TransE-middle-bottom and DistMult-middle-bottom achieve the best results on MRR, Hits@10, Hits@3 and Hits@1 while failing to get the best results on MR in the same group. Further analysis shows that in the results of TransE-middle-bottom, 56 test triples get ranks more than 10000, leading to more than 110 MR loss. While in the results of DistMult-middle-bottom, there exist 37 test triples whose ranks are more than 7000, which would lead to about 50 MR loss. Indeed, MR is sensitive to these high ranks, which lead to worse results on the metric of MR; (4) From all the results, based on the good basic model DistMult, the extended models of DistMult can achieve the best performance compared with other state-of-the-art baselines CTransR, TransD and TransG.
We also provide some case studies on relation clusters and sub-relations. Table 3 shows some relation clusters of FB15k. Cluster 1 to 3 are Olympics-related, basketball-related and software-related relations respectively. From Table 3 we can see that semantically related relations can join the same cluster. Table 4 shows some (head, tail) pairs for the sub-relations of '/educational institution/education/degree'. Sub-relation 1 to 3 are about the degree of Doctor, Master and Bachelor respectively. Table 5 gives some (head, tail) pairs for the sub-relations of '/music/artist/genre'. Sub-relation 1 and 2 are about rock music and pop music respectively while subcluster 3 is about other kinds of music. From Table 4 and 5, we can see that different sub-relations give fine-grained descriptions for each relation.

Parameter Study
In this section, we study the performance affected by the number of relation clusters N 1 as well as the number of sub-relations for each relation N 2 . The results in Figure 2 and 3 clearly show that there exists an optimal value of N 1 and N 2 for each dataset. All three models keep achieving better results as we increase the number of clusters from 0 to the optimal value. Then, after N 1 and N 2 exceed the optimal point, the performance starts falling down. The reason lies as: (1) Smaller value of N 1 leads to large-sized relation clusters. Some unrelated relations may join in the same large-sized cluster and degrade the performance of our models. Larger value of N 1 leads to small-sized relation clusters, thus less information can be leveraged by each relation, leading to the unsatisfying performance; (2) Smaller value of N 2 can't provide sufficient representations for each relation and degrade the performance of our models. Larger value of N 2 may lead to lacking of training data for each sub-relation and also result in the unsatisfying performance.

Triple Classification
In order to testify the discriminative capability of our models, we conduct a triple classification task aiming to predict the label (True or False) of a given triple (h, r, t).

Experimental Settings
In this paper, we use three datasets WN11, FB13 and FB15k to evaluate our models. The data sets WN11 and FB13 released by NTN (Socher et al., 2013) already have negative triples. The test set of FB15k only contains correct triples, which re-quires us to construct negative triples. In this study, we construct negative triples following the same setting used for FB13 (Socher et al., 2013). For the extended models of TransE, λ 1 , λ 2 , λ 3 and γ are set as λ 1 = 1e − 5, λ 2 = 1e − 5, λ 3 = 1e − 3 and γ = 4. For the extended models of TransH, we set λ 1 = 1e − 5, λ 2 = 1e − 4, λ 3 = 1e − 3 and γ = 5. While for the extended models of DistMult, parameters are set as λ 1 = 1e − 5, λ 2 = 1e − 4, λ 3 = 1e − 2 and γ = 4. For WN11 and FB13, we generate 2 sub-relations for each relation. For FB15k, we generate 3 sub-relations for relations that have more than 500 occurrences in the training set. Other parameters are set as introduced in Section 4.3.1. We follow the same decision process as NTN (Socher et al., 2013): for TransE and TransH, a triple is predicted to be positive if f r (h, t) is below a threshold, while for DistMult, a triple is regarded as a positive one if f r (h, t) is above a threshold; otherwise negative. The thresholds are determined on the validation set. We adopt accuracy as our evaluation metric.

Experimental Results
Finally, the evaluation results in Table 6 lead to the following findings: (1) Our models outperform other baselines on WN11 and FB15k, and obtain comparable results with baselines on FB13, which validate the effectiveness of our models; (2) The extended models TransE-HRS, TransH-HRS and DistMult-HRS achieve substantial improvements against the original models. On WN11, TransE- HRS outperforms TransE with a margin as large as 10.9%. These improvements indicates the technique of utilizing the HRS information is capable to be extended to different KGE models. Figure 4 shows the classification accuracy of different relations on WN11. We can see that extended models significantly improve the original models in each relation classification task, which again validate the effectiveness of our models.

Conclusion
In this paper, we found that relations in KGs conform to a three-layer HRS. This HRS model provides a critical capacity for embedding entities and relations, and along this line we extended three state-of-the-art models to leverage the HRS information. The technique we used can be easily applied to extend other KGE models. Moreover, our proposed models don't need additional information like text or paths, instead, we made full use of the knowledge triples in KGs and the rich information from the HRS. We evaluate our model on the link prediction task and triple classification task. The results show that our extended models achieve substantial improvements against the original models as well as other baseline competitors.
In the future, we will utilize more sophisticated models to leverage the HRS information, e.g, (1) utilize the embeddings of the three layers in a more sophisticated way instead of sum them together; (2) determine the number of relation clusters and sub-relations automatically instead of manually.