SEEK: Segmented Embedding of Knowledge Graphs

In recent years, knowledge graph embedding becomes a pretty hot research topic of artificial intelligence and plays increasingly vital roles in various downstream applications, such as recommendation and question answering. However, existing methods for knowledge graph embedding can not make a proper trade-off between the model complexity and the model expressiveness, which makes them still far from satisfactory. To mitigate this problem, we propose a lightweight modeling framework that can achieve highly competitive relational expressiveness without increasing the model complexity. Our framework focuses on the design of scoring functions and highlights two critical characteristics: 1) facilitating sufficient feature interactions; 2) preserving both symmetry and antisymmetry properties of relations. It is noteworthy that owing to the general and elegant design of scoring functions, our framework can incorporate many famous existing methods as special cases. Moreover, extensive experiments on public benchmarks demonstrate the efficiency and effectiveness of our framework. Source codes and data can be found at https://github.com/Wentao-Xu/SEEK.


Introduction
Learning embeddings for a knowledge graph (KG) is a vital task in artificial intelligence (AI) and can benefit many downstream applications, such as personalized recommendation (Zhang et al., 2016; and question answering (Huang et al., 2019). In general, a KG stores a large collection of entities and inter-entity relations in a triple format, (h, r, t), where h denotes the head entity, t represents the tail entity, and r corresponds to the relationship between h and t. The goal of knowledge graph embedding (KGE) is to project massive * Corresponding author. interconnected triples into a low-dimensional space and preserve the initial semantic information at the same time.
Although recent years witnessed tremendous research efforts on the KGE problem, existing research did not make a proper trade-off between the model complexity (the number of parameters) and the model expressiveness (the performance in capturing semantic information). To illustrate this issue, we categorize existing research into two categories.
The first category of methods prefers the simple model but suffers from poor expressiveness. Some early KGE methods, such as TransE (Bordes et al., 2013) and DistMult , fell into this category. It is easy to apply these methods to large-scale real-world KGs, but their performance in capturing semantic information (such as link prediction) is far from satisfactory.
To address these drawbacks of existing methods, in this paper, we propose a light-weight framework for KGE that achieves highly competitive expressiveness without the sacrifice in the model complexity. Next, we introduce our framework from three aspects: 1) facilitating sufficient feature interactions, 2) preserving various necessary relation properties, 3) designing both efficient and effective scoring functions.
First, to pursue high expressiveness with the reasonable model complexity, we need to facilitate more sufficient feature interactions given the same number of parameters. Specifically, we divide the embedding dimension into multiple segments and encourage the interactions among different segments. In this way, we can obtain highly expressive representations without increasing model parameters. Accordingly, we name our framework as Segmented Embedding for KGs (SEEK).
Second, it is crucial to preserve different relation properties, especially the symmetry and the antisymmetry. We note that some previous research did not preserve the symmetry or the antisymmetry and thus obtained inferior performance (Bordes et al., 2013;Lin et al., 2015;. Similar to the recent advanced models (Trouillon et al., 2016;Kazemi and Poole, 2018;Sun et al., 2019;Xu and Li, 2019), we also pay close attention to the modeling support of both symmetric and antisymmetric relationships.
Third, after an exhaustive review of the literature, we find that one critical difference between various KGE methods lies in the design of scoring functions. Therefore, we dive deeply into designing powerful scoring functions for a triple (h, r, t). Specifically, we combine the above two aspects (facilitating feature interactions and preserving various relation properties) and develop four kinds of scoring functions progressively. Based on these scoring functions, we can specify many existing KGE methods, including DistMult , HoIE (Nickel et al., 2016), and Com-plEx (Trouillon et al., 2016), as special cases of SEEK. Hence, as a general framework, SEEK can help readers to understand better the pros and cons of existing research as well as the relationship between them. Moreover, extensive experiments demonstrate that SEEK can achieve either stateof-the-art or highly competitive performance on a variety of benchmarks for KGE compared with existing methods.
In summary, this paper makes the following contributions.
-We propose a light-weight framework (SEEK) for KGE that achieves highly competitive expressiveness without the sacrifice in the model complexity.
-As a unique framework that focuses on designing scoring functions for KGE, SEEK combines two critical characteristics: facilitating sufficient feature interactions and preserving fundamental relation properties.
-As a general framework, SEEK can incorporate many previous methods as special cases, which can help readers to understand and compare existing research.
-Extensive experiments demonstrate the effectiveness and efficiency of SEEK. Moreover, sensitivity experiments about the number of segments also verify the robustness of SEEK.

Related Work
We can categorize most of the existing work into two categories according to the model complexity and the model expressiveness.
The first category of methods is the simple but lack of expressiveness, which can easily scale to large knowledge graphs. This kind of methods includes TransE (Bordes et al., 2013) and Dist-Mult . TransE uses relation r as a translation from a head entity h to a tail entity t for calculating their embedding vectors of (h, r, t); DistMult utilizes the multi-linear dot product as the scoring function.
The second kind of work introduces more parameters to improve the expressiveness of the simple methods. TransH (Wang et al., 2014), TransR (Lin et al., 2015), TransD (Ji et al., 2015), and ITransF (Xie et al., 2017) are the extensions of TransE, which introduce other parameters to map the entities and relations to different semantic spaces. The Single DistMult (Kadlec et al., 2017) increases the embedding size of the Dist-Mult to obtain more expressive features. Besides, ProjE (Shi and Weninger, 2017), ConvE (Dettmers et al., 2018) and InteractE (Vashishth et al., 2019) leverage neural networks to capture more feature interactions between embeddings and thus improves the expressiveness. However, these neural networkbased methods would also lead to more parameters since there are many parameters in the neural network. Although the second kind of methods has a better performance compared with simple methods, they are difficult to apply to real-world KGs due to the high model complexity (a large number of parameters).
Compared with the two types of methods above, our SEEK can achieve high expressiveness without increasing the number of model parameters.

Methods
Scoring Function Performance # Parameters Properties

Sym Antisym
TransE (Bordes et al., 2013) ||h + r − t|| Low Small DistMult  h, r, t Low Small ComplEx (Trouillon et al., 2016) Re( h, r, t ) Low Small Single DistMult (Kadlec et al., 2017) h, r, t High Large ConvE (Dettmers et al., 2018) f (vec(f([h, r] * ω))W)t High Large SEEK sx,y rx, hy, tw x,y High Small Table 1: Comparison between our SEEK framework and some representative knowledge graph embedding methods in the aspects of the scoring function, performance, the number of parameters, and the ability to preserve the symmetry and antisymmetry properties of relations. Table 1 shows the comparison between our framework and some representative KGE methods in different aspects. Besides, preserving the symmetry and antisymmetry properties of relations is vital for KGE models. Many recent methods devote to preserving these relation properties to improve the expressiveness of embeddings (Trouillon et al., 2016;Nickel et al., 2016;Ding et al., 2018;Kazemi and Poole, 2018;Sun et al., 2019;Xu and Li, 2019). Motivated by these methods, we also pay attention to preserving symmetry and antisymmetry properties of relations when we design our scoring functions.

SEEK
Briefly speaking, we build SEEK by designing scoring functions, which is one of the most critical components of various existing KGE methods, as discussed in the related work. During the procedure of designing scoring functions, we progressively introduce two characteristics that hugely contribute to the model expressiveness: 1) facilitating sufficient feature interactions; 2) supporting both symmetric and antisymmetric relations. In this way, SEEK enables the excellent model expressiveness given a light-weight model with the same number of parameters as some simple KGE counterparts, such as TransE (Bordes et al., 2013) and DistMult .

Scoring Functions
In this section, we illustrate our four scoring functions progressively.

f 1 : Multi-linear Dot Product
First, we start with the scoring function f 1 developed by , which computes a multi-linear dot product of three vectors: where r, h, and t are low-dimensional representations of the relation r, the head entity h, and the tail entity t, respectively, and r i , h i , and t i correspond to the i-th dimension of r, h, and t, respectively. We note that the function f 1 is the building block of much previous research (Trouillon et al., 2016;Kadlec et al., 2017;Kazemi and Poole, 2018). Different from these existing research, we focus on designing more advanced scoring functions with better expressiveness.

f 2 : Multi-linear Dot Product Among Segments
Next, we introduce fine-grained feature interactions to improve the model expressiveness further.
To be specific, we develop the scoring function f 2 that conducts the multi-linear dot product among different segments of the entity/relation embeddings. First, we uniformly divide the d-dimensional embedding of the head h, the relation r, and the tail t into k segments, and the dimension of each segment is d/k. For example, we can write the embedding of relation r as: where r x is the x-th segment of the embedding r. Then, we define the scoring function f 2 as follows: Compared with the scoring function f 1 , where the interactions only happen among the same positions of h, r, and t embeddings, the scoring function f 2 can exploit more feature interactions among different segments of embeddings.

f 3 : Modeling both Symmetric and Antisymmetric Relations
Although the scoring function f 2 can facilitate finegrained feature interactions, it can only preserve the symmetry property of relations and can not support the modeling of antisymmetric relations. For example, given a symmetric relation r, we To preserve the antisymmetry property of relations, we divide the segments of relation embedding r into odd and even parts. Then we define a variable s x,y to enable the even parts of segments to capture the symmetry property of relations and the odd parts to capture the antisymmetry property. We define the scoring function after adding s x,y as: In the scoring function f 3 , s x,y indicates the sign of each dot product term r x , h y , t w . Figure 1 depicts an example of the function f 3 with k = 2. When r x is the even part of r (the index x is even), s x,y is positive, and the summation Therefore, the function f 3 can model symmetric relations via the even segments of r.
When r x is the odd part of r (the index x is odd), s x,y can be either negative or positive depending on whether x + y ≥ k. Then, the summation of odd parts of f 3 (h, r, t) is differ from that of f 3 (t, r, h). Accordingly, f 3 (h, r, t) can support antisymmetric relations with the odd segments of r.
The scoring function f 3 can support both symmetric and antisymmetric relations inherently because of the design of segmented embeddings. Moreover, the optimization of relation embeddings is entirely data-driven, and thus we focus on providing the proper mechanism to capture common relation properties.

f 4 : Reducing Computing Overheads
However, though capturing various relation properties, the function f 3 suffers from huge computation overheads. The time complexity of function f 3 is O(k 2 d) because there are k 3 dot product terms r x , h y , t w in total. Therefore, the scoring function f 3 needs k 3 times of dot product to compute the score of a triple (h, r, t). Recall that the dimension of each segment is d/k, so each multi-linear dot product requires O(d/k) times of multiplication. As a conclusion, the time complexity of the function f 3 is O(k 2 d), which can be calculated by O(k 3 ×d/k). To reduce the computation overheads of the function f 3 , we introduce another variable w x,y for the index of tail entity t. Accordingly, we define the scoring function f 4 as follows. where The scoring function f 4 reduces the number of dot product terms to k 2 , so its time complexity is O(kd) (calculated by O(k 2 ×d/k)). Moreover, the scoring function f 4 can also preserve symmetry property in the even parts of r and preserve antisymmetry property in the odd parts of r. Figure 2 shows the example of the scoring function f 4 with k = 4. The dot product terms in Figure 2 can be categorized into four groups according to the segment indexes of r. In the groups of r 0 and r 2 , which are the even parts of r, the segment t wx,y 's index w x,y is same as the segment h y 's index y, and s x,y is always positive. Thus, the summation s x,y · r x , h y , t wx,y of the even parts of f 4 (h, r, t) is equal to the corresponding one s x,y · r x , t y , h wx,y of f 4 (t, r, h). In the groups of r 1 and r 3 , which are the odd parts of r, the segment indexes of t are (x + y) % k, where x and y are the indexes of r and h, respectively. When x + y ≥ k, the variable s x,y will change from positive to negative. So the summation of the odd parts of f 4 (h, r, t) and f 4 (t, r, h) will not be the same. Besides, it is apparent that the number of feature interactions on h, r and t are increasing k times since each segment has k interactions with other segments.
In summary, the scoring function f 4 of our SEEK framework has the following characteristics: -Tunable Computation. The scoring function exactly involves each segment of r, h, and t k times. Thus the number of feature interactions and the computation cost are fully tunable with a single hyperparameter k.
The even parts of r can preserve the symmetry property of relations, and the odd parts of r can preserve the antisymmetry property.
-Dimension Isolation. The dimensions within the same segment are isolated from each other, which will prevent the embeddings from excessive correlations.

Discussions
Complexity analysis As described before, the number of dot product terms in scoring function f 4 is k 2 , and each term requires O(d/k) times of multiplication. So the time complexity of our SEEK framework is O(kd) (calculated by O(k 2 × d/k)), where k is a small constant such as 4 or 8. For the space complexity, the dimension of entity and relation embeddings is d, and there are no other parameters in our SEEK framework. Thus, the space complexity of SEEK is O(d). The low time and space complexity of our framework demonstrate that our SEEK framework has high scalability, which is vital for large-scale real-world knowledge graphs.
Connection with existing methods Our SEEK framework is a generalized framework of some existing methods, such as DistMult , ComplEx (Trouillon et al., 2016), and HolE (Nickel et al., 2016). In the following, we will prove that these methods are special cases of our framework when we set k = 1 and k = 2, respectively.
Proof. The proof is trivial. Given k = 1, we have x = 0 and y = 0 in scoring function f 4 and r 0 = r, h 0 = h, and t 0 = t. Thus the function f 4 can be written as f k=1 4 (h, r, t) = r, h, t , which is the same scoring function of DistMult.
Proposition 2. SEEK (k = 2) is equivalent to the ComplEx and HolE.
Proof. Given k = 2, function f 4 can be written as: then we expand the right part of the equation: If we consider r 0 , h 0 , t 0 as the real part of r, h, t, and r 1 , h 1 , t 1 as the imaginary part, then f k=2 4 (h, r, t) is exactly the scoring function of ComplEx framework. Since (Hayashi and Shimbo, 2017) has already discussed the equivalence of ComplEx and HolE, the SEEK (k = 2) is also equivalent to the HolE framework.

Training
SEEK takes the negative log-likelihood loss function with L 2 regularization as its objective function to optimize the parameters of entities and relations: where σ is a sigmoid function defined as σ(x) = 1 1+e −x , and Θ represents the parameters in the embeddings of entities and relations in knowledge graphs; Ω is the triple set containing the true triples in the knowledge graphs and the false triples generated by negative sampling. In the negative sampling, we generate a false triple (h , r, t) or (h, r, t ) by replacing the head or tail entity of a true triple with a random entity. Y hrt is the label of (h, r, t), which is 1 for the true triples and −1 for the false triples. λ is the L 2 regularization parameter. The gradients of Equation 5 are then given by: where L represents the objective function of SEEK, and θ is the parameters in the segments. Specifically, the partial derivatives of function f 4 for the x-th segment of r and the y-th segment of h are: where is the entry-wise product of two vectors, e.g. c = a b results in the i-th dimension of c is a i · b i . The derivative of scoring function f 4 for t w is different from those of the above two: where 1 [w=wx,y] has value 1 if w = w x,y holds, otherwise it is 0.

Experimental Evaluation
In this section, we present thorough empirical studies to evaluate and analyze our proposed SEEK framework. We first introduce the experimental setting. Then we evaluate our SEEK framework on the task of link prediction. Then, we study the influence of the number of segments k to the SEEK framework, and present the case studies to demonstrate why our SEEK framework has high effectiveness.

Experimental Setting
Datasets In our experiments, we firstly use a de facto benchmark dataset: FB15K. FB15K is a subset of the Freebase dataset (Bollacker et al., 2008), and we used the same training, validation and test set provided by (Bordes et al., 2013). We also use another two new datasets proposed in recent years: DB100K (Ding et al., 2018) and YAGO37 . DB100K was built from the mappingbased objects of core DBpedia (Bizer et al., 2009); YAGO37 was extracted from the core facts of YAGO3 (Mahdisoltani et al., 2013). Table 2 lists the statistics of the three datasets.  Compared Methods There are many knowledge graph embedding methods developed in recent years. We categorize the compared methods as the following groups: -Some simple knowledge graph embedding methods that have low time and space complexity, like TransE (Bordes et al., 2013), Dist-Mult , HolE (Nickel et al., 2016), ComplEx (Trouillon et al., 2016), and Analogy (Liu et al., 2017). Specifically, TransE is a translation based method, and others are the multi-linear dot product-based framework.
-Some methods that achieve state-of-the-art performance on DB100K and YAGO37, which include RUGE  and ComplEx-NNE+AER (Ding et al., 2018).
-We evaluate the scoring function f 2 to apply an ablation study for our approach. Then we can observe the respective effect of facilitating sufficient feature interactions and preserving the relation properties. Since the scoring function f 2 can only preserve the symmetric property, we refer to it as Sym-SEEK.
Since our framework does not use additional information like text (Toutanova and Chen, 2015), relational path (Ebisu and Ichise, 2019), or external memory (Shen et al., 2017), we do not compare the methods with additional information. Moreover, we only compare our method with single models, and the Ensemble DistMult (Kadlec et al., 2017) is a simple ensemble of multiple different methods, so we do not compare with it.
Experimental Details We use the asynchronous stochastic gradient descent (SGD) with the learning rate adapted by AdaGrad (Duchi et al., 2011)   to optimize our framework. The loss function of our SEEK framework is given by Equation 5 . We conducted a grid search to find hypeparameters which maximize the results on validation set, by tuning number of segments k ∈

Link Prediction
We study the performance of our method on the task of link prediction, which is a prevalent task to evaluate the performance of knowledge graph embeddings. We used the same data preparation process as (Bordes et al., 2013). Specifically, we replace the head/tail entity of a true triple in the test set with other entities in the dataset and name these derived triples as corrupted triples. The goal of the link prediction task is to score the original true triples higher than the corrupted ones. We rank the triples by the results of the scoring function. We use the MRR and Hit@N metrics to evaluate the ranking results: a) MRR: the mean reciprocal rank of original triples; b) Hits@N: the percentage rate of original triples ranked at the top n in prediction. For both metrics, we remove some of the corrupted triples that exist in datasets from the ranking results, which is also called filtered setting  in (Bordes et al., 2013). We use Hits@1, Hits@3, and Hits@10 for the metrics of Hits@N. Table 3 summarizes the results of link prediction on DB100K and YAGO37, and Table 4 shows the results on FB15K. Note, the results of compared methods on DB100K and YAGO37 are taken from (Ding et al., 2018;; the results on FB15K are taken from (Kadlec et al., 2017;Ding et al., 2018;Kazemi and Poole, 2018;Sun et al., 2019;Xu and Li, 2019).
On the DB100K, SEEK outperforms the compared methods in all metrics, and the Sym-SEEK also can achieve a good performance. On the YAGO37, the SEEK and Sym-SEEK have a similar result and outperform other previous methods. The results on YAGO37 show that exploiting more feature interactions can significantly improve the performance of the embeddings on YAGO37 while preserving the semantic properties have a slight improvement. On FB15K, SEEK achieves the best performance on MRR, Hit@1 and Hit@3. Although SEEK is worse than the Single DistMult on the metrics of Hit@10, the Single DistMult is just a higher dimensional version of DistMult. The Single DistMult uses 512-dimensional embeddings, which is larger than the 400-dimensional embeddings of the SEEK framework. On the whole, our method's improvements on these datasets demonstrate that our method has a higher expressiveness.

Influence of the Number of Segments k
In the SEEK framework, a larger number of segments k implies more feature interactions and higher computational cost. To empirically study the influence of the number of segments k to the performance and computation time of SEEK, we let k vary in {1, 4, 8, 16, 20} and fix all the other hyperparameters, then we observe the MRR and time costs for the link prediction task on the test set of FB15K. Figure 3 shows the MRR and time costs of different segment counts k on FB15K. As we can see, changing k affects the performance of knowledge graph embeddings significantly. When k varies from 1 to 8, the performance is increased steadily. However, when k becomes even larger, no consistent and dramatic improvements observed on the FB15K dataset. This phenomenon suggests that excessive feature interactions cannot further improve performance. Therefore, k is a sensitive hyperparameter that needs to be tuned for the best performance given a dataset.  trates that the running time of SEEK is linear in k, and it verifies that the time complexity of SEEK is O(kd).

Case Studies
We employ case studies to explain why our framework has a high expressiveness. Specifically, we utilize the scoring functions f 1 , f 2 and f 4 to train the embeddings of DB100K, respectively. Then we use the corresponding scoring functions to score the triples in the test set and their reverse triples, and we feed the scores to the sigmoid function to get the correct probabilities P 1 , P 2 and P 4 of each triple. Figure 4 shows the correct probabilities of some triples. In these triples, two triples have symmetric relations, and the other two have antisymmetric relations. On the triples with symmetric relations, the original triples in the test set and their reverse triples are true triples, and the scoring functions f 1 , f 2 , f 4 can result in high probabilities on original and reverse triples. On the triples with antisymmetric relations, the reverse triples are false. Since the values of f 1 (h, r, t) or f 2 (h, r, t) are equal to f 1 (t, r, h) or f 2 (t, r, h), the scoring functions f 1 and f 2 result in high probabilities on the reverse triples. But the scoring function f 4 , which can model both symmetric and antisymmetric relations, results in low probabilities on the reverse triples. Meanwhile, we can also find that function f 2 have higher probabilities than function f 1 on the true triples. This phenomenon further explains that facilitating sufficient feature interactions can improve the expressiveness of embeddings.

Conclusion and Future Work
In this paper, we propose a lightweight KGE framework (SEEK) that can improve the expressiveness of embeddings without increasing the model complexity. To this end, our framework focuses on designing scoring functions and highlights two critical characteristics: 1) facilitating sufficient feature interactions and 2) preserving various relation properties. Besides, as a general framework, SEEK can incorporate many existing models, such as Dist-Mult, ComplEx, and HolE, as special cases. Our extensive experiments on widely used public benchmarks demonstrate the efficiency, the effectiveness, and the robustness of SEEK. In the future, we plan to extend the key insights of segmenting features and facilitating interactions to other representation learning problems.