RatE: Relation-Adaptive Translating Embedding for Knowledge Graph Completion

Many graph embedding approaches have been proposed for knowledge graph completion via link prediction. Among those, translating embedding approaches enjoy the advantages of light-weight structure, high efficiency and great interpretability. Especially when extended to complex vector space, they show the capability in handling various relation patterns including symmetry, antisymmetry, inversion and composition. However, previous translating embedding approaches defined in complex vector space suffer from two main issues: 1) representing and modeling capacities of the model are limited by the translation function with rigorous multiplication of two complex numbers; and 2) embedding ambiguity caused by one-to-many relations is not explicitly alleviated. In this paper, we propose a relation-adaptive translation function built upon a novel weighted product in complex space, where the weights are learnable, relation-specific and independent to embedding size. The translation function only requires eight more scalar parameters each relation, but improves expressive power and alleviates embedding ambiguity problem. Based on the function, we then present our Relation-adaptive translating Embedding (RatE) approach to score each graph triple. Moreover, a novel negative sampling method is proposed to utilize both prior knowledge and self-adversarial learning for effective optimization. Experiments verify RatE achieves state-of-the-art performance on four link prediction benchmarks.


Introduction
A knowledge graph refers to a collection of interlinked entities, which is usually formatted as a set of triples. A triple is represented as a head entity linked to a tail entity by a relation, which is written as (head, relation, tail) or (h, r, t). Large-scale knowledge graphs, such as Freebase (Bollacker et al., 2008) and WordNet (Miller, 1995), containing structured information, have been leveraged to support a broad spectrum of natural language processing (NLP) tasks, e.g., question answering (Hao et al., 2017), recommender system (Zhang et al., 2016), relation extraction (Min et al., 2013), etc. Nonetheless, the human-curated, real-world knowledge graphs often suffer from incompleteness or sparseness problem (Toutanova et al., 2015), which inevitably hurts the performance of downstream tasks. Hence, how to auto-complete knowledge graphs becomes a popular problem in both research and industry communities.
For this purpose, many light-weight graph embedding approaches (Bordes et al., 2013;Sun et al., 2019) have been proposed. Unlike costly graph neural networks (GNNs) (Schlichtkrull et al., 2018), these approaches use low-dimensional embeddings to represent the entities and relations, and capture their relationships via semantic matching or geometric distance. Specifically, the approaches with semantic matching, e.g., DistMult  and QuatE (Zhang et al., 2019), use a matching function f (h, r, t) that operates on whole triple to directly derive its plausibility score. In contrast, the approaches with geometric distance, e.g., TransE (Bordes et al., 2013) and RotatE (Sun et al., 2019), first apply a translation function to head entity and relation for a new embedding in latent space and then  Table 1: A brief comparison of semantic matching and trans-based graph embedding approaches, where a check mark denotes the model is equipped with the corresponding property. "Sym.", "Antisym.", "Inv." and "Comp." are abbreviations of relation patterns of symmetry, antisymmetry, inversion and composition respectively. For a trans-based graph embedding model, "Disambiguation" denotes whether the model explicitly handles embedding ambiguity problem as detailed in §2.5. And, · denotes generalized dot product, • denotes element-wise (Hadamard) complex product, ⊗ denotes element-wise Hamilton product, denotes normalization of a vector, and W denotes our proposed weighted product defined in Eq.(2). measure a distance from the new embedding to tail entity, i.e., f (h, r, t) = −||g(h, r) − t|| p . Empirically, the latter, namely trans-based approach, usually has higher efficiency and superior performance on link prediction than the former. Based on translating process, it also offers better interpretability of the graph embeddings and relation modeling (Sun et al., 2019).
Recently, some trans-based graph embedding approaches, e.g., RotatE (Sun et al., 2019), go beyond real vector space. They represent the entities and relations in complex vector space, and define the translation function on complex vectors. Empowered by the properties of arithmetic operations (e.g., product) in complex space, the translation function can easily capture relation patterns of symmetry (e.g., marriage), antisymmetry (e.g., father), inversion(e.g., hypernym vs. hyponym) and composition (e.g., mother ∧ husband → father). Compared to those defined in real vector space, these approaches improve model's capability in handling a variety of relation patterns and achieve state-of-the-art performance.
Nevertheless, current trans-based graph embedding approaches with complex embeddings are vulnerable to the following two issues. On the one hand, although approaches solely in complex vector space are equipped with high interpretability for various relation patterns, they are limited by the expressive power of standard product/add of two complex numbers. To improve, QuatE (Zhang et al., 2019) introduces quaternion hypercomplex vector space with semantic matching, at the cost of both interpretability and computational overheads, but the improvement is still marginal. On the other hand, embedding ambiguity problem, which means different entities are assigned with similar embeddings, cannot be explicitly handled by existing trans-based approaches (e.g., TransE and RotatE). It is mainly caused by the propagation of applying a translation function to one-to-many relations for optimizing ∀t = g(h, r).
To alleviate both issues above, we propose a novel Relation-adaptive translating Embedding (RatE) approach for knowledge graph completion. As an extension of the trans-based embedding approach Ro-tatE, our proposed RatE inherits the capability to handle various relation patterns, and further presents a light-weight yet effective relation-adaptive translation function. Specifically, the function is composed of a novel element-wise weighted product defined in complex vector space, where the weights are learnable, relation-specific and independent to embedding dimension. Rather than rigorous complex number product in RotatE and QuatE, RatE provides a more flexible way -either the resulting real or imaginary part is a weighted sum of the product on every pair of numbers respectively from the two complex number arguments (i.e., real or imaginary part). Hence, RatE only requires eight more scalar parameters each relation than baseline RotatE, which is much less than the embedding dimension by one or two orders of magnitude. Through relation-adaptive translation function, the proposed approach empirically promotes the capacity of modeling translation process and embedding ambiguity problem, while preserves most interpretability to handle various relation patterns.
We also propose a novel local-cognitive negative sampling method, by integrating type-constraint training technique (Krompaß et al., 2015) with self-adversarial learning (Sun et al., 2019). The former leverages prior knowledge in graph during training and samples negative head (tail) entities from relation-specific domain (range), which is limited by the hard sampling criterion and suffers from graph sparseness. By comparison, the latter scores a certain number of uniformly-sampled negative samples based on current model, and uses the normalized scores as weights for the loss function. It hence depends heavily on an incompletely-trained model. Thus, we integrate them for their mutual benefits: besides using a self-adversarial loss, our method leverages prior knowledge to weaken the effect of current model.
Our main contributions are summarized in the following.
• We propose a trans-based graph embedding approach with a novel relation-adaptive translation function in complex vector space, which achieves a better trade-off between interpretability and representing capacity than previous approaches.
• We verify the model's capability in alleviating embedding ambiguity problem caused by one-tomany relation pattern, from both theoretical and empirical perspectives.
• With a novel negative sampling method, we evaluate the proposed approach on four link prediction benchmark datasets, i.e., WN18, FB15k, WN18RR and FB15k-237, which shows state-of-the-art results among semantic matching and trans-based graph embedding approaches. The experimental codes are available at https://github.com/Hhnro/RatE.

Proposed Approach
This section begins with a definition of link prediction task for knowledge graph completion, followed by an introduction to a baseline RotatE. Then, we propose a novel relation-adaptive translation function to compose the final relation-adaptive translating embedding approach. Then, we present an efficient negative sampling method by integrating the merits of two previous sampling strategies. Lastly, we demonstrate the capability of our proposed model in alleviating embedding ambiguity problem.

Link Prediction
Formally, a knowledge graph G = {E, R} consists of a set of triples (h, r, t), where h, t ∈ E are head and tail entities respectively while r ∈ R is the relation between them. Given a head h (or tail t) entity and a relation r, the goal of link prediction is to find the most accurate tail t (or head h) from E to make the new triple (h, r, t) plausible in the knowledge graph G. In a graph embedding approach, each entity/relation is assigned with an embedding vector, and a triple is denoted as (h, r, t). To tackle link prediction, a scoring function f (h, r, t) is presented to derive the plausibility score for each triple candidate. Especially in a trans-based approach, the score function is formulated as f (h, r, t) = −||g(h, r) − t|| p where g(·) denotes a translation function.

Baseline: RotatE
RotatE is a state-of-the-art trans-based graph embedding approach in complex vector space. Motivated by Euler's identity, its translating process is formulated as a relation-specific rotation of the head's embedding vector. RotatE in complex space can be viewed as a natural extension of vanilla TransE in real vector space, aiming to support the relation pattern of symmetry. Specifically, RotatE represents both entities E and relations R in complex vector space C d , and defines relation's embedding as a rotation by constraining the modulus of each dimension to 1. And its translation function g(h, r) is simply fulfilled by a Hadamard product (i.e., element-wise, denoted as "•") in complex vector space, i.e., g(h, r) = h•r. Therefore, the scoring function in RotatE is written as Note, the p-norm of a complex vector v is defined as ||v|| p = p |v i | p .

Relation-Adaptive Translating Embedding
Based on the baseline, we propose a trans-based graph embedding approach, named as Relation-adaptive translating Embedding (RatE). It extends complex number product to a novel weighted product in complex space, where the weights are learnable and relation-specific. The weighted product is defined as where, o, u, v ∈ C, W ∈ R 2×4 and s (u,v) = [ac; ad; bc; bd] ∈ R 4 .
Here, W denotes a learnable weight matrix and will be updated during training for a specific target. Standard complex number product is its special case when W = [[1, 0, 0, −1]; [0, 1, 1, 0]]. Hence, empowered by the learnable weights, the weighted product promotes the ability to implicitly capture arithmetic or geometrical relationships in complex space when adapted into a data-driven neural model. Then, the proposed weighted product is readily integrated with RotatE to compose a novel relationadaptive translation function. That is h, r ∈ C d are the embeddings of head entity and relation respectively, and W (r) denotes element-wise weighted product where the weights are specified for each relation r ∈ R. Based on this translation function, we formulate the score function of relation-adaptive translating embedding as where s (h,r,t) ∈ R is the resulting score of the triple (h,r,t) to measure its plausibility. As both the graph embeddings and the translation function are defined in complex vector space and learnable during training, our proposed RatE is a generic formulation of previous trans-based approaches. In other words, the approaches like RotatE and TransE are special cases of RatE, so our approach makes the best of deep neural network and promotes the representing capacity of translating paradigm. This is achieved by increasing only eight learnable parameters for each relation, which are fewer than the relation's embedding size by one or two orders of magnitude. Moreover, besides handling the four relation patterns (i.e., symmetry, antisymmetry, inversion and composition), the proposed RatE also reduces the effect of embedding ambiguity (detailed at the end of this section). It is also noteworthy that although the integration above is based on RotatE, the proposed weighted product is compatible with any complex or hypercomplex embedding approach (e.g., QuatE).

Negative Sampling and Optimization
The way to conduct negative sampling can significantly affect the performance of a graph embedding approach (Cai and Wang, 2018;Sun et al., 2019) because contrasting a challenging negative sample against the corresponding positive one is more effective for learning structured knowledge. Formally, given an arbitrary correct triple x = (h, r, t) ∈ G (tr) , negative sampling aims at corrupting its either head or tail entity to get a wrong triple x = (h , r, t) or (h, r, t ), where x / ∈ G (tr) . G (tr) denotes the knowledge graph to train an embedding model. Note, we only exhibit tail corruption for a clear elaboration in the following, and head corruption is also considered in our implementation.
We first introduce two popular sampling strategies in the following. Type-constraint training technique (Krompaß et al., 2015) presents a new link prediction setting based on local closed-world assumptions -the entities to corrupt a triple only come from a relation-specific entity set during both training and test. We only take this idea in training phase to introduce prior knowledge and provide strong distractors. Particularly, for a triple (h, r, t), the candidate set of tail corruptions is However, sampling only in this set, E (h,r,t) , suffers from not only graph sparseness by local closed-world assumptions but also information loss of other corrupting entities. The other entities are denoted as In contrast, self-adversarial negative sampling (Sun et al., 2019) applies triple scoring function to a certain number of uniformly-sampled wrong triples, and each f (h, r, t ) represents its difficulty to current embedding model. It then uses the normalized scores as the weights in loss function to perform a selfadversarial training. However, this sampling strategy depends heavily on current embedding model. Figure 1: Toy examples -applying translation functions of TransE, RotatE and RatE to (hi, ri) for the resulting ti. Note that 1) dimension index i is omitted, and 2) TransE is defined in real space whereas RotatE/RatE is defined in complex space.
Then, we propose a novel local-cognitive negative sampling method by integrating them to complement each other. Our integration is non-trivial, where a dynamic coefficient 1 γ ∈ [0, 1] is used to control the proportion of negative samples from E (h,r,t) orĒ (h,r,t) . In particular, a certain number n of wrong triples is first sampled for each triple x = (h, r, t) ∈ G (tr) . To achieve this, we conduct a uniform sampling individually in E (h,r,t) andĒ (h,r,t) , which respectively produce N containing γn samples andN containing (1 − γ)n samples. Then we optimize the proposed embedding model by minimizing where µ is weight decay of L1 regularization and set to 0.01 without tuning. Lastly, we update the coefficient γ at the end of every training epoch by Here γ inclines to the candidate set with more challenging negative samples, which is determined by all wrong triples sampled in the previous epoch. In summary, our sampling method employs a selfadversarial training loss, and leverages prior knowledge to weaken the effect of current model.

Embedding Disambiguation
Embedding ambiguity here refers to similar embeddings assigned to distinct entities. In a trans-based graph embedding approach, it is usually caused by one-to-many (i.e., a kind of non-injective) relations in knowledge graphs. Specifically, given a set of triples {(h, r, t 1 ), . . . , (h, r, t M )} as an example of one-to-many relations, invoking a translation function directly defined in real or complex space makes the model optimize toward ∀t j = g(h, r) and inevitably results in similar tail embeddings. Because one-to-many relations are ubiquitous in a knowledge graph, e.g., has part in WordNet, the embedding ambiguity problem will deteriorate and propagate through the graph. Fortunately, the proposed RatE is able to alleviate this problem by cutting off the propagation.
To intuitively demonstrate RatE's capability in embedding disambiguation by stopping the propagation, we respectively illustrate toy examples of TransE, RotatE, and our proposed RatE in Figure 1. It is observed that given two head entities with similar embeddings, their similarity will be preserved in corresponding tail entities after applying the same relation, not to mention the relation r possibly being a one-to-many relation. The triple scoring function built upon geometric distance may hardly discriminate such subtle differences in the space and thus negatively affects the quality of predictions. In principle, compared to rigid transformation in RotatE and TransE, the proposed RatE with weighted product  shares a similar inspiration with projective transformation and changes the distance between the tail entities according to spatial positions of the head entities. Consequently, besides increasing the distance between the tail entities to disambiguate entity embeddings, RatE could also decrease the distance for better support of many-to-one relations. A rigorous proof of these properties is provided in Appendix A.

Experimental Setting
Dataset. We employ four widely-used link prediction benchmark, WN18, FB15K, WN18RR and FB15K-237, whose statistics are summarized in Table 2. Note, Toutanova et al., (2015) find that both WN18 and FB15K suffer from direct link problem caused by most test triples (e 1 , r 1 , e 2 ) can be found in the training or valid set with another relation, e.g., (e 1 , r 2 , e 2 ) or (e 2 , r 2 , e 1 ).
• WN18 (Bordes et al., 2013) is extracted from WordNet (Miller, 1995), a knowledge graph composed of English phrases and lexical relations between them.
• FB15k (Bordes et al., 2013) is extracted from Freebase (Bollacker et al., 2008), which is a largescale knowledge graph consisting real-world named entities and their relationships.
• FB15k-237 (Toutanova et al., 2015) is a subset of FB15k by 1) removing near-duplicate and inverse triples, and 2) filtering out the direct links to avoid data leakage.
Training Setting. The ranges of the hyper-parameters for grid search are elaborated in the following. Embedding dimension d ∈ {250, 500, 1000}, batch size ∈ {512, 1024, 2048}, and fixed margin λ ∈ {6, 9, 12, 18}. By following previous works, all entities and relation embeddings are randomly initialized under uniform distribution. The initialization range of entities is [−λ/d, +λ/d] for both real and imaginary parts, and the initialization range of relations is [0, 2π] with |r| = 1 in complex space. Our model is implemented using PyTorch on a single Titan V GPU. We use minibatch SGD with Adam optimizer, where the learning rate is set to 5 × 10 −5 without decay.
Evaluation Metrics. Following Bordes et al. (2013), we use "filtered" setting to calculate evaluation metrics during test: In either head or tail entity corruption, all correct triples in train/dev/test except the current oracle test triple are removed to avoid affecting rank. Given all candidate triples ranked according to the score function f (h, r, t), we use the standard evaluation metrics on link prediction tasks: 1) mean rank (MR) to describe the mean rank of the oracle test triples, 2) mean reciprocal rank (MRR), and 3) Hits@N (N =1, 3, 10) to denotes the ratio of the oracle test triples ranked in top-N .
Comparative Approach. We compare RatE with several strong graph embedding approaches, especially the trans-based approaches to which RatE belongs. In particular, for trans-based approaches, we mainly consider TransE (Bordes et al., 2013) in real space and RotatE (Sun et al., 2019) in complex space. For semantic matching approaches, we consider DistMult , HolE (Nickel et al., 2016), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2017) and QuatE (Zhang et al., 2019).
For most approaches, we copy results from the original paper or (Sun et al., 2019) except explanations.

Evaluation on Link Prediction
Link prediction results on the four datasets are shown in Table 3 and Table 4. It is observed that the proposed RatE is able to achieve new state-of-the-art results in terms of most metrics compared to previous   graph embedding approaches. Overall, compared with the baseline model RotatE, RatE merely employs several additional parameters to deliver significant improvement. To the best of our knowledge, RotatE is previous the best trans-based graph embedding approach and belongs to the same category as RatE. RatE also outperforms previous state-of-the-art semantic matching graph embedding approach, QuatE, which is defined in hypercomplex space and requires more computational overheads. Specifically, since WN18 and FB15k suffer from the direct link problem as detailed above, it is observed that the baselines and our proposed RatE obtain comparable results in all metrics. For example, Dettmers et al. (2017) find that using a rule-based model to learn the inverse relations achieves competitive results on WN18RR. This explains why our improvement is marginal in these two datasets. Moreover, since WN18RR and FB15k-237 are presented to solve the problem in WN18 and FB15k respectively, the evaluation results on WN18RR and FB15k-237 are more canonical to measure the capability in link prediction. As shown in Table 4, the proposed RatE brings a more noticeable improvement in contrast to previous approaches.

Ablation Study
We conduct an extensive ablation study in Table 5 to verify the effectiveness of each proposed part. We first replace the relation-adaptive translation function with a shared weighted product among all relations (i.e., "RatE w/o relation-adaptive"), and observe a performance drop. And the weighted product further degenerates to standard complex product (i.e., RatE w/o weighted product), which only results in a slight drop. This suggests the proposed weighted product should be coupled with relation-adaptation to maximize its effectiveness. Then, removing L1 regularization of W (r) in Eq.(7) and the proposed local-cognitive negative sampling leads to 0.6% and 2.6% Hits@10 drops respectively. Note "RatE w/o negative sampling" denotes using a uniform negative sampling method instead of our proposed local-   Table 6: Test performance in Hits@10 regarding different relation patterns and the corresponding relations on WN18RR.
||W (r) ||1 is used to measure the complexity of the proposed relation-adaptive translation function. Since only three triples with relation "similar to" appear in the test set of WN18RR, we omit this relation.
cognitive negative sampling. Lastly, when removing all the proposed parts, the model is equivalent to its baseline RotatE without self-adversarial negative sampling, which results in inferior performance.

Analysis of Relation-Adaptive Translation Function
A major difference between RatE and previous trans-based graph embedding approaches (e.g., RotatE) is that a learnable relation-adaptive translation function is used in RatE to capture the translating relationship. To measure the expressive power of RatE, it is significant to investigate the learned weights in each relation-specific weighted product. As shown in Table 6, the L1 norm of learned W (r) for symmetric relation is obviously less than that of antisymmetric relation. In particular, with the redundancy of complex number product removed, RatE preserves the ability to handle symmetric relations and achieves competitive results.

Performance on Non-Injective Relations
By following Sun et al. (2019), we also evaluate the proposed RatE on different types including one injective relation type (i.e., one-to-one ) and three non-injective relation types (i.e., one-to-many, manyto-one and many-to-many). As shown in Table 7, although RatE delivers similar Hits@10 values to RotatE on the injective relation type, it significantly surpasses both TransE and RotatE on the noninjective relation types. The improvements are especially substantial in 1-to-M relation (+8.8%) on tail prediction and M-to-1 (+12.2%) on head prediction, which verifies RatE's capability in handling one-tomany relations. Coupled with the theoretical proof in §2.5, this also indirectly verifies that RatE is able to alleviate the embedding ambiguity problem posted by one-to-many relations.

Analysis of Negative Sampling
As negative sampling is crucial for a model to learn structured knowledge, we evaluate RatE with different negative sampling methods. "Local-cognitive w/o self-adv loss" can be viewed as only using prior knowledge from local closed-world assumptions (Krompaß et al., 2015). The experimental results shown in Table 8 demonstrate that compared with uniform sampling, both self-adversarial sampling and type-constraint training technique (i.e., Local-cognitive w/o self-adv loss) contribute to performance   improvement. The results also emphasize the effectiveness of our proposed local-cognitive negative sampling method, a non-trivial integration of the both above, in structured knowledge learning.

Analysis of Efficiency
Lastly, we discuss RatE's efficiency that is mainly brought by the following two factors. On the one hand, in line with previous trans-based graph embedding approaches, RatE only employs fast translation function and geometric distance measurement. On the other hand, even if a relation-adaptive translation function with weighted product is used in translating process, the function with few parameters has low time and space complexities. We compare RatE with a semantic matching graph embedding method TuckER (Balazevic et al., 2019) that uses a weight tensor to score a triple. As shown in Table 9, with competitive performance, TuckER requires much more learnable parameters than RatE for scoring. For example, TuckER has a weight tensor with 1, 200, 000 parameters on WN18RR, whereas RatE only requires 88 parameters for all the eleven relations.

Related Work
Unlike semantic matching graph embedding approaches (Nickel et al., 2011;Dettmers et al., 2017;Balazevic et al., 2019;Zhang et al., 2019) require additional overheads to score a triple, this work is in line with trans-based graph embedding approaches that employ an efficient translation function defined in a latent space. TransE (Bordes et al., 2013) is the most representative trans-based approach, which embeds entities/relations in real vector space and utilizes the relations as translations. It optimizes score function towards "h+r = t". Several recent trans-based approaches (Wang et al., 2014;Lin et al., 2015;Ji et al., 2015;Ebisu and Ichise, 2018) can be viewed as extensions of TransE. More recently, RotatE (Sun et al., 2019), as a state-of-the-art trans-based approach, represents the entities/relations in complex vector space and formulates the translating process as a rotation in complex space. Also related to this work, many negative sampling methods (Cai and Wang, 2018;Sun et al., 2019) are proposed to effectively learn structured knowledge. KBGAN (Cai and Wang, 2018) uses knowledge graph embedding model as a negative sample generator to fool the main embedding model (i.e., the discriminator in GANs). In contrast, self-adversarial learning (Sun et al., 2019) scores a certain number of uniformly-sampled negative samples based on current model, and utilizes the scores to perform a weighted loss function. Lastly, this work is also related to using prior knowledge in graphs for training. Type-constraint method (Krompaß et al., 2015), which is based on local closed-world assumptions, corrupts heads (or tails) from relation-specific domain (or range).

Conclusion
In this paper, we study a novel trans-based graph embedding approach, called Relation-adaptive translating Embedding (RatE), for knowledge graph completion. It is based on the proposed relation-adaptive translation function with a novel weighted product in complex space, which not only improves represent-ing and modeling capacities but also alleviates embedding ambiguity problem caused by non-injective relations. Therefore, RatE achieves a better trade-off between interpretability and expressive power than previous trans-based approaches. Moreover, a local-cognitive negative sampling method is also presented to seamlessly integrate prior knowledge with self-adversarial learning for effective learning. In experiments, RatE achieves state-of-the-art performance on four link prediction benchmarks. Extensive ablation study and analyses further provide comprehensive insights into RatE.

Acknowledgement
This research was funded by the Australian Government through the Australian Research Council (ARC) under the grant of LP180100654. The authors would like to appreciate anonymous reviewers for their insightful and constructive feedback.