Orthogonal Relation Transforms with Graph Context Modeling for Knowledge Graph Embedding

Distance-based knowledge graph embeddings have shown substantial improvement on the knowledge graph link prediction task, from TransE to the latest state-of-the-art RotatE. However, complex relations such as N-to-1, 1-to-N and N-to-N still remain challenging to predict. In this work, we propose a novel distance-based approach for knowledge graph link prediction. First, we extend the RotatE from 2D complex domain to high dimensional space with orthogonal transforms to model relations. The orthogonal transform embedding for relations keeps the capability for modeling symmetric/anti-symmetric, inverse and compositional relations while achieves better modeling capacity. Second, the graph context is integrated into distance scoring functions directly. Specifically, graph context is explicitly modeled via two directed context representations. Each node embedding in knowledge graph is augmented with two context representations, which are computed from the neighboring outgoing and incoming nodes/edges respectively. The proposed approach improves prediction accuracy on the difficult N-to-1, 1-to-N and N-to-N cases. Our experimental results show that it achieves state-of-the-art results on two common benchmarks FB15k-237 and WNRR-18, especially on FB15k-237 which has many high in-degree nodes.


Introduction
Knowledge graph is a multi-relational graph whose nodes represent entities and edges denote relationships between entities. Knowledge graphs store facts about people, places and world from various sources. Those facts are kept as triples (head entity, relation, tail entity) and denoted as (h, r, t). A large number of knowledge graphs, such as Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), NELL (Carlson et al., 2010) and YAGO3 (Mahdisoltani et al., 2013), have been built over the years and successfully applied to many domains such as recommendation and question answering (Bordes et al., 2014;Zhang et al., 2016). However, these knowledge graphs need to be updated with new facts periodically. Therefore many knowledge graph embedding methods have been proposed for link prediction that is used for knowledge graph completion.
Though much progress has been made, 1-to-N, N-to-1, and N-to-N relation predictions (Bordes et al., 2013;Wang et al., 2014) still remain challenging. In Figure 1, relation "profession" demonstrates an N-to-N example and the corresponding edges are highlighted as green. Assuming the triple (SergeiRachmaninoff, Profession, Pianist) is unknown. The link prediction model takes "SergeiRachmaninoff" and relation "Profession" and rank all entities in the knowledge graph to predict "Pianist". Entity "SergeiRachmaninoff" connected to multiple entities as head entity via relation "profession", while "Pianist" as a tail entity also reaches to multiple entities through relation "profession". It makes the N-to-N prediction hard because the mapping from certain entity-relation pair could lead to multiple different entities. Same issue happens with the case of 1-to-N and N-to-1 predictions.
The recently proposed RotatE (Sun et al., 2019)  models each relation as a 2-D rotation from the source entity to the target entity. The desired properties for relations include symmetry/antisymmery, inversion and composition which have been demonstrated to be useful for link prediction in knowledge graph. Many existing methods model one or a few of these relation patterns, while RotatE naturally handles all these relation patterns. In addition, the entity and relation embeddings are divided into multiple groups (for example, 1000 2-D rotations are used in (Sun et al., 2019)). Each group is modeled and scored independently. The final score is computed as the summation of all these scores, which can be viewed as an ensemble of different models and further boost the performance of link prediction. However, RotatE is limited to 2-D rotations and thus has limited modeling capacity. In addition, RotatE does not consider graph context, which is helpful in handling 1-to-N, N-to-1, and N-to-N relation prediction.
In this work, a novel distance-based knowledge graph embedding called orthogonal transform embedding (OTE) with graph context is proposed to alleviate the 1-to-N, N-to-1 and N-to-N issues, while keeps the desired relation patterns as RotatE. First, we employ orthogonal transforms to represent relations in high dimensional space for better modeling capability. The Orthogonal transform embedding also models the symmetry/antisymmery, inversion and compositional relation patterns just as RotatE does. RotatE can be viewed as an orthogonal transform in 2D complex space.
Second, we integrate graph context directly into the distance scoring, which is helpful to predict 1-to-N, N-to-1 and N-to-N relations. For example, from the incomplete knowledge graph, people find useful context information, such as (SergeiRachmaninoff, role, Piano) and (SergeiRachmaninoff, Profession, Composer) in Figure 1. In this work, each node embedding in knowledge graph is aug-mented with two graph context representations, computed from the neighboring outgoing and incoming nodes respectively. Each context representation is computed based on the embeddings of the neighbouring nodes and the corresponding relations connecting to these neighbouring nodes. These context representations are used as part of the distance scoring function to measure the plausibility of the triples during training and inference. We show that OTE together with graph context modeling performs consistently better than RotatE on the standard benchmark FB15k-237 and WN18RR datasets.
In summary, our main contributions include: • A new orthogonal transform embedding OTE, is proposed to extend RotatE from 2D space to high dimensional space, which also models symmetry/antisymmery, inversion and compositional relation patterns; • A directed graph context modeling method is proposed to integrate knowledge graph context (including both neighboring entity nodes and relation edges) into the distance scoring function; • Experimental results of OTE on standard benchmark FB15k-237 and WN18RR datasets show consistent improvements over RotatE, the state of art distance-based embedding model, especially on FB15k-237 with many high in-degree nodes. On WN18RR our results achieve the new state-of-the-art performance.
2 Related work

Knowledge Graph Embedding
Knowledge graph embedding could be roughly categorized into two classes : distance-based models and semantic matching models. Distance-based model is also known as additive models, since it projects head and tail enti-ties into the same embedding space and the distance scoring between two entity embeddings is used to measure the plausibility of the given triple.
TransE (Bordes et al., 2013) is the first and most representative translational distance model. A series of work is conducted along this line such as TransH (Wang et al., 2014), TransR (Lin et al., 2015) and TransD (Ji et al., 2015) etc. RotatE (Sun et al., 2019) further extends the computation into complex domain and is currently the state-of-art in this category. On the other hand, Semantic matching models usually take multiplicative score functions to compute the plausibility of the given triple, such as DistMult (Yang et al., 2014), Com-plEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), TuckER (Balazevic et al., 2019) and QuatE (Zhang et al., 2019). ConvKB (Nguyen et al., 2017) and CapsE (Nguyen et al., 2019) further took the triple as a whole, and fed head, relation and tail embeddings into convolutional models or capsule networks. The above knowledge graph embedding methods focused on modeling individual triples. However, they ignored knowledge graph structure and did not take advantage of context from neighbouring nodes and edges. This issue inspired the usage of graph neural networks (Kipf and Welling, 2016;Veličković et al., 2017) for graph context modeling. Encoder-decoder framework was adopted in (Schlichtkrull et al., 2017;Shang et al., 2019;Bansal et al., 2019). The knowledge graph structure is first encoded via graph neural networks and the output with rich structure information is passed to the following graph embedding model for prediction. The graph model and the scoring model could be end-to-end trained together, or the graph encoder output was only used to initialize the entity embedding (Nathani et al., 2019). We take another approach in this paper: we integrate the graph context directly into the distance scoring function.

Orthogonal Transform
Orthogonal transform is considered to be more stable and efficient for neural networks (Saxe et al., 2013;Vorontsov et al., 2017). However, to optimize a linear transform with orthogonal property reserved is not straightforward. Soft constraints could be enforced during optimization to encourage the learnt linear transform close to be orthogonal. Bansal et al. (2018) extensively compared different orthogonal regularizations and find regularizations make the training faster and more stable in different tasks. On the other hand, some work has been done to achieve strict orthogonal during optimization by applying special gradient update scheme. Harandi and Fernando (2016) proposed a Stiefel layer to guarantee fully connected layers to be orthogonal by using Reimannian gradients. Huang et al. (2017) consider the estimation of orthogonal matrix as an optimization over multiple dependent stiefel manifolds problem and solve it via eigenvalue decomposition on a proxy parameter matrix. Vorontsov et al. (2017) applied hard constraint on orthogonal transform update via Cayley transform. In this work, we construct the orthogonal matrix via Gram Schmidt process and the gradient is calculated automatically through autograd mechanism in PyTorch (Paszke et al., 2017).

Our Proposed Method
We consider knowledge graph as a collection of triples D = {(h, r, t)} with V as the graph node set, and R as the graph edge set. Each triple has a head entity h and tail entity t, where h, t ∈ V . Relation r ∈ R connects two entities with direction from head to tail. As discussed in the introduction section, 1-to-N, N-to-1 and N-to-N relation prediction (Bordes et al., 2013;Wang et al., 2014) are difficult to deal with. They are addressed in our proposed approach by: 1) orthogonal relation transforms that operate on groups of embedding space. Each group is modeled and scored independently, and the final score is the sum of all group scores. Hence, each group could address different aspects of entity-relation pair and alleviate the 1-to-N and N-to-N relation mapping issues; and 2) directed graph context to integrate knowledge graph structure information to reduce the ambiguity.
Next, we first briefly review RotatE that motivates our orthogonal transform embedding (OTE), and then describe the proposed method in details.

RotatE
OTE is inspired by RotatE (Sun et al., 2019). In RotatE, the distance scoring is done via Hadamard production (element-wise) defined on the complex domain. Given a triple (h, r, t), the corresponding embedding are e h , θ r , e t , where e h and e t ∈ R 2d , θ r ∈ R d , and d is the embedding dimension. For each dimension i, e[2i] and e[2i + 1] are corresponding real and imaginary components. The projectionẽ t of t from corresponding relation and head entities is conducted as an orthogonal transform as below: Though RotatE is simple and effective for knowledge graph link prediction, it is defined in 2D complex domain and thus has limited modeling capability. A natural extension is to apply similar operation on a higher dimensional space.

Orthogonal Transform Embedding (OTE)
We use e h , M r , e t to represent embeddings of head, relation and tail entity, where e h , e t ∈ R d , and d is the dimension of the entity embedding. The entity embedding e x , where x = {h, t}, is further divided into K sub-embeddings, e.g., For each sub-embedding e t (i) of tail t, we define the projection from h and r to t as below: where φ is the Gram Schmidt process (see details in Section 3.3) applied to square matrix M r (i).
The output transform φ(M r (i)) is an orthogonal matrix derived from M r (i).ẽ t is the concatenation of all sub-vectorẽ t (i) from Eq. 1, e.g., The L 2 norm of e h (i) is preserved after the orthogonal transform. We further use a scalar tensor s r (i) ∈ R d s to scale the L 2 norm of each group of embedding separately. Eq. 1 is re-written as Then, the corresponding distance scoring function is defined as For each sub-embedding e h (i) of head h, we define the projection from r and t to h as below: where the reverse project from tail to head is simply transposing the φ(M r (i)) and reversing the sign of s r . Then, the corresponding distance scoring function is defined as

Gram Schmidt Process
We employ Gram-Schmidt process to orthogonalize a linear transform into an orthogonal transform (i.e., φ(M r (i)) in Section 3.2). The Gram-Schmidt process takes a set of tensor S = {v 1 , ⋯, v k } for k ≤ d s and generates an orthogonal set S ′ = {u 1 , ⋯, u k } that spans the same k−dimensional subspace of R d s as S.
where t 1 = v 1 , ||t|| is the L 2 norm of vector t and ⟨v, t⟩ denotes the inner product of v and t. Orthogonal transform has many desired properties, for example, the inverse matrix is obtained by simply transposing itself. It also preserves the L 2 norm of a vector after the transform. For our work, we are just interested in its property to obtain inverse matrix by simple transposing. This saves the number of model parameters (see Table 3).
It can be easily proved that OTE has the ability to model and infer all three types of relation patterns: symmetry/antisymmetry, inversion, and composition as RotatE does. The proof is listed in Appendix A.
It should be noted that, M r (i) is calculated every time in the neural networks forward computation to get orthogonal matrix φ(M r (i)), while the corresponding gradient is calculated and propagated back to M r (i) via autograd computation within PyTorch during the backward computation. It eliminates the need of special gradient update schemes employed in previous hard constraint based orthogonal transform estimations (Harandi and Fernando, 2016;Vorontsov et al., 2017). In our experiments, we initialize M r (i) to make sure they are with full rank 1 . During training, we also keep checking the determinant of M r (i). We find the update is fairly stable that we don't observe any issues with subembedding dimensions varied from 5 to 100.

Directed Graph Context
The knowledge graph is a directed graph: valid triple (h, r, t) does not mean (t, r, h) is also valid. Therefore, for a given entity in knowledge graph, there are two kinds of context information: nodes that come into it and nodes that go out of it. Specially, in our paper, for each entity e, we consider the following two context settings: 1. If e is a tail, all the (head, relation) pairs in the training triples whose tail is e are defined as Head Relation Pair Context. 2. If e is a head, all the (relation, tail) pairs in the training triples whose head is e are defined as Relation Tail Pair Context. Figure 1 demonstrates the computation of graph context for a testing triple (SergeiRachmaninoff, profession, Pianist). Edges for relation "profession" are colored as green. Entities marked with • are head entities to entity "Pianist", and these entities and corresponding relations to connect "Pianist" form the head relation pair context of "Pianist". While entities with ⭒ are tail entities for entity "SergeiRachmaninoff". Those entities and corresponding relations are the relation tail graph context of entity "SergeiRachmaninoff".

Head Relation Pair Context
For a given tail t, all head-relation pairs (h ′ , r ′ ) of the triples with tail as t are considered as its graph context and denoted as N g(t). First, we compute the head-relation context representationẽ c t as the average from all these pairs in N g(t) as below: where e t is the embedding of the tail t, f (h ′ , r ′ ) is the representation of (h ′ , r ′ ) induced from Eq. 2.
We use e t in Eq. 8 to make the computation of context representation possible when N g(t) is empty. This can be viewed as a kind of additive smoothing for context representation computation. Then, we compute the distance of the headrelation context of t and the corresponding orthogonal transform based representation of a triple (h, r, t) as follow.
There is no new parameter introduced for the graph context modeling, since the message passing is done via OTE entity-relation project f (h ′ , r ′ ).
The graph context can be easily applied to other translational embedding algorithms, such as RotatE and TransE etc, by replacing OTE.

Relation Tail Pair Context
For a given head h, all relation-tail pairs (r ′ , t ′ ) of the triples with head as h are considered as its graph context and denoted as N g(h). First, we compute the relation-tail context representationẽ c h as the average from all these pairs in N g(h) as below: Then, we compute the distance of the relationtail context of h and the corresponding orthogonal transform based representation of a triple (h, r, t) as follow.

Scoring Function
We further combine all four distance scores (Eq. 3, Eq. 5, Eq. 9 and Eq. 11) discussed above as the final distance score of the graph contextual orthogonal transform embedding (GC-OTE) for training and inference

t). (12)
Therefore the full GC-OTE model can be seen as an ensemble of K local GC-OTE models. This view provides an intuitive explanation for the success of GC-OTE. Optimization Self-adversarial negative sampling loss (Sun et al., 2019) is used to optimize the embedding in this work, where γ is a fixed margin, σ is sigmoid function, (h ′ , r, t ′ ) is negative triple, and p(h ′ , r, t ′ ) is the negative sampling weight defined in (Sun et al., 2019).  Each dataset is split into three sets for: training, validation and testing, which is same with the setting of (Sun et al., 2019). The statistics of two data sets are summarized at Table 1. Only triples in the training set are used to compute graph context.

Evaluation Protocol
Following the evaluation protocol in (Dettmers et al., 2018;Sun et al., 2019), each test triple (h, r, t) is measured under two scenarios: head focused (?, r, t) and tail focused (h, r, ?). For each case, the test triple is ranked among all triples with masked entity replaced by entities in knowledge graph. Those true triples observed in either train/validation/test set except the test triple will be excluded during evaluation. Top 1, 3, 10 (Hits@1, Hits@3 and Hits@10), and the Mean Reciprocal Rank (MRR) are reported in the experiments.

Experimental Setup
Hyper-parameter settings The hyper-parameters of our model are tuned by grid search during training process, including learning rate, embedding dimension d and sub-embedding dimension d s . In our setting, the embedding dimension is defined as the number of parameters in each entity embedding. Each entity embedding consists of K subembeddings with dimension d s , i.e., d = K × d s . There are two steps in our model training: 1) the model is trained with OTE or RotatE models, and 2) graph context based models are fine tuned on these pre-trained models. The parameter settings are selected by the highest MRR with early stopping on the validation set. We use the adaptive moment (Adam) algorithm (Kingma and Ba, 2014) to train the models.
Specially, for FB15k-237, we set embedding dimension d = 400, sub-embedding dimension d s = 20, and the learning rates to 2e-3 and 2e-4 for pre-training and fine-tuning stages respectively; for WN18RR dataset, we set d = 400, d s = 4, and the learning rates to 1e-4 and 3e-5 for pre-training and fine-tuning stages. Implementation Our models are implemented by PyTorch and run on NVIDIA Tesla P40 Graphics Processing Units. The pre-training OTE takes 5 hours with 240,000 steps and fine-tuning GC-OTE takes 23 hours with 60,000 steps. Though, it takes more computation for graph context based model training, the inference could be efficient if both head and tail context representations are precomputed and saved for each entity in the knowledge graph.

Experimental Results
In this section, we first present the results of link prediction, followed by the ablation study and error analysis of our models. Table 2 compares the proposed models (OTE and graph context based GC-OTE) to several stateof-the-art models: including translational distance based TransE   baseline numbers are quoted directly from published papers.

Results of Link Prediction
From Table 2, we observe that: 1) on FB15k-237, OTE outperforms RotatE, and GC-OTE outperforms all other models on all metrics. Specifically MRR is improved from 0.338 in RotatE, to 0.361, about 7% relative performance improvement. OTE which increases sub-embedding dimension from 2 to 20, and graph context each contributes about half the improvement; 2) on WN18RR, OTE outperforms RotatE and GC-OTE achieves the new state-of-the-art results (as far as we know from published papers). These results show the effectiveness of the proposed OTE and graph context for the task of predicting missing links in knowledge graph.
Moreover, GC-OTE improves more on FB15k-237 than on WN18RR. This is because FB15k-237 has richer graph structure context compared to WN18RR: an average of 19 edges per node v.s. 2 edges per node in WN18RR. These results indicate that the proposed method GC-OTE is more effective on data set with rich context structure information.   Table 3 shows the results of ablation study of the proposed models and compares the number of model parameters with RotatE on FB15k-237 validation set. We perform the ablation study with embedding dimension of 400. The entity embedding dimension for RotatE-S and RotatE-L are 400 and 2000, respectively.

Ablation Study
First we notice that increasing embedding size from 400 to 2000 makes RotatE model size more than quadrupled while the performance gain is very limited (Row 1 and 2 in Table 3); increasing group embedding size from 2 to 20 does not increase the model size of OTE much, but with nice performance gain (Row 3 and 4 in Table 3). The model size of OTE is less than one-third of the size of RotatE-L but with better performance. This shows the effectiveness of the OTE.
We examine the proposed model in terms of the following aspects: Impact of sub-embedding dimension: we fix the embedding dimension as 400, and increase the subembedding dimension d s from 2 to 20, the MRR of OTE is improved from 0.327 to 0.355 (See Row 3 and Row 4). For RotatE, the entity is embedded in complex vector space, this is similar to our setting with sub-embedding dimension = 2. Our results show that increasing the sub-dimension with OTE is beneficial to link prediction. Impact of orthogonal transform: we replace the orthogonal transform operation in OTE with two different settings, 1) removing the diagonal scalar tensor as Eq. 1 (See OTE-scalar) and 2) using normal linear transform rather than orthogonal transform (See LNE). Both settings lead to MRR degradation. This indicates the proposed orthogonal transform is effective in modeling the relation patterns which are helpful for link prediction. Impact of graph context: we add the graph context based model to both OTE (See GC-OTE) and RotatE-L (See GC-RotatE-L). We observe that MRRs are improved for both RotatE-L and OTE. This shows the importance of modeling context information for the task of link prediction. Sub-embedding dimension size: in Table 3 we show that increasing sub-embedding dimension brings a nice improvement on MRR. Is the larger size always better? Figure 2 shows the impact of d s on the OTE performance with the changing of sub-embedding size. We fix the entity embedding dimension as 400, and vary the sub-embedding size from 2, 5, 10, 20, 50, all the way to 100. The blue line and green bar represent MRR and H@10 value, respectively.
From Figure 2 we observe that, both MRR and Hit@10 are improved and slowly saturated around d s = 20 The similar experiments are also conducted on WN18RR data set and we find the best subembedding dimension is 4 on WN18RR.

Error Analysis
We present error analysis of the proposed model on 1-to-N, N-to-1 and N-to-N relation predictions on FB15k-237. Table 4 shows results in terms of Hit@10, where "Num." is the number of triples in the validation set belonging to the corresponding category, "H"/"T" represents the experiment to predict head entity /tail entity, and "A" denotes average result for both "H" and "T". Assume c(h, r) and c(r, t) are the number of (h, r) and (r, t) pairs appeared in triples from the training set respectively. A triple (h, r, t) from the validation set is considered as one of the categories in the following: From Table 4 we observe that, comparing to Ro-tatE large model, the proposed model get better Hit@10 on all cases, especially for the difficult cases when we attempt to predicting the head entity for 1-to-N/N-to-N relation type, and tail entity in N-to-1/N-to-N relation type. The reason is because that in the proposed model, the groupings of sub-embedding relation pairs in OTE and graph context modeling both are helpful to distinguish N different tails/heads when they share the same (head, rel)/(rel, tail).

Conclusions
In this paper we propose a new distance-based knowledge graph embedding for link prediction. It includes two-folds. First, OTE extends the modeling of RotatE from 2D complex domain to high dimensional space with orthogonal relation transforms. Second, graph context is proposed to integrate graph structure information into the distance scoring function to measure the plausibility of the triples during training and inference.
The proposed approach effectively improves prediction accuracy on the difficult N-to-1, 1-to-N and N-to-N link predictions. Experimental results on standard benchmark FB15k-237 and WN18RR show that OTE improves consistently over Ro-tatE, the state-of-the-art distance-based embedding model, especially on FB15k-237 with many high in-degree nodes. On WN18RR our model achieves the new state-of-the-art results. This work is partially supported by Beijing Academy of Artificial Intelligence (BAAI).

A Discussion on the Ability of Pattern Modeling and Inference
It can be proved that OTE can infer all three types of relation patterns, e.g., symmetry/antisymmetry, inversion and composition patterns.

A.1 Symmetry/antisymmetry
If e t = f (r, h) and e h = f (r, t) hold, we have e t = diag(exp(s r ))φ(M r ) diag(exp(s r ))φ(M r )e t ⇒ φ(M r )φ(M r ) = I s r = 0 In other words, if φ(M r ) is a symmetry matrix and no scale is applied, the relation is symmetry relation.
If the relation is antisymmetry, e.g., e t = f (r, h) and e h ≠ f (r, t), we just need to one of the φ(M r (i)) is not symmetry matrix or s r (i) ≠ 0.