Relation Embedding with Dihedral Group in Knowledge Graph

Link prediction is critical for the application of incomplete knowledge graph (KG) in the downstream tasks. As a family of effective approaches for link predictions, embedding methods try to learn low-rank representations for both entities and relations such that the bilinear form defined therein is a well-behaved scoring function. Despite of their successful performances, existing bilinear forms overlook the modeling of relation compositions, resulting in lacks of interpretability for reasoning on KG. To fulfill this gap, we propose a new model called DihEdral, named after dihedral symmetry group. This new model learns knowledge graph embeddings that can capture relation compositions by nature. Furthermore, our approach models the relation embeddings parametrized by discrete values, thereby decrease the solution space drastically. Our experiments show that DihEdral is able to capture all desired properties such as (skew-) symmetry, inversion and (non-) Abelian composition, and outperforms existing bilinear form based approach and is comparable to or better than deep learning models such as ConvE.


Introduction
Large-scale knowledge graph (KG) plays a critical role in the downstream tasks such as semantic search (Berant et al., 2013), dialogue management (He et al., 2017) and question answering (Bordes et al., 2014). In most cases, despite of its large scale, KG is not complete due to the difficulty to enumerate all facts in the real world. The capability of predicting the missing links based on existing dataset is one of the most important research topics for years. A common representation of KG is a set of triples (head, relation, tail), and the problem of link prediction can be viewed as predicting new triples from the existing set. A * Equal contribution. popular approach is KG embeddings, which maps both entities and relations in the KG to a vector space such that the scoring function of entities and relations for ground truth distinguishes from false facts (Socher et al., 2013;Bordes et al., 2013;Yang et al., 2015). Another family of approaches explicitly models the reasoning process on KG by synthesizing information from paths (Guu et al., 2015). More recently, researchers are applying deep learning methods to KG embeddings so that non-linear interaction between entities and relations are enabled (Schlichtkrull et al., 2018;Dettmers et al., 2018).
The standard task for link prediction is to answer queries (h, r, ?) or (? r, t). In this context, recent works on KG embedding focusing on bilinear form methods (Trouillon et al., 2016;Nickel et al., 2016;Liu et al., 2017;Kazemi and Poole, 2018) are known to perform reasonably well. The success of this pack of models resides in the fact they are able to model relation (skew-) symmetries. Furthermore, when serving for downstream tasks such as learning first-order logic rule and reasoning over the KG, the learned relation representation is expected to discover relation composition by itself. One key property of relation composition is that in many cases it can be noncommutative. For example, exchanging the order between parent_of and spouse_of will result in completely different relation (parent_of as opposed to parent_in_law_of). We argue that, in order to learn relation composition within the link prediction task, this non-commutative property should be explicitly modeled.
In this paper, we proposed DihEdral to model the relation in KG with the representation of dihedral group. The elements in a dihedral group are constructed by rotation and reflection operations over a 2D symmetric polygon. As the matrix representations of dihedral group can be symmetric or skew-symmetric, and the multiplication of the group elements can be Abelian or non-Abelian, it is a good candidate to model the relations with all the corresponding properties desired.
To the best of our knowledge, this is the first attempt to employ finite non-Abelian group in KG embedding to account for relation compositions. Besides, another merit of using dihedral group is that even the parameters are quantized or even binarized, the performance in link prediction tasks can be improved over state-of-the-arts methods in bilinear form due to the implicit regularization imposed by quantization.
The rest of paper is organized as follows: in ( §2) we present the mathematical framework of bilinear form modeling for link prediction task, followed by an introduction to group theory and dihedral group. In ( §3) we formalize a novel model DihEdral to represent relations with fully expressiveness. In ( §4, §5) we develop two efficient ways to parametrize DihEdral and reveal that both approaches outperform existing bilinear form methods. In ( §6) we carried out extensive case studies to demonstrate the enhanced interpretability of relation embedding space by showing that the desired properties of (skew-) symmetry, inversion and relation composition are coherent with the relation embeddings learned from DihEdral.

Bilinear From for KB Link Prediction
Let E and R be the set of entities and relations. A triple (h, r, t), where {h, t} ∈ E are the head and tail entities, and r ∈ R is a relation corresponding to an edge in the KG.
In a bilinear form, the entities h, t are represented by vectors h, t ∈ R M where M ∈ Z + , and relation r is represented by a matrix R ∈ R M ×M . The score for the triple is defined as φ(h, r, t) = h Rt. A good representation of the entities and relations are learned such that the scores are high for positive triples and low for negative triples.

Group and Dihedral Group
Let g i , g j be two elements in a set G, and be a binary operation between any two elements in G . The set G forms a group when the following axioms are satisfied: Closure For any two element g i , g j ∈ G, g k = g i g j is also an element in G.
Associativity For any g i , g j , g k ∈ G, (g i g j ) g k = g i (g j g k ).
Identity There exists an identity element e in G such that, for every element g in G, the equation e g = g e = g holds.
Inverse For each element g, there is its inverse element g −1 such that g g −1 = g −1 g = e.
If the number of group elements is finite, the group is called a finite group. If the group operation is commutative, i.e. g i g j = g j g i for all g i and g j , the group is called Abelian; otherwise the group is non-Abelian.
Moreover, if the group elements can be represented by a matrix, with group operations defined as matrix multiplications, the identity element is represented by the identity matrix and the inverse element is represented as matrix inverse. In the following, we will not distinguish between group element and its corresponding matrix representation when no confusion exists.
A dihedral group is a finite group that supports symmetric operations of a regular polygon in two dimensional space. Here the symmetric operations refer to the operator preserving the polygon. For a K-side (K ∈ Z + ) polygon, the corresponding dihedral group is denoted as D K that consists of 2K elements, within which there are K rotation operators and K reflection operators. A rotation operator O k rotates the polygon anti-clockwise around the center by a degree of (2πm/K), and a reflection operator F k mirrors the rotation O k vertically. The element in the dihedral group D K can be represented as 2D orthogonal matrices 1 : where m ∈ {0, 1, · · · , K}. Correspondingly, the group operation of dihedral group can be represented as multiplication of the representation matrices. Note that when K is evenly divided by 4, rotation matrices O

Relation Modeling with Dihedral Group and Expressiveness
We propose to model the relations by the group elements in D K . Like ComplEx (Trouillon et al., 2016), we assume an even number of latent dimensions 2L. More specifically, the relation matrix takes a block diagonal form R = diag R (1) , R (2) , · · · , R (L) where R (l) ∈ D K for l ∈ {1, 2, · · · , L}. The corresponding embedding vectors h ∈ R 2L and t ∈ R 2L take the form of h (1) , · · · , h (L) and t (1) , · · · , t (L) where h (l) , t (l) ∈ R 2 respectively. As a result, the score for a triple (h, r, t) in bilinear form can be written as a sum of these L components h Rt = L l=1 h (l) R (l) t (l) , We name the model DihEdral because each component R (l) is a representation matrix of a dihedral group element.
Theorem 1. The relations matrices in DihEdral form a group under matrix multiplication.
Though its relation embedding takes discrete values, DihEdral is fully expressive as it is able to model relations with desired properties for each component R l by the corresponding matrices in D K . The properties are summarized in Table 1, with comparison to DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), ANALOGY (Liu et al., 2017) and SimplE (Kazemi and Poole, 2018). 2 The details of expressiveness are described as follows. For notation convenience, we denote T + all the possible true triples, and T − all the possible false triples.
Symmetric A relation r is symmetric iff (h, r, t) ∈ T + ⇔ (t, r, h) ∈ T + . Symmetric relations in the real world include synonym, similar_to.
Note that with DihEdral, the component R l can be a reflection matrix which is symmetric and offdiagonal. This is in contrast to DistMult and Com-plEx where the relation matrix has to be diagonal when it is symmetric at the same time.

Skew-Symmetric
When K is a multiple of 4, pure skew-symmetric matrices in D 4 can be chosen. As a result, the relation is guaranteed to be skew-symmetric satisfying φ(h, r, t) = −φ(t, r, h).
Inversion r 2 is the inverse of r 1 iff (h, r 1 , t) ∈ T + ⇔ (t, r 2 , h) ∈ T + . As a real world example, parent_of is the inversion of child_of.
The inverse of the relation r is represented by R −1 in an ideal situation: For two positive triples (h, r 1 , t) and (t, r 2 , h), we have R 1 h ≈ t and R 2 t ≈ h in an ideal situation (cf. Lemma 2), With enough occurrences of pair {h, t} we have Composition r 3 is composition of r 1 and r 2 , denoted as Example of composition in the real world includes nationality = born_in_city city_belong_to_nation. Depending on the commutative property, there are two cases of relation compositions: • Abelian r 1 and r 2 are Abelian if Real world example include parent_of spouse_of = spouse_of parent_of.
In DihEdral, the relation composition operator corresponds to the matrix multiplication of the corresponding representations, i.e. R 3 ≈ R 1 R 2 . Consider three positive triples (h, r 1 , m), (m, r 2 , t) and (h, r 3 , t). In the ideal situation, Note that although all the rotation matrices form a subgroup to dihedral group, and hence algebraically closed, the rotation subgroup could not model non-Abelian relations. To model non-Abelian relation compositions at least one reflection matrix should be involved.

Training
In the standard traing framework for KG embedding models, parameters Θ = Θ E ∪ Θ R , i.e. the union of entity and relation embeddings, are learnt by stochastic optimization methods. For each minibatch of positive triples, a small number of negative triples are sampled by corrupting head or tail for each positive triple, then related parameters in the model are updated by minimizing the binary negative log-likelihood such that positive triples will get higher scores than negative triples. Specifically, the loss function is written as follows, (2) where λ ∈ R is the L 2 regularization coefficient for entity embeddings only, T + and T − are the sets of positive and sampled negative triples in a minibatch, and y equals to 1 if (h, r, t) ∈ T + otherwise −1. σ is a sigmoid function defined as σ(x) = 1/(1 + exp(−x)).
Special treatments of the relation representations R are required as they takes discrete values. In the next subsections we describe a reparametrization method for general K, followed by a simple approach when K takes small integers values. With these treatments, DihEdral could be trained within the standard framework.

Gumbel-Softmax Approach
Each relation component R (l) can be parametrized with a one-hot variable c (l) ∈ {0, 1} 2K encoding 2K choices of matrices in D K : enumerates D K . The number of parameters for each relation is 2LK in this approach.
One-hot variable c (l) is further parametrized by s (l) ∈ R 2K by Gumbel trick (Jang et al., 2017) with the following steps: 1) take i.i.d. samples q 1 , q 2 , . . . , q 2K from a Gumbel distribution: q i = − log(− log u i ), where u i ∼ U(0, 1) are samples from a uniform distribution; 2) use log-softmax form of s (l) to parametrize c (l) ∈ {0, 1} 2K : where τ is the tunable temperature. During training, we start with high temperature, e.g. τ 0 = 3, to drive the system out of pool local minimums, and gradually cool the system with τ = max(0.5, τ 0 exp(−0.001t)) where t is the number of epochs elapsed.

Reparametrization with Binary Variables
Another parametrization technique for D K where K ∈ {4, 6} is to parametrize each element in the matrix R (l) directly. Specifically we have where λ = cos(2πk/K), γ = sin(2πk/K), k ∈ {0, 1, · · · , 2K − 1} and α ∈ {−1, 1} is the reflection indicator . Both λ and γ can be parametrized by the same set of binary variables {x, y, z}: In the forward pass, each binary variable b ∈ {x, y, z} is parametrized by taking a element-wise sign function of a real number: In the backward pass, since the original gradient of sign function is almost zero everywhere such that b real will not be activated, the gradient of loss with respect to the real variable is estimated with the straight-through estimator (STE) (Yin et al., 2019). The functional form for STE is not unique and worth profound theoretical study. In our experiments, we used identity STE (Bengio et al., 2013): where 1 stands for element-wise identity. For these two approaches, we name the model as DK-Gumbel for Gumbel-Softmax approach and DK-STE for reparametrization using binary variable approach.

Experimental Result
This section presents our experiments and results. We first introduce the benchmark datasets used in our experiments, after that we evaluate our approach in the link prediction task.

Datasets
Introduced in Bordes et al. (2013), WN18 and FB15K are popular benchmarks for link prediction tasks. WN18 is a subset of the famous WordNet database that describes relations between words. In WN18 the most frequent types of relations form reversible pairs (e.g., hypernym to hyponym, part_of to has_part). FB15K is a subsampling of Freebase limited to 15k entities, introduced in Bordes et al. (2013). It contains triples with different characteristics (e.g., one toone relations such as capital_of to many-tomany such as actor_in_film). YAGO3-10 (Dettmers et al., 2018) is a subset of YAGO3 (Suchanek et al., 2007) with each entity contains at least 10 relations.
As noted in Toutanova et al. (2015); Dettmers et al. (2018), in the original WN18 and FB15k datasets there are a large amount of test triples appear as reciprocal form of the training samples, due to the reversible relation pairs. Therefore, these authors eliminated the inverse relations and constructed corresponding subsets: WN18RR with 11 relations and FB15K-237 with 237 relations, both of which are free from test data leak. All datasets statistics are shown in Table 2

Evaluation Metric
We use the popular metrics filtered HITS@1, 3, 10 and mean reciprocal rank (MRR) as our evaluation metrics as in Bordes et al. (2013).

Model Selection and Hyper-parameters
We implemented DihEdral in PyTorch (Paszke et al., 2017). In all our experiments, we selected the hyperparameters of our model in a grid search setting for the best MRR in the validation set. We   (Trouillon et al., 2016), and the rest of the results are taken from original literatures.
trained DK-Gumbel for K ∈ {4, 6, 8} and DK-STE for K ∈ {4, 6} with AdaGrad optimizer (Duchi et al., 2011), and we didn't notice significant difference in terms of the evaluation metrics when varying K. In the following we only report the result for K = 4.
The results of link predictions are shown in Table 3 and 4, where the results for the baselines are directly taken from original literature. Di-hEdral outperforms almost all models in bilinear form, and even ConvE in FB15K, WN18RR and YAGO3-10. The result demonstrates that even Di-hEdral takes discretized value in relation representations, proper modeling the underlying structure of relations using D K is essential.

Case Studies
The learned representation from DihEdral is not only able to reach the state-of-the-art performance in link prediction tasks, but also provides insights with its special properties. In this section, we present the detailed case studies on these properties. In order to achieve better resolutions, we increased the embedding dimension to 2L = 600 for WN18 datasets.

Inversion
We show the multiplication of some pairs of inversion relations on WN18 and FB15K in Figure 2, WN18RR FB15K-237 YAGO3-10  HITS@N  MRR  HITS@N  MRR  HITS@N  MRR  1  3  10  1  3  10  1  3 Table 4: Link prediction results on WN18RR and FB15K-237 datasets. Results marked by ' †' are taken from (Dettmers et al., 2018), and result marked by ' * ' is taken from (Das et al., 2018). and the result is close to an identity matrix. For the relation pair {_member_of_domain_usage, _synset_domain_usage_of}, the multiplication deviates from ideal identity matrix as the performance for these two relations are poorer compared to the others. We also repeat the same case study for other bilinear embedding methods, however their multiplications are not identity, but close to diagonal matrices with different elements. are skew-symmetric components and others are symmetric.

Symmetry and Skew-Symmetry
Since the KB datasets do not contain negative triples explicitly, there is no penalty to model skew-symmetric relations with symmetric matri-ces. This is perhaps the reason why DistMult performs well on FB15K dataset in which a lot of relations are skew-symmetric.
To resolve this ambiguity, for each positive triple (h, r, t) with a definite skew-symmetric relation r, a negative triple (t, r, h) is sampled with probability 0.5. After adding this new negative sampling scheme in D4-Gumbel, the symmetric and skew-symmetric relations can be distinguished on WN18 dataset without reducing performance on link prediction tasks. Figure 3 shows that both symmetric and skew-symmetric relations favor corresponding components in D 4 as expected. Again, due to imperfect performance of _synset_domain_topic_of, its corresponding representation is imperfect as well. We also conduct the same experiment without adding this sampling scheme, the histogram for the symmetric relations are similar, but there is no strong preference for skew-symmetric relations.

Relation Composition
In FB15K-237 dataset the majority of patterns is relation composition. However, these compositions are Abelian only because all the inverse relations are filtered out on purpose. To justify if non-Abelian relation compositions can be discovered by DihEdral in an ideal situation, we generate a synthetic dataset called FAMILY. Specifically, we first generated two generations of people with equal number of male and females in each generation, and randomly assigned spouse edges within each generation and child and parent edges between the two generations, after which the sibling, parent_in_law and child_in_law edges are connected based on commonsense logic.
We trained D4-Gumbel on FAMILY with latent dimension 2L = 400. In addition to the loss in Eq. 2, we add the following regularization term to encourage the score of positive triple to be higher than that of negative triple for each component independently.
where (h, r, t) ∈ T + , and the corresponding negative triple (h * , r, t * ) ∈ T − . For each composition r 3 = r 1 r 2 , we compute the histogram of R 1 R 2 R −1 3 . The result for relation compositions in FB15K-237 and FAMILY is shown in Figure 4, from which we could see good composition as matrix multiplication. We also reveal the non-Abelian property in FAMILY by exchanging the order of r 1 and r 2 .

Related Works
In this section we discuss the related works and their connections to our approach.
TransE (Bordes et al., 2013) takes relations as a translating operator between head and tail entities. More complicated distance functions (Wang et al., 2014;Lin et al., 2015b,a) are also proposed as extensions to TransE. TorusE (Ebisu and Ichise, 2018) proposed a novel distance function defined over a torus by transform the vector space by an Abelian group onto a n-dimensional torus. ProjE (Shi and Weninger, 2017) designs a neural network with a combination layer and a projection layer. R-GCN (Schlichtkrull et al., 2018) employs convolution over multiple entities to capture spectrum of the knowledge graph. ConvE (Dettmers et al., 2018) performs 2D convolution on the concatenation of entity and relation embeddings, thus by nature introduces non-linearity to enhance expressiveness.
In RESCAL (Nickel et al., 2011) each relation is represented by a full-rank matrix. As a downside, there is a huge number of parameters in RESCAL making the model prone to overfitting. A totally symmetric DistMult (Yang et al., 2015) model simplifies RESCAL by representing each relation with a diagonal matrix. To parametrize skewsymmetric relations, ComplEx (Trouillon et al., 2016) extends DistMult by using complex-valued instead of real-valued vectors for entities and relations. The representation matrix of ComplEx supports both symmetric and skew-symmetric relations while being closed under matrix multiplication. HolE (Nickel et al., 2016) models the skewsymmetry with circular correlation between entity embeddings, thus ensures shifts in covariance between embeddings at different dimensions. It was recently showed that HolE is isomophic to Com-plEx (Hayashi and Shimbo, 2017). ANALOGY (Liu et al., 2017) and SimplE (Kazemi and Poole, 2018) both reformulate the tensor decomposition approach in light of analogical and reversible relations.
Though embedding based approach achieves state-of-the-art performance on link prediction task, symbolic relation composition is not explicitly modeled. In contrast, the latter goal is currently popularized by directly modeling the reasoning paths (Neelakantan et al., 2015;Xiong et al., 2017;Das et al., 2018;Lin et al., 2018;Guo et al., 2019). As paths are consistent with rea-soning logic structure, non-Abelian composition is supported by nature.
DihEdral is more expressive when compared to other bilinear form based embedding methods such as DistMult, ComplEX and ANALOGY. As the relation matrix is restricted to be orthogonal, DihEdral could bridge translation based and bilinear form based approaches as the training objective w.r.t. the relation matrix is similar (cf Lemma 2). Besides, DihEdral is the first embedding method to incorporate non-Abelian relation compositions in terms of matrix multiplications (cf. Theorem 1).

Conclusion
This paper proposed DihEdral for KG relation embedding. By leveraging the desired properties of dihedral group, relation (skew-) symmetry, inversion, and (non-) Abelian compositions are all supported. Our experimental results on benchmark KGs showed that DihEdral outperforms existing bilinear form models and even deep learning methods. Finally, we demonstrated that the above g properties can be learned from DihEdral by extensive case studies, yielding a substantial increase in interpretability from existing models.