AprilE: Attention with Pseudo Residual Connection for Knowledge Graph Embedding

Knowledge graph embedding maps entities and relations into low-dimensional vector space. However, it is still challenging for many existing methods to model diverse relational patterns, especially symmetric and antisymmetric relations. To address this issue, we propose a novel model, AprilE, which employs triple-level self-attention and pseudo residual connection to model relational patterns. The triple-level self-attention treats head entity, relation, and tail entity as a sequence and captures the dependency within a triple. At the same time the pseudo residual connection retains primitive semantic features. Furthermore, to deal with symmetric and antisymmetric relations, two schemas of score function are designed via a position-adaptive mechanism. Experimental results on public datasets demonstrate that our model can produce expressive knowledge embedding and significantly outperforms most of the state-of-the-art works.


Introduction
Large scale knowledge graphs such as DBpedia (Auer et al., 2007), Freebase (Bollacker et al., 2008), and YAGO (Suchanek et al., 2007), have been shown useful to many applications including natural language understanding (Wang et al., 2017), question answering (Mohammed et al., 2018), and recommender systems (Wang et al., 2019). Knowledge graphs (KGs) (Dong et al., 2014;Hogan et al., 2020) are multirelational graphs containing much factual information in the form of a triple (h, r, t), where h represents the head entity, t represents the tail entity, and r represents the relationship between h and t. Although the symbolic form of triples effectively represents structured data, it makes KGs hard to calculate semantic similarity between entities and relations. In this paper, we focus specifically on knowledge graph embedding, which maps entities and relations into low-dimensional vector space while preserving the inherent structures of entities and relations.
There are diverse relations in KGs. Figure 1 shows symmetric and antisymmetric relational patterns in a KG. Barack Obama and Michelle Obama are in a marriage relation, and vice versa, therefore, marriage is a symmetric relation. For another case, Barack Obama Sr. is the father of Barack Obama but Barack Obama is not the father of Barack Obama Sr., thus father of is an antisymmetric relation. In addition, there are also different relational categories in Figure 1, which are one-to-one, one-to-many, many-toone, and many-to-many. For example, both Barack Obama and Michelle Obama are doctorate of J.D. (Juris Doctor), which belongs to many-to-one relational category. The alma mater of Barack Obama are Columbia University and Harvard University, which belongs to one-to-many relational category.
Many existing models are effective to process different relational categories. TransE (Bordes et al., 2013) interprets the relation as a translation operation on entities in the low-dimensional space, which is straightforward to capture the one-to-one relational category. However, the representations learned by TransE suffer from poor expressiveness of complex relational categories, such as one-to-many, many-toone, many-to-many. To address this issue, TransH (Wang et al., 2014), TransR (Lin et al., 2015), and TransD (Ji et al., 2015) extend TransE by projecting entities or relations into different vector space, which achieve progress on processing complex relational categories and are helpful in handling symmetric and antisymmetric relation patterns. However, it is still challenging since these models remain difficulties in capturing richer semantic features due to their inadequate expressiveness.
It is crucial to preserve different relation patterns, especially the symmetry and the antisymmetry (Xu et al., 2020). Several efforts have been made for modelling and inferring symmetric and antisymmetric patterns. DistMult (Yang et al., 2015) exploits a bilinear diagonal model via element-wise product to capture pairwise interaction between head entity, relation and tail entity. However, DistMult can only handle symmetric relations and fail to model antisymmetric relations. ComplEx (Trouillon et al., 2016) represents entities and relations in a complex space thus both symmetric and antisymmetric relations can be learned. However, it causes high memory costs due to the need for high dimensional space.
Moreover, more neural network architectures have been proposed to learn deep expressive features of triples. ConvE (Dettmers et al., 2018) and ConvKB (Nguyen et al., 2018) employ convolutional neural network for KG embedding. CapsE (Nguyen et al., 2019) applies capsule neural network (Sabour et al., 2017) to model the entries at the same dimension in the entity and relation embeddings. Although deep neural networks can capture more expressive features, they are computationally expensive as they usually require pre-trained KG embeddings as input for the neural network.
To address above issues, this paper proposes a novel knowledge graph embedding model named AprilE (Attention with pseudo residual connection Embedding) to model relational patterns, especially symmetric relation and antisymmetric relation. AprilE consists of a triple-level self-attention and a pseudo residual connection. We employ the triple-level self-attention to capture dependency within a triple by treating the factual triple as a sequence. To retain low-level semantic features, we propose a pseudo residual connection by assigning a new embedding vector to each element within the triple as a pseudo identity of original embedding, which is connected to the output of self-attention. Both embeddings construct the final representation of each element. To overcome the limitation of dealing only with symmetrical relations, we propose another schema to process antisymmetric relations, in which the role of each embedding can be switched through position-adaptive mechanism.
Namely, if an entity is the head of a triple, then the second half of its representation will be utilized in the self-attention layer, and the first half will be treated as the pseudo identity. If the entity is the tail of a triple, then the first half of its representation will become the input of the self-attention layer, and the second half will be the pseudo identity. In this way, triple-level self-attention is sensitive to position so that the model can preserve both symmetric and antisymmetric relational patterns. Moreover, two schemas of score function are designed to facilitate the position-adaptive mechanism. We conduct extensive experiments on real-world public datasets. The link prediction results on FB15k, WN18, FB15k-237, and WN18RR show that AprilE outperforms most of the state-of-the-art KG embedding models. Furthermore, the experimental results of each relation on the WN18 dataset indicate that AprilE achieves significant improvement than baselines in symmetric and antisymmetric relations. Moreover, further experiments on FB15k by relational category illuminate AprilE is effective to handle different relational categories.
In summary, the main contributions of this paper are three-fold: (1) To learn sufficient semantic features and capture the interdependency of h, r, t whin a triple, we propose a novel model AprilE that employs the well-designed triple-level attention and pseudo residual connection mechanisms. (2) Two schemas of score function are designed based on the combination of different embedding partitions to deal with symmetric and antisymmetric relations. (3) Extensive experiments on public datasets demonstrate that AprilE can effectively process not only symmetric/antisymmetric relational patterns but also different relation categories, and it significantly outperforms previous state-of-the-art works.

Preliminary
Throughout this paper, given a triple (h, r, t), the lower-case letters h, r, t represents head entities, relations and tail entities, respectively. The corresponding boldface lower-case letters h, r, t ∈ R 2d , where d is the embedding dimension, are the embeddings of head, relation, and tail, respectively. We employ E, R, S and S to denote the sets of all head and tail entities, all relations, valid triples and invalid triples, respectively.
There are different relational forms in KGs. In terms of relational categories, it refers to the mapping properties of relation including one-to-one, one-to-many, many-to-one, and many-to-many relations (Bordes et al., 2013). As far as relational patterns are concerned, there are symmetric, antisymmetric, inversion, and composition relations (Sun et al., 2019). This paper mainly focuses on symmetric and antisymmetric patterns, which are defined as follows: Definition 1 (Symmetric pattern). Given a triple (h, r, t), if ∀h, t, r(h, t) ⇒ r(t, h) holds, then the relation r is symmetric and the triple (h, r, t) is a symmetric pattern.
Definition 2 (Antisymmetric pattern). Given a triple (h, r, t), if ∀h, t, r(h, t) ⇒ ¬r(t, h) holds, then the relation r is antisymmetric and the triple (h, r, t) is a antisymmetric pattern.

Overview of AprilE
The principle of AprilE is shown in Figure 2. To model both symmetric and antisymmetric relations, AprilE consists of triple-level self-attention and pseudo residual connection. First, instead of using the whole embedding for triple-level self-attention and residual connection, AprilE divides embeddings of h, r and t into two equal-size partitions, Ft ∈ R d and Sd ∈ R d . Second, AprilE applies the self-attention mechanism to capture the dependency of a triple. Then the pseudo residual connection is used to retain original information. Finally, AprilE introduces a new translation-based score function and a rank-based hinge loss function in model training.
We leverage the positional information of sub-embeddings to make the description concise and clear, for example, Ft(r) and Sd(r) represent the first and the second partitions of r. By combining different embedding partitions, it is prone to design different schemas to model relational patterns. Two schemas symmetric schema and antisymmetric schema are proposed in this paper to deal with symmetric and antisymmetric relational patterns.
Symmetric schema The symmetric schema combines the second partition of the head, relation, and tail embedding for triple-level self-attention. The first partition serves as the pseudo-identity for pseudo residual connection, and is connected to the attention output. Therefore, the learned embedding encode both high-level and low-level semantic features.
Antisymmetric schema Different from the symmetric schema, the second partition of head and relation embedding and the first partition of tail embedding are selected for triple-level self-attention, and the rest embedding partitions act as pseudo-identity for pseudo residual connection. As a result, the learned embedding can express distinct features according to the position of entity, therefore, the model is able to process antisymmetric relational pattern.
Note that the antisymmetric schema also supports to preserve symmetric relational patterns, without the increase of computational complexity. Hence we set the antisymmetric schema as the default schema of AprilE.

Triple-level self-attention
We apply self-attention mechanism (Vaswani et al., 2017) to capture the dependency of a triple. We first expand the dimension of the embedding partitions to 2-dimension: Sd(h) ∈ R 1×d , Sd(r) ∈ R 1×d , and Ft(t) ∈ R 1×d . Then we concatenate matrices on the first axis to form a 3 × d matrix, we call it union representation: where [; ] stands for the concatenation operation. Inspired by Vaswani et al. (2017), we serve c as the input of three components: query, key, and value of self-attention. We first project c via a non-linear fully-connected layer to produce query q and key k, respectively: where ReLU(·) is an activation function, W q , W k ∈ R d×d are the weights, and b q , b k ∈ R d are the bias terms of query and key, respectively. Next, we calculate element-wise multiplication between the query and the key representation to produce matched product s: where stands for element-wise multiplication. After the matching phase, we normalize s via softmax to get attention weight α: Then we apply attention weight to the union representation c: In order to calculate pseudo residual connection of each element, we unpack the weighted union representation a into three parts to get latent head, relation, and tail representation: To this end, we can see that the triple-level self-attention treats head, relation, tail as a whole and seizes the intrinsic dependency to adjust the weights of each element dynamically, which makes the learned representation more flexible and expressive.

Pseudo residual connection
Residual connection (He et al., 2016) creates shortcuts to combine shallow layers and deep layers, which can smooth the convergence as networks get deeper. Inspired by the residual connection, in our setting, we connect the attention outputs Sd(h) • , Sd(r) • , Ft(t) • with their pseudo identities Ft(h), Ft(r), Sd(t), respectively, as follows: Apart from creating shortcuts between layers, the pseudo residual connection can also keep original information in the network. The pseudo identity and the attention output come from the same embedding, and are exclusive from each other. Therefore, AprilE can learn attention-transformed representation and, simultaneously, retain primitive embedding representation as much as possible. Integrating transformed representation and primitive embedding representation helps AprilE balance low-level and high-level semantic features and learn sufficient representation.

Score function and loss function
Similar to TransE (Bordes et al., 2013), we adopt the translation-based score function. For the symmetric schema, the score function is defined as follows: where L 1 /L 2 stands for L1/L2 norm. The score function above can not deal with antisymmetric relations due to the triple-level self-attention is insensitive to the position because switching positions of head and tail entities do not change respective representations. For antisymmetric schema, the score function is defined as follow: For antisymmetric relations, the chosen embedding partitions for triple-level self-attention and pseudo residual connection exchange when changing the position of head and tail entities, which makes the head or tail entity has different representation in different positions. The score is expected to be lower for a valid triple, meanwhile, higher for an invalid triple. To achieve this, we adopt a rank-based hinge loss, which maximizes the discriminative margin between a valid triple (h, r, t) and an invalid triple (h , r , t ), the loss function is defined as follows: where γ is the margin, S (h,r,t) stands for the set of invalid triples generated by randomly exchanging head or tail entity or both entities in a KG.
Baselines We compare AprilE with some state-of-the-art works, including translation-based models TransE (Bordes et al., 2013) and TransH (Wang et al., 2014), convolution-based model ConvE (Dettmers et al., 2018), bilinear embedding models DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016) and RotatE (Sun et al., 2019) , and graph convolution-based model R-GCN+ (Schlichtkrull et al., 2018). Evaluation task and metrics We evaluate the performance of our method on the task of link prediction. Following (Bordes et al., 2013), for each valid test triple (h, r, t), we replace either h or t by each of all other entities to create a set of corrupted triples. We adopt MR (mean rank) and Hits@10 (the proportion of valid test triples ranking at the top 10 prediction) metrics. All metrics are reported in filtered setting (Bordes et al., 2013), by removing all triples in the graph from the set of corruptions.

Experimental results of link prediction
Link prediction aims to predict missing h or t for a relational fact triple (h, r, t), which is a valuable task to evaluate the performance of knowledge graph embedding. The experiment results on four datasets are reported in Table 2. For ablation study purpose, we report the results of three variants of our model, AprilE is the model in antisymmetric (default) schema, AprilE(sym.) is the model in symmetric schema, and AprilE(w/o PR.) is a variant model without pseudo residual connection. We can observe that AprilE outperforms AprilE(sym.) on all datasets, which illuminates the importance of modelling and inferring more relational patterns.
It is noticed that AprilE performs competitively compared to the state-of-the-art baselines. AprilE achieves the best results on FB15k and its subset FB15k-237 across all metrics. On WN18, AprilE outperforms all baselines on MR, while RGCN+ achieves the best result on Hits@10. On WN18RR in which there are a number of symmetry relations, AprilE achieves the best result on MR while ConvE does not work very well. The reason is that ConvE cannot model symmetric patterns. We also noticed that RotatE achieves the best result on Hits@10 on WN18RR, which indicates that complex space-based embedding models is also powerful to solve symmetric patterns. To compare the performance of different models for symmetric and antisymmetric relations, we also report the results of each relation on WN18, and the results are shown in Table 3. WN18 describes lexical and semantic hierarchies between concepts, which contains antisymmetric relations such as hypernymy, hyponymy, part of (Trouillon et al., 2016), and a some symmetric relations such as similar to. The results indicate that all of the baseline models except TransE can deal with symmetric relations well, especially for similar to relation, and four models achieve 100% performance on Hits@10. Though TransE can deal with antisymmetric relations, it fails to achieve reasonable performance due to the poor expressiveness of representation. DistMult can only process symmetric relations. For the performance on antisymmetric relations, there is still a gap compared with other models. ComplEx, RotatE, and AprilE can deal with symmetric and antisymmetric simultaneously, while AprilE can achieve 0.4% to 1.7% average performance compared with ComplEx and RotaE for antisymmetric relations since AprilE can learn richer semantic expression and capture dependency within a triple.

Experimental results on FB15k by relational category
Compare to the translation-based models, AprilE significantly outperforms TransE and TransH on FB15k and WN18 datasets. Considering the fact that TransE and TransH are able to process different relational categories, we assume that AprilE can deal with such complex relational categories as well. To prove the assumption, subsequently, we conduct a further experiment to investigate the performance of AprilE on different relational categories: one-to-one, one-to-many, many-to-one, and many-to-many relations.
The dataset of relational categories is built by the approach of Zhen Wang et al. (2014). The results are summarized in Table 4. Compared with baseline models, we can observe that AprilE outperforms in head prediction, and achieves the best results on one-to-many and many-to-one relations in tail prediction. The results illustrate that models producing expressive representations are capable of handling different relational categories. It is also remarkable that both ComplEx and RotatE (Trouillon et al., 2016;Sun et al., 2019) achieve the best results on one-to-one and many-to-many relations in prediction tail, which shows the importance of complex space-based embedding. We leave the work of AprilE on complex space-based embedding as our future work.

Effective of triple-level self-attention
In this paper, we propose triple-level self-attention which takes the dependency of a triple into account to produce expressive knowledge graph embedding. Triple-level self-attention is the key component of AprilE. Without triple-level self-attention, AprilE will degenerate into TransE. Similar to TransE, we adopt translation-based score function. Many translation-based models extend TransE by entity or relation projection. Instead, we adopt a novel approach by triple-level self-attention to improve TransE. Extensive experiment results show that the triple-level self-attention enables AprilE to produce more expressive representation, which makes AprilE capable of handling different relational patterns and different relational categories.

Effective of pseudo residual connection
To explore the effect of pseudo residual connection, we conduct ablation experiments to compare AprilE with and without pseudo residual connection. The contribution of pseudo residual connection is distinctly identified as the model AprilE with pseudo residual connection achieves better results than the model AprilE(w/o PR.) without pseudo residual connection, as shown in Table 2. We conclude that pseudo residual connection is important because it can not only retain useful low-level semantic features but also it enables AprilE to deal with symmetric and antisymmetric relations by designing different score functions. Besides, pseudo residual connection connects pseudo-identity (low-level semantic) and corresponding attention output (high-level semantic), which helps to trade-off different levels of semantic features so as to produce better knowledge graph embedding.

Related work
Knowledge graph embedding is a critical research issue of KG and a variety of embedding methods have been proposed. Table 5 summarizes some previous state-of-the-art models as well as AprilE model proposed in this paper in the aspect of scoring function, entity and relation representations, and the ability to model symmetry pattern and antisymmetry pattern.
Translation-based models all follow the translation principle h + r ≈ t. TransE (Bordes et al., 2013) represents relations as translations of the entities from head to tail. Despite its simplicity and efficiency, TransE is overwhelmed when dealing with complex one-to-many, many-to-one and many-to-many relationships (Wang et al., 2014). To overcome the disadvantages of TransE, TransH (Wang et al., 2014), TransR (Lin et al., 2015), TransD (Ji et al., 2015) introduce projection vectors or matrices to map entities embeddings to different relation vector spaces.
· denotes the generalized dot product, σ denotes activation function, * denotes 2D convolution, • of RotatE stands for element-wise product, Sym. stands for Symmetric, and Antisym. stands for Antisymmetric.
Bilinear embedding methods utilize product-based scoring functions to match the potential semantics of entities and relations contained in the vector space representation. RESCAL (Nickel et al., 2011) captures the interaction between the head entity and the tail entity through the relationship-related matrix. DisMult (Yang et al., 2015) simplifies RESCAL for multi-relational representation learning by restricting the relationship-related matrix to a diagonal matrix. Although the number of parameters is greatly reduced, it can not handle the antisymmetric relations in general KGs. To model symmetric and antisymmetric relations, ComplEx (Trouillon et al., 2016) firstly introduces complex space to model triple. RotatE (Sun et al., 2019) takes each relation as a rotation from head entity to tail entity in complex space.
Recently, CNN-based models have been proposed to learn deep expressive features. ConvE (Dettmers et al., 2018) uses 2D convolution over embedding and multiple layers of nonlinear features to model the interactions between entities and relationships, of which the head entity and relation are reshaped into a 2D matrix. ConvKB (Nguyen et al., 2018) adopts CNN to explore the global relationships among same dimensional entries of entities and relational embedding and generalize the transitional characteristics in the transition-based models.
Our proposed model AprilE belongs to the translational embedding methods. More specially, AprilE employs triple-level self-attention and pseudo residual connection, which aims to model the relational patterns including symmetric and antisymmetric relations while it can also model different relational categories including one-to-one, one-to-many, many-to-one and many-to-many.

Conclusion and future work
In this paper, we have proposed a novel model AprilE for knowledge graph embedding. The welldesigned triple-level self-attention and pseudo residual connection enable AprilE to model symmetric and antisymmetric relations effectively while it can also deal with the complex one-to-many, many-toone and many-to-many relational categories. Moreover, extensive experiments on public benchmark datasets show that AprilE outperforms most state-of-the-art baselines. In the future, we plan to extend our method on complex space embedding and plan to handle more relational patterns, thereby providing a deeper insight analysis of the proposed model.