ReInceptionE: Relation-Aware Inception Network with Joint Local-Global Structural Information for Knowledge Graph Embedding

The goal of Knowledge graph embedding (KGE) is to learn how to represent the low dimensional vectors for entities and relations based on the observed triples. The conventional shallow models are limited to their expressiveness. ConvE (Dettmers et al., 2018) takes advantage of CNN and improves the expressive power with parameter efficient operators by increasing the interactions between head and relation embeddings. However, there is no structural information in the embedding space of ConvE, and the performance is still limited by the number of interactions. The recent KBGAT (Nathani et al., 2019) provides another way to learn embeddings by adaptively utilizing structural information. In this paper, we take the benefits of ConvE and KBGAT together and propose a Relation-aware Inception network with joint local-global structural information for knowledge graph Embedding (ReInceptionE). Specifically, we first explore the Inception network to learn query embedding, which aims to further increase the interactions between head and relation embeddings. Then, we propose to use a relation-aware attention mechanism to enrich the query embedding with the local neighborhood and global entity information. Experimental results on both WN18RR and FB15k-237 datasets demonstrate that ReInceptionE achieves competitive performance compared with state-of-the-art methods.


Introduction
Knowledge graphs (KGs) are at the core of most state-of-the-art natural language processing solutions and have been spotlighted in many real-world applications, including question answering (Hao et al., 2017), dialogue generation Madotto et al., 2018) and machine reading comprehension (Yang and Mitchell, 2017). Typically, KGs * Corresponding author. are directed graphs whose nodes denote the entities and edges represent the different relations between entities. The structured knowledge in KGs is organized in the form of triples (h, r, t), where h and t stand for the head and tail entities respectively, and r represents the relation from h to t. Although large-scale KGs (e.g., Freebase (Bollacker et al., 2008), DBpedia (Lehmann et al., 2015)) have already contained millions or even billions of triples, they are still far from complete since the emerging new knowledge appears. Knowledge graph embedding (KGE) is an effective solution to solve the incompletion problem.
KGE aims to learn the low-dimensional vectors (embeddings) for entities and relations based on the observed triples in KGs. Conventional models including TransE (Bordes et al., 2013) and its numerous extensions (e.g., TransD (Ji et al., 2015), TransR (Lin et al., 2015), DistMul , ComplEx (Trouillon et al., 2016), etc.) have been proposed. These shallow models are limited to their expressiveness (Dettmers et al., 2018). Recently, CNN-based methods have been proposed to capture the expressive features with parameter efficient operators. ConvE (Dettmers et al., 2018) takes advantage of CNN and uses convolution filters on 2D reshapings of the head entity and relation embeddings. Through this, ConvE can increase the interactions between head and relation embeddings. Empirical results have proved that increasing the number of interactions is beneficial to the KGE task, but ConvE is still limited by the number of interactions Vashishth et al., 2020).
Furthermore, ConvE does not consider the structural information. In contrast, graph-based methods are effective to aggregate neighborhood information to enrich the entity/relation representation (Schlichtkrull et al., 2018;Bansal et al., 2019;Nathani et al., 2019). Among them, KB- GAT (Nathani et al., 2019) achieves state-of-the-art performance on various benchmark datasets via using graph attention networks (GAT) (Velickovic et al., 2018). KBGAT learns embeddings for every entity by taking all possible relations into account, which requires multiple hops of reasoning. In contrast, it can be beneficial to learn embeddings from a query-relevant subgraph of the local neighborhood and global entities. As an example shown in Figure 1, given a query (Jack London, nationality, ?) for Jack London, we can gather the relationaware local neighbor (place lived, Okaland). The local neighbor allows us to project Jack London into the Okaland region of the embedding space, which can lead to a high score for predicting the target America, as Okaland and America are close in embedding space. Besides, we also note that a specific relation can be acted as the "bridge" to link the related entities. Considering the relation nationality, the related head entities { Kaneto Shiozawa, Shammi Kapoor, Will Smith, · · · } and tail entities { America, Canada, Japan, · · · } tend to be a set of person names and countries. These related entities act as a strong signal to judge whether a triple is valid or not.
Based on the above observations, we take the benefits of ConvE and KBGAT together and propose a Relation-aware Inception network with joint local-global structural information for knowledge graph Embedding, and we name it ReIncep-tionE. In ReInceptionE, we first adapt Inception network (Szegedy et al., 2015(Szegedy et al., , 2016) -a high performing convolutional neural network with carefully designed filters, to increase the interactions using multiple convolution filters with different scales, while at the same time to keep parameter efficient. Then, we construct a local neighborhood graph and a global entity graph by sharing the head and relation respectively for a given query. With the constructed graphs, we apply a relation-aware attention mechanism to aggregate the local neighborhood features and gather the global entity information to enrich the head/relation representation. Finally, we aggregate the joint local-global structural information using a fully connected layer to predict the missing links.
In summary, we make the following three contributions: (1) It is the first to explore Inception network to learn query embedding which aims to further increase the interactions between head and relation embeddings; (2) We propose to use a relation-aware attention mechanism to enrich the query embedding with the local neighborhood and global entity information; (3) We conduct a series of experiments to evaluate the performance of the proposed method. Experimental results demonstrate that our method obtains competitive performance in comparison to these state-of-the-art models on both WN18RR and FB15k-237.
The rest of this paper is structured as follows.
Section 2 describes our proposed method for KGE. In Section 3, the experimental results are presented. We make a conclusion in Section 4.

Our Approach
In this section, we first describe the background and definition in Subsection 2.1, and Inception-based query encoder in Subsection 2.2. Then, we introduce the relation-aware local attention and global attention in Subsection 2.3 and 2.4, respectively. Finally, we describe the joint using of them in Subsection 2.5.

Background and Definition
Definition 3.1 Knowledge Graph G: A knowledge graph G = {(h, r, t)|(h, r, t) ∈ E ×R×E} denotes a collection of triples, where E and R indicate entities and relations, respectively, h, t ∈ E represent the head entity and tail entity, and r ∈ R denotes the specific relation linking from the head entity h to tail entity t.
Definition 3.2 Knowledge Graph Embedding: Knowledge graph embedding aims to learn embeddings of entities and relations with the valid triples in G, and then predict the missing head entity h given query (?, r, t) or tail entity t given query (h, r, ?) with the learned entity and relation embeddings.
The framework of the proposed ReInceptionE is shown in Figure 1 (right). ReIncetionE consists of four modules: (1) Inception-based query encoder (InceptionE), which is used to transform the input query q = (h, r, ?) into a k-dimensional vector v q ; (2) relation-aware local attention and (3) relation-aware global attention are used to capture the local neighborhood information and the global entity information; and (4) joint relation-aware attention is used to aggregate the different structural information using a fully connected layer. Finally, we compute the score for the given triple (h, r, t) based on the query embedding and the tail entity embedding.

Inception-Based Query Encoder
ConvE (Dettmers et al., 2018) is the first model to apply CNN for KGE, which uses 2D convolution operation to model the head and relation in a query. However, ConvE is limited by the number of interactions between the head and relation embeddings Vashishth et al., 2020). In this paper, we propose to employ the Inception network  (Szegedy et al., 2015(Szegedy et al., , 2016, a high performing convolutional neural network with carefully designed filters, to increase the interactions by taking the head and relation as two channels of the input. Figure 2 shows the differences between InceptionE (right) and ConvE (left). Obviously, ConvE cannot capture full interactions between the head and relation embeddings since the convolution operations in ConvE only slides on the entity or relation 2D matrices independently. On the contrary, In-ceptionE can increase the interactions between the head and relation embeddings using multiple convolution filters with different scales, while at the same time keep parameter efficient.
As shown in Figure 2, given a query q = (h, r, ?), we first reshape the head and relation embeddings as 2D matrices denoted as v h and v r . Then, the 2D embeddings are viewed as two channels of the input for the Inception network. Thus, the entries at the same dimension of v h and v r are aligned over the channel dimension, which enables the convolution operations to increase the interactions between the head and relation embeddings. Specifically, We first use 1 × 1 convolutions to capture the direct interactions at the same dimension, which can be formulated as: where Relu (Glorot et al., 2011) is a non-linear activation function, || denotes the concatenation operation, * denotes the convolutional operation and ω 1×1 is the parameter of convolution filters with 1 × 1 size, v 1×1 denotes the interaction features of the first 1 × 1 convolutional layer. Then, filters with different sizes, such as 2 × 2 and 3 × 3, are applied to capture high-level interaction features in various scales. Thus, we can get interaction features of the 2 × 2 and 3 × 3 convolutional layers, denoted by v 2×2 and v 3×3 , respectively.
As suggested in (Szegedy et al., 2016), we use two 3 × 3 convolutions instead of a 5 × 5 convolution to capture interaction features in larger spatial filters, which is able to reduce the number of parameters. The two 3 × 3 convolutions are denoted as: is the input interaction features, ω 1 3×3 and ω 2 3×3 are parameters of the two 3 × 3 convolution layers.
Finally, the output interaction features with different scales and levels are concatenated and a fully connected layer is applied to obtain the embedding of the given query. Formally, we define the Inception-based query encoder model as: (3) where W is the parameter of the fully connected layer.

Relation-Aware Local Attention
KBGAT learns embedding for every entity by taking all possible relations into account, and the embedding learning is impaired by the irrelevant neighbors. In contrast, it can be beneficial to learn embedding from a query-relevant neighborhood graph. In this subsection, we first construct a relation-aware neighborhood graph and then apply an attention mechanism to aggregate local graph structure information.
For the query q = (h, r, ?), we denote its neighbors as Note that, for each triple (h, r, t), we create an inverse triple (t, r −1 , h), which has also been used in (Lacroix et al., 2018;Dettmers et al., 2018). Thus, query (?, r, t) can be converted to (t, r −1 , ?). And the neighbors {(r j , e j )|(h, r j , e j ) ∈ G} for head entity h can be converted to a format of {(e j , r −1 j )|(h, r j , e j ) ∈ G}. Thus, N q contains both the outgoing and incoming neighbors for a query q = (h, r, ?). Each neighbor n i = (e i , r i ) ∈ N q is also a query with a head entity e i and a relation r i . Thus, each entity and relation in neighbor n i = (e i , r i ) can be encoded using the Inception-based query encoder: where v e i and v r i are the 2D embedding vectors of entity e i and relation r i . In practice, different neighbors may have different impacts for a given query. It is useful to determine the importance of each neighbor for a specific query. As an example in Figure 1, for the query (Jack London, nationality, ?), it is reasonable to focus on the the neighbors related to the relation nationality, such as (Jack London, place lived, Oakland). To this end, we use relation-aware attention mechanism to assign different importance for each neighbor and compute the relevant score for each neighbor using a non-linear activation layer: where W 1 , W 2 and W 3 are parameters to be trained and LeakyRelu (Maas et al., 2013) is the activation function. We then normalize the relevant scores for different neighbors using a softmax function to make it comparable across the neighbors, which is denoted as: Finally, we aggregate the neighborhood information according to their attention scores and apply a non-linear function to obtain the neighborhood vector. To keep more information of the original query embedding, we also apply a residual operation: For simplification, we denote the above relationaware attention operations as: where V n = {v n i |n i ∈ N q } is a set of local neighobrhood vectors.

Relation-Aware Global Attention
The number of relation-aware local neighbors for each node (entity) varies from one to another, making the neighbor graph very sparse. The sparse nature would affect the accuracy of the embedding. In fact, a specific relation can be acted as the "bridge" to link the related entities. In this subsection, we construct a relation-aware head graph and tail graph by gathering all entities for relation r in the given query q = (h, r, ?). Intuitively, all head entities for relation r share some common type information. And the tail entities for relation r contain some implicit information about the type of the target entity t. For example in Figure 1, given the relation nationality, all heads { Kaneto Shiozawa, Shammi Kapoor, Will Smith, · · · , } and tails { America, Canada, Japan, · · · , } are the names of a person and a country, sharing the similar entity types. These relation-aware global heads and tails can provide some useful information for the KGE task. Thus, we construct relation-aware global head and tail graphs according to the head and tail entities of the relation. Let H r = {e i |(e i , r, e j ) ∈ G} and T r = {e j |(e i , r, e j ) ∈ G} denote a set of head and tail entities for relation r, respectively. For each head entity h ri ∈ H r , we first represent it as an embedding vector v h ri . Then, we use relation-aware attention mechanism to capture the relevant information from all the relation-aware head entities, which is denoted as: where V rh = {v h ri |h ri ∈ H r } is a set of entity vectors for relation-aware global entities. Similarly, we use relation-aware attention mechanism to capture global tail informations, which is computed as: where V rt = {v t ri |t ri ∈ T r } is a set of entity embeddings for relation-aware global tails.

Joint Relation-Aware Attention
Once obtained the relation-aware local neighborhood information v n and global head and tail vectors v ht and v rt , we concatenate these vectors and merge them by using a linear feed-forward layer: where W 4 and b are the parameters of the feedforward layer. Finally, we compute the score for each triple (h, r, t) by applying a dot product of the query embedding v q and the tail embedding v t : To optimize the parameters in our model, we compute the probability of the tail t using a softmax function: where λ is a smoothing parameter, and G is a set of invalid triples created by randomly replacing the tail t with an invalid entity t . We train the model by minimizing the following loss function: where (h i , r i , t i ) ∈ G is a valid triple, and |E| is the number of valid triples in G.

Experimental Setup
Datasets: We conduct experiments for KGE on two widely used public benchmark datasets : WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova et al., 2015). WN18RR is a subset of WN18 (Bordes et al., 2013) while FB15k-237 is a subset of FB15k (Bordes et al., 2013). Since WN18 and FB15k contain a large number of inverse relations, making the triples in the test set can be obtained simply by inverting triples in the training set. To address the above problem, both WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova et al., 2015) are generated by removing the inverse relations from WN18 and FB15k. In recent two   (Dettmers et al., 2018), the superscript a represents the results reported in the original papers while b represents the results are taken from (Sun et al., 2020), other results are directly taken from the corresponding papers. Both MRR and Hits@1 have a strong correlation, thus we do not report the results of Hits@1 since it does not give any new insight (Nguyen et al., 2019). The best results are in bold and the second best results are in underline.
years, WN18RR and FB15k-237 have become the most popular datasets for the KGE task. Table 1 shows the summary statistics of the datasets.
Implementations: For a test triple (h, r, t), the purpose of KGE task is to predict missing links, e.g. predict tail entity t given head entity h and relation r or predict head entity h given tail entity t and relation r. To evaluate our method, three metrics are used, including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hit@10 (e.g. the accuracy in top 10 predictions). Please note that lower MR, higher MRR and Hits@10 indicate better performance. We follow the "Filtered" setting protocol (Bordes et al., 2013) to evaluate our model, i.e., ranking all the entities excluding the set of other true entities that appeared in training, validation and test sets. We initialize the embedding of entity and relation in our ReInceptionE model using the pre-trained embeddings with 100-dimension used in (Nguyen et al., 2019). We use Adam (Kingma and Ba, 2015) to optimize the model. The parameters of our model are selected via grid search according to the MRR on the validation set. We select the dropout rate from {0.1, 0.2, 0.4, 0.5}, the learning rate from {0.001, 0.0005, 0.0002, 0.0001} , the L 2 norm of parameters from {1e −3 , 1e −5 , 1e −8 }, the batch size from {32, 64, 128, 256, 512} and the smoothing parameter λ in Equation 13 from {1, 5, 10}. Finally, the learning rate is set to 0.0002 for WN18RR and 0.0001 for FB15k-237. The L 2 norm of parameters is set to 1e −5 . The batch size is set to 256. The dropout rate is set to 0.4 for WN18RR and 0.2 for FB15k-237. The smoothing parameter in Equation 13 is set to λ = 5. The number of filters for each convolution operation in the Inception module is set to 32. We

Main Results
We compare our results with various state-of-theart methods. Experimental results are summarized in Table 2. For all KGE models, a key step is to create the invalid triples to construct the negative samples. Most recently, Sun et al. (2020) investigated the inappropriate evaluation problem happened in ConvKB (Nguyen et al., 2018), CapsE (Nguyen et al., 2019) and KBGAT (Nathani et al., 2019). In fact, this issue comes from the unusual score distribution, e.g., the score function for some invalid triples gets the same values as the valid triples. Sun et al. (2020) also found that KBGAT removed the invalid triples when they appeared in the test set during negative sampling, suffering from the leakage of test triples. Therefore, we take the results (marked with the superscript b) from (Sun et al., 2020) for ConvKB, CapsE and KBGAT. Besides, we also list the results reported in the original papers (marked with the superscript a).
From Table 2, we can see that our proposed ReInceptionE obtains competitive results compared with the state-of-the-art methods. On WN18RR dataset, the ReInceptionE achieves the best results using Hits@10 and MRR, and the second-best results using MR. On FB15k-237 dataset, the ReIn-ceptionE obtains the second-best results using MR, and comparable results using MRR and Hits@10.
Our proposed ReInceptionE is closely related to ConvE (Dettmers et al., 2018) and KBGAT (Nathani et al., 2019). Compared with ConvE, ReInceptionE achieves large performance gains on both . The reason is that instead of simply con-catenating the head and relation embeddings, ReIn-ceptionE takes head and relation as two channels of the input and applies the Inception network to capture the rich interactions, which is able to learn expressive features by using filters with various scales. Unlike KBGAT, the ReInceptionE takes the (entity, relation) pair as a query and utilizes the relation-aware attention mechanism to gather the most relevant local neighbors and global entity information for the given query. The results again verify the effectiveness of the relation-aware local and global information for KGE. Some other methods have been proposed to address the KGE task, such as pLogicNet (Ou and Tang, 2019), RPJE (Niu et al., 2020), CoKE , TuckER (Balazevic et al., 2019a), D4-GUmbel (Xu and Li, 2019) and HAKE . pLogicNet (Ou and Tang, 2019) and RPJE (Niu et al., 2020) leverage logic rules to improve the performance. CoKE  uses Transformer (Vaswani et al., 2017) to encode contextualized representations. HAKE  embeds entities in the polar coordinate system to learn semantic hierarchies. D4-Gumbel (Xu and Li, 2019) uses the dihedral group to model relation composition. TuckER (Balazevic et al., 2019a) uses Tucker decomposition to learn tensor factorization for KGE. These methods take a series of different ways to model the KGE task. For example, logic rules play an important role to determine whether a triple is valid or not, we suspect that the performance of our proposed ReInceptionE can be further improved when taking the logic rules into account. We will leave the comparison and deep analysis in the future work.

Impact of Different Modules
We describe the experimental results in Table 3 to investigate the impact of different modules in ReIn-ceptionE. In Table 3, "InceptionE" is the baseline

Query and Target Top Neighbors and Predictions
Query: (Jack London, nationality, ?) Target: America  model without using relation-aware local neighbors and global entities. "ReInception w/o N" is the model without using relation-aware local neighbor information while "ReInception w/o E" is the model without using relation-aware global entity information. Besides, we also take two closely related models ConvE and KBGAT for fair comparison.
From Table 3, we can see that our baseline Incep-tionE outperforms the closely related CNN-based model ConvE. Compared with ConvE, InceptionE is more powerful because it can capture the rich interaction features by using filters with various scales. And the ReInceptionE, which incorporates relation-aware local neighborhood and global entity information, outperforms the related graph-based model KBGAT. Table 3 also shows that the ReIn-ceptionE outperforms InceptionE, "ReInception w/o N" and "ReInception w/o E" by a large margin on both datasets, which reconfirms our observations that relation-aware local neighbors and global entities can play different contributions for KGE.

Evaluation on different Relation Types
In this subsection, we present the experimental results on different relation types on WN18RR and FB15k-237 using Hits@10. We choose the closely related model ConvE, as well as InceptionE as the baselines. Following (Bordes et al., 2013), we classify the relations into four groups: one-to-one (1-1), one-to-many (1-N), many-to-one (N-1) and manyto-many (N-N), based on the average number of tails per head and the average number of heads per tail. Table 4 shows the link prediction results for each relation category. From Table 4, we find that InceptionE achieves better performance than ConvE for all relation types, indicating that increasing the number of interactions between head and re-lation embeddings is indeed beneficial to KGE task. Furthermore, our proposed ReInceptionE significantly outperforms ConvE and InceptionE for all relation types. In particular, ReInceptionE obtains larger improvements for complex relations, such as one-to-many, many-to-one and many-to-many. This again verifies our observations that increasing the interactions and taking the local-global structural information allows the model to capture more complex relations.

Case Study
In order to further analyze how relation-aware neighbors contribute to KGE task, we give two examples in Table 5. For the query (Jack London, nationality, ?), ReInceptionE assigns the highest attention scores for neighbors (place lived, Oakland), since Oakland and America are close to each other in embedding space because of other relations between them. And the top predictions for the query are a set of entities with the type of country. For the second example (Jerry Lewls, languages, ?), ReInceptionE assigns the very high score for neighbor (place of birth, Newark). This can allow us to project (place of birth, Newark) into the Jerry Lewis region of the embedding space, which can lead to a high score for predicting the target English Language. These examples give clear evidence of how our proposed ReInceptionE benefits the KGE task.

Conclusions
In this paper, we propose a novel relation-aware Inception network for knowledge graph embedding, called ReInceptionE. ReInceptionE takes the benefits of ConvE and KBGAT together. The proposed method first employs Inception network to learn the query embedding, with the aim of increasing the interaction between head and relation embeddings, while at the same time to keep the parameter efficient. Then, we gather the relation-aware local neighborhood and global entity information with an attention mechanism and enrich the query embedding with the joint local-global structural information. Empirical studies demonstrate that our proposed method obtains comparative performance compared with the state-of-the-art performance on two widely used benchmark datasets WN18RR and FB15k-237.