Knowledge Graph Embedding with Atrous Convolution and Residual Learning

Knowledge graph embedding is an important task and it will benefit lots of downstream applications. Currently, deep neural networks based methods achieve state-of-the-art performance. However, most of these existing methods are very complex and need much time for training and inference. To address this issue, we propose a simple but effective atrous convolution based knowledge graph embedding method. Compared with existing state-of-the-art methods, our method has following main characteristics. First, it effectively increases feature interactions by using atrous convolutions. Second, to address the original information forgotten issue and vanishing/exploding gradient issue, it uses the residual learning method. Third, it has simpler structure but much higher parameter efficiency. We evaluate our method on six benchmark datasets with different evaluation metrics. Extensive experiments show that our model is very effective. On these diverse datasets, it achieves better results than the compared state-of-the-art methods on most of evaluation metrics. The source codes of our model could be found at https://github.com/neukg/AcrE.


Introduction
Knowledge graph is a kind of valuable knowledge bases and it is important for many AI-related applications. Generally, a KG stores factual knowledge in the form of structural triplets like <h, r, t>, which means there is a kind of r relation from h (head entity) to t (tail entity). Nowadays, great achievements have been made in building large scale KGs. Usually a KG may contain millions of entities and billions of relational facts. However, there are still two major difficulties that prohibit the availability of KGs. First, although most existing KGs contain large amount of triplets, they are far from completeness. Second, most existing KGs are stored in symbolic and logical formations while applications often involve numerical computing in continuous spaces. To address these two issues, researchers proposed knowledge graph embedding (KGE) methods that aim to learn a kind of embedding representations for a KG's items (entities and relations) by projecting these items into some continuous low-dimensional spaces. Generally, different kinds of KGE methods mainly differ in how to view the role of relations in the projected spaces. For example, translation based methods (TransE (Bordes et al. , 2013), TransH (Wang et al. , 2014), TransR (Lin et al. , 2015a), TransD (Ji et al. , 2015), et al.) view the relation in a triplet as a translation operation from the head entity to the tail entity. Other KGE methods view relations as some kind of combination operators that link head entities and tail entities. For example, HolE (Nickel et al. , 2016) employs a circular correlation function as the combination operator in the project space. ComplEx (Trouillon et al. , 2016) makes use of complex valued embeddings and takes the matrix decomposition as the combination operator. RT uses Tucker decomposition for KGE. RotateE  use the rotation operation in the complex space as the combination operator. Experimental results show these existing methods have strong feasibility and robustness in solving the mentioned two issues.
Recently, deep neural networks (DNN) based KGE methods (Dettmers, 2018;Nguyen et al. , 2018;Yao et al. , 2020;Vashishth et al. , 2020a;Vashishth et al. , 2020b) push the performance of KGE to a soaring height. Compared with previous methods, this kind of methods can learn more effective embeddings mainly due to the powerful learning ability inherent in the DNN models. However, as pointed out by Xu et al. (2020) that existing research did not make a proper trade-off between the model complexity (the number of parameters) and the model expressiveness (the performance in capturing semantic information). Thus deep convolutional neural networks (DCNN) based methods are achieving more and more research attention due to their simple but effective structure. However, Chen et al. (2018) point out that the DCNN based methods usually suffer from the reduced feature resolution issue that is caused by the repeated combination of max-pooling and down-sampling("striding") performed at consecutive layers of DCNNs. This will result in feature maps with significantly reduced spatial resolution when DCNN is employed in a fully convolutional fashion.
To address this issue, we propose an atrous convolution based KGE method which allows the model to effectively enlarge the field of view of filters almost without increasing the number of parameters or the amount of computations. To address the vanishing/exploding gradient issue inherent in the DNN based learning frame and the original information forgotten issue when more convolutions used, we introduce residual learning in the our method. We propose two learning structures to integrate different kinds of convolutions together: one is a serial structure, and the other is a parallel structure. We evaluate our method on six diverse benchmark datasets. Extensive experiments show that our method achieves better result than the compared state-of-the-art baselines under most evaluation metrics on these datasets.

Related Work
Translation based KGE methods view the relation in a triplet as a translation operation from the head entity to the tail entity. These methods usually define a score function (or energy function) that has a form like ||h + rt || to measure the plausibility of a triplet. During training, almost all of them minimize a margin based ranking loss function over the training data. TransE (Bordes et al. , 2013) is a seminal work in this branch. It directly takes the embedding space as a translation space. Formally, it tries to let h + r ≈ t if <h, r, t> holds. TransH (Wang et al. , 2014) models a relation as a hyperplane together with a translation operation on it. TransR (Lin et al. , 2015a) models entities and relations in distinct spaces, i.e., the entity space and multiple relation spaces. TransD (Ji et al. , 2015) models each entity or relation by two vectors. TranSparse (Ji et al. , 2016) mainly considers the heterogeneity property and the imbalance property in KGs. PTransE (Lin et al. , 2015b) integrates relation paths into a TransE model. ITransF (Xie et al. , 2017) uses a sparse attention mechanism to discover hidden concepts of relations and to transfer knowledge through the sharing of concepts. Recently, researchers also employ the methods of combining different distance functions together for KGE. For example, Sadeghi et al. (2019) proposed a multi distance embedding (MDE) model, which consists of several distances as objectives.
Bilinear KGE models use different kinds of combination operators other than the translation. For example, HolE (Nickel et al. , 2016) employs a circular correlation as the combination operator. ComplEx (Trouillon et al. , 2016) makes use of complex valued embedding and takes matrix decomposition as the combination operator. Similar to ComplEx, RotatE  also use a complex space where each relation is defined as a rotation from the source entity to the target entity. Xu and Li (2019) proposed DihEdral for KG relation embedding. By leveraging the desired properties of dihedral group, their method could support many kinds of relations like symmetry, inversion, etc.  propose the Relational Tucker3(RT) decomposition for multi-relational link prediction in knowledge graphs.
Other work, KG2E  uses a density-based method to model the certainty of entities and relations in a space of multi-dimensional Gaussian distributions. TransG (Xiao et al. , 2016) mainly addresses the issue of multiple relation semantics.
Recently, researchers begin to explore the DNN based methods for KGE and achieve state-of-the-art results. For example, ConvE (Dettmers, 2018)   propose VR-GCN, which is an extention of graph convolutional networks for embedding both nodes and relations. Shang et al. (2019) propose SACN that takes the benefit of both GCN and ConvE. Vashishth et al. (2020b) propose CompGCN which jointly embeds both nodes and relations in a relational graph.
However, as pointed out by Xu et al. (2020) that most of these existing DNN-based or GNN-based KGE methods are very complex and time-consuming, which prevents them be used in some on-line or real-time application scenarios.

AcrE Model
We denote our model as AcrE (the abbreviation of Atrous Convolution and Residual Embedding). In this study, we design two structures to integrate the standard convolution and atrous convolutions together. One is a serial structure as shown in Figure 1 (a), and the other is a parallel structure as shown in Figure  1 (b). We will introduce them one by one in the following.

Serial AcrE Model
In the Serial AcrE model, the standard convolution and atrous convolutions are organized in a serial manner. As shown in Figure 1 (a), the output of one convolution will be taken as the input of its subsequent adjoining convolution. In this model, the embeddings of an entity and its relation are first reshaped into a 2-Dimension representation, then a standard convolution and several atrous convolutions are performed serially. Next, the outputted embeddings of the last atrous convolution and the initial embeddings are combined by a residual learning based method. The combined embeddings are flattened into a vector. This vector is then used as features to get the probability distributions for the entity candidates.
2D Embedding Representation For a triplet <h, r, t>, we denote h, r and t as their corresponding embedding representations. ConvE points out that a 2-Dimension (2D for short) convolution operation is better than a 1D-convolution operation because a 2D-convolution increases the expressiveness of a CNN model through additional points of interaction between embeddings. Thus follow ConvE, we also use a 2D convolution in our model. To this end, the embedding concatenation of an entity and its linked relation is reshaped into a 2D embedding representation.
Specifically, we use τ to denote a 2D reshaping function and use e to denote the embedding of an entity e. If e, r ∈ R m , τ ([e; r]) ∈ R n 1 ×n 2 where 2 × m = n 1 × n 2 . In this study, we use [e; r] to denote the concatenation of e and r. Standard Convolution based Learning After the 2D reshaping process, a standard convolution operation is performed with Equation 1.
where denotes convolution operation, ω i 0 ∈ R k×k is the i-th filter and b i 0 is the i-th bias vector. Then the outputs of these filters are stacked to form the output of the standard convolution learning. We denote the final output of this standard convolution learning as C 0 , which could be simply written as C 0 = C 1 0 : C 2 0 : C 3 0 : ... : C F 0 and F is the number of filters used. It should be noted that we don't perform a max-pooling operation that is often used in traditional CNN models. This is because the input of our model is always an entity and a relation. Thus the length of the convolution output is fixed. It is unnecessary to use a max-pooling to generate a new length-fixed representation. Our in-house experiments show that there is no obvious performance difference between with and without a max-pooling operation. Atrous Convolution based Learning Atrous convolution, also called as dilated convolution, inserts some holes (zeros) in the input during convolution. Given an input vector x with a filter vector w of length K, the output vector y of an atrous convolution is computed with Equation 2.
Here l (the atrous rate parameter) means the stride with which we sample the input. Obviously, the standard convolution is a special case of the atrous convolution when l is set to 1.
Specifically, in the Serial AcrE model, an atrous convolution takes the output of its previous convolution as input, and output a new result with Equation 3.
where C t−1 is the output of previous convolution operation, ω t and b t are the filter and bias vector respectively in the i-th convolution. Feature Vector Generation In the Serial AcrE model, different kinds of convolutions are performed one by one. Each convolution will extract some interaction features from the output of its previous convolution. Thus the mined features would "forget" more and more original input information as convolutions performed. However, the original information is the foundation of all mined features, so "forget" them will increase the risk that the mined features are actually irrelevant to what are needed. We call this phenomenon as original information forgotten issue. Besides, there is an inherent vanishing/exploding gradient issue in the deep networks. Here we use the residual learning method  to add original input information back so as to address both issues. Then the result of residual learning is flattened into a feature vector. Specifically, the whole process is defined with Equation 4. o = F latten(ReLU(C T + τ ([e; r]))) where C T is the output of last atrous convolution, and T is the number of atrous convolutions.
Score Function With the generated feature vector o, we define the following function to compute a score to measure the degree of an entity candidate t can form a correct triplet with the input <h,r>.
where W is a transformation matrix and b is a bias vector. Then a sigmoid function is used to get the probability distribution over all candidate entities.

Parallel AcrE Model
In the Parallel AcrE model, the standard convolution and atrous convolutions are organized in a parallel manner. As shown in Figure 1 (b), different kinds of convolutions are performed simultaneously, then their results are combined and flattened into a vector. Similar to the Serial AcrE model, this vector is used as features to get the probability distributions for the entity candidates. Compared with the Serial AcrE model, most of the components in the Parallel AcrE model have the same definitions except for the results integration and feature vector generation. We will introduce these two differences in the following part. Results Integration Different from the serial structure, there will be multi results generated by different convolution operations. Accordingly, we need to integrate these results together. This process can be defined with following Equation 7.
where C 0 is the result of standard convolution and C i is the result of the i-th atrous convolution, and ⊕ means a result integration operation. There are different kinds of integration methods. In this study, we explore two widely used methods for this. One is an element-add operation based method, the other is a concatenation operation based method. Feature Vector Generation As shown in Figure 1, the final output of the whole convolution learning is followed by a transformation operation. Then the results are flattened into the feature vector. Specially, the process can be written with Equation 8, where W 1 is the transformation matrix. c = F latten(W 1 Relu(C + τ ([e; r]))) (8)

Training
Different from other KGE methods that often use a max-margin loss function for training, most neural networks based KGE methods (like ProjE, ConvE, etc.) often use the following two kinds of ranking loss functions. One is a kind of binary cross-entropy loss that the ranking scores are calculated independently (pointwise ranking method), and the other is a kind of softmax regression loss that considers the ranking scores collectively (listwise ranking method). Both ProjE and ConvE show that the latter one achieves better experimental results. In AcrE, we define a same listwise loss function as used in ConvE.
where t is a label vector whose elements are ones for relationships that exist and zero otherwise, and N is the number of entities in a KG. This loss function takes one (h,r) pair and scores it against all entities simultaneously. Thus our model is very fast for both training and inference.

Experiment Settings
Datasets We evaluate our method on six widely used benchmark datasets. The first two are WN18 (Bordes et al. , 2014) and FB15k (Bordes et al. , 2014   0.233 11.5 30.1 44.8 HolE (Nickel et al. , 2016) 0.26 18.2 30.9 41.1 ComplEx (Trouillon et al. , 2016) 0.242 12.6 31.2 44 Analogy  0.252 14.2 32.3 42.7 RUGE  0.246 12.9 32.5 43.3 ComplEx-NNE+AER (Ding et al. , 2018) Xu et al. (2020). (Dettmers, 2018), which are two variant datasets for WN18 and FB15k to avoid test leakage. The rest two are Alyawarra Kinship (Lin et al. , 2018) and DB100K (Ding et al. , 2018), both are new datasets proposed in recent years. Some statistics of these six datasets are shown in Table 1. Evaluation Task We use link prediction, one of the most frequently used benchmark evaluation tasks for KGE methods, to evaluate our model. Link prediction is to predict the missing h or t for a correct triplet <h, r, t>, i.e., predict t given <h, r> or predict h given <r, t>. Rather than requiring one best answer, this task emphasizes more on ranking a set of candidate entities from the KG. Hits@k and MRR are often used as the evaluation metrics.
In experiments, all the parameters, including initial embeddings, transformation matrices, and bias vectors, are randomly initialized. Hyper-parameters are selected by a grid search on the validation set. All the results are reported when 3 atrous convolutions used for both learning structures.

Experimental Results
Overall Results Table 2 and 3 show the experimental results on different datasets under different evaluation metrics. It should be noted that not all models report their results on all these six datasets, so the compared baselines on different datasets are different in these two tables. In subsequent part, all the experimental results for the compared baselines are taken from some latest published papers or their original papers. From these results we can draw following two conclusions. First, our model is very robust and it significantly outperforms the compared state-of-the-art results under all the evaluation metrics on all datasets except for WN18RR. Especially on DB100K, FB15k, and Kinship, both AcrE (Serial) and AcrE (Parallem) outperform the compared baselines by a large margin under almost all the evaluation metrics. As for WN18RR, our model still achieves very competitive results. Especially when compared with other DCNN-based KGE methods like ConvE and ConvKB, we can see that both the Serial and the Parallel models perform much better.

Model
Prediction Head (Hits@10) Prediction Tail (Hits@10) 1-to-1 1-to-n n-to-1 m-to-n 1-to-1 1-to-n n-to-1 m-to-n  Second, AcrE (Parallel) performs better than AcrE (Serial) in most cases. We think this is mainly due to the reason that the Serial structure based method suffers more from the original information forgotten issue than the Parallel structure based method. Detailed Results We conduct following two kinds of detailed experiments to further demonstrate the performance of our model. One is Head and Tail Prediction, and the other is Prediction by Categories.
In the first kind of detailed experiments, we compare the performance of our model with several representative state-of-the-art baselines on FB15k-237 for predicting missing head entities and predicting missing tail entities. The results are summarized into Table 4, from which we can see that our model outperforms the compared baselines again under all the evaluation metrics.
In the second kind of detailed experiments, we compare the performance of our model with several representative state-of-the-art baselines on FB15k for predicting by different categories. The results are shown in Table 5. We can see that ArcE does much better than other compared baselines on almost all types of relations except the 1-to-1 relations. This merit is much important for real application scenarios where the complexer relations often take up large proportions. For example, in FB15k, one of the largest available KGs, the triplets of 1-to-1 are about 1.4%, 1-to-n are about 8.9%, n-to-1 are about 14.6%, and m-to-n are about 75.1%. Ablation Results Table 6 shows the ablation experiments of our model on FB15k and FB15k-237. We can see that there is a large different between the performance of "with/without" residual learning in most cases. As analyzed above, the more serial convolutions used, the more original information would be forgotten. While a residual learning adds the original information back. Accordingly, the mentioned  Table 6: Ablation experiments on FB15k and FB15k-237. "add" and "con" refer to the element-add and concatenation integration methods respectively.
issue is alleviated greatly. Since AcrE (Serial) forgets more original information than AcrE (Parallel), it achieves more performance gains from residual learning. From Table 6 we can also observe that the integration method plays important role in AcrE (Parallel). Usually, the concatenation based integration method is superior to an element-add based integration method in most cases. Here we do not use some complexer integration methods like gate control based methods for we do not want to make the model too complex.
Besides, the atrous rate and the number of atrous convolutions used also affect the performance. Here we do not report the performance under different settings of these two hyper-parameters due to space limitation. In fact, both of these two parameters are easily selected due to their small search spaces. Parameter efficiency We also compare the parameter efficiency between our model and some stateof-the-art models on FB15k-237. For each method, we report the number of parameters associated with the optimal configuration that leads to the performance shown in Table 3. The comparision results are shown in Table 7, from which we can see that the number of parameters in AcrE is close with ConvE, but is far less than that in other compared baselines. This is in line with our expectation: using atrous convolutions would not increase the parameters greatly. These results show that our model is more parameter efficient, it achieves substantially better results with fewer parameters. Note that AcrE (Parallel) has more parameters than AcrE (Serial) because it has an extra transformation operation after the result integration.
Here we do not quantitatively compare the runtime of different models for it is difficult to provide a fair evaluation environment: coding tricks, hyper-parameter settings (like batch-size, learning rate), parallelization, lot of non-model factors affect the runtime. However, AcrE can be viewed as a variant of ConvE. Theoretically, it has the same time complexity as ConvE that has been proven to be faster than most existing state-of-the-art methods. Taking FB15k-237 as an example, when using a Titan XP GPU server, it takes about 220 and 100 seconds per epoch during training for AcrE (Serial) and AcrE (Parallel) respectively. As for inference, it only takes 14 and 6 seconds for AcrE (Serial) and AcrE (Parallel) respectively to finish the whole test set evaluation. While some latest GNN or DNN based methods often takes many hours even several days to complete the same work under the same experiment settings.

Conclusions
In this paper, we propose AcrE, a simple but effective DNN-based KGE model. We make comprehensive comparisons between AcrE and many state-of-the-art baselines on sis diverse benchmark datasets. Extensive experimental results show that AcrE is very effective and it achieves better results than the compared baselines under most evaluation metrics on six benchmark datasets. The main contributions of our method are summarized as follows. First, to our best knowledge, this is the first work that uses different kinds of convolutions for the KGE task. Second, we propose two simple but effective learning structures to integrate different kinds of convolutions together. Third, the proposed model has much better parameter efficiency than the compared baselines.