Disentangle-based Continual Graph Representation Learning

Graph embedding (GE) methods embed nodes (and/or edges) in graph into a low-dimensional semantic space, and have shown its effectiveness in modeling multi-relational data. However, existing GE models are not practical in real-world applications since it overlooked the streaming nature of incoming data. To address this issue, we study the problem of continual graph representation learning which aims to continually train a GE model on new data to learn incessantly emerging multi-relational data while avoiding catastrophically forgetting old learned knowledge. Moreover, we propose a disentangle-based continual graph representation learning (DiCGRL) framework inspired by the human's ability to learn procedural knowledge. The experimental results show that DiCGRL could effectively alleviate the catastrophic forgetting problem and outperform state-of-the-art continual learning models. The code and datasets are released on https://github.com/KXY-PUBLIC/DiCGRL.


Introduction
Multi-relational data represents relationships between entities in the world, which is usually denoted as a multi-relational graph with nodes and edges connecting them. It is widely used in realworld NLP applications such as knowledge graphs (KGs) (e.g., Freebase (Bollacker et al., 2008) and DBpedia (Lehmann et al., 2015)) and information networks (e.g., social network and citation network). Therefore, modeling multi-relational graph with graph embeddings (Bordes et al., 2013;Tang et al., 2015a;Sun et al., 2019;Bruna et al., 2014) has been attracting intensive attentions in both academia and industry. Graph embedding (GE), aiming to embed nodes and/or edges in the graph into a low-dimensional semantic space to enable neural models to effectively and efficiently utilize multi-relational data, has demonstrated remarkable effectiveness in various downstream NLP tasks such as question answering (Bordes et al., 2014) and dialogue system (Moon et al., 2019).
Nevertheless, most existing graph embedding works overlook the streaming nature of the incoming data in real-world scenarios. In consequence, these models have to be retrained from scratch to reflect the data change, which is computationally expensive. To tackle this issue, we propose to study the problem of continual graph representation learning (CGRL) in this work.
The goal of continual learning is to alleviate catastrophically forgetting old data while learning new data. There are two mainstream continual learning methods in NLP: (1) consolidationbased methods (Kirkpatrick et al., 2017;Zenke et al., 2017;Liu et al., 2018;Ritter et al., 2018) which consolidate the important model parameters of old data when learning new data; and (2) memory-based methods (Lopez-Paz and Ranzato, 2017;Shin et al., 2017;Chaudhry et al., 2019) which remember a few old examples and learn them with new data jointly. Despite the promising results these methods have achieved on classification tasks, their effectiveness has not been validated on graph representation learning. Unlike the classification problem where instances are generally independent and can be operated individually, nodes and edges in multi-relational data are correlated, making it sub-optimal to directly deploy existing continuous learning methods on the multi-relational data.
In cognitive psychology (Solso et al., 2005), procedural knowledge refers to a set of operational steps. Its smallest unit is production, where multiple productions can complete a series of cognitive activities. When learning new procedural knowledge, humans would update cognitive results by only updating a few related productions and leave the rest intact. Intuitively, such a process can be mimicked to learn constantly growing multirelational data by regarding each new data as a new procedural knowledge. For example, as illustrated in Figure 1, the relational triplets of Barack Obama and Michelle Obama are related to three concepts: "family", "occupation" and "location". When a new relational triplet (Michelle Obama, Daughter, Malia Ann Obama) appears, we only need to update the "family"-related information in Barack Obama. Consequently, we can further infer that the triplet (Barack Obama, Daughter, Malia Ann Obama) also holds.
Inspired by procedural knowledge learning, we propose a disentangle-based continual graph representation learning framework DiCGRL in this work. Our proposed DiCGRL consists of two modules: (1) Disentangle module. It decouples the relational triplets in the graph into multiple independent components according to their semantic aspects, and leverages two typical GE methods including Knowledge Graph Embedding (KGE) and Network Embedding (NE) to learn disentangled graph embeddings; (2) Updating module. When new relational triplets arrive, it selects the relevant old relational triplets and only updates the corresponding components of their graph embeddings. Compared with memory-based continual learning methods which save a fixed set of old data, DiC-GRL could dynamically select important old data according to new data to fine-tune the model, which makes DiCGRL better model the complex multirelational data stream.
We conduct extensive experiments on both KGE and NE settings based on the real-world scenarios, and the experimental results show that DiCGRL effectively alleviates the catastrophic forgetting problem and significantly outperforms existing continual learning models while remaining efficient.

Graph Embedding
Graph embedding (GE) methods are critical techniques to obtain a good representation of multirelational data. There are mainly two categories of typical multi-relational data in the real-world, knowledge graphs (KGs) and information networks. GE handles them via Knowledge Graph Embedding (KGE) and Network Embedding (NE) respectively, and our DiCGRL framework can adapt to the above two typical GE methods, which demonstrates the generalization ability of our model.
KGE is an active research area recently, which can be mainly divided into two categories to tackle link prediction task (Ji et al., 2020). One line of work is reconstruction-based models, which reconstruct the head/tail entity's embedding of a triplet using the relation and tail/head embeddings, such as TransE (Bordes et al., 2013), RotatE (Sun et al., 2019), and ConvE (Dettmers et al., 2018). Another line of work is bilinear-based models, which consider link prediction as a semantic matching problem. They take head, tail and relation's embeddings as inputs, and measure a semantic matching score for each triplet using bi-linear transformation (e.g., DistMult (Yang et al., 2015), Com-plEx (Trouillon et al., 2016), ConvKB (Nguyen et al., 2018)). Besides KGE, NE is also widely explored in both academia and industry. Early works (Perozzi et al., 2014;Tu et al., 2016;Tang et al., 2015b) focus on learning static node embeddings on information graphs. More recently, graph neural networks (Bruna et al., 2014;Henaff et al., 2015;Veličković et al., 2018) have been attracting considerable attention and achieved remarkable success in learning network embeddings. However, most existing GE models assume the training data is static, i.e., do not change over time, which makes them impractical in real-world applications.

Continual Learning
Continual learning, also known as life-long learning, helps alleviate catastrophic forgetting and enables incremental training for stream data. Methods for continual learning in natural language processing (NLP) field can mainly be divided into two categories: (1) consolidation-based methods (Kirkpatrick et al., 2017;Zenke et al., 2017), which slow down parameter updating to preserve old knowledge, and (2) memory-based methods (Lopez-Paz and Ranzato, 2017;Shin et al., 2017;Chaudhry et al., 2019;Wang et al., 2019), which retain examples from old data for re-play upon learning the new data. Although continual learning has been widely studied in NLP  and computer vision (Kirkpatrick et al., 2017), its exploration on graph embedding is relatively rare. Sankar et al. (2018) seek to train graph embedding on constantly evolving data. However, it assumes the timestamp information is known beforehand, which hinders its application to other tasks. Song and Park (2018) extend the idea of regulation-based methods to continually learn graph embeddings which straightforwardly limits parameter updating on only the embedding layer. It is therefore hard to generalize to more complex multi-relational data. Our proposed DiCGRL model is distinct from previous works in two aspects: (1) Our method does not require pre-annotated timestamps, which make it more feasible in various types of multi-relational data; (2) Inspired by procedural knowledge learning, we exploit disentanglement to conduct continual learning and achieve promising results.

Task Formulation and Overall Framework
We represent multi-relational data as a multirelational graph G = (V, E), where V , E denote the node set and the edge set within a graph G, and G can be formalized as a set of relational triplets Given a relational triplet (u, r, v) ∈ G, we denote the embeddings of them as u, v ∈ R d and r ∈ R l , where d and l indicate the vector dimension. Continual graph representation learning trains graph embedding (GE) models on constantly growing multi-relational data, where the i-th part of multi-relational data has its own training set T i , validation set V i , and query set Q i . The i-th training set is defined as a set of relational triplets, i.e., where N is the instance number of T i . The i-th validation and query sets are defined similarly. As new relational triplets emerges, continual graph representation learning requires GE models to achieve good results on all previous query sets. Therefore, after training on the i-th training set T i , GE models will be evaluated onQ i = i j=1 Q j to measure whether they could well model both new and old multi-relational data. The evaluation protocol indicates that it will be more and more difficult for the model to achieve high performance as the emerging of new relational triplets.
In general, our model continually learns on the streaming data. Whenever there comes a new part of multi-relational data, DiCGRL will learn the new graph embeddings and meanwhile prevent catastrophically forgetting old learned knowledge through two procedures: (1) Disentangle module. It decouples the relational triplets in the graph into multiple components according to their semantic aspects, and learns disentangled graph embeddings that divide node embeddings into multiple independent components where each component describes a semantic aspect of node; (2) Updating module. When new relational triplets arrive, DiCGRL first activates the old relational triplets from previous graphs which have relevant semantic aspects with the new ones, and only updates the corresponding components of their graph embeddings.

Disentangle Module
When the i-th training set T i becomes available, DiCGRL needs to update the graph embeddings according to these new relational triplets. To this end, for each node u ∈ V , we want to learn a disentangled node embedding u, which is composed of K independent components, i.e., u = [u 1 , u 2 , ..., u k , ..., u K ], where (0 ≤ k ≤ K) and u k ∈ R d . The component u k is used to represent the k-th semantic aspect of node u. As shown in Figure 2, the key challenge of the disentangle module is how to decouple the relational triplets into multiple components according to their semantic aspects, and learn the disentangled graph embeddings in different components independently.
Formally, given a relational triplet (u, r, v) ∈ T i , we aim to extract the most related semantic components of u and v with respective to the relation r. Specifically, we model this process with an attention mechanism, where (u, r, v) is associated with K attention weight (α 1 r , α 2 r , . . . , α K r ), which respectively represent the probability being assigned to the k-th semantic component. After that, we select the top-n related components of u and v with the highest attention weight. Then we leverage exiting GE methods to extract different features in the selected top-n related components, and we denote the feature extraction operation as f . Here, f could be any graph embedding operation that can incorporate the features of node u and v in the selected top-n related components. In this work, we adapt our DiCGRL model in two typical graph embeddings including: Knowledge Graph Embeddings (KGEs) Intuitively, the most related semantic components of a relational triplet in KG are related to their relation r. Therefore, we can directly set K attention values for each explicit relation r, and the k-th attention value a k r (0 ≤ k ≤ K) is a trainable parameter which indicates how related this edge is to the k-th component. The normalized attention weight α k r is computed as: . (1) As described in related work, KGE models can mainly be divided into two categories: reconstruction-based and bilinear-based models. We explore the effectiveness of both two lines of works to extract features in our framework. Specifically, we leverage two classic KGE models as f to extract latent features in our experiment including TransE (reconstruction-based): and ConvKB (bilinear-based): whereû,v are the concatenation of top-n component embeddings of node u and node v respectively; || · || p denotes the p-norm operation; [·; ·] denotes the concatenate operation; Conv(·) indicates the convolutional layer with M filters, and In total, f is expected to give higher scores for valid triplets than invalid ones.
Network Embeddings (NEs) We first determine α k r according to the representations of node u and node v since NE usually does not provide explicit relations. Hence, α k r is calculated by performing a non-linearity transformation over the concatenation of u and v: , (4) where W 2 ∈ R 1×2d is a trainable matrix 1 . Graph attention networks (GATs) (Veličković et al., 2018) gather information from the node's neighborhood and assign varying levels of importance to neighborhoods, which is a widely used and powerful way to learn embeddings for information networks. Thus we leverage GATs as f to extract latent features for NE. Given a target node u and its neighbors {v|v ∈ N u }, we first determine the top-n related components for each pair of nodes (u, v) according to the attention weights α k r . When updating the k-th component of u, a neighbor v is considered if and only if the k-th component is in the the top-n related components for the node pair (u, v). In this way, we can thoroughly disentangle the neighbors of the target node into different components to play their roles separately. GATs are used to update each component as follows: where W 3 ∈ R 1×d and W 4 ∈ R h×d are two trainable matrices, h is hidden size within GATs, and σ is the softmax function which is used to calculate the neighbor's relative attention value in the k-th component.

Updating Module
Now, the remaining problem is how to update the disentangled graph embedding when new relation triplets appear while preventing catastrophic forgetting. As shown in Figure 3, this process mainly includes two steps: (1) Neighbor activation: DiCGRL needs to identify which relational triplets from T 1 , ..., T i−1 need to be updated. Since in the multi-relational data, nodes are not independent, and therefore a new relational triplet may have influence on the embeddings of nodes that not directly connect to it. Inspired by that, for each relational triplet (u, r, v), we activate both their direct and indirect neighbor triplets 2 . Specifically, neighbors of triplet !"#$ % !"#$ %&' (u, r, v) ∈ T i refers to all triplets which contain node u or node v on the previous multi-relational graph (T 1 , ..., T i−1 ). In practice, adding all neighbors to T i is computationally expensive occasionally since some nodes have very high degrees 3 , i.e., they have a huge amount of neighbors. Therefore, we leverage a selection mechanism inspired by human's ability to learn procedural knowledge introduced in Section 1 and only update a few related neighbors: for each (u, r, v), we only activate the neighbors with related semantic components (i.e., they share at least one component in their top-n semantic components).
(2) Components updating: It is not necessary to update all semantic components of activated neighbors. For example, if a relational triplet (u , r , t ) ∈ (T 1 , · · · , T i−1 ) is activated by (u, r, t) ∈ T i , we only need to update the common components, i.e., top-n components, since the semantics of other components does not change. We use existing GE embedding method to update their features, as explained in our disentangle module. Generally speaking, in each epoch, we iteratively train new relational triplets and relevant semantic aspects of activated neighbor relational triplets. Through this training process, our model can not only effectively prevent catastrophic forgetting, but also learn the embeddings of new data.

Training Objective
As mentioned before, for newly arrived multirelational data T i , we iteratively train our model on T i and its activated neighbor relational triplets. We denote loss functions of these two parts as L new and L old respectively. For KGE, we utilize softmargin loss to train DiCGRL on link prediction 3 The highest degree of FB15k-237 dataset is 7,614. task. The loss function L new can be defined as: where T * i represents a set of relational invalid triplets for T i ; y = 1 if (u, r, t) ∈ T i , otherwise, y = −1. For NE, we leverage a standard cross entropy loss according to GATs and train our model on node classification task. L new can be formulated as follows: (7) where c is node's class and y(c) = 1 if the node label is c, otherwise y(c) = 0; N (T i ) is the node set of T i , |C| indicates the number of class, and W 5 ∈ R |C|×d is a trainable matrix. For KGE and NE, L old can be defined in the same way with L new on the selected old relational triplets set.
Intuitively, the less components a relation focuses on, the better the disentanglement is. Therefore, we add a constraint loss terms L norm for T i 4 to encourage the sum of the attention weights of the top-n selected components to reach 1, i.e., where n indicates the number of selected components. The overall loss function L of our proposed model is defined as follows: where β is a hyper-parameter.

Experiments
In this section, we evaluate our model on two popular tasks: link prediction for knowledge graph and node classification for information network.

Datasets
We conduct experiments on several continual learning datasets adapted from existing graph embedding benchmarks: KGE datasets. We considered two link prediction benchmark datasets, namely FB15K-237 (Toutanova and Chen, 2015) and WN18RR (Dettmers et al., 2018). We randomly split each benchmark dataset into five parts to simulate the real world scenarios, with each part having the ratio of 0.8 : 0.05 : 0.05 : 0.05 : 0.05 respectively 5 . We further divide each part into training set, validation set and query set. The statistics of FB15k-237 and WN18RR datasets are presented in Appendix C.
NE datasets. We conduct our experiments on three real-world information networks for node classification task: Cora, CiteSeer and PubMed (Sen et al., 2008). The nodes, edges and labels in these three citation datasets represent articles, citations and research areas respectively, and their nodes are provided with rich features. Like KGE datasets, we split each dataset into four parts and the partition ratio is 0.7 : 0.1 : 0.1 : 0.1. We further split train/validation/query set for each part. The statistics of Cora, CiteSeer and PubMed are presented in Appendix C.

Experimental Settings
We use Adam (Kingma and Ba, 2015) as the optimizer and fine-tune the hyper-parameters on the validation set for each task. We perform a grid search for the hyper-parameters specified as follows: the number of components K ∈ {2, 4, 6, 8, 10}, the number of top components n ∈ {2, 4}, node embedding dimension d ∈ {100, 200} (note that relation embedding dimension in KG is l = d×n K , and d is fixed to feature length in information network), initial learning rate lr ∈ {0.001, 0.005}, and the weight of regulation loss L norm β ∈ {0.1, 0.3}. The optimal hyper-parameters on FB15k-237 dataset are: K = 8, n = 4, d = 200, lr = 0.001, β = 0.3; and those on WN18RR dataset are: K = 4, n = 2, d = 200, lr = 0.001, β = 0.1. For the NE datasets, the optimal hyperparameters are: K = 8, n = 4, lr = 0.005, β = 0.1. For a fair comparison, we implement the baseline models (TransE, ConvKB, and GATs) by ourselves based on released codes, and use the same hyper-parameters as DiCGRL. For example, the embedding dimension for TransE and ConvKB are both 200; the number of heads in GATs is 8.
As the continual learning on the multi-relational data is not task dependent as previous works, we implement the baseline models by ourselves based on the toolkit 6 released by Han et al. (2020). For fair comparison, we use the same embedding dimension (i.e., d = K * n) and same replay instances number for both our model and baselines. For other hyper-parameters, we follow the settings in Han et al. (2020).
Following existing works (Bordes et al., 2013), the evaluation metrics of link prediction task include mean reciprocal rank (MRR) and proportion of valid test triplets in top-10 ranks (H@10). For node classification task, we use accuracy as our evaluation metric. We use two settings to evaluate the overall performance of our DiCGRL after learning on all graphs: (1) whole performance calculates the evaluation metrics on the whole test set of all data; (2) average performance averages the evaluation metrics on all test sets. As average performance highlights the performance of handling catastrophic forgetting problem, thus it is the main metric to evaluate models.

Baselines
We compare our model with several baselines including two theoretical models to measure the lower and upper bounds of continual learning: (1) Lower Bound, which continually fine-tunes models on the new multi-relational dataset without memorizing any historical instances; (2) Upper Bound, which continually re-train models with all historical and new incoming instances. In fact, this model serves as the ideal upper bound for the performance of continual learning; and several typical continual learning models: (3) EWC (Kirkpatrick et al., 2017), which adopts elastic weight consolidation to add regularization on parameter changes. It uses Fisher information to measure the parameter importance relative to old data, and slows down the update of those important parameters when learning new data; (4) EMR (Parisi et al., 2019), a basic memorybased method, which memorizes a few historical instances and conducts memory replay. Each time when new data comes in, EMR mixes memorized instances with new instances to fine-tune models; (5) GEM (Lopez-Paz and Ranzato, 2017), which memorizes a few historical instances and adds a constraint on directions of new gradients to make sure that there is no conflict of optimization directions with gradients on old data.  Table 2: Accuracy results of models on three information network benchmarks (%). "Lower" and "Upper" are abbreviations of the "Lower Bound" and "'Upper Bound" baselines. Table 1 and Table 2 show the overall performance on both KGE and NE benchmarks under two evaluation settings. From the tables, we can see that:

Overall Results
(1) Our proposed DiCGRL model significantly outperforms other baselines and achieves stateof-the-art performance almost in all settings and datasets. It verifies the effectiveness of our disentangled approach in continual learning, which decouples the node embeddings into multiple components with respect to the semantic aspects, and only updates the corresponding components of graph embedding for new relational triplets.
(2) There is still a huge gap between our model and the upper bound. It indicates although we have proposed an effective approach for continual graph representation learning, it still remains an open problem deserving further exploration.
(3) Although DiCGRL outperforms other base- lines in almost all settings in three information network benchmarks, the performance gain is not as high as it is on the KG datasets. The reason is that these three citation benchmarks are provided with rich node features, which would reduce the impact of topology changes. As can be seen, even the weakest Lower Bound achieves relatively high results close to Upper Bound. To further investigate how evaluation metrics change while learning new relational triplets, we show the average performance on the KG and NE datasets at each part in Figure 4 and Figure 5. From the figures, we observe that: (1) With increasing numbers of new relational triplets, the performance of all the models in almost all the datasets decreases to some degree (CiteSeer may introduce some instability in random data splitting since this dataset is small). This indicates that catastrophically forgetting old data is inevitable, and it is indeed one of the major difficulties for continual graph representation learning.
(2) The memory-based method GEM outperforms the consolidation-based methods, which demonstrates the memory-based methods may be more suitable for alleviating catastrophic forgetting in multi-relational data to some extent.
(3) Our proposed DiCGRL model achieves significantly better results compared to other baseline models. It indicates that disentangling relational triplets and updating dynamically selected components of relational triplets are more useful and reasonable than rote memorization of static examples from old multi-relational data.
In addition, we evaluate our model under noncontinual learning settings to illustrate the superiority of our disentangled approach, and the results are presented in Appendix A.

Hyper-Parameter Sensitivity
In this section, we investigate the effect of the number of components K and the top selected component number n, which are important hyperparameters of our DiCGRL. These experiments are only performed on NE datasets, since the node embedding dimension d can be affected by K and n in KG as introduced in Section 4.2, which would make it difficult to make a fair comparison.
Component Number K: We use n = 2, β = 0.1 to run DiCGRL on the Cora dataset with five different K settings. The results are illustrated in Figure 6(a). From the figure, we find that overall the average accuracy raises when K increases from 2 to 8, which suggests the importance of disentangling components. However, when K grows larger than 8, the performance starts to decline. One possible reason is that the number of components is already larger than that of semantics aspects, making it harder to achieve a good disentanglement. Therefore, we select the component number K for each dataset on the development set, and for most dataset, we select K = 4 or K = 8.
Top Selected Component Number n: For a fair comparison, we set K = 8, β = 0.1 and vary n on the Cora dataset. As shown in Figure 6(b), except for the case of n = 1, the other settings have comparable performance. However, it can be seen that when n = 4, the average accuracy on the last task is the highest, which indicates that the model has the strongest ability to avoid catastrophic forgetting problem when n = 4.

Efficiency Analysis
We show the training time of different continual learning methods on the biggest benchmark FB15k-237, so as to highlight the efficiency gap in different methods. For a fair comparison, all algorithms use  Bound. Although our model also requires some previous data, it is much less in time consumption than GEM and EMR, which verifies the efficiency of our disentangled approach in continual learning.

Case Study
In this section, we visualize an example case from FB15k-237 dataset (more readable than other datasets) to show that the activated neighbors in our updating module are in line with human commonsense. For example, as shown in Figure 8(a) newly arrived relational triplets such as (Robert Clohessy, award nominee, Jack Huston) and (Robert Clohessy, award winner, Dominic Chianese), both related to "award" semantic aspects. Therefore, only "award"-related neighbors of new triplets are updated, like (Jack Huston, nominated for, Boardwalk Empire). since Robert Clohessy is also very likely to be related to the movie of Boardwalk Empire. Meanwhile, relational triples of place of birth and gender, which are not related to "award", will not be updated. Moreover, to verify the learned representation satisfies the intuition that different relations focus on different components of entities, we plot the attention values on the components of the entity Britain in Figure 8(b), where the y-coordinate is sampled relations that appear in the same triplets with "Britain". We observe that semantically similar relations have similar attention value distributions. For example, relations "gdp nominal", "gdp real", "dated money", "ppp dollars", are all related to economics, relations "olympic medal", "olympics", "medal won" are all related to Olympics competitions. These results demonstrate that the disentangled representations learned by our DiCGRL are semantically meaningful.

Conclusion and Future Work
In this paper, we propose to study the problem of continual graph representation learning, aiming to handle the streaming nature of the emerging multi-relational data. To this end, we propose a disentangled-based continual graph representation learning (DiCGRL) framework, inspired by human's ability to learn procedural knowledge. Extensive experiments on several typical KGE and NE datasets show that DiCGRL achieves consistent and significant improvement compared to existing continual learning models, which verifies the effectiveness of our model on alleviating the catastrophic forgetting problem. In the future, we will explore to extend the idea of disentanglement in the continual learning of other NLP tasks.

A Results under Non-Continual Learning Settings
Results under non-continual learning settings, i.e., using the entire training sets to train the models, are presented in Table 3 and Table 4. For a fair comparison, we also reproduce the results of baselines, and using the same optimal hyper-parameters as in Section Experimental Settings in the main paper. DiCGRL (T) and DiCGRL (C) indicate using Con-vKB and TransE as feature extraction method for DiCGRL respectively. DiCGRL (·) α1 represents that α k is calculated by performing a non-linearity transformation over the concatenation of u and v as doing in Network Embedding, and DiCGRL (·) α2 represents that α k is calculated by performing a non-linearity transformation over the concatenation of u, r and v, since relations are an integral part of KGs. From the tables, we can see that: (1) DiCGRL is comparable with our reproduced baselines, especially on the FB15k-237 and Cora datasets, our DiCGRL performs better than vanilla GE methods. This phenomenon shows the effectiveness of our disentangled approach by decoupling the relational triplets in the graph into multiple independent components according to their semantic aspects.
(2) As shown in Table 3, the performance of DiCGRL (·) α1 and DiCGRL (·) α2 are worse than DiCGRL, even worse than original baseline in some settings. This indicates that assigning global attention values for each relation as done in DiC-GRL is an optimal option for KG datasets.   Table 4: Node classification results on three whole information networks. The best score is in bold.

B PubMed Result
The results of DiCGRL on the PubMed data is shown in Figure 9.

C Dataset Statistics
The statistics of FB15k-237 and WN18RR are presented in Table 5, where "Pi" denotes the i-th part, "# Accumulated Entities" and "# Accumulated Relations" represent the cumulative entities and relations after each new part of multi-relational data is generated. Statistics of Cora, Citeseer, and PubMed are presented in Table 6, where "# Acc Nodes" and "# Acc Edges" represent the cumulative nodes and edges after each new part of multi-relational data is generated.