Knowledge Router: Learning Disentangled Representations for Knowledge Graphs

The design of expressive representations of entities and relations in a knowledge graph is an important endeavor. While many of the existing approaches have primarily focused on learning from relational patterns and structural information, the intrinsic complexity of KG entities has been more or less overlooked. More concretely, we hypothesize KG entities may be more complex than we think, i.e., an entity may wear many hats and relational triplets may form due to more than a single reason. To this end, this paper proposes to learn disentangled representations of KG entities - a new method that disentangles the inner latent properties of KG entities. Our disentangled process operates at the graph level and a neighborhood mechanism is leveraged to disentangle the hidden properties of each entity. This disentangled representation learning approach is model agnostic and compatible with canonical KG embedding approaches. We conduct extensive experiments on several benchmark datasets, equipping a variety of models (DistMult, SimplE, and QuatE) with our proposed disentangling mechanism. Experimental results demonstrate that our proposed approach substantially improves performance on key metrics.


Introduction
Knowledge graphs (KG) have emerged as a compelling abstraction for organizing structured knowledge. They have been playing crucial roles in many machine learning tasks. A knowledge graph represents a collection of linked data, describing entities of interest and relationships between them. To incorporate KGs into other machine learning systems, a prevalent way is mapping entities and relations of knowledge graphs into expressive representations in a low-dimensional space that preserves the relationships among objects, also known as knowledge graph embeddings. Representative work such as (Bordes et al., 2013;Yang et al., 2014;Sun et al., 2019;Zhang et al., 2019;Chami et al., 2020) has gained intensive attention across the recent years.
The substantial effectiveness of recent work can be attributed to relational pattern modeling in which a suitable relational inductive bias is used to fit the structural information in data. Nevertheless, these methods ignore the fact that the origination and formation of KGs can be rather complex (Ehrlinger and Wöß, 2016). They may be collected, mined, handcrafted or merged in a complicated or convoluted process (Ji et al., 2017;Bosselut et al., 2019;Qin et al., 2018). To this end, entities in a knowledge graph may be highly entangled and relational triplets may form and be constructed for various reasons under a plethora of different circumstances or contexts. Contextual reasons and/or domains may be taken into account at the same time. As such, it is only natural that KG embedding methods trained in this fashion would result in highly entangled latent factors. Moreover, the existing holistic approaches fail to disentangle such factors and may result in sub-optimal solutions.
Recently, disentangled representation learning has achieved state-of-the-art performance and attracts much attention in the field of visual representation learning. A disentangled representation should separate the distinct, informative factors of variations in the data (Bengio et al., 2013). Disentangling the latent factors hidden in the observed data can not only increase the robustness, making the model less sensitive to misleading correlations but also enhance the model explainability. Disentanglement can be achieved using either supervised signals or unsupervised approaches. Zhu et al. (Zhu et al., 2014) propose to untangle the identity and view features in a supervised face recognition task. A bilinear model is adopted in (Tenenbaum and Freeman, 2000) to separate content from styles. There is also a large body of work on unsupervised disentangled representation learning (Chen et al., 2016;Denton et al., 2017;Higgins et al., 2016). Generally, the disentanglement mechanism is integrated into unsupervised learning frameworks such as variational autoencoders (Kingma and Welling, 2013) and generative adversarial networks (Goodfellow et al., 2014). The quality of unsupervised disentangled representation can even match that learned from supervised label signals.
Inspired by the success of disentangled representation learning, we seek to enhance the disentanglement capability of entities representation in knowledge graphs. Our hope is that this idea can address the aforementioned challenge in learning entity embeddings, that is, enabling the entities embeddings to better reflect the their inner properties. Unlike learning disentangled representations in visual data, it is more challenging to disentangle the discrete relational data. Most KGs embedding approaches operate at the triplet level, which is uninformative for disentanglement. Intuitively, information about the entities resides largely within the graph encoded through neighborhood structures. Our assumption is that an entity connects with a certain group of entities for a certain reason. For example, Tim Robbins, as an actor, starred in films such as The Shawshank Redemption; as a musician, is a member of the folk music group The Highwaymen. We believe that relational triplets form because of different factors and this can be disentangled when looking it at the graph level.
To summarize, our key contributions are: (1) We propose Knowledge Router (KR), an approach that learns disentangled representations for entities in knowledge graphs. Specifically, a neighbourhood routing mechanism disentangles the hidden factors of entities from interactions with their neighbors.
(2) Knowledge Router is model agnostic, which means that it can play with different canonical knowledge graph embedding approaches. It enables those models to have the capability in learning disentangled entity representations without incurring additional free parameters. (3) We conduct extensive experiments on four publicly available datasets to demonstrate the effectiveness of Knowledge Router. We apply Knowledge Router to models such as DistMult, SimplE, and QuatE and observe a notable performance enhancement. We also conduct model analysis to inspect the inner workings of Knowledge Router.

Learning Disentangled Representations
Learning representations from data is the key challenge in many machine learning tasks. The primary posit of disentangled representation learning is that disentangling the underlying structure of data into disjoint parts could bring advantages.
Recently, there is a growing interest in learning disentangled representations across various applications. A trending line of work is integrating disentanglement into generative models. (Tran et al., 2017) propose a disentangled generative adversarial network for face recognition and synthesis. The learned representation is explicitly disentangled from a pose variation to make it pose-invariant, which is critical for face recognition/synthesis task. (Denton et al., 2017) present a disentangled representation learning approach for videos. The proposed approach separates each frame into a timeindependent component and a temporal dynamics aware component. As such, it can reflect both the time-invariant and temporal features of a video. (Ma et al., 2018) propose a disentangled generative model for personal image generation. It separates out the foreground, background, and pose information, and offers a mechanism to manipulate these three components as well as control the generated images. Some works (Higgins et al., 2016;Burgess et al., 2018) (e.g., β-VAE) integrate disentanglement mechanism with variational autoencoder, a probabilistic generative model. β-VAE uses a regularization coefficient β to constrain the capacity of the latent information channel. This simple modification enables latent representations to be more factorised.
Drawing inspiration from the vision community, learning disentangled representations has also been investigated in areas such as natural language processing and graph analysis. (Jain et al., 2018) propose an autoencoders architecture to disentangle the populations, interventions, and outcomes in biomedical texts.  propose a prism module for semantic disentanglement in named entity recognition. The prism module can be easily trained with downstream tasks to enhance performance. For graph analysis, (Ma et al., 2019a) propose to untangle the node representation of graphstructured data in graph neural networks. (Ma et al., 2019b) present a disentangled variational autoencoder to disentangle the user's diverse interests for recommender systems.

Knowledge Graph Embeddings
Learning effective representations for knowledge graphs is extensively studied because of its importance in downstream tasks such as knowledge graph completion, natural language understanding, web search, and recommender systems. Among the large body of related literature, two popular lines are translational approaches and semantic matching approaches. The groundbreaking TransE (Bordes et al., 2013) sets the fundamental paradigm for translational models. Typically, the aim is to reduce the distance between translated (by relation) head entity and tail entity. Successors such as TransH , TransR (Lin et al., 2015) all follow this translational pattern. Semantic matching methods calculate the semantic similarities between entities. A representative semantic model is DistMult (Yang et al., 2014) which measures the plausibility of triplets with vector multiplications. To model more complex relation patterns, (Trouillon et al., 2016;Zhang et al., 2019;Sun et al., 2019;Zhang et al., 2021) extend the embedding spaces to complex number space or hyperbolic space. A fully expressive model named SimplE (Kazemi and Poole, 2018) could achieve the same level of capability of ComplEx (Trouillon et al., 2016) with lower calculation cost.
Inspired by the success of disentangled representations, we explore methods to factorize different components/aspects of entangled entities in a knowledge graph. To the best of our knowledge, our work is one of the first efforts to induce disentangled representations in knowledge graphs. Our disentangled embedding algorithm can be easily integrated into existing knowledge graph embedding models (model agnostic).

Notation and Problem Formulation
Suppose we have an entity set E and a relation set where h, t ∈ E and r ∈ R. The triplet (h, r, t) ∈ F means that entities h and r are connected via a relation r. The facts are usually directional, which means exchanging the head entity and tail entity does not necessarily result in a legitimate fact.
We are concerned with the link prediction task. The goal is to embed the entities and relations of a knowledge graph into low-dimensional rep- The entity embedding matrix. W The relation embedding matrix.

Ee
The e th row of the entity embedding matrix.

Wr
The r th row of the relation embedding matrix. d The length of the embedding vector. N (e) Neighbourhood entities set of entity e. K The number of independent components. T The number of routing iterations.
The k th initial vector for entity e. p e,k The k th vector of entity e after disentanglement.
The similarity score between entity e and entity i w.r.t the k th component.
The extent to which the model attends to the k th component of entity i. resentations that can preserve the facts in the graph. A classical setting is using an embedding matrix E ∈ R N ×d to represent all the entities and an embedding matrix W ∈ R M ×d to represent all the relations.

Disentangled Knowledge Graph Embeddings
Instead of directly modeling triplet facts, we propose to disentangle the entities with their neighbors in a message passing setting. The neighborhood entities could form several clusters for different reasons and the entity is updated by the information accepted from its neighborhood clusters. Figure 1 illustrates the overall process of Knowledge Router. It consists of two stages: (1) disentangling the entities from a graph perspective using neighbourhood routing; (2) scoring the facts using relations and the disentangled entities representations.
Let us build an undirected graph from the training data. The relations are anonymized, which means we do not need to know under which conditions two entities are linked. We denote the neighbourhood of entity e as N (e), regardless of the relations. Our neighborhood routing approach operates on this graph.
Given an entity e, we aim to learn a disentangled embedding that encodes various attributes of the entity. In this regard, we suppose that each entity is composed of K independent components, with each component denoted by p e,k ∈ R d K , where ∀k = 1, 2, ..., K. Each component stands for one aspect of the entity, e.g., a role of a person. A Figure 1: The overall procedure of the proposed Knowledge Router algorithm for learning disentangled entity representations. In this example, we disentangle the entity embedding into four components (K = 4) via neighborhood routing (iterate T times). These components are then concatenated to represent the corresponding entity. major challenge here is to make the learned K components to be independent of one another so that different facets can be separately encoded. To this end, we adopt routing mechanisms that are inspired by capsule networks (Hinton et al., 2011). Specifically, we aim to learn the K components from both the entity e and its neighbourhoods N (e). Next, we describe this procedure in detail.
For each entity e, we first initialize the E e randomly and evenly split it into K parts. The k th part is denoted by x e,k ∈ R d K . By doing so, the embedding is projected into different subspaces. To ensure computation stability, each part is also normalized as follows: This is used for the initialization of p e,k . Obviously, the information contained is limited and it cannot reach the goal of disentanglement. To enrich the information, we use a graph message passing mechanism and define the update rule for the k th component of p e as follows: (2) where AGGREGATE represents the neighborhood aggregation function (defined in equation 5). The same 2 normalization as (1) is applied to p e,k afterwards.
In this way, p e,k contains information from the k th aspect of both entity e and all of its neighbors. Common aggregating functions such as mean pooling and sum pooling are viable, but treating each neighbor equally when determining one component of the representation is undoubtedly not sensible. As such, an attention mechanism is used to obtain weights for each neighbor. In particular, a scaled dot-product attention method is applied. We first get the dot product between p e,k and x i,k , ∀i ∈ N (e). For each k, we get the following similarity score: which provides information on how entity e interacts with its neighbour entity i pertaining to the aspect k. Then the softmax function is applied to get the weight distribution over different components for each neighbour.
and w i,k indicates the extent to which the model attends to the k th component of entity i. Now, we formulate the definition of the AGGREGATE function as follows: The above process, including equations (2), (3), (4), (5) for learning p e,k , ∀k = 1, 2, ..., K, is repeated for T iterations, which is the same as that of a routing mechanism. Like capsule networks (Sabour et al., 2017), we also assume that entity (object) is composed of entity (object) parts. This routing method enables it to model part-whole relationships and enlarge the differences between parts after several routing iterations.
Afterwards, the concatenation of all K components of an entity is used to represent that entity. That is, the disentangled representation p e of the entity e is defined as: This neighborhood routing algorithm is model agnostic as our aim is to learn an entity embedding matrix which is necessary for most knowledge graph embedding methods. It is worth noting that this model will not introduce additional free parameters to the model.
The intuition behind the "routing mechanism" is that each facet in an entity has a separate route to contribute to the meaning of this entity. The routing algorithm will coordinately infer p e,k (we can view it as the center of each cluster) and w i,k (the probability that factor k is the reason why entity e is connected with entity i ). They are coordinately learned and under the constraint that each neighbor should belong to one cluster. It is reminiscent of the iterative method used in the EM algorithm (Bishop, 2006) and is expected to lead to convergence and meaningful disentangled representations (Ma et al., 2019a).
Until now, the relation embeddings are not utilized as all relations are anonymous during graph construction. This algorithm will be jointly trained with the following facts scoring algorithms.

Facts Scoring using Disentangled Entities
Using disentangled entity embeddings alone cannot recover the facts in a knowledge graph. It shall be further updated simultaneously with the relation embeddings for the fact scoring process. To predict whether a triplet h, r, t holds or not, we first fetch the learned disentangled representation of the head and tail entities, p h and p t . Then we adopt three methods for triplet scoring including DistMult (Yang et al., 2014), SimplE (Kazemi and Poole, 2018), and QuatE (Zhang et al., 2019). We denote the model after disentanglement as: KR-DistMult, KR-SimplE, and KR-QuatE.
The scoring function of KR-DistMult is defined as follows: where * , * , * denotes the standard componentwise multi-linear dot product.
SimplE needs an additional entity embedding matrix H ∈ R N ×d and an additional relation embedding matrix V ∈ R M ×d . We perform the same disentanglement process on H and denote the disentangled representation of entity e as q e , the scoring function of KR-SimplE (SimplE-avg is adopted since it outperforms SimplE-ignr) is: QuatE, entities and relations are represented with quaternions. Each quaternion is composed of a real component and three imaginary components. Let Q ∈ H N ×d denote the quaternion entity embedding and W ∈ H M ×d denote the quaternion relation embedding, where H is the quaternion space. Each entity is represented by Q e . We apply the Knowledge Router algorithm on each component of Q e . The scoring function of KR-QuatE is: where "⊗" is Hamilton product; "·" represents the quaternion inner product; Q KR denotes the entity representation after disentanglement.
As Knowledge Router is model agnostic, other scoring functions are also applicable.

Objective Functions
To learn a disentangled KG model, we adopt the following negative log-likelihood loss: (10) where S is the number of training samples (triplets); y (i) is a binary label indicating whether the i th triplet holds or not; φ (i) is the prediction for the i th triplet. Our model can be trained with commonly used minibatch gradient descent optimizers.

Complexity Analysis
The disentanglement process of each node needs O(|N (e)| d K K + T (|N (e)| d K K + d K K)) time complexity, where |N (e)| is neighborhood size. After simplification, the time complexity is O(T |N (e)|d). This will not incur a high computational cost since T is usually a small number (e.g., 3), and the neighborhood size is determined by the average degree and can usually be constrainted by a constant value (e.g., 10). With regard to fact

Experiments
In this section, we conduct experiments on several benchmark datasets to verify the effectiveness of the proposed approach. We target at answering: RQ I: whether the disentanglement method can enhance the traditional knowledge graph embedding methods? RQ II: Model-agnosticism: can it effectively work with different baseline models? RQ III: How do certain important hyper-parameters impact the model performance and what has the disentanglement algorithm learned? Are they meaningful?

Datasets Description
We use four publicly available datasets including ICEWS14, ICEWS05-15, WikiData, and FB15k-237. The reason for using these is that their entities are complicated and highly entangled. The WordNet dataset is not appropriate to evaluate the proposed method as the entities in WordNet are already disentangled 1 . FB15k-237 is a subset of the Freebase knowledge base which contains general information about the world. We adopt the widely used version generated by (Dettmers et al., 2018) where inverse relations are eliminated to avoid data leakage.
WikiData is sampled from Wikidata 2 , a collaborative open knowledge base. The knowledge is relatively up-to-date compared with FB15k-237. We use the version provided by (García-Durán et al., 2018). Timestamp is discarded.
ICEWS (García-Durán et al., 2018) is collected from the integrated crisis early warning system 3 which was built to monitor and forecast national and internal crises. The datasets contain political events that connect entities (e.g., countries, presidents, intergovernmental organizations) to other entities via predicates (e.g., "make a visit", "sign formal agreement", etc.). ICES14 contains events in the year 2014, while the ICEWS05-15 contains events occurring between 2005 and 2015. Temporal information is not used in our experiments.
Data statistics and the train/validation/test splits are summarized in Table 2.

Evaluation Protocol
We adopt four commonly used evaluation metrics including hit rate with given cut-off (HR@1, HR@3, HR@10) and mean reciprocal rank (MRR). HR measures the percentage of true triples of the ranked list. MRR is the average of the mean rank inverse which reflects the ranking quality. Evaluation is performed under the commonly used filtered setting (Bordes et al., 2013), which is more reasonable and stable compared to the unfiltered setting.

Implementation Details
We implement our model using pytorch (Paszke et al., 2019) and run it on TITAN XP GPUs. We adopt Adam optimizer to learn our model (Goodfellow et al., 2016) and the learning rate is set to 0.01 without further tuning. The embedding size d is set to 100 and the number of negative samples is fixed to 50. The batch size is selected from {128, 512, 1024}. The regularization rate is searched from {0.0, 0.01, 0.1, 0.2, 0.3, 0.5}. For the disentanglement algorithm, the number of components K is selected from {2, 4, 5, 10} (K should be divisible by d); the number of routing iterations T is tuned amongst {2, 3, 4, 5, 7, 10}. The hyperparameters are determined by the validation set. Each experiment runs five times and the average is reported. For convenience of implementation, the maximum neighbor sizes are: 16 (FB15K-237), 4 (WikiData), 10 (ICEWS14), 16 (ICEWS05-15). We apply zero padding to entities that have fewer neighbors.

Main Results
The test results on the four datasets are shown in Tables 3, 4 Table 3: Results on the FB15K-237 dataset. Best results are in bold. "D", "S", and "D" stand for DistMult, SimplE, and QuatE, respectively. "♥": results from (Schlichtkrull et al., 2018). " ": results from (Sun et al., 2019). For fair comparison, adversarial negative sampling is not used. " ": results from (Zhang et al., 2019) (without N3 regularization and type constraints).   KR-DistMult and KR-SimplE. In addition, it is good to note that the performance of each of the three KR-models is much higher than the graph convolutional networks based model, R-GCN. This implies that simply/naively incorporating graph structures might not lead to good performance. Knowledge Router also operates at the graph level, moreover, the neighborhood information is effectively utilized for disentanglement.
Similar trends are also observed on WikiData. Interestingly, we find that the performance differences of the three KR-models are quite small on this dataset. We hypothesize that the performance on this dataset has already been quite high, making further improvement more difficult.
Among the baselines, SimplE is the best performer. We notice that even though the pure QuatE does not show impressive performance, the Knowledge Router enhances its results and enables it to achieve the state-of-the-art performance.
On the two ICEWS datasets, disentanglement usually leads to a large performance boost. The average performance gains of Knowledge Router based models (KR-DistMult, KR-SimplE, KR-QuatE) are high, compared with the original models (DistMult, SimplE, and QuatE). We also observe that KR-QuatE outperforms other models significantly.
To conclude, our experimental evidence shows that disentangling the entities can indeed bring performance increase and the proposed Knowledge Router can effectively be integrated into different models.

Model Analysis
To answer RQ III and gain further insights, we empirically analyze the important ingredients of the model via qualitative analysis and visualization.

Visualization of similarity scores
The attention mechanism is critical to achieving the final disentanglement. To show its efficacy, we visualize four examples of attention weights w i,k in Figure 2. The color scale represents the strength of the attention weights. Each row represents a neighbor of the selected entity and each column represents a disentangled component. We observe a clear staggered pattern in the attention weights. For example, in the upper left figure, the neighbors   1, 2, 3 give higher weights to the second component while 0 gives a stronger weight to the first component. In other figures, the attention weights are also staggered among the disentangled components.

Case study
We randomly pick one entity (Michael Rensing, a German footballer) from the WikiData and show the learned weight between him and his neighborhood entities in Figure 3. We observe that FC Bayern Munich and Jan Kirchhoff (who is also a team member of the FC Bayern Munich club) contribute more on the first component of the representation of Michael Rensing, while Germany national under-18 football team and Germany national under-21 football team make larger contributions to the second component. Clearly, the first component captures the fact that Michael Rensing is a member of the FC Bayern Munich association football club and the second component reflects that he is also a Germany national football team member. This case justifies our assumption that entities are connected for different reasons and demonstrates that Knowledge Router is able to disentangle the underlying factors effectively.

Impact of size K
We analyze the impact of K. Intuitively, K is difficult to choose since there is no prior information on how many components we should decompose each entity into. The test results with varying K on ICEWS14 of KR-QuatE are shown in Figure 4 (a).
As can be seen, using large K could result in a performance degradation. One possible reason is that there are not enough neighborhood entities to be divided into 20 groups. Empirically, we found that setting K to a small value around 2 to 5 can usually render reasonable results. A practical suggestion is that K should not exceed the average degree of the knowledge graph.

Impact of routing iteration T
We study the influence of number of routing iterations. As shown in Figure 4 (b), the model performance is stable when using different iterations. The reason is that the Knowledge Router algorithm is not prone to saturation and has good convergence properties. In practice, we find that using a small number of iterations (e.g., 3) could lead to ideal enhancement without putting on much computation burden.

Conclusion
In this paper, we present Knowledge Router, an algorithm for learning disentangled entity representations in knowledge graphs. Our method is model agnostic and can be applied to many canonical knowledge graph embedding methods. Extensive experiments on four benchmarking datasets demonstrate that equipping popular embedding models with the proposed Knowledge Router can outperform a number of recent strong baselines. Via qualitative model analysis, we discover that Knowledge Router can effectively learns the hidden factors connecting entities, thus leading to disentanglement. We also showcase the impact of certain important hyper-parameters and give suggestions on hyperparameters tuning.