Edge: Enriching Knowledge Graph Embeddings with External Text

Knowledge graphs suffer from sparsity which degrades the quality of representations generated by various methods. While there is an abundance of textual information throughout the web and many existing knowledge bases, aligning information across these diverse data sources remains a challenge in the literature. Previous work has partially addressed this issue by enriching knowledge graph entities based on “hard” co-occurrence of words present in the entities of the knowledge graphs and external text, while we achieve “soft” augmentation by proposing a knowledge graph enrichment and embedding framework named Edge. Given an original knowledge graph, we first generate a rich but noisy augmented graph using external texts in semantic and structural level. To distill the relevant knowledge and suppress the introduced noise, we design a graph alignment term in a shared embedding space between the original graph and augmented graph. To enhance the embedding learning on the augmented graph, we further regularize the locality relationship of target entity based on negative sampling. Experimental results on four benchmark datasets demonstrate the robustness and effectiveness of Edge in link prediction and node classification.


Introduction
Knowledge Graph (KG) 1 embedding learning has been an emerging research topic in natural language processing, which aims to learn a low dimensional latent vector for every node. One major challenge is sparsity. Knowledge graphs are often incomplete, and it is a challenge to generate low-dimensional representations from a graph with many missing edges. To mitigate this issue, auxil-1 Knowledge graph usually represents a heterogeneous multigraph whose nodes and relations can have different types. However in the work, we follow (Kartsaklis et al., 2018), consider knowledge graph enrichment problem where only one relation type (connected or not) appears. iary texts that are easily accessible have been popularly exploited for enhancing the KG (as illustrated in Figure 1). More specifically, given that KG entities contain textual features, we can link them to an auxiliary source of knowledge, e.g., WordNet, and therefore enhance the existing feature space.
With notable exceptions, the use of external textual properties for KG embedding has not been extensively explored before. Recently, (Kartsaklis et al., 2018) used entities of the KG to query BabelNet (Navigli and Ponzetto, 2012), added new nodes to the original KG based on co-occurrence of entities, and produced more meaningful embeddings using the enriched graph. However, this hard-coded, cooccurrence based KG enrichment strategy fails to make connections to other semantically related entities. As motivated in Figure 1, the newly added entities "wound", "arthropod" and "protective body", are semantically close to some input KG entity nodes (marked in red). However, they cannot be directly retrieved from BabelNet using co-occurrence matching.
In this paper, we aim to address the sparsity issue by integrating a learning component into the process. We propose a novel framework, EDGE, for KG enrichment and embedding. EDGE first constructs a graph using the external text based on similarity and aligns the enriched graph with the original KG in the same embedding space. It infuses learning in the knowledge distillation process by graph alignment, ensuring that similar entities remain close, and dissimilar entities get as far from each other. Consuming information from an auxiliary textual source helps improve the quality of final products, i.e., low dimensional embeddings, by introducing new features. This new feature space is effective because it is obtained from a distinct knowledge source and established based on affinity captured by the learning component of our model.
More specifically, our framework takes KG, and an external source of texts, T , as inputs, and generates an augmented knowledge graph, aKG. in generating aKG we are mindful of semantic and structural similarities among KG entities, and we make sure it contains all the original entities of KG. This ensures that there are common nodes in two graphs which facilitates the alignment process. To align KG and aKG in the embedding space, a novel multi-criteria objective function is devised. In particular, we design a cost function that minimizes the distance between the embeddings of the two graphs. As a result, textual nodes (e.g., blue nodes in Figure 1) related to each target entity are rewarded while unrelated ones get penalized in a negative sampling setting.
Extensive experimental results on four benchmark datasets demonstrate that EDGE outperforms state-of-the-art models in different tasks and scenarios, including link prediction and node classification. Evaluation results also confirm the generalizability of our model. We summarize our contributions as follows: (i) We propose EDGE, a general framework to enrich knowledge graphs and node embeddings by exploiting auxiliary knowledge sources. (ii) We introduce a procedure to generate an augmented knowledge graph from external texts, which is linked with the original knowledge graph. (iii) We propose a novel knowledge graph embedding approach that optimizes a multi-criteria objective function in an end-to-end fashion and aligns two knowledge graphs in a joint embedding space. (iv) We demonstrate the effectiveness and generalizability of EDGE by evaluating it on two tasks, namely link prediction and node classification, on four graph datasets.
The rest of the paper is organized as follows. In the next section, we try to identify the gap in the existing literature and motivate our work. Next, in Section 3, we set up the problem definition and describe how we approach the problem by in-depth explanation of our model. We evaluate our proposed model by experimenting link prediction and node classification on four benchmark datasets and present the results and ablation study in Section 4. Finally, we conclude our work and give the future direction in Section 5.

Related Work
Knowledge graph embedding learning has been studied extensively in the literature (Bordes et al., 2013;Wang et al., 2014;Yang et al., 2015;Sun et al., 2019;Zhang et al., 2019;Xian et al., 2020;Yan et al., 2020;Sheu and Li, 2020). A large number of them deal with the heterogeneous knowledge graph, where it appears different types of edges. While in this work we consider the type of knowledge graph with only one type (i.e. connected or not) of relation, and only focus on entity embedding learning. Our work is related to graph neural networks, such as the graph convolutional networks (GCN) (Kipf and Welling, 2017) and its variants (Wu et al., 2020;Jiang et al., 2019Jiang et al., , 2020, which learn node embeddings by feature propagation. In the following, we mainly review the most relevant works in two aspects, i.e., graph embedding learning with external text and knowledge graph construction.

Graph Embedding with External Text
The most similar line of work to ours is where an external textual source is considered to enrich the graph and learn low dimensional graph embeddings using the enriched version of the knowledge graph. For instance, (Wang and Li, 2016) annotates the KG entities in text, creates a network based on entity-word co-occurrences, and then learns the enhanced KG. Similarly, (Kartsaklis et al., 2018) adds an edge (e, t) to KG per entity e based on co-occurrence and finds graph embeddings using random walks. However, there is no learning component in these approaches in constructing the new knowledge graph. And the enrichment procedure is solely based on occurrences ("hard" matching) of entities in the external text.
For graph completion task, (Malaviya et al., 2020) uses pre-trained language models to improve the representations and for Question Answering task, (Sun et al., 2018) extracts a sub-graph G q from KG and Wikipedia, which contains the an-  Figure 2: Our proposed framework for aligning two graphs in the embedding space. The graph alignment component, L J , requires an additional matrix, R, that selects embeddings of KG entities from Z T , so the resulting matrix, RZ T , would have the same dimension as Z K . Furthermore, L N penalizes additional entities that are unrelated to the target entity, while rewards the related ones. We omit the graph reconstruction loss for simplicity.
swer to the question with a high probability and apply GCN on G q which is limited to a specific task. We emphasize that the main difference between our model and previous work is that we first create an augmented knowledge graph from an external source, and improve the quality of node representation by jointly mapping two graphs to an embedding space. To the best of our knowledge, this is the first time that a learning component is incorporated to enriching knowledge graphs.

Knowledge Graph Construction
Knowledge graph construction methods are broadly classified into two main groups: 1) Curated approaches where facts are generated manually by experts, e.g., WordNet (Fellbaum, 1998) and UMLS (Bodenreider, 2004), or volunteers such as Wikipedia, and 2) Automated approaches where facts are extracted from semi-structured text like DBpedia (Auer et al., 2007), or unstructured text (Carlson et al., 2010). The latter approach can be defined as extracting structured information from unstructured text. In this work, we do not intend to construct a knowledge base from scratch, instead we aim to generate an augmented knowledge graph using side information. Hence, we employ existing tools to acquire a set of new facts from external text and link them to an existing KG.

Problem Statement
We formulate the knowledge graph enrichment and embedding problem as follows: given a knowledge graph KG = (E, R, X) with |E| nodes (or entities), |R| edges (or relations) and X ∈ R |E|×D as feature matrix, where D is the number of features per entity, also given an external textual source, T , the goal is to generate an augmented knowledge graph and jointly learn d (d << |E|) dimensional embeddings for knowledge graph entities, which preserve structural and semantic properties of the knowledge graph. The learned representations are then used for the tasks of link prediction and node classification. Link prediction is defined as a binary classification whose goal is to predict whether or not an edge exists in KG, and node classification is the task of determining node labels in labelled graphs.
To address the problem of knowledge graph enrichment and embedding, we propose EDGE, a framework that contains two major components, i.e., augmented knowledge graph construction, and knowledge graph alignment in a joint embedding space.

Augmented Knowledge Graph Construction
Given the entities of KG and an external source of textual data, T , we aim to generate an augmented graph, aKG, which is a supergraph of KG (i.e., KG is a subgraph of aKG). Augmentation is the process of adding new entities to KG. These newly added entities are called textual entities or textual nodes. A crucial property of aKG is that it contains entities of KG. The presence of these entities establishes a relationship between the two graphs, and such a relationship will be leveraged to learn the shared graph embeddings. To construct aKG, we need to find a set of keywords to query an external source, To obtain high quality keywords and acquire new textual entities, we design the following procedure per target entity e t (For every step of this process refer to Table 1 for a real example from SNOMED dataset). First, we find a set of semantically and struc- Table 1: We employ representation learning algorithms to find a set of semantically and structurally similar entities to each target entity (column 2). We then find a set of keywords, K, that are representative of the target entity (column 3) and use them to query an external text and obtain a set of sentences, S (column 4). Finally, we extract textual entities (column 5), and connect them to the target entity. turally similar entities to e t denoted by E et . This set creates a textual context around e t which we use to find keywords to query an external text, e.g., Word-Net or Wikipedia. Here by query we mean using the API of the external text to find related sentences, S (for instance for a given keyword "bite" we can capture several sentences from the wikipedia page for the entry "biting" or find several Synsets 2 from WordNet when we search for "bite"). Finally, we extract entities from S and attach them to e t . We call these new entities, textual entities or textual features. By connecting these newly found textual entities to the e t , we enhance KG and generate the augmented knowledge graph, aKG. We observed that the new textual entities are different from our initial feature space. Also, it is possible that two different target entities share one or more textual nodes, hence the distance between them in aKG would decrease. The implementation details of this process is provided in Supplementary materials.
Querying an external text allows us to extend the feature space beyond the context around e t . By finding other entities in KG that are similar to the target entity and extracting keywords from the collection of them to query the external text, distant entities that are related but not connected would become closer to each other owing to the shared keywords. Figure 1 illustrates a subset of SNOMED graph and its augmented counterpart by following the above procedure. As this figure reveals, the structure of aKG is different from KG, and as a result of added textual nodes, distant but similar enti-2 Synset is the fundamental building block of WordNet which is accompanied by a definition, example(s), etc. ties would become closer. Therefore, augmenting knowledge graphs would alleviate the KG sparsity issue. Although we may introduce noise by adding new entities but later in the alignment process we address this issue.
Remarks. In the above procedure, we need to obtain similar entities before looking for textual entities, and the rationality of such a strategy is discussed as follows. One naive approach is to simply use keywords included in the target entity to find new textual features. In this way, we would end up with textual features that are related to that target entity, but we cannot extend the feature space to capture similarity (i.e., dependency) among entities.

Knowledge Graph Alignment in Joint
Embedding Space With the help of augmented knowledge graph aKG, we aim to enrich the graph embeddings of KG. However, inevitably, a portion of newly added entities are noisy, and even potentially wrong. To mitigate this issue, we are inspired by Hinton et al. (Hinton et al., 2015), and propose a graph alignment process for knowledge distillation. In fact, aKG and KG share some common entities, which makes it possible to map two knowledge graphs into a joint embedding space. In particular, we propose to extract low-dimensional node embeddings of two knowledge graphs using graph auto-encoders (Kipf and Welling, 2016), and design novel constraints to align two graphs in the embedding space. The architecture of our approach is illustrated in Figure 2.
Let A K and A T denote the adjacency matrices of KG and aKG, respectively. The loss functions of graph auto-encoders that reconstruct knowledge graphs are defined as: whereÂ K = σ(Z K Z K ) is the reconstructed graph using node embeddings Z K . And Z K is the output of graph encoder that is implemented by a twolayer GCN (Kipf and Welling, 2016): is the Hyperbolic Tangent function that acts as the activation function of the neurons, W i are the model parameters, and X K is the feature matrix. 3 Similarly,Â T = σ(Z T Z T ), and Z T is learned by another two-layer GCN. Equations (1) and (2) are l 2 -norm based loss functions that aim to minimize the distance between original graphs and the reconstructed graphs.
Furthermore, to map KG and aKG to a joint embedding space and align their embeddings through common entities, we define the following graph alignment loss function: where R is a transform matrix that selects common entities that exist in KG and aKG. Note that the two terms Z K and RZ T should be of the same size in the L 2 norm equation. Our motivation is to align the embeddings of common entities across two knowledge graphs. By using R, the node embeddings of common entities can be selected from Z T . Note that Z T is always larger than Z K , as KG is a subgraph of aKG. Equation (4) also helps preserve local structures of the original knowledge graph KG in the graph embedding space. In other words, nodes that are close to each other in the original knowledge graph will be neighbors in the augmented graph as well. Moreover, we notice that the proposed augmented knowledge graph aKG involves more complicated structures than the original knowledge graph KG, due to the newly added textual nodes for each target entity in KG. In aKG, one target entity Algorithm 1 Training process of EDGE Input: AK , XK , AT , XT , POS, NEG, Calculate LK and LT using Equations (1) and (2). 7: Compute LJ using Equation (4) 8: Find negative and positive samples and calculate LN using Equation (5)  9: Sum up all losses with their corresponding ratios using Equation (6) 10: Run Adam optimizer to minimize L 11: Update model parameters W K i and W T i 12: end for Output: ZK is closely connected to its textual nodes, and their embeddings should be very close to each other in the graph embedding space. However, such local structures might be distorted in the graph embedding space. Without proper constraints, it is possible that one target entity is close to textual entities of other target entities in the embedding space, which is undesired for downstream applications.
To address this issue, we design a margin-based loss function with negative sampling to preserve the locality relationship as follows: where z t are the embeddings of the related textual nodes, z t are the embeddings of textual nodes that are not related to the target entity, and σ is the sigmoid function. Finally, the overall loss function is defined as: where α, β, and γ are hyper-parameters. We perform full-batch gradient descent using the Adam optimizer to learn all the model parameters in an end-to-end fashion. The whole training process of our approach is summarized in Algorithm 1.
The learned low-dimensional node embeddings Z K could benefit a number of unsupervised and supervised downstream applications, such as link prediction and node classification. Link prediction is the task of inferring missing links in a graph, and node classification is the task of predicting labels to vertices of a (partially) labeled graph. Extensive evaluations on both tasks will be provided in the experiment section.

Model Discussions
We have proposed a general framework for graph enrichment and embedding by exploiting auxiliary knowledge sources. What we consider as a source of knowledge is a textual knowledge base that can provide additional information about the entities of the original knowledge graph. It is a secondary source of knowledge that supplies new sets of features outside of the existing feature space, which improves the quality of representations. The proposed graph alignment approach can fully exploit augmented knowledge graph and thus improve the graph embeddings. Although aKG is a supergraph of KG, its connectivity pattern is different. With the help of our customized loss function for graph alignment, both graphs contribute in the quality of derived embeddings. We will also demonstrate the superiority of our joint embedding approach over the independent graph embedding approach (with only aKG) in the experiments, and we investigate which component of our model contributes more in the final performance in the ablation study in Subsection 4.4.

Experiment
We design our experiments to investigate effectiveness of different components of EDGE as well as its overall performance. To this end, we aim to answer the following three questions 4 . Q1 How well does EDGE perform compared to state-of-the-art in the task of link prediction? (Section 4.1) Q2 How is the quality of embeddings generated by EDGE compared to similar methods? (Sections 4.2 and 4.3) Q3 What is the contribution of each component (augmentation and alignment) in the overall performance? (Section 4.4) 4 We plan to release our code upon publication.

Task 1: Link Prediction
To investigate Q1 we perform link prediction on four benchmark datasets, and compare the performance of our model with five relevant baselines. For this task we consider SNOMED and three citation networks. For SNOMED, similar to (Kartsaklis et al., 2018), we select 21K medical concepts from the original dataset. Each entity in SNOMED is a text description of a medical concept, e.g., Nonvenomous insect bite of hip without infection. According to the procedure explained in subsection 3.2, we construct an augmented knowledge graph, aKG. Additionally, we consider three other datasets, namely Cora, Citeseer, and PubMed, which are citation networks consisting of 2,708, 3,312, and 19,717 papers, respectively. In all three datasets, a short text accompanies each node which is extracted from the title or abstract of the paper. For these networks, relation is defined as citation and the textual content of the nodes enables us to obtain aKG. Cora and Citeseer datasets come with a set of default features. We defer the detailed description of datasets in the supplementary.
In this experiment, for each dataset, we train the model on 85% of the input graph. Other 15% of the data is split into 5% validation set and 10% as part of the test set (positive samples only). An additional set of edges are produced, equal to the number of positive samples, which does not exist in the graph, as negative samples. The union of positive and negative samples are used as the test set. In all baselines, we test the model on KG. We obtain the following values for loss ratios after hyper-parameter tuning: α = 0.001, β = 10, γ = 1. We discuss parameter tuning and explain the small value of α in Section 4.5.
We provide comparison against VGAE (Kipf and Welling, 2016) and its adversarial variant ARVGE (Pan et al., 2018). Also we consider LoNGAE (Tran, 2018), SCAT (Zou and Lerman, 2019) and GIC (Mavromatis and Karypis, 2020) which are designed for link prediction task on graphs, hence they make strong baselines. Table 2 presents the Area Under the ROC Curve (AUC) and average precision (AP) scores for five baselines and our methods across all datasets. We observe that EDGE outperforms all baselines in three out of four datasets and produces comparable results for PubMed dataset.

Task 2: Node Classification on Citation Networks
To evaluate the quality of embeddings (Q2) we design a node classification task based on the final product of our model. For this task, we use Cora, Citeseer and PubMed datasets, and follow the same procedure explained in 3.2 to generate aKG and jointly map the two graphs into an embedding space. All the settings are identical to Task 1. To perform node classification, we use the final product of our model, which is a 160 dimensional vector per node. We train a linear SVM classifier and obtain the accuracy measure to compare the performance of our model with state-of-theart methods. Training ratio varies across different datasets, and we consider several baselines to compare our results against. We compare our approach with state-of-the-art semi-supervised models for node classification, including GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), LoNGAE (Tran, 2018), and MixHop (Abu-El-Haija et al., 2019. These models are semi-supervised, thus they were exposed to node labels during training while our approach is completely unsupervised. We also include DeepWalk, an unsupervised approach, to have a more complete view for our comparison. Table 3 reveals that our model achieves reasonable performance compared with semi-supervised models in two out of three datasets. Since EDGE is fully unsupervised, it is fair to declare that its performance is comparable as other methods are exposed to more information (i.e., node labels).

Embedding Effectiveness
Further, to measure the quality of embeddings produced by our model and compare it against the baseline, we visualize the similarity matrix of node embeddings for two scenarios on the Cora dataset: 1) GAE on KG, and 2) EDGE on KG and aKG. The results are illustrated in Figure 3. In this heatmap, elements are pair-wise similarity values sorted by different labels (7 classes). We can observe that the block-diagonal structure learned by our approach is clearer than that of GAE, indicating enhanced separability between different classes. Next, we examine our model in more details and study how different parameters affect its performance.

Ablation Study
To investigate the effectiveness of different modules of our model (Q3), we consider two scenarios. First we use a single graph to train our model. Note that when we use a single graph, the graph alignment and locality preserving losses are discarded and our model is reduced to GAE. In single graph scenario we consider two versions of augmented graph, aKG that was explained in subsection 3.2 (a) Cora β = 1, γ = 1 (b) Cora α = 1, γ = 1 (c) Cora α = 1, β = 1 (d) Cora β = 10, γ = 1 Figure 5: Effect of parameterization on link prediction performance and aKG * that was created based on co-occurrence proposed by (Kartsaklis et al., 2018). In the second scenario, we use two graphs to jointly train EDGE, and we feed our model with KG + aKG * and KG + aKG to show the effect of augmentation. For link prediction we only consider SNOMED dataset which is the largest dataset, and as Table 4 presents we observe that our augmentation process is slightly more effective than co-occurrence based augmentation. More importantly, by comparing second two rows with first two rows we realize that alignment module improves the performance more than augmentation process which highlights the importance of our proposed joint learning method. Moreover, we repeat this exercise for node classification (see Table 5) which results in a similar trend across all datasets.
Finally, we plot the t-SNE visualization of embedding vectors of our model with and without features. Figure 4 clearly illustrates the distinction between quality of the clusters for the two approaches. This implies that knowledge graph text carries useful information. When the text is incorporated into the model, it can help improve the model performance.

Parameter Sensitivity
We evaluate the parameterization of EDGE, and specifically we examine how changes to hyper parameters of our loss function (i.e., α, β and γ) could affect the model performance in the task of link prediction on Cora dataset. In each analysis, we fix the values of two out of three parameters and study Table 5: Node classification results in terms of accuracy for citation networks. TR stands for training ratio and aKG * is an augmented knowledge graph produced by the method proposed in (Kartsaklis et al., 2018) Figure 5. Figure 5a shows the effect of varying α, when β = 1 and γ = 1 are fixed. We observe a somewhat consistent trend across performance for different values of α. It is evident that decreasing α improves the performance. α is the coefficient of L T (see Equation 2). This examination suggests that the effect of this loss function is less significant, because we re-address it in the L N part of the loss function, where we consider the same graph (aKG) and try to optimize distance between its nodes but with more constraints. Figure 5b illustrates the effect of varying β, while α = 1 and γ = 1 are fixed. Tuning β results in more radical changes in the model performance, which is again consistent between the two datasets. Small values for β degrades performance remarkably, and we observe a much more improved AUC score for larger values of β. This implies the dominant effect of the joint loss function, L J , which is defined as the distance between corresponding entities of KG and aKG.
Next, we fix α = 1 and β = 1 and tweak γ from 0.1 to 10. As Figure 5c reveals, the variation in performance is very small. Finally, as we obtained the best results when β = 10, we set γ = 1 and once again tune α. Figure 5d shows the results for this updated setting. These experiments confirm the insignificance of parameter α. In practice, we obtained the best results by setting α to 0.001.

Conclusion
Sparsity is a major challenge in KG embedding, and many studies failed to properly address this issue. We proposed EDGE, a novel framework to enrich KG and align the enriched version with the original one with the help of auxiliary text. Using external source of information introduces new sets of features that enhance the quality of embeddings. We applied our model on three citation networks and one large scale medical knowledge graph. Experimental results show that our approach outperforms existing graph embedding methods on link prediction and node classification.