Context-Dependent Knowledge Graph Embedding

We consider the problem of embedding knowledge graphs (KGs) into continuous vector spaces. Existing methods can only deal with explicit relationships within each triple, i.e., local connectivity patterns, but cannot handle implicit relationships across different triples, i.e., contextual connectivity patterns. This paper proposes context-dependent KG embedding, a two-stage scheme that takes into account both types of connectivity patterns and obtain-s more accurate embeddings. We evaluate our approach on the tasks of link prediction and triple classiﬁcation, and achieve signiﬁcant and consistent improvements over state-of-the-art methods.


Introduction
Knowledge Graphs (KGs) like WordNet (Miller, 1995), Freebase (Bollacker et al., 2008), and DBpedia (Lehmann et al., 2014) have become extremely useful resources for many NLP-related applications. A KG is a directed graph whose nodes correspond to entities and edges to relations. Each edge is a triple of the form (h, r, t), indicating that entities h and t are connected by relation r. Although powerful in representing complex data, the symbolic nature makes KGs hard to manipulate.
Recently, knowledge graph embedding has attracted much attention (Bordes et al., 2011;Bordes et al., 2013;Socher et al., 2013;. It attempts to embed entities and relations in a KG into a continuous vector space, so as to simplify the manipulation while preserving the inherent structure of the original graph.
Most of the existing KG embedding methods model triples individually, ignoring the fact that * Corresponding author: Quan Wang. entities connected to a same node are usually implicitly related to each other, even if they are not directly connected. Figure 1 gives two examples.
Shaquille O Neal and NBA in the former example and Nevada and Utah in the latter example are implicitly related to each other, through the intermediate nodes Phoenix Suns and USA respectively. We refer to such implicit relationships as contextual connectivity patterns (CCPs). Relationships explicitly represented in triples are referred to as local connectivity patterns (LCPs). In most of the existing methods, only LCPs are explicitly modeled. This paper proposes a two-stage embedding scheme that explicitly takes into account both C-CPs and LCPs, called context-dependent KG embedding. In the first stage, each CCP is formalized as a knowledge path, i.e., a sequence of entities and relations occurring in the pattern. A word embedding model is adopted to learn embeddings of entities and relations, by taking them as pseudowords. The embeddings are enforced compatible within each knowledge path, and hence can capture CCPs. In the second stage, the learned embeddings are fine-tuned by an existing KG embedding technique. Since such a technique requires the embeddings to be compatible on each individual triple, LCPs are also encoded.
The advantages of our approach are three-fold. 1) It fully exploits both CCPs and LCPs, and can obtain more accurate embeddings. 2) It is a general scheme, applicable to a wide variety of word embedding models in the first stage and KG embedding models in the second. 3) No auxiliary data is further required in the two-stage process, except for the original graph.
We evaluate our approach on two publicly available data sets, and achieve significant and consistent improvements over state-of-the-art methods in the link prediction and triple classification tasks. The learned embeddings are not only more accurate but also more stable.

Context-Dependent KG Embedding
We are given a KG with nodes corresponding to entities and edges to relations. Each edge is denoted by a triple (h, r, t), where h is the head entity, t the tail entity, and r the relation between them. Entities and relations are represented as vectors, matrices, or tensors in a continuous vector space. Context-dependent KG embedding aims to automatically learn entity and relation embeddings, by using observed triples O in a two-stage process.

Modeling CCPs
The first stage models CCPs conveyed in the KG. Each CCP is formalized as a knowledge path, i.e., a sequence of entities and relations occurring in the pattern. For the CCPs in Figure 1, the associated knowledge paths are: We fix the length of knowledge paths to 5. During path extraction, we ignore the directionality of edges, and treat the KG as an undirected graph. 1 Given the extracted knowledge paths, we employ word embedding models to pre-train the embeddings of entities and relations, by taking them as pseudo-words. We use two word embedding models: CBOW and Skip-gram (Mikolov et al., 2013a;Mikolov et al., 2013b). In CBOW, words in the context are projected to their embeddings and then summed. Based on the summed embedding, log-linear classifiers are employed to predict the current word. In Skip-gram, the current word is projected to its embedding, and log-linear classifiers are further adopted to predict its context. We restrain the context of a word (i.e. entity/relation) within each knowledge path. The entity and relation embeddings pre-trained in this way are required to be compatible within each knowledge path, and thus can encode CCPs. Perozzi et al. (2014) and Goikoetxea et al. (2015) have proposed similar ideas, i.e., to generate random walks from online social networks or from the WordNet knowledge base, and then employ word embedding techniques on these random walks. But our approach has two differences. 1) It deals with heterogeneous graphs with different types of edges. Both nodes (entities) and edges (relations) are included during knowledge path extraction. However, the previous studies focus only on nodes. 2) We devise a two-stage scheme where the embeddings learned in the first stage will be fine-tuned in the second one, while the previous studies take such embeddings as final output.

Modeling LCPs
The second stage models LCPs conveyed in the KG. We employ three state-of-the-art KG embedding models, namely SME (Bordes et al., 2014), TransE (Bordes et al., 2013), and SE (Bordes et al., 2011) to fine-tune the pre-trained embeddings. These three models work in the following way. First, entities are represented as vectors, and relations as operators in an embedding space, characterized by vectors (SME and TransE) or matrices (SE). Then, for each triple (h, r, t), an energy function f r (h, t) is defined to measure its plausibility. Plausible triples are assumed to have low energies. Finally, to obtain entity and relation embeddings, a margin-based ranking loss, i.e., is minimized. Here, t + = (h, r, t) ∈ O is an observed (positive) triple; N t + is the set of negative triples constructed by replacing entities in t + , and t − = (h , r, t ) ∈ N t + ; γ is a margin separating positive and negative triples; [x] + = max(0, x). Table 1 summarizes the entity/relation embeddings and the energy functions used in SME, TansE, and SE. For other KG embedding models, please refer to (Nickel et al., 2011;Riedel et al., 2013;Wang et al., 2014;Chang et al., 2014).
We adopt stochastic gradient descent to solve the minimization problem, by taking entity and relation embeddings pre-trained in the first stage as Method Entity/Relation embedding Energy function SME (linear) (Bordes et al., 2014) h, t ∈ R k , r ∈ R k fr (h, t) = (Wu1r + Wu2h + bu) T (Wv1r + Wv2t + bv) SME (bilinear) (Bordes et al., 2014) h (Bordes et al., 2013) h  initial values. 2 The entity and relation embeddings fine-tuned in this way are required to be compatible within each triple, and thus can encode LCPs. Socher et al. (2013) have proposed a similar idea, i.e., to use embeddings learned from an auxiliary corpus as initial values. However, linking entities recognized in an auxiliary corpus to those occurring in the KG is always a non-trivial task. Our approach requires no auxiliary data, and naturally avoids the entity linking task.

Experiments
We test our approach on the tasks of link prediction and triple classification. Two publicly available data sets are used. The first is WN18 released by Bordes et al. (2013) 3 . It is a subset of Word-Net, consisting of 18 relations and the entities connected by them. The second is NELL186 released by  4 , containing the most frequent 186 relations in NELL (Carlson et al., 2010) and the associated entities. Triples are split into training/validation/test sets, used for model training, parameter tuning, and evaluation respectively. Knowledge paths are extracted from training sets. Table 2 gives some statistics of the data sets.
To perform context-dependent KG embedding, we use CBOW and Skip-gram in the pre-training stage, and SME, TransE, and SE in the fine-tuning stage. We take randomly initialized SME, TransE, and SE as baselines, denoted as *-Random. We do not compare to the setting that employs only CBOW or Skip-gram, since it does not provide an energy function to calculate triple plausibility, which hinders the evaluation of both tasks.

Link Prediction
Link prediction is to predict whether there is a specific relation between two entities.
Evaluation Protocol. For each test triple, the head is replaced by every entity in the KG, and the energy is calculated for each corrupted triple. Ranking the energies in ascending order, we get the rank of the correct answer. We can get another rank by corrupting the tail. We report two metrics on the test sets: Mean (averaged rank) and Hit-s@10 (proportion of ranks no larger than 10).

Implementation Details. To train CBOW and
Skip-gram, we use the word2vec implementations 5 . 20 negative samples are drawn for each positive one. The context size is fixed to 5. To train SME, TransE, and SE, we use the implementations provided by the authors 6 , with 100 mini-batches. We vary the learning rate in {0.01, 0.1, 1, 10}, the dimension k in {20, 50}, and the margin γ in {1, 2, 4}. The best model is selected by monitoring Hits@10 on the validation sets, with a total of at most 1000 iterations over the training sets.
Results. Table 3 reports the results on the test sets of WN18 and NELL186. The improvements of CBOW/Skip-gram over Random are also given. Statistically significant improvements are marked by ‡ (sign test, significance level 0.05). The results show that a pre-training stage consistently improves over the baselines for all the methods on both data sets. Almost all of the improvements are statistically significant.

Triple Classification
Triple classification aims to verify whether an unseen triple is correct or not.   classification, a triple is predicted to be positive if the energy is below a relation-specific threshold δ r ; otherwise negative. We report two metrics on the test sets: micro-averaged accuracy (perinstance average) and macro-averaged accuracy (per-relation average).
Implementation Details. We use the same parameter settings as in the link prediction task. The relation-specific threshold δ r is determined by maximizing Micro-ACC on the validation sets.
Results. Table 4 reports the results on the test sets of WN18 and NELL186. The results again demonstrate both the superiority and the generality of our approach.

Discussions
This section is to explore why pre-training helps in KG embedding, specifically in link prediction. We first test different random initializations in traditional KG embedding models. We run SME (linear) twice on WN18, with two different initialization settings. Both are randomly sampled from the same uniform distribution, but with different seeds, referred to as Random-I and Random-II. Each setting finally gets 10,000 ranks on the test set. 7 To better understand the difference be-7 For each of the 5,000 test triples, both the head and the tween the two settings, we analyze the ranks individually, rather than reporting aggregated metrics (Mean and Hits@10). Specifically, we distribute the 10,000 instances into different bins according to the ranks given by one setting (e.g. Random-I). Instances assigned to the i-th bin have the same rank of i, that means, they are all ranked in the i-th position by this setting. Then, within each bin, we calculate the average rank of the instances given by the other setting (e.g. Random-II). If the average rank differs drastically from the bin ID, the instances in this bin are ranked significantly differently by the two settings. Figures 2(a) and 2(b) show the results, with the instances distributed according to Random-I and Random-II respectively. In both cases, we retain the bins with ID no larger than 50, covering about 85% of the instances. In most of the bins, the average rank (red bars in the figures) differs drastically from the bin ID (black bars in the figures), indicating that the ranks given by Random-I and Random-II are significantly different at the instance level. The results demonstrate the non-convexity of SME (linear): different initial values lead to different local minimum.
We further compare the settings of initial values 1) randomly sampled from a uniform distribution (Random) and 2) pre-trained by Skip-gram tail are corrupted and ranked.     (Skip-gram). The results are given in Figures 2(c) and 2(d). In most of the bins Skip-gram has an average rank lower than the bin ID (Figure 2(c)), while Random has an average rank much higher than the bin ID (Figure 2(d)), implying that Skipgram performs better than Random-I at the instance level. The results indicate that pre-training might help in finding better initial values which lead to better local minimum.
Finally we test our two-stage KG embedding scheme where the skip-gram model itself is given two different initialization settings, say Skipgram-I and Skip-gram-II. The results are given in Figures 2(e) and 2(f). In each of the first 20 bins, Skip-gram-I and Skip-gram-II get an average rank almost the same with the bin ID, implying that the two settings perform quite similarly, particularly at the highest ranking levels. The results indicate that a pre-training stage might help in obtaining more stable embeddings.

Conclusion
We have proposed a novel two-stage scheme for KG embedding, called context-dependent KG embedding. In the pre-training stage CCPs are encoded by a word embedding model, and in the finetuning stage LCPs are encoded by a traditional KG embedding model. Since both types of connectiv-ity patterns are explicitly taken into account, our approach can obtain more accurate embeddings. Moreover, our approach is quite general, applicable to various word embedding and KG embedding models. Experimental results on link prediction and triple classification demonstrate the superiority, generality, and stability of our approach.
As future work, we plan to 1) Investigate the efficacy of longer CCPs (i.e. knowledge paths with lengths longer than 5). 2) Design a joint model that encodes LCPs and CCPs simultaneously. Moreover, our approach actually reveals the possibility of a broad idea, i.e., initializing an embedding model by another embedding model. We would also like to test the feasibility of other such strategies, e.g., initializing SME by TransE, so as to combine the benefits of both models.