Neighborhood Mixture Model for Knowledge Base Completion

Knowledge bases are useful resources for many natural language processing tasks, however, they are far from complete. In this paper, we define a novel entity representation as a mixture of its neighborhood in the knowledge base and apply this technique on TransE-a well-known embedding model for knowledge base completion. Experimental results show that the neighborhood information significantly helps to improve the results of the TransE, leading to better performance than obtained by other state-of-the-art embedding models on three benchmark datasets for triple classification, entity prediction and relation prediction tasks.

Most embedding models for KB completion learn only from triples and by doing so, ignore lots of information implicitly provided by the structure of the knowledge graph.Recently, several authors have addressed this issue by incorporating relation path information into model learning (García-Durán et al., 2015;Lin et al., 2015a;Guu et al., 2015;Toutanova et al., 2016) and have shown that the relation paths between entities in KBs provide useful information and improve knowledge base completion.For instance, a three-relation path (head, born in hospital/r 1 , e 1 ) ⇒(e 1 , hospital located in city/r 2 , e 2 ) ⇒(e 2 , city in country/r 3 , tail) is likely to indicate that the fact (head, nationality, tail) could be true, so the relation path here p = {r 1 , r 2 , r 3 } is useful for predicting the relationship "nationality" between the head and tail entities.
Besides the relation paths, there could be other useful information implicitly presented in the knowledge base that could be exploited for better KB completion.For instance, the whole neighborhood of entities could provide lots of useful information for predicting the relationship between two entities.Consider for example a KB fragment given in Figure 1.If we know that Ben Affleck has won an Oscar award and Ben Affleck lives in Los Angeles, then this can help us to predict that Ben Affleck is an actor or a film maker, rather than a lecturer or a doctor.If we additionally know that Ben Affleck's gender is male then there is a higher chance for him to be a film maker.This intuition can be formalized by representing an entity vector as a relation-specific mixture of its neighborhood as follows: Ben Affleck = ω r,1 (Violet Anne, child of) + ω r,2 (male, gender −1 ) + ω r,3 (Los Angeles, lives in −1 ) + ω r,4 (Oscar award, won −1 ), where ω r,i are the mixing weights that indicate how important each neighboring relation is for predicting the relation r.For example, for predicting the occupation relationship, the knowledge about the child of relationship might not be that informative and thus the corresponding mixing coefficient can be close to zero, whereas it could be relevant for predicting some other relationship, such as parent or spouse, in which case the relation-specific mixing coefficient for the child of relationship could be high.
The primary contribution of this paper is introducing and formalizing the neighborhood mixture model.We demonstrate its usefulness by applying it to the well-known TransE model (Bordes et al., 2013).However, it could be applied to other embedding models as well, such as Bilinear models (Bordes et al., 2012;Yang et al., 2015) and STransE (Nguyen et al., 2016).While relation path models exploit extra information using longer paths existing in the KB, the neighborhood mixture model effectively incorporates information about many paths simultaneously.Our extensive experiments on three benchmark datasets show that it achieves superior performance over competitive baselines in three KB completion tasks: triple classification, entity prediction and relation prediction.

Neighborhood mixture modeling
In this section, we start by explaining how to formally construct the neighbor-based entity representations in section 2.1, and then describe the Neighborhood Mixture Model applied to the TransE model (Bordes et al., 2013) in section 2.2.Section 2.3 explains how we train our model.

Neighbor-based entity representation
Let E denote the set of entities and R the set of relation types.Denote by R −1 the set of inverse relations r −1 .Denote by G the knowledge graph consisting of a set of correct tiples (h, r, t), such that h, t ∈ E and r ∈ R. Let K denote the symmetric closure of G, i.e. if a triple (h, r, t) ∈ G, then both (h, r, t) and (t, r −1 , h) ∈ K. Define: as a set of neighboring entities connected to entity e with relation r. Then is the set of all entity and relation pairs that are neighbors for entity e.
Each entity e is associated with a k-dimensional vector v e ∈ R k and relation-dependent vectors u e,r ∈ R k , r ∈ R ∪ R −1 .Now we can define the neighborhood-based entity representation ϑ e,r for an entity e ∈ E for predicting the relation r ∈ R as follows: a e and b r,r are the mixture weights that are constrained to sum to 1 for each neighborhood: where δ 0 is a hyper-parameter that controls the contribution of the entity vector v e to the neighbor-based mixture, α e and β r,r are the learnable exponential mixture parameters.
In real-world factual KBs, e.g.Freebase (Bollacker et al., 2008), some entities, such as "male", can have thousands or millions neighboring entities sharing the same relation "gender."For such entities, computing the neighbor-based vectors can be computationally very expensive.To overcome this problem, we introduce in our implementation a filtering threshold τ and consider in the neighbor-based entity representation construction only those relation-specific neighboring entity sets for which |N e,r | ≤ τ .

TransE-NMM: applying neighborhood mixtures to TransE
Embedding models define for each triple (h, r, t) ∈ G, a score function f (h, r, t) that measures its implausibility.The goal is to choose f such that the score f (h, r, t) of a plausible triple (h, r, t) is smaller than the score f (h , r , t ) of an implausible triple (h , r , t ).
TransE (Bordes et al., 2013) is a simple embedding model for knowledge base completion, which, despite of its simplicity, obtains very competitive results (García-Durán et al., 2016;Nickel et al., 2016).In TransE, both entities e and relations r are represented with k-dimensional vectors v e ∈ R k and v r ∈ R k , respectively.These vectors are chosen such that for each triple (h, r, t) ∈ G: The score function of the TransE model is the norm of this translation: We define the score function of our new model TransE-NMM in terms of the neighbor-based entity vectors as follows: using either the 1 or the 2 -norm, and ϑ h,r and ϑ t,r −1 are defined following the Equation 1.The relation-specific entity vectors u e,r used to construct the neighbor-based entity vectors ϑ e,r are defined based on the TransE translation operator: in which v r −1 = −v r .For each correct triple (h, r, t), the sets of neighboring entities N h,r and N t,r −1 exclude the entities t and h, respectively.
If we set the filtering threshold τ = 0 then ϑ h,r = v h and ϑ t,r −1 = v t for all triples.In this case, TransE-NMM reduces to the plain TransE model.In all our experiments presented in section 4, the baseline TransE results are obtained with the TransE-NMM with τ = 0.

Parameter optimization
The TransE-NMM model parameters include the vectors v e , v r for entities and relation types, the entity-specific weights α = {α e |e ∈ E} and relation-specific weights β = {β r,r |r, r ∈ R ∪ R −1 }.To learn these parameters, we minimize the L 2 -regularized margin-based objective function: where [x] + = max(0, x), γ is the margin hyperparameter, λ is the L 2 regularization parameter and is the set of incorrect triples generated by corrupting the correct triple (h, r, t) ∈ G.We applied the "Bernoulli" trick to choose whether to generate the head or tail entity when sampling an incorrect triple (Wang et al., 2014;Lin et al., 2015b;He et al., 2015;Ji et al., 2015;Ji et al., 2016).We use Stochastic Gradient Descent (SGD) with RMSProp adaptive learning rate to minimize L, and impose the following hard constraints during training: v e 2 1 and v r 2 1.We employ alternating optimization to minimize L. We first initialize the entity and relation-specific mixing parameters α and β to zero and only learn the randomly initialized entity and relation vectors v e and v r .Then we fix the learned vectors and only optimize the mixing parameters.In the final step, we fix again the mixing parameters and fine-tune the vectors.In all experiments presented in section 4, we train for 200 epochs during each three optimization step.Opt.

Related work
Table 1: The score functions f (h, r, t) and the optimization methods (Opt.) of several prominent embedding models for KB completion.In all of these models, the entities h and t are represented by vectors v h and v t ∈ R k respectively.
differ in their score function f (h, r, t) and the algorithm used to optimize their margin-based objective function, e.g., SGD, AdaGrad (Duchi et al., 2011), AdaDelta (Zeiler, 2012) or L-BFGS (Liu and Nocedal, 1989).The Unstructured model (Bordes et al., 2012) assumes that the head and tail entity vectors are similar.As the Unstructured model does not take the relationship into account, it cannot distinguish different relation types.The Structured Embedding (SE) model (Bordes et al., 2011) extends the Unstructured model by assuming that the head and tail entities are similar only in a relation-dependent subspace, where each relation is represented by two different matrices.Futhermore, the SME model (Bordes et al., 2012) uses four different matrices to project entity and relation vectors into a subspace.The TransH model (Wang et al., 2014) associates each relation with a relation-specific hyperplane and uses a projection vector to project entity vectors onto that hyperplane.TransD (Ji et al., 2015) and TransR/CTransR (Lin et al., 2015b) extend the TransH model by using two projection vectors and a matrix to project entity vectors into a relation-specific space, respectively.STransE (Nguyen et al., 2016) and TranSparse (Ji et al., 2016) are extensions of the TransR model, where head and tail entities are associated with their own projection matrices.
The DISTMULT model (Yang et al., 2015) is based on the Bilinear model (Nickel et al., 2011;Bordes et al., 2012;Jenatton et al., 2012) where each relation is represented by a diagonal rather than a full matrix.The neural tensor network (NTN) model (Socher et al., 2013) uses a bilinear tensor operator to represent each relation.Similar quadratic forms are used to model entities and relations in KG2E (He et al., 2015) and TATEC (García-Durán et al., 2016).
Recently, Neelakantan et al. (2015), Gardner and Mitchell (2015), Luo et al. (2015), Lin et al. (2015a), García-Durán et al. (2015), Guu et al. (2015) and Toutanova et al. (2016) showed that relation paths between entities in KBs provide richer information and improve the relationship prediction.In fact, our new TransE-NMM model can be also viewed as a three-relation path model as it takes into account the neighborhood entity and relation information of both head and tail entities in each triple.Luo et al. (2015) constructed relation paths between entities and viewing entities and relations in the path as pseudo-words applied Word2Vec algorithms (Mikolov et al., 2013) to produce pretrained vectors for these pseudo-words.Luo et al. (2015) showed that using these pre-trained vectors for initialization helps to improve the performance of the TransE, SME and SE models.RTransE (García-Durán et al., 2015), PTransE (Lin et al., 2015a) and TransE-COMP (Guu et al., 2015) are extensions of the TransE model.These models similarly represent a relation path by a vector which is the sum of the vectors of all relations in the path, whereas in the Bilinear-COMP model (Guu et al., 2015), each relation is a matrix and so it represents the relation path by matrix multiplication.Our neighborhood mixture model can be adapted to both relation path models Bilinear-COMP and TransE-COMP, by replacing head and tail entity vectors by the neighborbased vector representations, thus combining advantages of both path and neighborhood information.Nickel et al. (2015) reviews other approaches for learning from KBs and multi-relational data.

Experiments
To investigate the usefulness of the neighbor mixtures, we compare the performance of the TransE-NMM against the results of the baseline TransE and other state-of-the-art embedding models on the triple classification, entity prediction and relation prediction tasks.We conduct experiments using three publicly available datasets WN11, FB13 and NELL186.

Datasets
For all of them, the validation and test sets containing both correct and incorrect triples have already been constructed.Statistical information about these datasets is given in Table 2.
The two benchmark datasets1 , WN11 and FB13, were produced by Socher et al. (2013) for triple classification.WN11 is derived from the large lexical KB WordNet (Miller, 1995) involving 11 relation types.FB13 is derived from the large real-world fact KB FreeBase (Bollacker et al., 2008) covering 13 relation types.The NELL186 dataset2 was introduced by Guo et al. (2015) for both triple classification and entity prediction tasks, containing 186 most frequent relations in the KB of the CMU Never Ending Language Learning project (Carlson et al., 2010).

Evaluation tasks
We evaluate our model on three commonly used benchmark tasks: triple classification, entity prediction and relation prediction.This subsection describes those tasks in detail.

Triple classification:
The triple classification task was first introduced by Socher et al. (2013), and since then it has been used to evaluate various embedding models.The aim of the task is to predict whether a triple (h, r, t) is correct or not.
For classification, we set a relation-specific threshold θ r for each relation type r.If the implausibility score of an unseen test triple (h, r, t) is smaller than θ r then the triple will be classified as correct, otherwise incorrect.Following Socher et al. (2013), the relation-specific thresholds are determined by maximizing the micro-averaged accuracy, which is a per-triple average, on the validation set.We also report the macro-averaged accuracy, which is a per-relation average.

Entity prediction:
The entity prediction task (Bordes et al., 2013) predicts the head or the tail entity given the relation type and the other entity, i.e. predicting h given (?, r, t) or predicting t given (h, r, ?)where ?denotes the missing element.The results are evaluated using a ranking induced by the function f (h, r, t) on test triples.Note that the incorrect triples in the validation and test sets are not used for evaluating the entity prediction task nor the relation prediction task.
Each correct test triple (h, r, t) is corrupted by replacing either its head or tail entity by each of the possible entities in turn, and then we rank these candidates in ascending order of their implausibility score.This is called as the "Raw" setting protocol.For the "Filtered" setting protocol described in Bordes et al. (2013), we also filter out before ranking any corrupted triples that appear in the KB.Ranking a corrupted triple appearing in the KB (i.e. a correct triple) higher than the original test triple is also correct, but is penalized by the "Raw" score, thus the "Filtered" setting provides a clearer view on the ranking performance.
In addition to the mean rank and the Hits@10 (i.e., the proportion of test triples for which the target entity was ranked in the top 10 predictions), which were originally used in the entity prediction task (Bordes et al., 2013), we also report the mean reciprocal rank (MRR), which is commonly used in information retrieval.In both "Raw" and "Filtered" settings, mean rank is always greater or equal to 1 and lower mean rank indicates better entity prediction performance.The MRR and Hits@10 scores always range from 0.0 to 1.0, and higher score reflects better prediction result.

Relation prediction:
The relation prediction task (Lin et al., 2015a) predicts the relation type given the head and tail entities, i.e. predicting r given (h, ?, t) where ?denotes the missing element.We corrupt each correct test triple (h, r, t) by replacing its relation r by each possible relation type in turn, and then rank these candidates in ascending order of their implausibility score.Just as in the entity prediction task, we use two setting protocols, "Raw" and "Filtered", and evaluate on mean rank, MRR and Hits@10.

Hyper-parameter tuning
For all evaluation tasks, results for TransE are obtained with TransE-NMM with the filtering threshold τ = 0, while we set τ = 10 for TransE-NMM.
For triple classification, we first performed a grid search to choose the optimal hyperparameters for TransE by monitoring the microaveraged triple classification accuracy after each training epoch on the validation set.For all datasets, we chose either the 1 or 2 norm in the score function f and the initial RMSProp learning rate η ∈ {0.001, 0.01}.Following the previous work (Wang et al., 2014;Lin et al., 2015b;Ji et al., 2015;He et al., 2015;Ji et al., 2016), we selected the margin hyper-parameter γ ∈ {1, 2, 4} and the number of vector dimensions k ∈ {20, 50, 100} on WN11 and FB13.On NELL186, we set γ = 1 and k = 50 (Guo et al., 2015;Luo et al., 2015).The highest accuracy on the validation set was obtained when using η = 0.01 for all three datasets, and when using 2 norm for NELL186, γ = 4, k = 20 and 1 norm for WN11, and γ = 1, k = 100 and 2 norm for FB13.
We set the hyper-parameters η, γ, k, and the 1 or the 2 -norm in our TransE-NMM model to the same optimal hyper-parameters searched for TransE.We then used a grid search to select the hyper-parameter δ ∈ {0, 1, 5, 10} and L 2 regularizer λ ∈ {0.005, 0.01, 0.05} for TransE-NMM.By monitoring the micro-averaged accuracy after each training epoch, we obtained the highest accuracy on validation set when using δ = 1 and λ = 0.05 for both WN11 and FB13, and δ = 0 and λ = 0.01 for NELL186.

Method W11 F13
In Table 4, we compare the micro-averaged triple classification accuracy of our TransE-NMM model with the previously reported results on the WN11 and FB13 datasets.The first five rows report the performance of models that use TransE to initialize the entity and relation vectors.The last eight rows present the accuracy of models with randomly initialized parameters.Table 4 shows that our TransE-NMM model obtains the highest accuracy on WN11 and achieves the second highest result on FB13.Note that there are higher results reported for NTN (Socher et al., 2013), Bilinear-COMP (Guu et al., 2015) and TransE-COMP when entity vectors are initialized by averaging the pre-trained word vectors (Mikolov et al., 2013;Pennington et al., 2014).It is not surprising as many entity names in Word-Net and FreeBase are lexically meaningful.It is possible for all other embedding models to utilize the pre-trained word vectors as well.However, as pointed out by Wang et al. (2014) and Guu et al. (2015), averaging the pre-trained word vectors for initializing entity vectors is an open problem and it is not always useful since entity names in many domain-specific KBs are not meaningful.
Table 5 compares the accuracy for triple classification, the raw mean rank and raw Hits@10 scores for entity prediction on the NELL186 dataset.The first three rows present the best results reported in Guo et al. (2015), while the next three rows present the best results reported in Luo et al. (2015).TransE-NMM obtains the highest triple classification accuracy, the best raw mean rank and the second highest raw Hits@10 on the entity prediction task in this comparison.

Qualitative results
Table 6 presents some examples to illustrate the useful information modeled by the neighbors.We took the relation-specific mixture weights from the learned TransE-NMM model optimized on the entity prediction task, and then extracted three neighbor relations with the largest mixture weights given a relation.
Table 6 shows that those relations are semantically coherent.For example, if we know the place of birth and/or the place of death of a person and/or the location where the person is living, it is likely that we can predict the person's nationality.On the other hand, if we know that a person works for an organization and that this person is also the top member of that organization, then it is possible  (with other hyper-parameters being the same as selected in Section 4.3).Prefixes "R-" and "F-" denote the "Raw" and "Filtered" settings, respectively.Suffixes "-MR", "-MRR" and "-H@10" abbreviate the mean rank, the mean reciprocal rank and Hits@10, respectively.
tion using WN11, TransE-NMM with the filtering threshold τ = 10 only obtains better mean rank than TransE (about 15% relative improvement) but lower Hits@10 and mean reciprocal rank.The reason might be that in semantic lexical KBs such as WordNet where relationships between words or word groups are manually constructed, whole neighborhood information might be useful.So when using a small filtering threshold, the model ignores a lot of potential information that could help predicting relationships.Figure 2 presents relative improvements in entity prediction of TransE-NMM over TransE on WN11 when varying the filtering threshold τ .Figure 2 shows that TransE-NMM gains better scores with higher τ value.Specifically, when τ = 500 TransE-NMM does significantly better than TransE in all entity prediction metrics.

Conclusion and future work
We introduced a neighborhood mixture model for knowledge base completion by constructing neighbor-based vector representations for entities.We demonstrated its effect by extending TransE (Bordes et al., 2013)

Figure 1 :
Figure 1: An example fragment of a KB.

Table 1
summarizes related embedding models for link prediction and KB completion.The modelsModelScore function f (h, r, t)

Table 2 :
Statistics of the experimental datasets used in this study (and previous works).#E is the number of entities, #R is the number of relation types, and #Train, #Valid and #Test are the numbers of correct triples in the training, validation and test sets, respectively.Each validation and test set also contains the same number of incorrect triples as the number of correct triples.

Table 5 :
Results on on the NELL186 test set.Results for the entity prediction task are in the "Raw" setting."-SkipG" abbreviates "-Skip-gram".

Table 6 :
Bordes et al. (2013).thatthisperson is the CEO of that organization.higherthantheTransEresultsoriginallypublished inBordes et al. (2013).3Aspresented in Table3, for entity predic- with our neighborhood mixture model.On three different datasets, experimental results show that our model significantly improves TransE and obtains better results than the other state-of-the-art embedding models on triple classification, entity prediction and relation prediction tasks.In future work, we plan to apply the neighborhood mixture model to other embedding models, especially to relation path models such as TransE-COMP, to combine the useful information from both relation paths and entity neighborhoods.This research was supported by a Google award through the Natural Language Understanding Focused Program, and under the Australian Research Council's Discovery Projects funding scheme (project number DP160102156).This research was also supported by NICTA, funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.The first author was supported by an International Postgraduate Research Scholarship and a NICTA NRPA Top-Up Scholarship.