A2N: Attending to Neighbors for Knowledge Graph Inference

State-of-the-art models for knowledge graph completion aim at learning a fixed embedding representation of entities in a multi-relational graph which can generalize to infer unseen entity relationships at test time. This can be sub-optimal as it requires memorizing and generalizing to all possible entity relationships using these fixed representations. We thus propose a novel attention-based method to learn query-dependent representation of entities which adaptively combines the relevant graph neighborhood of an entity leading to more accurate KG completion. The proposed method is evaluated on two benchmark datasets for knowledge graph completion, and experimental results show that the proposed model performs competitively or better than existing state-of-the-art, including recent methods for explicit multi-hop reasoning. Qualitative probing offers insight into how the model can reason about facts involving multiple hops in the knowledge graph, through the use of neighborhood attention.


Introduction
Knowledge graphs, such as Freebase (Bollacker et al., 2008), contain a wealth of structured knowledge in the form of relationships between entities and are useful for numerous end applications. However, knowledge graphs (KG)-whether automatically constructed or human curated-are incomplete (Banko et al., 2007) and thus automatic methods for KG completion have been an important area of research (Nickel et al., 2016). The task of KG completion requires inferring missing entity relationships from the observed graph and is often formulated as predicting a target entity for a given query of source entity e and relation r, that is, to complete the tuple (e, r, ?). * Work done as an intern at Google Research Most state-of-the-art methods for KG completion learn vector embeddings of entities and relations (Bordes et al., 2013;Toutanova et al., 2015;Dettmers et al., 2017;Trouillon et al., 2016) which are used in conjunction with a (potentially parameterized) scoring function that scores every tuple in the graph. These embeddings are optimized such that the score for observed graph tuples is higher than a random tuple. While these models achieve good performance, they learn a fixeddimensional embedding for every entity, which necessitates that this embedding must memorize and then be able to generalize to infer all possible relationships for the entity, which may require multiple-hops of reasoning in the KG (Neelakantan et al., 2015;Das et al., 2017).
In contrast, it can be beneficial to compose embeddings from a query-relevant subset of the graph neighborhood of the entity. As a motivating example, consider answering the query (e, nationality, ?) for some entity e. Observing the KG neighbor (e, lived in, Maui), can allow us to project e into the Maui region of the embedding space which can lead to a high score for predicting the target USA (through an appropriate scoring function), as Maui and USA were close in embedding space due to other relations between them in KG. Note that here e can have a type that is very different than the type of M aui, for example e can be Oprah Winfrey in which case it's type would be Actor but using the neighborhood we can still project it to be close to USA for the query.
Thus, we propose A2N, an effective model (Section 2) which, conditioned on the query, uses a bi-linear attention on the graph neighborhood of an entity to generate an embedding representation of the entity. This query-specific and neighborhood-informed representation is then used to score target entities for the query. Intuitively, for the example described above, the model Figure 1: An actual example of how the A2N model generates the answer for two different queries for the same entity Oprah Winfrey on the FB15k-237 graph. We show a subset of the top neighbors. Each neighbor is assigned a probability based on the query and the neighbor representations are pooled based on these probabilities to obtain the entity embedding for the source entity. Top 3 neighbors are in bold face.
can score neighbors connected via the lived in relations higher so that the resulting embedding of the entity would be in the US region of the embedding space which when scored in conjunction with the query relation nationality can yield a high score for the target entity US. Fig. 1 shows an actual example how the model scores the graph neighborhood for two different queries on the same node, attending to a different relevant subset of the neighborhood for each query.
On two standard benchmark datasets for KG completion (Dettmers et al., 2017) -FB15k-237 (Toutanova et al., 2015) and WN18RR -we show (Section 3.2) that the model performs competitively or better than existing state-of-the-art models. Qualitative analysis (Fig. 1, Section 3.2) shows that the model indeed assigns higher scores to relevant neighbors based on the query and provides insight into how the model answers queries requiring multiple hops.

Methodology
Problem Formulation and Notation: Let [X] represent the integer set {1, . . . , X}. We are given a KG, G := {(s, r, t)} where each tuple consists of a source entity s ∈ [V e ], a relation r ∈ [V r ] and a target entity t ∈ [V e ], with V r being the number of relations and V e the number of entities in the graph. The objective is to predict the target entity for a given query of source entity and relation, q := (s, r, ?) -such that the predicted tuple doesn't already exist in G.
Entities, e, and relations, r, are represented as k-dimensional embeddings,ẽ andr. Most embedding-based methods for KG completion work by defining a scoring function f for every possible tuple in the KG. For example, DistMult (Yang et al., 2014) uses the following score: where Diag(r) is a k × k diagonal matrix withr in its diagonal. There are other functions proposed in the literature (Bordes et al., 2013;Dettmers et al., 2017;Trouillon et al., 2016). We used the DistMult scoring function in our experiments for its simplicity and good performance when tuned properly (Kadlec et al., 2017), though the model can be combined with any other scoring function.

A2N Model
We now describe our graph-attention model. Consider the neighborhood of an entity s to be N s = {(r i , e i )|(s, r i , e i ) ∈ G}. We associate each graph entity e with an initial embeddingẽ 0 , and each relation r with an embeddingr. We first encode every neighbor into an embedding. The embedding of a neighbor (r i , e i ) ∈ N s of entity s, is obtained by concatenating the initial embeddings and projecting using a linear transform. The model then attends to each element of N s , assigning it a probability for its relevance in answering the query and generates the query-dependent embedding of the entity s by aggregating the neighbor embeddings weighted by their relevance. Concretely, given a query (s, r, ?), we assign each neighbor n i ∈ N s a scalar attention score a i which is then normalized over all neighbors to obtain the probabilities p i . The neighbor embeddings are then then aggregated with p i as weights to generate new source embeddingŝ. This is concatenated with the initial source embedding and projected to K dimension to obtain the final source embeddings: where W n ∈ R K×2K , W s ∈ R K×2K are projection matrices. We use this attention-based embedding for the source entitys along with the query relation embeddingr and the base embedding for a potential target entityt 0 in the DistMult scoring function Eq(1) to generate a score for the tuple (s, r, t). This is done for all possible entities t ∈ [V e ] to obtain a ranked list of potential target entities for the query. We use DistMult function for the attention scoring in Eq (2) as it allows the model to learn to project the neighbors in the same space as the target entities, so as to give high scores to correct targets when the resulting embedding is scored again using the DistMult score Eq(1).
Training: The model is randomly initialized and all embeddings and projection parameters are trained by taking a tuple from the graph (s, r, t) ∈ G, hiding the target entity t and randomly sampling negative entities, t − = {e|(s, r, e) / ∈ G}. The scores for the positive and each negative tuple are passed through a softmax to compute the likelihood of predicting the correct target. The same process is repeated for predicting the source entity given (r, t). We also augment the graph by adding an inverse relation for every graph relation which improves training by increasing the possible neighborhood elements for the model to attend.

Experimental Setup
Datasets: Following Dettmers et al. (2017), we evaluate the model on two standard benchmark datasets for KG completion, FB15k-237 (Toutanova et al., 2015) and WN18RR (Dettmers et al., 2017). Evaluation Protocol: We followed the evaluation protocol of Dettmers et al. (2017). Each test tuple (s, r, t) is converted into two queries: target query (s, r, ?) and source query (?, r, t). For every query, the correct entity is ranked among all KG entities excluding the set of other true entities for the query observed in either train/dev/test set for the same query. See Kadlec et al. (2017) for more details. We then report the Mean Reciprocal Rank (MRR) of the correct entity, that is the average of the reciprocal rank of the correct entity, and the Hits@N, that is the accuracy in the top N predictions. Experimental details, including all hyper-parameters, are in Appendix A.

Results
Results are summarized in Tables 1 and 2. We compare with many state-of-the-art methods for KG completion. Note that we did not fine-tune separate models for target-only and source-andtarget prediction, but instead trained a model for source-and-target prediction and used it for both evaluations. In Table 1, we evaluate performance on target-only prediction. The baseline results on target-only prediction are taken from Das et al. (2017), who finetuned all the models for this task. We find that the proposed A2N model performs significantly better than all baseline models for target-only prediction. Interestingly, the model also performs significantly better than MINERVA, a model which uses RL to search for explicit paths for multi-hop reasoning (Das et al., 2017).
In Table 2, we evaluate on both source and target prediction 1 . The remaining baseline results are reproduced from the respective papers. We find that on WN18RR the model performs better than all baselines, except on Hits@10 metric where it is competitive with ConvE. On FB15k-237, the model performs competitively to ConvE and better than all the other models. Among existing baselines, we found ConvE to be the best competitor to our model. Note that in general all models perform better on target-only (Table 1) as compared to both source and target prediction. This is due to more ambiguous and one-to-many queries when predicting source entity (Das et al., 2017), for example (?, nationality, US). For such generic source-prediction queries we expect attention to be of limited use.
Qualitative Results: Fig. 1 shows how the model attends to different subsets of neighbors for the same graph entity for different queries. This example also demonstrates how the model can reason about multiple-hops of facts. Using neighbors such as places lived, the entity is first projected into a relevant subspace of the embedding space and then when scored with the target entity US leads to a high DistMult score for the relation na-  tionality. Here the model implicitly reasoned over a two-hop fact, first about places lived and the second about the country of those places. More examples of attention are provided in Fig. 2, refer to the Appendix B for more qualitative analysis.

Related Work
KG completion is an important research area, with several embedding-based models proposed, such as TransE which scores translations of entities in embedding space (Bordes et al., 2013), Dist-Mult (Toutanova et al., 2015), ComplEx which is an extension to complex space (Trouillon et al., 2016), ConvE which uses 2D convolution layers (Dettmers et al., 2017) as well as recent tensor decomposition methods (Lacroix et al., 2018). Refer to Nickel et al. (2016) for a more comprehensive review. Recently, Das et al. (2017); Xiong et al. (2017) proposed reinforcement learning methods which find paths in KG. We compared with MINERVA (Das et al., 2017), a recent method, and found A2N to perform favorably. Graph Convolution Networks (Kipf and Welling, 2016;Schlichtkrull et al., 2017) and Graph attention networks (Veličković et al., 2017) also learn neighborhood based representations of nodes. However, they do not learn a query-dependent composition of the neighborhood which is sub-optimal as also seen our in experiments and noted previously (Dettmers et al., 2017). They are also computationally expensive. Nguyen et al. (2016) proposed a neighborhood mixture model which is closely related. However, their proposed model learns a fixed mixture over neighbors as opposed to learning an adaptive mixture based on the query, and requires storing an embedding parameter for every entity-relation pair which can be prohibitively large, potentially O(V e × V r ) whereas our model only requires O(V e + V r ). Moreover, their model cannot generalize to unseen entity-relation pairs and new neighbors of an entity even when the entity and relation for the pair was observed with other relations or entities. Our work is also related to Memory Network models, often used for question-answering (Kumar et al., 2016;Miller et al., 2016;Bansal et al., 2017). To the best of our knowledge, this is the first work utilizing attention to learn query-dependent entity embeddings from the entity neighborhoods.

Conclusion
We proposed A2N, an attention-based model for learning query-dependent entity embeddings based on graph neighborhood. The model performs favorably when compared with state-of-theart models for KG completion. The model has attractive properties as it is interpretable and its number of parameters do not depend on the size of entity neighborhoods. Future research will look into applying such methods to reason jointly about text and KG, by attending to textual mentions of entities in addition to graph (Verga et al., 2016).

A Experimental Details
We found hyperparameters by selecting the set which performed best on the validation sets according to Hits@10. We evaluated k ∈ {128, 256, 512}, number of negative samples n − ∈ {500, 1000, 2000}, batch-size b ∈ {256, 512, 1024, 2048} and chose k = 512, n − = 2000 and b = 1024. We used Adam (Kingma and Ba, 2014) with a fixed learning rate of 0.001, β 1 = 0.9, β 2 = 0.999. Gradients were clipped to a maximum norm of 10. We capped the maximum number of neighbors of an entity to 500, randomly sub-sampling the set of neighbors when required. We used dropout on all embeddings, and after the projection matrices W n and W s . We used the same fixed value of dropout probability d everywhere which we tuned in {0.4, 0.3, 0.2, 0.1} and chose d = 0.3 as the value on both datasets.   contribution as a synthesizer and that he is an instrumentalist for Keyboards to infer that he is a musician. On the other hand, for queries like nationality, the model attends to neighbors like place of birth (see query for Burt Young) and places lived. For a query about time zone, the model attends to the state and metropolitan area containing the location to infer the time-zone. Note that all of these queries requires reasoning over multiple facts and model achieves this by (1) explicitly selecting a subset of neighbors of the entity to project to an appropriate neighborhood in the embedding space, and then (2) selecting the entity with the largest score given by DistMult for the query relation. We found nearest neighbors of entities based on the initial embeddings of entities before attention. These entities should ideally cluster into regions which participate in similar relations as that would benefit attention by allowing entities to be projected into the appropriate region in the embedding space. Table 3 shows the nearest neighbors for some entities. For sitcom TV shows like Two and a Half Men, we find other sitcoms like How I Met Your Mother in neighbors, for University of Oxford we find other universities like University of Cambridge as top neighbors, for cities like Edinburgh we find other cities in the same country like Aberdeen and Glasgow as top neighbors. Overall, we found the nearest neighbors to be functionally related which would benefit attention.