Composing Relationships with Translations

Performing link prediction in Knowledge Bases (KBs) with embedding-based models, like with the model TransE (Bordes et al., 2013) which represents relationships as translations in the embedding space, have shown promising results in recent years. Most of these works are focused on modeling single relationships and hence do not take full advantage of the graph structure of KBs. In this paper, we pro-pose an extension of TransE that learns to explicitly model composition of relationships via the addition of their corresponding translation vectors. We show empirically that this allows to improve performance for predicting single relationships as well as compositions of pairs of them.


Introduction
Performing link prediction on multi-relational data is becoming essential in order to complete the huge amount of missing information of the knowledge bases. These knowledge can be formalized as directed multi-relation graphs, whose node correspond to entities connected with edges encoding various kind of relationships. We denote these connections via triples (head, label, tail). Link prediction consists in filling in incomplete triples like (head, label, ?) or (?, label, tail).
In this context, embedding models (Wang et al., 2014;Lin et al., 2015;Jenatton et al., 2012;Socher et al., 2013) that attempt to learn lowdimensional vector or matrix representations of entities and relationships have shown promising performance in recent years. In particular, the basic model TRANSE (Bordes et al., 2013) has been proved to be very powerful. This model treats each relationship as a translation vector operating on the embedding representing the entities. Hence, for a triple (head, label, tail), the vector embeddings of head and tail are learned so that they are connected through a translation parameterized by the vector associated with label. Many extensions have been proposed to improve the representation power of TRANSE while still keeping its simplicity, by adding some projections steps before the translation (Wang et al., 2014;Lin et al., 2015).
In this paper, we propose an extension of TRANSE 1 that focuses on improving its representation of the underlying graph of multi-relational data by trying to learn compositions of relationships as sequences of translations in the embedding space. The idea is to train the embeddings by learning simple reasonings, such as the relationship people/nationality should give a similar result as the composition people/city of birth and city/country. In our approach, called RTRANSE, the training set is augmented with relevant examples of such compositions by performing constrained walks in the knowledge graph, and training so that sequences of translations lead to the desired result. The idea of compositionality to model multi-relational data was previously introduced in (Neelakantan et al., 2015). That work composes relationships by means of recurrent neural networks (RNN) (one per relationship) with non-linearities. However, we show that there is a natural way to compose relationships by simply adding translation vectors and not requiring additional parameters, which makes it specially appealing because of its scalability.
We present experimental results that show the superiority of RTRANSE over TRANSE in terms of link prediction. A detailed evaluation, in which test examples are classified as easy or hard depending on their similarity with training data, highlights the improvement of RTRANSE on both categories. Our experiments include a new evaluation protocol, in which the model is directly asked to answer questions related to compositions of relations, such as (head, label 1 , label 2 , ?). RTRANSE also achieves significantly better performances than TRANSE on this new dataset.
We describe RTRANSE in the next section, and present our experiments in Section 3.

Model
The model we propose is inspired by TRANSE (Bordes et al., 2013). In TRANSE, entities and relationships of a KB are mapped to low dimensional vectors, called embeddings. These embeddings are learnt so that for each fact (h, , t) in the KB, we have h + ≈ t in the embedding space.
Using translations for relationships naturally leads to embed the composition of two relationships as the sum of their embeddings: on a path (h, , t), (t, , t ), we should have h+ + ≈ t in the embedding space. The original TRANSE does not enforce that the embeddings accurately reproduce such compositions. The recurrent TRANSE we propose here has a modified training stage to include such compositions. This should allow to model simple reasonings in the KB, such as people/nationality is similar to the composition of people/city of birth and city/country.

Recurrent TransE
We describe in this section our model in its full generality, which allows to deal with compositions of an arbitrary number of relationships, even though in this first work we experimented only with compositions of two relationships.
Triples that are the result of a compositions are denoted by (h, where p is the number of relationships that are composed to go from h to t. Such a path means that there exist entities e 1 , ..., e p+1 , with e 1 = h and e p+1 = t such that for all k, (e k , k , e k+1 ) is a fact in the KB. Our model, RTRANSE , t) along the path in the KB with the recurrence relationship (boldface characters denote embedding vectors i.e. h is the embedding vector of the entity h): Then, the energy of a triple is computed as

Path construction and filtering
The experience of the paper is motivated by learning simple reasonings in the KB through the compositions of relationships. Therefore, we restricted our analysis to paths of length 2 created as follows. First, for each fact (h, , t), retrieve all paths (h, { 1 , 2 }, t) such that there is e such that both (h, 1 , e) and (e, 2 , t) are in the KB. Then, we filter out paths where (h, 1 , e) = (h, , t) or (e, 2 , t) = (h, , t), as well as the paths with 1 = 2 and h = e = t. We focused on "unambiguous" paths, so that the reasoning might actually make sense. In particular, we considered only paths where 1 is either a 1-to-1 or a 1-to-many relationship, and where 2 is either a 1-to-1 or a many-to-1 relationship. In our experiments, the paths created for training only consider the training subset of facts.
In the remainder of the paper, such paths of length 2 are called quadruples.

Training and regularization
Our training objective is decomposed in two parts: the first one is the ranking criterion on triples of TRANSE, ignoring quadruples. Paths are then taken into account through additional regularization terms.
Denoting by S the set of facts in the KB, the first part of the training objective is the following ranking criterion that operates on triples where [x] + = max(x, 0) is the positive part of x, γ is a margin hyperparameter and S (h, ,t) is the set of corrupted triples created from (h, , t) by replacing either h or t with another KB entity.
This ranking loss effectively trains so that the embedding of the tail is the nearest neighbor of the translated head, but it does not guarantee that the distance between the tail and the translated head is small. The nearest neighbor criterion is sufficient to make inference over simple triples, but making sure that the distance is small is necessary for the composition rule to be accurate. In order to account for the compositionality of relationships, we add two additional regularization terms:   The first criterion only applies to original facts of the KB, while the second term applies to quadruples. N →{ 1 , 2 } , which involves both the relationships of the quadruple and the relationship from which it was created, is the number of paths involving relationships { 1 , 2 } created from a fact involving , normalized by the total number of quadruples created from facts involving . This criterion puts more weight on paths that are reliable as an alternative for a relationship, for instance {people/city of birth, city/country} is likely a better alternative to people/nationality than {people/writer of the film, film/film release region}. Finally, a regularization term µ||e|| 2 2 is added for each entity embedding e.

Experiments
This section presents experiments on the benchmark FB15K introduced in (Bordes et al., 2013) and on FAMILY, a slightly extended version of the artificial database described in (García-Durán et al., 2014). Table 1 gives their statistics.

Experimental Protocol
Data FB15K is a subset of Freebase, a very large database of generic facts gathering more than 1.2 billion triples and 80 million entities. Inspired by (Hinton, 1986), FAMILY is a database that contains triples expressing family relationships (cousin of, has ancestor, married to, parent of, related to, sibling of, uncle of) among the mem-bers of 5 families along 6 generations. This dataset is artificial and each family is organized in a layered tree structure where each layer refers to a generation. Families are connected among them by marriage links between two members, randomly sampled from the same layer of different families. Interestingly on this dataset, there are obvious compositional relationships like uncle of ≈ sibling of + parent of or parent of ≈ married to + parent of, among others.
Setting Our main comparison is TRANSE so we followed the same experimental setting as in (Bordes et al., 2013), using ranking metrics for evaluation. For each test triple we replaced the head by each of the entities in turn, and then computed the score of each of these candidates and sorted them. Since other positive candidates (i.e. entities forming true triples) can be ranked higher than the target one, we filtered out all the positive candidates existing in either the training, validation and test set, except the target one, from the ranking and then we kept the rank of the target entity. The same procedure is repeated but removing the tail instead of the head. The filtered mean rank (mean rank in the rest) is the average of these ranks, and the filtered Hits@10 (H@10 in the rest) is the proportion of target entities in the top 10 predictions. The embedding dimensions were set to 20 for FAMILY and 100 for FB15K. Training was performed by stochastic gradient descent, stopping after for 500 epochs. On FB15K, we used the embeddings of TRANSE to initialize RTRANSE, and we set a learning rate of 0.001 to fine-tune RTRANSE. On FAMILY, both algorithms were initialized randomly and used a learning rate of 0.01. The mean rank was used as a validation criterion, and the values of γ, λ, α and µ were chosen respectively among {0.25, 0.5, 1}, {1e −4 , 1e −5 , 0}, {0.1, 0.05, 0.1, 0.01, 0.005} and {1e −4 , 1e −5 , 0}.

Results
Overall performances Experiments on FAM-ILY show a quantitative improvement of the performance of RTRANSE : where TRANSE gets a mean rank of 6.7 and a H@5 of 68.7, RTRANSE get a performance of 6.3 and 72.3 respectively.
Similarly, on FB15K, Table 2 (last row) shows that training on longer paths (length 2 here) actually consistently improves the performance while predicting heads and tails of triples only: the overall H@10 improves by almost 5% from 71.5 for 3 Nearest entities to h + l 1 + l 2 example shows that answering that question by using that path is sometimes difficult: the members of the cast of that TV show have different nationalities, so RTRANSE lists the nationalities of these ones and the correct one is ranked third. TRANSE is again more affected than RTRANSE by the cascading error. In the third one, RTRANSE deducts the main region where malay is spoken from the continent of the country with the most number of speakers of that language. In the last two examples, our model infers the location of those universities by forcing an equivalence between their location and the location of their respective campus.
Prediction performance For a more quantitative analysis, we have generated a new test dataset of link prediction on quadruples on FB15K. This test set was created by generating the paths from the usual test set (the triple test set) and removing those quadruples that are used for training. We obtain 1,852 quadruples. The overall experimental protocol is the same as before, trying to predict the head or tail of these quadruple in turn.
On that evaluation protocol, RTRANSE has a mean rank of 114.0 and a H@10 of 68.2%, while TRANSE obtains a mean rank of 159.9 and a H@10 of 65.2% (using the same models as in the previous subsection). We can see that learning on paths improves performances on both metrics, with a gain of 3% in terms of H@10 and an important gain of about 46 in mean rank, which corresponds to a relative improvement of about 30%.

Conclusion
We have proposed to learn embeddings of compositions of relationships in the translation model for link prediction in KBs. Our experimental results show that this approach is promising.
We considered in this work a restricted set of small paths of length two. We leave the study of more general paths to future work.