Out-of-Sample Representation Learning for Knowledge Graphs

Many important problems can be formulated as reasoning in knowledge graphs. Representation learning has proved extremely effective for transductive reasoning, in which one needs to make new predictions for already observed entities. This is true for both attributed graphs(where each entity has an initial feature vector) and non-attributed graphs (where the only initial information derives from known relations with other entities). For out-of-sample reasoning, where one needs to make predictions for entities that were unseen at training time, much prior work considers attributed graph. However, this problem is surprisingly under-explored for non-attributed graphs. In this paper, we study the out-of-sample representation learning problem for non-attributed knowledge graphs, create benchmark datasets for this task, develop several models and baselines, and provide empirical analyses and comparisons of the proposed models and baselines.


Introduction
Multi-relational graphs are a prevalent form of graphs where each edge has a label and a direction associated with it. Many prediction problems can be formulated as reasoning within a multirelational graph. For example, Figure 1 depicts a job recommendation system that has been formulated in these terms. A notable example of multirelational graphs is knowledge graphs (KGs) with several applications in natural language processing and information retrieval including search, question answering and commonsense reasoning. Much prior work has considered transductive KG reasoning in which predictions are made at test time for only those entities that were observed during training. These are known as in-sample entities. In Figure 1, predicting if A 1 is expert in S 2 is an example of transductive reasoning.  Figure 1: An example of a multi-relational graph for a job recommendation system is presented on the left side of the dashed blue line where the vertices A i , C i , S i , J i and T i represent applicants, companies, skills, job postings, and titles respectively. Predicting whether A 1 is expert in S 2 is an example of transductive reasoning. J new represents an out-of-sample entity that has not been observed during training. Predicting whether A 3 is a good fit for J new based on the relations of J new observed during test time (red arrows) is an example of out-of-sample reasoning.
Conversely, we consider out-of-sample KG reasoning. We make predictions for previously unseen or out-of-sample entities based on their relations with the in-sample entities. This is more challenging than transductive reasoning as it requires generalizing to unseen entities. In Figure 1, predicting whether A 3 is a good fit for the previously unseen job posting J new given J new 's relations with insample entities (observed at test time) is an example of out-of-sample reasoning.
Representation learning has proved effective for reasoning in KGs (Nickel et al., 2016;Hamilton et al., 2017b;. It has been extensively studied for transductive reasoning in attributed graphs (where each entity has an initial feature vector) and non-attributed KGs (where the only initial information derives from known relations with other entities) as well as simple graphs (in which there is only a single relation). One prominent family of work is based on extensions of the convolution operator to non-Euclidean domains (Kipf and Welling, 2017;Defferrard et al., 2016;Hammond et al., 2011;Schlichtkrull et al., 2018). A second family models relations as translations (or rotations) from subject to object entities (Bordes et al., 2013;Ji et al., 2015;Nguyen et al., 2016;Sun et al., 2019). A third approach represents the facts in a KG as a 3rd order tensor and factorizes this tensor to produce entity and relation embeddings (Yang et al., 2015;Trouillon et al., 2016;Kazemi and Poole, 2018;Zhang et al., 2019).
Out-of-sample representation learning has also been extensively studied for attributed KGs (Xie et al., 2016;Zhao et al., 2017) and attributed simple graphs (Yang et al., 2016;Hamilton et al., 2017a;Veličković et al., 2018;Chen et al., 2018). However, for non-attributed KGs, it remains under-explored. The main challenge of out-of-sample representation learning for non-attributed KGs is that an entity representation must be learned using only the relations the entity participates in.  develop such a model for non-attributed simple graphs but extending their work to KGs is not straightforward. Out-of-sample representation learning in non-attributed graphs is an important problem for high-throughput production systems, as it is not tractable to adapt the transductive approaches and use additional rounds of gradient descent to incorporate new entities at test time.
The contributions of this work are as follows: 1) we formally define out-of-sample representation learning for KGs, 2) we create benchmark datasets for this problem, 3) we propose several baselines, 4) we extend current transductive KG representation learning approaches by developing new training algorithms that can support the incorporation of out-of-sample entities at test time via aggregation functions to compute representations, and 5) we provide a thorough experimental comparison of the baselines and the proposed approaches.

Background and Notation
Lower-case letters denote scalars, bold lower-case letters denote vectors, and bold upper-case letters denote matrices. For a vector z z z ∈ R d , we represent by z z z[i] (n ≤ d) the i th element of z z z and by ||z z z|| the Euclidean norm of z z z. For z 1 z 1 z 1 , z 2 z 2 z 2 ∈ R d , we let z z z 1 z z z 2 ∈ R d represent the element-wise (Hadamard) product of the two vectors. For z z z 1 , . . . , z z z k ∈ R d , we let z z z 1 , . . . , z z z k = d i=1 (z z z 1 [i] * · · · * z z z k [i]) represent the sum of the element-wise product of the elements of the k vectors.
Let V and R represent a set of entities and relations respectively. We represent a triple as (v, r, u), where v ∈ V is the head (or subject), r ∈ R is the relation, and u ∈ V is the tail (or object) of the triple. Let ζ represent the set of all triples on entities V and relations R that are facts (e.g., (Montreal, LocatedIn, Canada)). A (non-attributed) knowledge graph (KG) G ⊂ ζ is a subset of ζ. Hereafter, whenever we refer to a KG, we assume a non-attributed KG.
Transductive KG Reasoning: In transductive KG reasoning, a model is learned for a KG G with entities V and relations R such that the model can make predictions about any triple (v, r, u) where v, u ∈ V are both in-sample entities and r ∈ R.
KG embedding models map entities and relations to hidden representations known as embeddings and define a function φ from the embeddings of the entities and the relation in a triple to a score corresponding to the degree of belief the model has for the relation holding between the entities. Typically, the embeddings can be formulated as two matrices Z Z Z ent ∈ R |V|×dent and Z Z Z rel ∈ R |R|×d rel where each row of Z Z Z ent corresponds to the embedding for an entity, each row of Z Z Z rel corresponds to the embedding for a relation, and d ent and d rel represent entity and relation embedding sizes. One can look up the embedding for a particular entity v by multiplying the transpose of Z Z Z ent to the onehot encoding of v and for a particular relation r by multiplying the transpose of Z Z Z rel to the one-hot encoding of r. A large number of approaches define Z Z Z ent and Z Z Z rel as matrices with directly learnable parameters. Other approaches define encoders that produce these two matrices typically through several rounds of message passing among entities.
Algorithm 1 outlines one epoch of training for learning the embeddings as well as the parameters of the φ function. The training is performed using stochastic gradient descent with mini-batches. For each batch (line 2), the nextBatch function extracts a set of positive triples from the KG and creates n negative triples per positive triple by corrupting the positive triple according to the procedure introduced in (Bordes et al., 2013). n is known as the negative ratio. For each triple (v, r, u) in the batch, the embeddings for v, r and u are looked up and the score for the triple is computed according to φ. Then the embeddings and the parameters of φ are updated based on the predicted scores, the labels of the triples, and a loss function L.
scores.append(φ(z z z v , z z z r , z z z u )) 9: end for 10: updateParams(L, scores, labels) 11: end for Different models have been proposed in the literature by mainly changing the score function. Note that some models may break the vector embeddings into multiple pieces and reshape each piece before using it in the score function. In this paper, we focus primarily on DistMult, a simple yet effective model for transductive KG embedding. However, many of the ideas we develop in this paper are general and can be applied to other models as well.
DistMult (Yang et al., 2015): In DistMult, Z Z Z ent ∈ R |V|×d and Z Z Z rel ∈ R |R|×d . For a triple (v, r, u), let z z z v , z z z r , z z z u ∈ R d represent the embeddings for v, r and u respectively where each embedding is obtained by looking up the Z Z Z ent and Z Z Z rel matrices. DistMult defines the score for the triple as φ(z z z v , z z z r , z z z u ) = z z z v , z z z r , z z z u , i.e. the sum of the element-wise product of the head, relation, and tail embeddings.
Loss function: We use the L2 regularized negative log-likelihood which has proved effective in several works (Trouillon et al., 2016;Kazemi and Poole, 2018). The loss L(Θ) for a single batch of labeled triples is defined as follows: (1) where Θ represents the parameters of the model, softplus(x) = log(1 + exp(x)), l ∈ {−1, 1} represents the label of the triple in the batch, and λ represents the L2 regularization hyperparameter.

Out-of-Sample KG Reasoning
We define out-of-sample reasoning for KGs as: Definition 1. Out-of-sample reasoning for KGs is the problem of training a model on a KG G with entities V and relations R such that at the test time, the model can be used for making predictions about any out-of-sample entity v ∈ V given r ∈ R} corresponding to the relations between v and in-sample entities.
According to the definition, G v is observed only at the test time and so during training, the model does not observe any triples involving v. To develop a representation learning model for out-ofsample reasoning in KGs, one needs to learn i) embeddings for the in-sample entities in V and the relations in R, ii) a function φ from triples to scores, and iii) a function from G v and the in-sample entity and relation embeddings to an embedding for v that can be used to make further predictions about v.
One possible way of extending transductive models such as DistMult to the out-of-sample domain is by following the standard training procedure outlined in Algorithm 1 and then defining an aggregation function with no learnable parameters which, at inference time, provides an embedding for an out-of-sample entity v based on the embeddings of the entities and relations in G v . A simple aggregation function, for instance, can be the average of the embeddings for entities {u : ∃r all entities that have a relation with v). Such a procedure, however, introduces an inconsistency between training and testing as the training is done irrespective of the aggregation function and with the objective of performing well on a transductive task whereas the model is tested on an out-of-sample task.

Proposed Training Procedure
To make the training procedure resemble what is expected of the model at the test time and make it aware of the aggregation function being used, we propose a new training algorithm that guides the learning procedure towards learning entity and relation embeddings that better match the aggregation function. A general training procedure for outof-sample representation learning is proposed in Algorithm 2. For each triple (v, r, u) in the batch, first we lookup the embedding for r. Then with probability ψ 2 , where 0 ≤ ψ ≤ 1 is a hyperparameter, we consider v to be out-of-sample and u to be in-sample. In this case, for v we use an aggregate function that computes the embedding for v based on the triples involving v except for (v, r, u), and for u we simply lookup its embedding. Also with probability ψ 2 , we consider u to be out-of-sample if rand < ψ 2 then 8: updateParams(L, scores, labels) 20: end for and v to be in-sample and follow a similar procedure. Finally, with probability 1 − ψ, we follow the standard training procedure by looking up the embedding for both entities. Having the embeddings for v, r and u, we use a score function (e.g., DistMult) to compute the score for this triple being true. Finally, we update the embeddings (and the parameters of the aggregate and φ functions if they have any) according to the scores, labels, and a loss function L. Note that when ψ = 0, Algorithm 2 reduces to Algorithm 1. Note that Algorithm 2 is generic and can be used with any KG embedding model.
By using Algorithm 2, one can develop different models for out-of-sample representation learning by choosing different φ and aggregate functions. We propose two aggregate functions that extend DistMult to out-of-sample domains.

Proposed Models
oDistMult-ERAvg: Let v be an entity for which we need to compute an embedding using aggregation and G v be the triples involving v. According to the score function of DistMult, for each triple (v, r, u) ∈ G v (and similarly for each triple (u, r, v) ∈ G v ), we want z z z v , z z z r , z z z u to be high where z z z v , z z z r and z z z u represent the embed-ding of v, r and u respectively. The score can be written as z z z v , z z z r , z z z u = z z z v · (z z z r z z z u ) where · represents dot product. Since z z z v · (z z z r z z z u ) = ||z z z v || ||z z z r z z z u || cos(z z z v , z z z r z z z u ), one possible choice to ensure a high value for z z z v , z z z r , z z z u is by choosing z z z v to be the vector z z z r z z z u so that the angle θ between the two vectors becomes 0 (and consequently, cos(θ) = 1). Since there may be multiple triples in G v , we average these vectors and define z z z v = aggregate(v) as follows: where |G v | represents the number of triples in G v .
oDistMult-LS: An alternative to the averaging strategy in Equation (2) is to find z z z v as the solution to a least squares problem to ensure the score for the triples in G v are maximized. One way to achieve this goal is by solving a (potentially underdetermined) system of linear equations where there exists one equation of the form z z zv·(z z zr z z zu) ||z z zv|| ||z z zr z z zu|| = 1 for each triple (v, r, u) ∈ G v (and similarly for each triple (u, r, v) ∈ G v ). The presence of ||z z z v || in the denominator makes finding an analytical solution difficult. We note that ||z z z v || only affects the magnitude of the scores and not their ranking, so instead we consider the following equation: Considering a matrix A A A ∈ R |Gv|×d (recall that d is the embedding dimension) such that A A A[i] = z z z r z z z u where r and u are the relation and entity involved in the i-th triple in G v and a vector b b b ∈ R |Gv| such that b b b[i] = ||z z z r z z z u ||, we compute z z z v = aggregate(v) analytically as follows: where I I I ∈ R d×d is an identity matrix and λ is a hyperparameter corresponding to L2 regularization which ensures the system has a unique solution. While we proposed the aggregation functions for DistMult, note that they can be easily extended to other models such as SimplE, ComplEx, and QuatE that have 2, 4 and 8 ., ., . terms respectively.

Time Complexity
We analyze the time complexity of the proposed algorithms for finding the embedding of an outof-sample entity v. Let us assume that |G v | = N

Datasets
We created datasets for out-of-sample representation learning over KGs using WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova and Chen, 2015), two standard datasets for KG completion. WN18RR is a subset of Wordnet (Miller, 1995) and FB15k-237 is a subset of Freebase (Bollacker et al., 2008). We call the two datasets oWN18RR and oFB15k-237 respectively, where "o" in the beginning of the name stands for "out-of-sample". The statistics for these datasets can be found in Table 1.
We outline the steps we took for creating the datasets.
1. We merge the train, validation, and test triples from the original dataset into a single set.
2. From the entities appearing in at least 2 triples, we randomly select 20% to be candidates for the out-of-sample entities; other entities are in-sample entities. We avoid having entities appearing in only 1 triple as out-of-sample entities because, during test time, we select one triple as query and need other triples for learning a representation for the out-of-sample entity.
3. Triples containing two out-of-sample entities are removed, triples with one out-of-sample entity are considered as test triples and other triples are considered as train triples.
4. In step 3, it is possible that some entities selected to be in-sample appear in no training triples. This can happen whenever an insample entity only appears in triples involving an out-of-sample entity. A similar situation can occur for some relations as well (i.e. some relations only appearing in the test set). We remove such entities and relations and the triples they appear in from the dataset.
5. After doing the above steps, if the number of triples for an out-of-sample entity is less than 2, we remove that entity from the test set.
6. We randomly select half of the out-of-sample entities and the triples they appear in as the validation set and the other half as the test set.

Experiments and results
To measure the performance of different models, for any out-of-sample entity v in the test set with triples G v , we create |G v | queries where in the i-th query, we use our learned model to compute an embedding for v given all except the i-th triple in G v and use that embedding to make a prediction about the i-th triple. Figure 2 represents statistics on the number of triples used to compute the embedding of the out-of-sample entities in the test set for both oWN18RR and oFB15k-237. If the i-th triple is of the form (v, r, u), then we create the query (v, r, ?) and find the ranking our model assigns to u (the correct answer to the query) among entities u ∈ V such that (v, r, u ) ∈ G v (the (v, r, u ) ∈ G v constraint is known as the filtered setting). We follow a similar procedure for the case where the i-th triple is of the form (u, r, v). Let κ (v,r,?),u represent  the rank of u for query (v, r, ?). We report filtered mean reciprocal rank (MRR) computed as: and filtered Hit@k (for k ∈ {1, 3, 10}) defined as: where 1 condition is 1 if the condition holds and 0 otherwise.

Baselines
We develop several baselines for out-of-sample representation learning over KGs. Popularity: In this baseline, we rank the insample entities based on the number of times they appear in the triples of the training set. We break ties randomly. At the test time, we use this ranking as our answer to all queries.
OOV: This baseline is inspired by the way a word embedding is computed for out-of-vocabulary (OOV) words (i.e. words unseen during training) in some works in the natural language processing literature. After training, we compute the average embedding of all in-sample entities and use it as the embedding for out-of-sample entities.
RGCN-D: Graph convolutional networks (GCNs) have proved effective for inductive and out-of-sample learning when initial entity features are available. When such features are not available, Hamilton et al. (2017a) propose to use node degrees as initial entity features. Since we work with multi-relational graphs, we initialize entity features as vectors of size 2|R| where the i-th and |R| + i-th elements (for i < |R|) represent the number of incoming and outgoing edges with relation type r i respectively. We use RGCN (Schlichtkrull et al., 2018) as the GCN.
oDistMult-EAvg: Similar to the first baseline in , we create a simpler version of oDistMult-ERAvg by defining the embedding for an unseen entity v as the average of the embeddings of the entities that are related to v. More formally, this baseline defines z z z v = aggregate(v) = 1 |Gv| ( (v,r,u)∈Gv z z z u + (u,r,v)∈Gv z z z u ).
DistMult-LS-U: As an ablation study, we also include an unnormalized version of DistMult-LS where we change Equation (3) to z z z v · (z z z r z z z u ) = 1 (in other words, setting the elements of b b b in Equation (4) to 1).

Implementation Details
For RGCN-D, we used the implementation in the deep graph library (DGL). We implemented other models and baselines in PyTorch (Paszke et al., 2017) and used the AdaGrad optimizer (Duchi et al., 2011). We selected the hyperparameters  Figure 2: The two figures provide statistics on the test sets of (a) oWN18RR and (b) oFB15k-237. They show the number of test queries (on the y-axis) for which the embedding of the out-of-sample entity is computed based on k triples (e.g., for almost 2000 queries in oWN18RR, the embedding of the out-of-sample entity is learned based on only 1 triple). Since the number of samples for many of the larger values of k is 0, to make the plots visually appealing, we restricted the x-axis to k ≤ 30 for oWN18RR and k ≤ 120 for oFB15k-237 and did not include in the diagrams the few cases where k was larger. The colors show the bins used for the experiment in Figure 3(b, c).
corresponding to learning rate and L2 regularization (λ) via a grid search over {0.1, 0.01} and {0.1, 0.01, 0.001, 0.0001} respectively validating the models every 100 epochs and selecting the best hyperparameters and epoch based on validation filtered MRR. We set the negative ratio to 1 and the embedding dimension to 200. When using Algorithm 2 for training, we set ψ to 0.5 unless stated otherwise. The code and datasets are available at https://github.com/BorealisAI/OOS-KGE.

Results
According to the results on oWN18RR and oFB15k-237 reported in Table 2, in almost all cases, using Algorithm 2 for training as opposed to Algorithm 1 results in a boost of performance. Recall that the models whose names start with an "o" use Algorithm 2 and the models without "o" correspond to the variants where Algorithm 1 is used instead. On oWN18RR, for instance, oDistMult-ERAvg and oDistMult-LS achieve 28% and 16% improvement in terms of filtered MRR compared to DistMult-ERAvg and DistMult-LS respectively. The margins of improvements on oFB15k-237 are smaller as oFB15k-237 is generally a more challenging dataset compared to oWN18RR and it is more difficult to make progress on. We believe the reason for the observed boost when using Algorithm 2 is mainly because the train and test procedures become more consistent compared to when Algorithm 1 is used. Furthermore, it can be observed that the proposed oDistMult-ERAvg and oDistMult-LS models outperform the other baselines. We believe the reason for the poor performance of RGCN-D on oWN18RR is because the out-of-sample entities have few neighbors (see Figure 2(a)) and the degree information (used as initial features) is not discriminative enough 1 . Between the two proposed models, the winner is dataset-dependant with oDistMult-LS performing slightly better on oWN18RR and oDistMult-ERAvg showing better performance on oFB15k-237. DistMult-LS also outperforms DistMult-LS-U shedding light on the importance of the normalization in Equation (3).
Selecting ψ: For the results in Table 2, we set the value of ψ to 0.5 (see Algorithm 2 for the usage of ψ). Here, we explore different values for ψ to see how it affects the performance. Figure 3(a) shows the test MRR of oDistMult-ERAvg on oWN18RR for different values of ψ. When ψ = 0 (corresponding to using the standard transductive training algorithm presented in Algorithm 1), the performance is poor. As soon as ψ becomes greater than zero, we observe a substantial boost in performance. The performance keeps increasing as ψ increases until reaching a plateau and then it goes down when ψ = 1 corresponding to a training procedure where for each triple, one entity is always treated as outof-sample. We repeated the experiment with other models and on other datasets and observed similar behavior. We believe one reason why we observe a better performance for 0 < ψ < 1 compared to ψ = 1 is that when 0 < ψ < 1, the model is encouraged to learn embeddings that do well for both transductive and out-of-sample prediction   Figure 2). tasks with the transductive task acting as an auxiliary task (and possibly as a regularizer) helping the embeddings capture more information.
Neighbor-size effect: Out-of-sample entities appear in a different number of triples. Figure 2 shows statistics for oWN18RR and oFB15k-237 on the number of triples used to learn the embedding for the out-of-sample entity in each query in the test set. To test how this number affects the models, we divided our test queries into 5 bins of (approximately) equal size as shown by the bar colors in Figure 2 and measured the test MRR on each bin. According to the results for oDistMult-ERAvg and DistMult-ERAvg, presented in Figure 3(b,c), oDistMult-ERAvg almost consistently outperforms DistMult-ERAvg on all (except one) bins. For both models, as the number of triples from which we learn the embedding for out-of-sample entities increases, the performance deteriorates, highlighting a shortcoming of our averaging strategy used for aggregation. Future work can look into other aggregation functions (e.g., attention-based averaging).
In-sample performance: To measure how training with Algorithm 2 affects model performance for in-sample (aka transductive) link prediction, we compared DistMult and oDistMult-ERAvg on the original splits of WN18AM, the cleaned version of WN18RR (Hajimoradlou and Kazemi, 2020). For this experiment, we used Adam optimizer (Kingma and Ba, 2014) and added a dropout of 0.5 after the Hadamard product of the embeddings (before taking the sum of the features) in DistMult. We tuned both learning rate and weight decay from the set {0.0001, 0.001, 0.01, 0.1}. The results in Table 3 indicate that training with our proposed algorithm does not deteriorate the performance for in-sample link prediction.  Table 3: In-sample link prediction results on a cleaned version of WN18RR named WN18AM (for details, see (Hajimoradlou and Kazemi, 2020)). Although oDistMult-ERAvg has been trained for out-of-sample reasoning, its performance on in-sample reasoning is almost as good as DistMult.

Conclusion
We studied out-of-sample representation learning for non-attributed multi-relational graphs -a problem that is surprisingly poorly studied. We created two benchmarks for this task and outlined the procedure we followed for creating these datasets to facilitate the creation of more datasets in the future. We also developed several baselines, a new training algorithm, and two aggregation models for out-of-sample representation learning. Future work includes developing new training strategies, testing other aggregation functions, combining the aggregation functions with other transductive models, extending out-of-sample reasoning to temporal KG completion and knowledge hypergraph completion (e.g., extending the proposed training algorithm and aggregation functions to the temporal or hypergraph versions of DistMult or SimplE Fatemi et al., 2019)) transferring the knowledge learned over one graph to a new graph with new entities (similar to (Muhan Zhang, 2020; Teru and Hamilton, 2019)), studying the similarities and differences between out-of-sample representation learning and out-of-vocabulary word embedding, and testing the proposed models on relational domains other than knowledge graphs.