Meta Relational Learning for Few-Shot Link Prediction in Knowledge Graphs

Link prediction is an important way to complete knowledge graphs (KGs), while embedding-based methods, effective for link prediction in KGs, perform poorly on relations that only have a few associative triples. In this work, we propose a Meta Relational Learning (MetaR) framework to do the common but challenging few-shot link prediction in KGs, namely predicting new triples about a relation by only observing a few associative triples. We solve few-shot link prediction by focusing on transferring relation-specific meta information to make model learn the most important knowledge and learn faster, corresponding to relation meta and gradient meta respectively in MetaR. Empirically, our model achieves state-of-the-art results on few-shot link prediction KG benchmarks.


Introduction
A knowledge graph is composed by a large amount of triples in the form of (head entity, relation, tail entity) ((h, r, t) in short), encoding knowledge and facts in the world. Many KGs have been proposed (Vrandei and Krtzsch, 2014;Bollacker et al., 2008;Carlson et al., 2010) and applied to various applications (Bordes et al., 2014;Zhang et al., 2016Zhang et al., , 2019a. Although with huge amount of entities, relations and triples, many KGs still suffer from incompleteness, thus knowledge graph completion is vital for the development of KGs. One of knowledge graph completion tasks is link prediction, predicting new triples based on existing ones. For link prediction, KG embedding methods (Bordes et al., 2013;Nickel et al., 2011;Trouillon et al., 2016;Yang et al., 2015) are * Equal contribution.
† Corresponding author.  Figure 1: An example of 3-shot link prediction in KGs. One task represents observing only three instances of one specific relation and conducting link prediction on this relation. Our model focuses on extracting relationspecific meta information by a kind of relational learner which is shared across tasks and transferring this meta information to do link prediction within one task.
promising ways. They learn latent representations, called embeddings, for entities and relations in continuous vector space and accomplish link prediction via calculation with embeddings.
The effectiveness of KG embedding methods is promised by sufficient training examples, thus results are much worse for elements with a few instances during training (Zhang et al., 2019c). However, few-shot problem widely exists in KGs. For example, about 10% of relations in Wikidata (Vrandei and Krtzsch, 2014) have no more than 10 triples. Relations with a few instances are called few-shot relations. In this paper, we devote to discuss few-shot link prediction in knowledge graphs, predicting tail entity t given head entity h and relation r by only observing K triples about r, usually K is small. Figure 1 depicts an example of 3-shot link prediction in KGs.
To do few-shot link prediction, Xiong et al. (2018) made the first trial and proposed GMatching, learning a matching metric by considering both learned embeddings and one-hop graph structures, while we try to accomplish few-shot link prediction from another perspective based on the intuition that the most important information to be transferred from a few existing instances to incomplete triples should be the common and shared knowledge within one task. We call such information relation-specific meta information and propose a new framework Meta Relational Learning (MetaR) for few-shot link prediction. For example, in Figure 1, relation-specific meta information related to the relation CEOof or CountryCapital will be extracted and transferred by MetaR from a few existing instances to incomplete triples.
The relation-specific meta information is helpful in the following two perspectives: 1) transferring common relation information from observed triples to incomplete triples, 2) accelerating the learning process within one task by observing only a few instances. Thus we propose two kinds of relation-specific meta information: relation meta and gradient meta corresponding to afore mentioned two perspectives respectively. In our proposed framework MetaR, relation meta is the highorder representation of a relation connecting head and tail entities. Gradient meta is the loss gradient of relation meta which will be used to make a rapid update before transferring relation meta to incomplete triples during prediction.
Compared with GMatching (Xiong et al., 2018) which relies on a background knowledge graph, our MetaR is independent with them, thus is more robust as background knowledge graphs might not be available for few-shot link prediction in real scenarios.
We evaluate MetaR with different settings on few-shot link prediction datasets. MetaR achieves state-of-the-art results, indicating the success of transferring relation-specific meta information in few-shot link prediction tasks. In summary, main contributions of our work are three-folds: • Firstly, we propose a novel meta relational learning framework (MetaR) to address fewshot link prediction in knowledge graphs.
• Secondly, we highlight the critical role of relation-specific meta information for fewshot link prediction, and propose two kinds of relation-specific meta information, relation meta and gradient meta. Experiments show that both of them contribute significantly.
• Thirdly, our MetaR achieves state-of-the-art results on few-shot link prediction tasks and we also analyze the facts that affect MetaR's performance.

Related Work
One target of MetaR is to learn the representation of entities fitting the few-shot link prediction task and the learning framework is inspired by knowledge graph embedding methods. Furthermore, using loss gradient as one kind of meta information is inspired by MetaNet (Munkhdalai and Yu, 2017) and MAML (Finn et al., 2017) which explore methods for few-shot learning by metalearning. From these two points, we regard knowledge graph embedding and meta-learning as two main kinds of related work.

Knowledge Graph Embedding
Knowledge graph embedding models map relations and entities into continuous vector space. They use a score function to measure the truth value of each triple (h, r, t). Same as knowledge graph embedding, our MetaR also need a score function, and the main difference is that representation for r is the learned relation meta in MetaR rather than embedding of r as in normal knowledge graph embedding methods. One line of work is started by TransE (Bordes et al., 2013) with distance score function. TransH (Wang et al., 2014) and TransR (Lin et al., 2015b) are two typical models using different methods to connect head, tail entities and their relations. DistMult (Yang et al., 2015) and Com-plEx (Trouillon et al., 2016) are derived from RESCAL (Nickel et al., 2011), trying to mine latent semantics in different ways. There are also some others like ConvE (Dettmers et al., 2018) using convolutional structure to score triples and models using additional information such as entity types  and relation paths (Lin et al., 2015a). Wang et al. (2017) comprehensively summarize the current popular knowledge graph embedding methods.
Traditional embedding models are heavily rely on rich training instances (Zhang et al., 2019b;Xiong et al., 2018), thus are limited to do few-shot link prediction. Our MetaR is designed to fill this vulnerability of existing embedding models.

Meta-Learning
Meta-learning seeks for the ability of learning quickly from only a few instances within the same concept and adapting continuously to more concepts, which are actually the rapid and incremental learning that humans are very good at.
Several meta-learning models have been proposed recently. Generally, there are three kinds of meta-learning methods so far: (1) Metric-based meta-learning (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017;Xiong et al., 2018), which tries to learn a matching metric between query and support set generalized to all tasks, where the idea of matching is similar to some nearest neighbors algorithms. Siamese Neural Network (Koch et al., 2015) is a typical method using symmetric twin networks to compute the metric of two inputs. GMatching (Xiong et al., 2018), the first trial on one-shot link prediction in knowledge graphs, learns a matching metric based on entity embeddings and local graph structures which also can be regarded as a metric-based method.
(2) Model-based method (Santoro et al., 2016;Munkhdalai and Yu, 2017;Mishra et al., 2018), which uses a specially designed part like memory to achieve the ability of learning rapidly by only a few training instances.
MetaNet (Munkhdalai and Yu, 2017), a kind of memory augmented neural network (MANN), acquires meta information from loss gradient and generalizes rapidly via its fast parameterization. (3) Optimization-based approach (Finn et al., 2017;Lee and Choi, 2018), which gains the idea of learning faster by changing the optimization algorithm. Model-Agnostic Meta-Learning (Finn et al., 2017) abbreviated as MAML is a model-agnostic algorithm. It firstly updates parameters of task-specific learner, and meta-optimization across tasks is performed over parameters by using above updated parameters, it's like "a gradient through a gradient".
As far as we know, work proposed by Xiong et al. (2018) is the first research on few-shot learning for knowledge graphs. It's a metric-based model which consists of a neighbor encoder and a matching processor. Neighbor encoder enhances the embedding of entities by their one-hop neighbors, and matching processor performs a multistep matching by a LSTM block.

Task Formulation
In this section, we present the formal definition of a knowledge graph and few-shot link prediction task. A knowledge graph is defined as follows: And a few-shot link prediction task in knowledge graphs is defined as: where |S r | = K, predicting the tail entity linked with relation r to head entity h j , formulated as r : (h j , ?), is called Kshot link prediction.
As defined above, a few-shot link prediction task is always defined for a specific relation. During prediction, there usually is more than one triple to be predicted, and with support set S r , we call the set of all triples to be predicted as query set The goal of a few-shot link prediction method is to gain the capability of predicting new triples about a relation r with only observing a few triples about r. Thus its training process is based on a set of tasks to an individual fewshot link prediction task with its own support and query set. Its testing process is conducted on a set of new tasks T test = {T j } N j=1 which is similar to T train , other than that T j ∈ T test should be about relations that have never been seen in T train .

Embedding Learner
Embedding Learner

Support
Step

Query
Step Figure 2: Overview of MetaR. T r = {S r , Q r }, R Tr and R ′ Tr represent relation meta and updated relation meta, and G Tr represents gradient meta. Table 1 gives a concrete example of the data during learning and testing for few-shot link prediction.

Method
To make one model gain the few-shot link prediction capability, the most important thing is transferring information from support set to query set and there are two questions for us to think about: (1) what is the most transferable and common information between support set and query set and (2) how to learn faster by only observing a few instances within one task. For question (1), within one task, all triples in support set and query set are about the same relation, thus it is naturally to suppose that relation is the key common part between support and query set. For question (2), the learning process is usually conducted by minimizing a loss function via gradient descending, thus gradients reveal how the model's parameters should be changed. Intuitively, we believe that gradients are valuable source to accelerate learning process.
Based on these thoughts, we propose two kinds of meta information which are shared between support set and query set to deal with above problems: • Relation Meta represents the relation connecting head and tail entities in both support and query set and we extract relation meta for each task, represented as a vector, from support set and transfer it to query set.
• Gradient Meta is the loss gradient of relation meta in support set. As gradient meta shows how relation meta should be changed in order to reach a loss minima, thus to accelerate the learning process, relation meta is updated through gradient meta before being transferred to query set. This update can be viewed as the rapid learning of relation meta.
In order to extract relation meta and gradient mate and incorporate them with knowledge graph embedding to solve few-shot link prediction, our proposal, MetaR, mainly contains two modules: • Relation-Meta Learner generates relation meta from heads' and tails' embeddings in the support set.
• Embedding Learner calculates the truth values of triples in support set and query set via entity embeddings and relation meta. Based on the loss function in embedding learner, gradient meta is calculated and a rapid update for relation meta will be implemented before transferring relation meta to query set.
The overview and algorithm of MetaR are shown in Figure 2 and Algorithm 1. Next, we introduce each module of MetaR via one few-shot link prediction task T r = {S r , Q r }.

Relation-Meta Learner
To extract the relation meta from support set, we design a relation-meta learner to learn a mapping from head and tail entities in support set to relation meta. The structure of this relation-meta learner can be implemented as a simple neural network.
In task T r , the input of relation-meta learner is head and tail entity pairs in support set {(h i , t i ) ∈ S r }. We firstly extract entity-pair specific relation Update φ and emb by loss in Q r 9: end while meta via a L-layers fully connected neural network, where h i ∈ R d and t i ∈ R d are embeddings of head entity h i and tail entity t i with dimension d respectively. L is the number of layers in neural network, and l ∈ {1, . . . , L − 1}. W l and b l are weights and bias in layer l. We use LeakyReLU for activation σ. x⊕y represents the concatenation of vector x and y. Finally, R (h i ,t i ) represent the relation meta from specific entity pare h i and t i .
With multiple entity-pair specific relation meta, we generate the final relation meta in current task via averaging all entity-pair specific relation meta in current task,

Embedding Learner
As we want to get gradient meta to make a rapid update on relation meta, we need a score function to evaluate the truth value of entity pairs under specific relations and also the loss function for current task. We apply the key idea of knowledge graph embedding methods in our embedding learner, as they are proved to be effective on evaluating truth value of triples in knowledge graphs.
In task T r , we firstly calculate the score for each entity pairs (h i , t i ) in support set S r as follows: where x represents the L2 norm of vector x. We design the score function inspired by TransE (Bordes et al., 2013) which assumes the head entity embedding h, relation embedding r and tail entity embedding t for a true triple (h, r, t) satisfying h+r = t. Thus the score function is defined according to the distance between h + r and t.
Transferring to our few-show link prediction task, we replace the relation embedding r with relation meta R Tr as there is no direct general relation embeddings in our task and R Tr can be regarded as the relation embedding for current task T r . With score function for each triple, we set the following loss, where [x] + represents the positive part of x and γ represents margin which is a hyperparameter.
L(S r ) should be small for task T r which represents the model can properly encode truth values of triples. Thus gradients of parameters indicate how should the parameters be updated. Thus we regard the gradient of R Tr based on L(S r ) as gradient meta G Tr : Following the gradient update rule, we make a rapid update on relation meta as follows: where β indicates the step size of gradient meta when operating on relation meta. When scoring the query set by embedding learner, we use updated relation meta. After getting the updated relation meta R ′ , we transfer it to samples in query set Q r = {(h j , t j )} and calculate their scores and loss of query set, following the same way in support set: where L(Q r ) is our training objective to be minimized. We use this loss to update the whole model.

Training Objective
During training, our objective is to minimize the following loss L which is the sum of query loss for all tasks in one minibatch:

Experiments
With MetaR, we want to figure out following things: 1) can MetaR accomplish few-shot link prediction task and even perform better than previous model? 2) how much relation-specific meta information contributes to few-shot link prediction? 3) is there any requirement for MetaR to work on few-shot link prediction? To do these, we conduct the experiments on two few-shot link prediction datasets and deeply analyze the experiment results 1 .

Datasets and Evaluation Metrics
We use two datasets, NELL-One and Wiki-One which are constructed by Xiong et al. (2018). NELL-One and Wiki-One are derived from NELL (Carlson et al., 2010) and Wikidata (Vrandei and Krtzsch, 2014) respectively. Furthermore, because these two benchmarks are firstly tested on GMatching which consider both learned embeddings and one-hop graph structures, a background graph is constructed with relations out of training/validation/test sets for obtaining the pre-train entity embeddings and providing the local graph for GMatching. Unlike GMatching using background graph to enhance the representations of entities, our MetaR can be trained without background graph. For NELL-One and Wiki-One which have background  graph originally, we can make use of such background graph by fitting it into training tasks or using it to train embeddings to initialize entity representations. Overall, we have three kinds of dataset settings, shown in Table 3. For setting of BG:In-Train, in order to make background graph included in training tasks, we sample tasks from triples in background graph and original training set, rather than sampling from only original training set. Note that these three settings don't violate the task formulation of few-shot link prediction in KGs. The statistics of NELL-One and Wiki-One are shown in Table 2.
We use two traditional metrics to evaluate different methods on these datasets, MRR and Hits@N. MRR is the mean reciprocal rank and Hits@N is the proportion of correct entities ranked in the top N in link prediction.

Implementation
During training, mini-batch gradient descent is applied with batch size set as 64 and 128 for NELL-One and Wiki-One respectively. We use Adam (Kingma and Ba, 2015) with the initial learning rate as 0.001 to update parameters. We set γ = 1 and β = 1. The number of positive and negative triples in query set is 3 and 10 in NELL-One and Wiki-One. Trained model will be applied on validation tasks each 1000 epochs, and the current model parameters and corresponding performance will be recorded, after stopping, the model that has the best performance on Hits@10 will be treated as final model. For number of training epoch, we use early stopping with 30 patient epochs, which means that we stop the training when the performance on Hits@10 drops 30 times continuously. Following GMatching, the embedding dimension of NELL-One is 100 and Wiki-One is 50. The

MRR
Hits@10 Hits@5 Hits@1 NELL-One 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot  Table 4: Results of few-shot link prediction in NELL-One and Wiki-One. Bold numbers are the best results of all and underline numbers are the best results of GMatching. The contents of (bracket) after MetaR illustrate the form of datasets we use for MetaR.
sizes of two hidden layers in relation-meta learner are 500, 200 and 250, 100 for NELL-One and Wiki-One.

Results
The results of two few-shot link prediction tasks, including 1-shot and 5-shot, on NELL-One and Wiki-One are shown in Table 4. The baseline in our experiment is GMatching (Xiong et al., 2018), which made the first trial on few-shot link prediction task and is the only method that we can find as baseline. In this table, results of GMatching with different KG embedding initialization are copied from the original paper. Our MetaR is tested on different settings of datasets introduced in Table 3.
In Table 4 -g -r .052 .052 Table 5: Results of ablation study on Hits@10 of 1-shot link prediction in NELL-One.
plore, the results of MetaR are no worse than GMatching, indicating that MetaR has the capability of accomplishing few-shot link prediction. In parallel, the impressive improvement compared with GMatching demonstrates that the key idea of MetaR, transferring relation-specific meta information from support set to query set, works well on few-shot link prediction task.
Furthermore, compared with GMatching, our MetaR is independent with background knowledge graphs. We test MetaR on 1-shot link prediction in partial NELL-One and Wiki-One which discard the background graph, and get the results of 0.279 and 0.348 on Hits@10 respectively. Such results are still comparable with GMatching in fully datasets with background.

Ablation Study
We have proved that relation-specific meta information, the key point of MetaR, successfully contributes to few-shot link prediction in previous section. As there are two kinds of relation-specific meta information in this paper, relation meta and gradient meta, we want to figure out how these two kinds of meta information contribute to the performance. Thus, we conduct an ablation study with three settings. The first one is our complete MetaR method denoted as standard. The second one is removing the gradient meta by transferring un-updated relation meta directly from support set to query set without updating it via gradient meta, denoted as -g. The third one is removing the relation meta further which makes the model rebase to a simple TransE embedding model, denoted as -g -r. The result under the third setting is copied from Xiong et al. (2018). It uses the triples from background graph, training tasks and one-shot training triples from validation/test set, so it's neither BG:Pre-Train nor BG:In-Train. We conduct the ablation study on NELL-one with metric Hit@10 and results are shown in Table 5. Table 5 shows that removing gradient meta decreases 29.3% and 15% on two dataset settings, and further removing relation meta continuous decreases the performance with 55% and 72% compared to the standard results. Thus both relation meta and gradient meta contribute significantly and relation meta contributes more than gradient meta. Without gradient meta and relation meta, there is no relation-specific meta information transferred in the model and it almost doesn't work. This also illustrates that relationspecific meta information is important and effective for few-shot link prediction task.

Facts That Affect MetaR's Performance
We have proved that both relation meta and gradient meta surely contribute to few-shot link prediction. But is there any requirement for MetaR to ensure the performance on few-shot link prediction? We analyze this from two points based on the results, one is the sparsity of entities and the other is the number of tasks in training set.
The sparsity of entities We notice that the best result of NELL-One and Wiki-One appears in different dataset settings. With NELL-One, MetaR performs better on BG:In-Train dataset setting, while with Wiki-One, it performs better on BG:Pre-Train. Performance difference between two dataset settings is more significant on Wiki-One.
Most datasets for few-shot task are sparse and the same with NELL-One and Wiki-One, but the entity sparsity in these two datasets are still significantly different, which is especially reflected in the proportion of entities that only appear in one triple in training set, 82.8% and 37.1% in Wiki-One and NELL-One respectively. Entities only have one triple during training will make MetaR unable to learn good representations for them, because entity embeddings heavily rely on triples related to them in MetaR. Only based on one triple, the learned entity embeddings will include a lot of bias. Knowledge graph embedding method can learn better embeddings than MetaR for those oneshot entities, because entity embeddings can be corrected by embeddings of relations that connect to it, while they can't in MetaR. This is why the best performance occurs in BG:Pre-train setting on Wiki-One, pre-train entity embeddings help MetaR overcome the low-quality on one-shot entities.
The number of tasks From the comparison of MetaR's performance between with and without background dataset setting on NELL-One, we find that the number of tasks will affect MetaR's performance significantly. With BG:In-Train, there are 321 tasks during training and MetaR achieves 0.401 on Hits@10, while without background knowledge, there are 51, with 270 less, and MetaR achieves 0.279. This makes it reasonable that why MetaR achieves best performance on BG:In-Train with NELL-One. Even NELL-One has 37.1% one-shot entities, adding background knowledge into dataset increases the number of training tasks significantly, which complements the sparsity problem and contributes more to the task.
Thus we conclude that both the sparsity of entities and number of tasks will affect performance of MetaR. Generally, with more training tasks, MetaR performs better and for extremely sparse dataset, pre-train entity embeddings are preferred.

Conclusion
We propose a meta relational learning framework to do few-shot link prediction in KGs, and we design our model to transfer relation-specific meta information from support set to query set. Specif-ically, using relation meta to transfer common and important information, and using gradient meta to accelerate learning. Compared to GMatching which is the only method in this task, our method MetaR gets better performance and it is also independent with background knowledge graphs. Based on experimental results, we analyze that the performance of MetaR will be affected by the number of training tasks and sparsity of entities. We may consider obtaining more valuable information about sparse entities in few-shot link prediction in KGs in the future.