One-Shot Relational Learning for Knowledge Graphs

Knowledge graphs (KG) are the key components of various natural language processing applications. To further expand KGs’ coverage, previous studies on knowledge graph completion usually require a large number of positive examples for each relation. However, we observe long-tail relations are actually more common in KGs and those newly added relations often do not have many known triples for training. In this work, we aim at predicting new facts under a challenging setting where only one training instance is available. We propose a one-shot relational learning framework, which utilizes the knowledge distilled by embedding models and learns a matching metric by considering both the learned embeddings and one-hop graph structures. Empirically, our model yields considerable performance improvements over existing embedding models, and also eliminates the need of re-training the embedding models when dealing with newly added relations.


Introduction
Large-scale knowledge graphs (Suchanek et al., 2007;Vrandečić and Krötzsch, 2014;Bollacker et al., 2008;Auer et al., 2007;Carlson et al., 2010) represent every piece of information as binary relationships between entities, usually in the form of triples i.e. (subject, predicate, object). This kind of structured knowledge is essential for many downstream applications such as Question Answering and Semantic Web.
Despite KGs' large scale, they are known to be highly incomplete (Min et al., 2013). To automatically complete KGs, extensive research efforts (Nickel et al., 2011;Bordes et al., 2013;Yang 1 Code and datasets could be found at https://github.com/xwhan/ One-shot-Relational-Learning.  Trouillon et al., 2016;Lao and Cohen, 2010;Neelakantan et al., 2015;Xiong et al., 2017;Das et al., 2017;Chen et al., 2018) have been made to build relational learning models that could infer missing triples by learning from existing ones. These methods explore the statistical information of triples or path patterns to infer new facts of existing relations; and have achieved considerable performance on various public datasets. However, those datasets (e.g. FB15k, WN18) used by previous models mostly only cover common relations in KGs. For more practical scenarios, we believe the desired KG completion models should handle two key properties of KGs. First, as shown in Figure 1, a large portion of KG relations are actually long-tail. In other words, they have very few instances. But intuitively, the fewer training triples that one relation has, the more KG completion techniques could be of use. Therefore, it is crucial for models to be able to complete relations with limited numbers of triples. However, existing research usually assumes the availability of sufficient training triples for all relations, which limits their usefulness on sparse long-tail relations.
Second, to capture up-to-date knowledge, realworld KGs are often dynamic and evolving at any given moment. New relations will be added whenever new knowledge is acquired. If a model can predict new triples given only a small number of examples, a large amount of human effort could be spared. However, to predict target relations, previous methods usually rely on well-learned representations of these relations. In the dynamic scenario, the representations of new relations cannot be sufficiently trained given limited training instances, thus the ability to adapt to new relations is also limited for current models.
In contrast to previous methods, we propose a model that depends only on the entity embeddings and local graph structures. Our model aims at learning a matching metric that can be used to discover more similar triples given one reference triple. The learnable metric model is based on a permutation-invariant network that effectively encodes the one-hop neighbors of entities, and also a recurrent neural network that allows multi-step matching. Once trained, the model will be able to make predictions about any relation while existing methods usually require fine-tuning to adapt to new relations. With two newly constructed datasets, we show that our model can achieve consistent improvement over various embedding models on the one-shot link prediction task.
In summary, our contributions are three-fold: • We are the first to consider the long-tail relations in the link prediction task and formulate the problem as few-shot relational learning; • We propose an effective one-shot learning framework for relational data, which achieves better performance than various embedding-based methods; • We also present two newly constructed datasets for the task of one-shot knowledge graph completion.

Related Work
Embedding Models for Relational Learning Various models have been developed to model relational KGs in continous vector space and to automatically infer missing links. RESCAL (Nickel et al., 2011) is one of the earlier work that models the relationship using tensor operations. Bordes et al. (2013) proposed to model relationships in the 1-D vector space. Following this line of research, more advanced models such as DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016) and ConvE (Dettmers et al., 2017) have been proposed. These embedding-based models usually assume enough training instances for all relations and entities and do not pay attention to those sparse symbols. More recently, several models (Shi and Weninger, 2017;Xie et al., 2016) have been proposed to handle unseen entities by leveraging text descriptions. In contrast to these approaches, our model deals with long-tail or newly added relations and focuses on one-shot relational learning without any external information, such as text descriptions of entities or relations.
Few-Shot Learning Recent deep learning based few-shot learning approaches fall into two main categories: (1) metric based approaches (Koch, 2015;Vinyals et al., 2016;Snell et al., 2017;Yu et al., 2018), which try to learn generalizable metrics and the corresponding matching functions from a set of training tasks. Most methods in this class adopt the general matching framework proposed in deep siamese network (Koch, 2015). One example is the Matching Networks (Vinyals et al., 2016), which make predictions by comparing the input example with a small labeled support set; (2) meta-learner based approaches (Ravi and Larochelle, 2017;Munkhdalai and Yu, 2017;Finn et al., 2017;Li et al., 2017), which aim to learn the optimization of model parameters (by either outputting the parameter updates or directly predicting the model parameters) given the gradients on few-shot examples. One example is the LSTMbased meta-learner (Ravi and Larochelle, 2017), which learns the step size for each dimension of the stochastic gradients. Besides the above categories, there are also some other styles of few-shot learning algorithms, e.g. Bayesian Program Induction (Lake et al., 2015), which represents concepts as simple programs that best explain observed examples under a Bayesian criterion.
Previous few-shot learning research mainly focuses on vision and imitation learning (Duan et al., 2017) domains. In the language domain, Yu et al. (2018) proposed a multi-metric based approach for text classification. To the best of our knowledge, this work is the first research on few-shot learning for knowledge graphs.

Problem Formulation
Knowledge graphs G are represented as a collection of triples {(h, r, t)} ⊆ E × R × E, where E and R are the entity set and relation set. The task of knowledge graph completion is to either predict unseen relations r between two existing entities: (h, ?, t) or predict the tail entity t given the head entity and the query relation: (h, r, ?). As our purpose is to infer unseen facts for newly added or existing long-tail relations, we focus on the latter case. In contrast to previous work that usually assumes enough triples for the query relation are available for training, this work studies the case where only one training triple is available. To be more specific, the goal is to rank the true tail entity t true higher than other candidate entities t ∈ C h,r , given only an example triple (h 0 , r, t 0 ). The candidates set is constructed using the entity type constraint (Toutanova et al., 2015). It is also worth noting that when we predict new facts of the relation r, we only consider a closed set of entities, i.e. no unseen entities during testing. For open-world settings where new entities might appear during testing, external information such as text descriptions about these entities are usually required and we leave this to future work.

One-Shot Learning Settings
This section describes the settings for the training and evaluation of our one-shot learning model.
The goal of our work is to learn a metric that could be used to predict new facts with oneshot examples. Following the standard one-shot learning settings (Vinyals et al., 2016;Ravi and Larochelle, 2017), we assume access to a set of training tasks. In our problem, each training task corresponds to a KG relations r ∈ R, and has its own training/testing triples: This task set is often denoted as the meta-training set, T meta−train .
To imitate the one-shot prediction at evaluation time, there is only one triple consists of the testing triples of r with ground-truth tail entities t i for each query (h i , r), and the corresponding tail entity candidates C h i ,r = {t ij } where each t ij is an entity in G. The metric model can thus be tested on this set by ranking the candidate set C h i ,r given the test query (h i , r) and the labeled triple in D train r . We denote an arbitrary ranking-loss function as θ (h i , r, t i |C h i ,r , D train r ), where θ represents the parameters of our metric model. This loss function indicates how well the metric model works on tuple (h i , r, t i , C h i ,r ) while observing only one-shot data from D train r . The objective of training the metric model, i.e. the metatraining objective, thus becomes: where T r is sampled from the meta-training set T meta−train , and |D test r | denotes the number of tuples in D test r . Once trained, we can use the model to make predictions on new relations r ∈ R , which is called the meta-testing step in literature. These meta-testing relations are unseen from metatraining, i.e. R ∩ R = φ. Each meta-testing relation r also has its own one-shot training data D train r and testing data D test r , defined in the same way as in meta-training. These meta-testing relations form a meta-test set T meta−test .
Moreover, we leave out a subset of relations in T meta−train as the meta-validation set T meta−validation . Because of the assumption of one-shot learning, the meta-testing relations do not have validation sets like in the traditional machine learning setting. Otherwise, the metric model will actually see more than one-shot labeled data during meta-testing, thus the one-shot assumption is violated.
Finally, we assume that the method has access to a background knowledge graph G , which is a subset of G with all the relations from T meta−train , T meta−validation and T meta−test removed.

Model
In this section, we describe the proposed model for similarity metric learning and also the corresponding loss function we use to train our model.
The core of our proposed model is a similarity function M((h, t), (h , t )|G ). Thus for any query relation r, as long as there is one known fact (h 0 , r, t 0 ), the model could predict the likelihood of testing triples {(h i , r, t ij )|t ij ∈ C h i ,r }, based on the matching score between each (h i , t ij ) and (h 0 , t 0 ). The implementation of the above matching function involves two sub-problems: (1) the representations of entity pairs; and (2) the comparison function between two entity-pair representations. Our overall model, as shown in Figure 2 deals with the above two problems with two major components respectively: • Neighbor encoder (Figure 2b), aims at utilizing the local graph structure to better represent entities. In this way, the model can leverage more information that KG provides for every entity within an entity pair.
• Matching processor (Figure 2c), takes the vector representations of any two entity pairs from the neighbor encoder; then performs multi-step matching between two entity-pairs and outputs a scalar as the similarity score.

Neighbor Encoder
This module is designed to enhance the representation of each entity with its local connections in knowledge graph.
Although the entity embeddings from KG embedding models (Bordes et al., 2013;Yang et al., 2014) already have relational information encoded, previous work (Neelakantan et al., 2015;Lin et al., 2015a;Xiong et al., 2017) showed that explicitly modeling the structural patterns, such as paths, is usually beneficial for relationship prediction. In view of this, we propose to use a neighbor encoder to incorporate graph structures into our metric-learning model. In order to benefit from the structural information while maintaining the efficiency to easily scale to real-world large-scale KGs, our neighbor encoder only considers entities' local connections, i.e. the one-hop neighbors.
For any given entity e, its local connections form a set of (relation, entity) tuples. As shown in Figure 2a, for the entity Leonardo da Vinci, one of such tuples is (occupation, painter). We refer this neighbor set as as N e = {(r k , e k )|(e, r k , e k ) ∈ G }. The purpose of our neighbor encoder is to encode N e and output a vector as the latent representation of e. Because this is a problem of encoding sets with varying sizes, we hope the encoding function can be (1) invariant to permutations and also (2) insensitive to the size of the neighbor set. Inspired by the results from , we use the following function f that satisfies the above properties: where C r k ,e k is the feature representation of a relation-entity pair (r k , e k ) and σ is the activation function. In this paper we set σ = tanh which achieves the best performance on T meta−validation .
To encode every tuple (r k , e k ) ∈ N e into C r k ,e k , we first use an embedding layer emb with dimension d (which can be pre-trained using existing embedding-based models) to get the vector representations of r k and e k : Dropout (Srivastava et al., 2014) is applied here to the vectors v r k , v e k to achieve better generalization. We then apply a feed-forward layer to encode the interaction within this tuple: where W c ∈ R d×2d , b c ∈ R d are parameters to be learned and ⊕ denotes concatenation.  Figure 3: The distribution of entities' degrees (numbers of neighbors) on our two datasets. Since we work on closed-set of entities, we draw the figure by considering the intersection between entities in our background knowledge G and the entities appearing in T meta−train , T meta−validation or T meta−test . Note that all triples in T meta−train , T meta−validation or T meta−test are removed from G . Upper: NELL; Lower: Wikidata.

Entity count
To enable batching during training, we manually specify the maximum number of neighbors and use all-zero vectors as "dummy" neighbors. Although different entities have different degrees (number of neighbors), the degree distribution is usually very concentrated, as shown in Figure 3. We can easily find a proper bound as the maximum number of neighbors to batch groups of entities.
The neighbor encoder module we propose here is similar to the Relational Graph Convolutional Networks (Schlichtkrull et al., 2017) in the sense that we also use the shared kernel {W c , b c } to encode the neighbors of different entities. But unlike their model that operates on the whole graph and performs multiple steps of information propagation, we only encode the local graphs of the entities and perform one-step propagation. This enables us to easily apply our model to large-scale KGs such as Wikidata. Besides, their model also does not operate on pre-trained graph embeddings. We leave the investigation of other graph encoding strategies, e.g. (Xu et al., 2018;Song et al., 2018), to future work.

Matching Processor
Given the neighbor encoder module, now we discuss how we can do effective similarity matching based on our recurrent matching processor. By applying f (N e ) to the reference entity pair (h 0 , t 0 ) and any query entity pair (h i , t ij ), we get two Algorithm 1 One-shot Training 1: Input: 2: a) Meta-training task set Tmeta−training; 3: b) Pre-trained KG embeddings (excluding relation in Tmeta−training); 4: c) Initial parameters θ of the metric model; 5: for epoch = 0:M-1 do 6: Shuffle the tasks in T meta−learning 7: for Tr in T meta−learning do 8: Sample one triple as the reference 9: Sample a batch B + of query triples 10: Pollute the tail entity of query triples to get B − 11: Calculate the matching scores for triple in B + and B − 12: Calculate the batch loss L = B 13: Update θ using gradient g ∝ ∇L 14: end for 15: end for neighbor vectors for each: To get a similarity score that can be used to rank (h i , t ij ) among other candidates, we can simply concatenate the f (N h ) and f (N t ) in each pair to form a single pair representation vector, and calculate the cosine similarity between pairs. However, this simple metric model turns out to be too shallow and does not give good performance. To enlarge our model's capacity, we leverage a LSTM-based (Hochreiter and Schmidhuber, 1997) recurrent "processing" block (Vinyals et al., 2015(Vinyals et al., , 2016 to perform multi-step matching. Every process step is defined as follows: where LST M (x, [h, c]) is a standard LSTM cell with input x, hidden state h and cell state c, and are the concatenated neighbor vectors of the reference pair and query pair. After K processing steps 2 , we use score K as the final similarity score between the query and support entity pair. For every query (h i , r, ?), by comparing (h i , t ij ) with (h 0 , t 0 ), we can get the ranking scores for every t ij ∈ C h i ,r .

Loss Function and Training
For a query relation r and its reference/training triple (h 0 , r, t 0 ), we collect a group of positive (true) query triples {(h i , r, t + i )|(h i , r, t + i ) ∈ G} and construct another group negative (false) query triples {(h i , r, t − i )|(h i , r, t − i ) ∈ G} by polluting the tail entities. Following previous embeddingbased models, we use a hinge loss function to optimize our model: where score + θ and score − θ are scalars calculated by comparing the query triple (h i , r, t + i /t − i ) with the reference triple (h 0 , r, t 0 ) using our metric model, and the margin γ is a hyperparameter to be tuned. For every training episode, we first sample one task/relation T r from the meta-training set T meta−training . Then from all the known triples in T r , we sample one triple as the reference/training triple D train r and a batch of other triples as the positive query/test triples D test r . The detail of the training process is shown in Algorithm 1. Our experiments are discussed in the next section.  Existing benchmarks for knowledge graph completion, such as FB15k-237 (Toutanova et al., 2015) and YAGO3-10 (Mahdisoltani et al., 2013) are all small subsets of real-world KGs. These datasets consider the same set of relations during training and testing and often include sufficient training triples for every relation. To construct datasets for one-shot learning, we go back to the original KGs and select those relations that do not have too many triples as one-shot task relations. We refer the rest of the relations as background relations, since their triples provide important background knowledge for us to match entity pairs.

Datasets
Our first dataset is based on NELL (Mitchell et al., 2018), a system that continuously collects structured knowledge by reading webs. We take the latest dump and remove those inverse relations. We select the relations with less than 500 but more than 50 triples 3 as one-shot tasks. To show that our model is able to operate on large-scale KGs, 3 We want to have enough triples for evaluation. we follow the similar process to build another larger dataset based on Wikidata (Vrandečić and Krötzsch, 2014). The dataset statistics are shown in Table 1. Note that the Wiki-One dataset is an order of magnitude larger than any other benchmark datasets in terms of the numbers of entities and triples. For NELL-One, we use 51/5/11 task relations for training/validation/testing. For Wiki-One, the division ratio is 133:16:34.

Implementation Details
In our experiments, we consider the following embedding-based methods: RESCAL (Nickel et al., 2011), TransE (Bordes et al., 2013), Dist-Mult (Yang et al., 2014) and ComplEx (Trouillon et al., 2016). For TransE, we use the code released by Lin et al. (2015b). For the other models, we have tried the code released by Trouillon et al. (2016) but it gives much worse results than TransE on our datasets. Thus we use our own implementations based on PyTorch (Paszke et al., 2017) for comparison. When evaluating existing embedding models, during training, we use not only the triples of background relations but also all the triples of the training relations and the one-shot training triple of those validation/test relations. However, since the proposed metric model does not require the embeddings of query relations, we only include the triples of the background relations for embedding training. As TransE and DistMult use 1-D vectors to represent entities and relations, they can be directly used in our natching model. While for RESCAL, since it uses matrices to represent relations, we employ mean-pooling over these matrices to get 1-D embeddings. For the ComplEx model, we use the concatenation of the real part and imaginary part. The hyperparameters of our model are tuned on the validation task set and can be found in the appendix.
Apart from the above embedding models, a more recent method (Dettmers et al., 2017) applies convolution to model relationships and achieves the best performance on several benchmarks. For every query (h, r, ?), their model enumerates the whole entity set to get positive and negative triples for training. We find that this training paradigm takes lots of computational resources when dealing with large entity sets and cannot scale to realworld KGs such as Wikidata 4 that have millions  of entities. For the scalability concern, our experiments only consider models that use negative sampling for training.

Results
The main results of our methods are shown in Table 2. We denote our method as "GMatching" since our model is trained to match local graph patterns. We use mean reciprocal rank (MRR) and Hits@K to evaluate different models. We can see that our method produces consistent improvements over various embedding models on these one-shot relations. The improvements are even more substantial on the larger Wiki-One dataset.
To investigate the learning power of our model, we also try to train our metric model with randomly initialized embeddings. Surprisingly, although the results are worse than the metric models with pre-trained embeddings, they are still superior to the baseline embedding models. This suggests that, by incorporating the neighbor entities into our model, the embeddings of many relations and entities actually get updated in an effective way and provide useful information for our model to make predictions on test data. It is worth noting that once trained, our model can be used to predict any newly added relations without fine-tuning, while existing models usually need to be re-trained to handle those newly added symbols. On a large real-world KG, this re-training process can be slow and highly computational expensive.
Remark on Model Selection Given the existence of various KG embedding models, one interesting experiment is to incorporate model selec-tion into hyper-parameter tuning and choose the best validation model for testing.
If we think about comparing KG embedding and metric learning as two approaches, the results from the model selection process can then be used as the "final" measurement for comparison. For example, the baseline KG embedding achieves best MRR on Wiki-One with RESCAL (11.9%), so we report the corresponding testing MRR (7.2%) as the final model selection result for KG embedding approach. In this way, at the top half of Table 2, we select the best KG embedding method according to the validation performance. The results are highlighted with underlines. Similarly, we select the best metric learning approach at the bottom.
Our metric-based method outperforms KG embedding by a large margin from this perspective as well. Taking MRR as an example, the selected metric model achieves 17.1% on NELL-One and 20.0% on Wiki-One; while the results of KG embedding are 9.3% and 7.2%. The improvement is 7.8% and 12.8% respectively.

Analysis on Neighbor-Encoder
As our model leverages entities' local graph structures by encoding the neighbors, here we try to investigate the effect of the neighbor set by restricting the maximum number of neighbors. If the size of the true neighbor set is larger than the maximum limit, the neighbors are then selected by random sampling. Figure 4 shows the learning curves of different settings. These curves are based on the Hits@10 calculated on the validation set. We see that encoding more neighbors   for every entity generally leads to better performance. We also observe that the model that encodes 40 neighbors in maximum actually yields worse performance than the model that only encodes 30 neighbors. We think the potential reason is that for some entity pairs, there are some local connections that are irrelevant and provide noisy information to the model.

Ablation Studies
We conduct ablation studies using the model that achieves the best Hits@10 on the NELL-One dataset. The results are shown in Table 4. We use Hits@10 on validation and test set for comparison, as the hyperparameters are selected us-ing this evaluation metric. We can see that both the matching processor 5 and the neighbor encoder play important roles in our model. Another important observation is that the scaling factor 1/N e turns out to be very essential for the neighbor encoder. Without scaling, the neighbor encoder actually gives worse results compared to the simple embedding-based matching.

Performance on Different Relations
When testing various models, we observe that the results on different relations are actually of high variance. Table 3 shows the decomposed results on NELL-One generated by our best metric model (GMatching-ComplEx) and its corresponding embedding method. For reference, we also report the embedding model's performance under standard training settings where 75% of the triples (instead of only one) are used for training and the rest are used for testing. We can see that relations with smaller candidate sets are generally easier and our model could even perform better than the embedding model trained under standard settings. For some relations such as athleteInjured-HisBodypart, their involved entities have very few connections in KG. It is as expected that one-shot learning on these kinds of relations is quite challenging. Those relations with lots of (>3000) candidates are challenging for all models. Even for embedding model with more training triples, the performance on some relations is still very limited. This suggests that the knowledge graph completion task is still far from being solved.

Conclusion
This paper introduces a one-shot relational learning framework that could be used to predict new facts of long-tail relations in KGs. Our model leverages the local graph structure of entities and learns a differentiable metric to match entity pairs. In contrast to existing methods that usually need finetuning to adapt to new relations, our trained model can be directly used to predict any unseen relation and also achieves much better performance in the one-shot setting. Our future work might consider incorporating external text data and also enhancing our model to make better use of multiple training examples in the few-shot learning case.

A Hyperparameters
For the NELL dataset, we set embedding size as 100. For Wikidata, we set the embedding size as 50 for faster training with millions of triples. The embeddings are trained for 1,000 epochs. The other hyperparamters are tuned using the Hits@10 metric 6 on the validation tasks. For matching steps, the optimal setting is 2 for NELL-One and 4 for Wiki-One. For the number of neighbors, we find that the maximum limit 50 works the best for both datasets. For parameter updates, we use Adam (Kingma and Ba, 2014) with the initial learning rate 0.001 and we half the learning rate after 200k update steps. The margin used in our loss function is 5.0. The dimension of LSTM's hidden size is 200.  Table 5: 5-shot experiments on NELL-One. 6 The percentage of correct answer ranks within top10.