Adaptive Attentional Network for Few-Shot Knowledge Graph Completion

Few-shot Knowledge Graph (KG) completion is a focus of current research, where each task aims at querying unseen facts of a relation given its few-shot reference entity pairs. Recent attempts solve this problem by learning static representations of entities and references, ignoring their dynamic properties, i.e., entities may exhibit diverse roles within task relations, and references may make different contributions to queries. This work proposes an adaptive attentional network for few-shot KG completion by learning adaptive entity and reference representations. Specifically, entities are modeled by an adaptive neighbor encoder to discern their task-oriented roles, while references are modeled by an adaptive query-aware aggregator to differentiate their contributions. Through the attention mechanism, both entities and references can capture their fine-grained semantic meanings, and thus render more expressive representations. This will be more predictive for knowledge acquisition in the few-shot scenario. Evaluation in link prediction on two public datasets shows that our approach achieves new state-of-the-art results with different few-shot sizes.


Introduction
Knowledge Graphs (KGs) like Freebase (Bollacker et al., 2008), NELL (Carlson et al., 2010) and Wikidata (Vrandecic and Krötzsch, 2014) are extremely useful resources for NLP tasks, such as information retrieval , machine reading (Yang and Mitchell, 2017), and relation extraction (Ren et al., 2017). A typical KG is a multi-relational graph, represented as triples of the form (h, r, t), indicating that two entities are connected by relation r. Although a KG contains a great number of triples, it is also known to suffer from incompleteness problem. KG completion, which aims at automatically inferring missing facts by examining existing ones, has thus attracted broad attention. A promising approach, namely KG embedding, has been proposed and successfully applied to this task. The key idea is to embed KG components, including entities and relations, into a continuous vector space and make predictions with their embeddings.
Current KG embedding methods mostly require sufficient training triples for all relations to learn expressive representations (i.e., embeddings). In real KGs, a large portion of KG relations is actually long-tail, having only a limited (few-shot) number of relational triples . This may lead to low performance of embedding models on KG completion for those long-tail relations. posed to address the few-shot issue of KG completion, where one task is to predict tail entity t in a query (h, r, ?) given only a few entity pairs of the task relation r. These known few-shot entity pairs associated with r are called references. To improve semantical representations of the references,  and Zhang et al. (2020) devise modules to enhance entity embeddings with their local graph neighbors. The former simply assumes that all neighbors contribute equally to the entity embedding, and in this way the neighbors are always weighted identically. The latter develops the idea by employing an attention mechanism to assign different weights to neighbors, but the weights do not change throughout all task relations. Therefore, both works assign static weights to neighbors, leading to static entity representations when involved in different task relations. We argue that entity neighbors could have varied impacts associated with different task relations. Figure 1(a) gives an example of head entity BillGates associated with two task relations. The left neighbors show his business role, while the right ones show his family role, which reveals quite different meanings. Intuitively, the task relation CeoOf is supposed to pay more attention to the business role of entity BillGates than the other one.
In addition, task relations can be polysemous, also showing different meanings when involved in different entity pairs. Therefore, the reference triples could also make different contributions to a particular query. Take a task relation SubPartOf as an example. As shown in Figure 1(b), SubPartOf associates with different meanings, e.g., organization-related as (Cavaliers, SubPartOf, NBA) and location-related as (Petersburg, SubPartOf, Virginia). Obviously, for query (ChicagoBulls, SubPartOf, ?), referring to the organization-related references would be more beneficial.
To address the above issues, we propose an Adaptive Attentional Network for Few-Shot KG completion (FAAN), a novel paradigm that takes dynamic properties into account for both entities and references. Specifically, given a task relation with its reference/query triples, FAAN proposes an adaptive attentional neighbor encoder to model entity representations with one-hop entity neighbors. Unlike the previous neighbor encoder with a fixed attention map in (Zhang et al., 2020), we allow attention scores dynamically adaptive to the task relation under the translation assumption. This will capture the diverse roles of entities through varied impacts of neighbors. Given the enhanced entity representations, FAAN further adopts a stack of Transformer blocks for reference/query triples to capture multi-meanings of the task relation. Then, FAAN obtains a general reference representation by adaptively aggregating the references, further differentiating their contributions to different queries. As such, both entities and references can capture their fine-grained meanings, and render richer representations to be more predictive for knowledge acquisition in the few-shot scenario.
The contributions of this paper are three-fold: (1) We propose the notion of dynamic properties in few-shot KG completion, which differs from previous paradigms by studying the dynamic nature of entities and references in the few-shot scenario.
(2) We devise a novel adaptive attentional network FAAN to learn dynamic representations. An adaptive neighbor encoder is used to adapt entity representations to different tasks. A Transformer encoder and an attention-based aggregator are used to adapt reference representations to different queries.
(3) We evaluate FAAN in few-shot link prediction on benchmark KGs of NELL and Wikidata. Experimental results reveal that FAAN could achieve new state-of-the-art results with different few-shot sizes.

Related Work
Recent years have seen increasing interest in learning representations for entities and relations in KGs, a.k.a KG embedding. Various methods have been devised, and roughly fall into three groups: 1) translation-based models which interpret relations as translating operations between head-tail entity pairs (Bordes et al., 2013;; 2) simple semantic matching models which compute composite representations over entities and relations using linear mapping operations (Yang et al., 2015;Trouillon et al., 2016;Liu et al., 2017;Sun et al., 2019); and 3) (deep) neural network models which obtain composite representations using more complex operations (Schlichtkrull et al., 2018;Dettmers et al., 2018). Please refer to (Nickel et al., 2016;Wang et al., 2017;Ji et al., 2020) for a thorough review of KG embedding techniques. Traditional embedding models always require sufficient training triples for all relations, thus are limited when solving the few-shot problem.
Previous few-shot learning studies mainly focus on computer vision (Sung et al., 2018), imitation learning (Duan et al., 2017) and sentiment analysis . Recent attempts Chen et al., 2019;Zhang et al., 2020) tried to perform few-shot relational learning for long-tail relations.  proposed a matching network GMatching, which is the first research on one-shot learning for KGs as far as we know. GMatching exploits a neighbor encoder to enhance entity embeddings from their one-hop neighbors, and uses a LSTM matching processor to perform a multi-step matching by a LSTM block. FSRL (Zhang et al., 2020) extends GMatching to few-shot cases, further capturing local graph structures with an attention mechanism. Chen et al. (2019) proposed a novel meta relational learning framework MetaR by extracting and transferring shared knowledge across tasks from a few existing facts to incomplete ones. However, previous studies learn static representations of entities or references, ignoring their dynamic properties. This work attempts to learn dynamic entity and reference representations by an adaptive attentional network.
Dynamic properties have also been explored in other contexts outside few-shot relational learning. Ji et al. (2015); Wang et al. (2019) performed KG completion by learning dynamic entity and relation representations, but their methods are specially devised for traditional KG completion. Lu et al. (2017) adopted an adaptive attentional model for image captioning. Luo et al. (2019) tried to model dynamic user preference using a recurrent network with adaptive attention for the sequential recommendation. All these studies demonstrate the capability of modeling dynamic properties to enhance learning algorithms.

Background
Gonsider a KG G containing a set of triples T = {(h, r, t) ∈ E × R × E}, where E and R denotes the entity set and relation set, respectively. This work focuses on a challenging link prediction scenario, i.e., few-shot KG completion. We follow the standard definition of this task (Zhang et al., 2020): Definition 1 (Few-shot KG Completion) Given a relation r ∈ R and its reference set S r = {(h k , t k )|(h k , r, t k ) ∈ T }, one task is to complete triple (h, r, t) with tail entity t ∈ E missing, i.e., to predict t from a candidate entity set C given (h, r, ?). When |S r | = K and K is very small, the task is called K-shot KG completion.
For this task, the goal of a few-shot learning method is to rank the true tail entity higher than false candidate entities, given few-shot reference entity pairs S r . To imitate such a link prediction task, each training task corresponds to a relation r ∈ R with its own reference/query entity pairs, i.e., D r = {S r , Q r }, where S r only consists of K-shot reference entity pairs (h k , t k ). Additionally, Q r = {(h m , t m /C hm,r )} contains all queries with ground-truth tail entity t m and the corresponding candidates C hm,r , where each candidate is an entity in E selected based on the entity type constraint . The few-shot learning method thus could be trained on the task set by ranking the candidates in C hm,r given the query (h m , r, ?) and its references S r . All tasks in training form the meta-training set, denoted as T mtr = {D r }. Here, we only consider a closed set of entities appearing in E.
After suffcient training with meta-training set, the learned model can be used to predict facts of new relation r ∈ R in testing. The relations used for testing are unseen from meta-training, i.e., R ∪ R = φ. Each testing relation r also has its own few-shot references and queries, i.e., D r = {S r , Q r }, defined in the same way as in meta-training. All tasks in testing form the metatesting set, denoted as T mte = {D r }. In addition, we also suppose that the model has access to a background KG G , which is a subset of G with all the relations excluded from T mtr and T mte .

Our Approach
This section introduces our approach FAAN. Given a meta-training set T mtr , the purpose of FAAN is to learn a metric function for predictions by comparing the input query to the given references. To achieve this goal, FAAN consists of three major parts: (1) Adaptive neighbor encoder to learn adaptive entity representations; (2) Transformer encoder to learn relational representations for entity pairs; (3) Adaptive matching processor to compare the query to the given references. Finally, we present the detailed training objective of our model.

Adaptive Neighbor Encoder for Entities
Previous works on embeddings (Schlichtkrull et al., 2018;Shang et al., 2019) have demonstrated that explicitly modeling graph contexts benefits KG completion. Recent few-shot relational learning methods encode one-hop neighbors to enhance entity embeddings with equal or fixed attentions Zhang et al., 2020), ignoring the dynamic properties of entities. To tackle this issue, we devise an adaptive neighbor encoder for entities discerning their entity roles associated with task relations. Specifically, we are given a triple of a few-shot task for relation r, e.g., (h, r, t). Take the head entity h as a target, and we denote its one-hop neighbors as N h = {(r nbr , e nbr )|(h, r nbr , e nbr ) ∈ G }. Here, G is the background KG; r nbr , e nbr represent the neighboring relation and entity of h respectively. The aim of the proposed neighbor encoder is to obtain varied entity representations with N h to exhibit their different roles when involved in different task relations. Figure 2(a) gives the details of the adaptive neighbor encoder, where CeoOf is the fewshot task relation and the other relations such as MarryTo, ProxyFor and WorksWith are the neighboring relations of the head entity BillGates.
As claimed in the introduction, the role of entity h can be varied with respect to the few-shot task relation r. However, few-shot task relations are always hard to obtain effective representations by existing embedding models that always require sufficient training data for the relations. Inspired by TransE (Bordes et al., 2013), we model the task relation embedding r as a translation between the entity embeddings h and t, i.e., we want h + r ≈ t when the triple holds. The intuition here originates from linguistic regularities such as Italy−Rome = France − Paris, and such analogy holds because of the certain relation CapitalOf. Under the translation assumption, we can obtain the embedding of few-shot task relation r given its entity pair (h, t): where r, t, h ∈ R d ; t and h are embeddings pretrained on G with current embedding model such as TransE; d denotes the pre-trained embedding dimension. Actually, the translation mechanism is not the only way to model the task relations.
We leave the investigation of other KG embedding methods (Trouillon et al., 2016;Sun et al., 2019) to future work. Intuitively, relations can reflect roles of an entity. As shown in Figure 1(a), the task relation CeoOf may be more related to WorkWith than MarryTo, since the first two exhibit a business role. That is to say, we can discern the roles of h according to the relevance between the task relation r and the neighboring relation r nbr . Hence, we first define a metric function ψ to calculate their relevance score by a bilinear dot product: ψ(r, r nbr ) = r Wr nbr + b where r and r nbr can be obtained by Eq. (1); both W ∈ R d×d and b ∈ R are learnable parameters. Then, we obtain a role-aware neighbor embedding c nbr for h by considering its diverse roles: α nbr = exp(ψ(r, r nbr )) r nbr ∈N h exp(ψ(r, r nbr )) That means, when neighboring relations are more related to the task relation, ψ(·, ·) will be higher and the corresponding neighboring entities would play a more important role in neighbor embeddings. In order to enhance entity embeddings, we simultaneously couple the pre-trained entity embedding h and its role-aware neighbor embedding c nbr . Then, h can be formulated as: where σ(·) denotes activation function, and we use Relu; W 1 , W 2 ∈ R d×d are learnable parameters. Entity representations obtained in this way shall 1) preserve individual properties made by the current embedding model, and 2) possess diverse roles adaptive to different tasks. The above procedure also holds for the candidate tail entity t.

Transformer Encoder for Entity Pairs
Based on enhanced entity embeddings, we are going to derive embeddings of entity pairs. Figure 2(b) gives the details of Transformer encoder for entity pairs. FAAN borrows ideas from recent techniques for learning dynamic KG embeddings (Wang et al., 2019). Given an entity pair in a task of r, i.e, (h, t) ∈ D r , we take each entity pair with its task relation as a sequence X = (x 1 , x 2 , x 3 ), where the first/last element is head/tail entity, and the middle is the task relation.
For each element x i in X, we construct its input representation as: where x ele i denotes the element embedding, and x i pos the position embedding. Both x 1 ele and x 3 ele are obtained from the adaptive neighbor encoder. We allow a position embedding for each position within length 3. After constructing all input representations, we feed them into a stack of L Transformer blocks (Vaswani et al., 2017) to encode X and obtain: z l i = Transformer(z l−1 i ), l = 1, 2, · · · , L. (7) where z l i is the hidden state of x i after the lth layer. Transformer adopts a multi-head selfattention mechanism, with each block allowing each element to attend to all elements with different weights in the sequence.
To perform the few-shot KG completion task, we restrict the mask solely to the task relation r (i.e. x 2 ), so as to obtain meaningful entity pair embeddings. The final hidden state z L 2 is taken as the desired representation for the entity pair in D r . Such representation encodes semantic roles of each entity, and thus helps discern fine-grained meanings of task relations associated with different entity pairs. For more details about Transformer, please refer to Vaswani et al. (2017).

Adaptive Matching Processor
To make predictions by comparing the query to references, we devise an adaptive matching processor considering different semantic meanings of the task relation. Figure 2(c) gives the details of adaptive matching processor.
In order to compare one query to K-shot references, we are going to obtain a general reference representation for the given reference set S r . Considering the various meanings of the task relation, we define a metric function δ(q r , s rk ) that measures the semantic similarity of the query q r and the reference triple s rk . For simplicity, we achieve δ(q r , s rk ) with simple but effective dot product: Unlike current few-shot relational learning models that learn static representations when predicting different queries, we adopt attention mechanism to obtain a general reference representation g (S r ) adaptive to the query. This can be formulated as: g (S r ) = s rk ∈Sr β k s rk (9) β k = exp(δ (q r , s rk )) s rj ∈Sr exp(δ (q r , s rj )) Here, β k denotes the attention score of a reference; s rk (h k , t k ) ∈ S r denotes the k-th reference in the task of r, and s rk is its embedding; q r is the embedding of a query q r in Q r . Both s rk and q r are obtained by Eq. (7), to capture their finegrained meanings. Eq. (9) leads to the fact that references having similar meanings to the query would be more referential, making reference set S r have an adaptive representation to different queries.
To make predictions, we define a metric function φ (q r , S r ) to measure the semantic similarity of the query q r and the reference representation S r : φ (q r , S r ) = q r · g (S r ) .
(11) φ(·) is expected to be large if the query holds, and small otherwise. Here, φ (·, ·) can also be implemented with alternative metrics such as cosine similarity or Euclidean distance.

Model Training
With the adaptive neighbor encoder, the Transformer encoder and the adaptive matching processor, the overall model of FAAN is then trained on meta-training set T mtr . T mtr is obtained by the following way. For each few-shot relation r, we randomly sample K-shot positive entity pairs from T as the reference set S r . The remaining entity pairs are utilized as positive query set Q r = {(h m , t m )}. Then we construct a set of negative queries Q − r = {(h m , t − m )} by randomly corrupting the tail entity of (h m , t m ), where t − m ∈ E \ {t m }. Then, the overall loss is formulated as: is standard hinge loss, and γ is a margin separating positive and negative queries. To minimize L, we take each relation in T mtr as a task, and adopt a batch sampling based meta-training procedure proposed in (Zhang et al., 2020). To optimize model parameters in Θ and Transformer, we use Adam optimizer (Kingma and Ba, 2015), and further impose L 2 regularization on the parameters to avoid over-fitting.

Experiments
In this section, we conduct link prediction experiments to evaluate the performance of FAAN.

Datasets
We conduct experiments on two public benchmark datasets: NELL and Wiki 1 . In both datasets, relations that have less than 500 but more than 50 triples are selected to construct few-shot tasks. There are 67 and 183 tasks in NELL and Wiki, respectively. We use original 51/5/11 and 133/16/34 relations in NELL and Wiki, respectively, for training/validation/testing as defined in Section 3. Moreover, for each task relation, both datasets also provide candidate entities, which are constructed based on the entity type constraint . More details are shown in Table 1.

Comparision Methods
In order to evaluate the effectiveness of our method, we compare our method against the following two groups of baselines: 1 https://github.com/xwhan/ One-shot-Relational-Learning  KG embedding method. This kind of method learns entity/relation embeddings by modeling relational structures in KG. We adopt five widely used methods as baselines: TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), SimplE (Kazemi andPoole, 2018) and RotatE (Sun et al., 2019). All KG embedding methods require sufficient training triples for each relation, and learn static representations of KG.
Few-shot relational learning method. This kind of method achieves state-of-the-art performance of few-shot KG completion on NELL and Wiki datasets. GMatching  adopts a neighbor encoder and a matching network, but assumes that all neighbors contribute equally. FSRL (Zhang et al., 2020) encodes neighbors with a fixed attention mechanism, and applies a recurrent autoencoder to aggregate references. MetaR (Chen et al., 2019) makes predictions by transferring shared knowledge from the references to the queries based on a novel optimization strategy. All the above methods learn static representations of entities or references, ignoring their dynamic properties.

Implementation Details
We perform 5-shot KG completion task for all the methods. Our implementation for KG embedding baselines is based on OpenKE 2 (Han et al., 2018) with their best hyperparameters reported in the original literature. During training, all triples in background KG G and training set, as well as few-shot reference triples of validation and testing set are used to train models. For few-shot relational learning baselines, we extend GMatching from original one-shot scenario to few-shot scenario by three settings: obtaining general reference representation by mean/max pooling (denoted as MeanP/MaxP) over references, or taking the reference that leads to the maximal similarity score to the query (denoted as Max). Because FSRL was reported in completely different experimental settings, we reimplement the   . 147 .244 .197 .090 .245 .372 .295 .185 FSRL (Zhang et al., 2020) .  model to make a fair comparison. We directly report the original results of MetaR with pre-trained embeddings to avoid re-implementation bias.
For all implemented few-shot learning methods, we initialize entity embeddings by TransE. The entity neighbors are randomly sampled and fixed before model training, and the maximum number of neighbors M is fixed to 50 on both datasets. The embedding dimensionality is set to 50 and 100 for NELL and Wiki, respectively. For FAAN, we further set the number of Transformer layers to 3 and 4, and the number of Transformer heads to 4 and 8, respectively. To avoid over-fitting, we also apply dropout to the neighbor encoder and the Transformer layer with the rate tuned in {0.1, 0.3}. The L 2 regularization coefficient is tuned in {0, 1e −4 }. The margin γ is fixed to 5.0. The optimal initial learning rate η for Adam optimizer is 5e −5 and 6e −5 for NELL and Wiki respectively, which is warmed up over the first 10k training steps, and then linearly decayed. We evaluate all methods for every 10k training steps, and select the best models leading to the highest MRR (described later) on the validation set within 300k steps. The optimal hyperparameters are tuned by grid search on the validation set.

Evaluation Metrics
To evaluate the performance of all methods, we measure the quality of the ranking of each test triple among all tail substitutions in the candidates: (h m , r , t m ), t k ∈ C hm,r . We report two standard evaluation metrics on both datasets: MRR and Hits@N. MRR is the mean reciprocal rank and Hits@N is the proportion of correct entities ranked in the top N , with N = 1, 5, 10.

Main Results in Link Prediction
The performance of all models on NELL and Wiki are shown in Table 2. The table reveals that: (1) Compared to the traditional KG embedding methods, our model achieves better performance on both datasets. The experimental results indicate that our few-shot learning method is more suitable for solving few-shot issues.
(2) Compared to the few-shot learning baselines, our model also consistently outperforms them on both datasets in all metrics. Compared to the best performing baseline MetaR, FAAN achieves an improvement of 33.5%/20.6% in MRR/Hits@10 on NELL test data, and an improvement of 5.6%/10.8% on Wiki test data, respectively. It demonstrates that exploiting the dynamic properties of KG can indeed improve the  performance of few-shot KG completion.

Impact of Few-Shot Size
We conduct experiments to analyze the impact of few-shot size K. Figure 3 reports the performance of models on NELL data in different settings of K.
The figure shows that: (1) Our model outperforms all baselines by a large margin under different K, showing the effectiveness of our model in the few-shot scenario.
(2) An interesting observation is that a larger reference set does not always achieve better performance in the few-shot scenario. The reason is probably that few-shot scenario makes the performance sensitive to available references. Take the task relation SubPartOf in Figure 1(b) as an example. When making predictions for organization-related queries, injecting more location-related references is not necessarily useful. Even so, FAAN still gets relatively stable improvements compared to most baselines like GMatching and FSRL. The robustness to few-shot size comes from better reference embeddings generated by the adaptive aggregator.

Discussion for Model Variants
To inspect the effectiveness of the model components, we show results of experiments for model variants in Table 3: (A) Neighbor Encoder Variants: In A1, we replace the encoder by mean pooling module used in GMatching. In A2, we aggregate neighbors with a fixed attention map as used in FSRL. In A3, we remove the embeddings of entities' own and encode them with only their neighbors. Experiments show that aggregating entity neighbors in an adaptive way and considering self-embedding can benefit the model performance.
(B) Transformer Encoder Variants: In B1, we   Table 5: Attention weights of 5-shot references, given two queries: Query 1 (C. Bulls, NBA) and Query 2 (Astana, Kazakhstan). The task relation of all entity pairs is SubPartOf. The references that are more related to the query achieve higher attention weights.
replace the encoder by a concatenate operation on entity pairs as used in both GMatching and FSRL. In B2, we remove position embeddings in the Transformer encoder. Experiments indicate that the Transformer can effectively model few-shot relations, and position embeddings are also essential.
(C) Matching Processor Variants: In C1, we just obtain the embedding of reference set by averaging all reference representations. In C2, we only take the reference that is the most relevant to the query. In C3, we adopt the LSTM matching network as used in GMatching. Experiments indicate that our adaptive matching processor has superior capability in computing relevance between references and queries.

Case Study for Adaptive Attentions
To better understand the effects of adaptive attentions in the neighbor encoder and the matching processor, we conduct a case study. Table 4 provides the most contributive relation neighbors with the highest attention weights in different tasks. We can see that the contributive neighbors for each entity in both tasks are different. The entities tend to focus more on the neighbors that are related to the task. Table 5 shows attention weights of references given different queries. The attention map of references is varied for each query, and the queries focus more on the related references. We can see that the attention weights are higher for location-related ref-  erences when the query is location-related, while those are higher for organization-related references when the query is organization-related. This further indicates that our adaptive matching processor can aggregate references dynamically adaptive to the query, and benefits the matching process. All the above results further confirm our intuition described in the introduction.

Results on Different Relations
Besides the overall performance reported in the main results, we also conduct experiments to evaluate the performance of each task relation in NELL testing data. Table 6 reports the results of the best baseline model MetaR and our model FAAN. According to the table, we find that the results of both models on different task relations are of high variance. The reason may be that the number of candidate entities is different, and the relations with large candidate set are usually hard to make predictions. Even so, our model FAAN has better performance in most cases, which indicates that our model is robust for different task relations.

Conclusion
This paper proposes an adaptive attentional network for few-shot KG completion, termed as FAAN. Previous studies solve this problem by learning static representations of entities or references, ignoring their dynamic properties. FAAN proposes to encode entity pairs adaptively, and predict facts by adaptively matching references with queries. Experiments on two public datasets demonstrate that our model outperforms current state-of-art methods with different few-shot sizes.
Our future work might consider other advanced methods to model few-shot relations, and exploiting more contextual information like textual description to enhance entity embeddings.