Knowledge Base Embedding By Cooperative Knowledge Distillation

Knowledge bases are increasingly exploited as gold standard data sources which benefit various knowledge-driven NLP tasks. In this paper, we explore a new research direction to perform knowledge base (KB) representation learning grounded with the recent theoretical framework of knowledge distillation over neural networks. Given a set of KBs, our proposed approach KD-MKB, learns KB embeddings by mutually and jointly distilling knowledge within a dynamic teacher-student setting. Experimental results on two standard datasets show that knowledge distillation between KBs through entity and relation inference is actually observed. We also show that cooperative learning significantly outperforms the two proposed baselines, namely traditional and sequential distillation.


Introduction
Knowledge Bases (KB), organizing structured information about entities (nodes), and relations (edges) as graphs, are increasingly exploited as gold standard data sources for a broad range of human-AI tasks including language modeling (Logan et al., 2019), question answering (Shen et al., 2019) and semantic search (Bast et al., 2016). Although typical KBs may include a huge amount of observed knowledge through millions of entities and their relations, they are by nature incomplete since they can only capture a fraction of world knowledge. This limitation has given rise to extensive research work that focuses on the issue of predicting new knowledge from the observed one (Socher et al., 2013;Nickel et al., 2016). This issue has been successfully tackled by neural approaches for representation learning of KBs (Wang et al., 2017;Bordes et al., 2013;Yang et al., 2014). These models aim at representing KB entities and relations in low-dimensional embedding spaces, and supporting relational inferences using simple vector algebra. Recent years have witnessed increasing interest toward embedding models leveraged to connect multiple KBs (Liu et al., 2016;Chen et al., 2017;Trivedi et al., 2018;Zhu et al., 2017;Zhang et al., 2019).
The key objective of multi-graph representation learning is to empower the entity and relation models with different graph contexts that potentially bridge different semantic contexts. To achieve this goal, embeddings are learned upon the combined triples across graphs. Although the above multi-graph representation learning methods have achieved promising results, they are still challenged by two main limitations. First, they are particularly suited to graph-alignment and machine translation as downstream tasks which necessarily leads to tackle computational challenges in large-scale KBs. Second, these methods assume that each KB has access to all the entities and relations that are stored in the other KBs, while it may not be feasible neither relevant to the KBs to share unaligned information such as in personal KBs (Balog and Kenter, 2019).
Following a different objective, we argue that apart from any downstream task, modeling the relational patterns across KBs might mainly focus on explicitly modeling connectivity patterns within each KB using its own observed triples and, infer additional patterns from its peer by only using partially aligned observed triples. As a motivating example let us consider two KBs (KB 1 and KB 2 ) that contain facts regarding cities, capitals, and countries as illustrated in Figure 1, but where none of them include the fact that Rome is a city of Italy. By learning an embedded space, KB 1 model may be able to correctly generalize the relation CityIn and infer that Rome is a city of Italy by grouping other similar entity embeddings to Rome such as the embedding of Pisa. Although this knowledge was not inferred by KB 1 , KB 2 can teach this information to KB 1 by distillation. Then, KB 1 model will be able to understand the relation CapitalOf by directly observing examples within its own semantic context and the relation CityIn by distilled knowledge from KB 2 semantic context.
Accordingly, unlike existing work in multi-graph embedding which relies on a unified view over multiple graphs, our work rather relies on multiple within-KB views that are bridged with aligned information. While each KB may be learning embeddings on its own semantic context based on associated hard triples, it can additionally exchange inferred knowledge from soft aligned triples provided by other KBs, in turn improving the embeddings of each other based on different semantic contexts. Our key idea is to model a knowledge distillation process (Hinton et al., 2015) across KBs to empower their generalization ability. Despite the number of works that show the rationale behind the entity and relation inference between KBs (Sun et al., 2018;Zhu et al., 2017), none has shown the feasibility of the knowledge distillation framework to model knowledge inference between KBs. Since the KBs play symmetric roles in knowledge transfer, one critical issue is how to train each KB model using entity/relation labels based on soft predictive distributions provided by the teacher, as well as its own predictive distribution. To tackle this issue, we argue for a mutual learning paradigm , where each KB acts dynamically as either a teacher or a student. Unlike traditional static one-way knowledge transfer from a teacher model to a student model, we argue towards a two-way cooperative knowledge transfer between a KB and its peers. Concretely our set up is the following: the representation learning model of each KB is equipped with two losses which are jointly optimized: 1) a classic KB supervised margin based ranking loss whose objective is to make the scores of positive triples lower than those of negative ones; and 2) a mimicry cooperative distillation loss that makes the posterior class predictions of aligned entities and aligned relations close to the entity and relation class probabilities of its peer respectively. Through joint optimization, the knowledge is also naturally transferred from the seed aligned information to unaligned information. In summary, the main contributions of the paper are the following: 1) a first attempt to ground multigraph representation learning by a knowledge distillation theoretical framework; 2) a novel KB representation learning model called KD-MKB, based on a cooperative knowledge distillation strategy; 3) experiments on two standard datasets, WN18RR and FB15K-237, that empirically validate the rationale of knowledge distillation across KBs and show the effectiveness of the cooperative knowledge distillation as proposed in KD-MKB.
The remainder of this paper is structured as follows. Section 2 presents the related works. In Section 3, we first introduce the used preliminary notions and then detail the KD-MKB model. In Section 4, we present and discuss the experimental results. Section 5 concludes the paper.

Representation learning across multiple KBs
Learning KB embeddings has drawn a huge attention in recent years. The embedding models are mainly categorized into: 1) translational models such as TransE (Bordes et al., 2013), TransH (Wang et al., 2014), and TransR (Lin et al., 2015) or complex based models such as ComplEx (Trouillon et al., 2016) and RotatE , which learn vector embeddings of both entities and relations by interpreting a relation as a translation operation from a head entity to a tail; 2) deep neural models based for instance on graph convolutional networks (Schlichtkrull et al., 2018). More recently, representation learning across multiple graphs has gained an increasing attention. The general objective is to encode different graphs into a unified embedding space, such that the alignment likelihood between entities can be directly measured via their embeddings. Trivedi et al. (Trivedi et al., 2018) propose the LinkNBed multi-graph representation learning model based on a deep neural architecture. LinkNBed jointly learns relational data embeddings across multiple graphs in a shared space and entity linkage between these graphs using a multi-tasking approach. Sun et al. (Sun et al., 2018) propose a bootstrapping approach for entity alignment across multiple graphs. The key idea is to iteratively label likely alignments as training data and use them to further improve entity embeddings and alignment. Other work extends embedding models to multilingual learning across graphs. A seminal work is MTransE (Chen et al., 2017) which connects monolingual models by jointly aligning cross-lingual counterparts. On the other hand, in (Chen et al., 2018), the authors propose a co-training process combining multiple multilingual graph embedding models to learn on two views namely the structure and literal descriptions of entities. None of these methods jointly learn multiple KB embeddings while preserving the structure of each KB, which is the main goal of our work.

Knowledge Distillation
Knowledge distillation has been initially designed to distill the function approximated by a powerful ensemble of models playing the role of teacher, to a simpler single model playing the role of student (Bucila et al., 2006). This idea has given recently rise to an increasing attention for distilling the generalization ability from a large and easy-to train network model to a small but harder to train network (Adriana et al., 2015). The general framework relies on training a teacher first and then uses a teacher outputs in the form of posterior class probabilities to train the student model such as it mimics the teacher by providing similar outputs. Knowledge distillation has been widely used in NLP tasks to distill large models into small models (Mou et al., 2016) or ensembles of models into single models (Liu Yijia, 2018;Liu Xiaodong, 2019;Kevin Clark, 2019). Mou et al. (Mou et al., 2016) addressed the problem of distilling word embeddings in NLP applications. They proposed a supervised encoding approach to distill taskspecific knowledge from cumbersome word embeddings. The approach has been shown to be effective in sentiment analysis and relation classification. Clark et. al (Kevin Clark, 2019) rather distill knowledge from single-task teacher models to multi-task student models. Their work extends born-again networks (Tommaso Furlanello, 2018) to the multi-task setting. The authors mainly rely on the teacher annealing technique, which consists in mixing the teacher prediction with the ground truth label during training. This strategy allows the student surpassing the teacher. The method has shown good performance in various NLP tasks including textual entailment, question-answering and paraphrase. In all of these works, distillation is applied on a pair of models that statically play either the role of teacher or student. In contrast, we adopt a mutual learning approach proposed in computer vision  where a set of models dynamically play the role of teacher-student. However, unlike the learning framework proposed in , we rather propose a cooperative learning over different learning tasks with associated respective ground truths and where knowledge distillation enables communication between models via shared data characteristics. Beyond, to the best of our knowledge, it is the first work that models and empirically validates the concept of knowledge distillation across KBs.

Cooperative Knowledge Distillation Across Multiple KBs
In this section, we describe KD-MKB, a KB representation learning model. We first introduce a couple of terminological definitions so we can formally define our model.
x , e j y ) ∈ E i × E j } is the set of aligned entities meaning that e i x and e j y represent the same real world entity. Note that the set of entities from

Knowledge distillation from a KB to its peer
We conjecture that knowledge transfer between KBs can be drawn by relation and entity distillation. Our underlying intuitions are the following: Relation distillation. Let us consider (e i 1 , e j x ), (e i 2 , e j y ) ∈ I e (i, j) two pairs of aligned entities between KB i and KB j . Assuming the existence of aligned relations between KB i and KB j , our intuition is that such entity pairs lead to the same probability of relation inference because the aligned entities refer to the same real world objects (Sun et al., 2018;Zhu et al., 2017). Accordingly, we argue for the relevance of mutually distilling likely aligned relations from one KB to its peers. Formally, plausibility scores of triples (e i 1 , r i , e i 2 ) can be estimated with high confidence based on plausibility scores of triplets (e j x , r j , e j y ) and vice versa. Entity distillation. Let us consider (r i v , r j w ) ∈ I r (i, j) and (e i 1 , e j x ) ∈ I e (i, j) a pair of aligned entities and relations between KB i and KB j . Similarly to the intuition underlying relation distillation, we believe that such relation pairs lead to the same probability of entity inference because the aligned relations bring equivalent semantics that link between entities (Sun et al., 2018;Zhu et al., 2017). Thus, we argue for the relevance of mutually distilling likely aligned entities from one KB to its peer. Analogously to relation distillation principle, plausibility scores of triples (e i 1 , r i v , e i 2 ) can be estimated with high confidence based on plausibility scores of triples (e j x , r j w , e j y ) and vice versa.

Formulation of the KD-MKB model
In this paper, we study the representation learning of entities and relations across multiple KBs, while preserving the essential information included in each KB. Formally, given a collection of KBs KB = {KB 1 , KB 2 , . . . KB n }, a knowledge embedding model M i is learned to preserve entities and relations of each KB i , i = 1 . . . n in a separated embedding space.

Design principles and objectives
Our main design principle is to, on one hand, learn embeddings directly from knowledge included in each KB, and on the other hand, improving the learning using knowledge distilled from its peers w.r.t to aligned entities and aligned relations. Based on this principle, the learning framework jointly achieves two complementary objectives.
Objective O1. Preserve the relational structure of each KB. For each participating KB i , a dedicated knowledge embedding model M i takes triples (e i x , r i , e i y ) either positive in T + i or negative triples in T − i and learns corresponding embedding vectors (e i x , r i , e i y ) by maximizing a triple plausibility scoring function f i : (Bordes et al., 2013), TransH (Wang et al., 2014) and RotatE  are examples of state-of-the-art scoring functions.
Objective O2. Improve the generalization ability of the representation learning model of each KB by leveraging its peers. Based on a cooperative learning setting, each knowledge embedding model M i is further improved using knowledge distilled from each of the other embedding models M j , j = 1 . . . n, j = i. Each KB model M i acts dynamically as either a teacher or a student by respectively distilling or leveraging distilled relations and distilled entities from its peers. Thus, we formulate the KD-MKB model with a set of n networks which act dynamically as either teacher or student networks and mutually learn each of them specific models M i , i = 1 . . . n. Figure 2 provides an overview of the KD-MKB architecture with a setting of 2 (n = 2) teacher-students. Each KB model M i uses a teacher-student setting which learns from the ground-truth labels using a score function that measures the plausibility of the embeddings and the soft-labels provided by the n − 1 teacher networks as prediction outputs using the probability of relation inference and entity inference based on the principle of knowledge distillation. The probability mass associated with each prediction output provided by the other teacher KBs KB j , j = 1 . . . n, j = i allows the model M i to learn richer contextual information about the relation and entity embeddings similarity, leading to an increased ability of generalization. Thus, the model M i is equipped with two losses which are jointly optimized: a classic KB supervised loss L i s on ground-truth labels and a mimicry cooperative knowledge distillation loss L i KD on soft-labels.
where α is a hyperparameter. Accordingly, each KB model learns both to correctly predict the correct label based on ground-truth training triples (loss L i s ) as well as to match the posterior probability estimate of relations and entities provided by its peers (loss L i KD ), following the intuitions outlined above (see Section 3.1). Such mutual learning helps each KB to learn additional context from its peers.

Supervised classification loss
Following objective O1 (see Section 3.1), we adopt a standard KB embedding model, namely TransE (Bordes et al., 2013). It is worth mentioning that other KB embedding models can also be used (eg., TransH (Wang et al., 2014) or RotatE ). Given a relation fact (e i x , r i w , e i y ) in KB i , we use the following score function to estimate the plausibility of the embeddings: where · denotes either L 1 or L 2 vector norm. Accordingly, we define, the probability of (e i x , r i w , e i y ) being a true triple as follows: where x )) , the logistic sigmoid applied to each triple score. The embedding model parameters θ i , are defined by minimizing the logistic loss function:

Cooperative knowledge distillation loss
Following objective O2 (see Section 3.1), the knowledge distillation is cooperatively conducted on the set of n KBs. At each learning step, each KB model M i takes turns in the student-teacher process. As a teacher, the model distills its knowledge background through class prediction estimates L i s which are used as soft labels by the other student KBs to compute their mimicry loss function L j KD , j = 1 . . . n, j = i. Mutually, as a student, M i model uses in its own mimicry loss L i KD soft labels distilled from the other KB teachers through L j s , j = 1 . . . n, j = i. From the perspective of KB i , the distillation loss function L i KD is formalized as the sum of two losses related to relation distillation L ij KDr and entity distillation L ij KDe from teacher network j to student network i as follows: Following the relation (resp. entity) distillation principles, the distillation function L ij KDr (resp. L ij KDe ) quantifies the match of each student network relation (resp. entity) prediction outputs using soft labels provided by the teacher networks with respect to the plausibility of the embeddings estimate given by the corresponding supervised classification functions f j on ground truth labels. Relation and entity confidence scores used by the student and the teachers for prediction are obtained by converting the triple plausibility scores f i and f j on triples involving seed aligned relations and seed aligned entities as detailed in the following. Relation distillation. A relation distillation favors the student model M i to mimic the teacher model M j on the relation prediction outputs over the set of aligned relations r ∈ I r (i, j) such as triples (e j 1 , r, e j 2 ) and (e i 1 , r, e i 2 ) have close plausibility scores. Thus, L ij KDr is computed as follows: where D is the distillation function which can be defined in several ways (Sau and Balasubramanian, 2016) such as the L2 loss (Ba and Caruana, 2014) or Kullback-Leiber divergence (Hinton et al., 2015), r (ex,·,ey) is a categorical variable with |I r (i, j)| values corresponding to aligned relation labels, P(r (e j x ,·,e j y ) | θ j ) is a categorical distribution generated from the true triples (e j x , r, e j y ) ∈ T + j and P(r (e i x ,·,e i y ) | θ i ), a categorical distribution generated from the triplets involving soft relation labels provides by model M j . The relation confidence score of relation r v is obtained by converting plausibility scores using the softmax function over the aligned relations r ∈ I r (i, j) as below: where k = i, j and, sof tmax(f k (e k x , r v , e k x )) = exp(f k (e k x ,rv,e k x )) rw ∈Ir (i,j) exp(f k (e k x ,rw,e k x )) , the softmax function applied to each triple score. Entity distillation. An entity distillation favors the student model M i to mimic the teacher model M j on the link prediction outputs over the set of aligned entities e ∈ I e (i, j) such as triples (e j x , r j w , e j y ) and (e i 1 , r i v , e i 2 ) have close plausibility scores. Thus, L ij KDe is computed as follows: r (ex,r,·) is a categorical variable with |I e (i, j)| values corresponding to aligned entity labels, P(r (ex,r,·) | θ j ) is a categorical distribution generated from the true triples (e j x , r, e j y ) ∈ T + j and P(r (e j x ,r,·) | θ i ), a categorical distribution generated from the triplets involving soft relation labels provides by models M j . The entity confidence score of the entity e y is obtained by converting plausibility scores using the softmax function over the aligned entities e ∈ I e (i, j) as below: P y (r (ex,r,ey) | θ k ) = sof tmax(f k (e x , r, e y )) where sof tmax(f k (e x , r, e y )) = exp(f k (ex,r,ey)) ez ∈Ie(i,j) exp(f k (ex,r,ez)) , the softmax function applied to each triple score.
batch j e ←query index top-k using batch i ∩ Ie(i, j) to get P(g (e i x ,e i y ) |θ i ) batch j r ←query index top-k using batch i ∩ Ir(i, j) to get P(g (e i

The training procedure
A key characteristic of our proposed cooperative knowledge distillation is that all the losses L(θ i ), i=1 . . . n of the n knowledge embedding models, are jointly and cooperatively optimized. At each iteration, each loss L(θ i ) uses both the true labels and the soft labels provided by network models M j , j=1 . . . n, j = i to update parameters θ i . The training model is summarized in Algorithm 1. The learning strategy is setup in each mini-batch based model update. At each iteration, all the losses L(θ i ) are jointly learned using one mini-batch for training L i s and (n − 1) mini-batches comprising pairs of alignments for training L ij KDe and L ij KDr . Since the sizes of entity and relation sets used in the softmax normalization calculation of P(.) over E i or R i may be very large, we apply a sampling technique to estimate the probability distribution as done in previous work (Liu Yijia, 2018). It consists in selecting the top k candidate entities (resp. relations) w.r.t. equation 2 to the given example to be distilled plus k random entities (resp. relations). The teacher is in charge of the choice of the top-k candidates. Thus, we only use 2 × k entities (resp. relations) for the softmax normalization instead of |E i | or |I e (i, j)| (resp. |R i | or |I r (i, j)|) total values which drastically reduce the number of required computations for each distillation mini-batch.

Experiments
Two main objectives have guided our experiments: 1) show the validity of knowledge distillation to formally support knowledge transfer between KBs; 2) evaluate the effectiveness of the KD-MKB model.

Settings
Datasets and splits. We perform our experiments on two standard real-world WN18RR and FB15K-237 KBs 1 . We simulate the multiple KBs setting by randomly splitting each of the WN18RR and FB15K-237 KB train triples into 2 and 3 partitions (n = 2, 3 in the KD-MKB setting, see Section 3.2). Our motivation behind this evaluation setting is sustained by two reasons: 1) our goal with KD-MKB is to learn empowered KB embeddings instead of multi-graph embeddings; 2) evaluate the intrinsic effect of the KD-MKB model without any bias induced by uncontrolled effect of knowledge alignment quality. More  precisely, two FB15K-237 partitions usually share 95% of the entities but drops to 64% for WN18RR. The setting n = 1 allows reporting the traditional KB model on the entire set of triples. Table 1 provides statistics for both datasets used in our experiments. Within each setting, we obtain n teacher M i and student M j models, so reported results are averages. We compare the performance of our model using state-of-the art neural representation models, namely TransE (Bordes et al., 2013). We focus on the standard entity link prediction task for knowledge base population. This task evaluates the model performances for a given tail query (e i ,r j ,e?) where the response is a ranked list of entities that better fit e? (similarly, head queries can be evaluated). We use the standard HITS@k (k = 1, 3, 10) and Mean Reciprocal Rank (MRR) metrics. We report the means of multiple runs over test partitions.
Knowledge distillation strategies. We analyze the effectiveness of knowledge distillation strategies by comparing the results reported on the following scenarios: 1) Independent is the traditional one-way distillation setup (Hinton et al., 2015) where only half (n = 2) or third (n = 3) of the knowledge is transferred from the teacher to the respective student. The teacher model is pre-trained and provides posterior entity and relation predictions to the student model. Note that within this setting, each KB model plays statically either the role of teacher or student during the learning process; 2) Xdistills∼X is a sequential setup in which first each model is trained over one of the set defined in the partition until convergence. Then he plays the role of a teacher and distills it's knowledge to other models that play the role of students, then, it plays the role of student. Thus the KB teacher and the KB student models parameters are updated one after the other in a sequential fashion 2 ; 3) KD-MKB model in which each model dynamically and simultaneously acts as teacher and as student during the whole training process. Thus, the predictions and parameters of the KB models are jointly updated. Implementation details. We implemented all baselines and our model using PyTorch 3 . The loss function is minimized using the Adam stochastic method with a learning rate of 10 −5 . The maximum number of iterations is set to 8 × 10 4 . Parameters of the TransE model were fixed by selecting the best configuration in the validation partition of each dataset and by following recommendations from . Thus, embedding size is fixed to 1000 (resp. 500), batch size to 512 (resp. 256), negative sampling size to 128 (resp. 512), the α al adversarial loss to 1 (resp. 0.5), and the margin hyperparameter γ to 9 (resp. 6) for FB15K-237 (resp. WN18RR). Top-k entities are found using faiss (Johnson et al., 2017) with k set to 10. Hyperparameter α in Equation 1, is set to 0.98 following a loss analysis between L i s and L i KD 4 .

Does knowledge distillation between KBs work?
To the best of our knowledge, this is the first attempt in the literature to empirically assess about knowledge inference between KB embeddings using the theoretical framework of knowledge distillation (Hinton et al., 2015). Table 2 shows link prediction performances for the two used data sets when performing traditional independent distillation from a teacher M i to a student M j . We can see from Table 2 that overall the performance levels of the student model follow those of the teacher model for all the metrics and that the performance trends remain the same for increasing numbers of KBs. This result empirically validates our idea about the modeling of knowledge inference between KBs through the formalization of simultaneous distillation of entities (L ij KDe ) and relations (L ij KDr ).

KD-MKB model analysis
Distillation model. To highlight the benefit behind cooperatively distilling knowledge across KBs, we report in Table 3 link prediction results using the three knowledge distillation strategies Independent,  Xdistills∼X and KD-MKB by using the same splits than those presented in Table 2. We report best and worst results of M i and M j models for each dataset partition. The main observation that can be drawn from Table 3 is that the KD-MKB model outperforms the Independent model (e.g., between 24.8% and 455.7% improvement based on HITS@1, between 17.9% and 85.7% improvement based on MRR) over all the partitions and datasets and w.r.t. to all the metrics. Additionally, we can also see that the Xdistills∼X model outperforms the Independent model. For example, when n = 2, the HITS@3 performance reaches a level around 27.93 (resp. 32.60) using Xdistills∼X for the WN18RR (resp. FB15K-237) dataset vs. a lower value around 19.95 (resp. 31.92) using the Independent model. This can be easily explained as the Independent model is only trained over the soft-labels in just one partition while the Xdistills∼X model uses first hard labels from its own partition (when it plays the role of a teacher) and then uses soft labels from the other partitions (when it plays the role of a student). We can also interestingly observe that the KD-MKB model outperforms the Xdistills∼X model though by a lower percent change (e.g., between 1.0% and 11.0% improvement for HITS@10 using WN18RR dataset, between 10.6% and 18.6% improvement for HITS@10 using FB15K-237 dataset). It is worth mentioning that the KD-MKB model uses the same number of soft-and hard-labels than the Xdistills∼X model but our proposed cooperative strategy takes more advantage of both kind of labels. This result highlights a clear benefit of both dynamically switching between the teacher and student roles for each of the M i models empowered by the mimicry cooperative learning and joint update of their parameters.  Distillation with larger alignments. We further analyze the effect of the size of entity alignments on KD-MKB performance. To be coherent with the results presented in the previous experiments, we keep the same partitions. Figure 3 plots the performance variations w.r.t. to all the metrics using the  WN18RR dataset. It is worth mentioning, that unlike the FB15K-237 which exhibits an overlap of 95%, the WN18RR actually allows to simulate an increasing overlap of entities by adding extra information to I e (i, j) until reaching the 100% overlap. The additional mapping information between entities does not increase the number of triplets to train but it allows a larger number of entities to be distilled from teachers. Figure 3 indicates that in average, 20% of extra aligned entities leads to an absolute gain of 5.2 points in HITS@10 and 0.06 points in MRR. As expected, higher numbers of soft-labels improves the mutual knowledge inference from a KB to its peers. Results for FB15K-237 datasets are presented in Figure 4. In this case the increasing is beneficial being more stable when n = 3 because of the larger increase of the overlapped entities.
Multi-KB learning vs. single-KB learning. We depict in Figures 3 and 4 the performances of KD-MKB as well as for TransE using both datasets. Both models were trained using full train data, hard labels only for the latter, and hard-labels and soft-labels for the former. Results are constant for TransE as its performances do not depend on the number of aligned entities. When the number of aligned entities is 100% (highlighted by a circle), we can verify from these figures the improvements of using KD-MKB on multiple KBs instead of the classical TransE embedding model on an individual KB. For FB15K-237, both partitions configuration (e.g. KD-MKB with n = 2 and n = 3) outperform the TransE performances for all studied metrics. However, for WN18RR, KD-MKB slightly outperforms TransE when n = 2 in terms of HITS@3 and MRR, but fails to improve in terms of HITS@10.

Conclusion and Future Work
This paper presents a new framework for learning entity and relation embeddings over multiple KBs. Our framework exploits a new way to transfer learning from one KB model to its peers. First, we formalize entity and relation inference between KBs as a distillation loss over posterior probability distributions on aligned knowledge. Grounded on this finding, we propose and formalize a cooperative distillation framework where a set of KB models are jointly learned by using each of them hard labels from their own context and also soft labels provided by peers. We empirically demonstrate the rationale behind knowledge distillation between KBs and show the effectiveness of our cooperative learning framework on the link prediction task compared to the existing distillation strategies. Further experiments are planned for future work using more complex and realistic configurations for multiple KBs learning to assess about the generalizability of our findings. We also plan to extend our approach to consider weak alignments while distilling knowledge over pairs of KBs. This would empower the cooperative learning with higher generalization ability particularly for heterogeneous KBs. We believe that this work could be used without major change in the core methodology to support a wide range of knowledge-driven applications.