Pretrain-KGE: Learning Knowledge Representation from Pretrained Language Models

Conventional knowledge graph embedding (KGE) often suffers from limited knowledge representation, leading to performance degradation especially on the low-resource problem. To remedy this, we propose to enrich knowledge representation via pretrained language models by leveraging world knowledge from pretrained models. Specifically, we present a universal training framework named Pretrain-KGE consisting of three phases: semantic-based fine-tuning phase, knowledge extracting phase and KGE training phase. Extensive experiments show that our proposed Pretrain-KGE can improve results over KGE models, especially on solving the low-resource problem.


Introduction
Knowledge graphs (KGs) constitute an effective access to world knowledge for a wide variety of NLP tasks, such as entity linking (Luo et al., 2017), information retrieval (Xiong et al., 2017), question answering (Hao et al., 2017) and recommendation system (Zhang et al., 2016). A typical KG such as Freebase (Bollacker et al., 2008) and Word-Net (Miller, 1995), consists of a set of triplets in the form of (h, r, t) with the head entity h and the tail entity t as nodes and relation r as edges in the graph. A triplet represents the relation between two entities, e.g., (Steve Jobs, founded, Apple Inc.). To learn effective representation of entities and relations in the graph, knowledge graph embedding (KGE) models are one of prominent approaches (Bordes et al., 2013;Ji et al., 2015;Lin et al., 2015;Sun et al., 2019;Nickel et al., 2011;Kazemi and Poole, 2018;Trouillon et al., 2016;Zhang et al., 2019).
However, traditional KGE models often suffer from limited knowledge representation due to the sparse and noisy dataset annotations. It leads to performance degradation, especially on the lowresource problem. To address this issue, we propose to enrich knowledge representation via pretrained language models (i.e., BERT (Devlin et al., 2019)) given a semantic description of entities and relations. We propose to incorporate world knowledge from BERT to the entity and the relation representation. Although simply fine-tuning BERT can enrich the knowledge representation, it suffers from learning inadequate structure information observed in training triplets, which we have demonstrated when we analyze the rationality of the KGE-training phase.
We propose a model-agnostic training framework for learning knowledge graph embedding consisting of three phases: semantic-based fine-tuning phase, knowledge extracting phase and KGE training phase (see Fig. 1). During the semantic-based fine-tuning phase, we learn knowledge representation via BERT given the semantic description of entities and relations as the input sequence. In this way, we incorporate world knowledge from BERT into the knowledge representation. Then during the knowledge extracting phase, we extract the entity and the relation representations encoded by BERT and inject them into embeddings of a KGE model. Finally, during the KGE training phase, we train the KGE model to learn adequate structure information of dataset, while reserving partial knowledge from BERT to learn better knowledge graph embedding.
Extensive experiments show that our proposed Pretrain-KGE can improve performance over KGE models on four benchmark KG datasets. Further analysis and visualization of the knowledge learning process demonstrate that our method can enrich knowledge representation via pretrained language models through the training framework.  Figure 1: An illustration of our proposed three-phase Pretrain-KGE. "KGE loss" is the score function of an arbitrary KGE model, thus our method can be applied to any variant of KGE models. "BERT Encoder" represents the entity/relation encoder given semantic description of entities and relations.

Related Work
KGE models can be roughly divided into translational models and semantic matching models according to the score function (Wang et al., 2017). Translational models consider the relation between the head and tail entity as a translation between the two entity embeddings, such as TransE (Bordes et al., 2013), TransH (Wang et al., 2014), TransR (Lin et al., 2015), TransD (Ji et al., 2015), RotatE (Sun et al., 2019), and TorusE (Ebisu and Ichise, 2018); while semantic matching models define a score function to match latent semantics of the head, tail entity and the relation, such as, RESCAL (Nickel et al., 2011), Dist-Mult , SimplE (Kazemi and Poole, 2018), ComplEx (Trouillon et al., 2016) and QuatE (Zhang et al., 2019). QuatE (Zhang et al., 2019) is the recent state-of-the-art KGE model, which represents entities as hypercomplex-valued embeddings and models relations as rotations in the quaternion space.
In a knowledge graph dataset, the names of each entity and relation are provided as the semantic description of entities and relations. Recent works also leverage semantic description to enrich knowledge representation but ignore contextual information of the semantic description (Socher et al., 2013a;Li et al., 2016;Speer and Havasi, 2012;Xiao et al., 2017;. Instead, our method exploits world information via pretrained models. Recent approaches to modeling language representations offer significant improvements over embeddings, such as pretrained deep contextualized language models (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2019). KG-Bert  first utilizes BERT (De-vlin et al., 2019) for knowledge graph completion, which treats triplets in knowledge graphs as textual sequences. However, KG-Bert does not extract knowledge representations from Bert and thus cannot provide entity or relation embeddings. In this work, we leverage world knowledge from BERT to learn better knowledge representation of entities and relations given semantic description.

Training Framework
An overview of Pretrain-KGE is shown in Fig. 1. The framework consists of three phases: semanticbased fine-tuning phase, knowledge extracting phase, and KGE training phase.
Semantic-based fine-tuning phase We first encode the semantic description by BERT (Devlin et al., 2019). Define S(e) and S(r) as the semantic description of entity e and relation r respectively. BERT(·) converts S(e) and S(r) into the representation of entity and relation. We then project the entity and the relation representations into two separate vector spaces F d through linear transformations, where F d denotes a vector space on the number set F. Formally, we get the entity encoder Enc e (·) for each entity e and the relation encoder Enc r (·) for each relation r, then output the entity and the relation representations as: where v h , v r , and v t represents encoding vectors of the head entity, the relation, and the tail entity in a triplet (h, r, t), respectively. W e , W r ∈ F d×n , b e , b r ∈ F d , and σ denotes a nonlinear activation function. The entity and the relation representations are used to train the BERT encoder based on a KGE loss. After fine-tuning, the entity encoder and the relation encoder are used in the following knowledge extracting phase.
Knowledge extracting phase In this phase, we extract knowledge representation encoded by BERT encoder and inject it into embedding of a KGE model as initialization: the entity embedding E = [E 1 ; E 2 ; · · · ; E k ] ∈ F k×d ; and the relation embedding R = [R 1 ; R 2 ; · · · ; R l ] ∈ F l×d , where ";" means concatenating column vectors into a matrix, k and l denote the total number of entities and relations, respectively. Formally, we extract the knowledge representation encoded by BERT and inject it into a KGE model by setting E i to Enc e (e i ) and R j to Enc r (r j ).
KGE training phase After the knowledge extracting phase, we train a KGE model in the same way as a traditional KGE model. For example, if the max-margin loss function with negative sampling are adopted, the loss is calculated as: where (h, r, t) and (h , r , t ) represent a candidate and a corrupted false triplet respectively, γ denotes the margin, · + = max(·, 0), and f (·) denotes the score function. The KGE training phase is indispensable because simply fine-tuning a pretrained language model cannot learn adequate structure information observed in training triplets. We demonstrate the rationality of the three-phase training framework in Section 5.2.

Implementation of Baseline Models
To evaluate the universality of training framework Pretrain-KGE, we select multiple public KGE models as baselines including translational models: • TransE (Bordes et al., 2013), the translationalbased model which models the relation as translations between entities; • RotatE (Sun et al., 2019), the extension of translational-based models which introduces complex-valued embeddings to model the relations as rotations in complex vector space; and semantic matching models: • DistMult , a semantic matching model where each relation is represented with a diagonal matrix; • ComplEx (Trouillon et al., 2016), the extension of semantic matching model which embeds entities and relations in complex space.
• QuatE (Zhang et al., 2019), the recent stateof-the-art KGE model which learns entity and relation embeddings in the quaternion space.  Table 1.

Method
Score function F TransE (Bordes et al., 2013) v h + vr − vt R DistMult  v h , vr, vt R ComplEx (Trouillon et al., 2016) (Zhang et al., 2019) v h ⊗vr vt H Table 1: Score functions and corresponding F. v h , v r , v t denote head, tail and relation embeddings respectively. R, C, H denote real number field, complex number field and quaternion number division ring respectively. · denotes L 1 norm. · denotes generalized dot product. Re(·) and· denote the real part and the conjugate for complex vectors respectively. ⊗ denotes circular correlation, denotes Hadamard product.· denotes the normalized operator.

Datasets and Evaluation Metrics
We evaluate our proposed training framework on four benchmark KG datasets: WN18 (Bordes et al., 2013), WN18RR (Dettmers et al., 2018), FB15K (Bordes et al., 2013) and FB15K-237 (Toutanova and Chen, 2015). Detailed statistics of datasets are in the appendix. WN18 and WN18RR are two subsets of WordNet (Miller, 1995); FB15K and FB15K-237 are two subsets of FreeBase (Bollacker et al., 2008). We use entity names and relation names provided by the four datasets as input semantic descriptions for BERT, and we also utilize synsets definitions provided by WordNet as additional semantic descriptions of entities.   Table 3: Link prediction and triplet classification ("Class.") results over QuatE. ↓ means a lower metric is better. ↑ means a higher metric is better. ♠ denotes state-of-the-art performance of KGE models. "+Name" means Pretrain-KGE uses entity and relation names as semantic description. "+Definition" means Pretrain-KGE also adopts definitions of word senses as additional semantic description.
In our experiments, we perform the link prediction task (filtered setting) mainly with the triplet classification task. The link prediction task aims to predict either the head entity given the relation and the tail entity or the tail entity given the head entity and the relation, while triplet classification aims to judge whether a candidate triplet is correct or not. For the link prediction task, we generate corrupted false triplets (h , r, t) and (h, r, t ) using negative sampling. We get ranks of test triplets and calculate standard evaluation metrics: Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hits at N (H@N). For triplet classification, we follow the evaluation protocol in Socher et al. (2013b) and adopt the accuracy metric (Acc).

Main Results
We present the main results of our Pretrain-KGE method in Table 2 and Table 3. As shown in Table 2, our universal training framework can be applied to multiple variants of KGE models despite different embedding spaces, and achieves improvements over TransE, DistMult, ComplEx, RotatE and QuatE on most evaluation metrics, especially on MR but still being competitive on MRR. The results in Table 3 demonstrate that our method can facilitate the performance of QuatE on most evaluation metrics for link prediction and triplet classification. The results verify the effectiveness of our proposed training framework and show that our universal training framework can be applied to multiple variants of KGE models and achieves improvements on most evaluation metrics, which shows the universality of our Pretrain-KGE.

Analysis
In this section, we evaluate our Pretrain-KGE on the low-resource problem and further verify the rationality of our training framework.

Performance on the Low-resource Problem
We evaluate our training framework in the case of fewer training triplets on WordNet, and test its performance on OOKB entities as shown in Fig. 2. To test the performance of our Pretrain-KGE given fewer training triplets, we conduct experiments on WN18 and WN18RR by feeding varying numbers of training triplets as shown in Fig. 2a and 2b. We also evaluate our Pretrain-KGE on WordNet for the OOKB entity problem as shown in Fig. 2c and 2d. We use traditional TransE and the word averaging model following Li et al. (2016) as baselines. Experimental details are in the appendix. Results show that our training framework achieves the best performance in the case of fewer training triplets and OOKB entities. Baseline-TransE performs the worst when training triplets are few and cannot address the OOKB entity problem because it does not utilize any semantic de-       scription. The word averaging model contributes to better performance of TransE on fewer training triplets, yet it does not learn knowledge representation as well as BERT because the latter can better understand the semantic description of entities and relations by exploiting world knowledge in the description. In contrast, our Pretrain-TransE can further enrich knowledge representation by encoding semantic description of entities and relations via BERT, and uses the learned representation to initialize the embedding for TransE. In this way, we can incorporate world knowledge from BERT into the entity and the relation embedding so that TransE can perform better given fewer training triplets and also alleviate the problem of OOKB entities.

Rationality of the Framework
We visualize the knowledge learning process of Baseline-TransE and our Pretrain-TransE in Fig. 3.
We select top five common supersenses in WN18: plant, animal, act, person and artifact, among which the last three supersenses are all relevant to the concept of human beings. In Fig. 3a, we can observe that Baseline-TransE learns the structure information in training triplets and does not distinguish plant and animal from the other three supersenses. In contrast, Fig. 3b shows that our Pretrain-TransE can distinguish entities belonging to different supersenses. Especially, entities relevant to the same concept human beings are more condensed and entities belonging to significantly different supersenses are more clearly separated.
The main reason is that we introduce knowledge from BERT to enrich the knowledge representation of entities and relations. We also demonstrate the rationality of the KGEtraining phase. Table 4 shows that The full Pretrain-KGE method outperforms the ablation version which excludes the KGE training phase.

Conclusion
We propose Pretrain-KGE, an efficient pretraining technique for learning knowledge graph embedding. Pretrain-KGE is a universal training framework that can be applied to any KGE model. It learns knowledge representation via pretrained language models and incorporates world knowledge from the pretrained model into the entity and the relation embedding. Extensive experimental results demonstrate consistent improvements over KGE models across multiple benchmark datasets. The knowledge incorporation introduced in Pretrain-KGE alleviates the low-resource problem and we justify our three-phase training framework through an analysis of the knowledge learning process. Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013a. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926-934.  In semantic-based fine-tuning phase, we adopt the following non-linear pointwise function σ(·): for x i e i ∈ F (where F can be real number filed R, complex number filed C or quaternion number ring H): where x i ∈ R and e i is the K-dimension hypercomplex-value unit. For instance, when K = 1, F = R; when K = 2, F = C, e 1 = i (the imaginary unit); when K = 4, F = H, e 1,2,3 = i, j, k (the quaternion units). For example: where i, j, k denote the quaternion units.

A.2.2 Implementation of the Word-averaging Baseline
We implement the word-averaging baseline to utilize the entity names and entity definition in WordNet to represent the entity embedding better. Formally, for entity e and its textual description T (e) = w 1 w 2 · · · w L , where w i denotes the i-th token in sentence T (e) and T (e) here together utilizing the entity names and entity definition in WordNet. Avg where u i denotes the word embedding of token w i , which is a trainable randomly initialized parameter and will be trained in the semantic-based fine-tuning phase. We also adopt our three-phase training method to train word-averaging baseline.
Similarly, E = [E 1 ; E 2 ; · · · ; E k ] ∈ F k×d and R = [R 1 ; R 2 ; · · · ; R l ] ∈ F l×d denote entity and relation embeddings. In semantic-based fine-tuning phase, for head entity h, tail entity t and relation r, the score function is calculated as: where R r denotes the relation embedding of relation r. In knowledge extracting phase, similar to our proposed model, we initialize E i with Avg(e i ).
In KGE training phase, we optimize E and R with the same training method to TransE baseline.

A.3 Experimental Settings
The hyper-parameters are listed in Table 6. Experiments are conducted on a GeForce GTX TITAN X GPU.