Multi-Task Learning for Knowledge Graph Completion with Pre-trained Language Models

As research on utilizing human knowledge in natural language processing has attracted considerable attention in recent years, knowledge graph (KG) completion has come into the spotlight. Recently, a new knowledge graph completion method using a pre-trained language model, such as KG-BERT, is presented and showed high performance. However, its scores in ranking metrics such as Hits@k are still behind state-of-the-art models. We claim that there are two main reasons: 1) failure in sufficiently learning relational information in knowledge graphs, and 2) difficulty in picking out the correct answer from lexically similar candidates. In this paper, we propose an effective multi-task learning method to overcome the limitations of previous works. By combining relation prediction and relevance ranking tasks with our target link prediction, the proposed model can learn more relational properties in KGs and properly perform even when lexical similarity occurs. Experimental results show that we not only largely improve the ranking performances compared to KG-BERT but also achieve the state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.


Introduction
A Knowledge Graph (KG) is a graph-structured knowledge base, where real-world knowledge is represented in the form of triple (h, r, t): (head entity, relation, tail entity) which means h and t have a relationship r. Entities and the relation in a triple are denoted as nodes and an edge of the graph, respectively. In recent years, Natural Language Processing (NLP) has benefited from utilizing KGs in various applications such as language modeling (Peters et al., 2019;Liu et al., 2019a), question answering Huang et al., 2019), and machine reading (Yang and Mitchell, 2017). Since there has been an increasing demand for high-quality knowledge, the reliability of KG has also become important. Therefore, knowledge graph completion (a.k.a. link prediction), which identifies whether the triple in KG is valid or not, has been actively investigated.
Several studies on the knowledge graph completion have been conducted (Bordes et al., 2013;Trouillon et al., 2016;Dettmers et al., 2018). They presented methods to model the connectivity patterns between entities in KG, and score functions to define the validity of the triple. However, these methods only consider graph structure and relational information depending on existing KG. Thus, they cannot predict well on triples that contain less frequent entities. Recently, addressing the sparseness problem of previous models, Yao et al. (2019) proposed a method called KG-BERT for knowledge graph completion, using entity descriptions and pre-trained language models. Even though KG-BERT significantly improved mean ranks using preliminary linguistic information from BERT (Devlin et al., 2018), the results in other ranking metrics such as MRR and Hit@k are still behind the state-of-the-art models.
We claim that there are two major reasons for this problem. First, KG-BERT misses lots of relation information in KGs. While previous state-of-the-art methods aimed to model relational properties in graphs, KG-BERT only uses binary cross entropy loss to predict valid or invalid triples for the link prediction task. Next, KG-BERT has difficulty in picking out the answer entity between lexically similar Figure 1: Architecture of the proposed multi-task learning method for knowledge graph completion.
candidates. For example, given head entity and relation as (take a breather, derivationally related for, ) and the correct tail entity as "breathing time", KG-BERT predicts "snorkel breather" and "breath" as top scores because of the lexical similarity by "breath". This problem leads to lower performance in MRR and Hits@k.
In this paper, we propose an effective multi-task learning method to overcome these problems. We devise a multi-task framework by adding two tasks (relation prediction and relevance ranking) to link prediction, our target task. In the relation prediction, the model is trained to predict the relationship between given two entities, which helps the model learn more relational properties. In the relevance ranking, the model is trained by the margin ranking loss to make a gap between the valid triple and lexically similar candidates. We evaluate the proposed method on two popular datasets WN18RR and FB15k-237, and experimental results show that our method could improve ranking performance by a large margin compared to KG-BERT. Notably, our method achieves state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.

Proposed Method
In this section, we propose a multi-task learning for knowledge graph completion. As shown in Figure 1, we follow a multi-task learning framework in MT-DNN (Liu et al., 2019b), and use the pre-trained BERT model as a shared layer. We combine three tasks: link prediction, relation prediction, and relevance ranking. Each task has a classification layer W ∈ R K×H where K is the number of labels and H is the hidden size of BERT. Following Devlin et al. (2018), every input sequence has a [CLS] token at the head of sentence, and [SEP] token is used as a separator.
Link Prediction (LP): We define link prediction as same as KG-BERT (Yao et al., 2019), and this is our main target task. Given a training set S, the input x is a text sequence of (h, r, t). Each entity is represented as entity name and description, e.g., for triple (plant tissue, hypernym, plant structure), the input sequence is as follows: The model is trained to predict whether a given triple (h, r, t) is valid or not, and invalid triples are made by replacing head or tail entity with one of random entities. Let C be the final hidden vector of [CLS] token, W LP ∈ R 2×H be a classification layer for link prediction, and S be a invalid triple set, then  where f (x) is the final output of the model and y ∈ {0, 1} is a label. Let the output of CW T LP be [s 0 , s 1 ] ∈ R 2 , then s 1 is used as the final ranking score in evaluation.
Relation Prediction (RP): The model learns to classify the relation of two entities. The input is head and tail entity sequences, e.g., "[CLS] plant tissue, the tissue of a plant [SEP] plant structure, any part of a plant or fungus [SEP]", then the model trains to predict the relation hypernym. The classification layer for relation prediction is W RP ∈ R R×H where R is the number of relations, and we minimize a cross-entropy loss.
where g(x) is the output of the model and y ∈ R R is a class indicator.
Relevance Ranking (RR): The objective of relevance ranking is to make valid triples keep higher scores than invalid triples. We use a margin ranking loss to provide a bigger gap between valid and invalid triples. The input is the same as link prediction, and the classification layer for relevance ranking is where h(x) is the output of the model and λ is a margin. In the training time, we use mini-batch based stochastic gradient descent. We first compose minibatches for each task, D LP , D RP , and D RR , then combine all data D = D LP ∪ D RP ∪ D RR . At each training step, the mini-batch is randomly selected from D, and then the task corresponding to the batch is trained sequentially.

Experiments
Datasets We evaluated the proposed multi-task learning method on two benchmark datasets WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova and Chen, 2015). Each dataset consists of a set of triples in the form of (h, r, t). WN18RR is a subset of WordNet, which is a lexical database of English. Thus, entities in WN18RR are words or short phrases, and there exists 11 relations between two words, such as hypernym and similar to. FB15k-237 is a subset of Freebase (Bollacker et al., 2008), a largescale graph database including general human knowledge. FB15k-237 has more general entities, such as Lincoln and Monaco, and relations are longer and more complex than WN18RR. We used the same entity descriptions with Yao et al. (2019) Baselines We mainly compare our method with KG-BERT (Yao et al., 2019), and also provide a comparison with several outstanding models: TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), and RotatE .
Experimental Settings We used pre-trained BERT-base as a shared layer and fine-tuned over the multi-task setup for 3 epochs. We used mini-batch size of 32 and Adam optimizer (Kingma and Ba, 2014) with learning rate 2e-5. In relevance ranking, we set the margin λ on the validation set, and it showed best results when λ = 0.1 .
Evaluation Settings We evaluate our method on the link prediction, where the model predicts the head entity given ( , r, t) and tail entity given (h, r, ). To compare prior work, we follow the evaluation protocol and filtered setting in Bordes et al. (2013). Let E be a entity set and T be a set of all triples in train, valid, and test. Then, the set of test candidates U for predicting h in a given triple (h, r, t) is .
∈ T}. After the model computes scores of all candidate triples, they are sorted in descending order. The performances are evaluated in Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@1, 3, 10.

Main Results
Table 2 demonstrates how the proposed method improves performance over the baseline model on the link prediction. The results show that multi-task learning with two tasks (LP + RP) and (LP + RR) could improve over the baseline by a large margin maintaining low MR scores. When the model is trained on three tasks (LP + RP + RR), we gain significant improvements, especially in Hits@1 and Hits@3 with 10.8 and 14.0, respectively. Table 3 shows an example of results in WN18RR. We observe that our model can choose the correct answer "breathing time" as the first ranking among lexically similar words, while the KG-BERT predicts "snorkel breather" and "breath" in top ranks. More examples are presented in Appendix A.
In the FB15k-237 benchmark, the task becomes more challenging as the number of relations increases up to 237, whereas the WN18RR contains only 11 relations. Thus, joint training with Relation Prediction (RP) was more effective on the FB15k-237, and this is shown as results that the model outperformed the baseline by 7, 2.5, 2.5, 2.9, and 2 absolute scores on MR, MRR, Hits@1, Hits@3, and Hits@10, respectively. When the Relevance Ranking (RR) task is added, and the model is trained with three different tasks, it achieves further improvements in all metrics with 13, 3, 2.8, 3.8, and 3.1 points, respectively.
A Comparison with previous models is presented in Table 4. Our model achieved state-of-the-art performances in MR and hits@10 on the WN18RR. In the FB15k-237 dataset, the performance of our model is lower than that of several models in Hits@10. Since FB15k-237 has more relations and a more complex graph structure than WN18RR, we conjecture that pre-trained language models cannot capture the complex structural information in knowledge graphs. Despite that, we achieved the best MR score on FB15k-237.

Related Work
A common approach for the knowledge graph completion is learning vector embeddings of the entities and the relationships in KG (Bordes et al., 2013;Yang et al., 2014;Trouillon et al., 2016;Dettmers et al., 2018). The most widely used method is TransE (Bordes et al., 2013), which models the relationships as translations in low-dimensional vector space. Dettmers et al. (2018) and Nguyen et al. (2018) proposed the embedding models using a convolutional neural network. Recent research has shown that the relation in complex vector space can infer the connectivity patterns: symmetry/antisymmetry, inversion, and composition . On the one hand, Yao et al. (2019) proposed KG-BERT that uses pre-trained language models (PLM) with entity descriptions. It can capture the contextualized meaning of entities and significantly improve mean ranks with rich linguistic information from PLM.
Multi-task learning has gained popularity over a decade in natural language processing (Collobert and Weston, 2008;Luong et al., 2015;Hashimoto et al., 2017;Liu et al., 2019b) of various tasks. It aims to regularize deep learning models from overfitting by sharing parameters of different tasks while jointly training them. With the advent of powerful PLMs such as BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019), a multi-task learning scheme is applied by sharing pre-trained parameters of these models when training different tasks simultaneously.

Conclusion and Future Work
We propose an effective multi-task learning method for knowledge graph completion by combining relation prediction and relevance ranking tasks with link prediction. Experimental results demonstrate that our method outperforms previous strong baselines, and we largely improve MRR and Hits@k compared to the previous KG-BERT model.
In the future, we plan to investigate how to combine pre-trained language models and graph embedding methods to fully utilize the prior linguistic information of pre-trained models and graph structural information.
Appendix A Examples of the results in Link Prediction E1. Given (take a breather, derivationally related form, ), the answer is breathing time  For the example 1, the entity breathing time appears only once in the training set. Thus, the methods using only graph structure information, such as TransE and RotatE, cannot predict well on the given triple. Our model provides the correct answer, while KG-BERT predicts snorkel breather and breath as top scores due to the lexical similarity by breath. In example 2, the entity piece of music has lots of relationships with other entities; thus, most models show low performance on that example. Lastly, the example 3 shows that how the pre-trained language model (PLM) improves Mean Rank significantly. KG-BERT and our model give a high score for the answer programme using preliminary linguistic information from PLM, but the results of TransE and RotatE are extremely low.