Integrating Graph Contextualized Knowledge into Pre-trained Language Models

Complex node interactions are common in knowledge graphs (KGs), and these interactions can be considered as contextualized knowledge exists in the topological structure of KGs. Traditional knowledge representation learning (KRL) methods usually treat a single triple as a training unit, neglecting the usage of graph contextualized knowledge. To utilize these unexploited graph-level knowledge, we propose an approach to model subgraphs in a medical KG. Then, the learned knowledge is integrated with a pre-trained language model to do the knowledge generalization. Experimental results demonstrate that our model achieves the state-of-the-art performance on several medical NLP tasks, and the improvement above MedERNIE indicates that graph contextualized knowledge is beneficial.


Introduction
In 1954, Harris (1954) proposed a distributional hypothesis that words occur in the same contexts tend to have similar meanings. Firth (1957) explained the context-dependent nature of meaning in linguistics by his famous quotation "you shall know a word by the company it keeps" . Although the above-mentioned distributional hypothesis is proposed for language models, if we look at the knowledge graph from the perspective of this hypothesis, we can find that similar hypothesis exists in knowledge graphs (KGs). We call it KG distributional hypothesis: you shall know an entity by the relationships it involves.
Given this hypothesis, contextualized information in language models can be mapped to knowledge graphs, which we call "graph contextualized knowledge". Figure 1 illustrates a knowledge subgraph that includes several medical entities. In this figure, four incoming and four outgoing neighboring nodes (hereinafter called "in-entity" and "outentity") of node "Bacterial pneumonia" are linked by various relation paths. These linked nodes and correlations can be seen as "graph contextualized information" of entity node "Bacterial pneumonia". In this study, we will explore how to integrate graph contextualized knowledge into pre-trained language models.
Pre-trained language models learn contextualized word representations on large-scale text corpus through self-supervised learning methods, and obtain new state-of-the-art (SOTA) results on most downstream tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019). This gradually becomes a new paradigm for natural language processing research. Recently, several knowledgeenhanced pre-trained language models have been proposed, such as ERNIE-Baidu (Sun et al., 2019), ERNIE-Tsinghua (Zhang et al., 2019a), WKLM (Xiong et al., 2019) and K-ADAPTER (Wang et al., 2020).
In this study, since we need to learn graph contextualized knowledge in a large-scale medical knowledge graph, ERNIE-Tsinghua (hereinafter called "ERNIE") is chosen as our backbone model. In ERNIE, entity embeddings are learned by TransE (Bordes et al., 2013), which is a popular transitionbased method for knowledge representation learning (KRL). However, TransE cannot deal with the modeling of complex relations , such as 1-to-n, n-to-1 and n-to-n relations. This shortcoming will be amplified in the medical knowledge graph, in which many entities have a large number of related neighbors.
Inspired by previous work (Veličković et al., 2018;Nathani et al., 2019), we propose an approach to learn knowledge from subgraphs, and inject graph contextualized knowledge into the pretrained language model. We call this model BERT-MK (a BERT-based language model integrated with Medical Knowledge), our contributions are as follows: • We propose a novel knowledge-enhanced pretrained language model BERT-MK for medical NLP tasks, which integrates graph contextualized knowledge learned from the medical KG.
• Experimental results show that BERT-MK achieves better performance than previous state-of-the-art biomedical pre-trained language models on entity typing and relation classification tasks.

Methodology
Our model consists of two modules: the knowledge learning module and the language model pretraining module. The first module is utilized to learn graph contextualized knowledge existing in KGs, and the second one integrates the learned knowledge into the language model for knowledge generalization. The details will be described in the following subsections.

Learning Graph Contextualized Knowledge
We denote a knowledge graph as G = (E, R), where E represents the entity set and R is the set of relations between enity pairs. A triple in G is formalized as (e s , r, e o ), where e s is a subjective entity, e o is an objective entity, and r is the relation between e s and e o . In Figure 1, two entities (rectangles) and a relation (arrow) between them constructs a knowledge triple, for example, (Bacterial pneumonia, causative agent of, Bacteria).

Subgraph Conversion
To enrich the contextualized information in knowledge representations, we extract subgraphs from the knowlege graph to be the modeling objectives, and the generation process is described in Algorithm 1. For a given entity, its two 1-hop in-entities T out e = extract out triples(G, e); and out-entities are sampled to generate a subgraph 1 , and we repeat the generation process M times for each entity. Figure 2(a) shows an instance of the knowledge subgraph, which consists of four 1-hop and four 2-hop relations. In this study, we propose a Transformer-based (Vaswani et al., 2017) module to model subgraphs. Relations are learned 1 , e O 1 is the embedding of the input node, the updated node and the output node, respectively.
as nodes equivalent to entities in our model, and the relation conversion process is illustrated in Figure 2(b). Therefore, knowledge graph G can be redefined as G = (V, E), where V represents the nodes in G, involving entities in E and relations in R, and E denotes the directed edges among the nodes in V .
Then, subgraphs are converted into sequences of nodes. The conversion result of a subgraph is shown in Figure 2(c), including a node sequence, a node position index matrix and an adjacency matrix. Each row of the node position index matrix corresponds to a triple in the subgraph. For example, the triple (e 1 , r 1 , e) is represented as the first row (0, 1, 4) in this matrix. In the adjacency matrix, the element A ij equals 1 if the node i is connected to node j in Figure 2(b), and 0 otherwise.

GCKE
After the subgraph conversion preprocessing, the input samples to learn graph contextualized knowledge are generated. Formally, we denote the node sequence as {x 1 , . . . , x N }, where N is the length of the input sequence. Besides, the node position index matrix and the adjacency matrix are defined as P and A, respectively. Entity embeddings and relation embeddings are integrated in the same matrix V, where V ∈ R (ne+nr)×d , n e is the entity number in E and n r is the relation type number in R. The node embeddings X = {x 1 , . . . , x N } can be gen-erated by looking up node sequence {x 1 , . . . , x N } in embedding matrix V. X, P and A constitute the input of the graph contextualized knowledge embedding learning module, called GCKE, as shown in Figure 3.
The inputs are fed into a Transformer-based model to encode the node information.
where x i is the new embedding for node x i . denotes the concatenation of the H attention heads in this layer, α h ij and W h v are the attention weight of node x j and a linear transformation of node embedding x j in the h th attention head, respectively. The Masking function in Equation 3 restraints the contextualized dependency among the input nodes, only the degree-in nodes and the current node itself are involved to update the node embedding. The subfigure in the lower right corner of Figure 3 shows the contextualized dependencies. Similar to W h v , W h q and W h k are independent linear transformations of node embeddings. Then, the updated node representations are fed into the feed forward layer for further encoding. The aforementioned Transformer blocks are stacked by L times, and the output hidden states can be formalized as Then, the node position indexes P is utilized to restore triple representations: is the representation of this triple. The subfigure in the upper right corner of Figure 3 shows the triple restoration process.
In this study, the translation-based scoring function ) is adopted to measure the energy of a knowledge triple. The node embeddings are learned by minimizing a margin-based loss function on the training data: where is an entity replacement operation that the head entity or the tail entity in a triple is replaced and the replaced triple is an invalid triple in the KG.

Integrating Knowledge into the Language Model
Given a comprehensive medical knowledge graph, graph contextualized knowledge representations can be learned using the GCKE module. We follow the language model architecture proposed in (Zhang et al., 2019a), and utilize graph contextualized knowledge to enhance medical language representations. The pre-training process is shown in the left part of Figure 3. The Transformer block encodes word contextualized representation while the aggregator block implements the fusion of knowledge and language information. According to the characteristics of medical NLP tasks, domain-specific finetuning procedure is designed. Similar to BioBERT , symbol "@" and "$" are used to mark the entity boundary, which indicate the entity positions in a sample and distinguish different relation samples sharing the same sentence. For example, the input sequence for the relation classification task can be pain control was initiated with morphine but was then changed to @ demerol $, which gave the patient better relief of @ his epigastric pain $". In the entity typing task, entity mention and its context are critical to predict the entity type, so more localized features of the entity mention will benefit this prediction process. In our experiments, the entity start symbol is selected to represent an entity typing sample.

Medical Knowledge Graph
The Unified Medical Language System (UMLS) (Bodenreider, 2004) is a comprehensive knowledge base in the biomedical domain, which contains large-scale concept names and relations among them. The metathesaurus in UMLS involves various terminology systems and comprises about 14 million terms covering 25 different languages. In this study, a subset of this knowledge base is extracted to construct the medical knowledge graph. Non-English and long terms are filtered, and the final statistics is shown in Table 1.

Corpus for Pre-training
To ensure that sufficient medical knowledge can be integrated into the language model, PubMed abstracts 2 and PubMed Central full-text papers 3 are chosen as the pre-training corpus, which are openaccess datasets for biomedical and life sciences journal literature. Since sentences in different paragraphs may not have good context coherence, paragraphs are selected as the document unit for next sentence prediction. The Natural Language Toolkit (NLTK) 4 is utilized to split the sentences within a paragraph, and sentences having less than 5 words are discarded. As a result, a large corpus containing 9.9B tokens is achieved for language model pre-training.  (Kim et al., 2004) 51,301 -8,653 BC5CDR (Li et al., 2016) 9,385 9,593 9,809 Relation Classification 2010 i2b2/VA (Uzuner et al., 2011) 10,233 -19,115 GAD (Bravo et al., 2015) 5,339 --EU-ADR (Van Mulligen et al., 2012) 355 --In our model, medical terms appearing in the corpus need to be aligned to the entities in the UMLS metathesaurus before pre-training. To make sure the coverage of identified entities in the metathesaurus, the forward maximum matching (FMM) algorithm is used to extract the term spans from the corpus aforementioned, and spans less than 5 characters are filtered. Then, BERT vocabulary is used to tokenize the input text into word pieces, and the medical entity is aligned with the first subword of the identified term.

Downstream Tasks
In this study, entity typing and relation classification tasks in the medical domain are used to evaluate the models.
Entity Typing Given a sentence with an entity mention tagged, this task is to identify the semantic type of this entity mention. For example, the type "medical problem" is used to label the entity mention "asystole" in the sentence "he had a differential diagnosis of e asystole /e ". To the best of our knowledge, there are no publicly available entity typing datasets in the medical domain. Therefore, three entity typing datasets are constructed from the corresponding medical named entity recognition datasets. Entity mentions and entity types are annotated in these datasets, in this study, entity mentions are considered as input while entity types are the output labels. Table 2 shows the statistics of the datasets for the entity typing task. Datasets can be download from here 5 .
Relation Classification Given two entities within one sentence, this task aims to determine the relation type between the entities. For example, in sentence "pain control was initiated with morphine but was then changed to e 1 demerol /e 1 , which 5 https://drive.google.com/file/d/ 1OletxmPYNkz2ltOr9pyT0b0iBtUWxslh/view. gave the patient better relief of e 2 his epigastric pain /e 2 ", the relation type between two entities is TrIP (Treatment Improves medical Problem). In this study, three relation classification datasets are utilized to evaluate our models, and the statistics of these datasets are shown in Table 2. Datasets can be download from here 6 .

Baselines
In addition to the state-of-the-art models on these datasets, we have also added the popular BERT-Base model and another two models pre-trained on biomedical literature for further comparison.
BERT-Base (Devlin et al., 2019) This is the original bidirectional pre-trained language model proposed by Google, which achieves state-of-the-art performance on a wide range of NLP tasks.
BioBERT  This model follows the same model architecture as the BERT-Base model, but with the PubMed abstracts and PubMed Central full-text articles (about 18B tokens) used to do model finetuning upon BERT-Base.
SCIBERT (Beltagy et al., 2019) In this model, a new wordpiece vocabulary is built based on a large scientific corpus (about 3.2B tokens). Then, a new BERT-based model is trained from scratch using this scientific vocabulary and the scientific corpus. Since a large portion of the scientific corpus consists of biomedical articles, this scientific vocabulary can also be regarded as a biomedical vocabulary, and helps improve the performance of downstream tasks in the biomedical domain.

Graph Contextualized Knowledge
Firstly, UMLS triples are fed into the TransE model to achieve a basic knowledge representation. We Table 3: Experimental results on the entity typing and relation classification tasks. Accuracy (Acc), Precision, Recall, and F1 scores are used to evaluate the model performance. The results reported in previous work are underlined. E-SVM is short for Ensemble SVM (Bhasuran and Natarajan, 2018), which achieves SOTA performance in the GAD dataset. CNN-M stands for CNN using multi-pooling (He et al., 2019), which is the SOTA model in the 2010 i2b2/VA dataset. use OpenKE toolkit  to learn entity and relation embeddings. Knowledge embedding dimension is set to 100, while training epoch number is set to 10000. Following the initialization method used in (Nguyen et al., 2018;Nathani et al., 2019), the embeddings produced by TransE are utilized to initialize knowledge representations of the GCKE module. We set the layer number to 4, and each layer contains 4 heads. Due to the median degree of entities in UMLS is 4 (shown in Table1), we set the count of in-entities and two out-entities to 4, so each subgraph contains four 1-hop and four 2-hop relations. The GCKE module runs 1200 epochs on a single NVIDIA Tesla V100 (32GB) GPU to learn graph contextualized knowledge. The batch size is set to 50000.

Pre-training
In this study, two pre-trained language models are trained. The first one is MedERNIE, a medical ERNIE model trained on the UMLS triples and the PubMed corpus, inheriting the same model hyperparameters used in (Zhang et al., 2019a). Besides, the entity embeddings learned by GCKE module are integrated into the language model to train the BERT-MK model. In our work, we align the same pre-training epochs with BioBERT, which uses the same pre-training corpus as ours, and finetune the BERT-Base model on the PubMed corpus for one epoch.

Finetune
As shown in Table 2, there is no standard valid or test set in some datasets. For datasets containing a standard test set, if no standard valid set is provided, we divide the training set into new train/valid sets by 4:1. We preform each experiment 5 times under specific experimental settings with different random seeds. Besides, 10-fold cross-validation method is used to evaluate the model performance for the datasets without a standard test set. According to the maximum sequence length of the sentences in each dataset, the input sequence length for 2010 i2b2/VA (Uzuner et al., 2011), JNLPBA (Kim et al., 2004), BC5CDR (Li et al., 2016), GAD (Bravo et al., 2015) and EU-ADR (Van Mulligen et al., 2012) are set to 390, 280, 280, 130 and 220, respectively. The initial learning rate is set to 2e-5. Table 3 presents the experimental results on the entity typing and relation classification tasks. For entity typing tasks, all these pre-trained language models achieve high accuracy, indicating that the type of a medical entity is not as ambiguous as that in the general domain. BERT-MK outperforms BERT-Base and BioBERT on three datasets, and is competitive with SCIBERT. Without using external knowledge in the pre-trained language model, SCIBERT achieves comparable results to BERT-MK, which proves that a domain-specific vocabulary is critical to the feature encoding of inputs. Long tokens are relatively common in the medical domain, and these tokens will be split into short pieces when a domain-independent vocabulary is used, which will cause an overgeneralization of lexical features. Therefore, a medical vocabulary generated by the PubMed corpus can be introduced into BERT-MK in the following work.

Relation Classification
On the relation classification tasks, BERT-Base does not perform as well as other models, which indicates that pre-trained language models require a domain adaptation process when used in restricted domains. Compared with BioBERT, which utilizes the same domain-specific corpus as ours for domain adaptation, BERT-MK improves the F score of 2010 i2b2/VA, GAD and EU-ADR by 1.1%, 5.21% and 0.49%, respectively, which demonstrates medical knowledge has indeed played a positive role in the identification of medical relations.
The following example provides a brief explanation of why medical knowledge improves the model performance of the relation classification tasks. "On postoperative day number three , patient went into e 1 atrial fibrillation /e 1 , which was treated appropriately with e 2 metoprolol /e 2 and digoxin and converted back to sinus rhythm" is a relation sample from the 2010 i2b2/VA dataset, and the relation label is TrIP. Meanwhile, the above entity pair can be aligned to a knowledge triple (atrial fibrillation, may be treated by, metoprolol) in the medical knowledge graph. Obviously, this knowledge information is advantageous to identify the relation type of the aforementioned example.

TransE vs. GCKE
In order to explicitly analyze the improvement effect of the GCKE module on pre-trained language models, we compare MedERNIE (TransE-based) and BERT-MK (GCKE-based) on two relation classification datasets. Table 4 demonstrates the results of these two models. As we can see, integrating graph contextualized knowledge into the pre-trained language model, the performance increases F score by 0.9% and 0.64% on these two relation classification datasets, respectively.
In Figure 4, as the amount of pre-training data increases, BERT-MK always outperforms Med-ERNIE on the 2010 i2b2/VA relation dataset, and  the performance gap has an increasing trend. However, on the GAD dataset, the performance of BERT-MK and MedERNIE are intertwined. We link the entities in each relation sample to the medical KG, and find that some entity pairs have a connected relationship in the KG. Statistical analysis on 2-hop neighbor relationships between these entity pairs shows that there are 136 cases in the 2010 i2b2/VA dataset, while only 1 in GAD. The second case shown in Table 5 gives an example of the observation described above. Triple (CAD, member of, Other ischemic heart disease) and (Other ischemic heart disease, has member, Angina symptom) are triples in the medical KG, which indicates entity pair cad and angina symptoms in the relation sample have a 2-hop neighbor relationship in the KG. GCKE learns these 2-hop neighbor relationships in 2010 i2b2/VA and produces an improvement for BERT-MK. However, due to the characteristics of the GAD dataset, the capability of GCKE is limited.

Effect of Different Corpus Sizes in
Pre-training Figure 4 shows the model performance comparison with different proportion of the pre-training corpus. From this figure, we observe that BERT-MK outperforms BioBERT by using only 10%-20% of the corpus, which indicates that medical knowledge has the capability to enhance pre-trained language models and save computational costs (Schwartz et al., 2019).

Related Work
Pre-trained language models represented by ELMO (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) have attracted great attention, and a large number of variant models have been proposed. Among these studies, some researchers devote their efforts to introducing knowledge into language models (Levine et al., 2019;Lauscher et al., 2019;Liu et al., 2019;Zhang et al., 2019b). ERNIE-Baidu (Sun et al., 2019) introduces new masking units such as phrases and entities to learn knowledge information in these masking units. As a reward, syntactic and semantic information from phrases and entities is implicitly integrated into the language model. Furthermore, a different knowledge information is explored in ERNIE-Tsinghua (Zhang et al., 2019a), which incorporates knowledge graph into BERT to learn lexical, syntactic and knowledge information simultaneously. Xiong et al. (2019) introduce entity replacement checking task into the pre-trained language model, and improve several entity-related downstream tasks, such as question answering and entity typing. Wang et al. (2020) propose a plug-in way to infuse knowledge into language models, and their method keeps different kinds of knowledge in different adapters. The knowledge information introduced by these methods does not pay much attention to the graph contextualized knowledge in the KG. Recently, several KRL methods have attempted to introduce more contextualized information into knowledge representations. Relational Graph Convolutional Networks (R-GCNs) (Schlichtkrull et al., 2018) is proposed to learn entity embeddings from their incoming neighbors, which greatly enhances the information interaction between related triples. Nathani et al. (2019) further extend the information flow from 1-hop in-entities to n-hop during the learning process of entity representations, and achieves the SOTA performance on multiple relation prediction datasets, especially for the ones containing higher in-degree nodes. We believe that the information contained in knowledge graphs is far from being sufficiently exploited. In this study, we develop an approach to integrate more graph contextualized information, which models subgraphs as training samples. This module has the ability to model any information in the KG. In addition, this learned knowledge is integrated into the language model to obtain an enhanced version of the medical pre-trained language model.

Conclusion and Future Work
We propose a novel approach to learn more comprehensive knowledge, focusing on modeling subgraphs in the knowledge graph by a knowledge learning module. Additionally, the learned medical knowledge is integrated into the pre-trained language model, which outperforms BERT-Base and another two domain-specific pre-trained language models on several medical NLP tasks. Our work validates the intuition that medical knowledge is beneficial to some medical NLP tasks and provides a preliminary exploration for the application of medical knowledge.
In the follow-up work, some knowledge-guided tasks will be used to validate the effectiveness of the knowledge learning module GCKE. Moreover, we will explore some other knowledge injection ways to combine medical knowledge with language models, such as multi-task learning. More subgraph sampling strategies need to be explored, such as r-ego subgraph (Qiu et al., 2020) and degreedependent subgraph.