Improving Biomedical Pretrained Language Models with Knowledge

Pretrained language models have shown success in many natural language processing tasks. Many works explore to incorporate the knowledge into the language models. In the biomedical domain, experts have taken decades of effort on building large-scale knowledge bases. For example, UMLS contains millions of entities with their synonyms and defines hundreds of relations among entities. Leveraging this knowledge can benefit a variety of downstream tasks such as named entity recognition and relation extraction. To this end, we propose KeBioLM, a biomedical pretrained language model that explicitly leverages knowledge from the UMLS knowledge bases. Specifically, we extract entities from PubMed abstracts and link them to UMLS. We then train a knowledge-aware language model that firstly applies a text-only encoding layer to learn entity representation and then applies a text-entity fusion encoding to aggregate entity representation. In addition, we add two training objectives as entity detection and entity linking. Experiments on the named entity recognition and relation extraction tasks from the BLURB benchmark demonstrate the effectiveness of our approach. Further analysis on a collected probing dataset shows that our model has better ability to model medical knowledge.

As an applied discipline that needs a lot of facts and evidence, the biomedical and clinical fields have accumulated data and knowledge from a very early age (Ashburner et al., 2000;Stearns et al., 2001). One of the most representative work is Unified Medical Language System (UMLS) (Bodenreider, 2004) that contains more than 4M entities with their synonyms and defines over 900 kinds of relations. Figure 1 shows an example. There are two entities "glycerin" and "inflammation" that can be linked to C0017861 (1,2,3-Propanetriol) and C0011603 (dermatitis) respectively with a may_prevent relation in UMLS. As the most important facts in biomedical text, entities and relations can provide information for better text understanding (Xu et al., 2018;Yuan et al., 2020).
To this end, we propose to improve biomedical PLMs with explicit knowledge modeling. Firstly, we process the PubMed text to link entities to the knowledge base. We apply an entity recognition and linking tool ScispaCy  to annotate 660M entities in 3.5M documents. Secondly, we implement a knowledge enhanced language model based on Févry et al. (2020), which performs a text-only encoding and a text-entity fusion encoding. Text-only encoding is responsible for bridging text and entities. Text-entity fusion encoding fuses information from tokens and knowledge from entities. Finally, two objectives as entity extraction and linking are added to learn better entity representations. To be noticed, we initialize the entity embeddings with TransE (Bordes et al., 2013), which leverages not only entity but also relation information of the knowledge graph.
We conduct experiments on the named entity recognition (NER) and relation extraction (RE) tasks in the BLURB benchmark dataset. Results show that our KeBioLM outperforms the previous work with average scores of 87.1 and 81.2 on 5 NER datasets and 3 RE datasets respectively. Furthermore, our KeBioLM also achieves better performance in a probing task that requires models to fill the masked entity in UMLS triplets.
We summary our contributions as follows 1 : • We propose KeBioLM, a biomedical pretrained language model that explicitly incorporates knowledge from UMLS.
• We conduct experiments on 5 NER datasets and 3 RE datasets. Results demonstrate that our KeBioLM achieves the best performance on both NER and RE tasks.
• We collect a cloze-style probing dataset from UMLS relation triplets. The probing results show that our KeBioLM absorbs more knowledge than other biomedical PLMs.
2 Related Work

Biomedical PLMs
Models like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) show the effectiveness of the paradigm of first pre-training an LM on the unlabeled text then fine-tuning the model on the downstream NLP tasks. However, direct application of the LMs pre-trained on the encyclopedia and web text usually fails on the biomedical domain, because of the distinctive terminologies and idioms. The gap between general and biomedical domains inspires the researchers to propose LMs specially tailored for the biomedical domain. BioBERT (Lee et al., 2020) is the most widely used biomedical PLM which is trained on PubMed abstracts and PMC articles. It outperforms vanilla BERT in named entity recognition, relation extraction, and question answering tasks. Jin et al. (2019) train BioELMo with PubMed abstracts, and find features extracted by BioELMo contain entity-type and relational information. Different training corpora have been used for enhancing performance of sub-domain tasks. ClinicalBERT (Alsentzer et al., 2019), BlueBERT (Peng et al., 2019) and bio-lm (Lewis et al., 2020) utilize clinical notes MIMIC to improve clinical-related downstream tasks. SciB-ERT  uses papers from the biomedical and computer science domain as training corpora with a new vocabulary. KeBioLM is trained on PubMed abstracts to adapt to PubMedrelated downstream tasks.
To understand the factors in pretraining biomedical LMs, Gu et al. (2020) study pretraining techniques systematically and propose PubMedBERT pretrained from scratch with an in-domain vocabulary. Lewis et al. (2020) also find using an indomain vocabulary enhances the downstream performances. This inspires us to utilize the in-domain vocabulary for KeBioLM.

Knowledge-enhanced LMs
LMs like ELMo and BERT are trained to predict correlation between tokens, ignoring the meanings behind them. To capture both the textual and conceptual information, several knowledge-enhanced PLMs are proposed.
Entities are used for bridging tokens and knowledge graphs. Zhang et al. (2019) align tokens and entities within sentences, and aggregate token and entity representations via two multi-head self-attentions. KnowBert (Peters et al., 2019) and Entity as Experts (EAE) (Févry et al., 2020) use the entity linker to perform entity disambiguation for candidate entity spans and enhance token representations using entity embeddings. Inspired by entity-enhanced PLMs, we follow the model of EAE to inject biomedical knowledge into KeBi-oLM by performing entity detection and linking.
Relation triplets provide intrinsic knowledge be-tween entity pairs. KEPLER  learns the knowledge embeddings through relation triplets while pretraining. K-BERT (Liu et al., 2020) converts input sentences into sentence trees by relation triplets to infuse knowledge. In the biomedical domain, He et al. (2020) inject disease knowledge to existing PLMs by predicting diseases names and aspects on Wikipedia passages. Michalopoulos et al. (2020) use UMLS synonyms to supervise masked language modeling. We propose KeBioLM to infuse various kinds of biomedical knowledge from UMLS including but not limited to diseases.

Approach
In this paper, we assume to access an entity set E = {e 1 , ..., e t }. For a sentence x = {x 1 , ..., x n }, we assume some spans m = (x i , ..., x j ) can be grounded to one or more entities in E. We further assume the disjuncture of these spans. In this paper, we use UMLS to set the entity set.

Model Architecture
To explicitly model both the textual and conceptual information, we follow Févry et al. (2020) and use a multi-layer self-attention network to encode both the text and entities. The model can be viewed as building the links between text and entities in the lower layers and fusing the text and entity representation in the upper layers. The overall architecture is shown in Figure 2. To be more specific, we set the PubMedBERT (Gu et al., 2020) as our backbone. We split the layers of the backbone into two groups, performing a text-only encoding and text-entity fusion encoding respectively.
Text-only encoding. For the first group, which is closer to the input, we extract the final hidden states and perform a token-wise classification to identify if the token is at the beginning, inside, or outside of a mention (i.e., the BIO scheme). The probabilities of the B/I/O label {l i } are written as: After identifying the mention boundary, we maintain a function M(i) → E ∪ {NIL}, which returns the entity of the i-th token belongs. 2 We collect the mentions with a sentence x. For a mention m = (s, t), where s and t represents the starting and ending indexes of m, we encode it as the concatenation of hidden states of the boundary tokens For an entity e j ∈ E in the KG, we denote its entity embedding as e j . For a mention m, we search the k nearest entities of its projected representation h m = W m h m + b m in the entity embedding space, obtaining a set of entities E . The normalized similarity between h m and e j is calculated as The additional entity representation e m of m is calculated as a weighted sum of the embeddings e m = e j ∈E a j · e j .
Text-entity fusion encoding. After getting the mentions and entities, we fuse the entity embeddings with the text embedding by summation. For the i-th token, the entity-enhanced embedding is calculated as: (4) M(i) = m represents the i-th token belong to entity e m . The sequence of h * 1 , ..., h * n is then fed into the second group of transformer layers to generate text-entity representations. The final hidden states h f i are calculated as:

Pretraining Tasks
We have three pretraining tasks for KeBioLM. Masked language modeling is a cloze-style task for predicting masked tokens. Since the entities are the main focus of our model, we add two tasks as entity detection and linking respectively following Févry et al. (2020). Finally, we jointly minimize the following loss: Masked Language Modeling Like BERT and other LMs, we predict the masked tokens {x i } in inputs using the final hidden representations {h f i }. The loss L M LM is calculated based on the crossentropy of masked and predicted tokens:  Whole word masking is successful in training masked language models (Devlin et al., 2019;Cui et al., 2019). In the biomedical domain, entities are the semantic units of texts. Therefore, we extend this technique to whole entity masking. We mask all tokens within a word or entity span. KeBioLM replaces 12% of tokens to [MASK] and 1.5% tokens to random tokens. This is more difficult for models to recover tokens, which leads to learning better entity representations.
Entity Detection Entity detection is an important task in biomedical NLP to link the tokens to entities. Thus, We add an entity detection loss by calculating the cross-entropy for BIO labels: Entity Linking One medical entity in different names linking to the same index permits the model to learn better text-entity representations. To link mention {m} in texts with entities {e} in entity set E, we calculate the cross-entropy loss using similarities between {h m } and entities in E:

Data Creation
Given a sentence S from PubMed content, we need to recognize entities and link them to the UMLS knowledge base. We use ScispaCy , a robust biomedical NER and entity linking model, to annotate the sentence. Unlike previous work (Vashishth et al., 2020) that only retains recognized entities in a subset of Medical Subject Headings (MeSH) (Lipscomb, 2000), we relax the restriction to annotate all entities to UMLS 2020 AA release 3 whose linking scores are higher than a threshold of 0.85.

Experiments
In this section, we first introduce the pretraining details of KeBioLM. Then we introduce the BLURB datasets for evaluating our approach. Finally, we introduce a probing dataset based on UMLS triplets for evaluating knowledge modeling.

Pretraining Details
We use ScispaCy to acquire 477K CUIs and 660M entities among 3.5M PubMed documents 4 from PubMedDS dataset (Vashishth et al., 2020)  We also use the vocabulary from PubMedBERT. AdamW (Loshchilov and Hutter, 2017) is used as the optimizer for KeBioLM with 10,000 steps warmup and linear decay. We use an 8-layer transformer for text-only encoding and a 4-layer transformer for text-entity fusion encoding. We set the learning rate to 5e-5, batch size to 512, max sequence length to 512, and training epochs to 2. For each input sequence, we limit the max entities count to 50 and the excessive entities will be truncated. To generate entity representation e m , the most k = 100 similar entities are used. We train our model with 8 NVIDIA 16GB V100 GPUs.

Datasets
In this section, we evaluate KeBioLM on NER tasks and RE tasks of the BLURB benchmark 5 (Gu et al., 2020). For all tasks, we use the preprocessed version from BLURB. We measure the NER and RE datasets in terms of F1-score. Table 1 shows the counts of training instances in BLURB datasets (i.e., annotated mentions for NER datasets and sentences with two mentions for RE datasets). We also report the count of annotated mentions overlapping with the UMLS 2020 release and Ke-BioLM in each dataset. The percentage of men-5 https://microsoft.github.io/BLURB/ tions overlapping with KeBioLM ranges from 8.7% (NCBI-disease) to 58.5% (DDI) which indicates that KeBioLM learns entity knowledge related to downstream tasks. JNLPBA (Collier and Kim, 2004) includes 2,000 PubMed abstracts to identify molecular biology-related entities. We ignore entity types in JNLPBA following Gu et al. (2020).

Relation Extraction
ChemProt (Krallinger et al., 2017) classifies the relation between chemicals and proteins within sentences from PubMed abstracts. Sentences are classified into 6 classes including a negative class.
DDI (Herrero-Zazo et al., 2013) is a RE dataset with sentence-level drug-drug relation on PubMed abstracts. There are four classes for relation: advice, effect, mechanism, and false.
GAD (Bravo et al., 2015) is a gene-disease relation binary classification dataset collected from PubMed sentences.

Fine-tuning Details
NER We follow Gu et al. (2020)  BIO tagging scheme and ignore the entity types in NER datasets. We classify labels of tokens by a linear layer on top of the hidden representations.
RE We replace the entity mentions in RE datasets with entity indicators like @DISEASE$ or @GENE$ to avoid models classifying relations by memorizing entity names. We add these entity indicators into the vocabulary of LMs. We concatenate the representation of two concerned entities and feed it into a linear layer for relation classification.
Parameters We adopt AdamW as the optimizer with a 10% steps linear warmup and a linear decay. We search the hyperparameters of learning rate among 1e-5, 3e-5, and 5e-5. We fine-tune the model for 60 epochs. We evaluate the model at the end of each epoch and choose the best model according to the evaluation score on the development set. We set batch size as 16 when fine-tuning. The maximal input lengths are 512 for all NER datasets. We truncate ChemProt and DDI to 256 tokens, and GAD to 128 tokens. To perform a fair comparison, we fine-tune our model with 5 different seeds and report the average score.

Results
We compare KeBioLM with following base-size biomedical PLMs on the above-mentioned datasets: BioBERT (Lee et al., 2020), SciBERT , ClinicalBERT (Alsentzer et al., 2019), BlueBERT (Peng et al., 2019), bio-lm (Lewis et al., 2020), diseaseBERT (He et al., 2020), and Pub-MedBERT (Gu et al., 2020) 6 . Table 2 shows the main results on NER and RE datasets of the BLURB benchmark. In addition, we report the average scores for NER and RE tasks respectively. KeBioLM achieves state-ofthe-art performance for NER and RE tasks. Compared with the strong baseline BioBERT, KeBi-oLM shows stable improvements in NER and RE datasets (+1.1 in NER, +1.9 in RE). Compared with our baseline model PubMedBERT, KeBioLM performs significantly better in BC5dis, NCBI, JNLPBA, ChemProt, and GAD (p ≤ 0.05 based on one-sample t-test) and achieves better average scores (+0.8 in NER, +0.6 in RE). DiseaseBERT is a model carefully designed for predicting disease names and aspects, which leads to better performance in the BC5dis dataset (+0.4). They only report the promising results in disease-related tasks, however, our model obtains consistent promising performances across all kinds of biomedical tasks. In the BC2GM dataset, KeBioLM outperforms our baseline model PubMedBERT and other PLMs except for bio-lm, and the standard deviation of the BC2GM task is evidently larger than other tasks. Another exception is the DDI dataset, we observe a slight performance degradation compared to PubMedBERT (-0.5). The average performances demonstrate that fusing entity knowledge into the LM boosts the performances across the board.  Table 3: Ablation studies for KeBioLM architecture on the BLURB benchmark. We use -wem, +rand and +frz to represent pretraining setting (a), (b) and (c), respectively.

Ablation Test
We conduct ablation tests to validate the effectiveness of each part in KeBioLM. We pretrain the model with the following settings and reuse the same parameters described above: (a) Remove whole entity masking and retain whole word masking while pretraining (-wem); (b) Initialize entity embeddings randomly (+rand); (c) Initialize entity embeddings by TransE and freeze the entity embeddings while pretraining (+frz).
In Table 3, we observe the following results. Firstly, comparing KeBioLM with setting (a) shows that whole entity masking boosting the performances consistently in all datasets (+0.5 in NER, +0.9 in RE). Secondly, comparing KeBioLM with setting (b) indicates initializing the entity embeddings randomly degrades performances in NER tasks and RE tasks (-0.4 in NER, -1.2 in RE). Entity embeddings initialized by TransE utilize relation knowledge in UMLS and enhance the results. Thirdly, freezing the entity embeddings in setting (c) reduces the performances on all datasets compared to KeBioLM except BC2GM (-0.4 in NER, -1.1 in RE). This indicates that updating entity embedding while pretraining helps KeBioLM to have better text-entity representations, and this leads to better downstream performances.
To evaluate how the count of transformer layers affects our model, we pretrain KeBioLM with the different number of layers. For the convenience of notation, denote l 0 is the layer count of text-only encoding and l 1 is the layer count of text-entity fusion encoding. We have the following settings: (i)    Table 4. Our base model (i) has better performance than setting (ii) (+0.3 in NER, +0.7 in RE). Training setting (iii) is equal to a traditional BERT model with additional entity extraction and entity linking tasks. The comparison with (i) and (iii) indicates that text-entity representations have better performances than textonly representations (+0.5 in NER, +0.9 in RE) in the same amount of parameters.

UMLS Knowledge Probing
We establish a probing dataset based on UMLS triplets to evaluate how LMs understand medical knowledge via pretraining.

Probing Dataset
UMLS triplets are stored in the form of (s, r, o) where s and o are CUIs in UMLS and r is a relation type. We generate two queries for one triplet based on names of CUIs and relation type: •  For relation names end with "of", "as" , and "by", we add "is" in front of relation names. For instance, translation_of is converted to is translation of, classified_as is converted to is classified as, and used_by is converted to is used by. We summarize our generated UMLS relation probing dataset in Table 5. Unlike LAMA (Petroni et al., 2019) and X-FACTR (Jiang et al., 2020) that contain less than 50 kinds of relation, our probing task is a more difficult task requiring a model to decode entities over 900 kinds of relations.

Multi [MASK] Decoding
To probe PLMs using generated queries, we require models to recover the masked tokens. Since biomedical entities are usually formed by multiple words and each word can be tokenized into several wordpieces (Wu et al., 2016), models have to recover multiple [MASK] tokens. We limit the max length of one entity is 10 for decoding. We decode the multi [MASK] tokens using the confidence-based method described in Jiang et al. (2020). We also implement a beam search for decoding. Unlike beam search in machine translation that decodes tokens from left to right, we decode tokens in an arbitrary order. For each step, we calculate the probabilities of all undecoded masked tokens based on original input and decoded tokens. We predict only one token within undecoded tokens with the top B = 5 accumulated log probabilities. Decoding will be accomplished after count of [MASK] times iterations and we keep the best B = 5 decoding results. We skip the refinement stage since it is time-consuming and does not significantly improve the results.

Evaluation Metric
Since multiple correct CUIs exist for one query, we consider a model answering the query correctly if any decoded tokens in any [MASK] length hit any of the correct CUIs. We evaluate the probing results by the relation-level macro-recall@5.

Probing Results
We classify probing queries into two types based on their difficulties. Type 1: answers within queries (24,260 queries); Type 2: answers not in queries (119,511 queries).
Here are examples of Type 1 (Q 1 and A 1 ) and Type 2 (Q 2 and A 2 ) queries: • Q 1 : [MASK] has form tacrolimus monohydrate.
• A 2 : C0001614: adrenal cortex disease Table 6 summarizes the probing results of different PLMs according to query types. Checkpoints of BioBERT and PubMedBERT miss a cls/predictions layer and cannot perform the probe directly. Compared to other PLMs, KeBioLM achieves the best scores in both two types and obviously outperforms BlueBERT and ClincalBERT with a large margin, which indicates that KeBioLM learns more medical knowledge. Table 7 lists some probing examples. SciBERT can decode medical entities for [MASK] tokens which may be unrelated. KeBioLM decodes relation correctly and is aware of the synonyms of hepatic. KeBioLM states that Vaccination may prevent tetanus which is a correct but not precise statement.

Conclusions
In this paper, we propose to improve biomedical pretrained language models with knowledge. We  propose KeBioLM which applies text-only encoding and text-entity fusion encoding and has two additional entity-related pretraining tasks: entity detection and entity linking. Extensive experiments have shown that KeBioLM outperforms other PLMs on NER and RE datasets of the BLURB benchmark. We further probe biomedical PLMs by querying UMLS relation triplets, which indicates KeBioLM absorbs more biomedical knowledge than others. In this work, we only leverage the relation information in TransE to initialize the entity embeddings. We will further investigate how to directly incorporate the relation information into LMs in the future.