Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

A large array of pretrained models are available to the biomedical NLP (BioNLP) community. Finding the best model for a particular task can be difficult and time-consuming. For many applications in the biomedical and clinical domains, it is crucial that models can be built quickly and are highly accurate. We present a large-scale study across 18 established biomedical and clinical NLP tasks to determine which of several popular open-source biomedical and clinical NLP models work well in different settings. Furthermore, we apply recent advances in pretraining to train new biomedical language models, and carefully investigate the effect of various design choices on downstream performance. Our best models perform well in all of our benchmarks, and set new State-of-the-Art in 9 tasks. We release these models in the hope that they can help the community to speed up and increase the accuracy of BioNLP and text mining applications.


Introduction
The pretrain-and-finetune approach has become the dominant paradigm for NLP applications in the last few years Devlin et al., 2019;Conneau et al., 2020, inter alia.), bringing significant performance gains in many areas of NLP. Models trained on Wikipedia and WebText (Radford et al., 2019) generally perform well on a variety of target domains, but various works have noted that pretraining on in-domain text is an effective method for boosting downstream performance further Beltagy et al., 2019;Gururangan et al., 2020). Several pretrained models are available specifically in the domain of biomedical and clinical NLP driving forward the state of the art including BioBERT , SciBERT (Beltagy et al., 2019), ClinicalBERT (Alsentzer et al., 2019) and BioMedRoBERTa (Gururangan et al., 2020).
While it is great to have multiple options, it can be difficult to make sense of what model to use in what case -different models are often compared on different tasks. To further complicate matters, more powerful general-purpose models are being released continuously. It is unclear whether it is better to use a more powerful general-purpose model like RoBERTa, or a domain-specific model derived from an earlier model such as BioBERT. And given the opportunity to pretrain a new model, it is unclear what are the best practices to do that efficiently.
Our goal is to understand better the landscape of pretrained biomedical and clinical NLP models. To that effect, we perform a large-scale study across 18 established biomedical and clinical NLP tasks. We evaluate four popular bioNLP models using the same experimental setup. We compare them to general purpose RoBERTa checkpoints. We find that BioBERT performs best overall on biomedical tasks, but the general-purpose RoBERTA-large model performs best on clinical tasks. We then take advantage of recent advances in pretraining by adapting RoBERTa (Liu et al., 2019) to biomedical and clinical text. We investigate what choices are important in pretraining for strong downstream bioNLP performance, including model size, vocabulary/tokenization choices and training corpora. Our best models perform well across all of the tasks, establishing a new state of the art on 9 tasks. Finally, we apply knowledge distillation to train a smaller model that outperforms all other models with similar computational requirements. We will release our pretrained models and the code used to run our experiments. 1

Tasks and Datasets
We select a broad range of datasets to cover both scientific and clinical textual domains, and common modelling tasks -namely i) Sequence labelling tasks, covering Named Entity Recognition (NER) and de-identification (De-id) and ii) Classification tasks, covering relation extraction, multi-class and multi-label classification and Natural Language Inference (NLI)-style tasks. These tasks were also selected to optimize overlap with previous work in the space, drawing tasks from the BLUE benchmark (Peng et al., 2019), BioBERT , SciBERT (Beltagy et al., 2019) and ClinicalBERT (Alsentzer et al., 2019). The tasks are summarized in Table 1 and described in the following subsections.

Sequence Labelling Tasks
BC5-CDR (Li et al., 2016) is an NER task requiring the identification of Chemical and Disease concepts from 1,500 PubMed articles. There are 5,203 and 4,182 training instances for chemicals and diseases respectively.
JNLPBA (Collier and Kim, 2004) is an NER task requiring the identification of entities of interest in micro-biology, with 2,000 training PubMed abstracts.
NCBI-Disease (Dogan et al., 2014) requires identification of disease mentions in PubMed abstracts. There are 6,892 annotations from 793 abstracts.
BC4CHEMD (Krallinger et al., 2015) requires the identification of chemical and drug mentions from PubMed abstracts. There are 84,310 annotations from 10,000 abstracts.
BC2GM (Smith et al., 2008) requires the identification of 24,583 protein and gene mentions from 20,000 sentences from PubMed.
LINNAEUS (Gerner et al., 2010) is a collection of 4,077 species annotations from 153 PubMed articles.
I2B2-2010/VA (Uzuner et al., 2011) is made up of 871 de-identified clinical reports. The task requires labelling a variety of medical concepts in clinical text. -2012- (Sun et al., 2013b is made up of 310 de-identified clinical discharge summaries. The task requires the identification of temporal events within these summaries.  is made up of 1,304 de-identified longitudinal medical records. The task requires the labelling of spans of text of private health information.

Classification Tasks
HOC (Baker et al., 2016) is a multi-label classification task requiring the classification of cancer concepts for PubMed Articles. We follow (Peng et al., 2019) and report abstract-level F1 score.
MedNLI (Romanov and Shivade, 2018) is a 3class NLI dataset built from 14K pairs of sentences in the clinical domain.
ChemProt (Krallinger et al., 2017) requires classifying chemical-protein interactions from 1,820 PubMed articles. We follow the standard practice of evaluating over the 5 most common classes.
GAD (Bravo et al., 2015) is a binary relation extraction task for 5330 annotated gene-disease interactions from PubMed. We use the cross-validation splits from . (van Mulligen et al., 2012) is a small data binary relation extraction task with 355 annotated gene-disease interactions from PubMed. We use the cross-validation splits from . DDI-2013(Herrero-Zazo et al., 2013) is a relation extraction task requiring recognition of drugdrug interactions. There are 4 classes to extract from 4920 sentences from PubMed, as well as many sentences which do not contain relations. (Uzuner et al., 2011) in this setting of I2B2-2010, we focus on the relation extraction task to detect 8 clinical events.

Pretraining Corpora
There is a wide range of text corpora in the biomedical and clinical domains. We limit our options to data that is freely available to the public so that models can be open-sourced.

Pretrained Models
We compare five publicly-available language models which together form a representative picture of the state-of-the-art in biomedical and clinical NLP.
We use the HuggingFace Transformers library to access the model checkpoints (Wolf et al., 2019).
SciBERT ( BioBERT  is based on the BERT-base model (Devlin et al., 2019), with additional pretraining in the biomedical domain. We use BioBERT-v1.1. This model was was trained for 200K steps on PubMed and PMC for 270K steps, followed by an additional 1M steps of training on PubMed, using the same hyperparameter settings as BERT-base.
ClinicalBERT (Alsentzer et al., 2019) is also based on BERT-base, but with a focus on clinical tasks. We use the "Bio+Clinical BERT" checkpoint, which is initialized from BioBERT, and then trained using texts from MIMIC-III for 150K steps using a batch size of 32.
RoBERTa (Liu et al., 2019) is a state-of-theart general purpose model. We experiment with RoBERTa-base and RoBERTa-large to understand how general domain models perform on biomedical tasks. Both models are pretrained with much larger batch sizes than BERT, and use dynamic masking strategies to prevent the model from overmemorization of the training corpus. RoBERTa outperforms BERT on general-domain tasks (Liu et al., 2019).
BioMed-RoBERTa (Gururangan et al., 2020) is a recent model based on RoBERTa-base. BioMed-RoBERTa is initialized from RoBERTa-base, with an additional pretraining of 12.5K steps with a batch size of 2048, using a corpus of 2.7M scientific papers from Semantic Scholar (Ammar et al., 2018).

Pretraining New Models
In addition to these publicly available models, we also pretrain new models on the corpora in Section 3 and examine which design criteria are important for strong downstream performance on Bio-NLP tasks. We have three criteria we are interested in studying: i) The effect of model size on downstream performance; ii) the effect of pretraining corpus on downstream performance; and, iii) whether tokenizing with a domain-specific vocabulary has a strong effect on downstream performance. We pretrain a variety of models based on the RoBERTa-base and RoBERTa-large architectures, with detailed ablations discussed in section 6.1. We use the PubMed data, and optionally include MIMIC-III. We initialize our models with the RoBERTa checkpoints, except when we use a domain-specific vocabulary, then we retrain the model from a random initialization. Our domainspecific vocabulary is a byte-level byte-pair encoding (BPE) dictionary learned over our PubMed pretraining corpus (Radford et al., 2019;Sennrich et al., 2016). Both the general-purpose (RoBERTa) and domain-specific vocabularies contain 50k subword units. Our best performing models use PubMed abstracts, PMC and MIMIC-III pretraining and a domain-specific vocabulary, and are referred to as "ours-base" and "ours-large" in the following sections.

Pretraining
We largely follow the pretraining methodology of Liu et al. (2019). We pretrain models using FAIRSEQ  on input sequences of 512 tokens, of which 15% are masked and later predicted. 6 We pretrain with batches of 8,192 sequences and use the AdamW optimizer (Loshchilov and Hutter, 2019) with 1 = 0.9, 2 = 0.98, ✏ = 1e 6. We regularize the model with dropout (p = 0.1) and weight decay ( = 0.01). We pretrain all models for 500k steps using mixed precision on V100 GPUs. We linearly warmup the learning for the first 5% of steps and linearly decay the learning rate to 0 over the remaining steps. We use a learning rate of 6e-4 for base models and 4e-4 for large models.

Fine-tuning
We fine-tune models using 5 different seeds and report the median result on the test sets.
For sequence labelling tasks, we use learning rate of 1e-5 and a batch size of 32. For all sequence labelling tasks, we train for 20 epochs in total and choose the best checkpoint based on validation set performance (evaluating every 500 optimization steps). We fine-tuned the models with 5 seeds and report the median test results across these seeds.
For classification tasks, we use a learning rate of 0.002 and a batch size of 16. For HOC, ChemProt, MedNLI and I2B2-2010-RE, we run for a maximum of 10 epochs, and perform early stopping, evaluating performance on validation data every 200 optimization steps. As GAD and EU-ADR are split into 10 train/test cross-validation partitions, we choose early-stopping hyperparameters using one fold, and report the median test results on the other 9 folds. Table 2 shows our main results. The first columns show results for the general-purpose RoBERTabase checkpoint, the next four show results for the specialized models mentioned in Section 4. The Roberta-large column shows results for the general-purpose RoBERTa-large checkpoint. The "ours-base" and "ours-large" columns refers to our proposed RoBERTa-base and RoBERTa-large sized models respectively, which were trained using PubMed and MIMIC-III data and a domainspecific vocabulary. We observe the following: i) RoBERTa-large outperforms RoBERTa-base consistently, despite having access to the same training  Table 2: Test results on all tasks for our RoBERTa baselines, publicly available models and our best Large and Base-sized models. All results are the median of 5 runs with different seeds corpora; ii) We find that BioBERT performs best from the publicly available models that we experiment with; and iii) our newly introduced models perform well, achieving the best results for 17 out of the 18 tasks in our experiments, often by a large margin. The exception is EU-ADR, which has a small test set where all models achieve essentially the same classification accuracy.

Results
Digging deeper, we note that standard RoBERTalarge is competitive with the four specialized models on sequence labelling tasks (85.8 vs 85.9) and outperforms them on clinical tasks (84.0 vs 83.3), despite having no specialized biomedical or clinical pretraining. This suggests that larger, more powerful general-purpose models could be a good default choice compared to smaller, less powerful domain-specific models.
Nevertheless, applying domain-specific training to otherwise-comparable models results in significant performance gains in our experiments, as shown by comparing ours-base and ours-large to RoBERTa-base and RoBERTa-large in Table 2, (+3.5% and +2.6% mean improvement), consistent with findings from previous work (Gururangan et al., 2020).

Ablations
The "ours-base" and "ours-large" models shown in Table 2 refer to the best language models that we trained in our experiments described in Section 4.1. These models use the RoBERTa architectures, are initialized with random weights, use a BPE vocabulary learnt from PubMed, and are pretrained on both our PubMed and MIMIC-III corpora. We performed a detailed ablation study to arrive at these models, and in what follows, we analyse the design decisions in detail. A summary of these results are shown in Table 3, a description of task groupings in Table 4, and full results can be found in Appendix A.2.

Effect of vocabulary
The effect of learning a dedicated biomedical vocabulary for base and large models can be analysed by comparing row 2 to row 3, row 4 to 5, and row 7 to 8 in Table 3. A dedicated vocabulary consistently improves sequence labelling tasks, improving results for base models by 0.7% and our large model by 0.6% on average. The difference is less consistent for classification tasks, improving the large model by 0.5%, but reducing performance on the small model by 0.7%. A specialized domain-specific vocabulary was also shown to be  Table 3: Ablation test set results. Rows 5 and 8 correspond to "ours-base"' and "ours-large" in Table 2 respectively. Bold indicates the best model overall, Underlined indicates the best base model. "PM" indicates training with PubMed and PMC corpora and "M3" refers to the MIMIC-III corpus. "Voc" indicates using a dedicated biomedical vocabulary. Details of the tasks incuded in each column are given in Table 4 Task group Tasks   useful in Beltagy et al. (2019). Since our specialized vocabulary models are trained from scratch only on biomedical data, we see that Wikipedia and WebText (Radford et al., 2019) pretraining is not necessary for strong performance. Table 3 also shows the results of text corpora. Rows 1 and 2 show that, unsurprisingly, including PubMed pretraining improves results over a RoBERTa-only model, by 2.6%. Comparing row 2 to row 4 and row 3 to 5 shows that including MIMIC-III in pretraining results in a large improvement on clinical tasks over PubMed-only models (+1.5% and +1.7%) but has little effect on PubMedbased tasks (-0.1% and +0.1%).

Effect of model size
Consistent with findings from the recent literature (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019;Brown et al., 2020), we find that large models perform consistently better than comparable smaller ones. Comparing row 1 to row 6, row 4 to 7, and row 5 to 8 in Table 3 shows average improvements of 2%, 1.6% and 0.9% respectively. These improvements are mostly driven by improved sequence labelling performance for large models.

Comparisons to the state-of-the-art
The focus of this paper was not to set the state-ofthe-art on specific downstream tasks, but rather to evaluate which models consistently perform well. As such, we prioritized consistent hyperparameter search and did not consider task-specific tuning. Nevertheless, the models that we trained compare favorably to the state-of-the-art. Table 5 shows the best results obtained for each task in our experiments. In some cases, models used in our experiments have been reported with higher results in the literature. We attribute such difference to variance in test performance, small differences in pre-processing and differing levels of hyperparameter optimization and tuning. We control for test-set variance by running each model 5 times with different random seeds and reporting median results. We also use standard hyperparameter settings as reported in the literature. Table 5 compares our results to numbers reported in the literature. The best model in our experiments sets a new State-ofthe-Art in 9 out of 18 tasks, and comes within 0.1% of the best reported result in another 3 tasks.

Distillation
In Section 6.1.3, we noted that larger models result in better accuracy. However, they also require more computational resources to run, limiting their applicability. Recent work addresses this issue by distilling larger models into smaller ones while retaining performance. Next, we investigate whether distillation works well in the BioNLP space.

Distillation Technique
Knowledge distillation (Hinton et al., 2015) aims to transfer the performance from a more accurate and computationally expensive teacher model into a more efficient student model. Typically, the student network is trained to mimic the output distribution  Table 5: Our best models compared to best reported results in the literature. The best model in our experiments unless otherwise stated is RoBERTa-large with PubMed, MIMIC-III and specialized vocabulary ("ours-large" in Table 2). Other models are indicated by: (*) RoBERTa-large + PubMed + MIMIC-III; ( †) SciBERT; ( ‡) RoBERTabase + PubMed + MIMIC-III + vocab.
or internal activations of the teacher network, while keeping the teacher network's weights fixed.
In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al., 2019;Turc et al., 2019;Sun et al., 2020). Recently, Sanh et al. (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM). Task-agnostic distillation is appealing because it enables the distilled student model to be applied to a variety of downstream tasks. Accordingly, we primarily explore task-agnostic distillation in this work.
Recent work has also shown the importance of student network initialization. For example, Sanh et al. (2020) find that initializing the student network with a subset of layers from the teacher network outperforms random initialization; unfortunately this approach constrains the student network to the same embedding and hidden dimension as the teacher. Turc et al. (2019) instead advocate initializing the student model via standard MLM pretraining, finding that it outperforms the layer subset approach. Unfortunately, they only consider task-specific distillation, where the teacher network has already been finetuned to the end task, reducing the generality of the resulting student network.
We combine the approaches from Sanh et al. (2020) and Turc et al. (2019) by initializing the student network via standard MLM pretraining and then performing task-agnostic distillation by training the student to mimic a pretrained teacher on the MLM objective. We use our pretrained base model as the student network and large model as the teacher network. We also experiment with aligning the hidden states of the teacher's and student's last layer via a cosine embedding loss (Sanh et al., 2020). Since our student and teacher networks have different hidden state sizes, we learn a linear projection from the student's hidden states to the dimension of the teacher's hidden states prior to computing this loss.
We distill each student for 50k steps. Similar to pretraining (Section 5.1), we distill with a batch size of 8,192 and linearly warmup the learning rate for the first 5% of steps. We use a learning rate of 5e-4 and largely follow the distillation hyperparameter choices of Sanh et al. (2020). In particular, our loss function is a weighted combination of the original MLM cross entropy loss (with a weight ↵ MLM = 5.0), a KL divergence loss term encouraging the student to match the teacher's outputs (with a weight ↵ KL = 2.0) and optionally a cosine embedding loss term to align the student's and teacher's last layer hidden states (with a weight ↵ cos = 1.0). For the KL loss we additionally employ a temperature of 2.0 to smooth the teacher's output distribution, following Sanh et al. (2020) and originally advocated by Hinton et al. (2015).

Distillation Results
Results for distillation are shown in Table 6. Since distillation trains the student for an additional 50k steps, we also include a baseline that just trains the student (base) model for longer without any distillation loss terms ("ours-base + train longer").
We find that distillation only slightly outperforms the original base model (+0.2% on average) and the original base model trained longer (+0.1% on average). Aligning the student and teacher hid-   (2019) showing that pretrained student models are a competitive baseline. The best student ("distill + align") improves upon the base model (+0.3% on average) but underperforms the large teacher (-0.8% on average).
Methods for training or finetuning models on downstream tasks is also an active area of research.
We focus on well-established single-task finetuning techniques for BERT-like models using standard hyperparameter settings. Si et al. (2019) use complex task-specific models to yield strong results on clinical tasks, and Peng et al. (2020) investigate STILTS methods (Phang et al., 2019) on a suite of BioNLP tasks, achieving gains over baselines.
In this work, we build a suite of 18 tasks to evaluate our models. Aggregated benchmarks have become a common tool in NLP research, popularized by the GLUE benchmark (Wang et al., 2018a) for language understanding and its successor Super-GLUE . Evaluating on a suite of tasks is common in BioNLP too.  evaluate on a set of 15 tasks, Peng et al. (2019) evaluate on 10 tasks referred to as "BLUE", Beltagy et al. (2019) and Gururangan et al. (2020) evaluate on 7 and 2 biomedical tasks respectively. Unfortunately, often there is little overlap between efforts, and different metrics and dataset splits are often used, making cross-model comparisons challenging, hence our efforts to evaluate all models on a single testbed. In concurrent work, Gu et al. (2020) also note this problem, and release a similar suite of tasks, referred to as BLURB, but do not include clinical tasks. We plan to evaluate our models on the "BLURB" benchmarks in future work.

Conclusion
We have thoroughly evaluated 6 open-source language models on 18 biomedical and clinical tasks. Of these models, we found that BioBERT was the best on biomedical tasks, but general-purpose RoBERTa-large performed best on clinical tasks. We then pretrained 6 of our own large-scale specialized biomedical and clinical language models. We determined that the most effective models were larger, used a dedicated biomedical vocabulary and included both biomedical and clinical pretraining. These models outperform all the other models in our experiments. Finally, we demonstrate that our base model can be further improved by knowledge distillation from our large model, although there remains a gap between the distillation-improved base model and our large model.