Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Inspired by the success of the General Language Understanding Evaluation benchmark, we introduce the Biomedical Language Understanding Evaluation (BLUE) benchmark to facilitate research in the development of pre-training language representations in the biomedicine domain. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT and ELMo and find that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available at https://github.com/ ncbi-nlp/BLUE_Benchmark.


Introduction
With the growing amount of biomedical information available in textual form, there have been significant advances in the development of pretraining language representations that can be applied to a range of different tasks in the biomedical domain, such as pre-trained word embeddings, sentence embeddings, and contextual representations (Chiu et al., 2016;Peters et al., 2017;Smalheiser et al., 2019).
In the general domain, we have recently observed that the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) has been successfully promoting the development of language representations of general purpose (Peters et al., 2017;Radford et al., 2018;Devlin et al., 2019). To the best of our knowledge, however, there is no publicly available benchmarking in the biomedicine domain.
To facilitate research on language representations in the biomedicine domain, we present the Biomedical Language Understanding Evaluation (BLUE) benchmark, which consists of five different biomedicine text-mining tasks with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks (Huang and Lu, 2015). These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges. We expect that the models that perform better on all or most tasks in BLUE will address other biomedicine tasks more robustly.
To better understand the challenge posed by BLUE, we conduct experiments with two baselines: One makes use of the BERT model (Devlin et al., 2019) and one makes use of ELMo (Peters et al., 2017). Both are state-of-the-art language representation models and demonstrate promising results in NLP tasks of general purpose. We find that the BERT model pre-trained on PubMed abstracts (Fiorini et al., 2018) and MIMIC-III clinical notes  achieves the best results, and is significantly superior to other models in the clinical domain. This demonstrates the importance of pre-training among different text genres.
In summary, we offer: (i) five tasks with ten biomedical and clinical text-mining corpora with different sizes and levels of difficulty, (ii) codes for data construction and model evaluation for fair comparisons, (iii) pretrained BERT models on PubMed abstracts and MIMIC-III, and (iv) baseline results.

Related work
There is a long history of using shared language representations to capture text semantics in biomedical text and data mining research. Such re-search utilizes a technique, termed transfer learning, whereby the language representations are pretrained on large corpora and fine-tuned in a variety of downstream tasks, such as named entity recognition and relation extraction.
One established trend is a form of word embeddings that represent the semantic, using high dimensional vectors (Chiu et al., 2016;Wang et al., 2018c;Zhang et al., 2019). Similar methods also have been derived to improve embeddings of word sequences by introducing sentence embeddings . They always, however, require complicated neural networks to be effectively used in downstream applications.
Another popular trend, especially in recent years, is the context-dependent representation. Different from word embeddings, it allows the meaning of a word to change according to the context in which it is used (Melamud et al., 2016;Peters et al., 2017;Devlin et al., 2019;Dai et al., 2019). In the scientific domain, Beltagy et al. released SciBERT which is trained on scientific text. In the biomedical domain, BioBERT  and BioELMo  were pretrained and applied to several specific tasks. In the clinical domain, Alsentzer et al. (2019) released a clinical BERT base model trained on the MIMIC-III database. Most of these works, however, were evaluated on either different datasets or the same dataset with slightly different sizes of examples. This makes it challenging to fairly compare various language models.
Based on these reasons, a standard benchmarking is urgently required. Parallel to our work,  introduced three tasks: named entity recognition, relation extraction, and QA, while  introduced NLI in addition to named entity recognition. To this end, we deem that BLUE is different in three ways. First, BLUE is selected to cover a diverse range of text genres, including both biomedical and clinical domains. Second, BLUE goes beyond sentence or sentence pairs by including document classification tasks. Third, BLUE provides a comprehensive suite of codes to reconstruct dataset from scratch without removing any instances.

Tasks
BLUE contains five tasks with ten corpora that cover a broad range of data quantities and difficulties (Table 1). Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks.

Sentence similarity
The sentence similarity task is to predict similarity scores based on sentence pairs. Following common practice, we evaluate similarity by using Pearson correlation coefficients.
BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain (Sogancıoglu et al., 2017). 1 To develop BIOSSES, five curators judged their similarity, using scores that ranged from 0 (no relation) to 4 (equivalent). Here, we randomly select 80% for training and 20% for testing because there is no standard splits in the released data.
MedSTS is a corpus of sentence pairs selected from Mayo Clinics clinical data warehouse (Wang et al., 2018b). To develop MedSTS, two medical experts graded the sentence's semantic similarity scores from 0 to 5 (low to high similarity). We use the standard training and testing sets in the shared task.

Named entity recognition
The aim of the named entity recognition task is to predict mention spans given in the text (Jurafsky and Martin, 2008). The results are evaluated through a comparison of the set of mention spans annotated within the document with the set of mention spans predicted by the model. We evaluate the results by using the strict version of precision, recall, and F1-score. For disjoint mentions, all spans also must be strictly correct. To construct the dataset, we used spaCy 2 to split the text into a sequence of tokens when the original datasets do not provide such information.
BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemicaldisease relation task . 3 The diseases and chemicals mentioned in the articles were annotated independently by two human experts with medical training and curation experience. We use the standard training and test set in the  . ShARe/CLEF eHealth Task 1 Corpus is a collection of 299 deidentified clinical free-text notes from the MIMIC II database (Suominen et al., 2013). 4 The disorders mentioned in the clinical notes were annotated by two professionally trained annotators, followed by an adjudication step, resulting in high inter-annotator agreement. We use the standard training and test set in the ShARe/CLEF eHealth Tasks 1.

Relation extraction
The aim of the relation extraction task is to predict relations and their types between the two entities mentioned in the sentences. The relations with types were compared to annotated data. We use the standard micro-average precision, recall, and F1-score metrics.
DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts (Herrero-Zazo et al., 2013). 5 The drug-drug interactions, including both pharmacokinetic and pharmacodynamic interactions, were annotated by two expert pharmacists with a substantial background in pharmacovigilance. In our benchmark, we use 624 train files and 191 test files to evaluate the performance and report the micro-average F1-score of the four DDI types.
i2b2 2010 shared task collection consists of 170 documents for training and 256 documents for testing, which is the subset of the original dataset (Uzuner et al., 2011). 7 The dataset was collected from three different hospitals and was annotated by medical practitioners for eight types of relations between problems and treatments.

Document multilabel classification
The multilabel classification task predicts multiple labels from the texts.
HoC (the Hallmarks of Cancers corpus) consists of 1,580 PubMed abstracts annotated with ten currently known hallmarks of cancer (Baker et al., 2016). 8 Annotation was performed at sentence level by an expert with 15+ years of experience in cancer research. We use 315 (∼20%) abstracts for testing and the remaining abstracts for training. For the HoC task, we followed the common practice and reported the example-based F1-score on the abstract level (Zhang and Zhou, 2014;Du et al., 2019).

Inference task
The aim of the inference task is to predict whether the premise sentence entails or contradicts the hypothesis sentence. We use the standard overall accuracy to evaluate the performance.

Total score
Following the practice in Wang et al. (2018a) and , we use a macro-average of F1scores and Pearson scores to determine a system's position.

Baselines
For baselines, we evaluate several pre-training models as described below. The original code for the baselines is available at https://github. com/ncbi-nlp/NCBI_BERT.

Pre-training BERT
BERT (Devlin et al., 2019) is a contextualized word representation model that is pre-trained based on a masked language model, using bidirectional Transformers (Vaswani et al., 2017).
In this paper, we pre-trained our own model BERT on PubMed abstracts and clinical notes (MIMIC-III). The statistics of the text corpora on which BERT was pre-trained are shown in Table 2.

Corpus
Words Domain PubMed abstract > 4,000M Biomedical MIMIC-III > 500M Clinical We initialized BERT with pre-trained BERT provided by (Devlin et al., 2019). We then continue to pre-train the model, using the listed corpora.
We released our BERT-Base and BERT-Large models, using the same vocabulary, sequence length, and other configurations provided by Devlin et al. (2019). Both models were trained with 5M steps on the PubMed corpus and 0.2M steps on the MIMIC-III corpus.

Fine-tuning with BERT
BERT is applied to various downstream textmining tasks while requiring only minimal archi-tecture modification.
For sentence similarity tasks, we packed the sentence pairs together into a single sequence, as suggested in Devlin et al. (2019).
For named entity recognition, we used the BIO tags for each token in the sentence. We considered the tasks similar to machine translation, as predicting the sequence of BIO tags from the input sentence.
We treated the relation extraction task as a sentence classification by replacing two named entity mentions of interest in the sentence with predefined tags (e.g., @GENE$, @DRUG$) . For example, we used "@CHEMI-CAL$ protected against the RTI-76-induced inhibition of @GENE$ binding." to replace the original sentence "Citalopram protected against the RTI-76-induced inhibition of SERT binding." in which "citalopram" and "SERT" has a chemicalgene relation.
For multi-label tasks, we fine-tuned the model to predict multi-labels for each sentence in the document. We then combine the labels in one document and compare them with the gold-standard.
Like BERT, we provided sources code for finetuning, prediction, and evaluation to make it straightforward to follow those examples to use our BERT pre-trained models for all tasks.

Fine-tuning with ELMo
We adopted the ELMo model pre-trained on PubMed abstracts (Peters et al., 2017) to accomplish the BLUE tasks. 10 The output of ELMo embeddings of each token is used as input for the fine-tuning model. We retrieved the output states of both layers in ELMo and concatenated them into one vector for each word. We used the maximum sequence length 128 for padding. The learning rate was set to 0.001 with an Adam optimizer. We iterated the training process for 20 epochs with batch size 64 and early stopped if the training loss did not decrease.
For sentence similarity tasks, we used bag of embeddings with the average strategy to transform the sequence of word embeddings into a sentence embedding. Afterward, we concatenated two sentence embeddings and fed them into an architecture with one dense layer to predict the similarity of two sentences.  (Yoon et al., 2018); ShARe/CLEFE (Leaman et al., 2015); DDI (Zhang et al., 2018). Chem-Prot (Peng et al., 2018); i2b2 (Rink et al., 2011); HoC (Du et al., 2019); MedNLI (Romanov and Shivade, 2018). P: PubMed, P+M: PubMed + MIMIC-III For named entity recognition, we used a Bi-LSTM-CRF implementation as a sequence tagger Si et al., 2019;Lample et al., 2016). Specifically, we concatenated the GloVe word embeddings (Pennington et al., 2014), character embeddings, and ELMo embeddings of each token and fed the combined vectors into the sequence tagger to predict the label for each token. The GloVe word embeddings 11 and character embeddings have 100 and 25 dimensions, respectively. The hidden sizes of the Bi-LSTM are also set to 100 and 25 for the word and character embeddings, respectively.
For relation extraction and multi-label tasks, we followed the steps in fine-tuning with BERT but used the averaged ELMo embeddings of all words in each sentence as the sentence embedding.

Benchmark results and discussion
We pre-trained four BERT models: BERT-Base (P), BERT-Large (P), BERT-Base (P+M), BERT-Large (P+M) on PubMed abstracts only, and the combination of PubMed abstracts and clinical notes, respectively. We present performance on the main benchmark tasks in Table 3. More detailed comparison is shown in the Appendix A. 11 https://nlp.stanford.edu/projects/ glove/ Overall, our BERT-Base (P+M) that were pretrained on both PubMed abstract and MIMIC-III achieved the best results across five tasks, even though it is only slightly better than the one pretrained on PubMed abstracts only. Compared to the tasks in the clinical domain and biomedical domain, BERT-Base (P+M) is significantly superior to other models. This demonstrates the importance of pre-training among different text genres.
When comparing BERT pre-trained using the base settings against that using the large settings, it is a bit surprising that BERT-Base is better than BERT-Large except in relation extraction and document classification tasks. Further analysis shows that, on these tasks, the average length of sentences is longer than those of others (Table 1). In addition, BERT-Large pre-trained on PubMed and MIMIC is worse than other models overall. However, BERT-Large (P) performs the best in the multilabel task, even compared with the feature-based model utilizing enriched ontology (Yan and Wong, 2017). This is partially because the MIMIC-III data are relatively smaller than the PubMed abstracts and, thus, cannot pretrain the large model sufficiently.
In the sentence similarity tasks, BERT-Base (P+M) achieves the best results on both datasets. Because the BIOSSES dataset is very small (there are only 16 sentence pairs in the test set), all BERT models' performance was unstable. This problem has also been noted in the work of Devlin et al. (2019) when the model was evaluated on the GLUE benchmarking. Here, we obtained the best results by following the same strategy: selecting the best model on the development set after several runs. Other possible ways to overcome this issue include choosing the model with the best performance from multiple runs or averaging results from multiple fine-tuned models.
In the named entity recognition tasks, BERT-Base (P) achieved the best results on two biomedical datasets, whereas BERT-Base (P+M) achieved the best results on the clinical dataset. In all cases, we observed that the winning model obtained higher recall than did the others. Given that we use the pre-defined vocabulary in the original BERT and that this task relies heavily on the tokenization, it is possible that using BERT as pertaining to a custom sentence piece tokenizer may further improve the model's performance.

Conclusion
In this study, we introduce BLUE, a collection of resources for evaluating and analyzing biomedical natural language representation models. We find that the BERT models pre-trained on PubMed abstracts and clinical notes see better performance than do most state-of-the-art models. Detailed analysis shows that our benchmarking can be used to evaluate the capacity of the models to understand the biomedicine text and, moreover, to shed light on the future directions for developing biomedicine language representations.