BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA

The impact of design choices on the performance of biomedical language models recently has been a subject for investigation. In this paper, we empirically study biomedical domain adaptation with large transformer models using different design choices. We evaluate the performance of our pretrained models against other existing biomedical language models in the literature. Our results show that we achieve state-of-the-art results on several biomedical domain tasks despite using similar or less computational cost compared to other models in the literature. Our findings highlight the significant effect of design choices on improving the performance of biomedical language models.


Introduction
The amount of biomedical literature has grown substantially in recent years. This growth created a demand for powerful biomedical language models. Transformer-based language models, such as BERT (Devlin et al., 2019), have shown effectiveness in capturing the contextual representation of corpora at large volume. To address the lack of biomedical contextual representation, both BioBERT , and SciBERT (Beltagy et al., 2019) have adapted BERT to the biomedical domain.
Recently, several Transformer-based models have been introduced, including Megatron , RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020) and ELECTRA (Clark et al., 2020). These models show impressive performance gains over BERT in the general domain leading most NLP leader boards. However, these models have been evaluated with environmental design factors varying in several dimensions (e.g., vocabulary and corpora domain, loss function, training steps, batch size, and model's scale). Understanding the contribution of these factors to the performance of the language models is challenging, especially when our goal is to shift the contextual representations to the biomedical domain.
This challenge motivates us to investigate the impact of design choices on the performance of biomedical language models. Moreover, highlighting this impact is critical when evaluating new applications in BioNLP, where each application may evaluate its performance against other models that use different design setups. In this work, we pretrain and evaluate different variants of large biomedical Transformer-based models across different design factors.
Thus, our contributions in this paper includes : (i) We pretrain four different variations of Transformer-based models including: ELECTRA Base , ELECTRA Large , BERT Large and ALBERT xxlarge on biomedical domain corpora using Tensor Processing Units TPUs.
(ii) We fine-tune and evaluate our pretrained models on several downstream biomedical tasks. We present a comprehensive evaluation that highlights the impact of design choices on the performance of biomedical language models.
(iii) We released our pretrained models along with our Github repository. 1 2 Related Work

Transformer-based Language Models
The introduction of the BERT model (Devlin et al., 2019) has initiated the advancement of Transformer-based models. Consequently, the investigation of the architecture and design choices of BERT introduced new state-of-the-art models. By exploiting the advantage of using the large batch size and increasing the size of the corpus, RoBERTa (Liu et al., 2019) has achieved significant performance gains on all downstream tasks. The loss function and scalability of BERT were also a subject for investigation by ELECTRA (Clark et al., 2020) and ALBERT (Lan et al., 2020). ELECTRA reaches state-of-the-art results by introducing a binary loss function. This loss function uses generative and discriminative models to accelerate the learning curve. Furthermore, the ALBERT model introduces multiple ideas to the BERT model to improve performance and scalability, including parameter-sharing technique, LAMB optimizer, and factorization of embedding layers. Both ELECTRA and ALBERT are now leading most of NLP benchmarks, including SQuAD (Rajpurkar et al., 2016) and GLUE (Wang et al., 2018).

Biomedical Language Models
In this section, we will briefly summarize the current state-of-the-art biomedical language models. We should also note that there are other insightful models in literature such as ClinicalBERT (Alsentzer et al., 2019), BlueBERT (Peng et al., 2019), BioELECTRA (Ozyurt, 2020) and BioMed-BERT (Chakraborty et al., 2020) .
BioBERT  is a BERT Base model that has been pretrained on biomedical corpora, including PubMed and PMC articles for 23 days on eight V100 GPUs. In our evaluation, we use BioBERT Base v1.1, which extends the pre-training steps of BioBERT B to 1M steps and was trained on PubMed abstracts only.
SciBERT (Beltagy et al., 2019) is a BERT Base model that has been pretrained on 1.14M biomedical and computer science papers from Semantic Scholar Corpus .
PubMedBERT (Gu et al., 2021) follows a similar approach of BioBERT by pretraining the BERT model on large biomedical corpora, including PubMed abstracts and PMC articles. PubMed-BERT, in contrast to BioBERT, is pretrained using a large batch size (8192) and studies various effects on domain adaptation. The paper also introduces the BLURB benchmark, which is a collection of downstream biomedical tasks.
BioMegaTron 345m (Shin et al., 2020) is a largescale model (345m parameters) by NVIDIA based on MegaTron architecture. . BioMegaTron introduces a variety of large biomed-ical language models examining the choice of corpora and vocabulary domain.
BioRoBERTa (Lewis et al., 2020) extends the state-of-the-art results by testing different design choices. Similar to BioMegaTron's approach, BioRoBERTa models investigate the effect of vocabulary and corpora domain on the performance of biomedical language model.

Pretraining our Language Models
We pretrain all our models using the original implementation of BERT, ALBERT, and ELECTRA. We use TensorFlow 1.15 and TPUv3-512 units to pretrain our large models and TPUv3-32 to pretrain our BioM-ELECTRA B model.

BioM-ALBERT
Initially, we pretrain our model BioM-ALBERT xxlarge on PubMed abstracts only. BioM-ALBERT xxlarge is based on ALBERT xxlarge architecture which has larger hidden layer size (4096) than both BERT L and ELECTRA L (1024). We build our specific domain vocabulary, which has a size of 30K words, using the sentence piece model (Kudo and Richardson, 2018). We maintain the same hyperparameters that (Lan et al., 2020) use, except that we increase the batch size to 8192, decrease the initializer range to 0.01. We pretrain BioM-ALBERT xxlarge with a learning rate of 1.76e-3 for 264K steps. Table 1 show the details of our pretrained models compared to the existing model in the literature. The goal to pretrain BioM-ALBERT xxlarge is to understand the impact of using ALBERT's techniques on domain adaptation. Moreover, we introduce PMC articles at 264k step, to study the influence of adding PMC articles on the language model. BioM-ALBERT xxlarge is the first model that we pretrain and fine-tune among our large models.

BioM-ELECTRA
We build our BioM-ELECTRA Base and BioM-ELECTRA Large based on ELECTRA architecture (Clark et al., 2020). We pre-train BioM-ELECTRA L on PubMed abstracts only using specific domain vocabulary generated by PubMed-BERT, which has a size of 28,895 words. Our evaluation of BioM-ALBERT xxlarge on downstream tasks, influences our decision to pretrain BioM-ELECTRA on PubMed abstracts only. We use
The main objective to pretrain BioM-ELECTRA Base is to study the effect of ELECTRA function by comparing its performance with PubMedBERT Base and RoBERTa Base .
Furthermore, we build our BioM-ELECTRA Large model to study the effect of model scale by comparing it with BioM-ELECTRA Base and PubMedBERT Base where other factors are similar. We should also note that we choose general domain model ELECTRA B++ as a baseline model instead of ELECTRA B model. The difference between ELECTRA B and ELECTRA B++ is that ELECTRA B is pretrained with less steps (1M) and on smaller corpora (Wikipedia+ Books) (Clark et al., 2020).

BioM-BERT
We pretrain BioM-BERT Large model on PubMed abstracts and PMC articles using the same vocabulary of BioBERT Base . BioBERT Base uses a general domain vocabulary pretrained on English Wikipedia and Books Corpus. Our BioM-BERT Large model aims to study the effect of using general domain vocabulary and PubMed + PMC corpora on downstream biomedical tasks. We use a batch size of 4096, a learning rate of 2e-4, and we set the pretraining steps to 700K. However, since we use preemptible TPUs, our TPUs preempted at 690K. We use the ELECTRA implementation of BERT to pretrain our BERT Large model. This implementation uses a dynamic masking feature without using next-sentence prediction objective.

Downstream Tasks
Our choices of downstream biomedical tasks are similar to (Shin et al., 2020). For Named Entity Recognition (NER) and Relation Extraction (RE), we generate our training, development, and test data using the same script that PubMedBERT uses (Gu et al., 2021). Named Entity Recognition Our choices for NER tasks including: BC5CDR-Chemical, BC5CDR-Disease (Li et al., 2016) and NCBI-Disease task. (Dogan et al., 2014). These tasks aim to identify chemical and disease entities using IOB tagging format (Ramshaw and Marcus, 1995). For NER tasks, we use entity-Level F1 score, which is a common standard in the literature. Relation Extraction is a text classification task where we classify each sequence from a list of labels (classes). For RE task, we choose the ChemProt task (Krallinger et al., 2015) , which is a task that classifies chemical-protein interactions. We use micro-level F1 score on the five most common classes. We reproduce the results of BioRoBERTa L 2 on ChemProt task since BioRoBERTa uses a different pre-processing script than (Gu et al., 2021). Question Answering We use the same BioASQ7B-factoid dataset that  use, which is in the format of SQuADv1.1. We use Mean Reciprocal Rank (MMR) as an evaluation metric for this task. Moreover, as it is a common practice, we fine-tune our models on BioASQ task using a checkpoint fine-tuned on SQuAD2.0 task (Rajpurkar et al., 2016).

Fine-Tuning Hyperparameters
We conduct a hyperparameters grid search using the development data set on TPUv3-8. We use Ten-sorFlow 1.15 to fine-tune our model for all tasks, except that we use Transformers library (Wolf et al., 2020) to fine-tune our BioM-ALBERT on NER tasks. Since we are fine-tuning different architectures, we extend our grid search range to : learning rate (1e-4, 2e-4, 1e-5 -7e-5), batch size (24,32,48,64,128) and (2-5) epochs . We fixed our choices of hyperparameters for each set of tasks, model's scale, and architecture. The details of our fine-tuning hyperparameters can be found in Appendix A.1. Table 2 shows our evaluation results. We categorize models into four categories based on the domain and the scale of each model. We show the results of BioM-BERT L and BioM-ALBERT xxlarge at different steps. We report entity-level F1 for NER tasks, micro-level F1 for ChemProt, F1 for SQuAD2.0, and Mean Reciprocal Rank (MMR) for BioASQ. We add SQuAD results to track the direction of contextual representation between the general and biomedical domain.

ELECTRA Objective
The effect of the ELECTRA objective can be seen from comparing both PubMedBERT B and BioM-ELECTRA B , where they both use similar design choices, vocabulary set, and C ratio. Our evaluation shows that the ELECTRA function improves the performance on ChemProt, SQuAD, and BioASQ tasks. On the SQuAD task, our BioM-ELECTRA B 2 BioRoBERTA released their models at https:// github.com/facebookresearch/bio-lm. We use following hyperparameters to reproduce results (lr: 2e-5 , batch size: 16, epochs : 10, seeds: 10, 42, 1234, 12345, 666). exceeds RoBERTa B despite using biomedical corpora and less C ratio. On NER tasks, BioM-ELECTRA B performs better on the NCBI-disease and worse on the BC5-CDR task. In contrast, BioM-ELECTRA large performs better than other large models on the BC5-CDR dataset, which excludes the assumption that ELECTRA function negatively affects BioM-ELECTRA B performance on BC5-CDR tasks

Named Entity Recognition
Specific domain vocabulary significantly improves the results on NER tasks. Results of BioM-ELECTRA L and BioRoBERTa L show that biomedical corpora choices have a marginal effect on NER tasks. Our results also show that the gap between base-scale and large-scale biomedical models on NER tasks is relatively smaller than RE and QA tasks, especially for NCBI-Disease task.

Relation Extraction
On ChemProt task, BioM-BERT Large achieve 78.8 F1 score at 100K step with a C ratio of 0.4x matching the performance of BioRoBERTa L which has a C ratio of 4.0x. At 1.6x C ratio (400K), it exceeds by a significant margin all large-scale biomedical models. BioM-BERT L is the only large model in Table 2 that has PP design choice, which highlights the critical impact of general domain vocabulary on some RE tasks such as ChemProt.

Question Answering
Our results highlight that question answering tasks are sensitive to out-of-domain corpora. This sensitivity can be clearly seen when we introduce (PP) design to BioM-ALBERT xxlarge . The performance decreases significantly on the BioASQ challenge. In contrast, the performance on the SQuAD dataset increase to 88.0%. This increase is not caused by extending the training steps since SQuAD score remains stable at 215K and 264K steps.
Moreover, we can observe a gap of 3.9% in the SQuAD benchmark between BioM-ELECTRA Large and BioM-ELECTRA Base . However, this gap is not reflected in the BioASQ benchmark since it is in the format of SQuADv1.1, highlighting the need to have a biomedical questing answering task in the format of SQuADv2.0.
Furthermore, our evaluation shows that ELECTRA B++ model achieve state-of-the-art result on BioASQ for base-scale models. We attribute this performance to the fact that we use   (Gu et al., 2021), BioMegaTron (Shin et al., 2020), BioRoBERTa L (Lewis et al., 2020). We generate QA results for all models, except that we use reported results for BioMegaTron, BioBERT (Shin et al., 2020), RoBERTa B . BioMegaTron uses sub-tokens evaluation for NER tasks rather than whole-entity evaluation and uses different pre-processed data set for ChemProt task. Our results are the average scores of five different runs. a SQuAD fine-tuned checkpoint to fine-tune our models on BioASQ task. In contrast, the gap between the general and biomedical domain is worse on NER and RE tasks since we are not using any general domain fine-tune checkpoints. Table 3 shows the fine-tuning efficiency. All basescale models in Table 2 have similar fine-tuning time to BioM-ELECTRA B since they are built on BERT B architecture. Also all models that are based on BERT L , such as BioRoBERTa L have similar fine-tuning time to BioM-ELECTRA L . Our evaluation shows that hidden layer size (H) significantly influences the fine-tuning time.

Conclusion
We introduce four biomedical Transformer-based language models. Our results show that language models with general domain vocabulary and PubMed+PMC corpora perform better on the  Table 3: Fine-Tuning time of our pre-trained models. We fine-tune all models on ChemProt data set for 3 epochs with a batch size of 32 and max seq. length of 128 on 3090RTX GPU with PyTorch (FP16).
ChemProt task. Language models with specific domain vocabulary and PubMed abstracts perform better on NER and QA tasks. In the future, we are planning to extend our evaluation to additional biomedical tasks and investigate implementing early existing (Zhou et al., 2020) to reduce the finetuning time. Also, we are planning to build an Endto-End ensemble QA system with our large models and Sentence-BERT (Reimers and Gurevych, 2019) to address pandemic issues such as COVID-19.