BioMegatron: Larger Biomedical Domain Language Model

There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application deeply. Additionally, the study of model size on domain-specific models has been mostly missing. We empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer. We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus, contributing to our understanding of domain language model applications. We demonstrate noticeable improvements over the previous state-of-the-art (SOTA) on standard biomedical NLP benchmarks of named entity recognition, relation extraction, and question answering. Model checkpoints and code are available at [https://ngc.nvidia.com] and [https://github.com/NVIDIA/NeMo].


Introduction
Effectively transferring the success of BERT (Devlin et al., 2018) Huang et al. (2019) added clinical text to the PubMed biomedical pretraining corpus and tested on standard biomedical and clinical NLP benchmarks. Many other similar works appeared at the ACL BioNLP Workshop (Demner-Fushman et al., 2019).
More recently, Gu et al. (2020) performed a comprehensive study on the pre-training corpus domain, language model masking method, and adversarial training, benchmarking on a number of different datasets for token classification, sequence classification, and sequence regression.
Compared to the previous works, we perform a more detailed study on (1) subword vocabulary, (2) labeling method, (2) model size, and (3) domain transfer, showing gains in token classification, sequence classification, and question answering.

Related Works
A prime example of Language Models (LMs) in the biomedical domain is BioBERT . It is a transformer LM pre-trained on the PubMed (www.ncbi.nlm.nih.gov/pubmed) biomedical text corpus comprised of biomedical literature abstracts. Their pre-training started from the checkpoint of Devlin et al. (2018) trained on Wikipedia and Books-Corpus. Independently, Beltagy et al. (2019) (SciBERT) pre-trained BERT from scratch using their vocabulary set on scientific text corpora, including PubMed abstracts and computer science papers. Both demonstrated increased performance over the previous non-BERT SOTA on biomedical benchmarks, including Named Entity Recognition (NER), Relation Extraction (RE), and Question Answering (QA). BioBERT and SciB-ERT report similar results on NER and RE, while only BioBERT report QA results.
They inspired other follow-up works (Alsentzer et al., 2019;Huang et al., 2019;Peng et al., Peng et al. (2019) report slightly improved performance on RE using BERT Large while reporting worse results on NER, compared to BERT Base . These results on biomedical tasks do not benefit from scaling model size to the same degree as standard NLP benchmarks such as GLUE or SQuAD (Shoeybi et al., 2019;Raffel et al., 2019).
3 Language Model Pre-training BERT Base & Large We compare our models to the pre-trained BERT Base & Large models of BioBERT  and PubMedBERT (Gu et al., 2020) (BERT Base ) for fine-tuning and evaluation. For QA we use the BERT Large variant of BioBERT following the authors' recommendation.
BioMegatron Megatron-LM (Shoeybi et al., 2019) was introduced for efficient model parallel training of large LMs, with up to 8.3B parameters. Shoeybi et al. (2019) showed that rearranging the order of the layer normalization and the residual connections is critical to enabling the scaling of the BERT-style models beyond 336m parameters, and we use the same architecture.
Megatron-LM also used a larger pre-training text corpus, comprised of Wikipedia (Devlin et al., 2018), CC-Stories (Trinh and Le, 2018), Real-News (Zellers et al., 2019), and OpenWebtext (Radford et al., 2019). For our LM training, we use the 4.5 billion-word PubMed abstract set and the 1.6 billion-word CC0-licensed Commercial Use Collection of the PMC full-text corpus (www.ncbi.nlm.nih.gov/pmc).
We train three sizes of BioMegatron: with 345 million, 800 million, and 1.2 billion number of parameters (Table 1).
We compare four pre-training scenarios in the smallest 345m model -using BERT-cased/uncased vocabularies, each pre-trained from scratch and finetuned from general domain LM. We also compare two sets of domain vocabularies learned on PubMed text corpus using SentencePiece (github.com/google/sentencepiece) library, each containing 30k and 50k subword units.
We train the larger BioMegatron models with less variation: 800m models from scratch on PubMed with BERT -cased/-uncased vocabularies; and 1.2b model starting from general domain LM checkpoint using BERT-uncased vocabulary.

Downstream Benchmark Tasks
We use the most widely used downstream biomedical benchmark datasets for NER, RE, and QA.
Named Entity Recognition The BC5CDR  NER dataset annotated disease and chemical terms with IOB tagging (Ramshaw and Marcus, 1999). In NCBI-disease (Dogan et al., 2014), only disease entities are IOB-tagged.

Relation Extraction
The ChemProt (Krallinger et al., 2015) dataset contains sentences from PubMed abstracts, where chemical-protein interaction types are annotated as five categories. Relation Extraction is essentially a sequence classification task, classifying a set of sentences into a category.
Question Answering The BioASQ-7b factoid task (Tsatsaronis et al., 2015) is a biomedical QA dataset whose format is similar to the SQuAD dataset (Rajpurkar et al., 2016). In this task, context-snippet, question and answer triplets, and factoid question/answers are evaluated with strict accuracy (SAcc), lenient accuracy (LAcc), and mean reciprocal rank (MRR).

Results and Discussion
The evaluation results on NER and RE are shown in Table 2, and QA are shown in Table 3. We perform entity-level F1 NER using the official CoNLL evaluation script translated into Python (github.com/ spyysalo/conlleval.py). RE uses micro-level F1, and QA uses the BioASQ evaluation script (github.com/BioASQ/Evaluation-Measures).

Named Entity Recognition
While the NER benchmark datasets appear saturated due to the small sample size, we find that the subword vocabulary is the most critical factor. Examples of tokenization with different vocabularies are shown in Figure 1. Representing named entities as single terms is more helpful than breaking them into several subtokens.   Table 3: Evaluation results on QA after fine-tuning for 30 epochs on checkpoints fine-tuned on SQuAD dataset with fixed hyper-parameter settings as num-fc-layers: 2; fc-hidden-size: 2048; fc-dropout: 0.1; max-seq-length: 512; learning-rate: 3e-5; cross-entropy loss, using Adam optimizer. BioMegatron models are pre-trained from scratch on PubMed, except 1.2b model which is fine-tuned from a general domain model checkpoint. break-out rate while being smaller in size than our 50k-size vocabulary. A lower break-out rate with smaller vocabulary size probably helps achieve bet-  ter NER performance despite smaller model size.
We can label the entities for NER training as: (1) marking the whole entity as a single label, and (2) labeling sub-tokens separately. Figure 1 shows examples of the labeling methods. We find these different schemes can result in as much as ∼2% difference in the F1-score on NER evaluation, possibly indicating that the datasets are too small. We report NER results by labeling sub-tokens separately, except for NCBI-disease dataset, which results in better whole-entity labeling across models.

Relation Extraction
Since RE is a classification task, albeit on sequences rather than on tokens, the choice of subword vocabulary has a notable effect.
We can also observe that larger models result in higher precision for lower recall, both for NER and RE. More hyper-parameter tuning could achieve higher F1-scores, even the generalization ability of such result may be questionable. Table 3 show evaluation results after fine-tuning on SQuAD for 10 epochs and BioASQ for 30 epochs each, following the recipe found to work best by . We found large batch size to be beneficial, as Q&A pairs repeat up to 88 times. We use batch size of 64 per GPU with data parallelism on 16 GPUs. Using biomedical vocabularies result in much worse results, possibly due to its low relevance in the first SQuAD fine-tuning task.

Question Answering
Larger models tend to perform better in QA, though it levels off after 345m parameters. The larger model size effect is more evident when finetuning on BioASQ directly, as shown in Table 5.

Domain Transfer and Generalization
We examine how well a general-or domain-specific LM generalizes across domains related to the model size. Gu et al. (2020) studied the effect of "domain-specific" vs. "mixed-domain" pre-training, i.e., pre-training on PubMed from scratch vs. pretraining starting from a general domain LM (finetuning). They found that pre-training on PubMed from scratch is better for biomedical NLP benchmarks, but we analyze its effect with further pre-training (fine-tuning) steps. In other words, if starting from a general domain LM, does sufficient finetuning make it as good as a fully domain-specific model? Can such model have any advantage for cross-domain or cross-discipline generalization?

RE ChemProt
10 3 steps 0.00 10 4 steps 34.1 10 5 steps 63.4 2 · 10 5 steps 71.1 3 · 10 5 steps 70.4 4 · 10 5 steps 69.7 5 · 10 5 steps 68.3  Table 6 shows F1-score evaluation on NER and RE benchmarks using a general-domain BioMegatron-1.2b with additional fine tuning. It shows that even for a large LM that was pre-trained on a large text corpus, it needs sufficient further pretraining on domain text (PubMed). After sufficient pre-training on domain text, it can be as good as an LM pre-trained on domain-text only, except that vocabulary has more significant effect on NER.

Model
SAcc LAcc MRR Megatron-345m (general LM) 38.5 52.6 43.7 Megatron-1.2b (general LM) 29.3 39.7 32.7  Table 7 shows the results of general-domain LMs fine-tuned on BioASQ-7b-factoid. Larger models do not perform better, which may indicate overfitting is occuring on the small training set.  BioMegatron models on SQuAD datasets. Here, a large biomedical LM pre-trained on large text corpus performs better than smaller general domain LMs such as BERT LARGE , even when pre-trained on the biomedical text.

Other Domain-Specific Factors
Size and Bias in Biomedical Datasets Annotating biomedical data requires in-depth domain knowledge. Besides, data often have substantial label bias as the occurrences of "abnormal" or "findings" are rare by nature. As a result, biomedical benchmark data tend to be smaller and highly biased than their general domain counterparts.   Table 9 shows a comparison of benchmark datasets for NER, RE (CLS), and QA in the biomedical domain and their general-domain counterparts. The SQuAD Q&A set is 15 times larger than the BioASQ data, where the same question-answer combinations appear up to 88 times in BioASQ.
Question-answer pairs are seldom repeated in SQuAD data, at most twice. The BC5CDR NER dataset is 1/3 size of CONLL-2003 and the ratio of I/O to O tags 0.08, compared to 0.18 for CONLL.
Methods to circumvent data imbalance issues such as oversampling the minority classes (Chawla et al., 2002;Chen et al., 2010) and using weighted cross-entropy gave minor effects on our NER and RE benchmarks. Recently,  proposed dice-loss for data-imbalance issues in NLP, with SOTA results on NER and QA, which could be a future avenue to explore for domain LMs. Transfer learning showed effectiveness in the biomedical QA task. However, it is somewhat unclear how to apply it to NER and RE tasks.

Model
PubMed Corpus #Words BioBERT abstracts 4.5 billion PubMedBERT abstracts + full-text 16.8 billion BioMegatron abstracts + full-text-CC 6.1 billion Table 10: Pre-training text corpus of each biomedical LM. We pre-train on PubMed abstracts and full-text commercial-collection (CC) that are free of copyrights.

Pre-training Corpus and Duration
PubMed-BERT is pre-trained on a much larger text corpus, as shown in Table 10. It is a performant domain-LM with a larger pre-training corpus and adequate domain vocabulary compared to its model size. We pre-train our LMs for about one epoch, reaching a masked-LM loss of about 1.2 (Devlin et al., 2018). Further pre-training may be helpful, but it is challenging to have strictly controlled experiments with many different settings.

Conclusion
We review and test several factors that can affect the performance of domain language models. We find that a language model targeted for a domain and application performs best. For example, model size is a secondary factor to vocabulary set for token classification task. Larger model size does not necessarily translate to better performance on a cross-domain benchmark task. This probably indicates that there is no master model that can "do it all", at least well enough as a targeted one. The model size is a secondary factor; larger model size can probably further improve the performance of a a domain-and applicationspecific language model.