BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR

The SARS-CoV-2 (COVID-19) pandemic spotlighted the importance of moving quickly with biomedical research. However, as the number of biomedical research papers continue to increase, the task of finding relevant articles to answer pressing questions has become significant. In this work, we propose a textual data mining tool that supports literature search to accelerate the work of researchers in the biomedical domain. We achieve this by building a neural-based deep contextual understanding model for Question-Answering (QA) and Information Retrieval (IR) tasks. We also leverage the new BREATHE dataset which is one of the largest available datasets of biomedical research literature, containing abstracts and full-text articles from ten different biomedical literature sources on which we pre-train our BioMedBERT model. Our work achieves state-of-the-art results on the QA fine-tuning task on BioASQ 5b, 6b and 7b datasets. In addition, we observe superior relevant results when BioMedBERT embeddings are used with Elasticsearch for the Information Retrieval task on the intelligently formulated BioASQ dataset. We believe our diverse dataset and our unique model architecture are what led us to achieve the state-of-the-art results for QA and IR tasks.


Introduction
The COVID-19 pandemic reminded us of the need for a tool that biomedical researchers can use to sift through existing research to extract novel insights, and ultimately help them make novel drug discoveries. The rate of new publications in the biomedical field is on the rise. PubMed reports that more than 1 million biomedical research papers are published each year, amounting to nearly two papers per minute (Landhuis, 2016). For papers mentioning COVID-19 alone, as of June 2020 more than 8000 peerreviewed publications have been published on PubMed. With the rate of scientific papers on COVID-19 doubling every fourteen days (Coren, 2020), it is imperative to have a language understanding tool that can extract relevant information from credible literature, such as the research methodology, data, authors, results, and citations (Hao, 2020).
In this paper, we address the problem from an information retrieval perspective, extracting the textual and contextual information from the corpus by taking a hierarchical approach. Traditional search approaches such as Lucene-based Elasticsearch (Gormley and Tong, 2015) using BM-25 & Jaccard-based matrices are efficient in retrieving objective answers where the primary task is to extract specific parts of the passage. However such methods struggle in the contextual retrieval of documents for which we need latent space representation of the query and the corpus of passages.
Our work leverages the BERT language model architecture to pre-train a large-scale biomedical language representation model, named BioMedBERT. The work is inspired by the research of  from Korea University & Clova AI research group. In said work, we use the new BREATHE dataset which combines full text articles and abstracts from ten data sources in the biomedical domain and use it to train our BioMedBERT model. We use the BERT LARGE model as a pre-training backbone to achieve new state-of-the-art results for question answering in the biomedical domain. In addition, we obtained impressive results by combining BioMedBERT embeddings with Elasticsearch to obtain highly relevant results for information retrieval. This is achieved by using a neural passage re-ranking mechanism, which learns the inherent structural dependencies in the query and the research articles. We validated our search algorithm by formulating BioASQ as a retrieval dataset.

Related Work
Latent space representation learning and vector space modeling have proven to be extremely successful in the natural language processing domain, where they have shown to efficiently encapsulate the hidden meaning and context of sentences or passages. The journey of distributed word representation learning began with the efficient Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fast-Text (Joulin et al., 2016) which outperformed the traditional bag of words model by significant margins in linguistic tasks, as these methods "largely" ignored the word context of the tokens. With the advent of sequential modeling and recurrent neural models, there was a significant enhancement in information content of the latent representation learning, with LSTMs as the state of the art at the time (Greff et al., 2016). Non-recurrent sequence-to-sequence encoder-decoder models, also known as Transformers, were introduced by (Vaswani et al., 2017). Transformers stacked attention layers into a multi-headed attention architecture. The attention mechanism makes it possible to learn long-running word dependencies between input and output sequences by computing a context vector of the encoder for each token in the sequence (Bahdanau et al., 2014). The BERT architecture, which is primarily the encoder part of the transformer, achieved notable model performances for several linguistic tasks. Pre-training and fine-tuning the BERT network on a domain specific corpus has also shown to outperform many language models (Devlin et al., 2018). Our work combines ideas from (Devlin et al., 2018) and  in using deep bidirectional transformers to learn contextual embeddings of word features from the large BREATHE corpora.

BREATHE Dataset
Biomedical Research Extensive Archive To Help Everyone, or BREATHE, is a new dataset collection of biomedical research articles from leading medical archives. It is a combination of both full body texts and abstracts. The development and recent availability of this dataset significantly inspired the work in this paper. To the best of our knowledge, BREATHE is the largest diverse collection of publicly available, machine readable biomedical corpus for advanced language modeling (Goncharov et al., ).
The dataset collection process was done in line with "ethical" principles for scraping, along with public APIs that were used when available. The primary advantage of the BREATHE dataset for our model is its source diversity. BREATHE contains full text articles and abstracts from nine sources including BMJ, arXiv, medRxiv, bioRxiv, CORD-19, Springer Nature, NCBI, JAMA, and BioASQ (Goncharov et al., ).
We performed our experiments with BREATHE v1.0 dataset. BREATHE v1.0 contains more than 6M articles and about 4 billion words. BREATHE v2.0 is the most recent version (Goncharov et al., ) and the results reported in this paper are from training BioMedBERT on BREATHE v1.0 dataset.

Methodology
BioMedBERT was built on the foundation of the BERT architecture (Devlin et al., 2018). In our work, we leverage both the pre-training and fine-tuning aspects of the BERT architecture which enhances accuracy while also ensuring the robustness of the model. BERT leverages a transformer architecture (Vaswani et al., 2017) and uses the stacked encoder where each encoder consists of a multi-headed attention layer and a feed-forward network. BERT builds on the transformer architecture to train large unlabeled data over two self-supervised tasks, and they are the Masked Language Model (MLM) and Next Sentence Prediction (NSP). MLM (also referred to as the Cloze task) (Taylor, 1953) works by randomly masking 15% of the sequence and then attempting to predict the masked tokens, where as the NSP task helps the model to understand the relationship between two sentences. It works by predicting if a sentence is the actual "following" sentence or if it is just a random sentence. For more details on the training procedure for BERT, the reader is referred to (Devlin et al., 2018). In training the BERT model, the total loss is computed as the sum of the masked-LM loss and next sentence prediction loss, both of which are cross-entropy, with the former multi-class log-loss and the later binary log-loss. q(y).log(q(y))

MaskedLM loss
BioMedBERT Model Architecture. Our model architecture retrained the BERT LARGE architecture with 24 transformer blocks, with a hidden-size of 1024 and 16 attention heads.

Pre-training BioMedBERT
The "de facto" BERT model was pre-trained on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2.5B words) (Devlin et al., 2018). The BioBERT model from  was pre-trained on different combinations of text-data from English Wikipedia, BookCorpus, PubMed Abstracts (4.5 billion words) and PubMed Central full-text abstracts (13.5 billion words). We trained the BioMedBERT model on BREATHE containing over 6 million articles from 9 different archives with 4 billion words. In pre-processing our dataset, we treated lists, tables and headers as a contiguous sequence of text and initialized our model using the pre-trained BERT weights. Initially we trained the model from scratch with a custom SentencePiece vocabulary for tokenization, but this did not yield good results when evaluated on downstream fine-tuned tasks. We represented the input sequences as vectors using the WordPiece embeddings, which contains 30,000 tokens (Wu et al., 2016). Using WordPiece allowed us to better leverage the pre-trained weights of BERT. WordPiece has a clever formulation to account for words not found in its vocabulary by breaking a word into subwords, thereby creating multiple tokens for a given word. Also, WordPiece has a subword for every character in the alphabet.

Fine-tuning BioMedBERT
The BioMedBERT model retains the architectural effectiveness of the BERT model, which itself leverages the multi-head self attention mechanism of the transformer network which makes it amenable to task specific fine-tuning. It is important to note the [CLS] token assigned by the WordPiece tokenizer, which is a special classification token to represent the aggregate sequence for downstream classification tasks. In building on the framework provided by  and (Devlin et al., 2018), our model is fine-tuned on three NLP tasks: Named Entity Recognition (NER), Relation Extraction (RE) and Question Answering (QA). Named entity recognition is the most fundamental sub-task of information extraction, it involves classifying and recognizing named entities such as drugs, diseases, etc. from unstructured biomedical text. The single output layer predicts the token level BIO2 probabilities for each input sequence. We use precision, recall, and F1 score as metrics for evaluation. The details of the datasets used for NER fine-tuning task are mentioned in Table 2.

Dataset
Entity type Number of annotations NCBI Disease (Dogan et al., 2014) Disease 6881 BC5CDR (Li et al., 2016) Disease 12694 BC5CDR (Li et al., 2016) Drug/Chem. 15411 BC4CHEMD (Krallinger et al., 2015) Drug/Chem. 79842 BC2GM (Smith et al., 2008) Gene/Protein 20703 JNLPBA (Kim et al., 2004) Gene/Protein 35460 Relation extraction is another sub-task of information extraction which involves detection and classification of relationships between named entity mentions in a biomedical corpus. In processing the results for the relation extraction datasets, we incorporated the technique from BioBERT  in using predefined tags to anonymize target named entities. For evaluation of RE, we also used precision, recall, and F1 score as metrics. Statistical details of the biomedical RE datasets are provided in Table 3.
Question answering is the task of predicting the text span of the answer, given the question and passage containing the answer. For general language question answering task, we fine-tuned BioMedBERT on SQuAD v1.0 and v2.0 datasets using the same BERT architecture used for SQuAD. For the biomedical QA fine-tuning task, we used BioASQ factoid datasets, as their data format is similar to that of the Dataset Entity type Number of relations GAD (Bravo et al., 2015) Gene-Disease 5330 EU-ADR (Van Mulligen et al., 2012) Gene-Disease 355 SQuAD datasets . For evaluating BioASQ, we used the evaluation code from BioBERT to exclude samples with unanswerable questions from the training set. And like  and (Wiese et al., 2017), we fine-tuned BioASQ on weights that were initially fine-tuned on SQuAD v1.1. However, unlike them, we also fine-tuned BioASQ on SQuAD v2.0 weights which was beneficial in outperforming state-of-the-art results for these tasks. This is another key contribution in this work.
The results are reported in Table 6 and the dataset details are provided in Table 4.

Neural Information Retrieval & Biomedical Search
In this phase, the primary objective was to retrieve the most relevant research papers and articles in a ranked order based on the query made by the researcher(q i ). There are two major aspects that enhance the complexity of this problem. First, the computational complexity associated with retrieval from a biomedical corpus of millions of records(N ). Second, the ability to accurately retrieve the relevant documents in a proper rank-ordered fashion, and to do so the process needs to account for a deep contextual understanding of the text.
We approach the problems in a 2-step hierarchical fashion where in the first step, we primarily reduce the search space complexity by using the Elasticsearch and BM-25 algorithm, a well-known metric in retrieval scenarios (Gormley and Tong, 2015). Elasticsearch itself operates in two stages. The first stage creates an extremely efficient data structure in textual searching scenarios, an inverted index, of the entire corpus of biomedical research papers. The second stage queries the entire corpus using the inverted index and scores the search results using the similarity (relevancy) ranking function BM25, and returns the top k search results based on that BM 25 score, where k is a tunable parameter (Turney and Pantel, 2010). The output of this stage is an ordered state of retrieved biomedical papers sorted by BM 25 scores and let (a r1 , a r2 , a r3 ...a rk ) be the top k retrieved documents.
Here, f (t, d) represents the raw frequency of the term t in the document d, idf (t, D) is a function of number of documents, the term t has occurred in, given the universal set of documents D , |d| is the number of words in the document and k, b are parameters. The results of the Elasticsearch based retrieval mechanism did not incorporate the contextual aspects of the query and the specific biomedical aspects of the corpus. As mentioned earlier, for effective IR in this domain, contextual consideration is highly relevant. When biomedical researchers search for relevant studies, they search for specific combinations of topics and scenarios. For instance, they might search for "how does coronavirus impact the lungs and to what extent" rather than a generic search such as "what is coronavirus". In order to best answer these types of specific queries, the system needs robust context for both the question and the possible answers.
To solve this challenge, in the second step of the hierarchical search we use BioMedBERT to project the query(q i ), retrieve the top k papers(a r1 , a r2 , a r3 ...a rk ) in a d dimensional latent space and extract the embeddings (Nogueira and Cho, 2019). Let the query embedding vector of q i be represented by Q i ∈ R d and the matrix embedding representation of the top k retrieved papers be A k×d , A i ∈ R d . We compute the cosine similarity as a vector-matrix product between the normalized query (Q i = Q i ||Q i || ) and the normalised retrieved paper embedding matrix (A k×d , whereA i = A i ||A i || ). Let Z be the output after computing the cosine similarity between the query and the papers, where Z = A k×d · Q i and Z ∈ R k . Finally, we sorted the output cosine scores Z , while re-ranking and returning the most relevant biomedical research papers conditioned on the query(q i ).
The primary intent of selecting cosine similarity as the preferred metric is that the vector cosine scores are normalized on each of the dimensions and hence are robust to scaling (Eibl and Gaedke, 2017).

Datasets
In order to simplify the process of comparing our work with related works, we perform the fine-tuning experiments on the biomedical and general language datasets that are most widely used by other NLP researchers. For NER fine-tuning task, we used the eight pre-processed datasets provided by (Wang et al., 2019) mentioned in Table 7. The NER evaluations for all datasets mentioned in Table 7 are based on entity level exact matches. For the RE fine-tuning task, we used the pre-processed GAD and EU-ADR datasets from (Lee et al., 2020) which contain gene-disease relations. For general domain QA finetuning task, we used SQuAD v1.1 and SQuAD v2.0 datasets (Rajpurkar et al., 2016) and for biomedical domain QA task, we used pre-processed 4b, 5b and 6b BioASQ datasets provided by (Lee et al., 2020) and pre-processed BioASQ 7b dataset from . We report the results in the form of micro-average scores of all five test batches of BioASQ datasets for QA fine-tuning and 10-fold cross validation scores for GAD and EU-ADR dataset for RE fine-tuning task.

Experimental Setup
In pre-training BioMedBERT, we use the setup provided by BERT (Devlin et al., 2018). For fine-tuning and evaluation, we use the setup provided by (Lee et al., 2020). During the initial phase of experimentation, we explore training the BioMedBERT model from scratch on BREATHE using our custom vocabulary created with SentencePiece tokenizer (Kudo and Richardson, 2018). But after evaluation on a downstream tasks of NER and QA, we found that we need to either add general domain data to our dataset or use the weights of BERT to initialize the model.
Hence, we trained our model using BERT LARGE weights as a transfer learning backbone on the BREATHE dataset. We use the same architecture and hyper-parameters as BERT and trained the model for 68k epochs. The resulting improvements in results was very close to state-of-the-art scores for NER. This motivated us to continue work on BioMedBERT and eventually train the model for 1M steps. The pre-training of BioMedBERT model on Google Cloud v3-TPUs with 128 cores for 1M epochs took a little over 3 days. Fine-tuning tasks on the same TPU took less than an hour.

Experimental Results
We provide the results for the selected downstream tasks in Tables 5, 6, 7 and 8. In each table, we compare BioMedBERT's performance against BERT and the extant state-of-the-art model for the corresponding task. Among the four fine-tuning tasks selected for evaluation and comparison with previous works, BioMedBERT achieves better results than the BERT model for nearly all of them on biomedical datasets. BioMedBERT outperforms the state-of-the-art models on QA fine-tuning task using SQuAD v2.0 dataset and also achieves close to state-of-the-art results in NER and RE fine-tuning tasks for biomedical datasets,demonstrating its robustness in domain specific downstream tasks. The BioMed-BERT v1.0 model fine-tuned on SQuAD v2.0 dataset and further fine-tuned on BioASQ 5b, 6b and 7b datasets (trained for 2 epochs) outperforms state-of-the-art MRR scores for all 3 datasets as seen in Table 8 1 . The best scores are highlighted in bold while the second best scores are underlined. By this, BioMedBERT may be viewed as the new state-of-the-art results for biomedical question-answering tasks.

Dataset
Metrics

Information Retrieval Task Formulation & Experimental Results
One of the most challenging parts of our research was to create a validation framework for our end-to-end biomedical retrieval methodology. To accomplish that, we had to ensure two major things. First, the validation corpus should be biomedical in nature, as our embeddings are primarily trained on the biomedical corpus. Second, the word length distribution of the validation corpus used for retrieval should be similar to the word length distribution of the abstracts of the biomedical research papers in the BREATHE corpus in order to have a meaningful validation. Both of these constraints were satisfied by formulating the BioASQ dataset intelligently where we retrieve the 'context' from the 'question', rather than the 'answers' which are typically much smaller. Additionally, we discovered the bias in the BioASQ dataset from a retrieval perspective, due to the high percentage of intersection between the words in the question and the context -which won't be the case in real life scenarios. Moreover, having a higher percentage intersection between questions and answers will cause results to be biased towards only Elasticsearch based approach (Figure 2).
To debias the dataset, we remove the records that have a very high percentage of common thresholds (based on the Jaccard index) (Leskovec et al., 2020). The result in Table 9 shows that our methodology significantly outperforms other models on the re-structured BioASQ dataset, which is a major novelty in our research work.

Discussion
We observe that the BioMedBERT model achieves state-of-the-art results in the QA tasks for both of the datasets in the biomedical domain (see Table 8) as well as in the general language domain with SQuAD v1.1 and 2.0 (see Table 6). Additionally, the model showed robustness with an impressive performance compared to BERT and the extant state-of-the-art in other language tasks such as Named Entity Recognition. While the model did not outperform the state-of-the-art, it was mostly on-par with BERT for most of the NER performance metrics. The same case for model-robustness can be made for our model's performance in Relation Extraction tasks. Current challenges related to research into COVID-19 directed us to build a language mining tool to support question-answering and information retrieval for the biomedical domain. With respect to the main purpose of this research, our results are the current state-of-the-art given the model performances on BioASQ question-answering datasets. Our model outperformed numbers from BioBERT  on QA tasks irrespective of the fact that BioBERT was trained on a total of 18B words from the biomedical domain (13.5B from PMC full-text articles and 4.5B from PubMed abstracts), while our BioMedBERT model was trained on just over 4.1B words. The key reason for our better performance was the diversity in the datasets on which the BioMedBERT model was trained. Diversity helped in enhancing both the performance and robustness for the BioMedBERT model.
Another important and novel aspect of our work was how we framed the BioASQ dataset to validate our information retrieval methodology. Specifically, we debiased reflect reality and achieved, robust and relevant results. BioMedBERT embeddings coupled with Elasticsearch outperformed the retrieval performance of the other models based on the Mean Reciprocal Rank values as shown in Table 9. Here we experimented based on retrieving variable number of documents using Elasticsearch, and then re-ranked using embeddings to possibly eliminate any further source of bias. Our methodology (BioMedBERT + ES) outperforms the others by significant margins for all the cases.

Conclusion
In this paper, we present the BioMedBERT model pre-trained on the BREATHE v1.0 dataset, one of the largest and most diverse datasets of biomedical research literature. BioMedBERT achieves stateof-the-art results when fine-tuned on Question and Answering datasets, and also produces impressive performances on other language tasks such as Named Entity Recognition and Relation Extraction. BioMedBERT embeddings coupled with Elasticsearch gives state-of-the-art performance on the re-framed BioASQ dataset. Moreover, the BioMedBERT model achieves state-of-the-art results for multiple tasks even when only pre-trained on the BREATHE v1.0 dataset, which contains just over 6 million articles. Work is in progress to train an improved BioMedBERT model on the BREATHE v2 dataset with over 16 million articles. We believe continued enhancements of the BioMedBERT model will help biomedical researchers discover meaningful insights from literature faster, and make significant improvements in their field.
We would like to thank Daniel Goncharov and 42 School Silicon Valley for their contributions to building the BREATHE dataset. We would also like to thank Dave Elliott with Google Cloud Platform and Google TensorFlow Research Cloud for providing infrastructure and guidance throughout the project.