Transfer Learning in Biomedical Named Entity Recognition: An Evaluation of BERT in the PharmaCoNER task

To date, a large amount of biomedical content has been published in non-English texts, especially for clinical documents. Therefore, it is of considerable significance to conduct Natural Language Processing (NLP) research in non-English literature. PharmaCoNER is the first Named Entity Recognition (NER) task to recognize chemical and protein entities from Spanish biomedical texts. Since there have been abundant resources in the NLP field, how to exploit these existing resources to a new task to obtain competitive performance is a meaningful study. Inspired by the success of transfer learning with language models, we introduce the BERT benchmark to facilitate the research of PharmaCoNER task. In this paper, we evaluate two baselines based on Multilingual BERT and BioBERT on the PharmaCoNER corpus. Experimental results show that transferring the knowledge learned from source large-scale datasets to the target domain offers an effective solution for the PharmaCoNER task.


Introduction
Currently, most biomedical Natural Language Processing (NLP) tasks focus on English documents, while only few research has been carried out on non-English texts. However, it is essential to note that there is also a considerable amount of biomedical literature published in other languages than English, especially for clinical documents. Therefore, it is of considerable significance to conduct NLP research in non-English literature. Phar-maCoNER(Gonzalez-Agirre et al., 2019) is the first Named Entity Recognition (NER) task to recognize chemical and protein entities from Spanish biomedical texts. Biomedical NER task is the foundation of biomedical NLP research, which is * Corresponding author often utilized as the first step in relation extraction, information retrieval, question answering, etc.
The existing biomedical NER methods can be roughly classified into two categories: traditional machine learning-based methods and deep learning-based methods. Traditional machine learning-based methods (Settles, 2005;Campos et al., 2013;Leaman et al., 2015Leaman et al., , 2016 mainly depend on feature engineering, i.e., the design of useful features using various NLP tools. Overall, this is a labor-intensive and skill-dependent process. In contrast, deep learning-based methods are more promising in biomedical NER tasks. Since deep learning-based methods can automatically learn features, these methods no longer need to construct feature engineering and exhibit more encouraging performance. For examples, (Luo et al., 2017) proposed an attention-based BiLSTM-CRF approach to document-level chemical NER. (Dang et al., 2018) proposed a D3NER model, using CRF and BiLSTM improved with fine-tuned embeddings of various linguistic information to recognize disease and protein/gene entities. Recently, the language model pre-training (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019) has proven to be effective for improving many NLP tasks. The fine-tuning language model (Radford et al., 2018;Devlin et al., 2019) can transfer the knowledge learned from large-scale datasets to domainspecific tasks by simply fine-tuning the pre-trained parameters.
Inspired by the success of transfer learning with language models, we would like to make full use of the existing language model resources to implement the PharmaCoNER task. In this paper, we introduce the BERT (Devlin et al., 2019) benchmark to facilitate the research of Pharma-CoNER task. We regard the large-scale dataset used to train the BERT model as the source do-main, and the PharmaCoNER dataset as the target domain, thus considering the PharmaCoNER task as a transfer learning problem. We evaluate two baselines based on Multilingual BERT and BioBERT. Experimental results show that transferring the knowledge learned from source largescale datasets to the target domain offers an effective solution for the PharmaCoNER task.
2 Related Work

Language Model
Learning widely used representations of words has been an active area of research for decades. To date, pre-trained word embeddings are considered to be an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010). Recently, ELMo (Peters et al., 2018) has been proposed to generalize traditional word embedding research (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017) to extract contextsensitive features. When integrating contextual word embeddings with existing task-specific architectures, ELMo achieves competitive performance for many major NLP benchmarks. More recent studies (Radford et al., 2018;Devlin et al., 2019) tend to exploit language models to pre-train some model architecture on a language model objective before fine-tuning that the same model for downstream tasks. BERT (Devlin et al., 2019), which stands for Bidirectional Encoder Representations from Transformers (Vaswani et al., 2017), is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. The pre-trained BERT can be fine-tuned to create competitive models for a wide range of tasks.

Transfer Learning
Many machine learning methods work well only under a common assumption: the training and test data are drawn from the same feature space and distribution (Pan and Yang, 2009). When the distribution changes, most models need to be rebuilt from scratch using newly annotated training data. However, it is an expensive and challenging process. Therefore, it would be meaningful to reduce the need and effort to recollect the annotated training data. In such scenarios, transfer learning between task domains would be useful. For example, (Cui et al., 2018) demon-strate the effects of transfer learning in the computer vision domain. They explore transfer learning via fine-tuning the knowledge learned from large-scale datasets to small-scale domain-specific fine-grained visual categorization datasets. For NLP tasks, (Conneau et al., 2017) and (McCann et al., 2017) also demonstrate the effects of transfer learning on the natural language inference and machine translation tasks, respectively. These methods demonstrate the significance of transfer learning in machine learning methods.

Problem Definition
The PharmaCoNER task is structured into two sub-tracks: 'NER offset and entity classification' and 'concept indexing'. Since we only participate in the first track, we will explain the fist track in detail. There are three entity types for evaluation in the PharmaCoNER corpus, namely 'normalizables', 'notnormalizables' and 'proteins'. Specifically, 'normalizables' is the mentions of chemicals that can be manually normalized to a unique concept identifier. 'notnormalizables' is the mentions of chemicals that could not be normalized manually to a unique concept identifier. 'proteins' is the mentions of proteins and genes. We used the extended BIO (Begin, Inside, Other) tagging scheme in our experiments. Formally, we formulate the PharmaCoNER task as a multiclass classification problem. Given an input sequence S = {w 1 , · · · , w i , · · · , w n } which has processed by WordPiece, the goal of PharmaCoNER is to classify the tag t of token w i . Essentially, the model estimates the probability P(t|w i ), where T = {B-normalizables, I-normalizables, Bnotnormalizables, I-notnormalizables, B-proteins, I-proteins, O, X, CLS, SEP}, t ∈ T , 1 ≤ i ≤ n.

Model Architecture
BERT (Devlin et al., 2019), which stands for bidirectional encoder representations from Transformers, is designed to learn deep bidirectional representations by jointly conditioning on both left and right context in all layers. The architecture of BERT is illustrated in Figure 1. The pretrained BERT can be fine-tuned to create competitive models for a wide range of downstream tasks, such as named entity recognition, relation extraction, and question answering.
Here, we explain the architecture of BERT for NER tasks. The input of BERT can represent both a single text sentence or a pair of text sentences in one sequence. BERT differentiates the text sentences as follows: first, they separate them with a special token ([SEP]); second, they add the sentence A embedding to every token of the first text sentence and the sentence B embedding to every token of the second text sentence. Furthermore, every sequence starts with a special token ([CLS]). For a given token, the input representation is constructed by integrating the corresponding token, segment, and position embeddings. BERT provides two model sizes: BERT BASE and BERT LARGE . For the BERT model, the number of layers L, the hidden size H and the number of selfattention heads A are listed as follows: • BERT BASE : L=12, H=768, A=12, Total Pa-rameters=110M.
During the shared task, we exploit Multilingual BERT (Devlin et al., 2019) and BioBERT (Lee et al., 2019) to implement the PharmaCoNER task. Both the multilingual BERT and BioBERT models are pre-trained based on the BERT BASE size. The multilingual BERT model is pre-trained on Wikipedia in multiple languages. The BioBERT model is pre-trained on Wikipedia, BooksCorpus, PubMed (PubMed abstracts) and PMC (PubMed Central full-text articles). The pre-training process of Multilingual BERT and BioBERT is similar to the pre-training process of BERT base . More details about Multilingual BERT and BioBERT can be found in the studies (Devlin et al., 2019;.
For the output layer, we feed the final hidden representation h i of each token i into the softmax function. The probability P is calculated as follows: where T = {B-normalizables, I-normalizables, Bnotnormalizables, I-notnormalizables, B-proteins, I-proteins, O, X, CLS, SEP}, t ∈ T , W o and b o are weight parameters. Furthermore, during the training, we use the categorical cross-entropy as the loss function. Finally, as shown in Figure 1, we removed the special tokens (labeled by 'X', 'CLS' and 'SEP') and obtained the final BIO labels at the post-processing step.

Experimental Settings
In this section, we introduce the dataset, evaluation metrics and details of the training process of our model. Dataset. The PharmaCoNER corpus has been randomly sampled into three subsets: the training set, the development set and the test set. The training set contains 500 clinical cases, and the development set and the test set include 250 clinical cases, respectively.
Evaluation Metrics. We apply the standard measures precision, recall and micro-averaged F1score to evaluate the effectiveness of our model. These metrics are also adopted as the evaluation metrics during the PharmaCoNER task.
Training Details. During the PharmaCoNER task, we utilized the training set for training the model and exploited the development set to choose the hyper-parameters of our model. In the prediction stage, we combined the training and development sets for training our model, and the organizers used the gold-standard test set to evaluate the final results. The detailed hyper-parameter settings are illustrated in Table 1

Experimental Results
We applied Multilingual BERT and BioBERT on the PharmaCoNER corpus, respectively. The experimental results are shown in Table 2. 'P', 'R', 'F' denote precision, recall, and micro-averaged F1-score, respectively. It is encouraging to see that the performance of both models is quite competitive. For the multilingual BERT model, since the model learned the Spanish language information during the pre-training process, its F1-score  Furthermore, we manually analyzed the errors generated by our models on the corpus test set after the PharmaCoNER task. The main errors can be classified into three categories: (1) incorrect boundaries, (2) missing the chemical/protein mention, (3) and incorrectly distinguishing the chemical and protein mentions. By analyzing these error examples, we infer that document-level information or biomedical knowledge may be helpful for the PharmaCoNER task.

Conclusion
In this paper, we introduce the BERT benchmark to facilitate the research of PharmaCoNER task. We evaluate two baselines based on Multilingual BERT and BioBERT on the PharmaCoNER corpus. It is encouraging to see that the performance of both models is quite competitive, reaching F1-scores of 89.24% and 89.02%, respectively. Experimental results demonstrate that transferring the knowledge learned from source large-scale datasets to the target domain offers an effective solution for the PharmaCoNER task.
In future work, we would like to explore an appropriate way to integrate document-level information or biomedical knowledge to improve the performance of the model.