TIMBERT: Toponym Identifier For The Medical Domain Based on BERT

In this paper, we propose an approach to automate the process of place name detection in the medical domain to enable epidemiologists to better study and model the spread of viruses. We created a family of Toponym Identification Models based on BERT (TIMBERT), in order to learn in an end-to-end fashion the mapping from an input sentence to the associated sentence labeled with toponyms. When evaluated with the SemEval 2019 task 12 test set (Weissenbacher et al., 2019), our best TIMBERT model achieves an F1 score of 90.85%, a significant improvement compared to the state-of-the-art of 89.13% (Wang et al., 2019).


Introduction
Phylogeographers, who study the geographic distribution of viruses, have long linked the increase in the geographical spread of viruses (Gautret et al., 2012;Green and Roberts, 2000) to the growth in global tourism and international trade of goods. Most notably, in December 2019, a pneumonia-like disease, later dubbed COVID-19, that was detected in the city of Wuhan, China quickly became a world pandemic mainly due to global travels (World Health Organization, 2020b;World Health Organization, 2020a).
Epidemiologists study the global impact of the spread of viruses by considering information on the DNA sequence and structure of viruses, as well as relying on accurate metadata. Although accurate localized geographical data is critical for creating maps of the locations of viruses and their migration paths, most publicly available databases, such as GenBank (Benson et al., 2012), provide insufficient details on the matter, limited to country or state level. In a study by Scotch et al. (2011), it is estimated that 7% of the GenBank records are missing geospatial metadata and 73% lacking detailed records. However, more fine-grained localization information on the viruses may be present in articles that describe the research work. Therefore, a manual inspection of biomedical articles is vital for obtaining more detailed information about the locations of the viruses.
Toponym detection aims to identify the word boundaries of expressions that denote geographic names. Toponym detection has been the focus of much work in recent years (Ardanuy and Sporleder, 2017;De-Lozier et al., 2015;Taylor, 2017) and studies have shown that the task is highly dependent on the textual domain (Amitay et al., 2004;Purves et al., 2007;Qin et al., 2010;Kienreich et al., 2006;Garbin and Mani, 2005). The focus of this paper is to propose a competitive deep learning based model for toponym detection that learns in an end-to-end fashion the mapping from an input sentence to the associated sentence with toponym labels.
We evaluated our models using the SemEval 2019 task 12 test set (Weissenbacher et al., 2019) and report their performances based on precision, recall and F1 respectively. These metrics can be measured in two ways: strict or overlapping. The strict measures, consider a prediction to match the gold standard annotation if both point to the exact same span of text at the character level. On the other hand, the overlapping measures are more lenient as they consider a prediction to match the gold standard annotations when they share a common span of text. Since the research community in toponym identification is more concerned with strict measures (Magge et al., 2018), we only report on the strict measures of precision, recall and F1. As reported in Section 4, our best model achieves a strict F1 score of 90.85% on the SemEval 2019 task 12 test set (Weissenbacher et al., 2019).

Previous Work
Categorizing each word of a text as toponym or non-toponym, is the focus of toponym detection. For example, given the sentence: COVID-19 was first reported in Wuhan, Hubei Province, China. 1 A toponym detection system should identify Wuhan, Hubei Province, and China as toponyms, and all other words as non-toponyms. Toponym detection tackles ambiguities between toponyms and other classes of Named Entity Recognition as well as metonymic usage of toponyms.
Toponym resolution in the epidemiology domain was the objective of the SemEval 2019 task 12 (Weissenbacher et al., 2019). The majority of current approaches to toponym detection in the medical domain are based on a combination of rule-based techniques, geographical gazetteer approaches, and deep learning models (Magge et al., 2018;Qi et al., 2019). In our approach, we aim to develop a toponym detection system solely based on deep learning techniques in order to sidestep many design choices and avoid pattern mining that comes along with rule-based techniques and gazetteer approaches.
Contextualized Embeddings Previous attempts to toponym detection in the medical domain have used contextualized embeddings, specifically ELMo (Peters et al., 2018), as the core of their models Yadav et al., 2019). In this paper, we experiment with a different contextualized embedding model and choose the pretrained BERT model (Devlin et al., 2019) as the backbone of our network architecture.
Linguistic Features Previous works on toponym detection in the medical domain have typically taken advantage of handcrafted features to achieve competitive performance. Most notable features include: (1) orthographic features that capture character level attributes of each token (Magge et al., 2018;Davari et al., 2019), and (2) part of speech (POS) tags Davari et al., 2019;Qi et al., 2019). In Section 4, we will evaluate the influence of these features on our proposed model.
Network Architecture There are two paradigms governing the network architectures used for this task: namely, whether localized contextual information is enough, or all available contextual information should be taken into account when making predictions. The former leads to models that only have access to a sliding window of information such as CNNs or MLPs (Magge et al., 2018;Davari et al., 2019;Magnusson and Dietz, 2019). The latter leads to sequential models operating at the sentence level; among which the BiLSTM-CRF architecture is the most favored and provides state-of-the-art results Yadav et al., 2019;Qi et al., 2019;Magnusson and Dietz, 2019). In our experiments, we focused on neural architectures that considered all available contextual information within a sentence. However, we deviated from the trend of Recurrent Neural Networks (RNN) and based our network on the Transformer models (Vaswani et al., 2017) due to its ability to process variable length inputs, while being much more parallelizable in comparison to RNN (see Section 3).

Our Proposed Model
The architecture of our toponym recognition model is shown in Figure 1. The WordPiece (Wu et al., 2016) tokenization of a sentence constitutes the input of the model. These tokens are then passed to a pretrained BERT network (Devlin et al., 2019). The output of the network along with certain linguistic features (see Section 4) are then concatenated and passed to a fully connected layer which determines the labels of each token. In our experiments, we used two variations of the BERT model: BERT-Base and BERT-Large (Devlin et al., 2019). The corresponding TIMBERT models are called TIMBERT-Base and TIMBERT-Large. Since the BERT-Large model is much more computationally expensive than BERT-Base, we limit the experiments involving this model.

Experiments and Results
Our models were evaluated on the SemEval 2019 task 12 (Weissenbacher et al., 2019) dataset. The dataset consists of 150 articles from PubMed annotated with toponym mentions. The dataset was split into 3 sections: training, validation, and test set containing 60%, 10%, and 30% of the dataset respectively. Table 1 shows statistics of the dataset. Table 2 shows the performance of our basic model i.e. TIMBERT-Base without any linguistic features (see row #7) compared to the state-of-the-art system ) (see row #4). A series of experiments was carried out to evaluate the influence of a variety of parameters on the performance of the model, which are described in the following sections. Table 2 (row #10) shows, the removal of stop words using the NLTK stop words corpus (Bird et al., 2009) of 179 words, significantly worsens the F1 performance of the model (from 81.36% to 72.08%). We hypothesize that some stop words such as in do help the system detect toponyms as they provide a learnable structure for detection of toponyms. Moreover, the BERT model is trained to capture contextual information and language structures during pretraining. Therefore, to transfer its knowledge to a downstream task, it should be given fully comprehensible and structured sentences. Stop words removal distorts sentence structure and therefore harms the model performance.

Stop Words As
Punctuation We then trained the TIMBERT-Base without any punctuation information. As Table 2 (row #9) shows, the removal of punctuation marks decreased the F1 from 81.36% to 79.39%. A manual error analysis showed that many toponyms appear inside parenthesis, near a dot at the end of a sentence, or after a comma (e.g. (Kara, Togo)). Hence, as suggested in (Davari et al., 2019;Gelernter and Balaji, 2013), punctuation is a good indicator of toponyms and should not be ignored. Part of Speech Tags The majority of the work in toponym detection have indicated that augmenting the model with part of speech (POS) tags results in an improvement in the performance of the model. On the other hand, the BERT model has shown a great ability to capture a number of linguistic features and transferring them to downstream tasks (Clark et al., 2019). Since BERT constitutes the backbone of TIMBERT, we investigated whether or not our model is aware of POS tags. We used the NLTK POS tagger (Bird et al., 2009) which uses the Penn Treebank tagset (Marcus et al., 1993). As indicated in Table 2 (row #8), including the POS tags reduced the performance of the model from the F1 of 81.36% to 81.01%. This suggests that the TIMBERT-Base model is already aware of the POS tags since the augmentation of the model with POS tags does not affect its performance.
Orthographic Features This feature is presented to the model as a one-hot encoded vector capturing whether a word is capitalized (e.g. Togo), uncapitalized (e.g. cell), or written in uppercase letters (e.g. UK). Since in the preprocessing of the data, all tokens are lowercased, the model is unaware of this feature. Our experiments showed that augmenting the model with the orthographic information results in an increase of the F1 performance from 81.36% to 82.90% (see Table 2, row #6).
Backbone Model Experiments on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) have shown great performance gains when the backbone model is switched from BERT-Base to BERT-Large (Devlin et al., 2019). Motivated by these findings, we investigated the impact of BERT-Large on our task. We replaced the backbone module with BERT-Large in our best performing model from previous experiments ( Table 2 row #6) and observed an improvement in the F1 performance from 82.90% to 85.11% (see Table 2, row #5).
Precursory Fine-tuning Precursory fine-tuning of a language model on an objective comparable to the target task can be seen as a form of distant supervision, where a knowledge resource is exploited to gather possibly noisy training instances (Mintz et al., 2009). We investigated the effect of precursory fine-tuning on the task of toponym detection. We used the CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003) and filtered it to only include instances of location names in English and established a training set with 8.5k sentences. We first fine-tuned our best performing model architecture (Table 2 row #5) on this dataset and used early stopping to end the training process. We then further fine-tuned the network with SemEval 2019 task 12 dataset (Weissenbacher et al., 2019) and observed a significant improvement in the F1 performance, from 85.11% to 89.98% (see Table 2, rows #5 and #3).
Pruning and Regularization Parameter pruning of neural networks could be seen as a form of permanent dropout that increases regularization and results in better generalization performance (Srivastava et al., 2014). It can also be seen as a form of L 0 regularization as it encourages sparse model representations that has shown to improve generalization performance (Louizos et al., 2018). We investigated the extent of the parameter pruning paradigm proposed by Frankle and Carbin (2019) on large scale transfer learning for toponym detection. We used a Monte Carlo estimate of the Lottery Ticket 2 to prune and regularize our best performing model from previous experiments ( Table 2, row #3). We observed that the compressed models outperformed the original uncompressed model at 10% and 20% pruning level with 90.85% and 90.37% F1 respectively (see Table 2, rows #1 and #2).

Discussion
Our search for a toponym identifier for the medical domain with little to no task-specific design choices led us to the development of the TIMBERT models. Our experiments with BERT as the backbone of our models detailed in Section 4 confirmed that certain linguistic insights such as POS tags are seamlessly transferred to downstream tasks while others, such as orthographic features need to be integrated to the model. Our experiments with the typical preprocessing techniques led to poor model performances. This suggests the need for a general agreement between the textual structure of the data during pretraining and fine-tuning.

Conclusion and Future Work
In this work, we presented a competitive model for toponym detection in the medical domain that significantly improves the state-of-the-art performance. We developed a family of toponym detection models and used BERT as the backbone of our models. In future studies, we will investigate the effects of using other language models, such as XLNET (Yang et al., 2019), RoBERTa , and ALBERT (Lan et al., 2019), for the backbone module. We experimented with parameter pruning to regularize our models as well as to reduce their computational complexity. In future studies, we will examine other knowledge distillation techniques in order to find competitive and resource conservative models for toponym identification in the medical domain. Our experiments with precursory fine-tuning resulted in significant performance improvement of our model. Further research can determine if precursory task specific fine-tuning is helpful for other NLP tasks. This can potentially lead to the development of task specific pretrained models for more efficient transfer learning, especially for tasks with small datasets.