Evaluating Pretrained Transformer-based Models on the Task of Fine-Grained Named Entity Recognition

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task and has remained an active research field. In recent years, transformer models and more specifically the BERT model developed at Google revolutionised the field of NLP. While the performance of transformer-based approaches such as BERT has been studied for NER, there has not yet been a study for the fine-grained Named Entity Recognition (FG-NER) task. In this paper, we compare three transformer-based models (BERT, RoBERTa, and XLNet) to two non-transformer-based models (CRF and BiLSTM-CNN-CRF). Furthermore, we apply each model to a multitude of distinct domains. We find that transformer-based models incrementally outperform the studied non-transformer-based models in most domains with respect to the F1 score. Furthermore, we find that the choice of domains significantly influenced the performance regardless of the respective data size or the model chosen.


Introduction
Named Entity Recognition (NER) is part of the fundamental tasks in Natural Language Processing (NLP). The main objective of NER is to detect and classify proper names (named entities) in a free text. Typically, named entities can be subdivided into four broad categories: persons, i.e., first and last names, locations such as countries or landscapes, organisations such as companies or political parties, and miscellaneous entities which serves as a catch-all category for other named entities such as brands, meals, or social events. NER is an active research field and state-of-the-art solutions such as spaCy 1 , flair (Akbik et al., 2018), and Primer 2 manage to achieve near-human performance. However, classical NER (which we refer to as coarse-grained NER in this paper) models typically distinguish between only a small number of entity types, usually fewer than a dozen distinct categories.
While this kind of shallow classification is sufficient for many applications, there are industrial usecases in which more precise information is necessary such as financial documents processing in the banking and finance context. For instance, application forms for a business loan are usually supplied with several supporting textual documents. These can contain the names of different types of persons, such as the owner or the CEO of the applying company, the contact person(s) at the issuing bank, finance analysts, or lawyers. The same is true for organisation names such as the name of the issuing bank, a government agency, or the name of the applying company or third-party companies. It is necessary to not only detect entity names, but to also qualify and differentiate between various entity types. Indeed, in many contexts the actual name of an entity is important only if it can be associated to a role, or any other relevant quality. In the banking and finance world for example, the strict regulatory requirements cannot be satisfied with just a list of who is involved; knowing how entities are involved is a necessity.
The term "Fine-Grained Named Entity Recognition" (FG-NER) was first coined by Fleischman and Hovy (2002). It describes a subtask of NER, where the objective remains the same as standard NER, This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. but where the number of entity types is considerably higher. In extreme cases, FG-NER models such as the fine-grained entity recognizer (FIGER) (Ling and Weld, 2012) are able to distinguish between more than 100 distinct labels.
Conditional Random Field (CRF) models (Lafferty et al., 2001) have been popular for numerous sequence-to-sequence tasks such as NER. They perform reasonably well and can serve as a baseline for the task of FG-NER.
In a previous study, Mai et al. (2018) compared the performance of several FG-NER approaches for the English and Japanese languages. They found that the BiLSTM-CNN-CRF model devised by Ma and Hovy (2016) combined with gazetteers performed the best in terms of F1 score for the English language. They also found that BiLSTM-CNN-CRF performed well without the use of gazetteers. In fact, among the models that did not make use of gazetteers, BiLSTM-CNN-CRF achieved the highest F1 score. In 2017, the introduction of the transformer model (Vaswani et al., 2017) revolutionised the NLP landscape and led to a number of novel language modeling approaches which manage to outperform state-of-the-art models in numerous tasks. In 2018, Devlin et al. (2019) developed the Bidirectional Encoder Representations from Transformers (BERT) model, a powerful language modeling technique which is considered as one of the most significant breakthroughs in NLP in recent memory. BERT models are pretrained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks. Devlin et al. (2019) fine-tuned the resulting models on several fundamental NLP tasks such as the GLUE language understanding tasks (Wang et al., 2018), the SQuAD question answering task (Rajpurkar et al., 2016), and the SWAG Common Sense Inference task (Zellers et al., 2018), for which BERT manages to achieve state-of-the-art performances. Furthermore, Devlin et al. (2019) reported an F1 score of 92.8% when fine-tuned on the CoNLL-2003 dataset for NER (Sang and De Meulder, 2003), achieving similar results as state-of-the-art models such as Contextual String Embeddings (Akbik et al., 2018) and ELMo Embeddings (Peters et al., 2017).
Improving on the BERT model, Liu et al. (2019) at Facebook AI 3 developed a Robustly optimized BERT approach (RoBERTa). They claim that the standard BERT models were undertrained and proposed a new version of BERT that was trained for a longer time, on longer sequences, on more data, and with larger batches. Furthermore, they trained only on the MLM task and with dynamic changes of the masking patterns applied to training data. BERT's pretraining steps was performed on the same dataset using the same masked locations for the entire MLM task. RoBERTa mitigated that problem by duplicating their dataset ten times, and using different masking patterns for each duplicate. They report that fine-tuned models derived from RoBERTa either matched or improved on BERT models in terms of performance, although they did not perform tests specifically on the NER task.
2019 also saw an attempt to solve the shortcomings of BERT in terms of the training approach. Yang et al. (2019) presented XLNet. During the MLM pretraining task of BERT, a special [MASK] token is introduced in the training set. According to (Yang et al., 2019), BERT models neglect dependencies between the masked tokens. Furthermore, this token is absent in the fine-tuning tasks, resulting in a pretrain/fine-tune discrepancy. XLNet avoids this shortcoming as it does not mask its tokens, and instead permutes the order of token predictions. Yang et al. (2019) reports that XLNet outperforms BERT in 20 NLP tasks, specifically language understanding, reading comprehension, text classification and document ranking tasks. They do not report any results on sequence-to-sequence tasks like NER. While BERT, RoBERTa, and XLNet (which we refer to as transformer-based models throughout the paper) achieve state-of-the-art performances in numerous Natural Language Understanding (NLU) tasks, we observe a lack of research in the area of FG-NER. In this paper, we present an empirical study of the performance of FG-NER approaches derived from a pretrained BERT, a pretrained RoBERTa, and a pretrained XLNet model as well as a comparison to a simple CRF model and the model presented by Ma and Hovy (2016). Furthermore, we apply these approaches to a large number of distinct domains, with varying numbers of data samples and entity categories. Specifically, we will address the following research questions: • RQ1: Do transformer-based models outperform the state-of-the-art model for the FG-NER task?  To the best of our knowledge, our study is the first aiming to precisely evaluate the performance of these existing approaches on the FG-NER task.

Experimental Setup
In this section, we present the dataset used in this study and we introduce the different models that we compare against each other.

Dataset
For this study, we apply the selected models to the English Wikipedia Named Entity Recognition and Text Categorization (EWNERTC) dataset 4 published by Sahin et al. (2017b). It is a collection of automatically categorised and annotated sentences from Wikipedia articles. The original dataset consists of roughly 7 million annotated sentences, divided into 49 separate domains. These 49 domains vary significantly in overall size and number of entity types. The physics domain is the smallest subset with 68 sentences, 144 entities and merely 6 distinct entity types. In contrast, the location domain is the largest subset with 443 646 sentences, 1 472 198 entities, and 1603 types. Table 1 contains statistics for a small selection of domains. 5 Physics, fashion, finance, exhibitions, and meteorology are the five smallest sets, consisting of fewer than 3000 sentences each. Food, media, biology, travel, and business are mediumsized sets, comprising between 40 000 and 70 000 sentences. Finally, government, film, music, people, and location are the largest sets with more than 300 000 sentences each. It is noteworthy that the physics dataset is an obvious outlier in terms of size (since the second smallest dataset is the fashion dataset, which contains an order of magnitude more sentences). It is possible that the size of the physics subset is too small to produce meaningful results.
For this study, the number of entity types was drastically reduced. This measure was taken for two reasons: most entity types appear only a few times in any given subset. Furthermore, the training time for CRF models tends to explode when dealing with a high number of entity types according to Mai et al. (2018). We limited the number of entity types per domain to the top 50 and, if necessary, added a miscellaneous type as a catch-all for all remaining named entities.

Approaches
In this section, we present the five models that we investigate for this study in more detail and we specify the configuration of each model.

CRF
As CRF models remain largely popular solutions for sequence-to-sequence tasks, we use a simple CRF model as a baseline. We use a large number of context and word shape features such as casing information and whether or not the word contains numerical characters. While simple CRF models generally perform well for coarse-grained NER, they require custom-made features and their usefulness is limited for FG-NER according to Mai et al. (2018) who observed that CRF models tend to require too much time to finish when handling a large number of labels. We use the sklearn crfsuite API 6 for python with the following hyperparameters for training: gradient descent using the L-BFGS method as the training algorithm with a maximum of 100 iterations. The coefficients for L1 and L2 regularisation are fixed to C 1 = 0.4 and C 2 = 0.0. We use the following features: the word itself, casing information, is the word alphabetical, numerical or alphanumerical, suffixes and prefixes, as well as the words and features in a two-words context window. Considering that the datasets are numerous and very diverse, we decided against using specialised gazetteers/dictionaries for this study, despite their proven usefulness in earlier studies (Mai et al., 2018).

BiLSTM-CNN-CRF
As our state-of-the-art model, we use the implementation of Reimers and Gurevych (2017b) 7 of the BiLSTM-CNN-CRF model proposed by Ma and Hovy (2016). The model consists of a combination of a convolutional neural network (CNN) layer, a bidirectional long short-term memory (BiLSTM) layer, and a CRF layer. In a first step, the CNN is used to extract character-level representations of given words which are then concatenated with word embeddings to create word level representations of the input tokens. These representations are fed into a forward and a backward LSTM layer, creating a bidirectional encoding of the input sequence. Finally, a CRF layer decodes the resulting representations into the most probable label sequence (Ma and Hovy, 2016). Mai et al. (2018) achieved the best performance with a combination of gazetteers and BiLSTM+CNN+CRF, but as was mentioned above, we do not use gazetteers for this study due to the diverse nature of our datasets. We use the hyperparameters recommended by Reimers and Gurevych (2017a) as they were shown to be useful for coarse-grained NER. We also use Global Vectors (GLoVe) 8 word embeddings with 300 dimensions for the same reason.

BERT
Pretraining a language model can take several days due to its large amount of trainable parameters. Furthermore, a sizable amount of data is required to achieve good results. Indeed, we tried to train a few language models using the EWNERTC dataset, but it is too small and the resulting models were essentially unusable as they yielded very low F1 scores. Fortunately, Google provides a variety of pretrained models that have been trained on the BooksCorpus (Zhu et al., 2015) and English Wikipedia, amounting to a grand total of 3.3 billion words. We use the Transformers library 9 provided by Huggingface (Wolf et al., 2019) which allows to pretrain and fine-tune BERT models with a simplified procedure using CLI commands. For this study, we fine-tune an English BERT Base model using each dataset separately. As we compare models for FG-NER, we chose the cased model as recommended, in order to preserve casing information. The BERT Base model contains 12 transformer blocks, 768 hidden layers, 12 selfattention blocks, and 110 million parameters in total. While the BERT Large model yields better results in every task that Devlin et al. (2019) investigated, the BERT Base model can be useful for determining a lower boundary for the performance. Devlin et al. (2019) report that the recommended hyperparameters vary depending on the NER task, but generally the best performances are observed for a batch size in {16, 32}, a learning rate in {2 −5 , 3 −5 , 5 −5 }, and training epochs in {2, 3, 4}. After testing on three specific domains (comic books, symbols, and fictional universe with 21 262, 21 171 and 39 781 sentences respectively), we found that a batch size of 16, a learning rate of 5 −5 , and 5 training epochs yielded the highest F1 scores.

RoBERTa
RoBERTa presents similar challenges as BERT as it needs a large amount of resources, time and data. Liu et al. (2019) provide pretrained models, trained on 160GB of text, which represents about 3-4 times the amount of data used for pretraining BERT. We use the RoBERTa Base model, which contains 12 transformer blocks, 768 hidden layers, 12 self-attention heads, and 125 million trainable parameters. We finetune it on each dataset separately. Similar to the pretrained BERT model, the pretrained RoBERTa model is also cased, making it appropriate for fine-tuning on NER tasks. Liu et al. (2019) trained RoBERTa using the same hyperparameters as BERT, except for the number of training epochs which they fixed to ten. We perform a similar grid search as for BERT, i.e., a batch size in {16, 32}, and a learning rate in {2 −5 , 3 −5 , 5 −5 }, but training epochs in {2, 4, 6, 8, 10}. Testing on the comic books, symbols, and fictional universe, we found that a batch size of 16 , a learning rate of 5 −5 , and 10 training epochs performed best with regards to F1 score.

XLNet
While the pretraining approach of the XLNet model differs significantly from BERT models, the pretraining step still requires a vast amount of resources and time. Thus, we once again use a pretrained model rather than training one ourselves. For the comparison, we use the cased XLNet Base model with 12 transformer blocks, 768 hidden layers, 12 self-attention heads, and 110 million parameters. Yang et al. (2019) fine-tuned their pretrained model using the same hyperparameters as the BERT models to compare their performances. We perform the same hyperparameter grid search as for BERT, and get the best F1 score with a batch size of 16, a learning rate of 5 −5 and 5 training epochs for the domains comic books, symbols, and fictional universe.

Experimental Results
In this section, we will answer the three research questions that we formulated for this study (cf. Section 1). Table 2 shows the performance of the five models for each domain. In order to account for the imbalanced distribution of the entity types, we opt to calculate micro-averaged performance scores which takes into account the frequency of every entity type. To facilitate reading, we highlight (in bold) the highest F1 score for each domain.
3.1 RQ1: Do transformer-based models outperform the state-of-the-art model for the FG-NER task?
The results indicate that, overall, the transformer-based models outperform CRF and BiLSTM-CNN-CRF in most domains in terms of F1 score. Specifically, the results show that the BERT and RoBERTa models yield the highest and second-highest F1 scores for almost every domain. BERT has the highest F1 score in 36 out of 49 domains, while RoBERTa achieves the best F1 score in 10 out of 49 domains. While XLNet outperforms BiLSTM-CNN-CRF in most domains, its performance scores are slightly lower than the ones of both the BERT and RoBERTa models. It is also noteworthy that XLNet performs consistently worse than BiLSTM-CNN-CRF in the ten smallest domains. Figure 1a provides the boxplots showing the distributions of the F1 scores over all the domains across the five models. We can make two observations. The boxplots indicate that, on average, all of the transformer-based models achieve higher performances than both CRF and BiLSTM-CNN-CRF. Furthermore, we can observe that the ranges, and, more importantly, the interquartile ranges of the transformer-based models are smaller. This indicates that their performances are more stable and less sensitive to the choice of domain than the performances of CRF and BiLSTM-CNN-CRF.   3.2 RQ2: What are the strengths, weaknesses, and trade-offs of each investigated model?
While the transformer-based models clearly outperform the other models with regards to the F1 score, it is worth examining the precision and recall scores as well. Regarding the precision, the CRF model almost consistently outperforms all of the other models as shown in Table 2. When compared to the BiLSTM-CNN-CRF model, the transformer-based models perform worse in most domains in terms of precision. In fact, BERT outperforms BiLSTM-CNN-CRF in less than half of the domains, RoBERTa outperforms BiLSTM-CNN-CRF in only a third of the domains and XLNet outperforms it in only a fifth of the domains. Figure 1b shows the distribution of the precision scores over all the domains across the five models. The boxplots confirm the strength of CRF over the other models. Furthermore, they show that BiLSTM-CNN-CRF performs slightly better than the transformer-based models, albeit at a loss of stability as indicated by the large range.
On the other hand, the transformer-based models significantly outperform the other models with regards to recall as seen in Table 2 can be observed in Figure 1c. The transformer-based models not only outperform the other models, but their interquartile ranges are significantly smaller as well. This difference in recall score also explains the higher F1 scores for the transformer-based models.
To summarise, CRF shows its strength in terms of precision, BERT, RoBERTa, and XLNet perform well with regards to both recall and F1 score, with BERT usually achieving the highest performances. The BiLSTM-CNN-CRF model acts as a trade-off between CRF and the transformer-based models.
3.3 RQ3: How does the choice of the domain influence the performance of the models? Figure 1a shows that while different models may achieve significantly different performance, no approach yields a significant breakthrough, w.r.t the others, for the task at hand, and all leave room for improvement. The five tested models obtained relatively stable performances, as is visible from the fact that boxes, which represent the performance measurements of 50% of the domains, cover only a ±0.05 band around the average. Figure 2, that plots the F1 scores for every domain (ordered by size), reveals however that all models are similarly impacted by domains: with the exceptions of the four smallest domains (left-most on Figure 2), when one model achieves a lower performance than its overall average, all models are also performing worse than their overall averages. We also note that the per-domain variations in performance cannot be explained by the size of the domains (since the performance looks erratic across all domain sizes). Overall, the results are a clear indication that most domains are either: (a) relatively hard for every model, or (b) relatively easy for every model. This suggests that no model manages to acquire a massively better language understanding that would make it able to avoid the difficulties faced by the other models, at least in the context of FG-NER.
Furthermore, the ranking of the five models is very stable across domains: given the fact that one specific model performs the best (resp. the worst) for one domain, it can reliably be predicted that this model will also perform the best (resp. the worst) across all domains. It follows that some models do bring a sometime incremental, but nonetheless measurable improvement over other models. Nevertheless, we note that for the four smallest domains, the difference in performance from one model to another is more important, and no ranking pattern is visible.
The performance variations between domains that we see in our results have also been reported in the study by Guo et al. (2006), who investigated the stability of coarse-grained NER across domains for the Chinese language. Notably, when trained on the sports domain, their baseline has a significantly higher F1-score than the other domains. The same is true here, but it has to be noted that they use the classic NER-labels, i.e., person, location, organisation, and miscellaneous, rather than domain-specific labels.
Take-Home Messages: To summarise, the transformer-based models do indeed outperform the BiLSTM-CNN-CRF model with regards to F1 score, with BERT yielding the highest results overall. The simple CRF model achieved the best performance in terms of precision, while performing the worst in terms of recall. Compared to both CRF and BiLSTM-CNN-CRF, the transformer-based models achieved significantly higher recall scores. Furthermore, we observe significant discrepancies when applying the models to different domains. Moreover, when a model is performing better (resp. worse) on one domain, the other models also perform better (resp. worse). This suggests that while transformer-based models can indeed bring significant performance improvements, their language understanding may not be outstandingly different. Indeed, if they were clearly different, we could have reasonably expected to note different patterns in the performance for the FG-NER task (i.e., they would not systematically perform well/badly for the same domains).

Fine-Grained Named Entity Recognition
Early efforts to develop a fine-grained approach to NER were made by Béchet et al. (2000), where they focused on differentiating between first names, last names, countries, towns, and organisations. While this would be considered coarse-grained by today's standards, they do split the classical NER labels person and location into more nuanced labels. FG-NER was first described as "fine grained classification of named entities" by Fleischman and Hovy (2002). They focused on a fine-grained label set for personal names, dividing the generic person label into eight subcategories, i.e., athlete, politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, and police. They experimented with a variety of classic machine learning approaches for this task, and achieved promising results of 68.1%, 69.5%, and 70.4% in terms of accuracy for SVM, a feed-forward neural network, and a C4.5 decision tree, respectively. Furthermore, Ling and Weld (2012) introduced their fine-grained entity recognizer (FIGER), which can distinguish between 112 different labels and handle multi-label classification. Mai et al. (2018) presented an empirical study on FG-NER prior to the rise of transformer-based models (which are the focus of our study). They targeted an English dataset containing 19 800 sentences and a Japanese dataset which contained 19 594 sentences, dividing the named entities into 200 categories. They compared performances for FIGER, BiLSTM-CNN-CRF, and a hierarchical CRF+SVM classifier, which classifies an entity into a coarse-grained category before further classifying it into a fine-grained subcategory. Furthermore, they combine some of the aforementioned methods with gazetteers and category embeddings to further improve the performance of the models. They found that the BiLSTM-CNN-CRF model by Ma and Hovy (2016) combined with gazetteer information performed the best for the English language with an F1 score of 83.14% while BiLSTM-CNN-CRF with both gazetteers and category embeddings yielded an F1 score of 82.29%, and 80.93% without either gazetteers or category embeddings. Vaswani et al. (2017) first described the transformer model which superseded the popular LSTM model in favour of the attention mechanism (Bahdanau et al., 2014). As transformers do not need to process sentences in sequence, they allow for more parallelisation than LSTMs or other recurrent neural network models. Due to this advantage, transformers have become fundamental for state-of-the-art models in the NLP field. One early notable model that employed transformers is the Generative Pretraining Transformer (GPT) model (Radford et al., 2018) which outperformed state-of-the-art models in nine out of twelve NLU tasks. Devlin et al. (2019) further revolutionised the NLP landscape by introducing BERT. Unlike the unidirectional GPT model, BERT is a deeply bidirectional transformer model, pretrained on the MLM and NSP tasks. Fine-tuned BERT models managed to outperform state-of-the-art models in eleven NLP tasks, including the GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016) benchmarks. The success of BERT led to a large variety of similar models, which were pretrained on different datasets. Most notably, RoBERTa (Liu et al., 2019) and XLNet managed to further outperform BERT in a large number of tasks. Specifically, Yang et al. (2019) introduced XLNet, replacing the MLM task with a permutation-based autoregression task, effectively predicting sentence tokens in random order. XLNet manages to outperform BERT in 20 tasks, including the GLUE, SQuAD and RACE (Lai et al., 2017) benchmarks. Meanwhile, the RoBERTa model was trained on more data, for longer periods of time, tweaked the MLM pretraining task, and removed the NSP task. Liu et al. (2019) reported that RoBERTa outperforms BERT on the GLUE, SQuAD, and RACE benchmarks.