A Rigourous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land?

Fine-tuning pretrained model has achieved promising performance on standard NER benchmarks. Generally, these benchmarks are blessed with strong name regularity, high mention coverage and sufficient context diversity. Unfortunately, when scaling NER to open situations, these advantages may no longer exist, and therefore raise the critical question of whether pretrained supervised models can still work well when facing these issues. As there is no currently available dataset to investigate this problem, this paper proposes to conduct randomization test on standard benchmarks. Specifically, we erase name regularity, mention coverage and context diversity respectively from the benchmarks, in order to explore their impact on the generalization ability of models. Moreover, we also construct a new open NER dataset that focuses on entity types with weak name regularity such as book, song, and movie. From both randomization test and empirical experiments, we draw the conclusions that 1) name regularity is vital for generalization to unseen mentions; 2) high mention coverage may undermine the model generalization ability and 3) context patterns may not require enormous data to capture when using pretrained supervised models.

Despite the success of recent models, there are specific advantages in current NER benchmarks which can significantly facilitate supervised neural networks. First, these benchmarks mainly focus on limited entity types, and most mentions of these types have strong name regularity. For example, nearly all person names follow the "FirstName LastName" or "LastName FirstName" patterns, and location names frequently end with indicator words such as "street" or "road". Second, the training and test data in these benchmarks are usually sampled from the same corpus, and therefore the training data commonly have high mention coverage on the test data, i.e., a large proportion of mentions in the test set have been observed in the training set. Unfortunately, this high coverage is inconsistent with the primary goal of NER models, which is expected to identify unseen mentions from new data based on name and context knowledge. For observed mentions, other techniques, such as entity linking (Lin et al., 2012), would be more appropriate and effective. Third, these benchmarks generally provide decent training data, and therefore the context diversity of all entity types can be sufficiently learned. In this paper, we refer to the NER tasks with strong name regularity, high mention coverage and with sufficient training instances as regular NER. And it proves that neural networks can easily exploit such name regularity, mention coverage and context diversity knowledge, and therefore achieve stateof-the-art performance in these benchmarks.
Unfortunately, when it comes to a more open scenario, there are significant discrepancies between regular benchmark settings and general NER tasks. Table 1 overviews their discrepancies on name regularity, mention coverage and context pattern acquisition. In open NER, mentions of many entity types do not follow regular compositional structures. For example, a movie name can be a random n-gram utterance and even is not a regular noun phrase, such as "Gone with the Wind". Furthermore, fully-annotated training data will be rare due to the expensive cost. Therefore training set can only cover a minor part of test mentions, and diverse context patterns must be learned from minimal instances. It is obvious that these discrepancies can lead to the biased estimation of the open NER performance using regular NER benchmarks.
In this paper, we want to shed some light on the impact of the discrepancies between regular and open NER, and provides some precious insights into the construction of NER models in a more effective and efficient way. Specifically, we want to answer the following question: Can pretrained supervised neural networks still generalize well on NER when either weak name regularity, low mention coverage or inadequate context diversity exists?
It is non-trivial to answer this question because currently no well-established benchmark concentrates on these issues. To this end, this paper exploits the efficacy of the above three kinds of information by conducting a series of experiments based on randomization test (Edgington and Onghena, 2007;Zhang et al., 2016). Specifically, we design several on-demand mention replacing mechanisms, which can erase specific kinds of information from current NER benchmarks. By applying the same supervised models on both vanilla and information-erased data, we can investigate how much the models rely on particular erased information to identify entity mentions. Generally, we propose to erase name regularity, mention coverage and context diversity respectively using the following kinds of randomization test, whose examples are shown in Tabel 2: • Name Permutation (NP) is used to investigate the necessity of name regularity for NER, which replaces the same entity mention with an identical, random n-gram string. In this way, the structural correlation between mentions of the same type is removed. For the example in Table 2, all mention "Putin" is replaced by the same utterance "the united".
• Mention Permutation (MP) is used to investigate the impact of mention coverage. Different from NP, MP replaces each mention with a unique n-gram string, and even two mentions with the same utterance will be replaced by different strings. For the example in Table 2, two mentions of "Putin" are replaced by "the united" and "the innocent" respectively. In this way, the mention coverage is erased and the model should merely rely on context knowledge for NER prediction.
• Context Reduction (CR) and Mention Reduction (MR) is used to investigate the influence of using less training data. CR decreases the diversity of sentences but preserves all mentions in vanilla data, while MR keeps all sentences but only preserves a small part of the original mentions. By comparing these two settings, we can figure out how much original training instances are needed to learn context patterns and name regularity.
To verify the findings from our randomization test, we further conduct a verification experiment by constructing a new dataset derived from Wikipedia, which focuses on entity types with weak name regularity. To the best of our knowledge, this is the first work that tries to investigate such critical differences between regular and open NER. From both the randomization test and the verification experiment, we reach the following main conclusions: • Decent name regularity is vital to the generalization over unseen entity mentions. When name regularity is erased, the performance on  : the knowledge is preserved in this setting; Ś : the knowledge is erased from the data in the setting; Ó: the knowledge decreases. unseen mentions will be significantly undermined. This finding indicates that it will be challenging to build models for open entity types with weak name regularity.
• High mention coverage weakens the model ability to capture informative context knowledge.In other words, high mention coverage will mislead models to overfit on popular mentions, rather than to learn sufficient generalization knowledge. This also reveals that current performance on regular NER benchmarks is highly biased, i.e., the performance on open NER will be significantly lower than that on regular benchmarks.
• Sufficient context diversity may not require enormous training data to capture. We will show that with simple data augmentation technics to preserving name regularity, required training data can be significantly reduced. This observation also raises the possibility of designing more effective NER models with less annotated data.
2 Experiment Settings

Dataset Summary
We use ACE2005 (LDC2006T06) as our primary experiment dataset for randomization test. Other openly-available datasets, such as CoNLL03 and Ontonotes, are not suitable for our randomization test. This is because they only annotate named mentions but ignore nominal and pronomial mentions. However, the context of named and nominal/pronomial mentions is generally identical, and therefore the models will be unable to distinguish between them once name regularity is removed. For better illustration and reproduction, we will report experiment results using the same dataset splits corresponding to  For all experiments, we only consider the outmost mentions similar to the majority of the previous work. Finally, there are 18739/2531/2314 mentions in the train/dev/test set respectively. We found that 58.4% mentions in the test set have appeared in the training data, which confirms our high mention coverage concern. We also have conducted multiple experiments using the 8:1:1 train/dev/test data split. And we found all the above experiments lead to the same conclusions which we will illustrate in the next section.

Baseline
We use the BERT-based CRF tagger as our baseline model. Specifically, a Transformer (Vaswani et al., 2017) based layer is used to extract features from the initial input sentence. Then two dense layers are used to map the hidden representation into the label space. Finally, a linear-chain CRF is applied to tag all tokens in BIO schema. The transformer is initialized using bert uncased L-24 H-  Table 3: Micro-F1 scores of BERT-CRF tagger on original data, name permutation setting and mention permutation setting respectively. We can see that erasing name regularity and mention coverage will significantly undermine the model performance.
1024 A-16, which achieves the best performance on our auxiliary experiments. All model parameters are fine-tuned later on the NER training set. During fine-tuning, we used Adam (Kingma and Ba, 2014) as the optimizer and set the learning rate to 10´5. Finally, this model achieves 81.76 micro F1 score on ACE2005, which is in accordance with the performance reported in previous work (Xia et al., 2019;Lin et al., 2019b).

Randomization Test on NER
To investigate the discrepancies between regular and open NER, this paper controllably erases target information from the vanilla data via a variant of randomization test (Edgington and Onghena, 2007;Zhang et al., 2016) in non-parametric statistics. Concretely, to probe the effect of a specific kind of information, we erase it in vanilla data by randomly replacing entity mentions with particular irregular utterances. After that, we learn and compare NER models on both the vanilla benchmark and the information-erased version, in order to evaluate the models robustness and generalization ability when target information is absent. The results of our randomization test can serve as a frame of reference for open NER, where the erased information is often truly absent.
Specifically, three kinds of information are particularly considered, and four kinds of strategies are used in our randomization test. Table 2 shows all kinds of randomization test in this paper and their examples. In the following subsections, we will illustrate the empirical findings through our randomization test, with one subsection for one kind of information. For each kind of information, we first present the critical conclusion and then demonstrate how we reach the conclusion.

Name regularity
Conclusion 1 Name regularity is vital for supervised NER model to generalize over unseen entity mentions.
One critical difference between regular and open NER is whether names of the same entity type share inner compositional structure. In regular NER, entity types (e.g., PER, ORG and LOC) are commonly with strong name regularity. In open NER, however, most entity types (e.g., movie, song and book) do not have such strong regularity, and some of mentions can even be random utterances. Therefore, it is critical to evaluate the impact of name regularity on generalization.
To address this issue, we propose name permutation, which replaces each mention utterance with a randomly sampled n-gram string, and the mentions with the same name will all be replaced by the same string. To ensure that no structural correlation between these mentions will be retained, the replacing strings are randomly sampled. For example in Table 2, all mentions "Putin" are replaced by "the united", and all "Bush" are replaced by "analysts"'. In this way, the name regularity can be erased, but the mention coverage will still retain because the same mention in the training and test data will still be the same. Table 3 shows the overall results. We can see that when we erase name regularity from the dataset, the performance significantly undermined. The overall drop on micro-F1 is 24%. Moreover, in the majority of entity types, the performance slips more than 40%. This shows the importance of name regularity on model performance. To investigate the reasons behind, we split mentions for evaluation by whether the predicted/golden mention is covered by the training data, which we refer to as the in-dictionary portion (InDict) and the out-of-dictionary portion (OutDict) respectively. The results 1 of these two portions on the vanilla dataset and name permutation setting are shown in Table 4.  Table 4: Comparasion between baseline and name permutation on in-dictionary and out-of-dictionary portions. We can see that the performance gap between In-Dict and OutDict is significantly enlarged when name regularity was erased.
From the above results, we can find that removing name regularity leads to more severe performance drop on mentions not covered by training set (OutDict) than those appearing in the training set (InDict). For the vanilla dataset, the performance gap between in-dictionary mentions and out-of-dictionary mentions is not very large, which shows the good generalizing ability of pretrained supervised model over unseen mentions. However, after erasing name regularity, this gap is significantly enlarged. The performance on the InDict portion does not drop too much, but the performance on the OutDict portion drops dramatically. This result shows that it is quite difficult to recognize unseen entity mentions when name regularity is missing. Besides, we can see that after erasing name regularity, the model can still perform quite well on the in-dictionary portion, whose precision is still quite high. This demonstrates the strong ability of neural networks to memorize and disambiguate observed mentions even they are irregular.
Consequently, we conclude that name regularity is very critical for the model to generalize over unseen mentions. Without name regularity, current supervised models can still work well on mentions covered by the training data via memorizing and disambiguating names, but it cannot generalize well to unseen mentions.

Mention Coverage
Conclusion 2 High mention coverage weakens the model ability of capturing informative generalization knowledge for NER.
Another critical difference between regular and open NER is whether the training data can cover a majority of mentions in the test scenario. High mention coverage can provide misleading evidence during model learning because neural networks can achieve considerable performance by just memorizing and disambiguating observed entity names. This ability, obviously, is not what we desire because 1) what we really want is to identify unseen names and in real world applications, most valuable mentions would be zeroshot, which means out-of-dictionary mentions will dominate the test process; 2) the training instances are very limited in open situations, which means it is unable to achieve high mention coverage; 3) many long-tail mentions in the training set would be one-shot, i.e., the mention only appears once in the training data. Therefore, it is necessary to exploit whether NER models can still reach reasonable performance in low mention coverage situation.
To this end, we conducted experiments via mention permutation, which replaces each mention with a random n-gram similar to the name permutation. However, to erase mention coverage information, the replacing string for each mention is independently sampled, and therefore even mentions with the same utterance in vanilla data will be replaced by different strings. For example, two "Putin" in Table 2 are replaced by different utterances. In this way, (almost) no mention in the test set is covered by the training data, and no name information remains in the data. Consequently, the models should only rely on context knowledge for NER prediction.
The overall results based on mention permutation are listed in Table 3. We can see that the performance of MP further drops compared with NP, which demonstrates high mention coverage can make mention detection much simpler. To further investigate whether high mention coverage will influence the models generalization ability to unseen mentions, we also compared MP with NP on the out-of-dictionary portion. The results are shown in Table 5. Surprisingly, the model performs significantly better in the MP setting than in the NP setting in all entity types. In other words, high mention coverage undermines the models ability to generalize to unseen mentions. We believe this is because, as some previous work in other  Table 5: Experiment results on out-of-dictionary portion. We can see that mention permutation significantly performs better than name permutation, which indicates that high mention coverage may undermine the generalization ability of models.
tasks (Zhang et al., 2016;Lu et al., 2019) have pointed out, neural networks have strong ability to memorize training instances. Consequently, the high mention coverage will mislead the models to mainly focus on memorizing and disambiguating frequent entity names even they are irregular, but ignore informative context patterns which are useful for identifying unseen mentions. These results reveal that NER models should focus more on context knowledge for generalization, rather than only memorizing popular mentions. This is even more critical for entity types without or with weak name regularity because context patterns are more important in this circumstance.

Context Diversity
Conclusion 3 Sufficient context patterns may not require enormous training data to capture when learning upon pretrained neural networks.
Current NER benchmarks commonly provide decent training data for learning context patterns of entities. However, due to the expensive annotation cost, it is impractical for open NER to assume enough fully-annotated training data. If we can figure out how much training instances are necessary for context pattern and name regularity respectively, it will provide valuable insights for constructing open NER datasets and models more effectively and efficiently.
To this end, we propose to conduct context reduction (CR) and mention reduction (MR) on the vanilla training set using simple data augmentation strategies. The purpose of context reduction is to reduce context diversity in training data but still keeps all mentions for name regularity in the vanilla setting. Specifically, CR only keeps a subset of sentences in the vanilla training data, and then duplicates preserved sentences and randomly replace mentions in them with mentions of the same type in the vanilla training data. In this way, all mentions will share identical frequency in the vanilla and CR dataset. On the contrast, MR aims to reduce mention diversity for name regularity, but retains context diversity by keeping all of the original contexts. For this, MR only keeps a part of mentions in the original training set as seeds, and replace other mentions in the training data with a mention randomly sampled from the seeds of the same type. In this way, only part of name knowledge will retain, but all contexts will be preserved. Furthermore, we also compare CR and MR with a naive reduction strategy which simply subsamples sentences in the training data, and we refer it as sentence reduction.
We varied the ratio of preserved information in each setting ranging from 5% to 100% respectively. The overall results are shown in Figure 1. We can see that in the sentence reduction setting, the performance steadily improved as the training data grows. This phenomenon is also observed in the mention permutation setting, which indicates that increasing training data will introduce more name regularity knowledge, and thus results in better performance. However, for context permutation setting, there is no significant performance improvement on PER, ORG and GPE when the preserved sentences are more than 30% of the vanilla data. But for FAC, increasing the preserving data will still improve the performance. This may because the instances of FAC are significantly smaller than PER, ORG and GPE in the vanilla dataset. From the above experiments, it seems that once it reaches a certain amount, the instances in training data are enough to capture sufficient context patterns. And increasing training instances can mainly provide more name regularity knowledge rather than more context diversity.
The above results provide a valuable insight that the name regularity and the context patterns for NER can be learned separately, rather than jointly. For example, we can learn context patterns using a moderate number of training instances and then attempt to incorporate more name regularity knowledge using other resources, e.g., easilyobtainable gazetteers.

Data Preparation
To further verify the conclusions from our randomization test, we propose to conduct experi-  Figure 1: Experiments on context reduction, mention reduction and sentence reduction when the kept information ratio varies. We can see that when preserving all name regularity information, sufficient context patterns can be captured once the training sentences reaches a certain amount, and introducing more training data does not significantly improve the performance.
ments on a real-world open NER dataset, which considers entity types with weaker name regularity than previously-used benchmarks. Because currently no suitable dataset is available for verifying our conclusions, this paper constructs a new dataset 2 from Wikipedia. Specifically, we consider four entity types in our experiments, including movie, song, book and tv series. We extract all sentences in Wikipedia which contain mentions linking to entities of these types as our experiment dataset. From them, we randomly sample 10,000 sentences as the test set and 2,000 sentences as the development set, and part of the remaining data will be used as the training data according to the following different settings. Finally, there are 2875, 2791, 598 and 580 mentions for movie, tv series, song and book in the test set respectively. Note that different from real scenarios, this dataset only keeps sentences containing at least one mention, owning to the partial labeling nature of Wikipedia. Therefore, the performance on this dataset may over-estimate the precision than in real applications.

Generalizing over Unseen Mentions
The first group of experiments was conducted to verify the influence of name regularity on indictionary and out-of-dictionary mentions. To this end, we randomly sampled 5,000 sentences from the dataset as the training set, which is close to the training data size of ACE2005. We use bert cased L-24 H-1024 A-16 rather than the uncased version of the pretrained model because we find that capitalization may have a significant impact on this dataset. Furthermore, different from the ACE2005 whose training data covers nearly 58% test set mentions, the training set of our Wikipedia dataset can only cover 27% mentions in the test set. This confirms our concern that the mention coverage is much lower in open NER than in regular NER.  Table 6: Comparasion between in-dictionary portion and out-of-dictionary portion on Wikipedia dataset. We can see that there is a significant gap between these two portions. Table 6 reports the experiment results on indictionary portion and out-of-dictionary portion respectively. We can see that the performance gap between in-dictionary portion and out-ofdictionary portion is very significant due to the weak name regularity of these entity types. This confirms our previous conclusion that name regularity is vital for NER system to generalize over unseen mentions. However, the performance gap between InDict and OutDict portions is not as large as the one in the previous name permutation setting shown in Table 4. We believe this is because: 1) there still exist some kinds of name regularity for these entity types, e.g., the capitalization of the first letter; 2) Wikipedia documents are much formal than ACE2005 documents, which makes the context patterns much easier to capture. For example, a movie mention in Wikipedia will frequently share the same context of in the film Xxx Xxx, where Xxx is mention word with the first letter being capitalized. This kind of pat- tern in Wikipedia is very easy to capture using neural networks. Despite this, the performance gap between InDict and OutDict portions is still more than 24% and 18% on precision and recall respectively, which verifies the necessity of name regularity for NER to achieve good generalization.

Influence of Training Data Size
This group of experiments tries to investigate the impact of training data size to the model performance. To this end, we varied the size of training set from 500 to 10,000, and investigate the performance improvement over the test set. Because the entity types we considered are with weak name regularity, the increment of training instances will mainly increase the context diversity. Therefore, this group of experiments can be used to verify the Conclusion 3 we proposed before. Figure 2 shows the results. We can see that the performance improvements on all entity types are less significant after training data size exceeding 3000. This phenomenon is very similar to the performance on the context reduction setting on ACE2005, as we have shown in Figure 1 (a). This further verifies our Conclusion 3 and confirms that when sample size reaches a certain level, introducing more training data will not improve the learning of context knowledge.

Related Work
Named entity recognition has long been studied and has attracted much attention. Conventional methods (Zhou and Su, 2002;Chieu and Ng, 2002;Bender et al., 2003;Settles, 2004) commonly rely on handcraft features to build NER models, which are hard to transfer among different languages, domains and entity types. Recently, deep learning methods, which automatically extract high-level features and perform sequence tagging with neural networks (Santos and Guimaraes, 2015;Chiu and Nichols, 2016;Lample et al., 2016;Yadav and Bethard, 2019), have achieved significant progress especially under strong pretraining and fine-tuning paradigm (Li et al., 2019b;Akbik et al., 2019;Zhai et al., 2019;Li et al., 2019a). These methods have achieved promising results in almost all popular NER benchmarks considering regular entity types.
Several researches have shift attention to name tagging in open scenarios, where entity types may have weaker name regularity and training data may be insufficient. These papers mainly focus on how to introduce and denoise weakly-supervised data (Täckström et al., 2013;Ni et al., 2017;Cao et al., 2019), or devoted to incorporate external resources (Yang et al., 2017;Peng and Dredze, 2016;Pan et al., 2017;Lin et al., 2018;Xie et al., 2018).
By contrast, to the best of our knowledge, this is the first work that tries to investigate the essential difference between regular and open NER. By conducting both randomization test (Edgington and Onghena, 2007) and verification experiments, we reached three critical conclusions about the impact of name regularity, mention coverage and context pattern sufficiency. Our conclusions will potentially guide the development of models and datasets for open NER.

Conclusion and Future work
This paper investigates whether current state-ofthe-art models on regular NER can still work well on open NER. From the perspective of name regularity, mention coverage and context diversity, we conducted both randomization test and verification experiments to evaluate the generalization ability of models. Our investigation leads to three valuable conclusions, which shows the necessity of decent name regularity to identify unseen mentions, the hazard of high mention coverage to model generalization, and the redundancy of enormous data to capture context patterns.
The above findings also shed light on the promising directions for open NER, including 1) exploiting name regularity more efficiently with easily-obtainable resources; 2) preventing the overfit on popular in-dictionary mentions with constraints or regularizers; and 3) decoupling the acquisition of context knowledge and name knowledge with more effective models.