Coarse-to-Fine Pre-training for Named Entity Recognition

More recently, Named Entity Recognition hasachieved great advances aided by pre-trainingapproaches such as BERT. However, currentpre-training techniques focus on building lan-guage modeling objectives to learn a gen-eral representation, ignoring the named entity-related knowledge. To this end, we proposea NER-specific pre-training framework to in-ject coarse-to-fine automatically mined entityknowledge into pre-trained models. Specifi-cally, we first warm-up the model via an en-tity span identification task by training it withWikipedia anchors, which can be deemed asgeneral-typed entities. Then we leverage thegazetteer-based distant supervision strategy totrain the model extract coarse-grained typedentities. Finally, we devise a self-supervisedauxiliary task to mine the fine-grained namedentity knowledge via clustering.Empiricalstudies on three public NER datasets demon-strate that our framework achieves significantimprovements against several pre-trained base-lines, establishing the new state-of-the-art per-formance on three benchmarks. Besides, weshow that our framework gains promising re-sults without using human-labeled trainingdata, demonstrating its effectiveness in label-few and low-resource scenarios


Introduction
Named Entity Recognition (NER) is the task of discovering information entities and identifying their corresponding categories, such as mentions of people, organizations, locations, temporal and numeric expressions (Freitag, 2004). It is an essential component in many applications including machine translation (Babych and Hartley, 2003), relation extraction (Yu et al., 2019), entity linking (Xue et al., 2019a), and so on.
Recently, NER has seen remarkable advances with the help of pre-trained representation models, such as BERT (Devlin et al., 2019) and XL-Net (Yang et al., 2019). Providing contextual representation, these pre-trained models could be easily applied to NER applications as an encoder by just fine-tuning it. Despite refreshing the state-of-theart performance of NER, the current pre-training techniques are not directly optimized for NER. Typically, these models build unsupervised training objectives to capture dependency between words and learn a general language representation (Tian et al., 2020), while rarely considering incorporating named entity information which can provide rich knowledge for NER. Due to little knowledge connection between NER and general language modeling, how to adapt public pre-trained models to be NER-specific remains an open problem.
To this end, injecting named entity knowledge during pre-training is a possible solution. However, this process of knowledge acquisition may be inefficient and expensive. In fact, there are extensive weakly labeled annotations that naturally exist on the web yet to be explored for NER model pre-training, which are relatively easier to obtain compared with labeled data (Cao et al., 2019). One can collect them from online resources, such as the Wikipedia anchors and gazetteers (named entity dictionaries). Although automatically derived corpora usually contain massive noisy data, it still contains some extend the valuable semantic information required for NER (Peng et al., 2019).
In this paper, we propose a Coarse-to-Fine Entity knowledge Enhanced (CoFEE) pre-training framework for NER task, aiming to gather and utilize knowledge related to named entities. In particular, we first extract anchors from Wikipedia and use them as training corpora for entity span identification. While anchors have no entity type information, the model could get general-typed entity knowledge from them and learn to distinguish entity words and non-entity words. In the second phase, we use gazetteers and anchors to generate weakly labeled data for specific entity types and use it to train the model for extracting entities with coarse-grained type. Furthermore, another observation is that entities with the same coarse-grained type may belong to different fine-grained types. According to the cluster hypothesis (Chapelle et al., 2009), the features of entities with the same latent fine-grained label will cluster together in the semantic space. Intuitively, mining these latent cluster structures provides auxiliary information about the coarse-grained entity type, which could be beneficial to improve the NER performance. Based on such motivation, we finally devise a self-supervised method to exploit fine-grained type knowledge and tap the potential of weakly labeled data, which effectively train the NER model with clusteringgenerated pseudo labels.
We conduct experiments on three realistic NER benchmarks in this paper. Experimental results show that the proposed CoFEE pre-training framework significantly outperforms other competitive baselines, often by large margins. We also demonstrate that CoFEE pre-training can work well in more challenging, label-free and low-resource scenarios. Further ablation studies show the impact of each pre-training task in achieving these strong performance. To the best of our knowledge, this is the first work that has tackled NER-specific representation during pre-training.

Related Work
Entity Knowledge for NER. Recently, neural networks have been used for NER and achieved great success (Collobert et al., 2011;dos Santos and Guimarães, 2015;Huang et al., 2015;Ma and Hovy, 2016). Specifically, various types of entity knowledge, including lexical words, gazetteers and anchors in Wikipedia have been proved to be useful for a wide range of sentiment analysis tasks.
For supervised NER task, some researchers utilize lattice structure to incorporate the lexical information into character-based NER and avoid the segmentation error propagation of word Gui et al., 2019a;Xue et al., 2019b;Gui et al., 2019b;Sui et al., 2019). Additionally, gazetteers have long been regarded as a piece useful knowledge for NER, previous methods commonly incorporated gazetteers by either using them as handcraft features (Alan et al., 2011;Dominic et al., 2018) or auxiliary structural information (Ding et al., 2019;. For weakly supervised NER, a typical line of methods centres around transfer learning to extract source knowledge for target, such as crossdomain (Yang et al., 2017;Lin and Lu, 2018;Jia et al., 2019) or cross-lingual (Ni et al., 2017;Xie et al., 2018;Zhou et al., 2019). There are also a lot of weak labels lying on the web or gazetteers, which have not been explored. Consequently, a number of works focus on distantly supervised methods, using anchors or gazetteers to generate data by distant supervision (Liu et al., 2015;Cao et al., 2019;Peng et al., 2019).
Task Specific Pre-training. Unsupervised language model pre-training and task-specific finetuning achieve SOTA results on many NLP tasks, including NER (Peters et al., 2018;Devlin et al., 2019;Li et al., 2020). Recently, with the help of automatically minded knowledge lying in the web, researchers devoted them to the pre-training models for specific tasks, including word sense disambiguation , word-in-context tasks (Levine et al., 2020), entity-linking and relation classification , sentiment classification (Tian et al., 2020).

Background
In this section, we give a brief introduction to MRC-NER (Li et al., 2020), which achieves satisfying performance in NER and thus is chosen as the foundation of our work. Given an input paragraph X = {x 1 , x 2 , · · · , x n } where x i denotes the i-th character, NER aims at discovering each entity x start,end in X and identify its corresponding type y ∈ Y , where Y is the set of predefined tags(e.g., PER, LOC). x start,end = {x start , x start+1 , · · · , x end−1 , x end } is a substring of X satisfying start ≤ end. Specifically, MRC-NER formulates NER as a machine reading comprehension (MRC) problem. Each entity type y is characterized by a natural language query Q y = {q y 1 , q y 2 , ..., q y m }, and entities are extracted by answering these queries given the contexts. For example, the task of assigning the PER label to " [Washington] was born into slavery on the farm" is formalized as answering the question "Find person including fictional". This strategy naturally introduces the natural language query which encodes significant prior knowledge about the entity  (Devlin et al., 2019) captures the contextual information for each token in the string via self-attention and produces the representation matrix H ∈ R n×d of X, where d is the dimension of the last layer of BERT. To extract entity spans, the representation of each word is fed to two softmax layers to predict the probability of each token being a start or end index as follows: where W s , W e ∈ R d×2 and b s , b e ∈ R 2 are trainable parameters. At training time, S associated with each question Q y is paired with two label sequences Y start = {y s 1 , y s 2 , ..., y s n } and Y end = {y e 1 , y e 2 , ..., y e n }, where y s i (y e i ) is the ground-truth label of x i being the start (end) index of a y-typed entity or not. The cross-entropy loss of start and end index predictions are therefore denoted as: where D denotes the training dataset. Finally, the overall training objective to be minimized can be formulated as follows:

Methodology
In this section, we introduce the overall framework of our coarse-to-fine pre-training. Figure 1 gives a brief illustration, which operates in three stages as follows: (1) Stage 1: identity entity span based on Wikipedia anchors; (2) Stage 2: extract coarsegrained entities based on gazetteers; (3) Stage 3: predict fine-grained entity types with a clusteringoriented self-supervised method.

Entity Span Identification
Pre-trained language Models such as BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) have been proven to capture rich language information from text. However, as the entity information of a text is seldom explicitly studied, it is hard to expect such pre-trained general representations to capture entity-centric knowledge. In order to better capture entity information and learn NER-specific representation, we propose the first pre-training task named Entity Span Identification (ESI). The entity-centric knowledge is automatically mined from the large scale Wikipedia corpus. In Wikipedia, an anchor m, e links a mention m to an entity e. Therefore, we assign an "Entity" tag to each anchor in the sentence and construct a General-typed weakly labeled NER dataset D g without considering the entity type. To align with MRC-NER, the question of the generated dataset is set as "Find Entities". With the general labeled data, the MRC-NER model can be warmed-up with loss L Dg MRC . By integrating the general-typed named entity knowledge into the pretraining process, the learned representation would be incorporated with the structural information of crucial importance for NER.

Named Entity Extraction
After the ESI pre-training, the model has learned to distinguish entity words and non-entity words. Then we step into the second phase (i.e., NEE) in which the model is trained to extract typed entities with gazetteer-labeled data. To alleviate human  effort, gazetteer-based distant supervision has been applied to automatically generate labeled data and has gained successes in NER Peng et al., 2019). A standard strategy is to scan through the anchor text in D g using the gazetteer of a given entity type y and treat anchors matched with entries of the given gazetteer as the entities with type y. In this way, we can obtain a specifictyped NER dataset D s , which is then exploited to train the MRC-NER model by optimizing L Ds MRC . Besides, in order to meet the paradigm of MRC-NER, we also generate a natural language query for each entity type. This procedure is critical since queries encode prior knowledge about labels. Inspired by (Li et al., 2020), we take annotation guideline notes as references to construct queries and illustrate all of the queries used in our model in Table 1. They are theoretical description of the tag categories, thus having the ability to make the model incorporate the information within the label categories unambiguously and completely.
However, as most existing gazetteers only cover part of entities, the automatically derived dataset usually contains massive noisy data including missing labels, incorrect boundaries and types. To address this issue, we propose an iterative self-picking strategy. At the beginning (iteration 0), the model starts with training from the original noisy label set. At the end of each iteration, the model determines the next label set by making predictions on D s . Concretely, a new entity will be extracted with type y if the probabilities of its start and end indices being predicted as y are both greater than a picking threshold δ. In the next iteration, we use the new derived dataset as input for the model training. Considering that we aim to recall the missing labels, we set δ < 0.5. The model is trained until we find the best model w.r.t. the performance on the validation set. And the final derived dataset is denoted as D best s .

Fine-grained Entity Typing
NEE pre-training focuses on teaching the model named entity knowledge about coarse-grained entity types. However, one coarse-grained entity type may be composed of a set of fine-grained entity types. For example, the coarse-grained type Location includes City, Country, Bodies of water, etc. These fine-grained types can provide auxiliary information to help us understand the meaning of Location. With this in mind, it is intuitive to group the extracted entities with a cluster miner, and use the subsequent cluster assignments as pseudo labels to mine the fine-grained NER knowledge. One of the most well-studied clustering algorithms is k-Means, and the simplicity and efficiency have established it as a popular means for performing clustering across different disciplines. Formally, in order to partition the entity set E = {e 1 , e 2 , · · · , e M } in D best s into pre-defined K distinct clusters {C k } K k=1 , k-Means minimizes the sum of the intra-cluster variances i=1 δ ik are the variance and the center of the k-th cluster, respectively, e i = sumpool([h start i , h start i +1 , ..., h end i ]) denotes the representation of the i-th entity, and δ ik is a cluster indicator variable with δ ik = 1 if e i ∈ C k and 0 otherwise. Clustering proceeds by alternating between assigning instances to their closest center and recomputing the centers, until a local minimum is reached. The cluster assignments are used as pseudo labels to guide the transformation of D best s to a pseudo-labeled fine-grained dataset D c = {(e i , y c i )}, where y c i is the pseudo label of e i Then we can take the negative log-likelihood of the pseudo-labeled tags as the training objective: where W c ∈ R K×d and b c ∈ R K are trainable parameters, P c (y c i |e i ) denotes the probability of entity e i being predicted to the y c i -th cluster. Recall that our purpose is to pre-train the NER model to discover typed entities belonging to Y rather than fine-grained entities, so L clus can be deemed as an auxiliary task to assist the model to mine the fine-grained NER knowledge and regularize the optimization of L Ds MRC . So the training objective in this stage is defined as: where γ is the trade-off parameter. While optimizing with pseudo labels created by the cluster miner seems reasonable, the inevitable label noise caused by the clustering procedure is ignored. To this end, we propose a variance-weighted cross-entropy loss to alleviate the influence of noisy pseudo labels. Obviously, the inverse of V k (V −1 k ) represents the intra-cluster compactness of the k-th cluster. If the features of instances in the k-th cluster are close together, V −1 k will be large, meantime the confidence of assigning pseudo label y i to these instances should also be high and vice versa. Thus we re-formulate Equation 7 as: Finally, we iterate the above clustering-optimizing process by putting back the model to output new representations, generate new pseudo labels D c and start the next iteration.

Algorithm Workflow
In this subsection, we introduce the overall procedure of our framework. Algorithm 1 gives the scratch. First, we construct general-typed NER data D g based on Wikipedia anchors, and pre-train the model to extract general typed entities with loss L Dg MRC . Then we leverage the gazetteer-based distant supervision strategy to construct a specifictyped NER dataset D s , and propose an iterative selfpicking method to alleviate the data missing problem. In each iteration, the model is optimized to fit the data labeled by the previous iteration. When the performance on the validation set starts to decline, the iteration is ended and the best-performed model is passed to the third stage, where a cluster miner is deployed to group the entities extracted from the second stage into fine-grained types, and the model is trained to simultaneously distinguish fine-grained entities and extract specific-typed entities. Also, we iteratively cluster the features from the last iteration to gradually refine the fine-grained pseudo labels for current.

Experiments
We evaluate the CoFEE framework under two settings: (i) supervised setting (ii) weakly supervised setting. In the supervised setting, the pre-trained model is fine-tuned on human-labeled datasets while in the weakly-supervised setting, the model pre-trained with CoFEE is directly applied to perform NER without fine-tuning. Next, we describe these experiments in detail.

Datasets
Our experiments are conducted on three benchmarks. (1) Chinese Ontonotes 4.0 consists of newswire text and published by Ralph et al. (2011). It is annotated by four types: PER (Person), ORG (Organization), GPE (Geo-Political Entity) and LOC (Location) for Chinese named entity. It contains 15.7k sentences for training and 4.3k for testing.
(2) E-commerce is a Chinese NER dataset collected from the e-commerce domain and released by Ding et al. (2019). It is annotated by PROD (product) and BRAN (brand) types. The training and test datasets contain 273k and 53k lines, respectively. (3) Twitter is an English NER dataset (Qi et al., 2018), following (Peng et al., 2019), we only use textual information to perform NER and make entity detection on PER, LOC and ORG. It contains 4,000 tweets for training and 3,257 tweets for testing.

Pre-training Corpora
Wikipedia. We use 20200401 Chinese and English Wikipedia dumps 23 for data construction, where we set the max sentence length as 250 and remove the sentences which contain three or fewer anchors. The resulting Chinese corpora contains 1,116,514 sentences and 6,383,142 anchors (entity mentions), and the English corpora contains 3,911,059 sentences and 37,755,176 anchors.
Gazetteer. For Chinese PER, ORG, GPE, and LOC, we collect the gazetteers from the crowdsource dictionaries used by Chinese Input Method "Sougou" 4 , which contain 2,314 person names, 2,649 organization names, 895 geopolitical entities, and 628 location names. For Chinese PROD and BRAN, we use the gazetteers provided by Ding et al.(2019), which contain 628 brand names and 2,974 product names. For English PER, ORG and LOC, we collect the gazetteers using the method released by Peng et al.(2019), which contain 2,795 person names, 1,825 organization names and 1,408 location names.

Baselines
We chose two types of baselines: supervised methods and the weakly supervised methods. We call our proposed CoFEE pre-training framework with MRC-NER backbone as CoFEE-MRC. In addition, to demonstrate the model-agnostic and generic property of CoFEE, we also implemented another competitive baseline by replacing the MRC-NER backbone with a widely used BERT model (Devlin et al., 2019) without any change in the training procedure, denoted as CoFEE-BERT. We used open-source release of https://github.com/huggingface/transformers. Supervised Setting. We fine-tune CoFEE-MRC and CoFEE-BERT on supervised NER data and compare with the following baselines to learn how improvement can be achieved for supervised models. BiLSTM-CRF (Huang et al., 2015) is a classical neural-network-based baseline for NER, which usually achieves competitive performance in supervised NER. BERT-Tagger (Devlin et al., 2019) uses the outputs from the last layer of model BERT base as the character-level enriched contextual representations to make sequence labeling. MRC-NER (Li et al., 2020) formulates NER as a machine reading comprehension task and uses BERT as the basic encoder.
Weakly Supervised Setting. We investigate the effect of CoFEE-MRC for solving the NER task without any human annotations, and compare the model to some weakly supervised NER models. For fair comparison, we implemented baselines with the same gazetteers constructed in Section 5.2. Gazetteer Matching applies the constructed gazetteers to the test set directly to obtain entity mentions with exactly the same surface name. By comparing with it, we can check the improvements of neural models over the distant supervision itself. MRC-NER uses the MRC-NER backbone to perform weakly supervised NER task with gazetteer labeled training data. Furthermore, we explore the influence of our proposed pre-training tasks by removing entity span identification pretraining (-ESI) and fine-grained entity typing pretraining (-FTP) from CoFEE-MRC.

Hyper-parameter settings
We use the BertAdam as our optimizer, all of the models are implemented under PyTorch using a single NVIDIA Tesla V100 GPU, we use "bertbasechinese" and "bert-base-cased" as our pretrained models for Chinese and English language, the number of parameters is same to these pretrained models in addition to two binary classifier. For each training stage, we vary the learning rate from 1e−6 to 1e−4. In NEE stage, we select the best trade-off δ from 0.1 to 0.5 with an incremental 0.1. In FET stage, we choose the number of clusters K from {K − 2, K − 1, K, K + 1, K + 2} if we set K as the categories of fine-grained entity. For all these hyper-parameters, we select the best according to the F1-score on the dev sets.

Evaluation
Following the evaluation metrics in previous work (Li et al., 2020), we apply the entity-level (exact entity match) standard micro Precision (P), Recall (R), and F1 score to evaluate the results.  spectively. CoFEE-BERT also significantly improves the performance compared with BERT and achieves a new SOTA for supervised NER on Ecommerce of 79.73%, which confirms the modelagnostic property of our CoFEE pre-training framework. Please note that the results of MRC-NER on OntoNotes have a few concerns need to be addressed. MRC-NER set the max sentence length as 77, which is far less than the true maximum length of the dataset. While in our method, we promise that the maximum length is more than 100. Table 3 reports the results of our models against to baselines under the weakly supervised setting. We can find that: 1) Gazetteer Matching performs quite poorly and the capability of this method is strongly influenced by the size of the gazetteers. For OntoNotes, the coverage of the large scale gazetteer is almost 40%, but also its huge size causes the low precision. For Twitter, the recall value is about 14% due to its limited gazetteers. 2) If we directly use MRC-NER to perform weakly supervised NER task with gazetteer labeled data, the model achieves a degree of improvement but is still inaccurate due to the distantly labeled data.

Overall Performance
3) CoFEE-MRC achieves the state-of-the-art F1 score on all three benchmarks, which confirms the validity of our proposed CoFEE pre-training framework. 4) FET pre-training task brings performance improvements, which verifies the effective-  Figure 2: (a) Impact of pre-training data size on the weakly supervised setting; and (b) Impact of fine-tuning data size on the supervised setting; and (c) Impact of picking rate δ; and (d) Impact of cluster size K.  ness of the introducing fine-grained named entity knowledge. 5) ESI pre-training further improves the performance, which demonstrates the necessity to warm-up the pre-trained language model using general-typed named entity knowledge.

Impact of Data Size
We analyze the influence of reducing the amount of pre-training data and fine-tuning data. The results on the dev set of E-commerce are shown in Figure 2(a) and 2(b), respectively. From Figure 2(a), we can observe that increasing the size of the pretraining data will improve the performance generally, but the improvement tends to flatten out with 60% ∼ 80% data. We suppose that this is because of the number of unique patterns, the influence of the training data size has its local minimum and maximum critical point. From Figure 2(b), we see that knowledge enhanced pre-training is more effective for low-resource cases, where there is a larger gap in performance between our CoFEE-MRC and MRC-NER. Besides, the performance of CoFEE pre-training is more stable as data scale changes. This further demonstrates that our CoFEE pre-training framework can significantly reduce human efforts to create NER taggers.

Impact of Picking Rate
We then evaluate the influence of the value and variation of our picking rate δ. From Figure 2(c), we can see that setting a lower picking rate to recall more named entities can indeed improve a great performance for the model and gives the highest result with δ 0 = 0.1.

Impact of Gazetteer Size
We further explore the change of the training data and performance when we use gazetteers of different sizes. In particular, we used 20%, 40%, 60%, 80% and 100% of the original gazetteers to construct pre-training corpora. Statistical information of each resultant gazetteer is illustrated in Figure 3(a), and the model performance on the Ecommerce dev set with these gazetteers is demonstrated in Figure 3(b). We can observe that increasing the size of gazetteers will generally improve the performance of our proposed CoFEE-MRC model and the performance growths in line with the performance of "Matching", indicating that in addition to the gazetteer size, matching degree also has a crucial influence on the model performance.

Impact of Cluster Size
The proposed CoFEE framework does require a cluster size K as the scope for pseudo labels. One may wonder whether the choice of K has a significant influence on the final results. In this subsection, we vary K from 4 to 90 and report the F1 score of CoFEE-MRC on the E-commerce dev set. As shown in Figure 2(d), the best performance is obtained when K is exactly set as the number of fine-grained entity types described in the queries (23), indicating that our CoFEE pre-training can leverage this information as useful prior knowledge. Thanks to the self-supervised learning schema, when we very from 3 to 90, the model achieves stable F1 score and is not sensitive to the choice of K. The results also further indicate the applicability of the proposed framework when being applied to a new kind of named entity where the number of fine-grained entity types is not available in advance. We can safely assign a larger value than needed and the model is still robust.

Conclusion
We investigated coarse-to-fine entity knowledge enhanced pre-training for named entity recognition, which integrates three kinds of entity knowledge with different granularity levels. Though conceptually simple, our framework is highly effective and easy to implement. On three popular NER benchmarks, we found consistent improvements over both state-of-the-art supervised and weakly-supervised methods. Further analysis verifies the necessity of utilizing NER knowledge for pre-training models.