Entity Enhanced BERT Pre-training for Chinese NER

Character-level BERT pre-trained in Chinese suffers a limitation of lacking lexicon information, which shows effectiveness for Chinese NER. To integrate the lexicon into pre-trained LMs for Chinese NER, we investigate a semi-supervised entity enhanced BERT pre-training method. In particular, we ﬁrst extract an entity lexicon from the relevant raw text using a new-word discovery method. We then integrate the entity information into BERT using Char-Entity-Transformer, which augments the self-attention using a combination of character and entity representations. In addition, an entity classiﬁcation task helps inject the entity information into model parameters in pre-training. The pre-trained models are used for NER ﬁne-tuning. Experiments on a news dataset and two datasets annotated by ourselves for NER in long-text show that our method is highly effective and achieves the best results.


Introduction
As a fundamental task in information extraction, named entity recognition (NER) is useful for NLP tasks such as relation extraction (Zelenko et al., 2003), event detection (Kumaran and Allan, 2004) and machine translation (Babych and Hartley, 2003). We investigate Chinese NER (Gao et al., 2005), for which the state-of-the-art methods use a character-based neural encoder augmented with lexicon word information (Zhang and Yang, 2018;Gui et al., 2019a,b;Xue et al., 2019).
NER has been a challenging task due to the flexibility of named entities. There can be a large number of OOV named entities in the open domain, which poses challenges to supervised learning algorithms. In addition, named entities can be ambiguous. Take Figure 1 for example. The term "老 妇人(the old lady)" literally means "older woman". * Equal contribution.  Figure 1: Entity enhanced pre-training for NER. "老妇 人(The old lady)", the nickname of a football club Juventus F.C., is extracted by new-word discovery and integrated into the Transformer structure. After pretraining, the embedding of "老妇人(The old lady)" has the global information and correctly classifies itself as an ORG, which also helps recognize "意甲(Serie A)" as an ORG.
However, in the context of football news, it means the nickname of a football club Juventus F.C.. Thus entity lexicons that contain domain knowledge can be useful for the task (Radford et al., 2015;. Intuitively, such lexicons can be collected automatically from a set of documents that are relevant to the input text. For example, in the news domain, a set of news articles in the same domain and concurrent with the input text can contain highly relevant entities. In the finance domain, the financial report of a company over the years can serve as a context for collecting named entities when conducting NER for a current-year report. In the science domain, relevant articles can mention the same technological terms, which can facilitate recognition of the terms. In the literature domain, a fulllength novel itself can serve as a context for mining entities. There has been work exploiting lexicon knowledge for NER (Passos et al., 2014;Zhang and Yang, 2018). However, little has been done integrating entity information into BERT, which gives the stateof-the-art for Chinese NER. We consider enriching BERT (Devlin et al., 2019) with automatically extracted domain knowledge as mentioned above. In particular, We leverage the strength of newword discovery on large documents by calculating point-wise mutual information to identify entities in the documents. Information over such entities is integrated into the BERT model by replacing the original self-attention modules (Vaswani et al., 2017) with a Char-Entity-Self-Attention mechanism, which captures the contextual similarities of characters and document-specific entities, and explicitly combines character hidden states with entity embeddings in each layer. The extended BERT model is then used for both LM pre-training and NER fine-tuning.
We investigate the effectiveness of this semisupervised framework on three NER datasets, including a news dataset and two annotated datasets (novels and financial reports) by ourselves, which aims to evaluate NER for long-text. We make comparisons with two groups of state-of-the-art Chinese NER methods, including BERT and ERNIE (Sun et al., 2019a,b). For more reasonable comparison, we also complement both BERT and ERNIE with our entity dictionary and further pre-train on the same raw text as ours.
Results on the three datasets show that our method outperforms these methods and achieves the best results, which demonstrates the effectiveness of the proposed Char-Entity-Transformer structure for integrating entity information in LM pre-training for Chinese NER. To our knowledge, we are the first to investigate how to make use of the scale of the input document text for enhancing NER. Our code and NER datasets are released at https://github.com/ jiachenwestlake/Entity_BERT.

Related Work
Chinese NER. Previous work has shown that character-based approaches perform better for Chinese NER than word-based approaches because of the freedom from Chinese word segmentation errors (He and Wang, 2008;Liu et al., 2010;Li et al., 2014). Lexicon features have been applied so that the external word-level information enhances NER training (Luo et al., 2015;Zhang and Yang, 2018;Gui et al., 2019a,b;Xue et al., 2019). However, these methods are supervised models, which cannot deal with a dataset with relatively little labeled data. We address this problem by using a semisupervised method by using a pre-trained LM.
Pre-trained Language Models. Pre-trained language models have been applied as an integral component in modern NLP systems for effectively improving downstream tasks (Peters et al., 2018;Radford et al., 2019;Devlin et al., 2019;Liu et al., 2019b). Recently, there is an increasing interest to augment such contextualized representation with external knowledge (Zhang et al., 2019;Liu et al., 2019a;Peters et al., 2019). These methods focus on augmenting BERT by integrating KG embeddings such as TransE (Bordes et al., 2013). Different from the line of work, our model dynamically integrates document-specific entities without using any pre-trained entity embeddings. A more similar method is ERNIE (Sun et al., 2019a,b), which enhances BERT through knowledge integration. In particular, instead of masking individual subword tokens as BERT does, ERNIE is trained by masking full entities. The entity-level masking trick for ERNIE pre-training can be seen as an implicit way to integrate entity information through error backpropagation. In contrast, our method uses an explicit way to encode the entities to the Transformer structure.

Method
As shown in Figure 2, the overall architecture of our method can be viewed as a Transformer structure with multi-task learning. There are three output components, namely masked LM, entity classification and NER. With only the masked language model component, the model resembles BERT without the next sentence prediction task, and the entity classification task is added to enhance pretraining. While only NER outputs are yielded, the model is a sequence labeler for NER. We integrate entity-level information by extending the standard Transformer.

New-Word Discovery
In order to enhance a BERT LM with documentspecific entities, we adopt an unsupervised method by Bouma (2009)   adds these three values as the validity score of possible entities. The specific induction process is shown in Appendix A.

Char-Entity-Transformer
We construct models based on the Transformer structure of BERT BASE for Chinese (Devlin et al., 2019). In order to make use of the extracted entities, we extend the baseline Transformer to Char-Entity-Transformer, which consists of a stack of multihead Char-Entity-Self-Attention blocks. We denote the hidden dimension of characters and the hidden dimension of new-words (entities) as H c and H e , respectively. L is the number of layers, and A is the number of self-attention heads.
Baseline Transformer. The Transformer encoder (Vaswani et al., 2017) is constructed with a stacked layer structure. Each layer consists of a multi-head self-attention sub-layer. In particular, given the hidden representation of a sequence {h l−1 1 , ..., h l−1 T } for the (l − 1)-th layer and packed together as a matrix h l−1 ∈ R T ×Hc , the self-attention function of the l-th layer is a linear transformation on the Value V l space by means of Query Q l and Key K l mappings, represented as: where dk is the scaling factor and W l q , W l k , W l v ∈ R Hc×Hc are trainable parameters of the l-th layer. The result of Atten(Q l , K l , V l ) is further fed to a Algorithm 1 Maximum entity matching.
i ← max{k + 1, i + 1} 8: end for 9: end while feed-forward network sub-layer with layer normalization to obtain the final representation h l of the l-th layer.
Char-Entity matching. Given a character sequence c = {c 1 , . . . , c T } and an extracted entity dictionary E ent 1 , we use the maximum entity matching algorithm to obtain the corresponding entity-labeled sequence e = {e 1 , ..., e T }. In particular, we label each character with the index of the longest entity in E ent that includes the character, and label characters with no entity matches with 0. The process is summarized in Algorithm 1.
Char-Entity-Self-Attention. The Char-Entity-Self-Attention structure is shown in Figure 2 (right). Following BERT (Devlin et al., 2019), given a character sequence c = {c 1 , . . . , c T }, the representation of the t-th (t ∈ {1, . . . , T }) character in the input layer is the sum of character, segment and position embeddings, represented as: (2) where E c , E s , E p represent character embedding lookup table, segment embedding lookup table and  position embedding lookup table, respectively. In particular, the segment index s ∈ {0, 1} is used to distinguish the order of input sentences for the next sentence prediction task in BERT (Devlin et al., 2019), which is not included in our method. Thus we set the segment index s as a constant 0. Given the (l − 1)-th layer character hidden se- we compute the combination of the character hidden and its corresponding entity embedding as: where W l h,q , W l h,k , W l h,v ∈ R Hc×Hc are trainable parameters of the l-th layer, and W l e,k , W l e,v ∈ R He×Hc are trainable parameters for the corresponding entities. E ent is the entity embedding lookup table.
As shown in Eq. (3), if there is no corresponding entity for a character, the representation is equal to the baseline self-attention. To show how a character and its corresponding entity are encoded jointly, we denote a pack of entity embeddings {E ent [e 1 ], . . . , E ent [e T ]} as e ∈ R T ×He . The attention score of the i-th character in the l-th layer S l i is computed as: where a char-to-char attention score s c t is computed equally to the baseline self-attention. A char-toentity attention score s e t represents the similarity between a character and the corresponding entity.
Before normalization, the attention score of the i-th character and t-th character {S l i } t is s c t s e t , which is the geometric mean of s c t and s e t . This shows that the similarity between two characters by Char-Entity-Self-Attention is computed as a combination of the char-to-char geometric distance and the char-to-entity geometric distance.
Given the attention score S l i , Atten(q l i , K l , V l ) is computed as a weighted sum of the Value V l , which is a combination of character values and entity values.

Masked Language Modeling Task
Following Devlin et al. (2019), we use the masked LM (MLM) task for pre-training. In particular, given a character sequence c = {c 1 , . . . , c T }, we randomly select 15% of input characters and replace them with [MASK] tokens. Formally, given the the hidden outputs of the last layer {h L 1 , . . . , h L T }, for each masked character c t in a character sequence, the prediction probability of MLM p(c t |c <t ∪ c >t ) is computed as: where E c is the character embedding lookup table.
V is the character vocabulary.

Entity Classification Task
In order to further enhance the coherence between characters and their corresponding entities, we propose an entity classification task, which predicts the specific entity that the current character belongs to. A theoretical explanation of this task is to maximize the mutual information I(e; c) between the character c ∼ p(c) and the corresponding entity e ∼ p(e), where p(c) and p(e) represent the probability distributions of c and e, respectively.
where H(e) indicates the entropy of e ∼ p(e), represented as H(e) = −E e∼p(e) [log p(e)], which is a constant corresponding to the frequency of entities in a document. Thus the maximization of the mutual information I(e; c) is equivalent to the maximization of the expectation of log p(e|c).
Considering the computational complexity due to the excessive number of candidate entities, we employ sampling softmax for output prediction (Jean et al., 2015). Formally, given the hidden outputs of last layer {h L 1 , . . . , h L T } and its corresponding entity labeled sequence e = {e 1 , . . . , e T }, we compute the probability of each character c t (s.t. e t = 0) aligning with its corresponding entity e t as: where R − represents the randomly sampled negative set from the candidate entities of the current input document. E ent is the entity embedding lookup table and b e is the bias of entity e.

NER Task
Given the hidden outputs of the last layer {h L 1 , . . . , h L T }, the output layer for NER is a linear classifier f : R Hc → Y, where Y is a (m − 1)simplex and m is the number of NER tags. The probability that the character c t aligns with the k-th NER tag is computed using softmax: where w k ∈ R Hc and b k are trainable parameters specific to the k-th NER tag. We adopt the B-I-O tagging scheme for NER.

Training Procedure
Our model is initialized using a pre-trained BERT model 2 , and the other parameters are randomly initialized. During training, we first pre-train an LM over all of the raw text to acquire the entityenhanced model parameters and then fine-tune the parameters using the NER task.
Pre-training. Given raw text with induced entities D lm = {(c n , e n )} N n=1 , where c n is a character sequence and e n is its corresponding entity sequence detected by Algorithm 1, we feed each training character sequence and its corresponding 2 https://github.com/google-research/ bert, which is pre-trained on Chinese Wikipedia. We denote the masked subset of D lm as D + lm = {(n, t)|c n t = [MASK], c n ∈ D lm }, the loss of the masked LM task is: We denote the entity prediction subset of D lm as D e lm = {(n, t)|e n t = 0, c n ∈ D lm }, the loss of the entity classification task is: To jointly train the masked LM task and the entity classification task in pre-training, we minimize the overall loss: Fine-tuning. Given an NER dataset D ner = {(c n , y n )} N n=1 , we train the NER output layer and fine-tune both the pre-trained LM and entity embeddings by the NER loss: The overall process of pre-training and finetuning is summarized in Algorithm 2.

Experiments
We empirically verify the effectiveness of entity enhanced BERT pre-training on different NER datasets. In addition, we also investigate how different components in the model impact the performance of NER with different settings.

Datasets
We conduct experiments on three datasets, including one public NER dataset, CLUENER-2020 (Xu et al., 2020), and two datasets annotated by ourselves, which are also contributions of this paper. The statistics of the datasets are listed in Table 1.
News dataset. We use the CLUENER-2020 (Xu et al., 2020) dataset. Compared with OntoNotes (Weischedel et al., 2012) and MSRA (Levow, 2006) datasets for Chinese news NER, CLUENER-2020 is constructed as a fine-grained Chinese NER dataset with 10 entity types, and its labeled sentences belong to different news domains rather than one domain. We randomly sample 5.2K, 0.6K and 0.7K sentences from the original CLUENER-2020 dataset as the training 3 , dev and test sets, respectively. The corresponding raw text is taken from THUCNews (Sum et al., 2016) in four news domains 4 , namely GAM (game), ENT (entertainment), LOT (lottery) and FIN (finance), with a total number of about 100M characters. The detailed entity statistics are shown in Appendix B.1.
Novel dataset. We select three Chinese Internet novels, titled "天荒神域(Stories in Myth)", "道破 天穹(Taoist Stories)" and "茅山诡术师(Maoshan Wizards)", respectively, and manually label around 0.9K sentences for each novel as the development 3 In practice, a little manual labeling can be performed on each news domain separately for the best results. However, considering the expense of performing experiments to study the influence of training data scale, we use a single set of training data for all the news domains. This setting is also used for the novel dataset. 4 The original CLUENER-2020 dataset has no domain divisions, but our method aims to leverage domain-specific entity information for NER. Thus we select some specific news domains according to raw text from THUCNews and construct an entity dictionary for each domain. We also released a smaller version of CLUENER-2020 with domain divisions.  and test sets. We also label around 6.7K sentences from six other novels for the training set. Considering the literature genre, we annotate six types of entities. Besides, we use the original text of the nine novels with about 48M characters for pre-training. The details of annotation and entity statistics are shown in Appendix B.2.
Financial report dataset. We collect annual financial reports of 12 banks in China for five years and select about 2k sentences to annotate as the test set. The annotation rules follow the MSRA dataset (Levow, 2006), and the annotation process follows the novel dataset. In addition, we use the MSRA training and dev sets as our training and dev data. The unannotated annual reports of about 26M characters are used in LM pre-training. The detailed entity statistics are shown in Appendix B.3.

Experimental Settings
Model size. Our model is constructed using BERT BASE (Devlin et al., 2019), with the number of layers L = 12, the number of self-attention heads A = 12, the hidden size of characters H c = 768 and the hidden size of entities H e = 64. The total amount of non-embedding model parameters is about 86M. The total amount of non-embedding parameters of BERT BASE is about 85M. The entity integration module occupies only a small proportion in the whole model. Therefore, it has little impact on training efficiency.
Hyperparameters. For pre-training, we largely follow the default hyperparameters of BERT (Devlin et al., 2019). We use the Adam optimizer with an initial learning rate of 5e −5 and a maximum epoch number of 10 for fine-tuning. We list the details about pre-training and fine-tuning hyperparameters in Table 2.
Baselines. We compare our methods with three groups of state-of-the-art methods to Chinese NER. BERT baselines. BERT (Devlin et al., 2019) directly fine-tunes a pre-trained Chinese BERT on  NER. BERT+FUR uses the same raw text as ours to further pre-train the BERT with only the masked LM task. BERT+FUR+ENT uses the sum of character embeddings and the corresponding entity embeddings by the same entity matching algorithm as ours only in the input layer, and then further pre-trains BERT on the same raw text as ours. ERNIE baselines. ERNIE 5 (Sun et al., 2019a,b) enhances BERT through knowledge integration using a entity-level masked LM task and more raw text from the Web resources, which achieves the currently best results on Chinese NER. ERNIE+FUR+ENT is a stronger baseline, which uses the same entity dictionary as ours for entitylevel masking and further pre-trains ERNIE on the same raw text as ours.
LSTM baselines. We compare character-level BILSTM (Lample et al., 2016) and BILSTM+ENT, which concatenates the character embeddings and its corresponding entity embeddings as inputs. We also compare a gazetteer based method LATTICE (Zhang and Yang, 2018) and LATTICE (REENT), which replaces the word gazetteer of LATTICE with our entity dictionary for fair comparison. We use the same embeddings as (Zhang and Yang, 2018), which are pre-trained on Giga-Word 6 using Word2vec (Mikolov et al., 2013). The entity embeddings are randomly initialized and fine-tuned during training.

Overall Results
The overall F 1 -scores are listed in Table 3.
Comparison with BERT baselines. BERT+FUR achieves a slightly better result than BERT on the news dataset All (75.14% F 1 5 https://github.com/PaddlePaddle/ ERNIE/tree/repro 6 https://catalog.ldc.upenn.edu/ LDC2011T13 v.s. 74.22% F 1 ), but similar results on the novel dataset All and the financial report dataset. This shows that simply further pre-training BERT on document-specific raw text can hardly improve the performances. After using a naive method to integrate entity information, BERT+FUR+ENT achieves significantly better results on the novel dataset All (76.23% F 1 v.s. 73.22% F 1 ) compared to BERT+FUR, but lower F 1 on the news and the financial report datasets, which shows that this naive method cannot effectively benefit from the entities of arbitrary text genre.
Compared with BERT, Ours achieves more significantly better results on the novel dataset and the fiancial report dataset than the news dataset (at least over 4% F 1 v.s. 2.4% F 1 ), indicating the effectiveness of Ours for long-text genre. Compared with all of the BERT baselines, Ours achieves significant improvement (over at least 1.5% F 1 on the news dataset All, over 1.3% F 1 on the novel dataset All and over 4% F 1 on the financial report dataset), which shows that the Char-Entity-Transformer structure effectively integrates the document-specific entities extracted by newword discovery and benefits for Chinese NER.
Comparison with the state-of-the-art. We make comparisons with ERNIE baselines. Even though ERNIE uses more raw text and entity information from the Web resources for pre-training, Ours outperforms ERNIE significantly (about 1% F 1 on the news dataset All, over 4% F 1 on both the novel dataset All and the financial report dataset), which shows the importance of document-specific entities for pre-training.
Using the same entity dictionary as Ours to further pre-train ERNIE on the same raw text as Ours, ERNIE+FUR+ENT achieves better results on the novel dataset and the financial report dataset Proportion of NWD-extracted entities (%) NWD-extracted entities Figure 3: Performances of new-word discovery against word frequency on the news dataset. We ignore the interval >1000, because it occupies less than 5% newwords or entities. than ERNIE, but suffers a decrease on the news dataset All, which shows that integrating documentspecific entity dictionary benefits ERNIE for Chinese NER in long-text genre. Compared with ERNIE+FUR+ENT, Ours achieves significant improvements, which shows that our explicit method of integrating entity information by the Char-Entity-Transformer structure is more effective than entitylevel masking for Chinese NER. Finally, BERT and ERNIE outperform the LSTM baselines on all of the three datasets, indicating the effectiveness of LM pre-training for Chinese NER.

Analysis
MI-based new-word discovery. Figure 3 illustrates the relationships between new-words extracted by the MI-based new-word discovery (NWD) and the named entities with the scope of the news dataset.
On the one hand, within the scope of the news dataset, the proportion of entities extracted by the MI-based NWD is relatively higher when they are more frequently appearing n-grams in the raw text (overall 31.04% of the named entities are extracted by the NWD), as shown by the red line in Figure 3. On the other hand, within the n-grams in the news dataset, new-words with lower frequencies extracted by the MI-based NWD are more likely to be named entities (overall 3.86% of new words within the news dataset are named entities), as shown by the blue line in Figure 3.
Fine-grained comparison. In order to study the performances of our method on different entity types, we make fine-grained comparisons on the news dataset, which has plenty of entity types in  different news domains. Figure 4 illustrates F 1scores of several typical entity types, including GOV (government), BOO (book), MOV (movie) and ADD (address), for fine-grained comparison on the news dataset with BERT and ERNIE. The trends are consistent with the overall results. The full table is shown in Appendix C.
Ablation study. As shown in Table 4, we use two groups of ablation study to investigate the effect of entity information.
(1) Entity prediction task. We consider (i) NO-ENT-CLASS, which does not use the entity classification task in pre-training; and (ii) NO-PRETRAIN, which does not use entity enhanced pre-training. Results of these methods suffer significantly decreases compared to FINAL, which shows that pre-training, especially with the entity classification task, plays an important role in integrating the entity information.
In addition, we also explore the effect of raw text quantity. The result of (iii) HALF-RAW shows that a larger amount of the raw text is helpful.
(2) Entity dictionary. We consider (i) HALF-ENT, which uses 50% randomly selected entities from the original entity dictionary; (ii) N-GRAMS, which uses randomly selected n-grams from the raw text;  specific entity dictionary benefits the performance, and the new-word discovery method is effective for collecting entity dictionary.
The amount of NER training data. To compare performances of different models under different numbers of labeled training sentences, we randomly select different numbers of training sentences for training on the novel dataset. As shown in Figure 5, in nearly unsupervised settings, Ours gives the largest improvements (33.92% F 1 over BILSTM+ENT, 20.80% F 1 over BERT+FUR and 2.81% F 1 over ERNIE+FUR+ENT). With only 500 training sentences, Ours achieves competitive result, which shows the effectiveness of our LM pre-training method for the few-shot setting.
Case study. Table 5 shows a case study on the news dataset. "花旗中国(Citi China)" is a COM (company) and "《辐射》(Radiation)" is a MOV (movie). Since the text genre and entities in the news are so different from Wikipedia, BERT does not recognize the company name "花旗中国(Citi China)" and misclassifies "《辐射》(Radiation)" as a GAM (game). Benefiting from integrating entity information into LM pre-training, both ERNIE and Ours recognize "花 旗 中 国(Citi China)".  We use an example in the news dataset, "休顿很难鼓 舞将士。(It is difficult for Hughton to encourage team members.)".
Ours uses document-specific entities to pre-train on raw news text. So with the global information, Ours also classifies "《辐射》(Radiation)" accurately as a MOV.
Visualization. Figure 6 uses BertViz (Vig, 2019) to visualize the last-layer attention patterns of "休(Hugh)" in a news example. BERT only has a higher attention score to itself, while Ours has relatively higher attention scores to all the tokens in the current entity "休顿(Hughton)", especially for the first attention head (in blue). This shows that Ours enables entity information to enhance the contextual representation.

Conclusion
We investigated an entity enhanced BERT pretraining method for Chinese NER. Results on a news dataset and two long-text NER datasets show that it is highly effective to explicitly integrate the document-specific entities into BERT pre-training with a Char-Entity-Transformer structure, and our method outperforms the state-of-the-art methods for Chinese NER. where E L and E R represent the left and right entropy, respectively. w represents an N-gram substring. A and B are the sets of words that appear to the left or right of w, respectively. Finally, we add the three values MI, E L and E R as the validity score of possible new entities, remove the common words based on an open-domain dictionary from Jieba 8 , and save the top 50% of the remaining words as the potential input documentspecific entity dictionary. B Details of the Datasets B.1 News Dataset Entity statistics. As listed in Table 6, the finegrained news dataset consists of 10 entity types, including GAM (game), POS (position), MOV (movie), NAM (name), ORG (organization), SCE (scene), COM (company), GOV (government), BOO (book) and ADD (address). The four test domains have obvious different distributions of entity types, which are visualized by the gray scale of color in Table 6.

B.2 Novel Dataset
Data collection. We construct our corpus from a professional Chinese novel reading site named Babel Novel 9 . Unlike news, the novel dataset covers a mixture of literary style including historical 8 http://github.com/fxsjy/jieba 9 https://babelnovel.com/ novels, and martial arts novels in the genre of fantasy, mystery, romance, military, etc. Therefore, unique characteristics of this dataset such as novelspecific types of named entities present challenges for NER.
Annotation. Considering the literature genre, we annotate three more entity types other than PER (person), LOC (location) and ORG (organization) in MSRA (Levow, 2006), namely (i) TIT (title), which represents the appellation or nickname of a person, such as "冥 界 之 主(Load of Underworld)" and "无极剑圣(Sward Master)"; (ii) WEA (weapon), which represents weapons or objects with specialpurpose (e.g. "天龙战戟(Dragon Spear)" and "星 辰法杖(Stardust Wand)"); and (iii) KUN (kongfu), which represents the name of martial arts such as "太极(Tai Chi)" and "忍术(Ninjutsu)". The annotation work is undertaken by five undergraduate students and two experts. All of the annotators have read the whole novels before annotation, which aims to prevent the labeling inconsistent problem. In terms of annotation progress, each sentence is first annotated by at least two students, and then the experts select the examples with inconsistent annotations and modify the mistakes. The inter-annotator agreement exceeded a Cohen's kappa value (McHugh, 2012) of 0.915 on the novel dataset. Entity statistics. The statistics for the above six entity types are listed in Table 7. We can see that the entity distributions on the three test novels are similar with only a few differences, which are because of the differences in the topics of novels.

B.3 Financial Report Dataset
Annotation. The annotation process is similar to that of the novel dataset. The inter-annotator agreement exceeded a Cohen's kappa value (McHugh, 2012) of 0.923 on the financial report dataset.
Entity statistics. The detailed statistics for the financial report dataset are listed in Table 7.

C Fine-grained Comparison
The total results of fine-grained comparisons on the news dataset are listed in Table 8. The news dataset has a total of 10 entity types, including GAM (game), POS (position), MOV (movie), NAM (name), ORG (organization), SCE (scene), COM (company), GOV (government), BOO (book) and ADD (address).