Toward Recognizing More Entity Types in NER: An Efficient Implementation using Only Entity Lexicons

In this work, we explore the way to quickly adjust an existing named entity recognition (NER) system to make it capable of recognizing entity types not defined in the system. As an illustrative example, consider the case that a NER system has been built to recognize person and organization names, and now it requires to additionally recognize job titles. Such a situation is common in the industrial areas, where the entity types required to recognize vary a lot in different products and keep changing. To avoid laborious data labeling and achieve fast adaptation, we propose to adjust the existing NER system using the previously labeled data and entity lexicons of the newly introduced entity types. We formulate such a task as a partially supervised learning problem and accordingly propose an effective algorithm to solve the problem. Comprehensive experimental studies on several public NER datasets validate the effectiveness of our method.


Introduction
Named Entity Recognition (NER) is a type of information extraction task that seeks to identify entity names from unstructured text and categorize them into a predefined list of types. It plays an important role in many downstream tasks such as knowledge base construction (Riedel et al., 2013;Shen et al., 2012), machine translation (Babych and Hartley, 2003), and search (Zhu et al., 2005), etc. In this field, the supervised methods, ranging from the conventional graph models (McCallum et al., 2000;Malouf, 2002;McCallum and Li, 2003;Settles, 2004) to the dominant deep neural methods (Collobert et al., 2011;Huang et al., 2015;Lample et al., 2016;Gridach, 2017;Zhang and Yang, 2018;Jiang et al., 2019;Gui et al., 2019), have achieved great success. However, these supervised methods usually require large scale labeled data to achieve good performance, while the annotation of NER data is often laborious and time-consuming.
In the real world, there are many, or more strictly speaking, infinite numbers of entity types. It is impossible for a NER system to cover all entity types (Ling and Weld, 2012;Mai et al., 2018).
Therefore, in the industrial area, it often happens that some entity types required to recognize by the clients are not defined in the previously designed NER system. In such a case, we need to quickly adjust the existing NER system to make it capable of recognizing the new entity types required by the clients. In this literature, we refer to the existing NER system as the source system, and refer to the adjusted system as the target system. The NER tasks defined in the two systems are referred to as the source task and the target task, respectively. The goal of this work is quickly transferring from the source task to the target task.
Suppose the new entity types defined in the target task are classified into class K (e.g, GPE and non-GPE are all annotated as the location type) in the source task. A common practice to build the target system is sampling some examples from the training data of the source task and asking the annotators to re-annotate words of class K in these examples. Then, it finetunes the model pretrained on the source task (with the output layer being replaced) using the re-annotated data to perform the target task. However, it is worth noting that the NER labels of words are contextdependent. To re-annotate the words of class K, the annotators need to read the whole sentence rather than the fragmental words of class K. This is still laborious and time-consuming, making it not an ideal choice when fast adaptation is required or the required entity types by the clients vary a lot and keep changing.
In this work, we propose to transfer from the Bobick works at Google as a program support specialist.  Figure 1: Applying of our method to an illustrative sentence for additionally introducing Job Title in the target task. Here, words in blue color constitute a job title. "Partial Label" corresponds to the automatically obtained partial labels using the job title lexicon, where "U" means the label of the word is unknown (can be "O" or "JOB"). "Predict" denotes the expected labels predicted by our model. source task to the target task using only the labeled data of the source task and entity lexicons of the newly introduced entity types. Note that the collection of entity lexicons is often much easier than data annotation. For example, we can ask the language experts familiar with NER to provide some common mentions of the new entity types, or we can collect some confident mentions of the types from the internet to construct the lexicons.

PER
In some cases, we can even ask the clients of the target system to provide the lexicons, and usually, they are more willing to do so than annotate data. To perform the transfer task using the entity lexicons, we formulate the task as a partially supervised learning problem. Figure 1 depicts the general process of our method, where the target task needs to additionally recognize job titles, which are annotated as the Other (O) class in the source task. Specifically, for the job title type, an entity lexicon of the type is collected. The lexicon is used to automatically re-annotate the training words of the Other class in the source task ("O" of the "Source Label" in Figure 1), obtaining some labeled data of the new entity type ("JOB" of the "Partial Label" in Figure 1). The rest words of the Other class of the source task not being annotated by the lexicons form the unlabeled data ("U" of the "Partial Label" in Figure 1). Note that, the unlabeled data contains both words of the new entity types and words not belonging to any entity type in the target task, and there is no labeled data for the Other class in the target task. Based on the obtained labeled and unlabeled data, a multi-class classifier is trained to perform the target task using a partially supervised (PS) learning algorithm. In this classifier, the constituted words of an entity Ts, Tt the source, target task ns the number of classes defined in Ts nt the number of new entity types defined in Tt ei, i ≤ ns the i-th predefined entity type in Ts en s +j the j-th new entity type defined in Tt Lj the lexicon of en s +j D s , D t the labeled, partially labeled data for Ts and Tt D t i the obtained labeled data of class i in D t D t u the unlabeled data in D t πi the ratio of class i data in D t π ns+j the ratio of class (ns + j) data in D t u mention correspond to the same class label without distinction of their positions in the mention, and words not belonging to any entity type are grouped into a single class ("O" of "Predict" in Figure 1). The contribution of this work is threefold: 1) We explore fast transferring from a source NER system (task) to a target NER system (task). This setting has a wide range of applications in the real world but has been rarely studied. 2) We propose to perform the task using only labeled data of the source task and entity lexicons of the newly introduced entity types, avoiding laborious and time-consuming data labeling. 3) We formulate the task as a partially supervised learning problem and accordingly propose an effective algorithm to address it.

Task Definition
In the setting of this work, there is a source and a target NER system, in which a source and a target NER task, T s and T t , are defined. For T s , a labeled dataset, D s , is available, in which n s classes are defined (each class corresponds to an entity type or the Other class). Compared with T s , the target task, T t , needs to additionally recognize some new entity types. Without loss of generality, we assume that the newly introduced n t entity types all belong to class K in the source task. For an intuitive understanding, consider introducing the two entity types, Government Organization and Company, which are all defined as the Organization type in the source task. In this case, the Organization type defined in the source class is the class K in the source task, while the Government Organization and Company are two sub-classes of class K. In this work, we present a way to perform T t using only D s and the entity lexicons of the new entity types. Table 1 lists some Class Label 1 · · · n i =O · · · n s n s + 1 · · · n s + n t K Labeled Data Table 2: Obtained labeled and unlabeled data for each class in the target task. The labeled data of each predefined entity type is copied from the training data of the source task, while the labeled data of each new entity type is automatically obtained using the lexicons. Note that, there is no labeled data for class K of the source task. Thus, a fully supervised learning algorithm is not applicable to train the classifier.
important notations used throughout this work for convenient reference.

Label Assignment
We apply the normal multi-label assignment mechanism for performing T t , instead of the prevalent BIO or BIOES mechanism. That is, the constituted words of a mention of the entity type e i are all classified to class i without distinction of their positions in the mention. This is because the labeled words by the lexicons may not cover all the constituted words of an entity mention, which means that we cannot distinguish the type, B (beginning), I (internal), or E (end), the words labeled by the lexicons belong to.

Method Overview
Based on the above label assignment mechanism, we train a (n s + n t )-class classifier to perform the target task, T t . In the classifier, the n s entity types predefined in T s are denoted as e i , i = 1, · · · , n s and mapped to class 1, · · · , n s , respectively. The n t new entity types introduced in T t are denoted as e ns+j , j = 1, · · · , n t and mapped to class n s + j, j = 1, · · · , n t , respectively. The challenge for training the classifier is that in D s , words of the newly introduced n t entity types are all classified to the same class K in the source task. For training the classifier, we construct a partially labeled dataset D t from D s using the lexicons of the newly introduced entity types. Specifically, let D t i ⊆ D t denote the labeled data of class i in D t . D t i , i = K is constructed using words of class i in D s . While for obtaining the labeled data of a new entity type e j , we use its corresponding entity lexicon L j to scan words of class K in D s and find out some confident words of the entity type to construct labeled data D t ns+j of class n s + j. This process applies n t times to obtain the labeled data of the n t new entity types. The rest words of class K in D s not being selected by the lexicons form the unlabeled data set D t u ⊆ D t in the target task, which contains both words of the new entity types (the lexicon cannot cover all its corresponding entities in the data) and words not belonging to any of the newly introduced entity types. Table 2 lists the available labeled and unlabeled data for each class in the target task after the above process. It is worth noting from the table that there is no labeled data for class K in the target task. This means that it is impossible to train the classifier using a normal supervised learning algorithm. To address this challenge, we introduce a novel partially supervised learning algorithm to train the classifier as described in §2.6.

Obtain the Partially Labeled Data using the Entity Lexicons
In this section, we detail the construction of the partially labeled dataset D t for the target task. As illustrated before, the labeled data D t i , i ≤ n s of class i can be easily obtained from D s according to the data labeling of T s . Thus, in the following, we focus on obtaining D t ns+j , j = 1, · · · , n t and D t u using the entity lexicons. Following the idea of , we apply the maximum matching algorithm (Xue, 2003) to obtain words that match with the lexicon L j and belong to class K in D s to construct D t ns+j . As summarized in algo. 1, this algorithm is a greedy search routine that walks through a sequence of class K words trying to find the longest string that matches with an entry of the lexicons. Note that in algo. 1, l w is intuitively set to 4, and the "for" loop is broken in step 12 because a mention must not occur in multiple lexicons, which is guaranteed by

Model Architecture
For a sentence s = [w 1 , · · · , w l ] with l words, we first get the contextualized representations of words using the BERT model (Devlin et al., 2019): (1) Algorithm 1 Data Labeling using the Lexicons 1: Input: entity lexicons L j , for j = 1, · · · , n t with L j ∩ L k = ∅ if j = k, a word sequence s = {w 1 , · · · , w n } ∈ class K in D s , and the maximum mention length l w 2: Result: the partially labeled dataset D t 3: 10: Based on the obtained word representations, we apply a multi-layer perceptron (MLP), f c , whose last layer activation function is set to softmax, to perform label inference: In the following, we denote f as the classifier, with f (w i ) = f c (h i ) being a (n s + n t )-dimensional probability vector.

Partially Supervised Learning for Model Training
In this section, we discuss how to train the (n s +n t )-class classifier using the partially labeled dataset D t . In the following, (f (w), i) denotes the classification loss defined on the input-label pair (w, i), π i denotes the ratio of class i data in D t , and denotes the classification loss defined on the dataset-label pair (D t j , i). Theoretical foundation. Suppose the labeled data of class K is available and denoted as D t K . Then, we can train the classifier on the normal fully supervised learning loss, which is defined as follows: Here, we assume that the value of π i is known and will discuss its estimation in the next section. However, due to the absence of D t K , we cannot directly obtain the value of L K K and consequently, cannot obtain L sup . To address this problem, we propose a method to estimate L K K using the available labeled and unlabeled data. Specifically, based on the unlabeled data D t u , we can obtain the loss defined on the dataset-label pair (D t u , K) as follows: Note that, D t u consists of unlabeled data from class (n s + 1) to class n s + n t and class K. Thus, the right term of the above equation can be factorized as follows: where D t u (n s + j) denotes the class (n s + j) data in D t u , and π ns+j = |D t u (ns+j)| |D t u | denotes the ratio of class (n s + j) in D t u . Based on this factorization and the assumption that the data distribution in D t ns+j is close to the data distribution in D t u (n s + j), we have that: By reformulating the approximate equation (4), we can obtain an approximation of L K K by: which can be calculated using the unlabeled data and the labeled data of the new entity types.
In addition, according to the theoretical and empirical analysis of (du Plessis et al., 2014;, training over this approximate value of L K K is expected to be equivalent to training over its true value if is upper-bounded. Practical loss definition. According to the above analysis, we implement the classification loss by the mean square error (MSE): where f (w)[i] denotes the i-th dimension value of f (w). Here, we implement with the mean square error instead of the popular cross-entropy loss because the mean square error is upper-bounded (by 1), which is critical for the estimation of L K K , while the cross-entropy loss is not (the crossentropy loss can be infinitely large). The empirical training loss is defined as follows: In addition, following the practice of (Kiryo et al., 2017;, we constrain: during the minimization of L ps . An intuitive understanding of this constrain is that the loss for class (K) should be non-negative.
Class ratio estimation. To obtain the value of L ps , it is necessary to know the value of the class ratio π i . Here, we present our method to estimate π i . For i ≤ n s , π i is estimated by: since class i data is fully labeled in D t . For estimating π ns+j and π ns+j , j = 1, · · · , n t + 1, we apply an iteration strategy. In particular, we first initialized π ns+j and π ns+j for j ≤ n t by |D t ns+j |/|D t |, and initialize π K and π K by |D t u |/|D t | and 1, respectively. Based on this, we train the classifier f and then re-estimate π ns+j and π ns+j using the trained classifier as follows: This process iterates several times to get the final estimations of π ns+j and π ns+j . Note that, according to the theoretical analysis of Kato et al. (2018), π ns+j and π ns+j will converge to fixed values.

Lexicon Adaptation
It has been proved to be an effective technique to improve the model performance by iteratively enriching the lexicons in a self-training style . We follow this technique in our method. In particular, we use the trained classifier to perform label prediction for words of D t u . Among the predicted entity mentions of the new entity types, we add the frequently occurred ones into the lexicons, which are then used for data labeling in the next iteration. This process repeats several times until the lexicons do not change.

Label Inference
For a query sentence, it first performs label prediction for the constituted words using the trained classifier f as follows: The consecutive words being predicted to be of the same class form an entity mention. For example, for the sentence s = {w 1 , w 2 , w 3 , w 4 , w 5 }, if the predicted label sequence is {1, 1, 3, 4, 4} with n s = 2 and n t = 1, then {w 1 , w 2 } and {w 3 } are treated as entity mentions of type e 1 and type e 3 , respectively.

Related Work
NER is a well studied natural language processing (NLP) task. Once a time, many NER systems are knowledge-based (Nadeau et al., 2006;Gerner et al., 2010;Liu et al., 2015). They do not require annotated training data but heavily rely on background knowledge (rules) and lexicon resources. They work well when the lexicon is exhaustive, but fail when the lexicon is incomplete. Precision is generally high for these systems, but recall is often low due to incomplete lexicons. Current state-of-the-art NER systems are mainly based on annotated data and machine learning approaches. The lexicons introduced in some of these systems are mainly for extracting some external features (Liu et al., 2015;Agerri and Rigau, 2016;Chiu and Nichols, 2016). This field has been previously dominated by the graph  Table 3: Task information built on five public NER datasets, including the sentence number (#Sent) and word number (#Word) in D s (also D t ), entity types comprising of class K of the source task (also the newly introduced entity types in the target task), and the mention ratio of each entity type (e.g., 28.3% entity mentions are of the person type in CoNLL03 (en)). models like Hidden Markov Models (HMM) (Zhou and Su, 2002), Maximum Entropy Markov Models (MEMM) (Malouf, 2002;McCallum et al., 2000), and Conditional Random Field (CRF) (McCallum and Li, 2003). Starting with (Collobert et al., 2011), neural network NER systems with minimal feature engineering have become popular. Such models do not require exhausted feature engineering. Various neural architectures have been proposed, like the bidirectional long short-term memory network (LSTM) plus a CRF layer (Huang et al., 2015), the convolutional neural network (CNN) plus a CRF layer, the combination of LSTM and CNN (Chiu and Nichols, 2016), and the BERT based LSTM+CRF model (Jiang et al., 2019;Hakala and Pyysalo, 2019).
One of the most related works is . This compared work proposes to perform NER using entity lexicons and unlabeled data. For this purpose, a distinct binary classifier is trained for each entity type using the unbiased positiveunlabeled (PU) learning algorithm (du Plessis et al., 2014;Kiryo et al., 2017). At the inference time, the recognition results of the binary classifiers for different entity types are combined to make the final decision. The difference between the compared work and our work is that, in the compared work, the mention recognition for one entity type is performed independently to the other types through a binary classifier. Consequently, it has to resolve the conflict between the recognition results of different binary classifiers for different entity types using a heuristic method at the inference time. While, in this work, the mention recognition for different entity types are performed simultaneously using a single model. This way, the recognition for different different entity types can enhance each other, and it can also avoid heuristically resolving the recognition conflict at the inference time.

Datasets
Following the experimental setting of the most related work , we performed the experiments on the four public NER datasets, including Conll03 (en) in English (Tjong Kim Sang and De Meulder, 2003), CoNLL02 (sp) (Sang and Erik, 2002) in Spanish, MUC-7 (Chinchor, 1998), Twitter  in English, and OntoNotes4.0 (Weischedel et al., 2011) in Chinese. For the former four datasets, we treated the location (LOC) and person (PER) types as the newly introduced entity types in the target task, and treated the rest entity types as the predefined entity types in the source task. While for OntoNotes4.0, we treated the GPE (countries, cities, states) and location (non-GPE locations, mountain ranges, bodies of water) types as the newly introduced entity types in the target task, which are all classified as the location type in the source task. Table 3 shows this setting and some statistic information of these datasets.

Lexicon Collection
We used the same entity lexicons of the person and location types as  to perform the experiments. According to the illustration of the refereed work, the collection of these lexicons is quite easy. For example, the lexicon of the person type is constructed from 2,000 popular English names in England and Wales in 2015 from ONS, and the lexicon of the location type is constructed from names of countries and their top two popular cities and 200 popular mountain names. The resultant person and location lexicons contain 2,000 distinct person names and 948 location  Table 4: Testing chunk-level F1 on the target task. The four label-based methods are fully supervised and trained on the fully re-annotated data of the source task. While the five lexicon-based methods train the model using only the existing labels of the source task and entity lexicons of the new entity types. The best performance in each group is marked in a boldface.
names, respectively. We refer you to the referred work for more information about the lexicons.
Here, we address that it can only label a small part of the mentions of the person and location types using the lexicons.

Compared Methods
In the following, we refer to SourceBERT as the BERT based model trained on the source task.
The compared methods can be divided into two groups. The first group of methods perform the target task using only D s and the entity lexicons of the new entity types, including the Match method that directly uses the lexicons to search for the mentions of the new entity types according to algorithm 1, and the bnPU method as well as its lexicon-adapted version AdaPU proposed by . For these methods, we combined their recognition result with that of SourceBERT to perform entity recognition. In particular, for a query sentence, we first perform label inference using SourceBERT and then apply these methods to words being predicted to be the "O" class by SourceBERT to further identity mentions of the new entity types. This practice also applies to our proposed method AdaPS as well as its variant, bnPS without lexicon adaptation.
The second group of methods are fully supervised, including the benchmark CRF model Stanford NER (CRF) (Lafferty et al., 2001;Finkel et al., 2005), the bi-directional long shortterm memory network with the CRF layer BiL-STM+CRF or not BiLSTM (Huang et al., 2015), and the BERT based model (Devlin et al., 2019) described in the "Model architecture" section. These supervised models were trained on the fully re-annotated D s according to the data labeling criteria of the target task.

Implementation
Implementation of the fully supervised methods except BERT fellow the protocol of . The BERT model was initialized using the bert-base-cased 1 model for the three English datasets, and initialized it using the bertmultilingual-base-cased 2 model for CoNLL02 (sp) and OntoNotes4.0 (cn); f c was implemented with a one-layer MLP (768 softmax − −−−− → n s + n t ). Parameter updating was implemented using the Adam (Kinga and Adam, 2015) optimizer with learning rate set to be 5e-5. For a fair comparison with our methods, we replaced the BiLSTM-based sequence modeling layer of bnPU and AdaPU with the BERT module, which showed better performance.  Figure 2: Testing chunk-level F1 (mean ± std. over 4 runs) of the BERT model against the re-annotated sentence number for its finetuning (best view in color). The dot line denotes the performance of the BERT model, while the solid line in the same color denotes the corresponding performance of our method, AdaPS. Note that AdaPS does not use re-annotated data, thus its performances stays the same along the x-axis.

Results
Following the protocol of most previous works, we apply the chunk-level (exact mention match) F1 to evaluate the model performance. We report the F1 score on the mention set of each new entity type, as well as the overall F1 score on the mention set of all new entity types. Note that, our methods and the other lexicon-based baselines are only applied to words being predicted as class K class by SourceBERT. Thus, their performance should be the same for the predefined entity types and determined by SourceBERT.
General performance. Table 4 shows the model performance on the four tested datasets. From the table, we can observe that: 1) Our methods, AdaPS and bnPS, consistently outperform their PU-learning based counterparts AdaPU and bnPU. This shows the advantage of our methods over the PU-learning baselines. 2) Compared with bnPU and bnPS, AdaPU and AdaPS can achieve further improvement on most of the four tested tasks. This verifies the effectiveness of lexicon adaptation. However, the improvement of AdaPS over bnPS is much smaller than the improvement of AdaPU over bnPU. Possible explanation is that bnPS has achieved much better performance than bnPU, thus achieving further improvement over bnPS will be harder than over bnPU.
3) The performance of the Match baseline is quite poor (mainly due to the small recall). This observation is consistent with the reported result in previous works, and shows the insufficiency of the purely lexicon-matching strategy. 4) Compared with BiLSTM and BiLSTM+CRF, the BERT based model achieves much better performance on the four tested tasks. This shows the effectiveness of the pretrained BERT model for NER. 5) Our method AdaPS and bnPS can achieve quite comparable performance with the fully supervised BERT model, which requires to re-annotate D s . In addition, enhanced by the pretrained BERT model, our methods even outperform the fully supervised CRF, BiLSTM, and BiLSTM+CRF models on the CoNLL03 (sp), MUC-7, and Twitter datasets. This shows the efficiency of our methods in transferring from the source task to the target task.
Compared with model finetuning. In this study, we explore how much re-annotated data it requires for the BERT model to achieve similar performance as our proposed method, AdaPS. Figure 2 show the performance of BERT when using varying sizes of randomly sampled reannotated data to finetune SourceBERT (with the output layer replaced). From the figure, we can see that: 1) Concerning about the overall F1 score, it averagely requires to re-annotate about 500, 200, 750, 750, and 1,000 sentences of D s to achieve similar performance as our method on CoNLL03 (en), CoNLL02 (sp), MUC-7, Twitter, and OntoNotes4.0 (cn), respectively. 2) To achieve similar performance as our method for all new entity types, it averagely requires to re-annotate about 500, 500, 1,000, and 750 sentences of D s On CoNLL03 (en), CoNLL02 (sp), MUC-7, and Twitter, respectively. 3) On OntoNotes4.0 (cn), it requires to re-annotate more data for the location type than for the GPE type. This is because the occurring frequency of mentions of the location type is much lower than the occurring frequency of mentions of the GPE type. Thus, it requires to annotate more data for the location type to cover enough mentions of the type.
Influence of SourceBERT for label inference.
As mentioned in the "compared methods" section, we combined the recognition results of our  method with those of the SourceBERT model to perform entity recognition for the target task.
Here, we study the influence of SourceBERT on the recognition results. Table 5 shows the performance of our method, bnPS, when using the trained classifier f only and when additionally using SourceBERT to perform entity recognition for the target task. From the table, we can see that: 1) It can consistently improve the recognition performance of our method for the predefined entity types by introducing the SourceBERT model, and on three of the four tested tasks, it can also improve the overall recognition performance of our method. 2) For the newly introduced entity types, the improvement introduced by SourceBERT is relatively smaller, and the improvement is even negative on some tasks. Let p(x) denote the data distribution of the target domain and p(x|D in ) denote the data distribution modeled based on the target data D in . According to the setting of this work, the size of D in should be small. This means p(x|D in ) = p(x|D in ). Or more specifically, there are quite a few regions x ∈ X that p(x) > δ while p(x|D in ) < δ, where δ > 0 is a threshold described in the following.
Note that the anomaly detection method will only extract examples x ∈ X where p(x ∈ X ||D in ) > δ as the target data. This means that the method is still not able to address the longtail distribution problem introduced by the small size of the task data. In addition, the distribution of the selected data is determined by the general domain data but not the target data. This means that the method is also sensitive to the selection of the general domain.
Let p(x) denote the data distribution of the target domain and p(x|D in ) denote the data distribution modeled based on the target data D in . According to the setting of this work, the size of D in should be small. This means that there are quite a few regions x ∈ X that p(x) > δ while p(x|D in ) < δ, where δ > 0 is a threshold described in the following. Note that the anomaly detection method will only extract examples x ∈ X that p(x ∈ X |D in ) > δ as the target data. This means that the method is still not able to address the long-tail distribution problem introduced by the small size of the target data. In addition, the distribution of the selected data is determined by the general domain data but not the target data. This means that the method is sensitive to the selection of the general domain.

Conclusion
In this work, we address the task to introduce one or more new entity types to an existing NER system, for which a dataset has been previously labeled. To avoid laborious and time-consuming data labeling, we propose a partially supervised learning algorithm to perform the task using only the labeled data of the existing NER system and entity lexicons of the new entity types. Experimental studies on four public NER datasets show that our method can achieve quite comparable performance with the fully supervised methods using some easily collected lexicons. This makes our method a good choice for fast entity type introduction.