Distantly Supervised Named Entity Recognition using Positive-Unlabeled Learning

In this work, we explore the way to perform named entity recognition (NER) using only unlabeled data and named entity dictionaries. To this end, we formulate the task as a positive-unlabeled (PU) learning problem and accordingly propose a novel PU learning algorithm to perform the task. We prove that the proposed algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data. A key feature of the proposed method is that it does not require the dictionaries to label every entity within a sentence, and it even does not require the dictionaries to label all of the words constituting an entity. This greatly reduces the requirement on the quality of the dictionaries and makes our method generalize well with quite simple dictionaries. Empirical studies on four public NER datasets demonstrate the effectiveness of our proposed method. We have published the source code at https://github.com/v-mipeng/LexiconNER.


Introduction
Named Entity Recognition (NER) is concerned with identifying named entities, such as person, location, product and organization names in unstructured text. It is a fundamental component in many natural language processing tasks such as machine translation (Babych and Hartley, 2003), knowledge base construction (Riedel et al., 2013;Shen et al., 2012), automatic question answering (Bordes et al., 2015), search (Zhu et al., 2005), etc. In this field, supervised methods, ranging from the typical graph models (Zhou and Su, 2002;McCallum et al., 2000;McCallum and Li, 2003;Settles, 2004) to current popular neural-networkbased models (Chiu and Nichols, 2016;Lample et al., 2016;Gridach, 2017;; Zhang * Equal contribution. and ), have achieved great success. However, these supervised methods often require large scale fine-grained annotations (label every word of a sentence) to generalize well. This makes it hard to apply them to label-few domains, e.g., bio/medical domains (Delėger et al., 2016).
In this work, we explore the way to perform NER using only unlabeled data and named entity dictionaries, which are relatively easier to obtain compared with labeled data. A natural practice to perform the task is to scan through the query text using the dictionary and treat terms matched with a list of entries of the dictionary as the entities (Nadeau et al., 2006;Gerner et al., 2010;Liu et al., 2015;. However, this practice requires very high quality named entity dictionaries that cover most of entities, otherwise it will fail with poor performance. As shown in Figure 1, the constructed dictionary of person names only labels one entity within the query text, which contains two entities "Bobick" and "Joe Frazier", and it only labels one word "Joe" out of the two-word entity "Joe Frazier". To address this problem, an intuitive solution is to further perform supervised or semi-supervised learning using the dictionary labeled data. However, since it does not guarantee that the dictionary covers all entity words (words being of entities) within a sentence, we cannot simply treat a word not labeled by the dictionary as the non-entity word. Take the data labeling results depicted in Figure 1 as an example. Simply treating "Bobick" and "Frazier" as non-entity words and then performing supervised learning will introduce label noise to the supervised classifier. Therefore, when using the dictionary to perform data labeling, we can actually only obtain some entity words and a bunch of unlabeled data comprising of both entity and non-entity words. In this case, the conventional supervised or semi-supervised learning algorithms are not suitable, since they usually require labeled data of all classes.
With this consideration, we propose to formulate the task as a positive-unlabeled (PU) learning problem and accordingly introduce a novel PU learning algorithm to perform the task. In our proposed method, the labeled entity words form the positive (P) data and the rest form the unlabeled (U) data for PU learning. We proved that the proposed algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data, under the assumption that the labeled P data can reveal the data distribution of class P. Of course, since words labeled by the dictionary only cover part of entities, it cannot fully reveal data distribution of entity words. To deal with this problem, we propose an adapted method, motivated by the AdaSampling algorithm (Yang et al., 2017), to enrich the dictionary. We evaluate the effectiveness of our proposed method on four NER datasets. Experimental results show that it can even achieve comparable performance with several supervised methods, using quite simple dictionaries.
Contributions of this work can be summarized as follows: 1) We proposed a novel PU learning algorithm to perform the NER task using only unlabeled data and named entity dictionaries. 2) We proved that the proposed algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data, under the assumption that the entities found out by the dictionary can reveal the distribution of entities. 3) To make the above assumption hold as far as possible, we propose an adapted method, motivated by the AdaSampling algorithm, to enrich the dictionary. 4) We empirically prove the effectiveness of our proposed method with extensive experimental studies on four NER datasets.

Risk Minimization
Let X ∈ X and Y ∈ Y be the input and output random variables, where X ⊂ R d and Y = {0, 1} denote the space of X and Y, respectively. Let f : X → R denote a classifier. A loss function is a map : R × Y → R + . Given any loss function and a classifier f , we define the -risk of f by: where E denotes the expectation and its subscript indicates the random variables with respect to which the expectation is taken. In ordinary supervised learning, we estimate R with the empirical lossR : and update model parameters to learn a classifier f * that minimizesR :

Unbiased Positive-Unlabeled learning
where π p = P(Y = 1) and π n = P(Y = 0). Note that E X,Y=1 (f (x), 1) can be effectively estimated using positive data. Therefore, the main problem of PU learning is how to estimate E X,Y=0 (f (x), 0) without using negative labeled data. To this end, it further formulates:
According to this equation, we can now estimate π n E X,Y=0 (f (x), 0) using only unlabeled data and positive data. Thus, R can be effectively estimated using only unlabeled data and positive data. In summary, we have that R can be unbiasedly estimated by: where x u i and x p i denotes an unlabeled and positive example, respectively, and n u and n p denotes the number of unlabeled and positive examples, respectively.

Consistent Positive-Unlabeled Learning
As we know, a good estimation should be not only unbiased but also consistent. The above induction has proved thatR is an unbiased estimation of R . In this section, we show thatR can be also a consistent estimation of R when the loss function is upper bounded. We argue that this is the first work to give such a proof, which is summarized in the following theorem: Theorem 1. If is bounded by [0, M ], then for any > 0, where B = L M M + C 0 . Here, L M denotes the Lipschitz constant that L M > ∂ (w,y) ∂w , ∀w ∈ R, C 0 = max y (0, y), and H denotes a Reproducing Kernel Hilbert Space (RKHS) (Aronszajn, 1950). H R is the hypothesis space for each given R > 0 in the ball of radius R in H. N ( ) denotes the covering number of H R following Theorem C in (Cucker and Smale, 2002).
Proof. Proof appears in Appendix A.

Remark 1. Let us intuitively think about what if
is not upper bounded (e.g., the cross entropy loss function). Suppose that there is a positive example x p i not occurring in the unlabeled data set. Then, its corresponding risk defined inR is V ( From this analysis, we can expect that, when using a unbounded loss function and a flexible classifier,R will dramatically decrease to a far below zero value. Therefore, in this work, we force to be bounded by replacing the common unbounded cross entropy loss function with the mean absolute error, resulting in a bounded unbiased positiveunlabeled learning (buPU) algorithm. This slightly differs from the setting of uPU, which only requires to be symmetric.
We further combine buPU with the nonnegative constraint proposed by Kiryo et al. (2017), which has proved to be effectiveness in alleviating overfitting, obtaining a bounded non-negative positive-unlabeled learning (bnPU) algorithm: 3 Dictionary-based NER with PU Learning In the following, we first define some notations used throughout this work, and illustrate the label assignment mechanism used in our method. Then, we precisely illustrate the data labeling process using the dictionary. After that, we show the detail for building the PU classifier, including word representation, loss definition, and label inference. Finally, we show the adapted method for enriching the dictionary.

Notations
We denote W ∈ V and S = {W} ∈ S be the word-level and sentence-level input random variables, where V is the word vocabulary and S is the sentence space. D e denotes the entity dictionary for a given entity type and D = {s 1 , · · · , s N } ⊆ S denotes the unlabelled dataset. We denote D + the set of entity words labeled by D e , and denote D u the rest unlabeled words.

Label Assignment Mechanism
In this work, we apply the binary label assignment mechanism for the NER task instead of the prevalent BIO or BIOES mechanism. Entity words are mapped to the positive class and nonentity words are mapped to the negative class. This is because, as we have discussed in the §1, the dictionary cannot guarantee to cover all entity words within a sentence. It may only label the beginning (B), the internal (I), or the last (E) word Algorithm 1 Data Labeing using the Dictionary 1: Input: named entity dictionary D e , a sentence s = {w 1 , · · · , w n }, and the context size k 2: Result: partial labeled sentence 3: Initialize: i ← 1 4: while i ≤ n do 5: label {w i , · · · , w max(i+j,n) } as positive class. 8: of an entity. Therefore, we cannot distinguish which type, B, I, or E, the labeled entity word belongs to. Take the data labeling results depicted in Figure 1 as an example. With the dictionary, we know that "Joe" is an entity word. However we cannot know that it is the beginning of the person name "Joe Frazier".

Data Labeling using the Dictionary
To obtain D + , we use the maximum matching algorithm (Liu et al., 1994;Xue, 2003) to perform data labeling with D e . It is a greedy search routine that walks through a sentence trying to find the longest string, starting from a given point in the sentence, that matches with an entry in the dictionary. The general process of this algorithm is summarized in Alg. 1. In our experiments, we intuitively set the context size k = 4.

Build PU Learning Classifier
In this work, we use a neural-network-based architecture to implement the classifier f , and this architecture is shared by different entity types.
Word Representation. Context-independent word representation consists of three part of features, i.e., the character sequence representation e c (w), the word embedding e w (w), and some human designed features on the word-face e h (w).
For the character-level representation e c (w) of w, we use the one-layer convolution network model (Kim, 2014) on its character sequence {c 1 , c 2 , · · · , c m } ∈ V c , where V c is the character vocabulary. Each character c is represented using where W c denotes a character embedding lookup table. The one-layer convolution network is then applied to {v(c 1 ), v(c 2 ), · · · , v(c m )} to obtain e c (w).
For the word-level representation e w (w) of w, we introduce an unique dense vector for w, which is initialized with Stanford's GloVe word embeddings 1 (Pennington et al., 2014) and finetuned during model training.
For the human designed features e h (w) of w, we introduce a set of binary feature indicators. These indicators are designed on options proposed by Collobert et al. (2011): allCaps, upperInitial, lowercase, mixedCaps, noinfo. If any feature is activated, its corresponding indicator is set to 1, otherwise 0. This way, it can keep the capitalization information erased during lookup of the word embedding.
The final word presentation independent to its context e(w) ∈ R kw of w, is obtained by concatenating these three part of features: where ⊕ denotes the concatenation operation. Based on this representation, we apply a bidirectional LSTM (BiLSTM) network (Huang et al., 2015), taking e(w t ), w t ∈ s as step input, to model context information of w t given the sentence s. Hidden states of the forward and backward LSTMs at the t step are concatenated: to form the representation of w t given s.
Loss Definition. Given the word representation, e(w|s), of w conditional on s, its probability to be predicted as positive class is modeled by: where σ denotes the sigmoid function, w p is a trainable parameter vector and b is the bias term. The prediction risk on this word given label y is defined by: Note that (f (w|s), y) ∈ [0, 1) is upper bounded. The empirical training loss is defined by: and π p is the ratio of entity words within D u .
In addition, during our experiments, we find out that due to the class imbalance problem (π p is very small), f inclines to predict all instances as the negative class, achieving a high value of accuracy while a small value of F1 on the positive class. This is unacceptable for NER. Therefore, we introduce a class weight γ for the positive class and accordingly redefine the training loss as: Label Inference. Once the PU classifier has been trained, we use it to perform label prediction. However, since we build a distinct classifier for each entity type, a word may be predicted as positive class by multiple classifiers. To address the conflict, we choose the type with the highest prediction probability (evaluated by f (w|s)). Predictions of classifiers of the other types are reset to 0. At inference time, we first solve the type conflict using the above method. After that, consecutive words being predicted as positive class by the classifier of the same type are treated as an entity. Specifically, for sequence s = {w 1 , w 2 , w 3 , w 4 , w 5 }, if its predicted labels by the classifier of a given type are L = {1, 1, 0, 0, 1}, then we treat {w 1 , w 2 } and {w 5 } as two entities of the type.

Adapted PU Learning for NER
In PU learning, we use the empirical risk on labeled positive data, 1 np np i=1 (f (x p i ), 1), to estimate the expectation risk of positive data. This requires that the positive examples x p i draw identically independent from the distribution P(X|Y = 1). The requirement is usually hard to satisfy, using a simple dictionary to perform data labeling.
To alleviate this problem, we propose an adapted method, motivated by the AdaSampling (Yang et al., 2017) algorithm. The key idea of the proposed method is to adaptively enrich the named entity dictionary. Specifically, we first train a PU learning classifier f and use it to label the unlabeled dataset. Based on the predicted label, it extracts all of the predicted entities. For a predicted entity, if it occurs over k times and all of its occurrences within the unlabeled dataset are predicted as entities, we will add it into the entity dictionary in the next iteration. This process iterates several times until the dictionary does not change.

Experiments
In this section, we empirically study: • the general performance of our proposed method using simple dictionaries; • the influence of the unlabeled data size; • the influence of dictionary quality, such as size, data labeling precision and recall; • and the influence of the estimation of π p .

Compared Methods
There are five indispensable baselines with which our proposed Adapted PU learning (AdaPU) algorithm should compare. The first one is the dictionary matching method, which we call Matching. It directly uses the constructed named entity dictionary to label the testing set as illustrated in Alg. 1. The second one is the supervised method that uses the same architecture as f but trains on fine-grained annotations (fully labeled D u and D + ). In addition, it applies the BIOES label assignment mechanism for model training.
We call this baseline BiLSTM. The third one is the uPU algorithm, which uses cross entropy loss to implement . The fourth one is the bounded uPU (buPU) algorithm, which implement ell with mean absolute error. Compared with AdaPU, it does not apply the non-negative constraint and does not perform dictionary adaptation. The last one is the bounded non-negative PU learning (bnPU) algorithm, which does not perform dictionary adaptation compared with AdaPU. Additionally, we compared our method with several representative supervised methods that have achieved state-of-the-art performance on NER. These methods include: Stanford NER (MEMM) (McCallum et al., 2000) a maximumentropy-markov-model-based method; Stanford NER (CRF) (Finkel et al., 2005) a conditionalrandom-field-based method; and BiLSTM+CRF  (Huang et al., 2015) a neural-network-based method as the BiLSTM baseline, but additionally introducing a CRF layer.

Datasets
CoNLL (en). CoNLL2003 NER Shared Task Dataset in English (Tjong Kim Sang and De Meulder, 2003)  Twitter. Twitter is a dataset collected from Twitter and released by . It contains 4,000 tweets for training and 3,257 tweets for testing. Every tweet contains both textual information and visual information. In this work, we only used the textual information to perform NER and we also only performed entity detection  on PER, LOC, and ORG. For the proposed method and the PU-learningbased baselines, we used the training set of each dataset as D. Note that we did not use label information of each training set for training these models.

Build Named Entity Dictionary
For CoNLL (en), MUC, and Twitter datasets, we collected the first 2,000 popular English names in England and Wales in 2015 from ONS 2 to construct the PER dictionary. For LOC, we collected names of countries and their top two popular cities 3 to construct the dictionary. While for MISC, we turned country names into the adjective forms, for example, England → English, and China → Chinese, and used the resultant forms to construct the dictionary. For ORG, we collected names of popular organizations and their corresponding abbreviations from Wikipedia 4 to construct the dictionary. We also added names of some international companies 5 , such as Microsoft, Google, and Facebook, into the dictionary. In addition, we added some common words occurring in organization names such as "Conference", "Cooperation", "Commission", and so on, into the dictionary.
For CoNLL (sp), we used DBpedia query editor 6 to select the most common 2000 names of the people who was born in Spain to construct the PER dictionary. We further used Google translator to translate the English LOC, ORG, MISC dictionary into Spanish.
The resultant named entity dictionaries contain 2,000 person names, 748 location names, 353 organization names, and 104 MISC entities. Table  1 lists some statistic information of the data labeling results with these dictionaries using Alg.   1. From the table, we can see that the precision of the data labeling is acceptable but the recall is quite poor. This is expectable and is a typical problem of the method using only dictionaries to perform NER.

Estimate π p
Before disscussing the estimation of π p defined in Eq.
(12), let us first look at some statistic information of the four studied datasets. Table 2 lists the true value of π p = (# of entity words)/(# of words of the training set) for different entity types over dataset. From the table, we can see that the variation of π p cross different datasets is quite small. This motivates us to use the value of π p obtained from an existing labeled dataset as an initialization. The labeled dataset may be from other domains or be out-of-date. In this work, we initially set π p = 0.04, 0.04, 0.05, 0.03 for PER, LOC, ORG, and MISC, respectively. Starting from this value, we trained the proposed model and used it to perform prediction on the unlabeled dataset. Based on the predicted results, we re-estimate the value of π p . The resulted values are listed in table 2 and were used throughout our experiments without further illustration.

Results
Following the protocol of most previous works, we apply the entity-level (exact entity match) F1 to evaluate model performance.
General Performance. Table 3 shows model performance by entity type and the overall performance on the four tested datasets. From the table, we can observe: 1) The performance of the Matching model is quite poor compared to other models. We found out that it mainly resulted from low recall values. This accords with our discussion in §1 and shows its inapplicability using such simple dictionaries. 2) Those PUlearning-based methods achieve significant improvement over Matching on all datasets. This demonstrates the effectiveness of the PU learning framework for NER in the studied setting. 3) buPU greatly outperforms uPU. This verifies our analysis in §2.3 about the necessity to make upper bounded. 4) bnPU slightly outperforms buPU on most of datasets and entity types. This verifies the effectiveness of the non-negative constraint proposed by Kiryo et al. (2017). 5) The proposed AdaPU model achieves further improvement over bnPU, and it even achieves comparable results with some supervised methods, especially for the PER type. This verifies the effectiveness of our proposed method for enriching the named entity dictionaries.   Influence of Unlabeled Data Size. We further study the influence of the unlabeled data size to our proposed method. To perform the study, we used 20%, 40%, 60%, 80%, 100%, and 300% (using additional unlabeled data) of the training data set of CoNLL (en) to train AdaPU, respectively. Figure 2 depicts the results of this study on PER, LOC, and ORG. From the figure, we can see that increasing the size of training data will, in general, improve the performance of AdaPU, but the improvements are diminishing. Our explanation of this phenomenon is that when the data size exceeds a threshold, the number of unique patterns becomes an sublinear function of the data size. This was verified by the observation from the figure, for   after introducing additional 20% of training data.
Influence of Dictionary. We then study the influence of the dictionary on our proposed model. To this end, we extended the dictionary with DBpedia using the same protocol proposed by Chiu and Nichols (2016). Statistic information of the resultant dictionary is listed in table 4, and model performance using this dictionary is listed in table 5. A noteworthy observation of the results is that, on LOC, the performance should decrease a lot when using the extended dictionary. We turn to table 4 for the explanation. We can see from the table that, on LOC, the data labeling precision dropped about 13 points (85.07 → 71.77) using the extend dictionary. This means that it introduced more false-positive examples into the PU learning and made the empirical risk estimation bias more to the expectation when using the extended dictionary.
Influence of π p Value. Table 6 lists the performance of AdaPU when using the true or estimated value of π p as listed in table 2. From the table, we can see that the proposed model using the estimated π p only slightly underperforms that using the true value of π p . This shows the robustness of the proposed model to a small variation of π p and verifies the effectiveness of the π p estimation method.

Related Work
Positive-unlabeled (PU) learning (Li and Liu, 2005) aims to train a classifier using only labeled positive examples and a set of unlabeled data, which contains both positive and negative examples. Recently, PU learning has been used in many applications, e.g., text classification (Li and Liu, 2003), matrix completion (Hsieh et al., 2015), and sequential data (Nguyen et al., 2011). The main difference between PU learning and semisupervised learning is that, in semi-supervised learning, there is labeled data from all classes, while in PU learning, labeled data only contains examples of a single class .
AdaSampling (Yang et al., 2017) is a selftraining-based approach designed for PU learning, which utilizes predictions of the model to iteratively update training data. Generally speaking, it initially treats all unlabeled instances as negative examples. Then, based on the model trained in the last iteration, it generates the probability p(y = 0|x u i ) of an unlabeled example x u i to be a negative one. This value, in turn, determines the probability of x u i to be selected as the negative examples for model training in next iteration. This process iterates for an acceptable result.

Conclusion
In this work, we introduce a novel PU learning algorithm to perform the NER task using only unlabeled data and named entity dictionaries. We prove that this algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data. And we argue that it can greatly reduce the requirement on sizes of the dictionaries. Extensive experimental studies on four NER datasets validate its effectiveness.