Reinforcement-based denoising of distantly supervised NER with partial annotation

Existing named entity recognition (NER) systems rely on large amounts of human-labeled data for supervision. However, obtaining large-scale annotated data is challenging particularly in specific domains like health-care, e-commerce and so on. Given the availability of domain specific knowledge resources, (e.g., ontologies, dictionaries), distant supervision is a solution to generate automatically labeled training data to reduce human effort. The outcome of distant supervision for NER, however, is often noisy. False positive and false negative instances are the main issues that reduce performance on this kind of auto-generated data. In this paper, we explore distant supervision in a supervised setup. We adopt a technique of partial annotation to address false negative cases and implement a reinforcement learning strategy with a neural network policy to identify false positive instances. Our results establish a new state-of-the-art on four benchmark datasets taken from different domains and different languages. We then go on to show that our model reduces the amount of manually annotated data required to perform NER in a new domain.


Introduction
Named Entity Recognition (NER) is one of the primary tasks in information extraction pipelines. (Ma and Hovy, 2016;Lample et al., 2016;Peters et al., 2018;Akbik et al., 2018). Traditional studies apply statistical techniques such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) using large amounts of features and extra resources (Ratinov and Roth, 2009;Passos et al., 2014). In recent years, deep learning approaches achieve state-of-the-art results in the task without any feature engineering (Ma and Hovy, 2016;Lample et al., 2016). Most of these works assume that there is a certain amount of annotated sentences in the training phase. However, avail-ability of large amounts of labeled data is problematic, particularly in specific domains. Distant supervision is proposed by Mintz et al. (2009) to address the challenge of obtaining training data for new domains using existing knowledge resources (dictionaries, ontologies). It has previously been successfully applied to tasks like relation extraction (Riedel et al., 2010;Augenstein et al., 2014) and entity recognition (Fries et al., 2017;Shang et al., 2018b;. For the task of NER, it identifies entity mentions if it exist in the knowledge base (e.g, domain-specific dictionary, glossary, ontology) and assigns the corresponding type according to the knowledge base.
However, distant supervision approaches encounter two main limitations. First, due to limited coverage of the knowledge resources, unmatched tokens result in False Negatives (FNs). Second, since simple string matching is employed to detect entity mentions, ambiguity in the knowledge resource may lead to False Positives (FPs). For the FN problem, Tsuboi et al. (2008) incorporate partial annotations into CRFs and propose a parameter estimation method for CRFs using partially annotated corpora (here-in after referred to as Partial-CRF). In order to reduce the negative impact of FPs for relation extraction, Qin et al. (2018) propose a deep reinforcement learning (RL) agent where the the agent's goal is to decide whether to remove or keep the distantly supervised instance.
In this paper we make the following contributions: • We combine the Partial-CRF approach with performance-driven, policy-based reinforcement learning to clean the noisy, distantly supervised data for NER in a pre-processing step.
• We formulate the reward function in RL based on the change in the performance of the NER module where the policy of RL is trained in an unsupervised manner by interaction with the environment.
• We show that our approach can boost the performance of the neural NER system on four datasets from different domains and for two different languages (English and Chinese).

Related work
The task of NER has been widely studied in the last decade and is generally considered as a sequence labeling problem. Using neural techniques, many studies report state-of-the-art results on this type of sequence labeling task (Lample et al., 2016;Ma and Hovy, 2016). These types of studies utilize character and/or word embeddings to encode sentence-level features automatically.
Recently, the use of contextualized word representation (Peters et al., 2018;Akbik et al., 2018) significantly improves the state-of-the-art results in many sequence labeling tasks and specifically also in the NER benchmark.
In the supervised NER paradigm, this task suffers from lack of large-scale labeled training data when moving to a new domain or new language. To alleviate the reliance on human annotated data, distant supervision is proposed by Mintz et al. (2009), to generate annotated data by heuristically aligning text to an existing domain-specific knowledge resource. It is widely used for relation extraction (Mintz et al., 2009;Riedel et al., 2010;Augenstein et al., 2014) and lately it has attracted attention also for NER (Ren et al., 2015;Fries et al., 2017;Shang et al., 2018b;. Shang et al. (2018b) present the Au-toNER model which employs a new type of tagging scheme (i.e., Tie or Break) rather than common ones (i.e., IOB, IOBES) without any CRF layer and achieves state-of-the-art unsupervised F 1 scores on several benchmark datasets. Crucially, they employ a set of high-quality phrases in distant supervision, using a phrase mining technique (Shang et al., 2018a) to reduce the falsenegative labels. Feng et al. (2018) and  make use of reinforcement learning to tackle false positives in distantly supervised relation classification and NER, respectively. Similar to our work,  address the noisy automatic annotation in NER, by using partial annotation learning and reinforcement learning. However, unlike our approach, they train the NER model and reinforcement learning model jointly, calculating the reward based on the loss of the NER model, whereas we employ the RL module as a pre-processing/filtering step, incorporating the previous state to satisfy a Markov decision process (MDP).  evaluate only on a Chinese dataset, whereas we apply our model also to English datasets. Furthermore, after running their code 1 , we observe that to reach the reported results in their paper on ecommerce dataset, the model needs more that 500 epochs and the reinforcement learning component removes all the distantly annotated sentences after some epochs. It means that after some epochs the code performs only the base-line NER model on annotation dataset and ignoring RL module, since there are no distantly annotated sentences. Their two datasets are included in our experiment in order to compare to their results. Qin et al. (2018) explore deep reinforcement learning as a false positive removal tool for distantly supervised relation extraction. Here, we adapt their approach to the NER task. Unlike Qin et al. (2018) however, we learn the policy agent in an unsupervised manner, where the parameters are learnt by interaction with the environment.

Model
We implement Partial-CRF together with a performance-driven, policy-based reinforcement learning method to detect FNs and FPs in distantly supervised NER. In contrast to a previous study that has applied RL in NER , we consider the RL agent as a pre-processing task to clean FPs from the noisy dataset. Furthermore, our RL agent is rewarded based on the change in the performance of the NER module and it is modeled as a Markov decision process (MDP).
Algorithm 1 describes the overall training procedure for our model and in the following, we detail the various components of our model.

Baseline NER model
The goal of NER is to identify text spans that present named entities and assign them into predefined categories. These categories vary depending on the domain, for example in the general domain, they are categories like organization, person and location names; in bio-medical domain, Figure 1: Annotation of distantly labeled example in Partial CRF based on IOBES scheme. The words with green tags are found in dictionary and assigned to the corresponding entity types, and the ones that are not found in dictionary are assigned to all possible tags (yellows).
Algorithm 1: Overall Training Procedure NER+PA+RL Input: Human Annotated ( A) + Distantly Labeled Data ( D) 1 Pre-train NER w/ Partial-CRF ( NER+PA) on A+D 2 Apply RL on D 3 Train NER+PA using A + cleaned D they are protein, drug, gene, disease names. Intuitively, given a sentence of the words X = {x 1 , x 2 , ..., x n }, NER assigns unique tag for each word like y = {y 1 , y 2 , ..., y n } from a predefined set of categories y i ∈ Φ, |Φ| = k . Our baseline model is a BiLSTM-CRF architecture (Lample et al., 2016;Habibi et al., 2017). The first layer takes character embeddings for each word sequence and then merge the output vector with the word embedding vector to feed into a second BiLSTM layer. The CRF layer comes on top of the last layer to model the dependencies across output tags and locates the best tag sequence by maximizing the log-probability in following equation: where and P is a k × n output tensor of a linear encoder applied to the last BiLSTM layer where P i,j corresponds to the score of the j th tag of the i th word in a sentence. T is a (k + 2) × (k + 2) transition tensor which represents transition probability from i th tag to the j th tag. Two additional tags <BOS> and <EOS> are added at the start and end of a sequence, respectively. In order to infer the final sequence tags the Viterbi algorithm is employed in the CRF model.

Partial-CRF layer (PA)
As mentioned above, FN instances constitute a common problem in distantly annotated datasets. It is caused by limited coverage of the knowledge base resource, when some of the entity mentions are not found in the resource and followingly labeled as non-entities ('O'). We follow Tsuboi et al. (2008) and treat the result of distant supervision as a partially annotated dataset where non-entity text spans are annotated as any possible tag. Figure  1 illustrates the annotation of distantly supervised examples using the IOBES labeling scheme that we employ. Let Y L denote all the possible tag sequences for a distantly supervised sentence X. Then, the conditional probability of the subset Y L given X is: Extending the original equation of the CRF layer (Eq.1) provides the log-probability for the distantly supervised instance: y ∈Y e s(X,y ) .
Using partial annotation, non-entity text spans are annotated as any possible tag. It gives a chance for non-entity text spans to be considered and scored properly in update version of CRF (Partial CRF) and become a part of the most optimal tag sequence.

Reinforcement Learning for denoising
The RL agent is designed to determine whether the distantly supervised instance is a true positive or not. There are two main components in RL : Randomly sample a j ∼ π(a; θ,s j ); compute p j = π(a; θ,s j ), save (a j , p j ) 7 if a j == 0 then 8 saves j into Ψ i 9 Recompute the s * as an average of ∀s j ∈ Ψ i Update Policy network (Eq. 5) Re-train NER+PA on A + D I) environment II) policy based agent. Following Qin et al. (2018), we model the environment as a Markov Decision Process (MDP), where we add information from the previous state to the current state. The policy based agent is formulated based on the Policy Gradient Algorithm (Sutton et al., 1999), where we update the policy model by computing the reward after finishing the selection process for the whole training set. The algorithm 2 presents additional details of the RL strategy in our NER model. The following subsections describe the elements of the RL agent.
State: The RL agent interacts with the environment to decide about instances at the sentence level. A central component of the environment is the current and previous state in the selection process. The state S i in step i represents the current instances as well as their label sequences. Following  the state vector S i includes: I) the vector representation of instances before the Partial-CRF layer, where we concatenate the outputs of the first and last nodes in the BiLSTM layer of the base NER model, and II) the label sequence scores calculated by the linear encoder before the Partial-CRF model. (i.e, P i,j in Eq. 2). If a word is annotated with a certain label, the score will be the corresponding value of the label, otherwise, the score will be the mean of all possible labels of the word in the linear encoder. These two vectors are concatenated to represent the current state. To satisfy the MDP, the average vector of the removed instances in the earlier step i − 1 is concatenated to the current state and represents the state for the RL agent.
Reward: If the RL agent filters out the FP instances from the noisy dataset, the NER model will achieve improved performance. Accordingly, the RL agent will receive a positive reward, otherwise, the agent will received a negative reward. Following Qin et al. (2018), we model the reward as a change of the NER performance; particularly, we adapt the F 1 score to calculate the reward as the difference between F 1 scores of the adjacent epochs (i.e., r i = F i 1 − F i−1 1 ).

Policy
Network: The policy network π(a j ; θ i , s j ) is a feed forward network with two fully-connected hidden layers. It receives the state vector for each distantly supervised instance and then determines whether the instance is a false positive or not. The π as a classifier with parameter θ decides an action a j ∈ {1, 0} for each s j ∈ S j . The loss function for the policy network is formulated based on the policy gradient method (Sutton et al., 1999) and the REINFORCE algorithm (Williams, 1992). Since we calculate the reward as a difference between F 1 scores in two contiguous epochs, the agent will be compensated for a set of actions that has direct impact on the performance of the NER model in the current epoch. In other words, the different parts of the removed instances in each epoch are the reason of the change in F 1 scores. Accordingly, the policy will update using the following gradient: According to Qin et al. (2018), assuming Ψ i is removed in epoch i : This means that if there is an increase in F 1 at the current epoch i, we will assign a positive reward to the instances that have been removed in epoch i and not in epoch i − 1 and negative reward to the instances that have been removed in epoch i − 1 and not in the current epoch.

Experiments
We perform experiments on four benchmark datasets to compare our method to similar techniques and investigate the impact of the number of available annotated sentences for our approach.

Experimental Settings
Datasets: Our approach requires an annotated dataset, a knowledge resource and a corpus of raw text. We rely on the resources used by Shang et al. (2018b) and  for English and Chinese, respectively, as well as their traintest splits. For all datasets, we employ a IOBES labeling scheme. Below we briefly describe the datasets: • BC5CDR is from BioCreative V Chemical Disease Relation task and contains 12,852 'Disease' and 15,935 'Chemical' entity mentions in 1,500 articles. It is already partitioned into a training, a development and a testing set. The related dictionary comes from the MeSH database 2 and the CTD chemical and Disease 3 vocabularies and contains 322,882 'Disease' and 'Chemical' entities. As a raw text, we use a corpus consisting of 20,217 sentences that is provided in Shang et al. (2018b) and extracted from PubMed papers.
• LaptopReview containing laptop aspect term is taken from the SemEval 2014 Challenge, Task 4 Subtask 1 (Pontiki et al., 2014). The 3,845 review sentences are annotated with 3,012 'AspectTerm' mentions. We extract 15,000 sentences from the Amazon laptop review dataset 4 as a raw text. Wang et al. (2011) design this dataset for the aspect-based sentiment analysis. Thanks to Shang et al. (2018b), they provide the dictionary of 13,457 computer terms crawled from a public website 5 .
• EC is a Chinese dataset from the e-commerce domain. We choose this dataset in order to compare our results to the approach by . There are 5 entity types: 'Brand', 'Product', 'Model', 'Material' and 'Specification' on user queries. This corpus contain 1,200 training instances, 400 in development set and 800 in test set.  provide the dictionary of 927 entries and 2,500 sentence as a raw text.
• NEWS is another Chinese dataset in the news domain. It is annotated with PERSON type and provided by . The NEWS dataset contains 3,000 sentences as training, 3,328 as dev data, and 3,186 as testing data.  apply distant supervision to raw data and obtain 3,722 annotated sentences.
Pre-trained Embeddings: We employ pretrained embeddings as initialization for the embedding layer of the LSTM layers. For the biomedical dataset, we use pre-trained 200dimensional word vectors trained on PubMed abstracts, all PubMed Central (PMC) articles and English Wikipedia (Pyysalo et al., 2013). Standard pre-trained GloVe 100-dimensional word vectors are employed for the LaptopReview dataset. In our experiments on the EC dataset, we use the 100dimensional Chinese character embeddings provided by  and trained on usergenerated text.
Evaluation: We report the performance of the model on the test set as the micro-averaged precision, recall and F 1 score. A predicted entity is counted as a true positive if both the entity boundary and entity type is the same as the ground-truth (i.e., exact match). To alleviate randomness of the scores, the mean of five different runs are reported.
Model Variants: We use slightly different variants of our model for English and Chinese. For English we follow Liu et al. (2017) in leveraging a language model to extract character-level knowledge. We keep the parameters in the model the same as in the original work. In order to compare to state-of-the-art models, we follow the same approach during training (i.e., by merging the training and development data as a training set in BC5CDR and randomly selecting 20% from the training set as the development set in LaptopReview). For the Chinese EC dataset, we only use character-based LSTM and CRF layers and discard the word-based LSTM and language model. For a fair comparison, the model parameters are set to be the same as in . For RL, the batch size, optimizer and learning rate are equal to the parameters in the related NER model. We use 100 epochs in RL and initialize the average vector of the removed sentences as an all-zero vector.
High-Quality Phrases: Considering all nonentity spans (i.e., 'O' type) as a potential entity provides noise in the Partial-CRF process. To address this issue, we use a set of quality multiword and single-word phrases, provided by Shang et al. (2018b) and obtained using their AutoPhrase method (Shang et al., 2018a). Note that this resource is available only for the English datasets, therefore, it is not included in the experiments on the Chinese datasets. When using these phrases, we assign all possible tags only for the token spans that are matched with this extended list. In our model, we treat the high-quality phrases as potential entities and we assign all possible entity types in annotation of distantly supervised sentences. For example, in Figure 1, we could only find the word 'leprosy' in this list, therefore, in annotation we assign all possible tags to this token and the other non-entity tokens remain as 'O'.

Performance Comparison
The first two rows of  We further investigate the impact of the different components of the model (Table 2) in the two English datasets via ablation experiments, where we contrast the use of partial annotation (PA) and reinforcement-based denoising RL, with and without the high-quality phrases (%). The experiments confirm the efficiency of the PA and RL modules in resolving FN and FP issues in the distantly labeled dataset. The results also corroborate Shang et al. (2018b) in showing that incorporation of the high-quality phrases always leads to a boost in the precision and subsequently in F1 score.

Size Of Gold Dataset
In all the previous experiments, we take advantage of the availability of an annotated dataset. However, one of the challenges in domain specific NER is the availability of a gold supervision data. We here examine the performance of  the proposed model on the BC5CDR corpus by selecting increasing amounts of annotated instances from the gold dataset. As shown in Figure 2, the proposed method achieves a performance of 83.18 only with 2% of the annotated dataset. Whereas the base NER model, requires almost 45% of the ground truth sentences to reach the same performance. This indicates that with a small set of human annotated data, our model can deliver relatively good performance.
We also carry out experiments on the BC5CDR and LaptopReview test sets, where our model is trained exclusively on distantly annotated data. We report the outcome together with the scores of the other state-of-the-art unsupervised methods in Table 3, where we also compare to simple dictionary matching. It is clear that the model of Shang et al. (2018b) (AutoNER) is still the best performing NER method on BC5CDR and LaptopReview datasets in an unsupervised setup. However, as is clear from Figures 3-a and 3-c in Shang et al. (2018b)), if there is at least some manually labeled data available, our method makes better use of the gold supervision compared to the AutoNER system in the similar training scenario. It is also worth noting that the approach proposed by Fries et al. (2017) utilizes extra human effort to design regular expressions and requires specialized hand-tuning.

Conclusion and Future work
This work presents an approach to alleviate the problems of auto-generated data in NER. The performance-driven, policy-based reinforcement learning module removes the sentences with FPs, whereas the adapted Partial-CRF layer deals with FNs. We examine the impact of each component in ablation experiments. Combining these in a su-pervised setting leads to state-of-the-art results on three benchmark datasets from different domains and different languages. Future work will extend the study to improve the performance of the model in unsupervised fashion and extend our study to additional domains and languages.