Proactive Learning for Named Entity Recognition

The goal of active learning is to minimise the cost of producing an annotated dataset, in which annotators are assumed to be perfect, i.e., they always choose the correct labels. However, in practice, annotators are not infallible, and they are likely to assign incorrect labels to some instances. Proactive learning is a generalisation of active learning that can model different kinds of annotators. Although proactive learning has been applied to certain labelling tasks, such as text classification, there is little work on its application to named entity (NE) tagging. In this paper, we propose a proactive learning method for producing NE annotated corpora, using two annotators with different levels of expertise, and who charge different amounts based on their levels of experience. To optimise both cost and annotation quality, we also propose a mechanism to present multiple sentences to annotators at each iteration. Experimental results for several corpora show that our method facilitates the construction of high-quality NE labelled datasets at minimal cost.


Introduction
Manually annotating a dataset with NEs is both time-consuming and costly. Active learning, a semi-supervised machine learning algorithm, aims to address such issues (Lewis, 1995;Settles, 2010). Instead of asking annotators to label the whole dataset, active learning methods present only representative and informative instances to annotators. Through iterative application of this process, a high-quality annotated corpus can be produced in less time and at lower cost than traditional annotation methods.
There are two strong assumptions in active learning: (1) instances are labelled by experts, who always produce correct annotations and are not affected by the tedious and repetitive nature of the task; (2) all annotators are paid equally, regardless of their annotation quality or level of expertise. However, in practice, it is highly unlikely that all annotators will assign accurate labels all of the time. For example, especially for complex annotation tasks, some labels are likely to be assigned incorrectly Carbonell, 2008, 2010;Settles, 2010). Furthermore, if annotation is carried out for long periods of time, tiredness and reduced concentration may ensue (Settles, 2010), which can lead to annotation errors. An additional issue is that different annotators may have varying levels of expertise, which could make them reluctant to annotate certain cases, and they may assign incorrect labels in other cases. It is also possible that an inexperienced annotator may assign random labels.
To address the above-mentioned assumptions, proactive learning has been proposed to model different types of experts Carbonell, 2008, 2010). Proactive learning assumes that (1) not all annotators are perfect, but that there is at least one "perfect" expert and one less experienced or "fallible" annotator; (2) as the perfect expert always provides correct answers, their time is more expensive than that of the fallible annotator. The annotation process in proactive learning is similar to traditional active learning. At each iteration, annotators will be asked to tag an unlabelled instance, the result of which will be added to the labelled dataset. However, the difference with proactive learning is that, in order to reduce annotation cost, an appropriate annotator is chosen to label each selected instance. For example, if there is a high probability that the fallible annotator will provide the correct label for an unlabelled instance, then proactive learning will send this instance to be annotated by fallible annotator. This aims to ensure a simultaneous saving of costs and maintenance of the quality of the data.
Proactive learning has been used for several annotation tasks, such as binary and multi-class text classification, and parsing Carbonell, 2008, 2010;Olsson, 2009). In contrast, this paper proposes a proactive learning method for NE tagging, i.e., a sequence labelling task.
Similarly to other efforts that have used proactive learning, our method models two annotators: a reliable one and a fallible one, who have different probabilities of providing correct labels. The reliable annotator is much more likely to produce correct annotations, but their time is expensive. In contrast, the fallible annotator is likely to assign incorrect annotations more often, but charges less for their services. It should be noted that the characteristics of our reliable expert are different from those proposed in previous work Carbonell, 2008, 2010). Specifically, in the conventional proactive learning, the reliable expert is assumed to be perfect, i.e., he/she always provides correct annotations. However, in practice, such an assumption is too strong, especially for NE annotation. Therefore, we assume that the reliable expert is not perfect, but that he/she has a higher expertise level in the target domain, and has a very low error rate. In order to determine an appropriate annotator for each sentence, we calculate the probability that an annotator will assign the correct sequence of labels in a selected unlabelled sentence. Furthermore, at each iteration, we use a batch sampling mechanism to select several sentences for annotators to label (instead of selecting only a single sentence), which optimises both cost and performance.
For evaluation purposes, we simulate the two annotators by using two machine-learning based NER methods, namely LSTM-CRF (Lample et al., 2016) as the reliable expert, and CRF (Lafferty et al., 2001) as the fallible expert. We then apply our method to three corpora from different domains: ACE2005 (Walker et al., 2006) for general language entities, COPIOUS-an in-house corpus of biodiversity entities 1 , and GENIA (Kim et al., 2003)-a corpus of biomedical entities. Our ex-1 The corpus is available upon request. perimental results demonstrate that by using the proposed method, we can obtain a high-quality labelled corpus at a lower cost than current baseline methods.
The contributions of our work are as follows. Firstly, we have modified the conventional proactive learning method to ensure its suitability for a sequence labelling task. Secondly, in contrast to previous work, which selects a single instance for each annotator at each iteration Carbonell, 2008, 2010;Moon and Carbonell, 2014), our method selects multiple sentences for presentation to annotators. Thirdly, by applying our method to a number of different corpora, we demonstrate that our method is generalisable to different domains.

Methodology
The proposed proactive learning for NE tagging is outlined in Algorithm 1. As an initial step, the performance of each expert is estimated based on a benchmark dataset (see Section 2.1). Subsequently, at each iteration, all sentences in the unlabelled dataset are sorted according to an active learning criterion. The top-N most informative sentences are then used as input to the batch sampling step. In this step, a batch of sentences is divided into two sets to be distributed to the reliable and fallible experts, respectively. Sentences distributed to the fallible experts are not only informative, but there is also a high probability that the expert will provide correct labels for them. Meanwhile, only those sentences that are estimated to be too difficult for the fallible expert to annotate will be sent to the reliable expert. By applying this process, annotation cost can be reduced. Further details about the batch sampling algorithm are presented in Section 2.2.
In Algorithm 1, U L r is the set of selected unlabelled sentences assigned to the reliable expert and U L f is the set assigned to the fallible expert. L r , L f are the annotated results of U L r , U L f .

Expert performance estimation
As mentioned above, our method assumes that there are two types of experts. One is reliable, who has a higher probability of assigning the correct sequence of labels for a sentence, and has a high cost for their time. The other expert is fallible, meaning that they may assign a higher proportion of incorrect labels for a sequence, but Algorithm 1: Proactive Learning for NER Input: a labelled dataset L, an unlabelled dataset U L, a test dataset T , a budget B, a reliable expert er with cost Cr for each sentence, a fallible expert e f with cost C f , the current cost C Output: a labelled dataset L 1 Estimate the performance of each expert as described in Section 2.1; 2 while C < B do Lr, L f ← er and e f annotate U Lr and U L f respectively; charges less for their time. The likely annotation quality of each expert is estimated based on two different probabilities: the class probability, p(label|expert, c) and the sentence probability p(CorrectLabels|expert, x).

Class probability
The class probability, p(label|expert, c), is the probability that an expert provides a correct label when annotating a named entity of class c. This probability is obtained by asking both the reliable and fallible experts to annotate a benchmark dataset and calculating F 1 scores for each of them against the gold standard annotations.

Sentence probability
The sentence probability is the probability that an expert provides a sequence of correct labels for a sentence x.
We firstly compute the probability for each token in the sentence by combining the class probability and the likelihood that an expert provides a correct label for the token x i , as shown in Equation 1. The equation is inspired by Moon and Carbonell (2014), who used it for a classification task.
C is the set of all entity labels and the label O. p(c|x i ) is the probability that a token x i is an entity of class c, which is predicted by an NER model.  Given the probabilities that an expert will provide correct labels for each tokens in a sentence, the sentence probability is calculated by averaging all of these probabilities, as presented in Equation 2.
|x| is the length of the sentence x.

Batch sampling
Instead of asking annotators to label only one sentence at each iteration, it is more efficient to ask them to annotate several sentences. To facilitate this, we propose a batch sampling algorithm that can select a set of sentences and assign them to appropriate annotators (see Algorithm 2). The input of the algorithm is a set of sentences in the unlabelled dataset that are considered to be the most informative ones, based on an active learning criterion (as described in line 5 of Algorithm 1). This batch sampling process is divided into two stages. In the first stage, unlabelled sentences for which the sentence probability for the fallible expert is higher than a threshold α, will be assigned to the fallible expert. Otherwise, the sentence will be passed to the second stage. In the second stage, we firstly reorder sentences according to a re-ranking criterion, as shown in Equation 3. The intuition behind this re-ranking step is that in order to save on annotation costs, we set a high priority for sentences to be assigned to the fallible expert in certain cases. Specifically, for sentences that are informative and for which there is a small difference between the sentence probabilities for the reliable and fallible experts, we favour the selection of the fallible one.
For an unlabelled sentence x, the difference between the sentence probabilities for the two experts is calculated as shown in Equation 4.
If the above difference is not significant, i.e., it is less than a threshold β, x will be distributed to the fallible expert. Otherwise, x will be assigned to the reliable expert. Equations (5) -(7) describe the estimation of the threshold β, in which x i is the i th sentence in the top-N sentences selected by an active learning criterion. γ is a parameter that controls the value of the threshold β. γ ranges from 0 to 1. If γ = 0, no sentences will be given to the fallible expert to annotate. If γ = 1, the fallible expert will label all the BatchSize sentences. It should be noted that β is a dynamic threshold, which is recalculated based on the difference between dif f max and dif f min at each iteration.

Dataset
We have applied our method to three different corpora: (1) ACE2005 (Walker et al., 2006) which includes named entities for the general domain, e.g., person, location, and organisation; (2) COPIOUS that includes five categories of biodiversity entities, such as taxon, habitat, and geographical location; (3) GENIA (Kim et al., 2003), a biomedical named entity corpus. Table 1 shows the entity classes and the number of entities of each class that are annotated in the three corpora. As shown in the table, for the GE-NIA corpus, we combined the DNA and RNA entities into a single named entity class. Meanwhile, for ACE2005, although top-level entity classes are divided into a number of different subtypes, we only considered the top-level classes, as shown in the table.
For active and proactive learning experiments, 1% and 20% of sentences of each corpus were used as the initial labelled set and the test set, respectively. The remaining 79% of sentences were regarded as unlabelled data.

Expert simulation
We simulated the reliable and fallible experts by using two machine learning models: LSTM-CRF (Lample et al., 2016)-a neural network NER and CRF (Lafferty et al., 2001). To evaluate the performance of the two models, we conducted preliminary experiments, by firstly trained the two models on 80% of the labelled corpora and subsequently testing them on the remaining 20% of the data.
Word embeddings As the three corpora belong to three different domains, we used three corresponding pre-trained word embeddings as input to the LSTM-CRF model.
• COPIOUS: we applied word2vec to the English subset of the Biodiversity Heritage Library 3 to learn vectors for biodiversity entities. The set has approximately 26 million pages with more than 8 billion words.   (Pyysalo et al., 2013).

CRF features
To train the CRF model, we used CRF++ 4 and employed following features: word base, lemma, part-of-speech tag and chunk tag of a token. We also used unigram and bigram features that combine the features of the previous, current and following token. As illustrated in Table 2, the LSTM-CRF model is mostly more precise and achieves wider coverage than CRF. We therefore selected LSTM-CRF to simulate the reliable expert and CRF to simulate the fallible expert.   The class probability of each expert is precalculated based on the the F 1 score of each class that an expert can achieve on the 1% initial labelled set. Meanwhile, the sentence probability of each expert is estimated at each iteration.

Active learning criteria
Various active learning criteria were investigated using the three corpora. We firstly estimated the performance (F 1 score) of a supervised NER model by using CRF++ and the above-mentioned features. We then compared the performance of each active learning criterion with that of the supervised model. If the performance of one criterion approximates that of the supervised with the least number of iterations, we consider the criterion as the best one for proactive learning experiments.
We experimented with the following criteria: least confidence (Culotta and McCallum, 2005), normalized entropy (Kim et al., 2006), MMR (Maximal Marginal Relevance) (Kim et al., 2006), density (Settles and Craven, 2008) when using feature vectors and word embeddings, and the combination of least confidence and density criterion. Equation 8 describes the combination criterion used in our experiments. In this equation, U L is the current unlabelled dataset, x u is the u th unlabelled sentence in U L, the parameter λ = 0.8, and the similarity score (Settles and Craven, 2008) were calculated by using feature vectors.
x * = arg max x (λ * Least Conf idence(x) similarity(x, x u )) (8)  We also implemented two baseline criteria. The first one is random selection, in which a batch of sentences is selected randomly at each iteration. The second one, namely longest, is a criterion that selects the longest sentences to be labelled.
Among these criteria, we selected the best criterion for further experiments. The best criterion is the one that produced competitive or better performance (F-score) than that of a supervised learning method with the least number of training instances. We report these criteria for each entity class as well as for the overall corpus in Table 4. In this table, Density (f2v) and Density (w2v) represent the density criteria when using feature and word vectors, respectively. Entropy is the normalized entropy. LC+Density is the combined criterion, described in Equation 8. As shown in the table, the best criteria at the level of individual classes are diverse. However, overall, normalized entropy is the best criterion for all three corpora. We therefore selected this criterion in our proactive learning experiments.

Proactive learning results
Our method was evaluated on the test datasets of the three corpora mentioned in Section 3.1. For all experiments with proactive learning, we used the following settings: α = 0.975, γ = 0.05, N = 200, and the annotation costs are 3 and 1 per sentence for the reliable and fallible experts, respectively.

BatchSize
We investigated different values of BatchSize including 20, 10, 5, and 1. The results when BatchSize is 1 was not shown in Figure 1 as our method always selects the fallible expert at every iteration, which results in a performance that is inferior to the baselines. For the GENIA corpus, the F-scores are comparable, regardless of the BatchSize used. Meanwhile, for the ACE2005 corpus, the F-scores are the highest when the batch size is 20. In contrast, for the COPIOUS corpus, the best scores are obtained with a batch size of 10. Figure 2 compares the experimental results of the two baseline methods (Reliable and F allible) and the best performance of the proposed proactive learning method (P A) with batch sizes of 20, 10, and 5, respectively, on the three corpora. Reliable refers to a baseline in which we only select the reliable expert at each iteration. Similarly, only the fallible expert was selected in the F allible experiments.

Comparison with baselines
It can be seen that the performance of the three models is comparable between ACE2005 and the COPIOUS corpus. For these two corpora, P A outperformed the two baselines. In most cases, by using P A, better F-scores are obtained at the same cost as the two baselines. Both P A and Reliable performance is increased when the total cost is increased. Meanwhile, for the F allible model, the performance stabilises at a lower level than the other methods when cost rises above a certain level.
Regarding the GENIA corpus, P A acheived a higher performance than Reliable, but a lower performance than F allible in the range of costs from 0 to approximately 3,500. This can be partly explained by the fact that there are only three NE classes in this corpus. Hence, the annotation task is simpler than for the the other corpora, even for the fallible expert. However, when the cost is greater than 3,500, the performance of F allible becomes stable, while the performance of P A continues to increase.
We also investigated the number of times that each expert was selected during the iterative process of P A. The results are shown in Figure 3. P A (Reliable) and P A (F allible) correspond to number of times that the reliable and fallible expert respectively, were selected in P A, while Reliable corresponds to the number of times that the reliable expert was selected in Reliable baseline experiment. The figure illustrates that the number of times that the fallible expert is selected grows continually as the number of iterations increases. This shows that our method can effectively distribute appropriate unlabelled sentences to the fallible expert, in order to save on annotation costs.
4 Related work 4.1 Active learning for NER Active learning aims to decrease annotation cost, whilst maintaining acceptable quality of annotated data. To achieve this, the method iteratively selects the most informative sentences to be annotated from an unlabelled data set.
One of the most common selection criteria used in applying active learning to the task of NE labelling is the uncertainty-based criterion. This criterion assumes that the most uncertain sentence is the most useful instance for learning an NER model. There are several ways to implement this, such as least confidence (Culotta and McCallum, 2005)-the lower the probability of a sequence of labels, the less confidence the model, and entropy (Kim et al., 2006) that can measure the uncertainty of a probability distribution. Some other criteria are a diversity measurement (Kim et al., 2006) and a density criterion (Settles and Craven, 2008).

Cost-sensitive active learning
Cost-sensitive active learning is a type of active learning method that considers the annotation cost, e.g., budget, time or effort required to complete the annotation process (Olsson, 2009). Since proactive learning also models the reliability or expertise of each annotator in addition to the annotation cost, it can be considered as another case of costsensitive active learning. Carbonell (2008, 2010) investigated proactive learning for binary classification. Figure 3: Number of times that each expert is selected in P A and Reliable models They predicted the probability that a reluctant oracle refuses to annotate an instance and the probability that a fallible oracle assigns a random label to an instance. Each oracle charges a different amount for their efforts. They also proposed a model that assigns different costs to unlabelled instances according to their annotation difficulty. For the multi-class classification task, Moon and Carbonell (2014) used the same approach but they had multiple experts, each of whom is specialised for each class. Kapoor et al. (2007) proposed a decision-theoretic method for the task of voice mail classification. They defined a criterion named "expected value-of-information" that combines the misclassification risk with the labelling cost.
Cost-sensitive active learning was also applied to part-of-speech (POS) tagging (Haertel et al., 2008). In this work, an hourly cost measurement was determined and a linear regression model was trained to predict the annotation cost. Hwa (2000) aimed to reduce the manual effort for a parsing task by using tree entropy cost. Meanwhile, Baldridge and Osborne (2004) measured the total annotation cost to create a treebank by using unit cost and discriminant cost.

Conclusion and future work
Our work constitutes the first attempt to use proactive learning method for named entity labelling. We simulated the behaviour of reliable and fallible experts having different levels of expertise and different costs. To save annotation costs and to ensure acceptable quality of the resulting annotated data, the method favours the selection of the fallible expert. In order to increase efficiency, we also proposed a batch sampling algorithm to select more than one sentence in each iteration.
Experimental results for three corpora belonging to different domains demonstrate that the employment of non-perfect experts can help to build gold standard dataset at reasonable cost. Moreover, our method performed well across the three different corpora, demonstrating the generality of our approach.
A potential limitation of our approach is that the initial step is reliant on the availability of a gold standard corpus to estimate the experts' performance. However, for some domains, it may be difficult to obtain such a dataset. Therefore, as future work, we will explore how we can assess experts' performance without the need for goldstandard labelled data.
As a further extension to our work, we will explore the deployment of our method on crowd sourcing platforms, such as CrowdFlower 5 and Amazon Mechanical Turk 6 . These platforms allow annotations to be obtained from non-expert annotators in a rapid and cost-effective manner (Snow et al., 2008). These non-experts can be treated as non-perfect annotators in our proposed proactive learning method.