A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now many proposed solutions to this problem involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective training data based on model predictions. In this paper, we ask the question: given this recent progress, and some amount of human annotation, what is the most effective method for efficiently creating high-quality entity recognizers in under-resourced languages? Based on extensive experimentation using both simulated and real human annotation, we settle on a recipe of starting with a cross-lingual transferred model, then performing targeted annotation of only uncertain entity spans in the target language, minimizing annotator effort. Results demonstrate that cross-lingual transfer is a powerful tool when very little data can be annotated, but an entity-targeted annotation strategy can achieve competitive accuracy quickly, with just one-tenth of training data.


Introduction
Named entity recognition (NER) is the task of detecting and classifying named entities in text into a fixed set of pre-defined categories (person, location, etc.) with several downstream applications including machine reading , entity and event co-reference (Yang and Mitchell, 2016), and text mining (Han and Sun, 2012). Recent advances in deep learning have yielded stateof-the-art performance on many sequence labeling tasks, including NER (Collobert et al., 2011; 1 https://github.com/Aditi138/ EntityTargetedActiveLearning Ma and Hovy, 2016;Peters et al., 2018). However, the performance of these models is highly dependent on the availability of large amounts of annotated data, and as a result their accuracy is significantly lower on languages that have fewer resources than English. In this work, we ask the question "how can we efficiently bootstrap a high-quality named entity recognizer for a low-resource language with only a small amount of human effort?" Specifically, we leverage recent advances in data-efficient learning for low-resource languages, proposing the following "recipe" for bootstrapping low-resource entity recognizers: First, we use cross-lingual transfer learning (Yarowsky et al., 2001;Ammar et al., 2016), which applies a model trained on another language to low-resource languages, to provide a good preliminary model to start the bootstrapping process. Specifically, we use the model of Xie et al. (2018), which reports strong results on a number of language pairs. Next, on top of this transferred model we further employ active learning (Settles and Craven, 2008;Marcheggiani and Artieres, 2014), which helps improve annotation efficiency by using model predictions to select informative, rather than random, data for human annotators. Finally, the model is fine-tuned on data obtained using active learning to improve accuracy in the target language.
Within this recipe, the choice of specific method for choosing and annotating data within active learning is highly important to minimize human effort. One relatively standard method used in previous work on NER is to select full sequences based on a criterion for the uncertainty of the entities recognized therein (Culotta and McCallum, 2005). However, as it is often the case that only a single entity within the sentence may be of interest, it can still be tedious and wasteful to annotate full sequences when only a small portion of the  Figure 1: Our proposed recipe: cross-lingual transfer is used for projecting annotations from an English labeled dataset to the target language. Entity-targeted active learning is then used to select informative sub-spans which are likely entities for humans to annotate. Finally, the NER model is fine-tuned on this partially-labeled dataset.
sentence is of interest (Neubig et al., 2011;Sperber et al., 2014). Inspired by this finding and considering the fact that named entities are both important and sparse, we propose an entity-targeted strategy to save annotator effort. Specifically, we select uncertain subspans of tokens within a sequence that are most likely named entities. This way, the annotators only need to assign types to the chosen subspans without having to read and annotate the full sequence. To cope with the resulting partial annotation of sequences, we apply a constrained version of conditional random fields (CRFs), partial CRFs, during training that only learn from the annotated subspans (Tsuboi et al., 2008;Wanvarie et al., 2011). To evaluate our proposed methods, we conducted simulated active learning experiments on 5 languages: Spanish, Dutch, German, Hindi and Indonesian. Additionally, to study our method in a more practical setting, we conduct human annotation experiments on two low-resource languages, Indonesian and Hindi, and one simulated low-resource language, Spanish. In sum, this paper makes the following contributions: 1. We present a bootstrapping recipe for improving low-resource NER. With just onetenth of tokens annotated, our proposed entity-targeted active learning method provides the best results among all active learning baselines, with an average improvement of 9.9 F1.
2. Through simulated experiments, we show that cross-lingual transfer is a powerful tool, outperforming the un-transferred systems by an average of 8.6 F1 with only one-tenth of tokens annotated.
3. Human annotation experiments show that annotators are more accurate in annotating entities when using the entity-targeted strategy as opposed to full sequence annotation. Moreover, this strategy minimizes annotator effort by requiring them to label fewer tokens than the full-sequence annotation.

Approach
As noted in the introduction, our bootstrapping recipe consists of three components (1) crosslingual transfer learning, (2) active learning to select relevant parts of the data to annotate, and (3) fine-tuning of the model on these annotated segments.
Steps (2) and (3) are continued until the model has achieved an acceptable level of accuracy, or until we have exhausted our annotation budget. The system overview can be seen in Figure 1. In the following sections, we describe each of these three steps in detail.

Cross-lingual Transfer Learning
The goal of cross-lingual learning is to take a recognizer trained in a source language, and transfer it to a target language. Our approach to doing so for NER follows that of Xie et al. (2018), and we provide a brief review in this section.
To begin with, we assume access to two sets of pre-trained monolingual word embeddings in the source and target languages, X and Y , one small bilingual lexicon, either provided or obtained in an unsupervised manner (Artetxe et al., 2017;Conneau et al., 2017a), and labeled training data in the source language. Using these resources, we train bilingual word embeddings (BWE) to create a word-to-word translation dictionary, and finally use this dictionary to translate the source training data into the target language, which we use to train an NER model.
To learn BWE, we first obtain a linear mapping W by solving the following objective: where X D and Y D correspond to the aligned word embeddings from the bilingual lexicon. F denotes the Frobenius norm. We can first compute the singular value decomposition Y T D X D = U V , and solve the objective by taking W * = U V . We obtain BWE by linearly transforming the source and target monolingual word embeddings with U and V , namely XU and Y V .
After obtaining the BWE, we find the nearest neighbor target word for every source word in the BWE space using the cross-domain similarity local scaling (CSLS) metric (Conneau et al., 2017b), which produces a word-to-word translation dictionary. We use this dictionary to translate the source training data into the target language, and simply copy the label for each word, which yields transferred training data in the target language. We train an NER model on this transferred data as our preliminary model. Going forward, we refer to the use of cross-lingual transferred data as CT.

Entity-Targeted Active Learning
After training a model using cross-lingual transfer learning, we start the active learning process based on this model's outputs. We begin by training a NER model Θ using the above model's outputs as training data. Using this trained model, our proposed entity-targeted active learning strategy, referred as ETAL, then selects the most informative spans from a corpus D of unlabeled sequences. Given an unlabeled sequence s, ETAL first selects a span of tokens s j i = s i · · · s j such that s j i is a likely named entity, where i, j ∈ [0, |s|]. Then, in order to obtain highly informative spans across D, ETAL computes the entropy H for each occurrence of the span s j i in D and then aggregates them over the entire corpus D, given by: where x is an unlabeled sequence in D. Finally, the spans s j i with the highest aggregate uncertainty H aggregate are selected for manual annotation.
We now describe the procedure for calculating H(x j i ), which is the entropy of a span x j i being a likely entity. Given an unlabeled sequence x, the trained NER model Θ is used for computing the marginal probabilities p θ (y i |x) for each token x i across all possible labels y i ∈ Y using the forward-backward algorithm (Rabiner, 1989), where Y is the set of all labels. Using these marginals we calculate the entropy of a given span x j i being an entity as shown in Algorithm 1.
Algorithm 1 Entity-Targeted Active Learning 1: B ← label-set denoting beginning of an entity 2: I ← label-set denoting inside of an entity 3: O ← outside of an entity span 4: p θ (y i |x) ← marginal probability of label y i 5: for token x i 6: if H > H threshold then 13: Let B denote the set of labels indicating beginning of an entity, I the set of labels indicating inside of an entity and O denoting outside of an entity. First, we compute the probability of a span x j i being an entity, starting with the token i, by marginalizing p θ (y i |x) over all labels in B, denoted as p ij span . Since an entity can span multiple tokens, for each subsequent token j being part of that entity, we marginalize p θ (y j |x) over all labels in I and combine it with p ij span . Finally, we compute p entity = p ij span * p θ (O j |x), which denotes end of a likely entity. Since we use the marginal probability for computing p entity , it already factors in the transition probability between tags. Thus, any invalid sequences such as BP ERIORG have low scores. Since contiguous spans have overlapping tokens, using dynamic programming (DP) to compute p ij span avoids an exponential computation when considering all possible spans in a sequence. Using p entity , we compute the entropy H and only consider the spans having H higher than a pre-defined threshold H threshold . The reason for this thresholding is purely for computational purposes as it allows us to discard all spans that have a very low probability of being an entity, keeping the number of spans actually stored in memory low. As mentioned above, we aggregate the entropy of spans H aggregate over the entire unlabeled set, thus combining uncertainty sampling with a bias towards high frequency entities.
Using this strategy, we select subspans in each sequence for annotation. The annotator only needs to assign named entity types to the chosen subspans, adjust the span boundary if needed, and ignore the rest of the sequence, saving much effort.

Training the NER model
With the newly obtained training data from active learning, we attempt to improve the original transferred model. In this section, we first describe our model architecture, and try to address: 1) how to train the NER model effectively with partially annotated sequences? 2) what training scheme is best suited to improve the transferred model?

Model Architecture
Our NER model is a BiLSTM-CNN-CRF model based on Ma and Hovy (2016) consisting of: a character-level CNN, that allows the model to capture subword information; a word-level BiLSTM, that consumes word embeddings and produces context sensitive hidden representations; and a linear-chain CRF layer that models the dependency between labels for inference. We use the above model for training the initial NER model on the transferred data as well as for re-training the model on the data acquired from active learning.

PARTIAL-CRF
Active learning with span-based strategies such as ETAL, produces a training dataset of partially labeled sequences. To train the NER model on these partially labeled sequences, we take inspiration from Bellare and McCallum (2007); Tsuboi et al. (2008) and use a constrained CRF decoder. Normally, CRF computes the likelihood of a label sequence y given a sequence x as follows: where T is the length of the sequence, Y(T ) denotes the set of all possible label sequences with length T , and ψ i (y t−1 , y t , x) = exp(W T y t−1 ,yt x i + b y t−1 ,yt ) is the energy function. To compute the likelihood of a sequence where some labels are unknown, we use a constrained CRF which marginalizes out the un-annotated tokens. Specifically, let Y L denote the set of all possible sequences that include the partial annotations (for un-annotated tokens, all labels are possible), and we compute the likelihood as: p θ (Y L |x) = y∈Y L p θ (y|x), referred as PARTIAL-CRF.

Training Scheme
To improve our model with the newly labeled data, we directly fine-tune the initial model, trained on the transferred data, on the data acquired through active learning, referred as FINETUNE. Each active learning run produces more labeled data, for which this training procedure is repeated again. We also compare the NER performance using two other training schemes: CORPUSAUG, where we train the model on the concatenated corpus of transferred data and the newly acquired data, and CORPUSAUG+FINETUNE, where we additionally fine-tune the model trained using CORPUSAUG on just the newly acquired data.

Experiments
In this section, we evaluate the effectiveness of our proposed strategy in both simulated ( §3.2) and human-annotation experiments ( §3.3).

Experimental Settings
Datasets: The first evaluation set includes the benchmark CoNLL 2002 and 2003 NER datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) for Spanish (from the Romance family), Dutch and German (like English, from the Germanic family). We use the standard corpus splits for train/dev/test. The second evaluation set is for the low-resource setting where we use the Indonesian (from the Austronesian family), Hindi (from the Indo-Aryan family) and Spanish datasets released by the Linguistic Data Consortium (LDC). 2 We generate the train/dev/test split by random sampling. Details of the corpus statistics are in the Appendix.
English-transferred Data: We use the same experimental settings and resources as described in Xie et al. (2018) to get the translations of the English training data for each target language.
Active Learning Setup: As described in Section §2.2, a DP-based algorithm is employed to select the uncertain entity spans which runs for all n-grams having length <= 5. This length was approximated by computing the 90th percentile on the length of entities in the English training data. H threshold is a hyper-parameter set to 1e-8. The details of the NER model hyper-parameters can be found in the Appendix.

Simulation Experiments
Setup: We use cross-lingual transfer ( §2.1) to train our initial NER model and test on the target language. This is the same setting as Xie et al. (2018) and serves as our baseline. Then we use several active learning strategies to select data for manual annotation using this trained NER model. We compare our proposed ETAL strategy with the following baseline strategies: SAL: Select whole sequences for which the model has least confidence in the most likely labeling (Culotta and McCallum, 2005).
CFEAL: Select least confident spans within a sequence using the confidence field estimation method (Culotta and McCallum, 2004).
RAND: Select spans randomly from the unlabeled set for annotation.
In this experimental setting, we simulate manual annotation by using gold labels for the data selected by active learning. At each subsequent run, we annotate 200 tokens and fine-tune the NER model on all the data acquired so far, which is then used to select data for the next run of annotation. Figure 2 summarizes the results for all datasets across the different experimental settings. Each data-point on the x-axis corresponds to the NER performance after annotating 200 additional tokens. CT denotes using cross-lingual transferred data to train the initial NER model for both kickstarting the active learning process and also for fine-tuning the NER model on the newly-acquired data. PARTIAL-CRF/FULL-CRF denote the type of CRF decoder used in the NER model. Throughout this paper, we report results averaged across all active learning runs unless otherwise noted. Individual scores are reported in the Appendix.

Results
As can be seen in the figure, our proposed recipe ETAL+PARTIAL-CRF+CT outperforms the previous 78.9 ± 3.5 62.3 ± 3.6 64.6 ± 4.0 68.6 ± 3.9 Table 1: Variance analysis for significance testing of different active learning systems using paired bootstrap resampling. ± denotes the 95% confidence intervals. Systems which are not statistically significant than the best system ETAL are in bold. The CoNLL datasets reflect the same observation, as can be seen in Appendix.
From Figure 2 we see that ETAL performs better than the baselines across multiple runs. To verify that this is not an artifact of randomness in the test data, we use a paired bootstrap resampling method, as illustrated in Koehn (2004), to compare SAL, CFEAL, RAND with ETAL. For each system, we compute the F1 score on randomly sampled 50% of the data and perform 10k bootstrapping steps at three active learning runs. From Table 1 we see that the baselines are significantly worse than ETAL at 600 and 1200 annotated tokens.

Ablation Study
In order to study the contribution of CT and PARTIAL-CRF in improving the NER performance, we conduct the following ablation, denoted by dashed lines in Figure 2.
CT: We observe that the transferred data from English provides a good start to the NER model: 69.4 (Dutch), 63.0 (Spanish-LDC), 65.7 (Spanish-CoNLL), 54.7 (German), 45.4 (Indonesian), 45.0 (Hindi) F1. As expected, cross-lingual transfer helps more for the languages closely related to English which are Dutch, German, Spanish. For this ablation, we train a ETAL+PARTIAL-CRF where no transferred data is used. Therefore, to create the seed data, we randomly annotate 200 tokens in the target language and thereafter use ETAL. We observe that as more in-domain data is acquired, the un-transferred setting soon approaches the transferred setting ETAL+PARTIAL-CRF+CT suggesting that an efficient annotation strategy can help close the gap between these two systems with as few as ∼1000 tokens (avg.).

PARTIAL-CRF:
We study the effect of using the original CRF (FULL-CRF) instead of the PARTIAL-CRF for training with partially labeled data. Since the former requires fully labeled sequences, the un-annotated tokens in a sequence are labeled with the model predictions. We see from Figure 2 that the ETAL+FULL-CRF+CT performs worse (avg. -4.1 F1) than ETAL+PARTIAL-CRF+CT. This is because the FULL-CRF significantly hurts the recall, as much as by an average of -11.0 points for Hindi, -1.4 for Indonesian, -7.4 for Spanish-LDC, -3.3 for German, -3.7 for Dutch, -4.8 for Spanish CoNLL.

Comparison of Training Schemes
We experiment with different NER training regimes (described in §2.3.3) for ETAL. We observe that generally fine-tuning not only speeds up the training but also gives better performance than CORPUSAUG. For brevity of space, we compare results for two languages in Figure 3: 3 Dutch, a relative of English, and Hindi, a distant language. We see that FINETUNE performs better for Hindi whereas CORPUSAUG+FINETUNE performs better for Dutch. This is because Dutch is closely related to English and benefits the most from the transferred data being explicitly augmented. Whereas for Hindi, which is typologically distant from English, the transferred data is noisy and thus the model doesn't gain much from the transferred data. Xie et al. (2018) make a similar observation in their Dutch Figure 3: Comparison of the NER performance trained with different schemes for the ETAL strategy. The x-axis denotes the total number of tokens annotated and the y-axis denotes the F1 score. experiments with German.

Human Annotation Experiments
Setup: We conduct human annotation experiments for Hindi, Indonesian and Spanish to understand whether ETAL helps reduce the annotation effort and improve annotation quality in practical settings. We compare ETAL with the full sequence strategy (SAL). We use six native speakers, two for each language, with different levels of familiarity with the NER task. Each annotator was provided with practice sessions to gain familiarity with the annotation guidelines and the user interface. The annotators annotated for 20 mins time for each strategy. For ETAL, the annotator was required to annotate single spans i.e each sequence contained one span of tokens. This involved assigning the correct label and adjusting the span boundary if required. For SAL, the annotator was required to annotate all possible entities in the sequence. We randomized the order in which the annotators had to annotate using ETAL and SAL strategy. Figure  5 illustrates the human annotation process for the ETAL strategy in the annotation interface. 4

Results
Table 2 records the results of the human annotation experiments. We first compare each annotator's annotation quality with respect to the oracle under both ETAL and SAL, denoted by Annotator Performance. We see that both Hindi and Spanish annotators have higher annotation quality using ETAL. This is because by selecting possible entity spans, ETAL not only saves effort on searching the entities in a sequence but also allows the annotators to read less overall and concentrate more  on the things that they do read, as seen in Figure  4(a). However, for SAL we see that the annotator missed a likely entity because they focused on the other more salient entities in the sequence. For Indonesian, we see an opposite trend due to several inconsistencies in the gold labels. The most common inconsistency being when a common noun is part of an entity. For instance, the gold standard annotates the span Kabupaten Bogor as an entity where Kabupaten means "district". Whereas for Kabupaten Aceh tengah, the gold standard does not include Kabupaten. Similarly, the same span gunung krakatau is annotated inconsistently across different mentions where sometimes they exclude the gunung (mountain) token.Since the annotators referred to these examples during their practice session, their annotations had similar inconsistencies. This issue affects ETAL more than SAL because ETAL selects more entities for annotation.
The Test Performance compares the performance of the NER models trained with these annotations. The number in the brackets denotes the total number of annotated tokens used for train- ing the NER model. We observe that SAL has a larger number of annotated tokens than ETAL. This is because most sequences selected by SAL did not have any entities. Since "not-an-entity" is the default label in the annotation interface, no operation is required for annotating these, allowing for more tokens to be annotated per unit times. When we count the number of entities present in the data selected by the two strategies, we see in Figure  4(b) that data selected by ETAL has a significantly larger number of entities than SAL, across all the 6 annotation experiments. Therefore, we first compare the NER performance on the same number of annotated tokens. From Table 2 we see that under this setting ETAL outperforms SAL, similar to the simulation results. We note that when we consider all the annotated tokens, SAL (denoted by SAL-FULL) has slightly better results. However, despite having 6 times fewer annotated tokens, the difference between ETAL and SAL-FULL is (avg.) 2.1 F1. This suggests that ETAL can achieve competitive performance with fewer annotations.
From both the simulation and human experiments, we can conclude that a targeted annotation strategy such as ETAL achieves competitive performance with less manual effort while maintaining high annotation quality. Given that ETAL can help find twice as many entities as SAL, a potential application of ETAL can also be for creating a highquality entity gazetteer under a short time budget. Since a naive strategy of SAL allows for more labelled data to be acquired in the same amount of time, in the future we plan to explore mixed-mode annotation where we choose either full sequences or spans for annotation.

Related Work
Cross-Lingual Transfer: Transferring knowledge from high-resource languages has been extensively used for improving low-resource NER. More common approaches rely on annotation projection methods where annotations in source language are projected to the target language using parallel corpora (Zitouni and Florian, 2008;Ehrmann et al., 2011) or bilingual dictionaries (Xie et al., 2018;Mayhew et al., 2017). Crosslingual word embeddings (Bharadwaj et al., 2016;Chaudhary et al., 2018) also provide a way to leverage annotations from related languages.
Active Learning (AL): AL has been widely explored for many NLP tasks-NER: Shen et al. (2017) explore token-level annotation strategies, Chen et al. (2015) present a study on AL for clinical NER; Baldridge and Palmer (2009) evaluate how well AL works with annotator expertise and label suggestions, Garrette and Baldridge (2013) study type and token based strategies for lowresource languages. Settles and Craven (2008) present a nice survey on the different AL strategies for sequence labeling tasks, whereas Marcheggiani and Artieres (2014) discuss the strategies for acquiring partially labeled data. Wanvarie et al. (2011);Neubig et al. (2011);Sperber et al. (2014) show the advantages of training a model on this partially labeled data. All above methods focus on either token or full sequence annotation.
The most similar work to ours perhaps is that of (a) Selected spans using ETAL strategy are highlighted for the human annotator to annotate.
(b) Human annotator correcting the span boundary and assigning the correct entity type.
(c) Human annotator assigning the correct entity type only since selected span boundary is correct.
(d) Partially-annotated sequences after being annotated by the human annotator. Figure 5: Example of the human annotation process for Hindi. Fang and Cohn (2017), which selects informative word types for low-resource POS tagging. However, their method requires the annotator to annotate single tokens, which is not trivially applicable for multi-word entities in practical settings.

Conclusion
In this paper, we presented a study on how to efficiently bootstrap NER systems for low-resource languages using a combination of cross-lingual transfer learning and active learning. We conducted both simulated and human annotation experiments across different languages and found that: 1) cross-lingual transfer is a powerful tool, constantly beating systems without using transfer; 2) our proposed recipe works the best among known active learning baselines; 3) our proposed active learning strategy saves annotator much effort while ensuring high quality. In future, to account for different levels of annotator expertise, we plan to combine proactive learning (Li et al., 2017) with our proposed method.