Named Entity Recognition without Labelled Data: A Weak Supervision Approach

Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level F1 scores compared to an out-of-domain neural NER model.


Introduction
Named Entity Recognition (NER) constitutes a core component in many NLP pipelines and is employed in a broad range of applications such as information extraction (Raiman and Raiman, 2018), question answering (Mollá et al., 2006), document de-identification (Stubbs et al., 2015), machine translation (Ugawa et al., 2018) and even conversational models (Ghazvininejad et al., 2018). Given a document, the goal of NER is to identify and classify spans referring to an entity belonging to pre-specified categories such as persons, organisations or geographical locations.
These neural architectures require large corpora annotated with named entities, such as Ontonotes (Weischedel et al., 2011) or ConLL 2003 (Tjong Kim Sang and De Meulder, 2003). When only modest amounts of training data are available, transfer learning approaches can transfer the knowledge acquired from related tasks into the target domain, using techniques such as simple transfer (Rodriguez et al., 2018), discriminative fine-tuning (Howard and Ruder, 2018), adversarial transfer (Zhou et al., 2019) or layer-wise domain adaptation approaches (Yang et al., 2017;Lin and Lu, 2018).
However, in many practical settings, we wish to apply NER to domains where we have no labelled data, making such transfer learning methods difficult to apply. This paper presents an alternative approach using weak supervision to bootstrap named entity recognition models without requiring any labelled data from the target domain. The approach relies on labelling functions that automatically annotate documents with named-entity labels. A hidden Markov model (HMM) is then trained to unify the noisy labelling functions into a single (probabilistic) annotation, taking into account the accuracy and confusions of each labelling function. Finally, a sequence labelling model is trained using a cross-entropy loss on this unified annotation.
As in other weak supervision frameworks, the labelling functions allow us to inject expert knowledge into the sequence labelling model, which is often critical when data is scarce or non-existent (Hu et al., 2016;Wang and Poon, 2018). New la-belling functions can be easily inserted to leverage the knowledge sources at our disposal for a given textual domain. Furthermore, labelling functions can often be ported across domains, which is not the case for manual annotations that must be reiterated for every target domain.
The contributions of this paper are as follows: 1. A broad collection of labelling functions for NER, including neural models trained on various textual domains, gazetteers, heuristic functions, and document-level constraints.

2.
A novel weak supervision model suited for sequence labelling tasks and able to include probabilistic labelling predictions.
3. An open-source implementation of these labelling functions and aggregation model that can scale to large datasets 1 .

Related Work
Unsupervised domain adaptation: Unsupervised domain adaptation attempts to adapt knowledge from a source domain to predict new instances in a target domain which often has substantially different characteristics. Earlier approaches often try to adapt the feature space using pivots (Blitzer et al., 2006(Blitzer et al., , 2007Ziser and Reichart, 2017) to create domain-invariant representations of predictive features. Others learn low-dimensional transformation features of the data (Guo et al., 2009;Glorot et al., 2011;Chen et al., 2012;Yu and Jiang, 2016;Barnes et al., 2018). Finally, some approaches divide the feature space into general and domain-dependent features (Daumé III, 2007). Multi-task learning can also improve cross-domain performance (Peng and Dredze, 2017).
Recently, Han and Eisenstein (2019) proposed domain-adaptive fine-tuning, where contextualised embeddings are first fine-tuned to both the source and target domains with a language modelling loss and subsequently fine-tuned to source domain labelled data. This approach outperforms several strong baselines trained on the target domain of the WNUT 2016 NER task (Strauss et al., 2016).
Aggregation of annotations: Approaches that aggregate annotations from multiples sources have largely concentrated on noisy data from crowd sourced annotations, with some annotators possibly being adversarial. The Bayesian Classifier Combination approach of Kim and Ghahramani (2012) combines multiple independent classifiers using a linear combination of predictions. Hovy et al. (2013) learn a generative model able to aggregate crowd-sourced annotations and estimate the trustworthiness of annotators. Rodrigues et al. (2014) present an approach based on Conditional Random Fields (CRFs) whose model parameters are learned jointly using EM. Nguyen et al. (2017b) propose a Hidden Markov Model to aggregate crowd-sourced sequence annotations and find that explicitly modelling the annotator leads to improvements for POStagging and NER. Finally, Simpson and Gurevych (2019) proposed a fully Bayesian approach to the problem of aggregating multiple sequential annotations, using variational EM to compute posterior distributions over the model parameters.
Weak supervision: The aim of weakly supervised modelling is to reduce the need for handannotated data in supervised training. A particular instance of weak supervision is distant supervision, which relies on external resources such as knowledge bases to automatically label documents with entities that are known to belong to a particular category (Mintz et al., 2009;Ritter et al., 2013;Shang et al., 2018). Ratner et al. ( , 2019 generalised this approach with the Snorkel framework which combines various supervision sources using a generative model to estimate the accuracy (and possible correlations) of each source. These aggregated supervision sources are then employed to train a discriminative model. Current frameworks are, however, not easily adaptable to sequence labelling tasks, as they typically require data points to be independent. One exception is the work of Wang and Poon (2018), which relies on deep probabilistic logic to perform joint inference on the full dataset. Finally,  presented a weak supervision approach to NER in the biomedical domain. However, unlike the model proposed in this paper, their approach relies on an ad-hoc mechanism for generating candidate spans to classify.
The approach most closely related to this paper is Safranchik et al. (2020), which describe a similar weak supervision framework for sequence labelling based on an extension of HMMs called linked hidden Markov models. The authors introduce a new type of noisy rules, called linking rules, to determine how sequence elements should be grouped into spans of same tag. The main differences be- Step 1:

labelling functions
Step 2: label aggregation Step 3: Training of sequence labelling model on aggregated labels tween their approach and this paper are the linking rules, which are not employed here, and the choice of labelling functions, in particular the documentlevel relations detailed in Section 3.1.
Ensemble learning: The proposed approach is also loosely related to ensemble methods such bagging, boosting and random forests (Sagi and Rokach, 2018). These methods rely on multiple classifiers run simultaneously and whose outputs are combined at prediction time. In contrast, our approach (as in other weak supervision frameworks) only requires labelling functions to be aggregated once, as an intermediary step to create training data for the final model. This is a non-trivial difference as running all labelling functions at prediction time is computationally costly due to the need to run multiple neural models along with gazetteers extracted from large knowledge bases.

Approach
The proposed model collects weak supervision from multiple labelling functions. Each labelling function takes a text document as input and outputs a series of spans associated with NER labels. These outputs are then aggregated using a hidden Markov model (HMM) with multiple emissions (one per labelling function) whose parameters are estimated in an unsupervised manner. Finally, the aggregated labels are employed to learn a sequence labelling model. Figure 1 illustrates this process. The process is performed on documents from the target domain, e.g. a corpus of financial news.
Labelling functions are typically specialised to detect only a subset of possible labels. For instance, a gazetteer based on Wikipedia will only detect mentions of persons, organisations and geographical locations and ignore entities such as dates or percents. This marks a departure from existing aggregation methods, which are originally designed for crowd-sourced data and where annotators are supposed to make use of the full label set. In addition, unlike previous weak supervision approaches, we allow labelling functions to produce probabilistic predictions instead of deterministic values. The aggregation model described in Section 3.2 directly captures these properties in the emission model associated with each labelling function.
We first briefly describe the labelling functions integrated into the current system. We review in Section 3.2 the aggregation model employed to combine the labelling predictions. The final labelling model is presented in Section 3.3. The complete list of 52 labelling functions employed in the experiments is available in Appendix A.

Labelling functions
Out-of-domain NER models The first set of labelling functions are sequence labelling models trained in domains from which labelled data is available. In the experiments detailed in Section 4, we use four such models, respectively trained on Ontonotes (Weischedel et al., 2011), CoNLL 2003(Tjong Kim Sang and De Meulder, 2003 2 , the Broad Twitter Corpus (Derczynski et al., 2016) and a NER-annotated corpus of SEC filings (Salinas Alvarado et al., 2015).
For the experiments in this paper, all aforementioned models rely on a transition-based NER model (Lample et al., 2016) which extracts features with a stack of four convolutional layers with filter size of three and residual connections. The model uses attention features and a multi-layer perceptron to select the next transition. It is initialised with GloVe embeddings (Pennington et al., 2014) and implemented in Spacy (Honnibal and Montani, 2017). However, the proposed approach does not impose any constraints on the model architecture and alternative approaches based on e.g. contextualised embeddings can also be employed.
Gazetteers As in distant supervision approaches, we include a number of gazetteers from large knowledge bases to identify named entities. Concretely, we use resources from Wikipedia (Geiß et al., 2018), Geonames (Wick, 2015, the Crunchbase Open Data Map, DBPedia (Lehmann et al., 2015) along with lists of countries, languages, nationalities and religious or political groups.
To efficiently search for occurrences of these entities in large text collections, we first convert each knowledge base into a trie data structure. Prefix search is then applied to extract matches (using both case-sensitive and case-insensitive mode, as they have distinct precision-recall trade-offs).
Heuristic functions We also include various heuristic functions, each specialised in the recognition of specific types of named entities. Several functions are dedicated to the recognition of proper names based on casing, part-of-speech tags or dependency relations. In addition, we integrate a variety of handcrafted functions relying on regular expressions to detect occurrences of various entities (see Appendix A for details). A probabilistic parser specialised in the recognition of dates, times, money amounts, percents, and cardinal/ordinal values (Braun et al., 2017) is also incorporated.
Document-level relations All labelling functions described above rely on local decisions on tokens or phrases. However, texts are not loose collections of words, but exhibit a high degree of internal coherence (Grosz and Sidner, 1986;Grosz et al., 1995) which can be exploited to further improve the annotations.
We introduce one labelling function to capture label consistency constraints in a document. As noted in (Krishnan and Manning, 2006;Wang et al., 2018), named entities occurring multiple times through a document have a high probability of belonging to the same category. For instance, while Komatsu may both refer to a Japanese town or a multinational corporation, a text including this mention will either be about the town or the company, but rarely both at the same time. To capture these non-local dependencies, we define the following label consistency model: given a text span e occurring in a given document, we look for all spans Z e in the document that contain the same string as e. The (probabilistic) output of the labelling function then corresponds to the relative frequency of each label l for that string in the document: The above formula depends on a distribution P label(z) , which can be defined on the basis of other labelling functions. Alternatively, a two-stage model similar to (Krishnan and Manning, 2006) could be employed to first aggregate local labelling functions and subsequently apply document-level functions on aggregated predictions. Another insight from Grosz and Sidner (1986) is the importance of the attentional structure. When introduced for the first time, named entities are often referred to in an explicit and univocal manner, while subsequent mentions (once the entity is a part of the focus structure) frequently rely on shorter references. The first mention of a person in a given text is for instance likely to include the person's full name, and is often shortened to the person's last name in subsequent mentions. As in Ratinov and Roth (2009), we determine whether a proper name is a substring of another entity mentioned earlier in the text. If so, the labelling function replicates the label distribution of the first entity.

Aggregation model
The outputs of these labelling functions are then aggregated into a single layer of annotation through an aggregation model. As we do not have access to labelled data for the target domain, this model is estimated in a fully unsupervised manner.
Model We assume a list of J labelling functions {λ 1 , ...λ J } and a list of S mutually exclusive NER labels {l 1 , ...l S }. The aggregation model is represented as an HMM, in which the states correspond to the true underlying labels. This model has multiple emissions (one per labelling function) assumed to be mutually independent conditional on the latent underlying label.
Formally, for each token i ∈ {1, ..., n} and labelling function j, we assume a Dirichlet distribution for the probability labels P ij . The parameters of this Dirichlet are separate vectors α s i j ∈ R S [0,1] , for each of the latent states s i ∈ {1, ..., S}. The latent states are assumed to have a Markovian dependence structure between the tokens {1, ..., n}. This results in the HMM represented by a dependent mixtures of Dirichlet model: Here, ω (s i ,s i−1 ) ∈ R are the parameters of the transition probability matrix controlling for a given state s i−1 the probability of transition to state s i . Figure 2 illustrates the model structure.
Parameter estimation The learnable parameters of this HMM are (a) the transition matrix between states and (b) the α vectors of the Dirichlet distribution associated with each labelling function. The The plugged wells have ...
Labelling function j ∈ {1, ...J} transition matrix is of size |S| × |S|, while we have |S| × |J| α vectors, each of size |S|. The parameters are estimated with the Baum-Welch algorithm, which is a variant of EM algorithm that relies on the forward-backward algorithm to compute the statistics for the expectation step.
To ensure faster convergence, we introduce a new constraint to the likelihood function: for each token position i, the corresponding latent label s i must have a non-zero probability in at least one labelling function (the likelihood of this label is otherwise set to zero for that position). In other words, the aggregation model will only predict a particular label if this label is produced by least one labelling function. This simple constraint facilitates EM convergence as it restricts the state space to a few possible labels at every time-step. Prior distributions The HMM described above can be provided with informative priors. In particular, the initial distribution for the latent states can be defined as a Dirichlet based on counts δ for the most reliable labelling function 3 : The prior for each row k of the transition probabilities matrix is also a Dirichlet based on the frequencies of transitions between the observed classes for the most reliable labelling function κ k : Finally, to facilitate convergence of the EM algorithm, informative starting values can be specified for the emission model of each labelling function. 3 The most reliable labelling function was found in our experiments to be the NER model trained on Ontonotes 5.0.
Assuming we can provide rough estimates of the recall r jk and precision ρ jk for the labelling function j on label k, the initial values for the parameters of the emission model are expressed as: The probability of observing a given label k emitted by the labelling function j is thus proportional to its recall if the true label is indeed k. Otherwise (i.e. if the labelling function made an error), the probability of emitting k is inversely proportional to the precision of the labelling function j.
Decoding Once the parameters of the HMM model are estimated, the forward-backward algorithm can be employed to associate each token marginally with a posterior probability distribution over possible NER labels (Rabiner, 1990).

Sequence labelling model
Once the labelling functions are aggregated on documents from the target domain, we can train a sequence labelling model on the unified annotations, without imposing any constraints on the type of model to use. To take advantage of the posterior marginal distributionp s over the latent labels, the optimisation should seek to minimise the expected loss with respect top s : where h θ (·) is the output of the sequence labelling model. This is equivalent to minimising the crossentropy error between the outputs of the neural model and the probabilistic labels produced by the aggregation model.

Evaluation
We evaluate the proposed approach on two Englishlanguage datasets, namely the CoNLL 2003 dataset and a collection of sentences from Reuters and Bloomberg news articles annotated with named entities by crowd-sourcing. We include a second dataset in order to evaluate the approach with a more fine-grained set of NER labels than the ones in CoNLL 2003. As the objective of this paper is to compare approaches to unsupervised domain adaptation, we do not rely on any labelled data from these two target domains.

Data
CoNLL 2003 The CoNLL 2003 dataset (Tjong Kim Sang and De Meulder, 2003) consists of 1163 documents, including a total of 35089 entities spread over 4 labels: ORG, PER, LOC and MISC.
Reuters & Bloomberg We additionally crowd annotate 1054 sentences from Reuters and Bloomberg news articles from Ding et al. (2014). We instructed the annotators to tag sentences with the following 9 Ontonotes-inspired labels: PER-SON, NORP, ORG, LOC, PRODUCT, DATETIME, PER-CENT, MONEY, QUANTITY. Each sentence was annotated by at least two annotators, and a qualifying test with gold-annotated questions was conducted for quality control. Cohen's κ for sentences with two annotators is 0.39, while Krippendorff's α for three annotators is 0.44. We had to remove QUAN-TITY labels from the annotations as the crowd results for this label were highly inconsistent.

Baselines
Ontonotes-trained NER The first baseline corresponds to a neural sequence labelling model trained on the Ontonotes 5.0 corpus. We use here the same model from Section 3.1, which is the single best-performing labelling function (that is, without aggregating multiple predictions).
We also experimented with other neural architectures but these performed similar or worse than the transition-based model, presumably because they are more prone to overfitting on the source domain.
Majority voting (MV) The simplest method for aggregating outputs is majority voting, i.e. outputting the most frequent label among the ones predicted by each labelling function. However, specialised labelling functions will output O for most tokens, which means that the majority label is typically O. To mitigate this problem, we first look at tokens that are marked with a non-O label by at least T labelling functions (where T is a hyperparameter tuned experimentally), and then apply majority voting on this set of non-O labels.
Snorkel model The Snorkel framework  does not directly support sequence labelling tasks as data points are required to be independent. However, heuristics can be used to extract named-entity candidates and then apply labelling functions to infer their most likely labels . For this baseline, we use the three functions nnp detector, proper detector and compound detector (see Appendix A) to generate candidate spans. We then create a matrix expressing the output of each labelling function for each span (including a specific "abstain" value to denote the absence of prediction) and run the matrix-completionstyle approach of Ratner et al. (2019) to aggregate the predictions from all functions.
mSDA is a strong domain adaptation baseline (Chen et al., 2012) which augments the feature space of a model with intermediate representations learned using stacked denoising autoencoders. In our case, we learn the mSDA representations on the unlabeled source and target domain data. These 800 dimensional vectors are concatenated to 300 dimensional word embeddings and fed as input to a two-layer LSTM with a skip connection. Finally, we train the LSTM on the labeled source data and test on the target domain.
AdaptaBERT This baseline corresponds to a state-of-the-art unsupervised domain adaptation approach (AdaptaBERT) (Han and Eisenstein, 2019). The approach first uses unlabeled data from both the source and target domains to domain-tune a pretrained BERT model. The model is finally tasktuned in a supervised fashion on the source domain labelled data (Ontonotes). At inference time, the model makes use of the pretraining and domain tuning to predict entities in the target domain. In our experiments, we use the cased-version of the base BERT model and perform three fine-tuning epochs for both domain-tuning and task-tuning. We additionally include an ensemble model, which averages the predictions of five BERT models finetuned with different random seeds.

Mixtures of multinomials
Following the notation from Section 3.2, we define Y i,j,k = I(P i,j,k = max k ∈{1,...,S} P i,j,k ) to be the most probable label for word i by source j. One can model Y ij with a Multinomial probability distribution. The first four baselines (the fifth one assumes Markovian dependence between the latent states) listed below use the following independent, i.e. p(s i , s i−1 ) = p(s i )p(s i−1 ), mixtures of Multinomials model for Y ij : Accuracy model (ACC) (Rodrigues et al., 2014) assumes the following constraints on p s i j : Here, for each labelling function it is assumed to have the same accuracy π j for all of the tokens.
Confusion vector (CV) (Nguyen et al., 2017a) extends ACC by relying on separate success probabilities for each token label: Confusion matrix (CM) (Dawid and Skene, 1979) allows for distinct accuracies conditional on the latent states, which results in: Sequential Confusion Matrix (SEQ) extends the CM model of Simpson and Gurevych (2019), where an "auto-regressive" component is included in the observed part of the model. We assume dependence on a covariate indicating that the label has not changed for a given source, i.e.: Dependent confusion matrix (DCM) combines the CM-distinct accuracies conditional on the latent states of (8) and the Markovian dependence of (3).

Results
The evaluation results are shown in Tables 1 and  2, respectively for the CoNLL 2003 data and the crowd-annotated sentences. The metrics are the (micro-averaged) precision, recall and F 1 scores at both the token-level and entity-level. In addition, we indicate the token-level cross-entropy error (in log-scale). As the labelling functions are defined on a richer annotation scheme than the four labels of ConLL 2003, we map GPE to LOC and EVENT, FAC, LANGUAGE, LAW, NORP, PRODUCT and WORK OF ART to MISC. The results for the ACC and CV baselines are not included as the parameter estimation did not converge and hence did not provide reliable posteriors over parameters. Table 1 further details the results for subsets of labelling functions. Of particular interest is the contribution of document-level functions, boosting the entity-level F 1 from 0.702 to 0.716. This highlights the importance of these relations in NER.
The last line of the two tables reports the performance of the sequence labelling model (Section 3.3) trained on the aggregated labels. We observe that its performance remains close to the HMMaggregated labels. This shows that the knowledge from the labelling functions can be injected into a standard neural model without substantial loss.

Discussion
Although not shown in the results due to space constraints, we also analysed whether the informative priors described in Section 3.2 influenced the performance of the aggregation model. We found informative and non-informative priors to yield similar performance for CoNLL 2003. However, the performance of non-informative priors was very poor on the Reuters and Bloomberg sentences (F 1 at 0.12), thereby demonstrating the usefulness of informative priors for small datasets.
We provide in Figure 3 an example with a few selected labelling functions. In particular, we can observe that the Ontonotes-trained NER model mistakenly labels "Heidrun" as a product. This erroneous label, however, is counter-balanced by other labelling functions, notably a document-level function looking at the global label frequency of this string through the document. We do, however, notice a few remaining errors, e.g. the labelling of "Status Weekly" as an organisation. Figure 4 illustrates the pairwise agreement and disagreement between labelling functions on the CoNLL 2003 dataset. If both labelling functions make the same prediction on a given token, we count this as an agreement, whereas conflicting predictions (ignoring O labels), are seen as disagreement. Large differences may exist between these functions for specific labels, especially MISC. The functions with the highest overlap are those making predictions on all labels, while labelling functions specialised to few labels (such as legal detector) often have less overlap. We also observe that the two gazetteers from Crunchbase and Geonames disagree in about 15% of cases, presumably due to company names that are also geographical locations, as in the earlier Komatsu example.
In terms of computational efficiency, the estimation of HMM parameters is relatively fast, requiring less than 30 mins on the entire CoNLL 2003 data. Once the aggregation model is estimated, it  can be directly applied to new texts with a single forward-backward pass, and can therefore scale to datasets with hundreds of thousands of documents. This runtime performance is an important advantage compared to approaches such as AdaptaBERT (Han and Eisenstein, 2019) which are relatively slow at inference time. The proposed approach can also be ported to other languages than English, although heuristic functions and gazetteers will need to be adapted to the target language.

Conclusion
This paper presented a weak supervision model for sequence labelling tasks such as Named Entity Recognition. To leverage all possible knowledge sources available for the task, the approach uses a broad spectrum of labelling functions, including data-driven NER models, gazetteers, heuristic functions, and document-level relations between entities. Labelling functions may be specialised to recognise specific labels while ignoring oth-  ers. Furthermore, unlike previous weak supervision approaches, labelling functions may produce probabilistic predictions. The outputs of these labelling functions are then merged together using a hidden Markov model whose parameters are estimated with the Baum-Welch algorithm. A neural sequence labelling model can finally be learned on the basis of these unified predictions.
Evaluation results on two datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) show that the method can boost NER performance by about 7 percentage points on entity-level F 1 . In particular, the proposed model outperforms the unsupervised domain adaptation approach through contextualised embeddings of Han and Eisenstein (2019). Of specific linguistic interest is the contribution of document-level labelling functions, which take advantage of the internal coherence and narrative structure of the texts.
Future work will investigate how to take into account potential correlations between labelling functions in the aggregation model, as done in e.g. . Furthermore, some of the labelling functions can be rather noisy and model selection of the optimal subset of the labelling functions might well improve the performance of our model. Model selection approaches that can be adapted are discussed in Adams and Beling (2019); Hubin (2019). We also wish to evaluate the approach on other types of sequence labelling tasks beyond Named Entity Recognition. Entity classification based on majority labels in document (case-insensitive) Table 3: Full list of labelling functions employed in the experiments. The neural NER models are provided in two versions: one that directly outputs the raw model predictions, and one that runs a shallow postprocessing step on the model predictions to correct known recognition errors (for instance, ensuring that a numeric amount that is either preceded or followed by a currency symbol is always classified as an entity of type MONEY).

B Label matching problem
The baseline models relying on mixtures of multinomials have to address the so-called label matching problem, which needs some extra care.
The following approach was employed in the experiments from Section 4: • First, we put strong initial values to the probabilities σ of individual classes based on the frequency of appearance of these classes in the most reliable labelling function. This is expected to increase the probability of EM exploring the mode around the initialised values.
• Second, we perform post-processing and set the labels to the states corresponding to the labels with the highest pairwise correlations to the latent labels from one of the three options:

C Detailed results
In Table 4, we provide the detailed results distributed by NER label for the CoNLL data 2003 which were presented in micro-averaged form in Table 1