A Bayesian Approach for Sequence Tagging with Crowds

Current methods for sequence tagging, a core task in NLP, are data hungry, which motivates the use of crowdsourcing as a cheap way to obtain labelled data. However, annotators are often unreliable and current aggregation methods cannot capture common types of span annotation error. To address this, we propose a Bayesian method for aggregating sequence tags that reduces errors by modelling sequential dependencies between the annotations as well as the ground-truth labels. By taking a Bayesian approach, we account for uncertainty in the model due to both annotator errors and the lack of data for modelling annotators who complete few tasks. We evaluate our model on crowdsourced data for named entity recognition, information extraction and argument mining, showing that our sequential model outperforms the previous state of the art, and that Bayesian approaches outperform non-Bayesian alternatives. We also find that our approach can reduce crowdsourcing costs through more effective active learning, as it better captures uncertainty in the sequence labels when there are few annotations.


Introduction
Current methods for sequence tagging, a core task in NLP, use deep neural networks that require tens of thousands of labelled documents for training (Ma and Hovy, 2016;Lample et al., 2016). This presents a challenge when facing new domains or tasks, where obtaining labels is often timeconsuming or costly. Labelled data can be obtained cheaply by crowdsourcing, in which large numbers of untrained workers annotate documents instead of more expensive experts. For sequence tagging, this results in multiple sequences of unreliable labels for each document.
Probabilistic methods for aggregating crowdsourced data have been shown to be more accurate than simple heuristics such as majority vot-ing (Raykar et al., 2010;Sheshadri and Lease, 2013;Rodrigues et al., 2013;Hovy et al., 2013). However, existing methods for aggregating sequence labels cannot model dependencies between the annotators' labels (Rodrigues et al., 2014;Nguyen et al., 2017) and hence do not account for their effect on annotator noise and bias. In this paper, we remedy this by proposing a sequential annotator model and applying it to tasks that follow a beginning, inside, outside (BIO) scheme, in which the first token in a span of type 'x' is labelled 'B-x', subsequent tokens are labelled 'I-x', and tokens outside spans are labelled 'O'.
When learning from noisy or small datasets, commonly-used methods based on maximum likelihood estimation may produce over-confident predictions (Xiong et al., 2011;Srivastava et al., 2014). In contrast, Bayesian inference accounts for model uncertainty when making predictions. Unlike alternative methods that optimize the values for model parameters, Bayesian inference integrates over all possible values of a parameter, weighted by a prior distribution that captures background knowledge. The resulting posterior probabilities improve downstream decision making as they include the probability of errors due to a lack of knowledge. For example, during active learning, posterior probabilities assist with selecting the most informative data points (Settles, 2010).
In this paper, we develop Bayesian sequence combination (BSC), building on prior work that has demonstrated the advantages of Bayesian inference for aggregating unreliable classifications (Kim and Ghahramani, 2012;Simpson et al., 2013;Felt et al., 2016;Paun et al., 2018). BSC is the first fully-Bayesian method for aggregating sequence labels from multiple annotators. As a core component of BSC, we also introduce the sequential confusion matrix (seq), a probabilistic model of annotator noise and bias, which goes beyond previous work by modelling sequential dependencies between annotators' labels. Further contributions include a theoretical comparison of the probabilistic models of annotator noise and bias, and an empirical evaluation on three sequence labelling tasks, in which BSC with seq consistently outperforms the previous state of the art. We make all of our code and data freely available 1 . Sheshadri and Lease (2013) benchmarked several aggregation models for non-sequential classifications, obtaining the most consistent performance from that of Raykar et al. (2010), who model the reliability of individual annotators using probabilistic confusion matrices, as proposed by Dawid and Skene (1979). Simpson et al. (2013) showed that a Bayesian variant of Dawid and Skene's model can outperform maximum likelihood approaches and simple heuristics when combining crowds of image annotators. This Bayesian variant, independent Bayesian classifier combination (IBCC) (Kim and Ghahramani, 2012), was originally used to combine ensembles of automated classifiers rather than human annotators. While traditional ensemble methods such as boosting focus on how to generate new classifiers (Dietterich, 2000), IBCC is concerned with modelling the reliability of each classifier in a given set of classifiers. To reduce the number of parameters in multi-class problems, Hovy et al. (2013) proposed MACE, and showed that it performed better under a Bayesian treatment on NLP tasks. Paun et al. (2018) further illustrated the advantages of Bayesian models of annotator ability on NLP classification tasks with different levels of annotation sparsity and noise.

Related Work
We expand this previous work by detailing the relationships between several annotator models and extending them to sequential classification. Here we focus on the core annotator representation, rather than extensions for clustering annotators (Venanzi et al., 2014;Moreno et al., 2015), modeling their dynamics (Simpson et al., 2013), adapting to task difficulty (Whitehill et al., 2009;Bachrach et al., 2012), or time spent by annotators (Venanzi et al., 2016).
Methods for aggregating sequence labels in-1 http://github.com/ukplab/ arxiv2018-bayesian-ensembles clude CRF-MA (Rodrigues et al., 2014), a CRFbased model that assumes only one annotator is correct for any given label. Recently, Nguyen et al. (2017) proposed a hidden Markov model (HMM) approach that outperformed CRF-MA, called HMM-crowd. Both CRF-MA and HMMcrowd use simpler annotator models than Dawid and Skene (1979) that do not capture the effect of sequential dependencies on annotator reliability. Neither CRF-MA nor HMM-crowd use a fully Bayesian approach, which has been shown to be more effective for handling uncertainty due to noise in crowdsourced data for non-sequential classification (Kim and Ghahramani, 2012;Simpson et al., 2013;Venanzi et al., 2014;Moreno et al., 2015). In this paper, we develop a sequential annotator model and an approximate Bayesian method for aggregating sequence labels.

Modeling Sequential Annotators
When combining multiple annotators with varying skill levels, we can improve performance by modelling their individual noise and bias using a probabilistic model. Here, we describe several models that do not consider dependencies between annotations in a sequence, before defining seq, a new extension that captures sequential dependencies. Probabilistic annotator models each define a different function, A, for the likelihood that the annotator chooses label c τ given the true label t τ , for the τ th token in a sequence.
Accuracy model (acc): the basis of several previous methods (Donmez et al., 2010;Rodrigues et al., 2013), acc uses a single parameter for each annotator's accuracy, π: where J is the number of classes. This may be unsuitable when one class label dominates the data, since a spammer who always selects the most common label will nonetheless have a high π. Spamming model (spam): proposed as part of MACE (Hovy et al., 2013), this model also assumes constant accuracy, π, but that when an annotator is incorrect, they label according to a spamming distribution, ξ, that is independent of the true label, t τ .
This addresses the case where spammers choose the dominant label but does not explicitly model different error rates in each class. For example, if an annotator is better at detecting type 'x' spans than type 'y', or if they frequently miss the first token in a span, thereby labelling the start of a span as 'O' when the true label is 'B-x', this would not be explicitly modelled by spam.
Confusion vector (CV): this approach learns a separate accuracy for each class label (Nguyen et al., 2017) using parameter vector, π, of size J: This model does not capture spamming patterns where one of the incorrect labels has a much higher likelihood than the others. Confusion matrix (CM) (Dawid and Skene, 1979): this model can be seen as an expansion of the confusion vector so that π becomes a J × J matrix with values given by: This requires a larger number of parameters, J 2 , compared to the J + 1 parameters of MACE or J parameters of the confusion vector. Like spam, CM can model spammers who frequently chose one label regardless of the ground truth, but also models different error rates and biases for each class. However, CM ignores dependencies between annotations in a sequence, such as the fact that an 'I' cannot immediately follow an 'O'. Sequential Confusion Matrix (seq): we introduce a new extension to the confusion matrix to model the dependency of each label in a sequence on its predecessor, giving the following likelihood: where π is now three-dimensional with size J × J × J. In the case of disallowed transitions, e.g. from c τ −1 ='O' to c τ ='I', the value π j,c τ −1 ,cτ ≈ 0, ∀j is fixed a priori. The sequential model can capture phenomena such as a tendency toward overly long sequences, by learning that I is more likely to follow another I, so that π O,I,I > π O,I,O . A tendency to split spans by inserting 'B' in place of 'I' can be modelled by increasing the value of π I,I,B without affecting π I,B,B and π I,O,B . The annotator models presented in this section include the most widespread models for NLP annotation tasks, and can be seen as extensions of one another. The choice of annotator model for a particular annotator depends on the developer's understanding of the annotation task: if the annotations have sequential dependencies, this suggests the seq model; for non-sequential classifications CM may be effective with small (≤ 5) numbers of classes; spam may be more suitable if there are many classes, as the number of parameters to learn is low. However, there is also a trade-off between the expressiveness of the model and the number of parameters that must be learned. Simpler models with fewer parameters may be effective if there are only small numbers of annotations from each annotator. The next section shows how these annotator models can be used as components of a complete model for aggregating sequential annotations.

A Generative Model for Bayesian Sequence Combination
To construct a generative model for Bayesian sequence combination (BSC), we first define a hidden Markov model (HMM) with states t n,τ and observations x n,τ using categorical distributions: where T j is a row of a transition matrix T , and ρ j is a vector of observation likelihoods for state j. For text tagging, n indicates a document and τ a token index, while each state t n,τ is a true sequence label and x n,τ is a token. To provide a Bayesian treatment, we assume that T j and ρ j have Dirichlet distribution priors as follows: where γ j and κ j are hyperparameters.
Next, we assume one of the annotator models described in Section 3 for each of K annotators. Selecting an annotator model is a design choice, and all can be coupled with the Bayesian HMM above to form a complete BSC model. In our experiments in Section 6, we compare different choices of annotator model as components of BSC. All the parameters of these annotator models are probabilities, so to provide a Bayesian treatment, we assume that they have Dirichlet priors. For annotator k's annotator model, we refer to the hyperparameters of its Dirichlet prior as α (k) . The annotator model defines a categorical likelihood over each annotation, c (k) n,τ : The annotators are assumed to be conditionally independent of one another given the true labels, t, which means that their errors are assumed to be uncorrelated. This is a strong assumption when considering that the annotators have to make their decisions based on the same input data. However, in practice, dependencies do not usually cause the most probable label to change (Zhang, 2004), hence the performance of classifier combination methods is only slightly degraded, while avoiding the complexity of modelling dependencies between annotators (Kim and Ghahramani, 2012).
Joint distribution: the complete model can be represented by the joint distribution, given by: where c is the set of annotations for all documents from all annotators, t is the set of all sequence labels for all documents, N is the number of documents, L n is the length of the nth document, J is the number of classes, x is the set of all word sequences for all documents and ρ, γ and κ are the sets of parameters for all label classes.

Inference using Variational Bayes
Given a set of annotations, c, we obtain a posterior distribution over sequence labels, t, using variational Bayes (VB) (Attias, 2000). Unlike maximum likelihood methods such as standard expec-tation maximization (EM), VB considers prior distributions and accounts for parameter uncertainty due to noisy or sparse data. In comparison to other Bayesian approaches such as Markov chain Monte Carlo (MCMC), VB is often faster, readily allows incremental learning, and provides easier ways to determine convergence (Bishop, 2006). It has been successfully applied to a wide range of methods, including being used as the standard learning procedure for LDA (Blei et al., 2003), and to combining non-sequential crowdsourced classifications (Simpson et al., 2013). The trade-off is that we must approximate the posterior distribution with an approximation that factorises between subsets of latent variables: VB performs approximate inference by updating each variational factor, q(z), in turn, optimising the approximate posterior distribution until it converges. Details of the theory are beyond the scope of this paper, but are explained by Bishop (2006). The VB algorithm is described in Algorithm 1, making use of update equations for the variational factors given below.
The expectations of ln T and ln ρ can be computed using standard equations for a Dirichlet distribution: where Ψ is the digamma function, N j,ι = N n=1 Ln τ =1 s n,τ,j,ι is the expected number of times that label ι follows label j, and o j,w is the expected number of times that word w occurs with sequence label j. Similarly, for the seq annotator model, the expectation terms are: where δ is the Kronecker delta. For other annotator models, this equation is simplified as the values of the previous labels c

Most Likely Sequence Labels
The approximate posterior probabilities of the true labels, r j,n,τ , provide confidence estimates for the labels. However, it is often useful to compute the most probable sequence of labels,t n , using the Viterbi algorithm (Viterbi, 1967). The most probable sequence is particularly useful because, unlike r j,n,τ , the sequence will be consistent with any transition constraints imposed by the priors on the transition matrix T , such as preventing 'O'→'I' transitions by assigning them zero probability.

Experiments
Datasets. We compare BSC to alternative methods on three NLP datasets containing both crowdsourced and gold sequential annotations. NER, the CoNLL 2003 named-entity recognition dataset (Tjong Kim Sang and De Meulder, 2003), which contains gold labels for four named entity categories (PER, LOC, ORG, MISC), with crowdsourced labels provided by (Rodrigues et al., 2014). PICO (Nguyen et al., 2017), consists of medical paper abstracts that have been annotated by a crowd to indicate text spans that identify the population enrolled in a clinical trial. ARG (Trautmann et al., 2019) contains a mixture of argumentative and non-argumentative sentences, in which the task is to mark the spans that contain pro or con arguments for a given topic. Dataset statistics are shown in  exact span matches to be correct. Incomplete named entities are typically not useful, and for ARG, it is desirable to identify complete argumentative units that make sense on their own. For medical trial populations, partial matches still contain useful information, so for PICO we use a relaxed F1-score, as in Nguyen et al. (2017), which counts the matching fractions of spans when computing precision and recall. We also compute the cross entropy error (CEE, also known as log-loss). While this is a token-level rather than span-level metric, it evaluates the quality of the probability estimates produced by aggregation methods, which are useful for tasks such as active learning, as it penalises over-confident mistakes more heavily.
Evaluated Methods. We evaluate BSC in combination with all of the annotator models described in Section 4. As well-established non-sequential baselines, we include token-level majority voting (MV), MACE (Hovy et al., 2013) which uses the spam annotator model, Dawid-Skene (DS) (Dawid and Skene, 1979), which uses the CM annotator model, and independent Bayesian classifier combination (IBCC) (Kim and Ghahramani, 2012), which is a Bayesian treatment of Dawid-Skene. We also compare against the state-of-theart sequential HMM-crowd method (Nguyen et al., 2017), which uses a combination of maximum a posteriori (or smoothed maximum likelihood) estimates for the CV annotator model and variational inference for an integrated hidden Markov model (HMM). HMM-Crowd and DS use non-Bayesian inference steps and can be compared with their Bayesian variants, BSC-CV and IBCC, respectively.
Besides the annotator models, BSC also makes use of text features and a transition matrix, T , over true labels. We test the effect of these components by running BSC-CM and BSC-seq with no text features (notext), and without the transition matrix, which is replaced by simple independent class probabilities (labelled \T ).
We tune the hyperparameters using grid search on the development sets. To limit the number of hyperparameters to tune, we optimize only three values for BSC: hyperparameters of the transition matrix, γ j , are set to the same value, γ 0 , except for disallowed transitions, (O I and transitions between types, e.g. I-PER I-ORG), which are set to 1e − 6; for the annotator models, all values are set to α 0 , except for disallowed transitions, which are set to 1e −6 , then 0 is added to hyperparameters corresponding to correct annotations (e.g. diagonal entries in a confusion matrix). This encodes the prior assumption that annotators are more likely to have an accuracy greater than random, which avoids the non-identifiability problem in which the class labels are switched around. Aggregation Results. This task is to combine multiple crowdsourced labels and predict the true labels. The results are shown in Table 2. BSCseq outperforms the other approaches, including the previous state of the art, HMM-crowd (sig-  nificant on all datasets with p .01 using a two-tailed Wilcoxon signed-rank test). Without the text model (BSC-seq-notext) or the transition matrix (BSC-seq\T ), BSC-seq performance decreases heavily, while BSC-CM-notext and BSC-CM\T in some cases outperform BSC-CM. This suggests that seq, with its greater number of parameters to learn, is most effective in combination with the transition matrix and simple text model. On the ARG dataset, the scores are close to zero for BSC-seq\T . Further investigation shows that this is because BSC-seq\T identifies many spans with one or two incorrect labels. Since we use exact span matches to compute true and false positives, these small errors reduce the scores substantially. In particular, we find a large number of missing B tags at the start of spans and misplaced O tags that split spans in the middle.
The performance of all methods across the three datasets varies greatly. With NER, the spans are short and the task is less subjective than PICO or ARG, hence its higher F1 scores. PICO uses a relaxed F1 score, meaning its scores are only slightly lower despite being a more ambiguous task. The constitution of an argument is also ambiguous, so ARG scores are lower, particularly as we use strict span-matching to compute the F1 scores. Raising the scores may be possible in future by using expert labels as training data, i.e. as known values of t, which would help to put more weight on annotators with similar labelling patterns to the experts.
We categorise the errors made by key methods and list the counts for each category in Table 3. All machine learning methods tested here reduce the number of spans that were completely missed by majority voting. Note that BSC completely removes all "invalid" spans (O I transitions) due to the sequential model with prior hyperparameters set to zero for those transitions. For PICO and ARG, which contain longer spans, BSC-seq has lower "length error" than other methods, which is the mean difference in number of tokens between the predicted and gold spans. It also reduces the number of missing spans, although in NER and ARG that comes at the cost of producing more false positives (predicting spans where there are none). Overall, BSC-seq appears to be the best choice for identifying exact span matches and reducing missed spans.
Visualising Annotator Models. To determine whether, in practice, BSC-seq really learns distinctive confusion matrices depending on the previous labels, we plot the learned annotator models for PICO as probabilistic confusion matrices in Figure 1 (for an alternative visualisation, see tions between the clusters. The third column, for example, shows annotators with a tendency toward B I transitions regardless of the true label, while other clusters indicate very different labelling behaviour. The model therefore appears able to learn distinct confusion matrices for different workers given previous labels, which supports the use of sequential annotator models.
Active Learning. Active learning is an iterative process that can reduce the amount of labelled data required to train a model. At each iteration, the active learning strategy selects informative data points, obtains their labels, then re-trains a labelling model given the new labels. The updated model is then used in the next iteration to identify the most informative data points. We simulate active learning in a crowdsourcing scenario where the goal is to learn the true labels, t, by selecting documents for the crowd to label. Each document can be labelled multiple times by different workers. In contrast, in a passive learning setup, the number of annotations per document is usually constant across all documents. For ex-ample, in the PICO dataset, each document was labelled six times. The aim of active learning is to decrease the number of annotations required by avoiding relabelling documents whose true labels can be determined with high confidence from fewer labels.
We simulate active learning using the least confidence strategy, shown to be effective by Settles and Craven (2008), as described in Algorithm 2. At each iteration, we estimate t from the current set of crowdsourced labels, c, using one of the methods from our previous experiments as the labelling model, then use this model to determine the least confident batch size documents to be labelled by the crowd. If the simulation has requested all the labels for a document that are available in our dataset, this document is simply ignored when choosing new batches and is not selected again.
We hypothesise that BSC will learn more quickly than non-sequential and non-Bayesian methods in the active learning simulation. For each document n, compute LC n = 1 − p(t * n |c), where t * n is the probability of the most likely sequence of labels for n.

4
Obtain annotations for batch size documents with the highest values of LC (least confidence), and add them to c end Algorithm 2: Active learning simulation using least-confidence sampling. Figure 2 plots the mean F1 scores over ten repeats of the active learning simulation on the NER dataset (for clarity, we only plot key methods). When the number of iterations is very small, neither IBCC nor DS are able to outperform majority vote, and only produce a very small benefit as the number of labels grows. This highlights the need for a sequential model such as BSC or HMM-crowd for effective active learning with small numbers of labels. IBCC learns slightly quicker than DS, while BSC-CV clearly outperforms HMM-crowd: we believe this difference is due to the Bayesian treatment of IBCC and BSC, which means they are better able to estimate confidence than DS and HMM-crowd, which use maximum likelihood and maximum a posteriori inference. BSC-seq produces the best overall performance, and the gap grows as the number of labels increases, since more data is required to learn the more complex annotator model.

Conclusions
We proposed BSC, a novel Bayesian approach to aggregating sequence labels that can be combined with several different models of annotator noise and bias. To model the effect of dependencies between labels on annotator noise and bias, we introduced the seq annotator model. Our results demonstrated the benefits of BSC over established non-sequential methods, such as MACE, Dawid-Skene (DS), and IBCC. We also showed the advantages of a Bayesian approach for active learning, and that the combination of BSC with the seq annotator model improves the state-of-the-art over HMM-crowd on three NLP tasks with different types of span annotations.
In future work, we plan to adapt active learning methods for easy deployment on crowdsourcing platforms, and to investigate techniques for automatically selecting good hyperparameters without recourse to a development set, which is often unavailable at the start of a crowdsourcing process. The variational factor for each annotator model is a distribution over its parameters, which differs between models. For seq, the variational factor is: where δ is the Kronecker delta. For CM, MACE, CV and acc, the factors follow a similar pattern of summing pseudo-counts of correct and incorrect answers. Figure 3 provides an alternative visualisation of the seq models inferred by BSC-seq for annotators in the PICO dataset. The annotators were clustered as described in Section 6 of the main paper, and the mean confusion matrices for each cluster are plotted in Figure 3 using 3D plots to emphasise the differences between the likelihoods of annotators in each cluster providing a particular label given the true label value.