OptSLA: an Optimization-Based Approach for Sequential Label Aggregation

The need for the annotated training dataset on which data-hungry machine learning algorithms feed has increased dramatically with advanced acclaim of machine learning applications. To annotate the data, people with domain expertise are needed, but they are seldom available and expensive to hire. This has lead to the thriving of crowdsourcing platforms such as Amazon Mechanical Turk (AMT). However, the annotations provided by one worker cannot be used directly to train the model due to the lack of expertise. Existing literature in annotation aggregation focuses on binary and multi-choice problems. In contrast, little work has been done on complex tasks such as sequence labeling with imbalanced classes, a ubiquitous task in Natural Language Processing (NLP), and Bio-Informatics. We propose OptSLA, an Optimization-based Sequential Label Aggregation method, that jointly considers the characteristics of sequential labeling tasks, workers reliabilities, and advanced deep learning techniques to conquer the challenge. We evaluate our model on crowdsourced data for named entity recognition task. Our results show that the proposed OptSLA outperforms the state-of-the-art aggregation methods, and the results are easier to interpret.


Introduction
Crowdsourcing (Howe, 2008) is a popular platform to annotate massive corpora inexpensively. It has bred lots of interest in machine learning and deep learning tasks. However, when workers provide annotations, the results may be noisier comparing with labels provided by experts. Thus, it becomes essential to conduct truth inference from the noisy annotations.
One common annotation aggregation approach is Majority Voting (MV) (Lam and Suen, 1997), in which annotation with the highest number of occurrences is deemed as truth. Another naive approach is to regard an annotation as correct if a certain number of workers provide the same annotation. The concern with these methods is that they assume all workers are of the same quality, which is usually invalid in practice. In this paper, we study the annotation aggregation problem for sequential labeling tasks, a common NLP task.
Many existing crowdsourcing label aggregation methods may suffer from performance loss because they assume that data instances are independent (Zheng et al., 2017). New approaches are recently proposed to handle the particular characteristics of sequential labeling tasks, where tokens in one sentence have complex dependencies (Rodrigues et al., 2014;Simpson and Gurevych, 2019;Nguyen et al., 2017). In this line of approaches, probabilistic models are adopted to model the workers' labeling behavior and to model the dependencies between adjacent tokens. There are some drawbacks to the probabilistic models. First, they have strong statistical assumptions when modeling the sequence annotations, limiting the flexibility of the models. Second, these models need to infer complex parameters, making it hard to interpret the relations between worker's quality and token's true labels. Third, these aggregation methods can not fully unleash the power of deep learning in sequential labeling tasks.
To address these challenges, we propose an optimization framework to improve aggregation performance. Our method OPTSLA estimates workers' reliability and models the label dependencies to infer the true labels from noisy annotations. OPT-SLA handles complex sequential label aggregation problem with fewer parameters comparing the state-of-the-art and produces easy-to-understand results.
We further incorporate the state-of-the-art deep learning approach into OPTSLA, where the deep learning component and the aggregation component can maturely enhance each other. To ensure high-quality training data, OPTSLA chooses sentences with high confidence from the aggregation component. The deep learning model is incrementally trained with the iteratively updated aggregation results.

Related Works
Data aggregation and label inference tasks have received lots of attention over the past decade, and many methods are developed to handle various challenges (Li et al., 2016;Zheng et al., 2017). Earlier works such as (Dawid and Skene, 1979;Yin et al., 2008;Snow et al., 2008;Whitehill et al., 2009;Groot et al., 2011) proposed to model the worker qualities and label inference using statistical methods. Later, optimization-based methods are proposed (Zhou et al., 2012;Li et al., 2014). Intensive experiments in many applications and tasks have shown that these methods generally outperform MV, which indicates that the worker qualities estimation can play an essential role in label inference. However, in these methods, the annotation instances are assumed to be independent. More recently, methods are developed to handle various types of correlations among annotation instances. For example, methods in Yao et al., 2018;Zhi et al., 2018) are proposed to handle the spatial-temporal dependencies among instances, and methods in (Rodrigues et al., 2014;Nguyen et al., 2017;Simpson and Gurevych, 2019) are proposed to handle the sequential labeling tasks in NLP, which are more related to this paper. Rodrigues et al. (Rodrigues et al., 2014) proposed a probabilistic approach using Conditional Random Fields (CRF) to model the sequential annotations. In this model, the worker's reliability is modeled by his/her F1 score, but only one worker is assumed to be correct for any instance. The three models mentioned above are probabilistic models with significantly more parameters to tune and are harder to interpret than optimizationbased methods (Zheng et al., 2017). Moreover, the existing methods do not fully unleash the power of deep learning approaches in sequential labeling tasks. In this paper, we propose an optimizationbased aggregation method to address the interpretability challenge, and further include the deep learning module to boost the performance.

Methodology
The sequential label aggregation task aims to combine the annotations provided by different workers to infer the ground truth sequential labels. In this section, we describe our approach, an optimization-based sequential label aggregation method (OPTSLA), which aggregates multiple workers' annotations with deep learning results by estimating the reliability of workers and modeling the dependencies among tokens in the sentences.

OPTSLA
We first introduce the notations. Suppose m workers (indexed by j) are hired to annotate s sentences (indexed by k) with total n tokens in the corpus. Let i k indicate the i-th token in the k-th sentence. y j i k is a one-hot vector that denotes the annotation given by the j-th worker on the i-th token in the k-th sentence. y * i k is the inferred aggregation label for the corresponding token. Each worker has a weight parameter w j to reflect his/her annotation quality, and W = {w 1 , w 2 , ..., w m } refers to the set of all worker weights. A higher weight implies that the worker is of higher reliability.
Our goal is to minimize the overall weighted loss of the inferred aggregation labels y * i k to the reliable workers' annotations y j i k , deep learning predictionŝ y * i k , and the loss of inconsistencies in sequential labels. Mathematically, we formulate the aggregation problem as an optimization problem with respect to set of worker weights W, the weight of deep learning model w dl , aggregated annotation y * i k , and the deep learning parameters θ shown in Eq (1).
where H(·, ·) is the cross entropy loss function, ξ(y * k ) is the confidence level of the k-th sentence, |{y j i k } i k | refers to the number of annotations provided by worker j, and g(., .) is a loss function to maintain the consistency between tokens label. More specifically, ξ(y * k ) = 1 l k i k margin(y * i k ), where l k is the number of tokens in sentence k and margin(y * i k ) is the probability difference between the two most likely labels of y * i k . In Eq(1), j wj k ξ(y * k ) i k H(y j i k , y * i k ) is the weighted cross-entropy loss between the inferred aggregation labels and the workers' annotations. The loss is adjusted by confidence measurement of (y * k ). Intuitively, if a worker is more reliable (i.e., w j is high) and the annotations are agreed with high confidence, high penalty will be received if his/her annotations are quite different from the inferred aggregation labels. In order to minimize the objective function, the inferred aggregation labels y * i k will rely more on the workers with high weights.
The term w dl k ξ(y * is the weighted cross-entropy loss between y * i k and the predicted labelsŷ * i k from a trained deep learning model, where w dl is the reliability of the deep learning model. In our model, the deep learning model is essentially treated as an additional worker. The training of the deep learning model is discussed in Section 3.4. The term j |{y j i k }i k | log(wj) + n log(w dl ) is a constraint to ensure that the calculated weights are positive. The final term i g(y * i−1 , y * i , y * i+1 ) is a loss function which gives penalties if the inferred aggregation labels is not consistent with the sequential label rules. One simple example of g(·, ·) is (2) This function will give 0 loss if the sequence of y * i k −1 , y * i k is valid according to sequential label rules, and 1 if the sequence is invalid. Taking NER task as an example, P (yi k = 'I-LOC'|yi k −1 = 'B-PER') = 0, so g(y * , both y * i k −1 and y * i k +1 are considered. The inferred aggregation labels y * i k , workers weights W and w dl , and the deep learning model are learned simultaneously by optimizing the Eq (1). To solve the problem, we adopt the block coordinate descent method (Tseng, 2001), which will keep reducing the value of the objective function. To minimize the objective function in Eq (1), we iteratively conduct the following three steps.

Workers' Weight Update
We initialize all the workers with equal weights. To update weights in each iteration, we treat the other variables as fixed. Then W has closed form solution by taking differentiation of Eq (1) with respects to W. The solution is shown as follows .

Aggregated Annotation Update
In the second step, once the workers' weights are updated, the inferred aggregation labels y * i k are updated to minimize Eq (1) as follows.
This function does not have a closed-form solution. In fact, for general label consistency loss function g(·, ·), it might be non-trivial to solve Eq (5) as variables are correlated. Therefore, we apply the gradient descent method to calculate y * i k while fixing all other variables.

Incremental Deep Learning
With updated aggregation results, we update the deep learning model. To maintain a high quality model, we select sentences with high ξ({y * k }) (i.g., ξ({y * i k }) >0.9) as training data. Since y * i k is updated iteratively, the training data change as well. However, the re-train of the deep learning model can be time consuming. Therefore, we adopt the incremental deep learning approach (Sarwar et al., 2019) to improve algorithm efficiency.

Class Priority (ρ)
Many sequential labeling tasks have class imbalance problem. For example, in the NER task, "O" will dominate the entity annotations. To handle this problem, class priorities (ρ's) can be used to re-weight the classes. A higher ρ will increase the weight for entity labels when calculating y * i k .
Datasets. We use real-world data to demonstrate the effectiveness of the proposed method OPTSLA. NER dataset (Sang and De Meulder, 2003) 1 consists of 5985 sentences and 47 workers are hired to identify the named entities in the sentences and annotate them as persons, locations, organizations, or miscellaneous. To make the task more challenging, we use 4515 sentences where workers had conflicting annotations, and for comparison we choose 3466 sentences to evaluate, which is the same as test set for NER dataset. 2 To evaluate the proposed OPTSLA, we compare the span level precision, recall, and F1 score 3 of the inferred aggregation labels with three state-of-theart baselines methods HMM-crowd (Nguyen et al., 2017), CRF-MA (Rodrigues et al., 2014), and BSCseq result comes from (Simpson and Gurevych, 2019). For OPTSLA, Convolutional Neural Network (CNN) is employed as the deep learning component for the NER dataset. To evaluate the effect of the deep learning module, we also compare OPTSLA without the deep learning component, denoted as OPTSLA (W/O DL).
The results are shown in Table 1 4 . It is clear that the proposed OPTSLA method outperforms stateof-the-art baselines methods. The results show that the deep learning component can indeed enhance aggregation performance. H(·, ·) and ξ({y * i k } i ) help in predicting worker reliability properly which in turn help in aggregation. This is because that OPTSLA only uses sentences with high ξ({y * i k }) for training, the deep learning model is trained properly.
As the worker's reliability estimation is the key to obtain high-quality aggregation results, we further show the weights estimated for workers with respect to their actual F1 scores in Figure 1. It can be observed that there is a strong positive correla-  tion between worker weights and their actual F1 scores. Because OPTSLA uses one parameter for each worker, the results are more straightforward to interpret and justify comparing with the baseline methods.
We observe that OPTSLA converges quickly. The algorithm stops when no more sentences can be added to the training set. Figure 2 illustrates the size of the training dataset with respect to the number of iterations.

Conclusion and Future Works
In this paper, we propose an innovative optimization-based approach OPTSLA for sequential label aggregation problem. Our model jointly considers different factors in the objective function, including the workers' annotations, workers' reliability, the deep learning model, and the characteristics of sequential labeling tasks. Our experimental results illustrate that OPTSLA outperforms the state-of-the-art sequential label aggregations methods, such as CRF-MA, HMM-Crowd, and Bayesian Sequence Combination (BSC) in terms of F1 score. For the future work, we will evaluate more factors such as the task assignment that may affect the aggregation performance from the deep learning model and the workers' behaviors.