Finding Patterns in Noisy Crowds: Regression-based Annotation Aggregation for Crowdsourced Data

Crowdsourcing offers a convenient means of obtaining labeled data quickly and inexpensively. However, crowdsourced labels are often noisier than expert-annotated data, making it difficult to aggregate them meaningfully. We present an aggregation approach that learns a regression model from crowdsourced annotations to predict aggregated labels for instances that have no expert adjudications. The predicted labels achieve a correlation of 0.594 with expert labels on our data, outperforming the best alternative aggregation method by 11.9%. Our approach also outperforms the alternatives on third-party datasets.


Introduction
Publicly-available labeled datasets are scarce for many NLP tasks, and crowdsourcing services such as Amazon Mechanical Turk 1 (AMT) offer researchers a quick, inexpensive means of labeling their data. However, workers employed by these services are typically unfamiliar with the annotation tasks, and they may have little motivation to perform high-quality work due to factors such as low pay and anonymity. To further complicate matters, some workers may produce spam or malicious responses. Thus, it is not uncommon for workers to correlate poorly with one another.
Researchers using crowdsourcing services commonly aggregate the labels they receive via simple strategies such as using the majority or average label. These methods are best suited for simple, straightforward tasks; with noisier data such as that which may be obtained for more difficult or subjective tasks, these strategies may produce skewed labels that misrepresent the instance. 1 www.mturk.com Thus, it is desirable to devise more effective aggregation strategies that consider factors such as label distribution and worker quality, while still avoiding manual adjudication of all instances.
In this work, our contributions are as follows: (1) we develop a regression-based method for automatically aggregating crowdsourced annotations of varying quality, with poor agreement and minimal expert-adjudicated data, that addresses multiple potential flaws or biases in non-expert human annotation. To do so, we (2) crowdsource annotations for a difficult NLP task, metaphor novelty scoring, and (3) describe a process by which we automatically detect untrustworthy workers. We then (4) introduce a feature set that captures label distribution and trustworthiness, and extract the features from our crowdsourced annotations. Finally, (5) we train a regression model that predicts aggregated labels for unseen instances and compare the predictions to expert annotations, finding that our method outperforms the best alternative approach. We evaluate our approach both on our data and on existing crowdsourcing datasets. All datasets and source code are available for the research community to improve on our results. 2

Related Work
Several methods have been proposed to identify low-quality workers in crowdsourced data. Jagabathula et al. (2016) filtered adversarial workers in binary labeling tasks by identifying those with outlier labeling patterns, and Lin et al. (2014) identified when additional labels for binary tasks should be crowdsourced to optimize classifier accuracy. Unlike these approaches, our filtering algorithm is suitable for multi-class annotation tasks.
Various methods have also been explored as intelligent modes of label aggregation. Most (Snow et al., 2008;Raykar et al., 2010;Karger et al., 2011;Liu et al., 2012;Hovy et al., 2013;Felt et al., 2014;Huang et al., 2015) have built upon the probabilistic item-response model first proposed by Dawid and Skene (1979), which simultaneously estimates annotator quality and aggregated labels using an expectation-maximization algorithm. MACE (Hovy et al., 2013) is a popular implementation inspired by this that aggregates labels as a function of the annotation and a learned binary variable indicating whether the annotator is a spammer. We posit that although annotator quality is an important factor in predicting accurate aggregations, the interplay between it and other factors is more nuanced. Thus, rather than adapting the item-response method, our learning approach incorporates features that address multiple potential flaws or biases in crowdsourced annotations.
Some researchers have also used data-aware approaches to predict aggregations (Raykar et al., 2010;Felt et al., 2014Felt et al., , 2015Felt et al., , 2016. We do not use the data itself in this work, to avoid skewing labels in a way that makes it trivial to learn classifiers based on the same data. To the best of our knowledge, our work is the first to frame label aggregation as a regression task, with features based solely on workers and their labels, that learns entirely from a small amount of expert-adjudicated crowdsourced annotations.

Data Collection
We evaluated our approach on our new metaphor novelty dataset, as well as on third-party datasets. To build our dataset, we crowdsourced annotations for 3112 potentially metaphoric word pairs, and randomly divided the instances into training (1036), validation (1038), and test (1038) subsets. We developed features and selected our regression algorithm using the training and validation sets only; the test set was withheld until the evaluation.

Annotation Task
Instances were comprised of pairs of words from 1840 sentences in the VU Amsterdam Metaphor Corpus (VUAMC) (Steen et al., 2010). The VUAMC consists of documents for which individual words are labeled as metaphors. The novelty of those metaphors varies widely, from highly con-Example Score Alice looked up, and there stood the Queen in front of them, with her arms folded, frowning like a thunderstorm.

Novel
Metaphor (3) 'Once,' said the Mock Turtle at last, with a deep sigh, 'I was a real Turtle.'

Conventional
Metaphor (1) A large rose-tree stood near the entrance of the garden: the roses growing on it were white, but there were three gardeners at it, busily painting them red.
Non-Metaphor (0) ventional to quite novel. Each sentence for which we collected annotations contained a content word (noun, verb, adjective, or adverb) labeled as being metaphoric, and one or more other content words or personal pronouns that were syntactically related to the metaphoric word. Word pairs containing a metaphoric word and a syntactically-related content word or personal pronoun were considered instances. AMT workers ("Turkers") were asked to score each instance on a discrete scale from non-metaphoric (0) to highly novel metaphor (3). Some examples are shown in Table 1. 3 Instances were grouped into Human Intelligence Tasks (HITs) containing all instances associated with 10 sentences each. Five worker assignments were requested per HIT, and Turkers were paid $0.20 per HIT. Overall, 237 Turkers annotated 942 assignments, with an average correlation of 0.269 per HIT (the poor agreement suggests this is a very difficult annotation task). An expert adjudicated all 3112 instances; those labels were considered the gold standard.

Data Filtering
Spam and malicious workers were identified during data collection using a filtering algorithm that compared annotations with those completed by "potentially good annotators" (P GA). Alg. 1 describes this process. Letting H i be a set of HITs collected, A i be the set of annotators who annotated H i , and A=∪(A 1 , . . . , A j ) be the set of all annotators, the algorithm computes three sets of annotators: good annotators (GA), spammers or malicious annotators (Bad Robots, or BR), and annotators of currently unknown quality U QA.
R(a j , a k ) computes the correlation coefficient between two annotators a j and a k , where a k is a potentially good annotator whose annotations overlap with a j 's, and AVG R(a j ) computes the average correlation between a j and all a k . HITs Algorithm 1 Worker Filtering for Annotation Set i completed under a minimum time threshold were also filtered. Following algorithm completion, filtered HITs and unpaid HITs from members of BR were rejected, and annotators in BR were disqualified from accepting future HITs. 116 total assignments were rejected by the filtering algorithm. Annotators in U QA (U QA=A−GA−BR) who had completed ≥ 2 HITs and had an r j < 0.1 were also disqualified. All other HITs were accepted.

Features
We designed features to capture the distribution and trustworthiness of crowdsourced labels for each instance. The features are described in Table 2. ANNOTATIONS are designed to provide the regression algorithm with label distributions based on label value and worker trustworthiness. AVG. R features are intended to further clarify worker quality, and AVG. R (GOOD) is meant to provide a more selective view of the same characteristic. AVG., WEIGHTED AVG., and WEIGHTED AVG.
(GOOD) allow the regressor to consider three different versions of a popular aggregation strategy, and finally, HIT R supplies the algorithm with an estimate of agreement on the current instance to consider when making its prediction.

Regression Algorithm
The approach utilizes a random subspace regressor, which was selected based on its performance on the training and validation data relative to a Feature Description ANNOTA-TIONS From highest to lowest label, the five annotations for the instance. 5 AVG. R For each annotator, in order of label value, his/her avg. correlation with other workers across all instances he/she annotated. 5 AVG. R (GOOD) AVG. R in which each annotator is compared only to annotators with rj>0.35. If the annotator has no overlapping annotations with those, AVG. R is repeated. AVG.
Average of the five ANNOTATIONS.
Let li be the i th ANNOTATION, and ri be its annotator's AVG. R. Then, Similar to WEIGHTED AVG., with weights (ri) taken from AVG. R. (GOOD) instead of AVG. R.

HIT R
The average weighted correlation among annotators for the HIT containing the instance. Letting wi,j be the weight for a pair of annotators equal to r i +r j 2 , where ri and rj are the AVG. R associated with annotators ai and aj, ri,j be the correlation between annotators ai and aj for the HIT, and P contain all annotator pairs (ai, aj) for the HIT, HIT R = p∈P r i,j ×w i,j p∈P w i,j Label Range 0-100 -100-100 0-2 0-3 Table 3: Dataset Details large variety of other regression algorithms. Random subspace is similar in nature to bagging and random forests, using multiple decision trees constructed from subsets of features selected randomly without replacement to make its predictions (Ho, 1998). We used the implementation from the Weka library (Frank et al., 2016), with Weka's REPTree classifier as the base decision tree model.

Other Datasets
In addition to evaluating our approach on our data, we evaluate it on three existing crowdsourcing datasets that differ in terms of their size, noise level, and number of annotators. Details about each dataset are shown in Table 3, with additional information below. Each third-party dataset was randomly divided into 66% training and 34% test.
Affect (Emotion and Valence). Affect (Emotion) and Affect (Valence) were created for Snow et al.'s (2008) work, and contain emotion (anger, fear, disgust, joy, sadness, and surprise) and valence ratings for 100 headlines from the SemEval affective text annotation task (Strapparava and Mihalcea, 2007) test set. Annotations indicate the degree of emotion in an emotion-headline pair (Affect (Emotion)) and the overall positive or negative valence of a headline (Affect (Valence)). Snow et al. report an average correlation among annotators of 0.669 (emotion) and 0.844 (valence).
WebRel. WebRel was originally created for the TREC 2010 Relevance Feedback Track (Buckley et al., 2010), and its annotations indicate the relevance of web documents retrieved for queries. The full dataset contains crowdsourced annotations for 20,232 topic-document pairs; 3277 of those pairs additionally have gold-standard labels. The number of annotations collected per instance varied. We used the subset of instances with gold standard labels and at least five annotations, and reconstructed their HIT groupings based on the workers that annotated each instance (we assumed all instances annotated by the exact same set of workers were originally from the same HIT). Average correlation per HIT was 0.102 (quite noisy).

Experimental Setup
We compare our approach to a number of alternative methods, detailed with justifications in Table 4. The alternatives are popular aggregation techniques that address different potential flaws in non-expert annotation. We train our approach on the training (and validation, for our dataset) data, and test on the test set. Since MACE (used for Item-Response) learns from and outputs predictions for the same data, we provide it with the entire dataset (training, validation if available, and test), but report its results for the test instances only. We provide input to MACE in an ndimensional sparse matrix (1 row per instance and 1 column per each of n distinct annotators in the dataset, with filled values only for the annotators who provided annotations for that instance), since the approach requires knowledge of which annotator provided each annotation to function properly. 6

Majority Vote
The most frequent label given by annotators for the instance. Ties were broken by taking the highest of the tied labels-assumes the most popular opinion should be trusted.

Highest
The highest label for the instance-assumes those who see a metaphor should be trusted.

Item-Response
The prediction expected from an itemresponse model. We use MACE (Hovy et al., 2013) to generate predictions since it is a well-documented item-response approach that is publicly available online.

Mode Average
The real-valued average of the mode(s) of the instance's labels (if only one mode, this feature is that mode)-assumes popular opinions should be trusted, and equally popular opinions are equally trustworthy.

Average
The average of all five labels-assumes each annotator's opinion is equally valid.

Rule-Based
Assigns a value of 0 if 4+ annotators labeled the instance as such; otherwise, takes the avg. non-zero label-assumes annotators frequently miss tricky or subtle instances. We also evaluate the performance of different feature subsets on our data. All−Averages contains all features except for AVG., WEIGHTED AVG., and WEIGHTED AVG. (GOOD). Each other subset contains all features except for the respective feature type noted from Table 2. The correlation coefficient (r) and root mean squared error (RMSE) were recorded for each test condition since our estimator produced continuous-valued scores. Since Mode Average, Average, and Rule-Based result in continuous values and Majority Vote, Highest, and Item-Response result in discrete values, we present two versions of our results; in one, predictions were rounded to the nearest integer (forcing a 0, 1, 2, or 3) and in the other, they were left as-is. For the discrete approaches on our data, we also report accuracy.

Results
The results are presented in Tables 5, 6, and 7. Table 5 compares our method with each alternative approach on our data, and Table 6 compares our method with the alternatives on each third-party dataset. Table 7 shows the results of the feature ablation. On our dataset, our approach outperformed all other approaches, with r = 0.594 with the gold standard and RMSE (0-3) = 0.605. This represented correlation improvements of 18.6%, 11.9%, and 69.2% relative to the continuous alternative approaches (Mode Average, Average, and Rule-Based, respectively). The   rounded predictions also outperformed all discrete alternatives (Majority Vote, Highest and Item-Response) with relative correlation improvements of 10.6%, 66.1%, and 35.4%, respectively. All approaches had strong positive statistically significant (p<<0.0001) correlations and the improvement of our results over the alternatives was statistically significant (p<<0.0001). On WebRel and Affect (Valence), our approach outperformed all other approaches for both the discrete and continuous conditions. On Affect (Emotion), our approach outperformed all alternatives for the discrete condition and had a lower RMSE than all other approaches for the continuous condition (relative reductions in error to RULE-BASED, AVERAGE, and MODE AVERAGE were 37.4%, 0.6%, and 24.2%, respectively), but the predictions from AVERAGE correlated better with the gold standard than did those of our approach.  Interestingly, Table 7 shows that the discrete version of our approach performed slightly better when the features indicating annotators' correlations with good annotators were removed; this was not the case for the continuous-labeled version. The raw annotations themselves were the most valuable features for both cases. Their removal led to a correlation reduction of 10.2% (rounded) and 6.2% (continuous) relative to using all features.
The results suggest that our approach is a suitable means of automatically aggregating noisy crowdsourced labels, and that reasonable results can be obtained even when training on only a small amount of expert-adjudicated instances. Further, the performance of the alternative approaches suggests that typical aggregation techniques may be less suitable for tasks with many workers who completed relatively few annotations.

Conclusion
In this work, we present a regression-based aggregation method that addresses multiple potential flaws or biases in non-expert human annotation. We show that the predictions from our approach correlate at r=0.594 with expert adjudications for a noisy, difficult task, outperforming the best alternative approach by 11.9% on our data and by up to 63.7% on third-party crowdsourcing datasets. This improvement shows that a learning approach can overcome some of the challenges faced by simple label aggregation techniques for these types of tasks. Our data and source code is publicly available for further research by others.