Early Gains Matter: A Case for Preferring Generative over Discriminative Crowdsourcing Models

In modern practice, labeling a dataset often involves aggregating annotator judgments obtained from crowdsourcing. State-of-theart aggregation is performed via inference on probabilistic models, some of which are dataaware, meaning that they leverage features of the data (e.g., words in a document) in addition to annotator judgments. Previous work largely prefers discriminatively trained conditional models. This paper demonstrates that a data-aware crowdsourcing model incorporating a generative multinomial data model enjoys a strong competitive advantage over its discriminative log-linear counterpart in the typical crowdsourcing setting. That is, the generative approach is better except when the annotators are highly accurate in which case simple majority vote is often sufficient. Additionally, we present a novel mean-field variational inference algorithm for the generative model that significantly improves on the previously reported state-of-the-art for that model. We validate our conclusions on six text classification datasets with both human-generated and synthetic annotations.


Introduction
The success of supervised machine learning has created an urgent need for manually-labeled training datasets. Crowdsourcing allows human label judgments to be obtained rapidly and at relatively low cost. Micro-task markets such as Amazon's Mechanical Turk and CrowdFlower have popularized crowdsourcing by reducing the overhead required to distribute a job to a community of annotators (the "crowd"). However, crowdsourced judgments often suffer from high error rates. A common solution to this problem is to obtain multiple redundant human judgments, or annotations, 1 relying on the observation that, in aggregate, the ability of non-experts often rivals or exceeds that of experts by averaging over individual error patterns (Surowiecki, 2005;Snow et al., 2008;Jurgens, 2013).
For the purposes of this paper a crowdsourcing model is a model that infers, at a minimum, class labels y based on the evidence of one or more imperfect annotations a. A common baseline method aggregates annotations by majority vote but by so doing ignores important information. For example, some annotators are more reliable than others, and their judgments ought to be weighted accordingly. State-of-the-art crowdsourcing methods formulate probabilistic models that account for such side information and then apply standard inference techniques to the task of inferring ground truth labels from imperfect annotations.
Data-aware crowdsourcing models additionally account for the features x comprising each data instance (e.g., words in a document). The data can be modeled generatively by proposing a joint distribution p(y, x, a). However, because of the challenge of accurately modeling complex data x, most previous work uses a discriminatively trained conditional model p(y, a|x), hereafter referred to as a discriminative model. As Ng and Jordan (2001) explain, maximizing conditional log likelihood is a compu-tationally convenient approximation to minimizing a discriminative 0-1 loss objective, giving rise to the common practice of referring to conditional models as discriminative.
Contributions. This paper challenges the popular preference for discriminative data models in the crowdsourcing literature by demonstrating that in typical crowdsourcing scenarios a generative model enjoys a strong advantage over its discriminative counterpart. We conduct, on both real and synthetic annotations, the first empirical comparison of structurally comparable generative and discriminative crowdsourcing models. The comparison is made fair by developing similar mean-field variational inference algorithms for both models. The generative model is considerably improved by our variational algorithm compared with the previously reported state-of-the-art for that model.

Previous Work
Dawid and Skene (1979) laid the groundwork for modern annotation aggregation by proposing the item-response model: a probabilistic crowdsourcing model p(y, a|γ) over document labels y and annotations a parameterized by confusion matrices γ for each annotator. A growing body of work extends this model to account for such things as correlation among annotators, annotator trustworthiness, item difficulty, and so forth (Bragg et al., 2013;Hovy et al., 2013;Passonneau and Carpenter, 2013;Pasternack and Roth, 2010;Smyth et al., 1995;Welinder et al., 2010;Whitehill et al., 2009;Zhou et al., 2012).
Of the crowdsourcing models that are data-aware, most model the data discriminatively (Carroll et al., 2007;Liu et al., 2012;Raykar et al., 2010;Yan et al., 2014). A smaller line of work models the data generatively (Lam and Stork, 2005;Simpson and Roberts, In Press). We are aware of no papers that compare a generative crowdsourcing model with a similar discriminative model. In the larger context of supervised machine learning, Ng and Jordan (2001) observe that generative models parameters tend to converge with fewer training examples than their discriminatively trained counterparts, but to lower asymptotic performance levels. This paper explores those insights in the context of crowdsourcing models.

Models
At a minimum, a probabilistic crowdsourcing model predicts ground truth labels y from imperfect annotations a (i.e., argmax y p(y|a)). In this section we review the specifics of two previously-proposed dataaware crowdsourcing models. These models are best understood as extensions to a Bayesian formulation of the item-response model that we will refer to as ITEMRESP. ITEMRESP, illustrated in Figure 1a, is defined by the joint distribution p(θ , γ, y, a) (1) where J is the set of annotators, K is the set of class labels, N is the set of data instances in the corpus, θ is a stochastic vector in which θ k is the probability of label class k, γ j is a matrix of stochastic vector rows in which γ jkk is the probability that annotator j annotates with k items whose true label is k, y i is the class label associated with the ith instance in the corpus, and a i jk is the number of times that instance i was annotated by annotator j with label k. The fact that a i j is a count vector allows for the general case where annotators express their uncertainty over multiple class values. Also, θ ∼ Dirichlet(b (θ ) ), jk ), y i |θ ∼ Categorical(θ ), and a i j |y i , γ j ∼ Multinomial(γ jy i , M i ) where M i is the number of times annotator j annotated instance i. We need not define a distribution over M i because in practice M i = |a i j | 1 is fixed and known during posterior inference. A special case of this model formulates a i j as a categorical distribution assuming that annotators will provide at most one annotation per item. All hyperparameters are designated b and are disambiguated with a superscript (e.g., the hyperparameters for p(θ ) are b (θ ) ). When ITEMRESP parameters are set with uniform θ values and diagonal confusion matrices γ, majority vote is obtained.
Inference in a crowdsourcing model involves a corpus with an annotated portion N A = {i : |a i | 1 > 0} and also potentially an unannotated portion N U = {i : |a i | 1 = 0}. ITEMRESP can be written as p(γ, y, a) = p(γ, y A , y U , a) where y A = {y i : i ∈ N A } and y U = {y i : i ∈ N U }. However, because ITEM-RESP has no model of the data x, it receives no benefit from unannotated data N U .

Log-linear data model (LOGRESP)
One way to make ITEMRESP data-aware is by adding a discriminative log-linear data component (Raykar et al., 2010;Liu et al., 2012). For short, we refer to this model as LOGRESP, illustrated in Figure 1b. Concretely, where x i f is the value of feature f in data instance i (e.g., a word count in a text classification problem), φ k f is the probability of feature f occurring in an instance of class k, φ k ∼ Normal(0, Σ), and In the special case that each γ j is the identity matrix (each annotator is perfectly accurate), LOGRESP reduces to a multinomial logistic regression model. Because it is a conditional model, LOGRESP lacks any built-in capacity for semi-supervised learning.

Multinomial data model (MOMRESP)
An alternative way to make ITEMRESP data-aware is by adding a generative multinomial data component (Lam and Stork, 2005;Felt et al., 2014). We re-fer to the model as MOMRESP, shown in Figure 1c.
and T i is a number-of-trials parameter (e.g., for text classification T i is the number of words in document i). T i = |x i | 1 is observed during posterior inference p(θ , γ, φ , y|x, a).
Because MOMRESP is fully generative over the data features x, it naturally performs semisupervised learning as data from unannotated instances N U inform inferred class labels y A of annotated instances via φ . This can be seen by observing that p(x) terms prevent terms involving y U from summing out of the marginal distribu- is a mixture of multinomials clustering model. Otherwise, the model resembles a semi-supervised naïve Bayes classifier (Nigam et al., 2006). However, naïve Bayes is supervised by trustworthy labels whereas MOMRESP is supervised by imperfect annotations mediated by inferred annotator error characteristic γ. In the special case that γ is the identity matrix (each annotator is perfectly accurate), MOM-RESP reduces to a possibly semi-supervised naïve Bayes classifier where each annotation is a fully trusted label.

A Generative-Discriminative Pair
MOMRESP and LOGRESP are a generativediscriminative pair, meaning that they belong to the same parametric model family but with parameters fit to optimize joint likelihood and conditional likelihood, respectively. This relationship is seen via the equivalence of the conditional probability of LOG-RESP p L (y, a|x) and the same expression according to MOMRESP p M (y, a|x). For simplicity in this derivation we omit priors and consider φ , θ , and γ to be known values. Then Equation 4 follows from Bayes Rule and conditional independence in the model. In Equation 5 p(a |y) sums to 1. The first term of Equation 6 is the posterior p(y|x) of a naïve Bayes classifier, known to have the same form as a logistic regression classifier where parameters w and z are constructed from φ and θ . 2

Mean-field Variational Inference (MF)
In this section we present novel mean-field (MF) variational algorithms for LOGRESP and MOM-RESP. Note that Liu et al. (2012) present (in an appendix) variational inference for LOGRESP based on belief propagation (BP). They do not test their algorithm for LOGRESP; however, their comparison of MF and BP variational inference for the ITEMRESP model indicates that the two flavors of variational inference perform very similarly. Our MF algorithm for LOGRESP has not been designed with the idea of outperforming its BP analogue, but rather with the goal of ensuring that the generative and discriminative model use the same inference algorithm. We expect that we would achieve the same results if our comparison used variational BP algorithms for both MOMRESP and LOGRESP, although such an additional comparison is beyond the scope of this work. Broadly speaking, variational approaches to posterior inference transform inference into an optimization problem by searching within some family of tractable approximate distributions Q for the distribution q ∈ Q that minimizes distributional divergence from an intractable target posterior p * . In particular, under the mean-field assumption we confine our search to distributions Q that are fully factorized.
Algorithm. Initialize each q(y i ) to the empirical distribution observed in the annotations a i . The Kullback-Leibler divergence KL(q||p * ) is minimized by iteratively updating each variational distribution in the model as follows: Approximate distributions are updated by calculating variational parameters α (·) , disambiguated by a superscript. Because q(γ jk ) is a Dirichlet distribution the term E q(γ jk ) [log γ jkk ] appearing in q(y i ) is computed analytically as ψ(α The distribution q(φ k ) is a logistic normal distribution. This means that the expectations E q(φ k ) [φ k f ] that appear in q(y i ) cannot be computed analytically. Following Liu et al. (2012), we approximate the distribution q(φ k ) with the point estimate φ k = argmax φ k q(φ k ) which can be calculated using existing numerical optimization methods for loglinear models. Such maximization can be understood as embedding the variational algorithm inside of an outer EM loop such as might be used to tune hyperparameters in an empirical Bayesian approach (where φ are treated as hyperparameters).

MOMRESP Inference
MOMRESP's posterior p * (y, θ , γ, φ |x, a) is approximated with the fully factorized distribution Algorithm. Initialize each q(y i ) to the empirical distribution observed in the annotations a i . The Kullback-Leibler divergence KL(q||p * ) is minimized by iteratively updating each variational distribution in the model as follows: Approximate distributions are updated by calculating the values of variational parameters α (·) , disambiguated by a superscript. The expectations of log terms in the q(y i ) update are all with respect to Dirichlet distributions and so can be computed analytically as explained previously.

Model priors and implementation details
Computing a lower bound on the log likelihood shows that in practice the variational algorithms presented above converge after only a dozen or so updates. We compute argmax φ k q(φ k ) for LOG-RESP using the L-BFGS algorithm as implemented in MALLET (McCallum, 2002). We choose uninformed priors b (θ ) k = 1 for MOMRESP and identity matrix Σ = 1 for LOGRESP. We set b (φ ) k f = 0.1 for MOMRESP to encourage sparsity in per-class word distributions. Liu et al. (2012) argue that a uniform prior over the entries of each confusion matrix γ j can lead to degenerate performance. Accordingly, we set the diagonal entries of each b (γ) j to a higher value b K+δ and off-diagonal entries to a lower value b (γ) jkk = 1 K+δ with δ = 2. Both MOMRESP and LOGRESP are given full access to all instances in the dataset, annotated and unannotated. However, as explained in Section 3.1, LOGRESP is conditioned on the data and thus is structurally unable to make use of unannotated data. We experimented briefly with self-training for LOG-RESP but it had little effect. With additional effort one could likely settle on a heuristic scheme that allowed LOGRESP to benefit from unannotated data. However, since such an extension is external to the model itself, it is beyond the scope of this work.

Experiments with Simulated Annotators
Models which learn from error-prone annotations can be challenging to evaluate in a systematic way. Simulated annotations allow us to systematically control annotator behavior and measure the performance of our models in each configuration.

Simulating Annotators
We simulate an annotator by corrupting ground truth labels according to that annotator's accuracy parameters. Simulated annotators are drawn from the annotator quality pools listed in Table 1. Each row is a named pool and contains five annotators A1-A5, each with a corresponding accuracy parameter (the number five is chosen arbitrarily). In the pools HIGH, MED, and LOW, annotator errors are distributed uniformly across the incorrect classes. Because there are no patterns among errors, these settings approximate situations in which annotators are ultimately in agreement about the task they are doing, although some are better at it than others. The HIGH pool represents a corpus annotation project with high quality annotators. In the MED and LOW pools annotators are progressively less reliable.
The CONFLICT annotator pool in Table 1 is special in that annotator errors are made systematically rather than uniformly. Systematic errors are  produced at simulation time by constructing a perannotator confusion matrix (similar to γ j ) whose diagonal is set to the desired accuracy setting, and whose off-diagonal row entries are sampled from a symmetric Dirichlet distribution with parameter 0.1 to encourage sparsity and then scaled so that each row properly sums to 1. These draws from a sparse Dirichlet yield consistent error patterns. The CONFLICT pool approximates an annotation project where annotators understand the annotation guidelines differently from one another. For the sake of example, annotator A5 in the CONFLICT setting will annotate documents with the true class B as B exactly 10% of the time but might annotate B as C 85% of the time. On the other hand, annotator A4 might annotate B as D most of the time. We choose low agreement rates for CONFLICT to highlight a case that violates majority vote's assumption that annotators are basically in agreement. HIGH  90  85  80  75  70  MED  70  65  60  55  50  LOW  50  40  30  20 10 CONFLICT 50 † 40 † 30 † 20 † 10 † Table 1: For each simulated annotator quality pool (HIGH, MED, LOW, CONFLICT), annotators A1-A5 are assigned an accuracy. † indicates that errors are systematically in conflict as described in the text.

Datasets and Features
We simulate the annotator pools from Table  1 on each of six text classification datasets. The datasets 20 Newsgroups, WebKB, Cade12, Reuters8, and Reuters52 are described by Cardoso-Cachopo (2007). The LDC-labeled Enron emails dataset is described by Berry et al. (2001). Each dataset is preprocessed via Porter stemming and by removal of the stopwords from MALLET's stopword list. Features occurring fewer than 5 times in the corpus are discarded. Features are fractionally scaled so that |x i | 1 is equal to the average document length since document scaling has been shown to be beneficial for multinomial document models (Nigam et al., 2006).
Each dataset is annotated according to the following process: an instance is selected at random (without replacement) and annotated by three annotators selected at random (without replacement). Because annotation simulation is a stochastic process, each simulation is repeated five times.

Validating Mean-field Variational Inference
Figure 2 compares mean-field variational inference (MF) with alternative inference algorithms from previous work. For variety, the left and right plots are calculated over arbitrarily chosen datasets and annotator pools, but these trends are representative of other settings. MOMRESP using MF is compared with MOMRESP using Gibbs sampling estimating p(y|x, a) from several hundred samples (an improvement to the method used by Felt et al. (2014)). MOMRESP benefits significantly from MF. We suspect that this disparity could be reduced via hyperparameter optimization as indicated by Asuncion et al. (2009). However, that investigation is beyond the scope of the current work. LOGRESP using MF is compared with LOGRESP using expectation maximization (EM) as in (Raykar et al., 2010). LOG-RESP with MF displays minor improvements over LOGRESP with EM. This is consistent with the modest gains that Liu et al. (2012) reported when comparing variational and EM inference for the ITEM-RESP model.

Discriminative (LOGRESP) versus Generative (MOMRESP)
We run MOMRESP and LOGRESP with MF inference on the cross product of datasets and annotator pools. Inferred label accuracy on items that have been annotated is the primary task of crowdsourcing; we track this measure accordingly. However, the ability of these models to generalize on unannotated data is also of interest and allows better comparison with traditional non-crowdsourcing models. Figure 3 plots learning curves for each annotator pool on the 20 Newsgroups dataset; results on other datasets are summarized in Table 2. The first row of Figure 3 plots the accuracy of labels inferred from annotations. The second row of Figure 3 plots generalization accuracy using the inferred model parameters φ (and θ in the case of MOMRESP) on held-out test sets with no annotations. The generalization accuracy curves of MOMRESP and LOGRESP may be compared with those of naïve Bayes and logistic regression, respectively. Recall that in the special case where annotations are both flawless and trusted (via diagonal confusion matrices γ) then MOMRESP and LOGRESP simplify to semi-supervised naïve Bayes and logistic regression classifiers, respectively.
Notice that MOMRESP climbs more steeply than LOGRESP in all cases. This observation is in keeping with previous work in supervised learning. Ng and Jordan (2001) argue that generative and discriminative models have complementary strengths: generative models tend to have steeper learning curves and converge in terms of parameter values after only log n training examples, whereas discriminative models tend to achieve higher asymptotic levels but converge more slowly after n training examples. The second row of Figure 3 shows that even after training on three-deep annotations over the entire 20 newsgroups dataset, LOGRESP's data model does not approach its asymptotic level of performance. The early steep slope of the generative model is more desirable in this setting than the eventually superior performance of the discriminative model given large numbers of annotations. Figure 4 additionally plots MOMRESPA, a variant of MOMRESP deprived of all unannotated documents, showing that the early generative advantage is not attributable entirely to semi-supervision.
The generative model is more robust to annotation noise than the discriminative model, seen by comparing the LOW, MED, and HIGH columns in Figure 3. This robustness is significant because crowdsourcing tends to yield noisy annotations, making the LOW and MED annotator pools of greatest practical interest. This assertion is borne out by an experiment with CrowdFlower, reported in Section 6.
To validate that LOGRESP does, indeed, asymptotically surpass MOMRESP we ran inference on datasets with increasing annotation depths. Crossover does not occur until 20 Newsgroups is annotated nearly 12-deep for LOW, 5-deep for MED, and 3.5-deep (on average) for HIGH. Additionally, for each combination of dataset and annotator pool except those involving CONFLICT, by the time LOGRESP surpasses MOMRESP, the majority vote baseline is extremely competitive with LOGRESP. The CONFLICT setting is the exception to this rule: CONFLICT annotators are particularly challenging for majority vote since they violate the implicit assumption that annotators are basically aligned with the truth. The CONFLICT setting is of practical interest only when annotators have dramatic deepseated differences of opinion about what various labels should mean. For most crowdsourcing projects this issue may be avoided with sufficient up-front orientation of the annotators. For reference, in Figure 4 we show that a less extreme variant of CON-FLICT behaves more similarly to LOW. Table 2 reports the percent of the dataset that must be annotated three-deep before LOGRESP's inferred label accuracy surpasses that of MOMRESP. Crossover tends to happen later when annotation quality is low and earlier when annotator quality is high. Cases reported as NA were too close to call; that is, the dominating algorithm changed depending on the random run.
Unsurprisingly, MOMRESP is not well suited to all classification datasets. The 0% entries in Table 2 mean that LOGRESP dominates the learning curve for that annotator pool and dataset. These cases are likely the result of the MOMRESP model making the same strict inter-feature independence assumptions as naïve Bayes, rendering it tractable and effective for many classification tasks but ill-suited for datasets where features are highly correlated or for tasks in which class identity is not informed by document vocabulary. The CADE12 dataset, in particular, is known to be challenging. A supervised naïve Bayes classifier achieves only 57% accuracy on this dataset (Cardoso-Cachopo, 2007). We would expect MOMRESP to perform similarly poorly on sentiment classification data. Although we assert that generative models are inherently better suited to crowdsourcing than discriminative models, a sufficiently strong mismatch between model assumptions and data can negate this advantage.

Experiments with Human Annotators
In the previous section we used simulations to control annotator error. In this section we relax that control. To assess the effect of real-world annotation error on MOMRESP and LOGRESP, we selected 1000 instances at random from 20 Newsgroups and paid annotators on CrowdFlower to annotate them with the 20 Newsgroups categories, presented as humanreadable names (e.g., "Atheism" for alt.atheism). Annotators were allowed to express uncertainty by