Uncertainty-aware generative models for inferring document class prevalence

Prevalence estimation is the task of inferring the relative frequency of classes of unlabeled examples in a group—for example, the proportion of a document collection with positive sentiment. Previous work has focused on aggregating and adjusting discriminative individual classifiers to obtain prevalence point estimates. But imperfect classifier accuracy ought to be reflected in uncertainty over the predicted prevalence for scientifically valid inference. In this work, we present (1) a generative probabilistic modeling approach to prevalence estimation, and (2) the construction and evaluation of prevalence confidence intervals; in particular, we demonstrate that an off-the-shelf discriminative classifier can be given a generative re-interpretation, by backing out an implicit individual-level likelihood function, which can be used to conduct fast and simple group-level Bayesian inference. Empirically, we demonstrate our approach provides better confidence interval coverage than an alternative, and is dramatically more robust to shifts in the class prior between training and testing.


Introduction
The goal of prevalence estimation is to infer the relative frequency of classes y i associated with unlabeled examples (e.g. documents) from a group, x i ∈ D. For example, one might want to estimate the proportion of blogs with a positive sentiment towards a political candidate (Hopkins and King, 2010), sentiment of responses to natural disasters on social media (Mandel et al., 2012), or prevalence of car types in street photos to infer neighborhood demographics (Gebru et al., 2017). Often, an analyst wants to compare prevalence between multiple groups, such as inferring prevalence variation over time (e.g., changes to online abuse content (Bissias et al., 2016)), or across other covariates (e.g., changes in police officers' "respect" when speaking to minorities (Voigt et al., 2017)). This problem has been re-introduced in many different fields: as "quantification" in data mining (Forman, 2005(Forman, , 2008, "prevalence estimation" in statistics and epidemiology (Gart and Buck, 1966), and "class prior estimation" in machine learning (Vucetic and Obradovic, 2001;Saerens et al., 2002). In NLP, SemEval 2016 and 2017 included Twitter sentiment class prevalence tasks (Nakov et al., 2016;Rosenthal et al., 2017).
Prevalence estimation assumes access to a (potentially small) set of labeled examples to train a classifier; but unlike the task of individual classification, the goal is to estimate the proportion of a class among examples in a group. If a perfectly accurate classifier is available, it is trivial to construct a perfect prevalence estimate by counting the classification decisions ( §3.1). In fact, most application papers in the previous paragraph use this or a similar aggregation rule to conduct their prevalence estimates. However, classifiers often exhibit errors from different sources, including: • Shifts in the class distribution from training to testing (P train (y) = P test (y)). A classifier may be biased toward predicting P train (y).
• Difficult classification tasks (such as predicting sentiment or sarcasm) that result in low accuracy classifiers; this can be exacerbated by limited training data, as is common in social science or industry settings that require manual human annotation for labels.
It is typically assumed (and sometimes confirmed) that when an individual classifier has less than 100% accuracy, it can still give reasonable preva-lence estimates. 2 However, there is relatively little understanding to what extent the quality of the document-level model impacts prevalence estimates. Imperfect classifier accuracy ought to be reflected in uncertainty over the predicted prevalence.
In this work, we tackle both of these challenges simultaneously, using a generative probabilistic modeling approach to prevalence estimation. This model directly parameterizes and conducts inference for the unknown prevalence, naturally accommodating shifts between training and testing, and also allows us to infer confidence intervals for the prevalence. We show that our best model can be seen as an implicit likelihood generative re-interpretation of an off-the-shelf discriminative classifier ( §4.2); this unifies it with previous work, and also is easy for a practitioner to apply.
We additionally review several types of class prevalence estimators from the literature ( §3), and conduct a robust empirical evaluation on sentiment analysis over hundreds of document groups, illustrating the methods' biases and robustness to class prior shift between training and testing. Our method provides better confidence interval coverage and is more robust to class prior shift than previous methods, and is substantially more accurate than an algorithm in widespread use in political science.

Problem definition
We consider two prevalence estimation problems: (1) point prediction and (2) confidence interval prediction. In this work, we are most interested in supervised learning for discrete-valued document labels, with access to a small to moderate number (e.g. around 1000) of labeled documents with text x and label y: (x i , y i ) ∈ D train . We restrict attention to binary-valued labels y ∈ {0, 1}. At test time, there are one or more groups of unlabeled test documents, D (1) , · · · , D (G) ; for example, one group might be a set of tweets sent during a certain month, or a set of online reviews associated with a particular product. For each group D, let θ * ≡ (1/n) n i y i be the true proportion of positive labels (where n = |D|).
The prevalence point prediction problem is to take an unlabeled document group D as input and 2 For example, Bissias et al. find a relative mean absolute error of less than 0.01 when the individual classifier has ROC AUC of 0.91. Figure 1: Example posterior distributions with MAP prevalence estimates,θ (solid line) and the true prevalence, θ * (dashed line). A desirable property is that confidence intervals, technically Bayesian credible intervals, (shaded regions) will be wider for more uncertain models. For example, the wider CI on the right (green) contains θ * whereas the narrower CI interval on the left (red) does not.
infer an estimatedθ ∈ [0, 1]. Ideally, this point estimate should be close to the true prevalence θ * ; we evaluate this by mean absolute error.
In this work, we are the first (that we know of) to introduce the question of uncertainty in prevalence estimation. Since document classifiers are typically far from perfectly accurate, we should expect substantial error in prevalence prediction, and inference methods should quantify such uncertainty. We formalize this as a prevalence confidence interval (CI) inference, which takes as input a desired nominal coverage level (1 − α), and predicts a real-valued interval [θ lo ,θ hi ] ⊆ [0, 1]. Ideally, a CI prediction algorithm should have frequentist coverage semantics: over a large number of test groups, 3 (1−α)% of the predicted intervals ought to contain the true value θ * . If the problem is hard-for example, the relationship between document features and the label is not captured well by the model-the CI should be wide. We empirically evaluate coverage of CI-aware prevalence inference models. See Fig. 1 for an intuitive example.

Review and baselines: Discriminative individual classification aggregation
The most straightforward baseline approach to prevalence estimation is to build on discriminative, supervised learning for individual-level labels, such as binary logistic regression with bagof-words features, randomized feature hashing (Weinberger et al., 2009), or neural networks (Goldberg, 2016). Such a model defines an individual document's label probability p i ≡ p β (y i = 1 | x i ) where parameters β are fit by maximizing regularized likelihood on the labeled training data.

Classify and Count (CC)
For prevalence point estimation, Forman (2005) defines the "classify and count" (CC) method as simply averaging the most-likely individual label predictions,θ This is the most obvious approach for practitioners, but it has at least two weaknesses, which have been addressed in different groups of prior work. First, the class proportions may change between training and test groups, which the Adjusted CC and ReadMe algorithms attempt to fix ( §3.2-3.3). Second, it discards probabilistic information, which is remedied by the Probabilistic CC method, and an extension we propose ( §3.4-3.5).

Adjusted Classify and Count (ACC)
CC may encounter problems if the test class distribution is different than the training's. The "adjusted classify-and-count" method (Gart and Buck, 1966;Forman, 2005) treats the classifier output as a proxy variable, and estimates a separate confusion model of classifier outputŷ i ≡ 1{p i > 0.5} conditional on the true label, p(ŷ | y), from cross-validation within the training set. Assuming the confusion model extends to the test data, a moment-matching approach is then used to infer the true label proportions, by first observing p test (ŷ) = y p(ŷ | y)p test (y) and solving the linear system for p test (y), the test-time expected class prevalence. Using empirical estimates for the true positive rate TPR = p(ŷ = 1 | y = 1), and false positive rate FPR = p(ŷ = 1 | y = 0), and θ CC = p(ŷ = 1), it has the closed form By design, ACC is more robust to a new test-time prevalence, but it relies on the accuracy of its TPR and FPR estimates, and its lack of probabilistic semantics makes it unclear how to infer confidence intervals.

ReadMe algorithm
An interesting extension to ACC is to remove the need for a discriminative classifier, by directly modeling text conditional on the latent document class. The ReadMe algorithm, developed in political science (Hopkins and King, 2010), extends ACC's linear system for every term type in a (subsampled and augmented) term vocabulary V, and calculates their class-conditional probabilities from the training data. Assuming these conditional models also hold in the test data, that implies p test (w) = yp (w | y)p test (y); the algorithm infers p test (y) by minimizing the squared error of predicted versus empirical term frequencies in the test set. The open-source ReadMe software package 4 has been used in numerous political science studies, including inferring proportions of types of censored Chinese news (King et al., 2013), credit claiming in Congressional press releases (Grimmer et al., 2012), and voter intentions among Twitter messages (Ceron et al., 2015).
ReadMe is theoretically appealing in that it infers latent class prevalences to explain the test group's textual evidence; but as a nonprobabilistic model, it does not directly imply a method for confidence intervals (Hopkins and King use the bootstrap). Furthermore, our experiments ( §5), contra the original paper, show its implementation exhibits poor performance.

Probabilistic Classify and Count (PCC)
Both the CC and ACC methods discard uncertainty information from the classification model. In a difficult classification setting, for example, we might expect many probabilities to be near, say, 0.6, in which case the CC method may undercount the negative class. This suggests an alternative method, "probabilistic classify and count" (PCC): which is the expected prevalence, (1/n) i y i , assuming each y i is distributed according to the original probabilistic classifier.

PCC Poisson-Binomial distribution (PB-PCC)
If we assume each y i is conditionally independent given text x i and model parameters β, this defines a fully probabilistic model for the class prevalence. Let the latent variable S = i y i ; its distribution is thus Poisson-Binomial (Chen and Liu, 1997). The modeled prevalence distribution p( S n | D) can be exactly inferred by Monte Carlo inference: each iteration samples every y i and sums for an S sample. The S/n distribution over many iterations can be used to construct a Monte Carlo CDFF , from which any To a certain degree, this model captures uncertainty in the classifier since per-document variance, p i (1 − p i ), is high when p i = 0.5 and low when near 0 or 1. However, it also has a major weakness-the variance concentrates with a large test group size n, which is the wrong behavior when a classifier is truly noisy, for example, when a classifier is genuinely uncertain and predicts the same constant p i = q for each document. In this case, the correct behavior would be to maintain a flat, wide posterior belief about θ, which is better accomplished by the generative model we introduce in the subsequent section.

Our approach: generative probabilistic modeling
We turn to generative modeling, that seeks to to jointly model the probability of labels and text in both the training and test groups, by assuming a document's text is generated conditional on the document label. Language models have widespread use in natural language processing, and class-conditional models have been used for document classification (e.g. multinomial Naive Bayes; McCallum and Nigam (1998)). We use a similar generative setup to explicitly model a class prevalence for test group g, with a generative story for each (bag-of-words) document i in the group: The test group is assumed to have a latent class prior θ g , which itself has a prior distribution (we assume Dist(α) = Unif(0, 1) in this work). For each class k, φ k is a class-conditional unigram language model, which is learned from the training data but fixed at test time. We then perform inference to find θ g that gives a high probability to text data {x i ∈ D (g) }. Figure 2 shows the probabilistic graphical model.

MNB and Loglin language models
We experiment with two explicit language models in this generative framework: (1) multinomial Naive Bayes (MNB), using a training-time symmetric Dirichlet prior φ y ∼ Dir(λ/V ) for vocabulary size V and "pseudocount" λ, and (2) an additive log linear model (Loglin, a.k.a. SAGE (Eisenstein et al., 2011)). Loglin estimates words' probabilities as deviations from a background logprobability m, where m w is the empirical log probability of a word w among all training documents, and η y,w denotes class-specific deviations of the logprobability of a word w, MAP estimated under a sparsity-inducing L1 penalty. Such sparse additive models have been used in both supervised and unsupervised document modeling; for example, as a document-level posterior classifier it outperforms MNB (Eisenstein et al., 2011), or even discriminative models (Taddy, 2013), and its sparsity helps interpretability for analyzing political, literary, and legal texts (Monroe et al., 2008;Sim et al., 2013;Bamman et al., 2014;Wang et al., 2012).

Implicit likelihoods from discriminative classifiers (LR-Implicit)
This generative formulation has a major advantage over the discriminative, CC-style aggregation models because it sets up a likelihood and posterior distribution over θ. But in terms of document modeling for classification purposes, the independence assumptions of the generative model are typically too strong, and for document-level classification, discriminative models tend to outperform similarly parameterized generative ones, especially when the training set is sufficiently large (Ng and Jordan, 2002). Thus, discriminative models may have information better suited to class prevalence inference. Also, since the most common practice for document classification is to use discriminative models, it would be helpful to more effectively use discriminative posteriors within our generative context. In Naive Bayes-style generative document classification, the model defines p gen (x | y) and class prior p(y), which are combined to calculate the posterior p gen (y | x) ∝ p gen (x | y)p(y). Discriminative models, by contrast, directly define a p disc (y | x). We can, however, expand this quantity via Bayes Rule: The "implicit document likelihood" p implicit (x | y) is a likelihood function that, combined with a particular class prior p(y), would have resulted in the same posterior predicted by the discriminative model. Given the discriminative posterior predictions and the training-time class prior p train (y) =θ train , an implicit likelihood function can be backed out for any particular document x; we define the "simple implicit" likelihood for document x to be: This takes the form of a correction of the discriminative posterior, by dividing out the training-time class prevalence. 5 Our LR-Implicit generative model uses the same class prevalence and document label generation setup as before, but to calculate the individual documents' p(x | y) probabilities, it uses p implicit based on a logistic regression p disc . 6 5 Technically, p implicit is retrievable only up to a constant, and pimplicit is one particular compatible implicit likelihood, since it can be multiplied by any constant and is still consistent with Eq. 9, and would give rise to the same documentand group-level posteriors. 6 The implicit likelihood still has the form of a logistic regression, adjusting its bias term: if pdisc(y | x) = σ(β x + β0), then pimplicit(x | y) = σ(β x + β0 − log (θtrain/(1 − θtrain))). Saerens et al. (2002)'s EM algorithm for adjusting a classifier for a test set's class prior; they derive it differently by applying the assumption p train (x | y) = p test (x | y), expanding each side with Bayes' Rule, solving for p test (y | x), then estimating p test (y) via EM. This in fact optimizes the same marginal likelihood function in the next section under the implicitdiscriminative generative model; our formulation broadens it as a fully Bayesian or likelihood-based model.

Inference
To estimate class prevalence, we use the marginal log likelihood over θ to obtain a posterior over θ. For each each test group g, we have the marginal log probability of all document texts, ; intuitively, the sign of the numerator says that documents that are more likely under the positive than negative class encourage higher likelihood for larger values of θ. When the model is uncertain about a document-that is, when L + i ≈ L − i -that document contributes a relatively flat likelihood curve, expressing little preference for likely values of θ. If a model is more heavily regularizedfor example, when the log-linear additive model is more dominated by the background language model-this condition tends to hold for the documents, leading to a flat, highly uncertain likelihood curve.
The marginal log likelihood is unimodal over θ ∈ [0, 1], since it is concave, being a sum of concave log-linear functions, and having negative curvature: Since it is concave and there is only one parameter, a very wide variety of techniques could be used to reliably find a mode, including EM or firstor second-order methods. At least two approaches to inferring confidence intervals are possible. One is to use a central limit theorem-style approximation, assuming the sampling distribution is approximated by a normal with mean θ MLE and variance −[∂ 2 MLL g /∂θ 2 ] −1 . The second, which we focus on, is Bayesian estimation for log p(θ g | D (g) ) ∝ log p(θ g ) + MLL g (θ g ) by simply using a grid search over values θ ∈ {0.001, 0.002, ...0.999} to infer both the posterior mode θ MAP as well as a 90% highest posterior density interval. 7 In smallscale experiments, this model had very similar results to the central limit theorem (with EM for θ MLE ).

Data
In order to compare document class prevalence estimators, we desire datasets that (1) Bella et al. (2010) and Esuli and Sebastiani (2015) use large, pre-existing labeled document corpora, but they do not contain natural groups; evaluations utilize randomly sampled synthetic groups.
To better fulfill these criteria, we select the task of business review sentiment prevalence, where the goal is to estimate the proportion of reviews that are positive for one particular business; specifically, we use labeled data from the Yelp Dataset Challenge Round Nine 8 corpus, which consists of 4.1M reviews by 1M users for 144K businesses. We sample 500 businesses with at least 200 reviews each as the test groups. We treat the task as binary classification, and assign y i = 1 to reviews with 3 or more stars. This task seems reasonably representative of real-world sentiment analysis problems, and this type of dataset can easily be collected and reproduced from Yelp or other widely available review data.
For training, we simulate a small-scale annotation project by sampling 2000 labeled documents from the rest of the corpus. This is a natural prevalence that on average is about the same as the test groups, though individual test groups may have a much different prevalence (ranging from 0.096 to 0.997, mean (stdev) 0.823 (0.136)). We also construct a synthetic training setting with a highly skewed class prior, selecting 2000 documents with a 0.1 class prevalence (i.e. 200 positive documents in the group). In each case, for every model, we re-run and average results over 10 different samples of the training set. For preprocessing, we tokenize with NLTK 9 and lowercase.

Model training
We use L1 regularization for logistic regression based on the vector of a documents' word counts, to be most directly comparable to the generative models; for each model, we select its hyperparameter (LR and Loglin's λ, or MNB's pseudocount) by minimizing cross-validated cross-entropy of individual document posteriors (within the labeled training set), over a grid search of powers of 2. The log-linear additive model is trained with OWL-QN (Andrew and Gao, 2007) 10 and the logistic regression model is trained with the default implementation in scikit-learn (Pedregosa et al., 2011). 11 We used ReadMe with its default parameters. 12

Results
For each of the 500 test groups, we calculate a prevalence point estimateθ with each method, and evaluate by averaging across groups for mean absolute error g |θ g − θ * g | and bias g (θ g − θ * g ). 13 For the models that allow for confidence interval Natural training prevalence ≈ 0. prediction, we infer 90% intervals and calculate coverage, which is best if it is 0.90. We also report average CI width; a narrower interval indicates more confidence (even if misplaced). Results are in Table 1; every result is averaged over 10 resamplings of the training set.
The ReadMe software did not have competitive performance; we hope in follow-up work to understand why Hopkins and King found it had considerably stronger performance than SVM-based CC.
For the natural training class prevalence setting (first column, Table 1), the discriminative-based models (CC, PCC and the adjusted variants ACC and LR-Implicit) all have very similar point estimate performance, outperforming the purely gen-erative models (MNB and Loglin). For CI coverage, the log-linear and LR-Implicit generative models have significantly better coverage than the discriminative model (PB-PCC) or MNB. Future work is required to improve coverage to be closer to the nominal ideal of 90%.
By contrast, when the class prevalences are mismatched (second column, Table 1), the nonadjusted CC and PCC methods give extremely poor and biased point estimates, and PB-PCC has incredibly poor CI coverage. ACC and the generative models do much better, presumably because their models directly allow for variability in the test class prior. While Loglin has somewhat higher coverage in this setting, overall, LR-Implicit has consistently strong performance in both training settings, and for both point estimation and (relatively, at leas) confidence intervals. Figure 3 shows θ * versusθ for each of the 500 test groups for each of the models, including predicted CIs. CC's and PCC's erroneous assumptions are directly viewable: in the natural prevalence setting, the slope shallower than 1, indicating a persistent under-sensitivity to the true class prevalence-unlike ACC and the generative models. In the synthetic training case, CC and PCC wildly underpredict, presumably because they are biased by the low training-time prevalence θ train = 0.1.

Comparison of PB-PCC and LR-Implicit
Since PB-PCC and LR-Implicit represent the strongest members of non-adjusted classification aggregation and generative modeling, respectively, we further compare their results. When varying synthetic training prevalence across 0.1 to 0.9 (Figure 5a), LR-Implicit has much better MAE in all settings except near the natural prevalence (the test groups have, on average, 0.82 positive prevalence), and consistently stronger CI coverage. Figure 5b shows results for natural class prevalence when varying the training set size. Unfortunately, LR-Implicit is disadvantaged at very small test sizes-its MAE is higher when there are only a few hundred training documents (≤ 2 8 = 256), though performance converges after that. We suspect this may occur because, when textual evidence is weak, the classifier learns to more heavily rely on its bias term, which can be a useful form of bias when the training class prevalence matches the test groups (on average). However, at all levels, LR-Implicit's coverage is better.
Since we hypothesized that PB-PCC may be overconfident for large test groups ( §3.5), we test this by binning test groups by the number of documents per group. Figure 4 confirms that PB-PCC exhibits overconfidence for larger groups (smaller CI width alongside lower CI coverage), but LR-Implicit suffers from the same problem as well.
6 Additional Related Work González et al. (2017a) reviews the class prevalence estimation literature, and we note a few threads of work here. Bella et al. (2010) propose a probabilistic variant of ACC, and Esuli and Sebastiani (2015) compare many methods on news article topics (RCV1) and medical record subject heading (OHSUMED-S) class prevalence tasks, finding varying results among CC, ACC, and PCC. A number of other empirical evaluations were conducted in two SemEval Twitter sentiment prevalence shared tasks, with varying results among these and other methods with a range of classifiers (Nakov et al., 2016;Rosenthal et al., 2017); Nakov et al. note that CC was often one of the strongest methods. Esuli and Sebastiani as well as Xue and Weiss (2009) present semi-supervised lossaugmented classifier training methods to improve prevalence estimation. Tasche (2017) presents theoretical results for ACC and Saerens et al.'s EM method (what we call the LR-Implicit MLE), arguing they correctly predict θ * under class prior shift; we confirm that those two methods are indeed better than many alternatives in our empirical evaluation. While we focus on inference of the test-time class prior as a class prevalence estimate, Saerens et al. (2002) also show their method can improve individual-level classification accuracy, which Sulc and Matas (2018) use for image classification. (From the viewpoint of individual classification, this phenomenon is known as prior probability shift (Moreno-Torres et al., 2012).) González et al. (2017b) and Card and Smith (2018), similarly to our results, find that CC is much poorer than ACC under class shift. Card and Smith also show that PCC can be sensitive to properties of the classifier, finding that wellcalibrated classifiers can give strong performance. They argue that discriminative aggregation models are appropriate for tasks where humans respond to text. Jerzak et al. (2018) analyze issues in class prevalence estimation and propose the ReadMe2 algorithm, which adds external word embeddings, optimization-based dimension reduction, and similarity matching to ReadMe's moment-matching framework.

Conclusion
Document class prevalence estimation is a widespread and much understudied task. We show that simple and obvious classifier aggregation methods display consistent biases, especially under class prior shift. Given how widely some of the less effective methods are used, machine learning and natural language processing research could have real impact in this space.
We also call attention to the need for uncer-tainty aware inference-methods that give confidence intervals to summarize their uncertainty. While our method is a first step, future work is necessary to better understand the problem and develop methods with improved coverage. Also, our framework can accommodate a wide array of document and language models-while we focus on bag-of-words models, recent advances in sequence, neural, and attention-based document models could be added directly to our generative model, or used as a discriminative-implicit component. The overall framework could also be extended to multiclass, and potentially, structured prediction settings.