Inferring latent attributes of Twitter users with label regularization

Inferring latent attributes of online users has many applications in public health, politics, and marketing. Most existing approaches rely on supervised learning algorithms, which require manual data annotation and therefore are costly to develop and adapt over time. In this paper, we propose a lightly supervised approach based on label regularization to infer the age, ethnicity, and political orientation of Twitter users. Our approach learns from a heterogeneous collection of soft constraints derived from Census demographics, trends in baby names, and Twitter accounts that are em-blematic of class labels. To counteract the imprecision of such constraints, we compare several constraint selection algorithms that optimize classiﬁcation accuracy on a tuning set. We ﬁnd that using no user-annotated data, our approach is within 2% of a fully supervised baseline for three of four tasks. Using a small set of labeled data for tuning further improves accuracy on all tasks.


Introduction
Data annotation is a key bottleneck in applying supervised machine learning to language processing problems. This is especially problematic in streaming settings such as social media, where models quickly become dated as new linguistic patterns emerge. An attractive alternative is lightly supervised learning (Schapire et al., 2002;Jin and Liu, 2005;Chang et al., 2007;Graça et al., 2007;Quadrianto et al., 2009;Mann and McCallum, 2010;Ganchev et al., 2010). In this approach, classifiers are trained from a set of domain-specific soft constraints, rather than individually labeled instances. For example, label regularization (Mann and Mc-Callum, 2007;Graça et al., 2007) uses prior knowledge of the expected label distribution to fit a model from large pools of unlabeled instances. Similarly, annotating features with their expected class frequency has proven to be an efficient way of bootstrapping from domain knowledge (Druck et al., 2009;Melville et al., 2009;Settles, 2011).
In this paper we use lightly supervised learning to infer the age, ethnicity, and political orientation of Twitter users. Lightly supervised learning provides a natural method for incorporating the rich, declarative constraints available in social media. Our approach pairs unlabeled Twitter data with constraints from county demographics, trends in first names, and exemplar Twitter accounts strongly associated with a class label.
Prior applications of label regularization use a small number of highly-accurate constraints; for example, Mann and McCallum (2007) use a single constraint that is the true label proportions of an unlabeled dataset, and Ganchev and Das (2013) use cross-lingual constraints from aligned text. In contrast, we use hundreds of constraints that are heterogeneous, overlapping, and noisy. For example, we constrain the predicted attributes of users from a county to match those collected by the Census, despite the known non-representativeness of Twitter users (Mislove et al., 2011). Furthermore, users from that county who list first names in their profile have additional constraints imposed upon them, which may conflict with the county constraints.
To deal with such noisy constraints, we explore forward selection algorithms that choose from hundreds of soft constraints to optimize accuracy on a tuning set. We find that this approach is competitive with a fully supervised approach, with the added advantage of being less reliant on labeled data and therefore easier to update over time. Our primary research questions and answers are as follows: RQ1. What effect do noisy constraints have on label regularization? We find that simply using all constraints, ignoring noise and overlap, results in surprisingly high accuracy, within 2% of a fully-supervised approach on three of four tasks. For age classification, the constraint noise appears to substantially degrade accuracy.
RQ2. How can we select the most useful constraints? Using a small tuning set, we find that our forward selection algorithms improve label regularization accuracy while using fewer than 10% of the available constraints. Constraint selection improves age classification accuracy by nearly 18% (absolute).

RQ3. Which constraints are most informative?
We find that follower constraints result in the highest accuracy in isolation, yet the constraint types appear to be complementary. For three of four tasks, combining all constraint types leads to the highest accuracy.
In the following, we first review related work in lightly supervised learning and latent attribute inference, then describe the Twitter data and constraints. Next, we formalize the label regularization problem and our constraint selection algorithms. Finally, we present empirical results on four classification tasks and conclude with a discussion of future work.
The main of drawback supervised learning in social media is that human annotation is expensive and error-prone, and collecting pseudo-labeled data by self-identifying keywords is noisy and biased (e.g., searching for profiles that mention political orientation). For these reasons we investigate lightlysupervised learning, which takes advantage of the plentiful unlabeled data.
One similarly-motivated work is that of Chang et al. (2010), who infer race/ethnicity of online users using name and ethnicity distributions provided by the U.S. Census Bureau. This external data is incorporated into the model as a prior; however, no linguistic content is used in the model, limiting the coverage of the resulting approach. Oktay et al. (2014) extend the work of Chang et al. (2010) to also include statistics over first names.
Other work has inferred population-level statistics from social media; e.g., Eisenstein et al. (2011) use geolocated tweets to predict zip-code statistics of demographic attributes of users, and Schwartz et al. (2013) predict county health statistics from Twitter. However, no user-level attributes are predicted. Patrini et al. (2014) build a Learning with Label Proportions (LLP) model with the objective to learn a supervised classifier when, instead of labels, only label proportions for bags of observations 186 are known. Their empirical results demonstrate that their algorithms compete with or are just percents of AUC away from the supervised learning approach.
In preliminary work (Mohammady and Culotta, 2014), we fit a regression model to predict the ethnicity distribution of a county based on its Twitter usage, then applied the regression model to classify individual users. In contrast, here we use label regularization, which can more naturally be applied to user-level classification and can incorporate a wider range of constraint types.

Data
In this section we describe all data and constraints collected for our experiments.

Labeled Twitter Data
For validation (and for tuning some of the methods) we annotate Twitter users according to age, ethnicity, and political orientation. We collects four disjoint datasets for this purpose: Race/ethnicity: This data set comes from the research of Mohammady and Culotta (2014). They categorized 770 Twitter profiles into one of four categories (Asian, Black, Latino, White). They used the Twitter Streaming API to obtain a random sample of 1,000 users, filtered to the United States. These were manually categorized by analyzing the profile, tweets, and profile image for each user, discarding those for which race could not be determined (230/1,000; 23%). The category frequency is Asian (22), Black (263), Latino (158), White (327). For each user, they collected the 200 most recent tweets using the Twitter API. We refer to this dataset as the race dataset.
Age: Annotating Twitter users by age can be difficult, since it is rarely explicitly mentioned. Similar to prior work (Rao et al., 2010;Al Zamal et al., 2012), we divide users into those below 25 and those above above 25 years old. Using the idea from Al Zamal et al. (2012), we use the Twitter search API to find tweets with phrases like "happy 30th birthday to me," and then we collect those users and download their 200 most recent tweets using the Twitter API. We collect 1,436 users (771 below 25 and 665 above 25). While this sampling procedure introduces some selection bias, it provides a useful form of validation in the absence of expedient alternatives. We refer to this dataset as the age dataset.
Politician: Inspired by works of (Cohen and Ruths, 2013), we select the official Twitter accounts of members of the U.S. Congress. We select 189 Democratic accounts and and 188 Republican accounts and download their most recent 200 tweets. We refer to this dataset as the politician dataset.
Politician-follower: As the politician dataset is not representative of typical users, we collect a separate political datasets. We first collect a list of followers of the official Twitter accounts for both parties ("thedemocrats" and "gop"). We randomly select 598 likely Democrats and 632 likely Republicans, and download the most recent 200 tweets for each user. While the labels for these data may contain moderate noise (since not everyone who follows "gop" is Republican), a manual inspection did not reveal any mis-annotations. We refer to this as the politician-follower dataset. 1 We split each of the datasets above into 40% tuning/training and 60% testing (though not all methods will use the training set, as we describe below).

Unlabeled Twitter Data
Label regularization depends on a pool of unlabeled data, along with soft constraints over the label proportions in that data. Since many of our constraints involve location, we use the Twitter streaming API to collect 1% of geolocated tweets, using a bounding box of the United States (48 contiguous states plus Hawaii and Alaska). In order to assign each tweet to a county, we use the U.S. Census' center of population data. 2 We use this data to map each geolocated Twitter user to a corresponding county. We use the k-d tree algorithm (Maneewongvatana and Mount, 2002) to find the nearest center of population for each tweet and use a threshold to discard tweets that are not within a specified distance of any county center. In total, we collect 18 million geolocated tweets from 2.7 million unique users.

Constraints
Finally, we describe the soft constraints used by label regularization. Each constraint will apply to a (possibly overlapping) subset of users from the unlabeled Twitter data. For all constraints below, we only include the constraint for consideration if at least 1,000 unlabeled Twitter users are matched. For example, if we only have 500 users from a county, we will not use that county's demographics as a constraint. This is to ensure that there is sufficient unlabeled data for learning. We consider three classes of constraints: County constraints (cnt): The U.S. Census produces annual estimates of the ethnicity and age demographics for each county. We use the most recent decennial census (2010) to compute the proportion of each county that is below and above 25 years old (to match the labels of the annotated data). We additionally use the 2012 updated estimates of ethnicity by county, restricting to Asian, Black, Latino, and White. Each constraint, then, is applied to the users assigned to that county in the unlabeled data. For example, there are 46K unlabeled users from Cook County, which the Census estimates as 45% White. We consider 3,000 total counties as constraints, of which roughly 500 are retained for consideration after filtering those that match fewer than 1,000 users.
Name constraints (nam): Silver and McCanc (2014) recently demonstrated how a person's first name can often indicate their age. The Social Security Administration reports the frequencies of names given to children born in a given year, 3 and its actuarial tables 4 estimate how many people born in a given year are still alive. From these data, one can estimate the age distribution of people with a given name. For example, the median age of someone named "Brittany" is 23. With this approach, we can assign constraints indicating the fraction of people with a given name that are above and below 25 years old.
For each user in the unlabeled Twitter data, we parse the "name" field of the profile, assuming that the first token represents the first name. Constraints are assigned to users with matching names. We consider more than 50K total name constraints, of which we retain 175 that match a sufficient number of users. For example, there are roughly 1,600 unlabeled users with the first name Katherine; the constraint specifies that 86% of them are under 25.
Follower constraints (fol): Our final type of constraint uses Twitter accounts and hashtags strongly associated with a class label. The constraint applies to users that follow such exemplar accounts or use such hashtags. We consider two sources of such constraints. For age and race, we download demographic data for 1K websites from Quantcast.com, an audience measurement company that tracks the demographics of visitors to millions of websites (Kamerer, 2013). We then identify the Twitter accounts for each website. For example, one constraint indicates that 12% of Twitter users who follow "oprah" are Latino. For political constraints, we manually identify 18 Twitter accounts or hashtags that are strongly associated with either Democrats or Republicans. 5 The constraint specifies that 90% of users that follow one of these accounts (or use one of these hashtags) are affiliated with the corresponding party. (We omit constraints use to construct the labeled data for the politicianfollower data.)

Label Regularization
Our goal is to learn a classification model using the unlabeled Twitter data and the constraints described above. The idea of label regularization is to define an objective function that enforces that the predicted label distribution for a set of unlabeled data closely matches the expected distribution according to a constraint.
We select multinomial logistic regression as our classification model. Given a feature vector x, a class label y, and set of parameter vectors θ = {θ y 1 . . . θ y k } (one vector per class), the conditional distribution of y given x is defined as follows: Typically, θ is set to maximize the likelihood of a labeled training set. Instead, we will optimize the objective defined in Mann and McCallum (2007), using only unlabeled data and constraints.
Let U = {U 1 . . . U k } be a set of sets, where U j consists of unlabeled feature vectors x. The elements of U may be overlapping. Letp j be the expected label distribution of U j . E.g.,p j = {.9, .1} would indicate that 90% of examples in U j are expected to have class label 0. The combination of (U j ,p j ) is called a constraint.
Our goal, then, is to set θ so that the predicted label distribution matchesp j , for all j. Since using the predicted class counts results in an objective that is non-differentiable, Mann and McCallum (2007) instead use the model's posterior distribution: y q j (y ) wherep j is the normalized form ofq j . Then, we want to set θ such thatp j andp j are close. Mann and McCallum (2007) use KL-divergence, which is equivalent to augmenting the likelihood with a Dirichlet prior over expectations where values for the priors are proportional top j . KL-divergence can be factored into two parts: where H(p j ) is constant for each j, and so we need to minimize H(p j ,p j ) in order to minimize KLdivergence, where H(p j ,p j ) is the cross-entropy of the hypothesized distribution and the expected distribution for U j .
We additionally use L2 regularization, resulting in our final objective function: In practice we find that λ does not need tuning for each data set. We set it simply to: We set C to 1.3e10 in our experiments. Mann and McCallum (2007) compute the gradient of crossentropy as follows: The gradient for θ k is then a sum of the gradients for each constraint j. In order to minimize the objective function, we use gradient descent with L-BFGS (Byrd et al., 1995). (While the objective is not guaranteed to be convex, this approximation has worked well in prior work.) To help reduce overfitting, we use early-stopping (10 iterations).
Temperature: Mann and McCallum (2007) find that sometimes label regularization returns a degenerate solution. For example, for a three class problem with constraintp j (y) = {.5, .35, .15}, it may find a solution such that p θ (y) = {.5, .35, .15} for every instance and as a result all of the instances are assigned the same label. To avoid this behavior Mann and McCallum (2007) introduce a temperature parameter T into the classification function as follows: In practice we find that we can set T to two for binary classification and ten for multi-class problems. While the approach described above closely follows Mann and McCallum (2007), we note two important distinctions: we use no labeled data in our objective, and we consider a set of hundreds of noisy, overlapping constraints (as opposed to only a handful of precise constraints).

Constraint Selection
As described above, our proposed constraints are undoubtedly inexact. For example, it is generally accepted that social media users are not a representative sample of the population. E.g., younger, urban and minority populations tend to be overrepresented on Twitter (Mislove et al., 2011;Lenhart and Fox, 2009), and Latino users tend to be underrepresented on Facebook (Watkins, 2009). Thus, it is incorrect to assume that the demographics of Twitter users from a county match those of all people from a county. While it may be possible to directly adjust for these mismatches using techniques from survey reweighting (Gelman, 2007), it is difficult to precisely quantify the proper weights in this context. Instead, we propose a search-based approach inspired by feature selection algorithms commonly used in machine learning (Guyon and Elisseeff, 2003). The idea is to select the subset of constraints that result in the most accurate model. We first assume the presence of a small set of labeled data L = {(x 1 , y 1 ) . . . (x n , y n )}. Given a set of constraints C = {(U 1 ,p 1 ) . . . (U k ,p k )}, the search objective is to select a subset of constraints C * ⊆ C to minimize error on L: where E(·) is a classification error function, and p C (y|x) is the model fit by label regularization using constraint set C .
In our experiments, |C| is in the hundreds, so exhaustive, exponential search is impractical. Instead, we consider the following greedy and pseudogreedy forward-selection algorithms: • Greedy (grdy): Standard greedy search. At each iteration, we select the constraint that leads to the greatest accuracy improvement on L. • Semi-greedy (semi): Rather than selecting the constraint that improves accuracy the most, we randomly select from the top three constraints (Hart and Shogan, 1987). • Improved-greedy (imp): The same as grdy, but after each iteration, optionally remove a single constraint. We consider each currently selected constraint, and compute the accuracy attained by removing this constraint from the set. We remove the constraint that improves accuracy the most (if any exists). This constraint is removed from consideration in future iterations.
We run each selection algorithm for 140 iterations (as we discuss below, accuracy plateaus well before then). Then, we select the constraint set that results in the highest accuracy. While this search procedure is computationally expensive, it is fortunately easily parallelizable (by partitioning by constraint), which we take advantage of in our implementation. All constraint selection algorithms use the 40% of the labeled data reserved for training/tuning. After we finalized all models using the tuning data, we then used them to classify the 60% of labeled data reserved for testing.

Baselines
We compare label regularization with standard logistic regression (logistic) trained using the 40% of labeled data reserved for training/tuning. We also consider several heuristic baselines: • Name heuristic, race classification: We implement the method proposed by (Mohammady and Culotta, 2014), using the top 1000 most popular last names with their race distribution from the U.S. Census Bureau to infer race/ethnicity of users based of most probable race according last name. If the last name is not among the top 1000 most popular for a given race, we simply predict White (the most frequent class). • Name heuristic, age classification: We use the heuristic described in Section 3.3 that estimates a person's age by their first name. Given the age distribution of a first name, we classify the user according to the more probable class. • Follower heuristic, political classification: We reuse the exemplar accounts used in the follower constraint in Section 3.3. That is, rather than using the fact that a user follows "dennis kucinich" as a soft constraint, we classify such a user as a Democrat. If a user follows more than one of the exemplar accounts, we select the more frequent party. 6 In case of ties (or if the user does not follow any of the accounts), we classify at random.  Table 1: Accuracy on the testing set. all-const does no constraint selection; imp-greedy selects constraints to maximize accuracy on the tuning set using the Improved-greedy algorithm.
Features: For all models, we use a standard bagof-words representation consisting of a binary term vector for the 200 tweets of each user, their description field, and their name field. We differentiate between terms used in the description, tweet text, and name field, and also indicate hashtags. Finally, we include additional features indicating the accounts followed by each user. Table 1 shows the classification accuracy on the test set for each of the four tasks (F1 results are similar). We begin by comparing heuristic and logistic to the all-const results, which is our proposed label regularization approach using no constraint selection (i.e., no user-labeled data). We can see that for three of the four tasks (race, pol, pol-f), label regularization accuracy is either the same as logistic or within 2%. That is, using no user-annotated data, we can obtain accuracy competitive with logistic regression.  poorly; only using the fol constraints surpasses the heuristic baseline. We suspect that this is in part due to the greater noise in age constraints -Twitter users are particularly non-representative of the overall population according to age. To summarize our answer to RQ1, label regularization appears to perform quite well under a moderate amount of constraint noise, but can still fail under excessive noise. We next consider the effect of the constraint selection algorithms. Table 2 compares the four different constraint selection algorithms, along with the model that selects all constraints. We report the accuracy for each approach considering all constraint types (county, follow, and name, where applicable). Importantly, this accuracy is computed on the tuning set, not the test set. The goal here is to determine which search algorithm is able to find the best approximate solution. By comparing with all, we can see that constraint selection can significantly improve accuracy on the tuning set (by 18% absolute on average). The differences among the selection algorithms do not appear to be significant. Figure 1 plots the accuracy at each iteration of constraint select for three of the datasets. The main conclusion we draw from these figures is that high accuracy can be achieved with only a small number of constraints, provided they are carefully chosen. Each method is very close to convergence after using only 20 constraints (selected from hundreds). When examining which constraints are selected, we find that those that apply to many users are often preferred, presumably because there is more data to inform the final model.

Results
Returning to Table 1, we have also listed the accuracy of the imp-greedy selection method (which performed best on the tuning set), further strati- fied by constraint type. Note that imp-greedy selects the constraints that perform best on the tuning set, fits the classification model, and then classifies the testing set. We can see that for three of the four tasks (race, age, pol-f), imp-greedy results in higher accuracy than using all the constraints. This is particularly pronounced for age: the best result without constraint selection is 61.4, compared with 79.2 for imp-greedy. Furthermore, impgreedy outperforms logistic on two of four tasks, suggesting that using unlabeled data can improve accuracy. Note that both imp-greedy and logistic use the same amount of labeled data, though in different ways: logistic performs standard supervised classification; imp-greedy uses the labeled data to perform constraint selection for label regularization. Thus, to summarize our answer to RQ2, we find that imp-greedy provides a robust method to select constraints in the presence of noise. While it comes at the cost of a small amount of labeled data, it is less reliant on this data than a traditional supervised approach, and so may be more applicable in streaming settings.
To answer RQ3, we can compare the accuracies provided by each of the constraint types in Table 1. For all-const, the follower constraints (fol) outperform the county constraints (cnt) for all tasks, while the name constraint (which only applies to age), falls between the two. Including both cnt and fol improves accuracy on two of the four tasks. These trends change somewhat for imp-greedy. The cnt constraints are superior for two tasks, while fol are superior for the other two. The nam constraints again fall between the two. Unlike for all-const, using more constraint types improves accuracy on three of four tasks. These differences suggest that the constraint selection algorithms allow label regularization to be more robust to noisy and conflicting constraints. That is, using constraint selection, we can view constraint engineering akin to feature engineering in discriminative, supervised learning methods -developers can add many types of constraints to the model without (much) fear of reducing accuracy. The usual caveat of overfitting applies here as well; indeed, comparing the accuracies on the tuning set (Table 2) with those on the testing set (Table 1) suggests that some over-tuning has occurred, most notably on age and pol.
We further examined the coefficients of the models trained using each constraint type. We find, for example, that county constraints result in models with large coefficients for location-specific terms (e.g., college names for younger users, southern cities for Republican users), while follower constraints tend to learn models dominated by follower features ("thenation" for Democrats, "glennbeck" for Republicans). Similarly, name constraints result in models dominated by name features. This analysis helps explain how combining constraint types can improve overall accuracy, since each type emphasizes different subsets of features.
This difference between constraint types is further shown in Table 3, which lists the top features for the semi-greedy constraint selection algorithm, fit using different subsets of constraints. In this table, the italicized words are the words from the description field of the user's profile, the underlined words are followed accounts, and the bold words are the words  Table 3: Top features learned by label regularization for the age and politician datasets using semi-greedy constraint selection. Models were fit separately for each constraint type (county, follow, name). Italicized words are from the description field, bold words are from the name field, and underlined words are followed accounts.
from the name field of the user profile. In the first row, we display the top features for a model fit using only county constraints. College names appear as top features for younger users, and "airport" and @NashvilleScene (a newspaper) are for older users. The second row of Table 3 shows the top features for following constraints; some news channels are appear for younger (Alternative Press) and older (The News & Observer) users. The third row shows the top features for name constraints, and some names are in the top features for younger (Katherine and Diana) and older (Debra, Lori, Sandra, and Janet). In addition, the absence of a profile description is indicative of older users.
The bottom of Table 3 shows top features for the politician dataset. The first row shows that some colleges, a sports network in New England, and locations in the Pacific Northwest are indicative of Democrats. Indiana-related terms are strong indicators of Republicans: indiana, the Indianapolis Colts (an American football team), and 'jgfortwayne' (The Journal Gazette, a newspaper in Fort Wayne, Indiana). This aligns with the strong support of the Republican party in Indiana. 7 The second row shows top-ranked following features. Accounts 'keithellison' and 'repjohnlewis' are top features for Democratic Party; these belong to Keith Ellison and John Robert Lewis, members of the Democratic leadership of the House of Representatives. On other hand, the 'gopleader' (the official account for the Republican's majority leader in the House) and 'senmikelee' (Republican Senator Mike Lee from Utah) are the top features for Republicans.

Conclusions and Future work
While label regularization has been used on a number of NLP tasks, we have presented evidence that it is applicable to latent attribute inference even using many noisy, heterogeneous constraints. We have compared a number of constraint selection algorithms and found they can make label regularization more robust to noisy constraints, allowing developers to combine many rich constraint types without reducing accuracy.
There are many avenues for future work. Most pressing is the need to directly address the sampling bias created when constraints derived from the overall population are applied to online users. We plan to explore alternative optimization strategies to explicitly address this issue. Finally, additional research should quantify how responsive label regularization approaches are to the changing linguistic patterns common in online data.