The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r=.73 to .82 for median income prediction or r=.37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets – over 1 billion of which were mapped to counties, available for research.


Introduction
Social media is an increasingly popular resource for large-scale population assessment which promises a cheap and non-intrusive complement to standard surveys with finer spatiotemporal scales (Coppersmith et al., 2015;Mowery et al., 2016;Wang et al., 2017). Twitter has been used -among other things -to measure community health (Paul and Dredze, 2011;Mowery et al., 2016;Eichstaedt et al., 2015), wellbeing (Schwartz et al., 2013), and public opinion on politics (O'Connor et al., 2010;Miranda Filho et al., 2015). By having access to measurements from multiple locations or communities, models trained on text data from social media can be used both to predict future measurements and to provide community estimates where these are lacking or are not robust. Such research is made possible by the massive amount of easily accessible usergenerated data from public social media.
However, there has been little research on the way in which such data should be aggregated in order to compute community-level lexical feature estimates. One study explored the benefit of normalizing lexicon word count features as the proportion of users in each county which use at least one word from a given category (Culotta, 2014a). This method lead to a significant improvement but was limited to broad categories of a psychological based dictionary rather than normalizing frequencies of individual lexical features within users before aggregating to the county. Typically, data are aggregated in a "bag of words" style, disregarding tweets and authors (Schwartz et al., 2013;Eichstaedt et al., 2015;Curtis et al., 2018). We find, however, that giving equal weight to each user, rather than to each word or tweet, yields much more accurate community-level predictions.
In this paper, we conduct a series of experiments testing various simple yet intuitive aggregation methods. We show that choice of aggregation methods can result in substantial (one might even say "remarkable") boosts in accuracy when predicting U.S. county level outcomes (e.g. userto-county aggregation yields a 7% to 27% increase in Pearson correlation). Contributions include (a) validation of aggregation approaches across four outcomes related to health, psychology, and demographics, (b) validation that aggregation has some effect on smaller sample of Twitter data, (c) show the effect of power tweeters (or "super users") and (d) release of resource-intensive community aggregated lexical data.
Related work. This is the first work we know of to explore simple aggregation techniques for population-level prediction tasks from language. Previous work has explored more sophisticated adjustments, such as addressing demographic-self selection bias in Twitter community predictions by re-weighting messages, finding small improvements (a 4.5% reduction in symmetric mean absolute percentage error) (Culotta, 2014b). In a political voting intention prediction application, (Lampos et al., 2013) modeled users and words jointly by learning separate regression weights for the two dimensions based on the intuition that each user contributes differently towards the outcome. However, their model was specifically adapted to problems that use time-series outcomes, rather than community-level aggregation. Distributions of lexical features are considered at multiple levels of analysis (message, user and community) in (Almodaresi et al., 2017) though each level considers one type of aggregation. Similar aggregation methods have been used in the context of topic modeling (Latent Dirichlet Allocation (Blei et al., 2003) and Author Topic Model (Rosen-Zvi et al., 2004)) by considering user, hashtag and conversation level aggregations (Alvarez-Melis and Saveski, 2016;Hong and Davison, 2010) but, again, community level aggregation and prediction tasks were not considered.

Data
Research was reviewed by an academic institutional review board and deemed exempt.

Twitter Data Collection
Twitter Sample A random 10% sample of the entire Twitter stream ('GardenHose') was collected between July 2009 and April 2014, which was then supplemented with a random 1% sample from May 2014 to February 2015. The total sample contains approximately 37.6 billion tweets (Preotiuc-Pietro et al., 2012).
County Mapping In order to map each tweet to a location within a county in the United States, we use both self-reported location information in user profiles and latitude/longitude coordinates associated with a tweet. If latitude/longitude coordinates are present then we trivially map the tweet to a county. The self-reported location information is a free text field and we use a cascading set of rules to map this field to a county. The rules are designed to avoid false positives (incorrect mappings) at the expense of fewer mappings. The full details of this process can be found in (Schwartz et al., 2013). Note that the latitude/longitude coordinates are a tweet attribute whereas the self-reported location is a user attribute yet both are used to map tweets to counties. Users are assigned a county by considering their earliest county mapped tweet.
In total, we are able to map 1.78 billion of the 37.6 billion tweets to a US county using the abovementioned method. The county mapped data set was then filtered to contain only English tweets using the popular langid.py method (Lui and Baldwin, 2012), further reducing our tweet set to 1.64 billion tweets. For experiments with user-level data aggregation, we removed users who made relatively few (less than 30) posts in our data set.
Publicly Available Stream The standard publicly available Twitter stream outputs approximately 1% of the public Tweet volume. Since a 10% sample is not available to most researchers, we replicated a 1% sample by taking a random 10% of our county mapped, English filtered 10% sample. The same process of county mapping, language filtering and user selection was applied to this data resulting in 131 million county mapped English tweets from 1.57 millions users. Table 1 presents the data set statistics.

The County Tweet Lexical Bank The County
Tweet Lexical Bank is a U.S. County level data set comprised of two feature sets 1 : • an aggregated "bag-of-words" count vector across all the county's messages in order to preserve anonymity. The unigrams represent the most frequent words in the data set; 2 ; • a "bag-of-topics" representation for each county, with 2000 social media-derived topics described in (Schwartz et al., 2013). Both feature sets will be releases across the 2009-2015 time span as well as individual years. Yearly updates will be included as they become available. As we are only releasing aggregated word-level features, as opposed to raw Tweets, this data release is within Twitter's Terms of Service.

Outcomes
The following U.S. county demographic, psychological and health variables were used in our prediction tasks. Table 2 gives statistics for each county variable.   3 Methods

Aggregation
Our aim is to use the user-level information based on the assumption that aggregating data first at the user-level would remove biases introduced by non-standard users of the platform. To this end, we explore three types of aggregation: (1) tweet to county, (2) county "bag of words" and (3) user to county.
Tweet to County Here we compute where 1 i denotes the indicator function for unigram i . Here the ith feature for the jth unit of analysis (a U.S. county) is equal to the relative frequency of the unigram: the number of times each unigram was mentioned divided by N j the total number of tweets from county j.
County Next, we use a method which was generally used in past research, which aggregates all messages to a community disregarding any metadata, including tweet or user information. Previous state-of-the-art results using this method include life satisfaction r = .31 (Schwartz et al., 2013), atheroclerotic heart disease r = .42 (Eichstaedt et al., 2015) and education r = 0.15 (Culotta, 2014b). We therefore consider each county a "bag of words" using (1) with N j equal to the number of unigrams from county j.

User to County
The third method treats the unit of analysis (U.S. county) as a community of users. Therefore, feature weights are extracted at the user level, normalized and then averaged to communities: where U j is the set of users in county j, N j is the total number of Twitter users in county j and r k (x) is the relative frequency of feature x for user k with i ∈ {all unigrams} and j ∈ {all counties}.

Features
We use as features a list of 2,000 social mediaderived topics generated from Latent Dirichlet Allocation (Blei et al., 2003) using the complete MyPersonality Facebook data set consisting of approximately 15 million posts (Schwartz et al., 2013). The topic loadings are computed from the most frequent 25,000 unigrams in our data set. We also use a subset of these unigrams as additional features in our models (25,000 reduced to 10,000).

Experimental setup
For each of the four county level Census and health variables we built three models using 10fold cross validation with the following features: (1) unigrams, (2) topics and (3)   We used a feature selection pipeline which first removed all low variance features and then features that were not correlated with our census and health data. Principal component analysis was then applied to the reduced feature set for further dimensionality reduction. This preprocessing was used to avoid overfitting, since our model included more independent variables (2000 topic frequencies and/or 10k unigrams) than observations (at most 2,041 counties). For the prediction task we used linear regression with 2 regularization (Ridge regression) (Eichstaedt et al., 2015). The regression regularization parameter α was set to 1000 using grid search.
Because our initial dataset consisted of 37.6 billion tweets, using distributed IO was crucial for data aggregation and feature extraction. We used a Hadoop-style cluster consisting of 64 disks and 64 CPU cores across 8 physical machines. Over this cluster, we used Hadoop MapReduce for the county mapping step (taking approximately 1 week of wall clock runtime) and Spark for the feature aggregations (taking approximately 1 day of wall clock runtime). The entire pipeline of county mapping, English language filtering, feature extraction and prediction used the DLATK Python package    Table 4: Prediction results (Pearson r, using unigrams + topics) using full 10% data vs. users with 30+ tweets. The number of tweets used in each task is listed to highlight the fact that the "User to County" tasks use less tweets than the "all" tasks.

Experiments
Using the above setup we perform 3 experiments in order to explore the effects of data aggregation. We 1) directly compare aggregation methods using our 10% data; 2) compare aggregation methods using a 1% sample and, finally, 3) explore the effect of choosing an upper bound on the number of posts per Twitter users, looking at users with less than 50, 500, 1000 posts. This allows us to exclude frequent posters who are potentially organizations or bots.

Results and Discussion
Direct aggregation comparison. The results of our predictive experiments on the 10% data can be found in Table 3. Across all four tasks we see that the "User to County" approach outperforms the other aggregation methods, giving a higher Pearson r and obtaining state-of-the-art results for community-level predictions. We see the largest gains for the "User to County" aggregation for the income outcome, with a 13 point increase in Pearson r for topics alone and a 9 point increase for unigrams + topics.
In Table 4 we remove the 30+ tweet requirement from the "Tweet to County" and "County" methods and compare against the "User to County" method (with the 30+ tweet requirement). Again we see the "User to County" method outperforms all others in spite of the fact that the "User to County" approach uses less data than both "all" approaches, which contains 108 million more tweets.
1% data. In Table 5 we repeat the above experiment on a 1% Twitter sample. Here we see that the "User to County" method outperforms both the "Tweet to County" and "County" methods (with all three tasks using the same number of . 75 .67 .83 .77 .37 .34 .68 .66 N all−tweets 191M 195M 191M 195M 191M 197M 191M 198M N    tweets). When we compare the "User to County" and "County (all)" methods we see the "User to county" outperforming on two out of four tasks (Income and Life Satisfaction). Again, we note that the "User to County" is using less data than the "County (all)". While, across the board, the performance increase is not as substantial as in the 10% results, we see comparable performance between "User to County" and "County (all)" methods despite the difference in the number of tweets.
Super users. One theory why we see such large gains depending on aggregation technique is that aggregating through users negates the effects of super users -those who post an extraordinary amount (such as organizations or bots). We implemented a maximum tweet requirement in order to remove these users and see if that accounts for the difference. Here we use both the "User to County" and "County (all)" samples and report results in Table 6. These results demonstrate that by keeping only users with less than 500 tweets we get results close to our "User to County (No Max)" method using the user-naive "County (all)" scheme. This shows that relatively few users (in this case 611k) can significantly decrease performance, though still leaves a small gain from the user to county approach. As seen in the lower half of the table, this thresholding does not increase performance when using the "User to County" method, which suggests such users can still be beneficial if they are just treated such that they can't dominate a community. This highlights the benefit of our simple method: we do not need to consider optimizations which may not generalize across data, such as upper-bound thresholds on the number of tweets per user. Further, the userto-county aggregation seems to provide at least a small benefit beyond removal of super users.

Conclusion
This study explored the benefit of aggregation techniques for streaming user-generated data from individual messages to community level data, the typical setting for nowcasting. We showed that by simply aggregating to users first and then taking the mean within a county, we can obtain large gains (remarkably, up to a 13 point increase in Pearson correlation) over typical aggregation methods common in past work. In order to foster nowcasting research utilizing this more ideal aggregation, we will release the County Tweet Lexical Bank -a large aggregated and anonymized county-level data set, and computed on more than 1.6 billion tweets posted over 5 years. Future work in this area can look at adjusting models to account for other meta-data such as temporal variation and diversity and to adjust for selection biases present in social media, where the user base on social media is not representative of the population of the community (Greenwood et al., 2016).