Confounds and Consequences in Geotagged Twitter Data

Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.


Introduction
Social media data, such as Twitter, is frequently used to identify the unique characteristics of geographical regions, including topics of interest (Hong et al., 2012), linguistic styles and dialects (Eisenstein et al., 2010;Gonçalves and Sánchez, 2014), political opinions (Caldarelli et al., 2014), and public health (Broniatowski et al., 2013). Social media permits the aggregation of datasets that are orders of magnitude larger than could be assembled via traditional survey techniques, enabling analysis that is simultaneously fine-grained and global in scale. Yet social media is not a representative sample of any "real world" population, introducing both geographic and demographic biases (Mislove et al., 2011); this is particularly true for volunteered geographic information (Hecht and Stephens, 2014).
This paper examines the effects of these biases of the geo-linguistic inferences that can be drawn from Twitter. We focus on the ten largest metropolitan areas in the United States, and consider three sampling techniques: drawing an equal number of GPS-tagged tweets from each area; drawing a county-balanced sample of GPS-tagged messages to correct Twitter's urban skew (Hecht and Stephens, 2014); and drawing a sample of location-annotated messages, using the location field in the user profile. Leveraging self-reported first names and census statistics, we show that the age and gender composition of these datasets differ significantly.
Next, we apply standard methods from the literature to identify geo-linguistic differences, and test how the outcomes of these methods depend on the sampling technique and on the underlying demographics. We also test the accuracy of textbased geolocation (Cheng et al., 2010;Eisenstein et al., 2010) in each dataset, to determine whether the accuracies reported in recent work will generalize to more balanced samples.
The paper reports several new findings about geotagged Twitter data: • In comparison with tweets with self-reported locations, GPS-tagged tweets are written more often by young people and women. • There are corresponding linguistic differences between these datasets, with GPS-tagged tweets including more geographically-specific non-standard words.
• Young people use significantly more geographically-specific non-standard words. Men tend to mention more geographicallyspecific entities than women, but these differences are significant only for individuals at the age of 30 or older.
• Users who GPS-tag their tweets tend to write more, making them easier to geolocate. Evaluating text-based geolocation on GPS-tagged tweets probably overestimates its accuracy. • Text-based geolocation is significantly more accurate for men and for older people.
These findings should inform future attempts to generalize from geotagged Twitter data, and may suggest investigations into the demographic properties of other social media sites. We first describe the basic data collection principles that hold throughout the paper ( § 2). The following three sections tackle demographic biases ( § 3), their linguistic consequences ( § 4), and the impact on text-based geolocation ( § 5); each of these sections begins with a discussion of methods, and then presents results. We then summarize related work and conclude.

Dataset
This study is performed on a dataset of tweets gathered from Twitter's streaming API from February 2014 to January 2015, over a period of almost an year. During an initial filtering step we removed retweets, repetitions of previously posted messages which contain the "retweeted status" metadata or "RT" token which is widely used among Twitter users to indicate a retweet. To eliminate spam and automated accounts (Yardi et al., 2009), we removed tweets containing URLs, user accounts with more than 1000 followers or followees, accounts which have tweeted more than 5000 messages at the time of data collection, and the top 10% of accounts based on number of messages in our dataset. We also removed users who have written more than 10% of their tweets in any language other than English. We used the "lang" metadata of a tweet for this language filtering.
We consider the ten largest Metropolitan Statistical Areas (MSAs) in the United States, listed in Table 1. MSAs are defined by the U.S. Census Bureau as geographical regions of high population with density organized around a single urban core; they are not legal administrative divisions. MSAs include outlying areas that may be substantially less urban than the core itself. For example, the Atlanta MSA is centered on Fulton County (1750 people per square mile), but extends to Haralson County (100 people per square mile), on the border of Alabama. A per-county analysis of this data therefore enables us to assess the de-gree to which Twitter's skew towards urban areas biases geo-linguistic analysis.

Representativeness of geotagged Twitter data
We first assess potential biases in sampling techniques for obtaining geotagged Twitter data. In particular, we compare two possible techniques for obtaining data: the location field in the user profile (Poblete et al., 2011;, and the GPS coordinates attached to each message (Cheng et al., 2010;Eisenstein et al., 2010).

Methods
To build a dataset of GPS-tagged messages, we extract the GPS latitude and longitude locations reported in the tweet, and reverse-geocoded these coordinates to counties. This set of geotagged messages will be denoted D G . Only 1.24% of messages contain geo-coordinates, and it is possible that the individuals willing to share their GPS comprise a skewed population. We therefore also considered the user-reported location field in the Twitter profile, focusing on the two most widelyused patterns: (1) city name, (2) city name and two letter state name (e.g. Chicago and Chicago, IL). Messages that matched any of the ten largest MSAs were grouped into a second set, D L . While the inconsistencies of writing style in the Twitter location field are well-known (Hecht et al., 2011), analysis of the intersection between D G and D L found that the overwhelming majority of messages in D L were geotagged to the appropriate MSA. Of course, there may be many false negatives -profiles that we fail to geolocate due to the use of non-standard toponyms like Pixburgh and ATL. If so, this would introduce a bias in the population sample in D L . Such a bias might have linguistic consequences, such as a diminished tendency to use non-standard language in general.

Subsampling
The initial samples D G and D L were then resampled to create the following balanced datasets:  GPS-COUNTY-BALANCED We resampled D G based on county-level population (obtained from the U.S. Census Bureau), and again obtained message-balanced and userbalanced samples. These samples are more geographically representative of the overall population distribution across each MSA.
LOC-MSA-BALANCED From D L , we randomly sampled 25,000 tweets per MSA as the message-balanced sample, and all the tweets from 2,500 users per MSA as the userbalanced sample. It is not possible to obtain county-level geolocations in D L , as exact geographical coordinates are unavailable.

Age and gender identification
To estimate the distribution of ages and genders in each sample, we queried statistics from the Social Security Administration, which records the number of individuals born each year with each given name. Using this information we obtained the probability distribution of age values for each given name. We then matched the names against the first token in the name field of each user's profile, enabling us to induce approximate distributions over ages and genders. Unlike Facebook and Google+, Twitter does not have a "real name" policy, so users are free to give names that are fake, humorous, etc. We eliminate user accounts whose names are not sufficiently common in the social security database (i.e. first names which are at least 100 times more frequent in Twitter than in the social security database). While some individuals will choose names not typically associated with their gender, we assume that this will happen with roughly equal probability in both directions.
So, with these caveats in mind, we induce the age distribution for the GPS-MSA-BALANCED sample and the LOC-MSA-BALANCED sample as, We induce distributions over author gender in much the same way (Mislove et al., 2011). This method does not incorporate prior information about the ages of Twitter users, and thus assigns too much probability to the extremely young and old, who are unlikely to use the service. While it would be easy to design such a prior -for example, assigning zero prior probability to users under the age of five or above the age of 95 -we see no principled basis for determining these cutoffs. We therefore focus on the differences between the estimated p D (a) for each sample D.

Results
Geographical biases in the GPS Sample We first assess the differences between the true population distributions over counties, and the pertweet and per-user distributions. Because counties vary widely in their degree of urbanization and other demographic characteristics, this measure is a proxy for the representativeness of GPSbased Twitter samples (county information is not available for the LOC-MSA-BALANCED sample). Population distributions for New York and Atlanta are shown in Figure 1. In Atlanta, Fulton County is the most populous and most urban, and is overrepresented in both geotagged tweets and user accounts; most of the remaining counties are corre-   Hecht and Stephens (2014). In New York, Kings County (Brooklyn) is the most populous, but is underrepresented in both the number of geotagged tweets and user accounts, at the expense of New York County (Manhattan). Manhattan is the commercial and entertainment center of the New York MSA, so residents of outlying counties may be tweeting from their jobs or social activities in the city center.
To quantify the representativeness of each sample, we use the L1 distance ||x − y|| 1 = c |p c − t c |, where p c is the proportion of the MSA population residing in county c and t c is the proportion of tweets (Table 1). County boundaries are determined by states, and their density varies: for example, the Los Angeles MSA covers only two counties, while the smaller Atlanta MSA is spread over 28 counties. The table shows that while New York is the most extreme example, most MSAs feature an asymmetry between county population and Twitter adoption.
Usage Next, we turn to differences between the GPS-based and profile-based techniques for ob- taining ground truth data. As shown in Figure 2, the LOC-MSA-BALANCED sample contains more low-volume users than either the GPS-MSA-BALANCED or GPS-COUNTY-BALANCED samples. We can therefore conclude that the county-level geographical bias in the GPS-based data does not impact usage rate, but that the difference between GPS-based and profile-based sampling does; the linguistic consequences of this difference will be explored in the following sections.
Demographics Table 2 shows the expected age and gender for each dataset, with bootstrap confidence intervals. Users in the LOC-MSA-BALANCED dataset are on average two years older than in the GPS-MSA-BALANCED and GPS-COUNTY-BALANCED datasets, which are statistically indistinguishable. Focusing on the difference between GPS-MSA-BALANCED and LOC-MSA-BALANCED, we plot the difference in age probabilities in Figure 3, showing that GPS-MSA-BALANCED includes many more teens and people in their early twenties, while LOC-MSA-BALANCED includes more people at middle age and older. Young people are especially likely to use social media on cellphones (Lenhart, 2015), where location tagging would be more relevant than when Twitter is accessed via a personal computer. Social media users in the age brackets 18-29 and 30-49 are also more likely to tag their locations in social media posts than social media users in the age brackets 50-64 and 65+ (Zickuhr, 2013), with women and men tagging at roughly equal rates. Table 2 shows that the GPS-MSA-BALANCED and GPS-COUNTY-BALANCED samples contain significantly more women than LOC-MSA-BALANCED, though all three samples are close to 50%.

Impact on linguistic generalizations
Many papers use Twitter data to draw conclusions about the relationship between language and geography. What role do the demographic differences identified in the previous section have on the linguistic conclusions that emerge? We measure the differences between the linguistic corpora obtained by each data acquisition approach. Since the GPS-MSA-BALANCED and GPS-COUNTY-BALANCED methods have nearly identical patterns of usage and demographics, we focus on the difference between GPS-MSA-BALANCED and LOC-MSA-BALANCED. These datasets differ in age and gender, so we also directly measure the impact of these demographic factors on the use of geographically-specific linguistic variables.

Discovering geographical linguistic variables
We focus on lexical variation, which is relatively easy to identify in text corpora. Monroe et al.
(2008) survey a range of alternative statistics for finding lexical variables, demonstrating that a regularized log-odds ratio strikes a good balance between distinctiveness and robustness. A similar approach is implemented in SAGE (Eisenstein et al., 2011) 1 , which we use here. For each sample -GPS-MSA-BALANCED and LOC-MSA-BALANCED -we apply SAGE to identify the twenty-five most salient lexical items for each metropolitan area.
Keyword annotation Previous research has identified two main types of geographical lexical variables. The first are non-standard words and spellings, such as hella and yinz. Such variables have been found to be very frequent in social media (Eisenstein, 2015). Other researchers have focused on the "long tail" of entity names (Roller et al., 2012). A key question is the relative importance of these two variable types, since this would decide whether geo-linguistic differences are primarily topic-based or stylistic. It is therefore important to know whether the frequency 1 https://github.com/jacobeisenstein/jos-gender-2014 of these two variable types depends on properties of the sample. To test this, we take the lexical items identified by SAGE (25 per MSA, for both the GPS-MSA-BALANCED and LOC-MSA-BALANCED samples), and annotate them as NONSTANDARD-WORD, ENTITY-NAME, or OTHER. Annotation for ambiguous cases is based on the majority sense in randomly-selected examples. Overall, we identify 24 NONSTANDARD-WORDs and 185 ENTITY-NAMEs.
Inferring author demographics As described in § 3.1.2, we can obtain an approximate distribution over author age and gender by linking selfreported first names with aggregate statistics from the United States Census. To sharpen these estimates, we now consider the text as well, building a simple latent variable model in which both the name and the word counts are drawn from distributions associated with the latent age and gender (Chang et al., 2010). The model is shown in Figure 4, and involves the following generative process: For each user i ∈ {1...N }, (a) draw the age, a i ∼ Categorical(π) (b) draw the gender, g i ∼ Categorical(0.5) (c) draw the author's given name, where we elide the second parameter of the multinomial distribution, the total word count. We use expectation-maximization to perform inference in this model, binning the latent age variable into four groups: 0-17, 18-29, 30-39, above 40. 2 We note that there is other work in this domain of demographic prediction (Nguyen et al., 2014;Volkova and Durme, 2015), but since it is not the focus of our research, we take a relatively simple approach, which is unsupervised.
ai Age (bin) for author i gi Gender of author i wi Word counts for author i ni First name of author i π Prior distribution over age bins θa,g Word distribution for age a and gender g φa,g First name distribution for age a and gender g

Results
Linguistic differences by dataset We first consider the impact of the data acquisition technique on the lexical features associated with each city. The keywords identified in GPS-MSA-BALANCED dataset feature more geographicallyspecific non-standard words, which occur at a rate of 3.9 × 10 −4 in GPS-MSA-BALANCED, versus 2.6 × 10 −4 in LOC-MSA-BALANCED; this difference is statistically significant (p < .05, t = 3.2). 3 For entity names, the difference between datasets was not significant, with a rate of 4.0 × 10 −3 for GPS-MSA-BALANCED, and 3.7×10 −3 for LOC-MSA-BALANCED. Note that these rates include only the non-standard words and entity names detected by SAGE as among the top 25 most distinctive for one of the ten largest cities in the US; of course there are many other relevant terms that are below this threshold.
In a pilot study of the GPS-COUNTY-BALANCED data, we found few linguistic differences from GPS-MSA-BALANCED, in either the aggregate word-group frequencies or the SAGE word lists -despite the geographical imbalances shown in Table 1 and Figure 1. Informal examination of specific counties shows some expected differences: for example, Clayton County, which hosts Atlanta's Hartsfield-Jackson airport, includes terms related to air travel, and other counties include mentions of local cities and business districts. But the aggregate statistics for underrepresented counties are not substantially different from those of overrepresented counties, and are largely unaffected by county-based resampling.
Demographics Aggregate linguistic statistics for demographic groups are shown in Figure 5. Men use significantly more geographicallyspecific entity names than women (p .01, t = 8.0), but gender differences for geographicallyspecific non-standard words are not significant (p ≈ .2). 4 Younger people use significantly more geographically-specific non-standard words than older people (ages 0-29 versus 30+, p .01, t = 7.8), and older people mention significantly more geographically-specific entity names (p .01, t = 5.1). Of particular interest is the intersection of age and gender: the use of geographically-specific non-standard words decreases with age much more profoundly for men than for women; conversely, the frequency of mentioning geographically-specific entity names increases dramatically with age for men, but to a much lesser extent for women. This suggests that the high-level patterns of geographically-oriented language are more age-dependent for men than for women, suggesting an intriguing site for fu-  For a more detailed view, we apply SAGE to identify the most salient lexical items for each MSA, subgrouped by age and gender. Table 3 shows word lists for New York (the largest MSA) and Dallas (the 5th-largest MSA), using the GPS-MSA-BALANCED sample. Non-standard words tend to be used by the youngest authors: ilysm ('I love you so much'), ight ('alright'), oomf ('one of my followers'). Older authors write more about local entities (manhattan, nyc, houston), with men focusing on sports-related entities (harden, watt, astros, mets, texans), and women above the age of 40 emphasizing religiously-oriented terms (proverb, islam, rejoice, psalm).

Impact on text-based geolocation
A major application of geotagged social media is to predict the geolocation of individuals based on their text (Eisenstein et al., 2010;Cheng et al., 2010;Wing and Baldridge, 2011;Hong et al., 2012;Han et al., 2014). Text-based geolocation has obvious commercial implications for location-based marketing and opinion analysis; it is also potentially useful for researchers who want to measure geographical phenomena in social media, and wish to access a larger set of individuals than those who provide their locations explicitly.
Previous research has obtained impressive accuracies for text-based geolocation: for example, Hong et al. (2012) report a median error of 120 km, which is roughly the distance from Los Angeles to San Diego, in a prediction space over the entire continental United States. These accuracies are computed on test sets that were acquired through the same procedures as the training data, so if the acquisition procedures have geographic and demographic biases, then the resulting accuracy estimates will be biased too. Consequently, they may be overly optimistic (or pessimistic!) for some types of authors. In this section, we explore where these text-based geolocation methods are most and least accurate.

Methods
Our data is drawn from the ten largest metropolitan areas in the United States, and we formulate text-based geolocation as a ten-way classification problem, similar to Han et al. (2014). 5 Using our user-balanced samples, we apply ten-fold cross validation, and tune the regularization parameter on a development fold, using the vocabulary of the sample as features.

Results
Many author-attribute prediction tasks become substantially easier as more data is available (Burger et al., 2011), and text-based geolocation is no exception. Since GPS-MSA-BALANCED and LOC-MSA-BALANCED have very different usage rates (Figure 2), perceived differences in accuracy may be purely attributable to the amount of data available per user, rather than to users in one group being inherently harder to classify than another. For this reason, we bin users by the number of messages in our sample of their timeline, and report results separately for each bin. All errorbars represent 95% confidence intervals.
GPS versus location As seen in Figure 6a, there is little difference in accuracy across sampling techniques: the location-based sample is slightly easier to geolocate at each usage bin, but the difference is not statistically significant. However, due to the higher average usage rate in GPS-MSA-BALANCED, the overall accuracy for a sample of users will appear to be higher on this data.
Demographics Next, we measure classification accuracy by gender and age, using the posterior distribution from the expectation-maximization algorithm to predict the gender of each user (broadly similar results are obtained by using the prior distribution). For this experiment, we focus on the GPS-MSA-BALANCED sample. As shown in Figure 6b, text-based geolocation is consistently more accurate for male authors, across almost the entire spectrum of usage rates. As shown in Figure 6c, older users also tend to be easier to geolocate: at each usage level, the highest accuracy goes to one of the two older groups, and the difference is significant in almost every case. As discussed in § 4, older male users tend to mention many entities, particularly sports-related terms; these terms are apparently more useful than the non-standard spellings and slang favored by younger authors.

Related Work
Several researchers have studied how adoption of Internet technology varies with factors such as socioeconomic status, age, gender, and living conditions (Zillien and Hargittai, 2009). Hargittai and Litt (2011) use a longitudinal survey methodology to compare the effects of gender, race, and topics of interest on Twitter usage among young adults. Geographic variation in Twitter adoption has been considered both internationally (Kulshrestha et al., 2012) and within the United States, using both the Twitter location field (Mislove et al., 2011) and per-message GPS coordinates (Hecht and Stephens, 2014). This prior work consistently indicates an urban bias, with rural counties underrepresented in comparison to their population. Twitter has often been used to study the geographical distribution of linguistic informa-tion, and of particular relevance are studies of Twitter-based studies of regional dialect differences (Eisenstein et al., 2010;Doyle, 2014;Gonçalves and Sánchez, 2014;Eisenstein, 2015) and text-based geolocation (Cheng et al., 2010;Hong et al., 2012;Han et al., 2014). This prior work rarely considers the impact of the demographic confounds, or the geographical biases mentioned in § 3. We address this question by measuring differences between three sampling techniques, in both language use and in the accuracy of text-based geolocation. Recent unpublished work proposes reweighting Twitter data to correct biases in political analysis (Choy et al., 2012) and public health (Culotta, 2014). Our results suggest that the linguistic differences between user-supplied profile locations and permessage geotags are more significant, and that accounting for the geographical biases among geotagged messages is not sufficient to offer a representative profile of Twitter users.

Discussion
Geotagged Twitter data offers an invaluable resource for studying the interaction of language and geography, and is helping to usher in a new generation of location-aware language technology. This makes critical investigation of the nature of this data source particularly important. This paper uncovers important demographic confounds in the linguistic analysis of geo-located Twitter data, but is limited to demographics that can be readily induced from given names. One key task for future work is to quantify the representativeness of geotagged Twitter data with respect to race and socioeconomic status; another is to expand this investigation to the international context. (c) Classification accuracy by imputed age Figure 6: Classification accuracies