Geolocation for Twitter: Timing Matters

Automated geolocation of social media messages can beneﬁt a variety of downstream applications. However, these geolocation systems are typically evaluated without attention to how changes in time impact geolocation. Since different people, in different locations write messages at different times, these factors can signiﬁcantly vary the performance of a ge-olocation system over time. We demonstrate cyclical temporal effects on geolocation accuracy in Twitter, as well as rapid drops as test data moves beyond the time period of training data. We show that temporal drift can effectively be countered with even modest online model updates.

Most previous work consider the task of author geolocation, the identification of a author's primary (home) location (Eisenstein et al., 2010;Han et al., 2014). Author geolocation systems rely on multiple tweets from each author to identify the location. In this work, we consider the task of tweet geolocation, where a system identifies the location where a single tweet was written (Osborne et al., 2014;. This approach is necessary when geolocation decisions must be made quickly, with limited resources, or when the location of a specific tweet is required. When focusing on a single tweet, time becomes relevant. Intuitively, tweets written in the morning might be in different locations (at home) than say tweets written during the day (at work). This information is often ignored but can provide important clues as to a tweet's location. Likewise, models built using historical data never adapt as time evolves. These factors may have a significant impact on geolocation accuracy, and downstream system's should be sensitive to these variations.
For the first time, we consider the impact of time on Twitter geolocation and predict where a post was made (rather than the more usual, and easier task of author location). We take a supervised learning approach, training a multi-class classifier to identify the city of a tweet. We train a system on 250 million tweets sampled from a 45 month period, perhaps the largest evaluation to date. We find that: • Geolocation accuracy is cyclical, varying significantly with time.
• While access to massive training data improves accuracy, these effects are largely lost when models are deployed on new tweets, in large part due to new users and duplicate tweets.
• Periodically updating geolocation models, even with data available from the free Twitter API, can largely supplant massive training datasets.
Our study is similar to that of Pavalanathan and Eisenstein (2015), who called into question the accuracy of geolocation models due to mismatches between the behavior of users in available training data as compared to users encountered in live data. While our work provides a cautionary tale, it provides a guide for how these models can be used in practice.

Dataset
We start with every geocoded tweet (based on the "location" field) from January 1, 2012 to September 30, 2015: 8,530,693,792 tweets. 1 These tweets are associated with a specific location by Twitter (the "location" field is populated.) We took several steps to remove tweets that were not relevant to the task. We removed tweets posted by location sharing services (FourSquare and jSwarm) since these are not written by users. We removed retweets for the same reason. We also remove tweets that do not have a specific latitude/longitude (geo) while nevertheless containing a location. Twitter allows user's to tag a tweet with a location (populating the location field) even when the user's device does not provide a latitude and longitude (geo field). To ensure we know the precise location of the user we only consider tweets with the geo field.
We matched each tweet to a city using the procedure of Han et al. (2014), with 3,709 cities derived from the geonames database 2 . Only 2983 locations contained a tweet; locations without tweets were mostly in Africa and China, which has low Twitter usage. Following Han et al. (2014) we focus on English tweets only, removing non-English tweets based on the metadata language code. We also identified the tweet's country for a country prediction task (161 labels). We divided this dataset into two time periods. We use tweets from January 1, 2012 to March 30, 2015 for a standard train/dev/test evaluation, selecting 2 10,000 of the data for development and test sets. Data from March 31, 2015 to September 30, 2015 forms an "out of time" sample.
The most common cities were Los Angeles, London, Jakarta, Chicago, Kuala Lumpur and Dallas.
The city clustering procedure of Han et al. (2014) greatly influences this list. For example, Los Angeles ends up as one large city, whereas the New York City area is divided into several smaller cities.

Geolocation Model
We treat geolocation as a multi-class task, with each city (or country) a label (Jurgens et al., 2015).
Features All of our features are extracted from a single tweet (text or metadata) without requiring additional queries to the Twitter API. 3 These include: Text: We extracted unigrams and bigrams from the text of each tweet after tokenizing with Twokenizer . We removed all punctuation, and replaced unique usernames and urls with placeholder tokens. Numbers were replaced with a NUM token. Profile location: Unigrams and bigrams extracted from the user supplied profile location field, as well as a feature for the entire location string. These fields often provide clues as to the user's location, e.g. "New York Living". Timezone: Each tweet has a timezone that reflects a specific location, e.g. "Pacific Time (US & Canada)", "Atlantic Time (Canada)", "Casablanca". We also include the UTC offset of the timezone. Time: We use a feature indicating the hour of the day (in UTC time) at which the tweet was posted.
Learning We used vowpal wabbit (version 8.1.1) (Agarwal et al., 2011), a linear classifier trained using stochastic gradient descent with adaptive, individual learning rates (Duchi et al., 2011) that minimizes the hinge loss. We used feature hashing with a 31-bit feature space. We selected the best model and parameters based on initial tests using development data. All other parameters used default settings.
Baselines We include two baselines: (1) the majority predictor: always predicts the most popular label.
(2) alias matching: we create a list of aliases for each of the 2983 cities from the genomes dataset, which includes the smaller cities clustered together by Han et al. (2014). We search each tweet and the user's profile location for these aliases, assigning a tweet with a matched alias to the corresponding city; unmatched tweets are assigned the majority label. When multiple cities match a tweet, we selected the correct one (if present) using oracle knowledge. About 90% of matches were in the profile. This strategy is similar to that of .
Duplicates A tweet may be duplicated in our dataset, appearing in both training and held out data, or appearing multiple times in held out data. We define duplicates as tweets with identical feature representations. We removed duplicates from dev and test splits, to ensure evaluation examples are unseen in training, yielding 22,966 dev and 23,240 test tweets.

Baseline Results
We begin by establishing the models' performance with a large training set, as measured on held out evaluation data drawn from the same time period. Here we use a standard setting, where there is no online adaptation. We include results for city and country models trained with the tweet text features alone (content). These evaluations train with a sample of 25,822,353 tweets, similar to previous large scale training for geolocation (Han et al., 2014). Table 2 shows our model beating both baselines, with the additional features generally improving over content features alone. Interestingly, improvements from adding features appears to be additive: the final model's accuracy is nearly the sum of the individual improvements from each feature set. On the non-deduped test dataset (25,941 tweets), the accuracy was higher (city: 0.2920, country: 0.8777) but the trends of adding features remain unchanged. Our time feature, which captures a temporal prior over locations, does not seem to help, providing only a small boost.
We consider the impact of training data size in Figure 1, including a model trained on 258,222,490 tweets, an order of magnitude larger than Han et al. (2014), which improves accuracy by roughly 3%. This figure provides guidance on how much data is necessary to do well on this task.
To summarize: our approach yields tweet level geolocation accuracy similar to, or better than, state of the art user level geolocation. 4 We note that for small datasets (tens of millions of training examples, which can be obtained from the Twitter streaming API), one can obtain a reasonable model.

Temporal Factors in Geolocation
We now consider factors that influence geolocation temporal accuracy using our largest city model (258M training tweets), which has an accuracy of 0.3302 on test data (0.3062 excluding duplicates).

Question 1: How do daily and weekly patterns impact geolocation accuracy?
Twitter traffic varies over the course of a day and a week. User behavior may change at different times, and different locations are active at different times. Figure 3 shows the number of tweets and test geolocation accuracy by the hour of the day (b) and day of the week (c). The day of the week has a minor impact on geolocation accuracy; the standard deviation of the 7 days is 2.7% of the total mean. Tweet volume has a negative correlation with accuracy (−0.435), i.e. more tweets may be indicative of more people from different locations tweeting, which makes the task harder. Notably, Monday is significantly harder, with an accuracy of 1.5 standard deviations below the mean. However, the hour of the day has much more significant impact on accuracy; some times of the day are significantly easier and harder than the average. The standard deviation is 6.8% of the mean, and tweet volume is strongly negatively correlated with accuracy (−0.647). Geolocation is easier during times when there are fewer locations actively tweeting. This is most apparent during   the nighttime in the US, where there are much fewer tweets overall and many fewer active locations. In short, the accuracy of a geolocation system depends on when it is running.
6.2 Question 2: How do changes over time impact a fixed geolocation model?
We now turn to our data sample taken after the training data: a 10% sample of 49,307,720 tweets from 2015/3/31 -2015/9/30. 5 These tweets will demonstrate the accuracy of a trained model deployed on new data over time.
Evaluating on these tweets (duplicates included), our model yields an accuracy of 0.2661, down from 0.3302, a 19% relative drop. Surprisingly, this isn't a gradual change over time; the drop is quite rapid. The week immediately following the training period has an accuracy of 0.2884. Figure 4 shows the decline in accuracy over time. 6 present in training, it has difficulty generalizing to new users. Far from a small percentage of the total, new users make up a significant number of tweets, at a rate that does not appear to be slowing. Reposted Tweets Users often repost content, which can include repeating simple message (e.g. "feeling good!") or tweeting the same content to multiple users. Users are more likely to repost content shortly after it was first created, making the number of reposts go down over time. For example, while 8% of test tweets from the same time period as training data are duplicates (they appear in the training data), only 3.8% of tweets in the six month evaluation period are duplicates.
How much of an impact do these reposts have on accuracy? For the test data from the same time period, we saw model performance drop from 0.3302 to 0.3062, a fairly large difference. By comparison, removing reposts in the the six month evaluation period drops accuracy from 0.2661 to 0.2541, a more modest change. Reposts help to inflate geolocation accuracy, and their decrease as time progresses from training removes this accuracy inflation. 7 Question 3: Can periodic model updates maintain a trained geolocation system?
Our results so far are sobering: shortly after a static model is deployed performance degrades to a model using two orders of magnitude less training data (compare the drop in §6.2 with Figure 1). Increasing the amount of training data might be an option, but given our previous results on new users, etc., this is unlikely to be sufficient. A simple method for addressing model degradation over time is to continuously update the model over time using online learning on new data as it becomes available. For example, we can continuously download a stream of (at least) 1% of geocoded tweets from the Twitter API to use as training for updating a deployed system. What is the impact on a system's accuracy when it is updated on these geocoded tweets with SGD updates ( §3)? Figure 4 shows the performance of our system in an online setting (dashed black line). This model updates on every 100th example (1% of all geocoded tweets) encountered in the six-month evaluation period. When we update this previously trained static 3 /3 1 /1 5 4 /2 0 /1 5 5 /1 0 /1 5 5 /3 0 /1 5 6 /1 9 /1 5 7 /9 /1 5 7 /2 9 /1 5 8 / model, we see a quick recovery to accuracy levels that meet or exceed those on the test set from the same time period as training (horizontal line.) Finally, we consider the case where a practitioner starts from scratch with no training data, but updates using just 1% of geocoded tweets. Can someone with access to no prior training data build an effective model? Encouragingly, within 20 days the new model (solid blue line) catches the previously trained static model (solid black line, "Existing model: no updates"). This is an extremely promising result as it suggests that most practitioners who do not have access to all geolocated data can produce geolocation prediction models that approximate models trained using hundred of millions of examples.

Conclusion
We have presented a tweet geolocation system that considers an order of magnitude more data than any prior work. Despite hundreds of millions of training examples, the resulting system is sensitive to the time the tweet was authored. Additionally, accuracy suffers when deployed on data beyond the training period. We show that online updates can mitigate problems caused by concept drift. In short, sheer volume of data is not enough: geolocation models should adapt to new data. Encouragingly, starting from no training data and updating on just 1% of geocoded tweets, within 20 days we can recover a model that catches a static model previously trained on hundreds of millions of tweets.