A Unified Neural Network Model for Geolocating Twitter Users

Locations of social media users are important to many applications such as rapid disaster response, targeted advertisement, and news recommendation. However, many users do not share their exact geographical coordinates due to reasons such as privacy concerns. The lack of explicit location information has motivated a growing body of research in recent years looking at different automatic ways of determining the user’s primary location. In this paper, we propose a unified user geolocation method which relies on a fusion of neural networks. Our joint model incorporates different types of available information including tweet text, user network, and metadata to predict users’ locations. Moreover, we utilize a bidirectional LSTM network augmented with an attention mechanism to identify the most location indicative words in textual content of tweets. The experiments demonstrate that our approach achieves state-of-the-art performance over two Twitter benchmark geolocation datasets. We also conduct an ablation study to evaluate the contribution of each type of information in user geolocation performance.


Introduction
Knowing physical locations involved in social media data helps us to understand what is happening in real life, to bridge the online and offline worlds, and to develop applications for supporting real-life demands. For example, we can monitor public health of residents (Cheng et al., 2010), recommend local events (Yuan et al., 2013) or attractive places (Noulas et al., 2012) to tourists, identify locations of emergency (Ao et al., 2014) or even disasters (Lingad et al., 2013), and summarize regional topics (Rakesh et al., 2013). Even though platforms such as Twitter allow users to geolocate their posts to reveal their locations either manually or with the help of GPS, it is reported that less than 1% of Twitter data has geo-coordinates provided (Jurgens, 2013). Moreover, location information on Twitter is far from being complete and accurate. For instance, self-declared home information in many user profiles is inaccurate or even invalid (Hecht et al., 2011). The lack of explicit location information in the majority of tweets has motivated a growing body of research in recent years looking at different automatic ways of determining the user's primary location (i.e.,user geolocation) and/or -as a proxy for the former -the location from which tweets have been posted (Ajao et al., 2015).
Geolocation methods usually train a model on a small set of users whose locations are known (e.g., through GPS-based geotagging), and predict locations of other users using the resulting model. These models broadly fall into three categories: text-based (Eisenstein et al., 2010;Wing and Baldridge, 2011;Roller et al., 2012), networkbased (Jurgens, 2013;Compton et al., 2014;Jurgens et al., 2015), and hybrid methods that combine text, user network, and metadata information (Rahimi et al., 2015b,a;Jayasinghe et al., 2016;Miura et al., 2016) with the aim of achieving stateof-the-art performance.
In this paper, we present a neural network-based system that we developed for user geolocation in Twitter. Our model combines different sources of information including tweet text, metadata, and user network. We employ a neural network model to generate a dense vector representation for each field and then use the concatenation of these representations as the feature for classification. Our main contributions can be summarized as follows: 1. We propose a unified user geolocation method that relies on a fusion of neural networks, incorporating different types of avail-able information: tweet message, users' social relationships, and metadata fields embedded in tweets and profiles.
2. For modeling the tweet text (and textual metadata fields), we use bidirectional Long Short-Term Memory (LSTM) networks augmented with a context-aware attention mechanism (Yang et al., 2016), which helps to identify the most location indicative words.
3. Through the empirical studies on two standard Twitter datasets, we demonstrate that the proposed method outperforms other state-ofthe-art approaches in addressing the problem of user geolocation.

4.
We train an individual model for each information field, and analyze the contribution of each component in the geolocation process.
The rest of the paper is organized as follows. We review the related work in Section 2. Utilized data is described in Section 3. Section 4 explains the proposed approach. The experimental results are given in Section 5, and finally, we conclude the paper and outline possible future work in Section 6.
2 Related Work

Text-based Methods
Text-based methods utilize the geographical bias of language use in social media for geolocation. These methods have widely used probability distributions of words over locations. Maximum likelihood estimation approaches (Cheng et al., 2010(Cheng et al., , 2013 and language modeling approaches minimizing KL-divergence (Roller et al., 2012) have succeeded in predicting user locations using word distributions. Topic modeling approaches to extract latent topics with geographical regions (Eisenstein et al., 2010;Hong et al., 2012;Ahmed et al., 2013;Yuan et al., 2013) have also been explored considering word distributions.
Supervised learning methods with word features are also popular in text-based geoinference. Multinomial Naïve Bayes (Han et al., 2012(Han et al., , 2014Wing and Baldridge, 2011), logistic regression (Wing and Baldridge, 2014;Han et al., 2014), hierarchical logistic regression (Wing and Baldridge, 2014), and multi-layer neural network with stacked denoising autoencoder (Liu and Inkpen, 2015) have realized geolocation predic-tion from text. A semi-supervised learning approach has been proposed by Cha et al. (2015) using a sparse-coding and dictionary learning. Hulden et al. (2015) have used a kernel-based method to smooth linguistic features over very small grid sizes and consequently alleviate data sparseness. Chi et al. (2016) have employed Multinomial Naïve Bayes and focused on the use of textual features (i.e., location indicative words, GeoNames gazetteers, user mentions, and hashtags) for geolocation inference. More recently, Rahimi et al. (2017b) have proposed a neural network-based geolocation approach. They used the parameters of the hidden layer of the neural network as word and phrase embeddings, and performed a nearest neighbor search on a sample of city names and dialect terms.
While having good results, text-based approaches are often limited to those users who generate text that contains geographic references (Jurgens, 2013).

Network-based Methods
Network-based methods rely on the geospatial homophily of interactions (of several kinds) between users. An early work by Davis Jr et al. (2011) proposed an approach in which the location of a given user is inferred by simply taking the mostfrequently seen location among its social network. Jurgens (2013) have extended the idea of location inference as label propagation over some form of friendship graph by interpreting location labels spatially. Locations are then inferred using an iterative, multi-pass procedure. This method has been further extended by Compton et al. (2014) to take into account edge weights in the social network, and to limit the propagation of noisy locations. They weigh locations as a function of how many times users interacted there, hence favoring locations of friends with evidence of a close relationship. Jurgens et al. (2015) have released a framework for nine network-based geolocation methods targeting Twitter.
The main limitation of network-based models is that they completely fail to geolocate users who are not connected to geolocated components of the graph (i.e., isolated users).

Hybrid Methods
Several attempts have been made to combine different sources of information for geolocating social media users. Li et al. (2012) have proposed a geolocation method by integrating both friendship and content information in a probabilistic model. Rahimi et al. (2015b) showed that geolocation predictions from text can effectively be used as a back-off for disconnected users in a networkbased approach. In another work by Rahimi et al. (2015a), a hybrid approach has been proposed by propagating information on a graph built from user mentions in Twitter messages, together with dongle nodes corresponding to the results of a textbased geolocation method. Ebrahimi et al. (2017Ebrahimi et al. ( , 2018b have presented a hybrid approach by incorporating both text and network information, and shown that the filtering of highly mentioned users in the social graph can improve the geolocation performance. Rahimi et al. (2017b) have proposed a text geoloation method based on neural network and incorporated it into their networkbased approach (Rahimi et al., 2015a). Wang et al. (2017) have introduced a collective geographical embedding algorithm to embed multiple information sources into a low dimensional space, such that the distance in the embedding space reflects the physical distance in the real world.
Metadata such as location fields have also been used as effective clues to predict the user's location (Hecht et al., 2011). Different geoinference approaches have been proposed to consider text and metadata information simultaneously, such as dynamically weighted ensemble method (Mahmud et al., 2012), and stacking approach (Han et al., 2014). Jayasinghe et al. (2016) have proposed a cascade ensemble approach by combining textbased, metadata-based, and network-based geolocation methods. Additionally, their approach makes use of several dedicated services, such as GeoNames gazetteers, time zone to GeoName mappings, IP country resolver and customized scrapers for social media websites. Miura et al. (2016) have trained a neural network utilizing the fastText n-gram model (Joulin et al., 2016) on tweet text, user location, user description, and user timezone. They have utilized several mapping services using external resources, such as GeoNames and time zone boundaries for feature preprocessing. This model has been further extended by Miura et al. (2017) to also consider user network information for geolocation. Thomas and Hennig (2017) have proposed a geolocation method that relies on the combination of individual neural networks trained on text and metadata fields. Ebrahimi et al. (2018a) have proposed a word embedding-based approach to predict the geographic proximity of connected users in the social graph based on their linguistic similarities. The calculated similarity scores have been used for weighting edges between users in the graph. Tweet content and metadata is also combined with an ensemble learning method to geolocate isolated users in the graph.

Data
We have used two benchmark Twitter geolocation datasets in our experiments: • TWITTERUS is a dataset compiled by (Roller et al., 2012), which contains 38M tweets from 450K users in the United States. Out of 450K users, 10K are reserved for the development set and another 10K for the test set. The ground truth location of each user is set to its first geotag in the dataset. To make city prediction possible in this dataset, we additionally assigned city centers to ground truth geotags using the city category of Han et al. (2012).
• WNUT is a user-level dataset from the geolocation prediction shared task of WNUT 2016 (Han et al., 2016). The dataset covers 13M tweets from 3362 cities worldwide, and consists of 1M training users, 10K development users, and 10K test users. The ground truth location of a user is decided by majority voting of the closest city center.
Note that the metadata of a tweet includes not only the tweet message (text) but also a variety of information such as tweet publication time, and user account data such as location and timezone. The organizers have provided full metadata for the test sets but only the tweet IDs for training and development sets. We collect metadata for training/development tweets using the Twitter API 1 . Figure 1 illustrates an overview of the proposed model for user geolocation. We make use of the following sources of information to train our model: 1) Tweet text; 2) User network; and 3) Metadata including user-declared location, user description, user name, timezone, user language, tweet creation time, user UTC offset, links (URL domains), and application source.

The Proposed Approach
Each field is processed by a separate subnetwork to generate a feature vector representation R j . These feature vectors are then concatenated to build a final user representationR which is fed into a linear classification layer: where N is the number of features (11 in total), r ∈ R R is the hidden representation at the penultimate layer. W r is a weight matrix and b r is a bias vector. r is fully connected to the output layer and activated by softmax to generate a probability distribution over the classes. We employ the cross-entropy loss as the objective function. Let M be the number of examples (i.e., users) and c be the number of classes (i.e., regions), then the cross-entropy loss is defined by: where y i , i = 1, ..., M is the ground-truth vector,ỹ i is the predicted probability vector, andỹ j i is the probability that user i resides in region j. We minimize the objective function through Stochastic Gradient Descent (SGD) over shuffled minibatches with Adam (Kingma and Ba, 2014).
We design several sub-networks to provide vectorized representation for each raw field. For processing the tweet text, we utilize word embeddings (Mikolov et al., 2013) and bidirectional Long Short-Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997) augmented with a context-aware attention mechanism (Yang et al., 2016) (Section 4.1). We construct a @-mention graph as a representation of users' interactions, and utilize this graph to extract the user network. We then use an embedding layer with attention mechanism to create the final user network representation (Section 4.2).
We divide metadata fields into two classes: textual, and categorical. For representing textual metadata fields (i.e., location, description, user name, and timezone), we use word embeddings and bidirectional LSTM networks with attention mechanism. We treat other metadata fields (language, tweet time, UTC offset, links, and source) as categorical features, and convert them to onehot encodings which are then fed forward to a dense layer (Section 4.3). In the following subsections, we describe details of each component.

Text Component
Figure 2(a) demonstrates the architecture of text sub-network. It takes the sequence of words in the tweet T = {w 1 , w 2 , ..., w n } as input. An embedding layer is used to project the words to a low-dimensional vector space R E , where E is the size of the embedding layer. We initialize the weights of the embedding layer using our pretrained word embeddings (Section 5.1). The embeddings of tweet words are then forwarded to an LSTM layer. An LSTM takes as input the words of a tweet and produces the word annotations H = (h 1 , h 2 , ..., h n ), where h i is the hidden state of the LSTM at time-step i, summarizing all the information of the sentence up to w i . We use bidirectional LSTM (BiLSTM) in order to get annotations for each word that summarize the information from both directions of the message. A bidirectional LSTM consists of a forward LSTM, − → f , that reads the sentence from w 1 to w T , and a backward LSTM, ← − f , that reads the sentence from w T to w 1 . We obtain the final annotation for each word w i , by concatenating the annotations from both directions: where denotes the concatenation operation and L the size of each LSTM. In order to amplify the contribution of important words in the final representation, we use a context-aware attention mechanism (Yang et al., 2016), that aggregates all the intermediate hidden states using their relative importance. An attention mechanism assigns a weight a i to each word annotation, which reflects its importance. We compute the representation of the tweet text, R text , as the weighted sum of all the word annotations using the attention weights. This attention mechanism introduces a context vector u h that helps to identify the informative words and it is randomly initialized and jointly learned with the rest of the attention layer weights. Formally, R text is defined as: where W h , b h and u h are the layer's weights. We use batch normalization (Ioffe and Szegedy, 2015) for normalizing inputs in order to reduce internal covariate shift. The risk of overfitting by co-adapting units is reduced by implementing dropout (Srivastava et al., 2014) between individual neural network layers.

User Network Component
As a representation of users' social relationships, we construct an undirected graph from interactions among Twitter users based on @-mentions in their tweets (Rahimi et al., 2015b). In this graph, nodes are all users in the dataset (train and test), as well as other external users mentioned in their tweets, and undirected edges are created between two users if either user mentioned the other. This unidirectional setting results in large numbers of edges. To make the process more tractable, we remove all nodes corresponding to external users with degree less than 3 (i.e., external users who have been mentioned by less than 3 different users in a training set). Figure 2(b) illustrates an overview of the user network component. After filtering the graph, we consider the adjacent nodes (i.e., immediate linked users) of each training user as its network. The user network N = {u 1 , u 2 , ..., u n } is given as input to an embedding layer. Embedding of user network E N = (e 1 , e 2 , ..., e n ) is then fed to an attention layer to compute the final representation of user network, R network : where a i is the weight assigned to embedding e i by the attention mechanism (Equation 6).

Metadata Component
According to (Guo and Berkhahn, 2016), the embeddings of categorical variables can reduce the network size while capturing the intrinsic properties of the categorical variables. Hence, we convert metadata fields with a finite set of elements (UTC offset, links, user language, tweet publication time, and application source) to one-hot encodings, which are then forwarded to a dense layer with Rectified Linear Units (ReLU) activation function. The user UTC offset is an integer in seconds representation (e.g., −18000), and the tweet publication time is given in UTC time , e.g., Fri Mar 02 12:19:40 +0000 2012. We convert user UTC offset into hours representation (e.g., −18000/3600 = −5). For tweet publication time, we use only time of the day information (e.g., 12:19) and split it into multiple bins. Specifically, we interpret every 10 minutes as a bin (144 bins in total). The intuition is that tweets originated from a particular location (e.g., Germany) favor certain bins, and this preference of bins should be different to tweets from a distant location (e.g., Japan) (Lau et al., 2017).
For metadata fields containing texts (i.e., user description, user location, user name, and timezone), we use an embedding layer and consequently forward the results to an LSTM layer. The attention mechanism is also employed to provide the final representation of textual metadata fields. Again batch normalization and dropout is applied between individual layers to avoid overfitting.

Experiment Settings
In the text sub-network, words are input to the model as n-dimensional word embeddings. We pre-trained word embeddings using word2vec (Mikolov et al., 2013) over tweet text of the full training data. The model was trained using the Skip-gram architecture and negative sampling (k = 5) for five iterations, with a context window of 5 and subsampling factor of 0.001. It is noteworthy that to be part of the vocabulary, words should occur at least five times in the corpus. We chose word embeddings of size 200/300 for TWIT-TERUS/WNUT datasets because smaller embeddings experimentally showed to capture not as much detail and resulted in a lower accuracy. Larger word embeddings, on the other hand, made the model too complex to train. In the preprocessing step, we used replacement tokens for URLs, mentions and numbers. However, we did not replace hashtags as doing so experimentally demonstrated to decrease the accuracy.
The layers and the embeddings in our subnetworks have parameters like embedding dimension, LSTM unit size, and attention context vector size. We chose optimal values for these parameters in terms of accuracy with a grid search using the development sets of TwitterUS and WNUT. The selected parameters values are reported in Table 1. It should be noted that the main reason for selecting smaller values for the TWITTERUS dataset is its larger size (in terms of tweet number) comparing to the WNUT dataset. We set the hyper-parameters of our final model as follows: batch size = 256, learning rate = 0.001, epochs = 5. The dropout rate between layers is set to 0.2.

Evaluation Metrics
We evaluate our approach in the following three commonly used metrics for user geolocation: • Acc@161: The percentage of predicted locations which are within a 161km (100 mile) radius of the actual location (Cheng et al., 2010). This metric is a proxy for accuracy within a metro area.
• Mean error: The mean value of error distances in predicted locations (Eisenstein et al., 2010).
• Median error: The median value of error distances in predictions (Eisenstein et al., 2010).
Note that higher numbers are better for Acc@161 but lower numbers are better for mean and median errors. Table 2 presents the performance of user geolocation methods over TWITTERUS and WNUT 2 datasets.

Results
The results show that our proposed method achieves the best performance in terms of all evaluation metrics. The main reason is the effective representation of text, metadata, and network information, and unifying them through a fusion of neural networks.

Ablation Study
To evaluate the contribution of each component in indicating the user's location, we train an individual neural network model for each field. To this end, we feed the final representation of each subnetwork to a fully-connected dense layer, activated by softmax function. We use stochastic gradient descent over shuffled mini-batches with Adam (Kingma and Ba, 2014) and cross-entropy loss as objective function for classification. The parameters of all models are set as follows: batch size = 256, epochs = 5, dropout=0.2, and learning rate = 0.001. Table 3 shows the performance breakdown for each model over the WNUT dataset.
The results conclude that user-declared location in tweet metadata is the most informative field for geolocating users, and model trained on this source achieves the best single source performance. This model can correctly geolocate 44.9% of users with a median error of 41.0km.
Using only tweet text, our model can predict the correct location for 34.9% of all users with a median error of 169.3km. It is noteworthy that this model outperforms the text-based approach IBM.1 (Chi et al., 2016) in terms of all metrics by a large margin.
User network model can correctly geolocate only 18.4% of users. However, our experiments show that excluding user network information declines the performance of the final model in terms of accuracy by 5.1%. Models using other metadata fields provide an accuracy between 2.7% to 10.6%, with description field being the most informative one. Tweet publication time, on the other hand, has the minimum accuracy in predicting user's location. However, mining the temporal patterns of users' posting habits can potentially provide useful information for geolocation inference.
We have also reported the results of our model when it takes only the metadata fields as inputs.
The metadata-based model can correctly geolocate 46.1% of the users with a mean error of 1318.3km, and a median error of 37.9km. It shows the ef-   , and Hybrid (Hyb) geolocation methods over TWITTERUS and WNUT datasets ("-" signifies that no results were published for the given dataset, and "?" signifies that the participant team has not provided descriptions of the proposed system). We have also reported the Accuracy of our proposed approach on WNUT dataset to make our results comparable with the existing methods.
fectiveness of utilized metadata fields for user geolocation. Meanwhile, a deeper analysis of metadata fields can further improve the performance of user location prediction. As an example, customized scrapers for social media websites like FourSquare, Swarm, Path, Facebook, and Instagram can be employed as described by (Jayasinghe et al., 2016) to increase the geolocation accuracy.

Error Analysis
As reported in Table 2, our proposed approach achieves quite low median errors over the TWIT-TERUS and WNUT datasets (i.e., 40.1km and 0km, respectively). However, there are some cases with large error distances, which make the mean errors much larger than median errors. Our analysis shows that some notable error distances are related to the following cases: (1) Users from remote areas for which less supervisions are available; (2) Users from small cities/states are misclassified to be in the neighboring larger cities/states; (3) Users from some neighboring cities/states are also misclassified between the two cities/states, which might be the result of business and entertainment connections between them.
Our ablation study demonstrates that the location field highly contributes to the geolocation per-  formance. However, some prediction errors arise when location fields are incorrect. We found two main cases that result in incorrect location fields: (1) Users who move to a new place (i.e., house) but do not update their locations; (2) Users who visit a new place (e.g., as tourists) and temporarily update their locations. Our proposed model cannot handle these types of errors, since it only supports single location field. A future direction is to extend the current architecture to track location changes and deal with temporal states such as traveling. Previous network-based methods (Jurgens, 2013;Compton et al., 2014) have demonstrated the effectiveness of users' social relationships for geolocation inference. However, our ablation study shows relatively low accuracy for the user network component. One main reason is that our model is less sophisticated (but more scalable) comparing to the mentioned network-based methods, since it only considers the immediate connected nodes as the network for each user. As a future work of this study, node/graph embeddings such as DeepWalk (Perozzi et al., 2014) can be employed to provide better representation of users' social relationships, and consequently improve the accuracy of network component.

Conclusion and Future Work
In this paper, we have proposed a unified user geolocation method which relies on a fusion of neural networks. Our joint model effectively utilizes different sources of information including tweet message, users' social relationships, and metadata fields embedded in tweets and profiles. In particular, we employed a neural network model to generate a dense vector representation for each information field and then used the concatenation of these representations as the feature for classification. For modeling tweet message and textual metadata fields, we utilized a bidirectional LSTM network augmented with an attention mechanism to identify the most location indicative words.
We have conducted comprehensive experiments on two standard Twitter geolocation datasets, and demonstrated that our method achieves the best performance in terms of all three evaluation metrics. In an ablation study, we have also trained individual models to investigate the usefulness of each information field in predicting the locations of Twitter users.
As a future work, it would be intriguing to utilize customized scrapers for social media websites (Jayasinghe et al., 2016) to further improve the performance of our geolocation model. It is noteworthy that the proposed model could be modified to infer other user demographic attributes such as gender and age.
Tweet publication time include both date and time, however, only time information is exploited in this work to infer users' geolocations. A future direction is to leverage tweeting behavior over dates for user geolocation. The intuition is that local residents would occasionally post tweets about their home city in a long-term manner, while tourists tend to tweet a lot while visiting the city. Hence, their different tweeting patterns can be easily revealed using date information from their tweet timestamps.