Unifying Text, Metadata, and User Network Representations with a Neural Network for Geolocation Prediction

We propose a novel geolocation prediction model using a complex neural network. Geolocation prediction in social media has attracted many researchers to use information of various types. Our model unifies text, metadata, and user network representations with an attention mechanism to overcome previous ensemble approaches. In an evaluation using two open datasets, the proposed model exhibited a maximum 3.8% increase in accuracy and a maximum of 6.6% increase in accuracy@161 against previous models. We further analyzed several intermediate layers of our model, which revealed that their states capture some statistical characteristics of the datasets.


Introduction
Social media sites have become a popular source of information to analyze current opinions of numerous people. Many researchers have worked to realize various automated analytical methods for social media because manual analysis of such vast amounts of data is difficult. Geolocation prediction is one such analytical method that has been studied widely to predict a user location or a document location. Location information is crucially important information for analyses such as disaster analysis (Sakaki et al., 2010), disease analysis (Culotta, 2010), and political analysis (Tumasjan et al., 2010). Such information is also useful for analyses such as sentiment analysis (Martínez-Cámara et al., 2014) and user attribute analysis (Rao et al., 2010) to undertake detailed region-specific analyses.
Among these sources, Twitter is often preferred because of its characteristics, which are suited for geolocation prediction. First, some tweets include geotags, which are useful as ground truth locations. Secondly, tweets include metadata such as timezones and self-declared locations that can facilitate geolocation prediction. Thirdly, a user network is obtainable by consideration of the interaction between two users as a network link.
Herein, we propose a neural network model to tackle geolocation prediction in Twitter. Past studies have combined text, metadata, and user network information with ensemble approaches (Han et al., 2013(Han et al., , 2014Rahimi et al., 2015a;Jayasinghe et al., 2016) to achieve state-of-the-art performance. Our model combines text, metadata, and user network information using a complex neural network. Neural networks have recently shown effectiveness to capture complex representations combining simpler representations from large-scale datasets (Goodfellow et al., 2016). We intend to obtain unified text, metadata, and user network representations with an attention mechanism  that is superior to the earlier ensemble approaches. The contributions of this paper are the following: 1. We propose a neural network model that learns unified text, metadata, and user network representations with an attention mechanism.
2. We show that the proposed model outperforms the previous ensemble approaches in two open datasets.
3. We analyze some components of the proposed model to gain insight into the unification processes of the model.
Our model specifically emphasizes geolocation prediction in Twitter to use benefits derived from the characteristics described above. However, our model can be readily extended to other social media analyses such as user attribute analysis and political analysis, which can benefit from metadata and user network information.
In subsequent sections of this paper, we explain the related works in four perspectives in Section 2. The proposed neural network model is described in Section 3 along with two open datasets that we used for evaluations in Section 4. Details of an evaluation are reported in Section 5 with discussions in Section 6. Finally, Section 7 concludes the paper with some future directions.
2 Related Works 2.1 Text-based Approach Probability distributions of words over locations have been used to estimate the geolocations of users. Maximum likelihood estimation approaches (Cheng et al., 2010 and language modeling approaches minimizing KL-divergence (Wing and Baldridge, 2011;Kinsella et al., 2011;Roller et al., 2012) have succeeded in predicting user locations using word distributions. Topic modeling approaches to extract latent topics with geographical regions (Eisenstein et al., 2010(Eisenstein et al., , 2011Hong et al., 2012;Ahmed et al., 2013) have also been explored considering word distributions.
Supervised machine learning methods with word features are also popular in text-based geolocation prediction. Multinomial Naive Bayes (Han et al., 2012(Han et al., , 2014Wing and Baldridge, 2011), logistic regression (Wing and Baldridge, 2014;Han et al., 2014), hierarchical logistic regression (Wing and Baldridge, 2014), and a multilayer neural network with stacked denoising autoencoder (Liu and Inkpen, 2015) have realized geolocation prediction from text. A semi-supervised machine learning approach by Cha et al. (2015) has also been produced using a sparse-coding and dictionary learning.

User-network-based Approach
Social media often include interactions of several kinds among users. These interactions can be regarded as links that form a network among users. Several studies have used such user network information to predict geolocation. Backstrom et al. (2010) introduced a probabilistic model to predict the location of a user using friendship information in Facebook. Friend and follower information in Twitter were used to predict user locations with a most frequent friend algorithm (Davis Jr. et al., 2011), a unified descriptive model (Li et al., 2012b), location-based generative models (Li et al., 2012a), dynamic Bayesian networks (Sadilek et al., 2012), a support vector machine (Rout et al., 2013), and maximum likelihood estimation (McGee et al., 2013). Mention information in Twitter is also used with label propagation models (Jurgens, 2013;Compton et al., 2014) and an energy and social local coefficient model (Kong et al., 2014). Jurgens et al. (2015) compared nine user-network-based approaches targeting Twitter, controlling data conditions.

Metadata-based Approach
Metadata such as location fields are useful as effective clues to predict geolocation. Hecht et al. (2011) reported that decent accuracy of geolocation prediction can be achieved using location fields. Approaches to combine metadata with texts are also proposed to extend text-based approaches. Combinatory approaches such as a dynamically weighted ensemble method (Mahmud et al., 2012), polygon stacking (Schulz et al., 2013), stacking (Han et al., 2013(Han et al., , 2014, and average pooling with a neural network (Miura et al., 2016) have strengthened geolocation prediction.

Combinatory Approach Extending
User-network-based Approach Several attempts have been made to combine usernetwork-based approaches with other approaches. A text-based approach with logistic regression was combined with label propagation approaches to enhance geolocation prediction (Rahimi et al., 2015a(Rahimi et al., ,b, 2016. Jayasinghe et al. (2016) combined nine components including text-based approaches, metadata-based approaches, and a usernetwork-based approach with a cascade ensemble method.

Comparisons with Proposed Model
A model we propose in Section 3 which combines text, metadata, and user network information with a neural network, can be regarded as an alternative to approaches using text and metadata (Mahmud et al., 2012;Schulz et al., 2013;Han et al., 2013Han et al., , 2014Miura et al., 2016), approaches with text and user network information (Rahimi et al., 2015a,b), and an approach with text, metadata, and user network information (Jayasinghe et al., 2016). In Section 5, we demonstrate that our model outperforms earlier models. In terms of machine learning methods, our model is a neural network model that shares some similarity with previous neural network models (Liu and Inkpen, 2015;Miura et al., 2016). Our model and these previous models have two key differences. First, our model integrates user network information along with other information. Secondly, our model combines text and metadata with an attention mechanism . Figure 1 presents an overview of our model: a complex neural network for classification with a city as a label. For each user, the model accepts inputs of messages, a location field, a description field, a timezone, linked users, and the cities of linked users.

Proposed Model
User network information is incorporated by city embeddings and user embeddings of linked users. User embeddings are introduced along with city embeddings because linked users with city information 1 are limited. We chose to let the model learn geolocation representations of linked users directly via user embeddings. The model can be broken down to several components, details of which are described in Section 3.1.1-3.1.4.

Text Component
We describe the text component of the model, which is the "TEXT" section in Figure 1. As an implementation of RNN, we used Gated Recurrent Unit (GRU)  with a bidirectional setting. In the RNN layer, word embeddings x of a message are processed with the following transition functions: where z t is an update gate, r t is a reset gate,h t is a candidate state, h t is a state, Word Embedding Attention M computes a message representation m as a weighted sum of g t with weight α t : where v α is a weight vector, W α is a weight matrix, and b α a bias vector. u t is an attention context vector calculated from g t with a single fullyconnected layer (Eq. 7). u t is normalized with softmax to obtain α t as a probability (Eq. 6). The message representation m is passed to the second attention layer Attention TL to obtain a timeline representation from message representations.

Text and Metadata Component
We describe text and metadata components of the model, which is the "TEXT&META" section in Figure 1. This component considers the following three types of metadata along with text: location a text field in which a user is allowed to write the user location freely, description a text field a user can use for self-description, and timezone a selective field from which a user can choose a timezone. Note that certain percentages of these fields are not available 2 , and unknown tokens are used for inputs in such cases.  We process location fields and description fields similarly to messages using an RNN layer and an attention layer. Because there is only one location and one description per user, a second attention layer is not required, as it is in the text component. We also chose to share word embeddings among the messages, the location, and the description processes because these inputs are all textual information. For the timezone, an embedding is assigned for each timezone value. A processed timeline representation, a location representation, and a description representation are then passed to the attention layer Attention U with a timezone representation. Attention U combines these four representations and outputs a user representation. This combination is done as in Attention TL with four representations as g 1 . . . g 4 in Eq. 5.

User Network Component
We describe the user network component of the model, which is the "USERNET" section in Figure 1. Figure 3 presents an overview of the user network component. The model has two inputs linked cities and linked users. Users connected with a user network are extracted as linked users. We treat their cities 3 as linked cities. Linked cities and linked users are assigned with city embeddings c and user embeddings a respectively. c and a are then processed to output p = c ⊕ a, where ⊕ is an element-wise addition operator. p is then passed to the subsequent attention layer Attention N to obtain a user network representa-  Table 1: Some properties of TwitterUS (train) and W-NUT (train). We were able to obtain approximately 70-78% of the full datasets because of accessibility changes in Twitter.
tion as in Attention U .

Model Output
An output of the text and metadata component and an output of the mention network component are further passed to the final attention layer Attention UN to obtain a merged user representation as in Attention U . The merged user representation is then connected to labels with a fully connected layer FC UN .  Roller et al. (2012), which consists of 429K training users, 10K development users, and 10K test users in a North American region. The ground truth location of a user is set to the first geotag of the user in the dataset. We collected TwitterUS tweets using TwitterAPI to reconstruct TwitterUS to obtain metadata along with text. Up to date versions in November-December 2016 were used for the metadata 4 . We additionally assigned city centers to ground truth geotags using the city category of Han et al. (2012) to make city prediction possible in this dataset. TwitterUS (train) in Table 1 presents some properties related to the TwitterUS training set.

Sub-models of the Proposed
W-NUT The second dataset we used is W-NUT, a user-level dataset of the geolocation prediction shared task of W-NUT 2016 (Han et al., 2016). The dataset consists of 1M training users, 10K development users, and 10K test users. The ground truth location of a user is decided by majority voting of the closest city center. Like in TwitterUS, we obtained metadata and texts using TwitterAPI. Up to date versions in August-September 2016 were used for the metadata. W-NUT (train) in Table 1 presents some properties related to the W-NUT training set.

Construction of the User Network
We construct mention networks (Jurgens, 2013;Compton et al., 2014;Rahimi et al., 2015a,b) from datasets as user networks. To do so, we follow the approach of Rahimi et al. (2015a) and Rahimi et al. (2015b) who use uni-directional mention to set edges of a mention network. An edge is set between the two users nodes if a user mentions another user. The number of unidirectional mention edges for TwitterUS and W-NUT can be found in Table 1. The uni-directional setting results to large numbers of edges, which often are computationally expensive to process. We restricted edges to satisfy one of the following conditions to reduce the size: (1) both users have ground truth locations or (2) one user has a ground truth location and another user is mentioned 5 times or more in a training set. The number of reduced-edges with these conditions in TwitterUS and W-NUT can be confirmed in Table 1.

Evaluation
5.1 Implemented Baselines 5.1.1 LR LR is an l 1 -regularized logistic regression model with k-d tree regions (Roller et al., 2012) used in Rahimi et al. (2015a). The model uses tfidf weighted bag-of-words unigrams for features. This model is simple, but it has shown state-ofthe-art performance in cases when only text is available.

MADCEL-B-LR
MADCEL-B-LR, a model presented by (Rahimi et al., 2015a), combines LR with Modified Adsorption (MAD) (Talukdar and Crammer, 2009). MAD is a graph-based label propagation algorithm that optimizes an objective with a prior term, a smoothness term, and an uninformativeness term. LR is combined with MAD by introducing LR results as dongle nodes to MAD.
This model includes an algorithm for the construction of a mention network. The algorithm removes celebrity users 5 and collapses a mention network 6 . We use binary edges for user network edges because they performed slightly better than weighted edges by accuracy@161 metric in Rahimi et al. (2015a).

LR-STACK
LR-STACK is an ensemble learning model that combines four LR classifiers (LR-MSG, LR-LOC, LR-DESC, LR-TZ) with an l 2 -regularized logistic regression meta-classifier (LR-2ND). LR-MSG, LR-LOC, LR-DESC, and LR-TZ respectively use messages, location fields, description fields, and timezones as their inputs. This model is similar to the stacking (Wolpert, 1992) approach taken in Han et al. (2013) and Han et al. (2014), which showed superior performance compared to a feature concatenation approach.
The model takes the following three steps to combine text and metadata: Step 1 LR-MSG, LR-LOC, LR-DESC, and LR-TZ are trained using a training set, Step 2 the outputs of the four classifiers on the training set are obtained with 10-fold cross validation, and Step 3 LR-2ND is trained using the outputs of the four classifiers.

MADCEL-B-LR-STACK
MADCEL-B-LR-STACK is a combined model of MADCEL-B-LR and LR-STACK. LR-STACK results are introduced as dongle nodes to MAD instead of LR results to combine text, metadata, and network information.

Model Configurations 5.2.1 Text Processor
We applied a lower case conversion, a unicode normalization, a Twitter user name normalization, and a URL normalization for text pre-processing. The pre-processed text is then segmented using Twokenizer (Owoputi et al., 2013) to obtain words.

Pre-training of Embeddings
We pre-trained word embeddings using messages, location fields, and description fields of a training set using fastText (Bojanowski et al., 2016) with the skip-gram algorithm. We also pre-trained user embeddings using the non-reduced mention network described in Section 4.2 of a training set with LINE (Tang et al., 2015). The detail of pre-training parameters are described in Appendix A.1.

Neural Network Optimization
We chose an objective function of our models to cross-entropy loss. l 2 regularization was applied to the RNN layers, the attention context vectors, and the FC layers of our models to avoid overfitting. The objective function was minimized through stochastic gradient descent over shuffled mini-batches with Adam (Kingma and Ba, 2014).

Model Parameters
The layers and the embeddings in our models have unit size and embedding dimension parameters. Our models and the baseline models have regularization parameter α, which is sensitive to a dataset. The baseline models have additional k-d tree bucket size c, celebrity threshold t, and MAD parameters µ 1 , µ 2 , and µ 3 , which are also data sensitive.
We chose optimal values for these parameters in terms of accuracy with a grid search using the development sets of TwitterUS and W-NUT. Details of the parameter selection strategies and the selected values are described in Appendix A.2.

Metrics
We evaluate the models in the following four commonly used metrics in geolocation prediction: accuracy the percentage of correctly predicted cities, accuracy@161 a relaxed accuracy that takes prediction errors within 161 km as correct predictions, median error distance median value of error distances in predictions, and mean error distance mean value of error distances in predictions.  Table 2: Performances of our models and the baseline models on TwitterUS. Significance tests were performed between models with same Sign. Test IDs. The shaded lines represent values copied from related papers. Asterisks denote significant improvements against paired counterparts with 1% confidence (**) and 5% confidence (*).

Model
Sign.  Table 3: Performance of our models and baseline models on W-NUT. The same notations as those in Table 2 are used in this table. Table 2 presents results of our models and the implemented baseline models on TwitterUS. We also list values from earlier reports (Han et al., 2012;Wing and Baldridge, 2014;Rahimi et al., 2015aRahimi et al., ,b, 2016 to make our results readily comparable with past reported values.

Result Performance on TwitterUS
We performed some statistical significance tests among model pairs that share the same inputs. The values in the Sign. Test ID column of Table  2 represent the IDs of these pairs. As a preparation of statistical significance tests, accuracies, accuracy@161s, and error distances of each test user were calculated for each model pair. Twosided Fisher-Pittman Permutation tests were used for testing accuracy and accuracy@161. Mood's median test was used for testing error distance in terms of median. Paired t-tests were used for testing error distance in terms of mean.
We confirmed the significance of improvements in accuracy@161 and mean distance error for all of our models. Three of our models also improved in terms of accuracy. Especially, the proposed model achieved a 2.8% increase in accuracy and a 2.4% increase in accuracy@161 against the counterpart baseline model MADCEL-B-LR-STACK. One negative result we found was the median error distance between SUB-NN-META and LR-STACK. The baseline model LR-STACK performed 4.5 km significantly better than our model. Table 3 presents the results of our models and the implemented baseline models on W-NUT. As for TwitterUS, we listed values from Miura et al. (2016) and Jayasinghe et al. (2016). We tested the significance of these results in the same way as we did for TwitterUS. We confirmed significant improvement in the four metrics for all of our models. The proposed model achieved a 4.8% increase in accuracy and a 6.6% increase in accuracy@161 against the counterpart baseline model MADCEL-B-LR-STACK. The accuracy is 3.8% higher against the previously reported best value (Jayasinghe et al., 2016) which combined texts, metadata, and user network information with an ensemble method. 6 Discussion 6.1 Analyses of Attention Probabilities

Unification Strategies
In the evaluation, the proposed model has implicitly shown effectiveness at unifying text, metadata, and user network representations through improvements in the four metrics. However, details of the unification processes are not clear from the model outputs because they are merely the probabilities of estimated locations. To gain insight into the unification processes, we analyzed the states of two attention layers: Attention U and Attention UN in Figure 1. Figure 4 presents the estimated probability density functions (PDFs) of the four input representations for Attention U . These PDFs are estimated with kernel density estimation from the development sets of TwitterUS and W-NUT, where all four representations are available. From the PDFs, it is apparent that the model assigns higher probabilities to time line representations than to other three representations in TwitterUS compared to W-NUT. This finding is reasonable because timelines in TwitterUS consist of more tweets (tweet/user in Table 1) and are likely to be more informative than in W-NUT. Figure 5 presents the estimated PDFs of user network representations for Attention UN . These PDFs are estimated from the development sets of TwitterUS and W-NUT, where both input representations are available. Strong preference of network representation for TwitterUS against W-NUT is found in the PDFs. This finding is intuitive because TwitterUS has substantially more user network edges (reduced-edge/user in Table 1) than W-NUT, which is likely to benefit more from user network information.

Attention Patterns
We further analyzed the proposed model by clustering attention probabilities to capture typical attention patterns. For each user, we assigned six attention probabilities of Attention U and Attention UN as features for a clustering. A kmeans clustering was performed over these users with 9 clusters. The clustering clearly separated the users to 5 clusters for TwitterUS users and 4 clusters for W-NUT users. We extracted typical users of each cluster by selecting the closest users of the cluster centroids. Figure 6 shows a clustering result and the attention probabilities of these users.
These attention probabilities can be considered as typical attention patterns of the proposed model and match with the previously estimated PDFs. For example, cluster 2 and 3 represent an attention pattern that processes users by balancing the representations of locations along with the representations of timelines. Additionally, the location probabilities in this pattern are in the right tail region of the location PDF.

City Prediction
The evaluation produced improvements in most of our models in the four metrics. One exception we found was the median distance error between SUB-NN-META and LR-STACKING in TwitterUS. Because the median distance error of SUB-NN-META was quite low (46.8 km), we   Table 4 denotes this oracle performance. The oracle mean error distance is 31.4 km. Its standard deviation is 30.1. Note that ground truth locations of TwitterUS are geotags and will not exactly match the oracle city centers. These oracle values imply that the current median error distances are close to the lower bound of the city classification approach and that they are difficult to improve.

Errors with High Confidences
The proposed model still contains 28-30% errors even in accuracy@161. A qualitative analysis of errors with high confidences was performed to investigate cases that the model fails. We found two common types of error in the error analysis. The first is a case when a location field is incorrect due to a reason such as a house move. For example, the model predicted "Hong Kong" for a user with a location field of "Hong Kong" but has the gold location of "Toronto". The second is a case when a user tweets a place name of a travel. For example, the model predicted "San Francisco" for a user who tweeted about a travel to "San Francisco" but has the gold location of "Boston".
These two types of error are difficult to handle with the current architecture of the proposed model. The architecture only supports single location field which disables the model to track location changes. The architecture also treats each tweet independently which forbids the model to express a temporal state like traveling.

Conclusion
As described in this paper, we proposed a complex neural network model for geolocation prediction. The model unifies text, metadata, and user network information. The model achieved the maximum of a 3.8% increase in accuracy and a maximum of 6.6% increase in accuracy@161 against several previous state-of-the-art models. We further analyzed the states of several attention layers, which revealed that the probabilities assigned to timeline representations and user network representations match to some statistical characteristics of datasets.
As future works of this study, we are planning to expand the proposed model to handle multiple locations and a temporal state to capture location changes and states like traveling. Additionally, we plan to apply the proposed model to other social media analyses such as gender analysis and age analysis. In these analyses, metadata like location fields and timezones may not be effective like in geolocation prediction. However, a user network is known to include various user attributes information including gender and age (McPherson et al., 2001) which suggests the unification of text and user network information to result in a success as in geolocation prediction.

A Supplemental Materials
A.1 Parameters of Embedding Pre-training Word embeddings were pre-trained with the parameters of learning rate=0.025, window size=5, negative sample size=5, and epoch=5. User embeddings were pre-trained with the parameters of initial learning rate=0.025, order=2, negative sample size=5, and training sample size=100M.

A.2 Model Parameters and Parameter
Selection Strategies Unit Sizes, Embedding Dimensions, and a Max Tweet Number The layers and the embeddings in our models have unit size and embedding dimension parameters. We also restricted the maximum number of tweets per user for TwitterUS to reduce memory footprints. Table 5 shows the values for these parameters. Smaller values were set for TwitterUS because TwitterUS is approximately 2.6 times larger in terms of tweet number. It was computationally expensive to process TwiiterUS in the same settings as W-NUT.