Twitter Geolocation using Knowledge-Based Methods

Automatic geolocation of microblog posts from their text content is particularly difficult because many location-indicative terms are rare terms, notably entity names such as locations, people or local organisations. Their low frequency means that key terms observed in testing are often unseen in training, such that standard classifiers are unable to learn weights for them. We propose a method for reasoning over such terms using a knowledge base, through exploiting their relations with other entities. Our technique uses a graph embedding over the knowledge base, which we couple with a text representation to learn a geolocation classifier, trained end-to-end. We show that our method improves over purely text-based methods, which we ascribe to more robust treatment of low-count and out-of-vocabulary entities.


Introduction
Twitter has been used in diverse applications such as disaster monitoring (Ashktorab et al., 2014;Mizuno et al., 2016), news material gathering (Vosecky et al., 2013;Hayashi et al., 2015), and stock market prediction (Mittal and Goel, 2012;Si et al., 2013). In many of these applications, geolocation information plays an important role. However, less than 1% of Twitter users enable GPSbased geotagging, so third-party service providers require methods to automatically predict geolocation from text, profile and network information. This has motivated many studies on estimating geolocation using Twitter data (Han et al., 2014).
Approaches to Twitter geolocation can be classified into text-based and network-based methods. Text-based methods are based on the text content of tweets (possibly in addition to textual user metadata), while network-based methods use relations between users, such as user mentions, follower-followee links, or retweets. In this paper, we propose a text-based geolocation method which takes a set of tweets from a given user as input, performs named entity linking relative to a static knowledge base ("KB"), and jointly embeds the text of the tweets with concepts linked from the tweets, to use as the basis for classifying the location of the user. Figure 1 presents an overview of our method. The hypothesis underlying this research is that KBs contain valuable geolocation information, and that this can complement pure textbased methods. While others have observed that KBs have utility for geolocation tasks (Brunsting et al., 2016;Salehi et al., 2017), this is the first attempt to combine a large-scale KB with a textbased method for user geolocation.
The method we use to generate concept embeddings from a given KB is applied to all nodes in the KB, as part of the end-to-end training of our model. This has the advantage that it generates KB embeddings for all nodes in the graph associated with a given relation set, meaning that it is applicable to a large number of concepts in the KB, including the large number of NEs that are unattested in the training data. This is the primary advantage of our method over generating text embeddings for the named entity ("NE") tokens, which would only be applicable to NEs attested in the training data.
Our contributions are as follows: (1) we propose a joint knowledge-based neural network model for Twitter user geolocation, that outperforms conventional text-based user geolocation; and (2) we show that our method works well even if the accuracy of the NE recognition is lowa common situation with Twitter, because many posts are written colloquially, without capitalization for proper names, and with non-standard syntax (Baldwin et al., 2013(Baldwin et al., , 2015.

Text-based methods
Text-based geolocation methods use text features to estimate geolocation. Unsupervised topic modeling approaches (Eisenstein et al., 2010;Hong et al., 2012;Ahmed et al., 2013) are one successful approach in text-based geolocation estimation, although they tend not to scale to larger data sets. It is also possible to use semi-supervised learning over gazetteers Quercini et al., 2010), whereby gazetted terms are identified and used to construct a distribution over possible locations, and clustering or similar methods are then used to disambiguate over this distribution. More recent data-driven approaches extend this idea to automatically learn a gazetteer-like dictionary based on semi-supervised sparse-coding (Cha et al., 2015).
Supervised approaches tend to be based on bagof-words modelling of the text, in combination with a machine learning method such as hierarchical logistic regression (Wing and Baldridge, 2014) or a neural network with denoising autoencoder (Liu and Inkpen, 2015). Han et al. (2012) focused on explicitly identifying "location indicative words" using multinomial naive Bayes and logistic regression classifiers combined with feature selection methods, while Rahimi et al. (2015b) extended this work using multi-level regularisation and a multi-layer perceptron architecture (Rahimi et al., 2017b).

Network-based methods
Twitter, as a social media platform, supports a number of different modalities for interacting with other users, such as mentioning another user in the body of a tweet, retweeting the message of another user, or following another user. If we consider the users of the platform as nodes in a graph, these define edges in the graph, opening the way for network-based methods to estimate geolocation.
Network-based methods are often combined with text-based methods, with the simplest methods being independently trained and combined through methods such as classifier combination, or the integration of text-based predictions into the network to act as priors on individual nodes (Han et al., 2016;Rahimi et al., 2017a). More recent work has proposed methods for jointly training combined text-and network-based models (Miura et al., 2017;Do et al., 2017;Rahimi et al., 2018).
Generally speaking, network-based methods are empirically superior to text-based methods over the same data set, but don't scale as well to larger data sets (Rahimi et al., 2015a).

Graph Convolutional Networks
Graph convolutional networks ("GCNs")which we use for embedding the KB of named entities -have been attracting attention in the research community of late, as an approach to "embedding" the structure of a graph, in domains ranging from image recognition (Bruna et al., 2014;Defferrard et al., 2016), to molecular footprinting (Duvenaud et al., 2015) and quantum structure learning (Gilmer et al., 2017). Relational graph convolutional networks ("R-GCNs": Schlichtkrull et al. (2017)) are a simple implementation of a graph convolutional network, where a weight matrix is constructed for each channel, and combined via a normalised sum to generate an embedding. Kipf and Welling (2016) adapted graph convolutional networks for text based on a layer-wise propagation rule.

Methods
In this paper, we use the following notation to describe the methods: U is the set of users in the  Figure 2: Our proposed method expands input entities using a KB, then entities are fed into input layers along with their relation names and directions. The vectors that are obtained from the input layers are combined via a weighted sum. The text associated with a user is also embedded, and a combined representation is generated based on average pooling with the entity embedding.
data set, E is the set of entities in the KB, R is the set of relations in the KB, T is the set of terms in the data set (the "vocabulary"), V is the union of the U and T (V = U ∪ T ), and d is the size of dimension for embedding. Our method consists of two components: a text encoding, and a region prediction. We describe each component below.

Text encoding
To learn a vector representation of the text associated with a user, we use a method inspired by relational graph convolutional networks (Schlichtkrull et al., 2017).
Our proposed method is illustrated in Figure 2. Each channel in the encoding corresponds to a directed relation, and these channels are used to propagate information about the entity. For instance, the channel for (bornIn, BACKWARDS) can be used to identify all individuals born in a given location, which could provide a useful signal, e.g., to find synonymous or overlapping regions in the data set. Our text encoding method is based on embedding the properties of each entity based on its representation in the KB, and its neighbouring entities.
Consider Tweets that user posted containing n entity mentions {e 1 , e 2 , ..., e n }, each of which is contained in a KB, e i ∈ E. The vector m e i r ∈ 1 |d| represents the entity e i based on the set of other entities connected through directed relation r, i.e., where, W (1) e ∈ 1 d is the embedding of entity e from embedding matrix W (1) ∈ R |V |×d , and N r (e) is the neighbourhood function, which returns all nodes e connected to e by directed relation r. Then, m e i r for all r are transformed using a weighted sum: where, a i ∈ 1 |R| is the attention that entity e i represented by one-hot vector e i pays to all relations using weight matrix W (2) ∈ R |V |×|R| , and σ and ReLU are the sigmoid and the rectified linear unit activation functions, respectively. Here, we obtain entity embedding vector v e i ∈ 1 d for entity e i .
Since the number of entities in tweets is sparse, we also encode, and use all the terms in the tweet regardless of if they are entity or not. We represent each term by: where w j is a one-hot vector of size |V | where the value j equals frequency of w j in the tweet, and W (1) is shared with entities (Equation 1). 1 Overall, user representation vector u is obtained as follows: where m is the number of words that the user mentioned. Our method has two special features: sharing the weight matrix across all channels, and using a weighted sum to combine vectors from each channel; these distinguish our method from R-GCN (Schlichtkrull et al., 2017). The reason we share the embedding matrix is that the meaning of the entity should be the same even if the relation type is different, so we consider that the embedding vector should be the same irrespective of relation type. We adopt weighted sum because even if the meaning of the entity is the same, if the entity is connected via different relation types, its functional semantics should be customized to the particular relation type.

Region estimation
To estimate the location for a given user, we predict a region using a 1-layer feed-forward neural network with a classification output layer as follows: where W (3) ∈ R class×d is a weight matrix. The classes represent regions in the data set, defined using k-means clustering over the continuous location coordinations in the training set (Rahimi et al., 2017a). Each class is represented by the mean latitude and longitude of users belonging to that class, which forms the output of the model. The model is trained using categorical crossentropy loss, using the Adam optimizer (Kingma and Ba, 2014) with gradient back-propagation.

Evaluation
Geolocation models are conventionally evaluated based on the distance (in km) between the known and predicted locations. Following Cheng et al. (2010) and Eisenstein et al. (2010), we use three evaluation measures: 3. Acc@161: the accuracy of geolocating a test user within 161km (= 100 miles) of their real location, which is an indicator of whether the model has correctly predicted the metropolitan area a user is based in.
Note that lower numbers are better for Mean and Median, while higher is better for Acc@161.

Data set and settings
We base our experiments on GeoText (Eisenstein et al., 2010), a Twitter data set focusing on the contiguous states of the USA, which has been widely used in geolocation research. The data set contains approximately 6,500 training users, and 2,000 users each for development and test. Each user has a latitude and longitude coordinate, which we use for training and evaluation. We exclude @mentions, and filter out words used by fewer than 10 users in the training set. We use Yago3 (Mahdisoltani et al., 2014) as our knowledge base in all experiments. Yago3 contains more than 12M relation edges, with around 4.2M unique entities and 37 relation types. We compare three relation sets:

GEO+TOP-5 RELATIONS: Combined GEO-RELATIONS and TOP-5 RELATIONS
The first of these was selected based on relations with an explicit, fine-grained location component, 2 while the second is the top-5 relations in Yago3 based on edge count. We use AIDA (Nguyen et al., 2014) as our named entity recognizer and linker for Yago3.
The hyperparameters used were: a minibatch size of 10 for our method, and full batch for R-GCN methods mentioned in the following section; each component, text encoding and region estimation, has one layer; 32 regions; L 2 regularization coefficient of 10 −5 ; hidden layer size of 896; and 50 training iterations, with early stopping based on development performance.
All models were learned with the Adam optimiser (Kingma and Ba, 2014), based on categorical cross-entropy loss with channel weights W c = |cmax| |c| , where |c| is the number of entities of class type c appearing in the training data, and |c max | is that of the most-frequent class. Each layer is initialized using HENormal (He et al., 2015), and all models were implemented in Chainer (Tokui et al., 2015).

Baseline Methods
We compare our method with two baseline methods: (1) the proposed method without weighted sum; and (2) an R-GCN baseline, over the same sets of relations as our proposed method. Both methods expand entities using the KB, which helps handle low-frequency and out-of-vocabulary (OOV) entities. Figure 3 illustrates the difference between the proposed and two baseline methods. The difference between these methods is only in the text encoding part. We describe these baseline methods in detail below.
Proposed Method without Weighted Sum ("simple average"'): To confirm the effect of the weighted sum in the proposed method, we use the proposed method without weighted sum as one of our baselines. Here, we use a r = 1 |Nr(e i )| instead of a ir in Equation 2.

R-GCN baseline method (R-GCN):
The R-GCNs we use are based on the method of Schlichtkrull et al. (2017). The differences are in having a weight matrix for each channel, and using non-weighted sum.

Results
Table 1 presents the results for our method, which we compare with three benchmark text-based user geolocation models from the literature (Cha et al., 2015;Rahimi et al., 2015bRahimi et al., , 2017b. We present results separately for the three relation sets, 3 under the following settings: (1) implemented within our proposed method, (2) the proposed method 3 Note that GEORELATIONS and TOP-5 RELATIONS include five relation types, while GEO+TOP-5 RELATIONS includes 10 relation types, so it is not fair between three relation sets.   Figure 3: The difference between the proposed and two baseline methods. The proposed method shares the weight matrix between the different channels. The first baseline is almost the same as the proposed method, with the only difference being that a simple sum is used instead of a weighted sum. The R-GCN baseline learns a separate weight matrix for each channel. without weighted sum; and (3) R-GCN baseline method.
The best results are achieved with our proposed method using the GEO+TOP-5 RELA-TIONS, in terms of both Acc@161 and Median. The second-best results across these metrics are achieved using our proposed method without weighted sum using GEO+TOP-5 RELATIONS, and the third-best results are for our proposed method using GEORELATIONS. Surprisingly, R-GCN baseline methods perform worse that the benchmark methods in terms of Acc@161 and Median. No method outperforms Cha et al. (2015) in terms of Mean, suggesting that this method produces the least high-value outlier predictions overall; we do not report Acc@161 for this method as it was not presented in the original paper.

Discussion
Our proposed method is able to estimate the geolocation of Twitter users with higher accuracy than pure text-based methods. One reason is that our method is able to handle OOV entities if those entities are related to training entities. Perhaps unsurprisingly, it was the fine-grained, geolocationspecific relation set (GEORELATIONS) that performed better than general-purpose set (TOP-5 RELATIONS), but it is important to observe  that this is despite them being more sparselydistributed in Yago3, and also that a more generalpurpose set of relations also resulted in higher accuracy. The combination of geolocation-specific and general-purpose set (GEO+TOP-5 RELA-TIONS) is the best result in the table, but the improvement from using only GEORELATIONS is limited. That is, even though our method works with general-purpose relation set, it is better to choose task-specific relations.
To confirm which relations have the greatest utility for user geolocation, we conducted an experiment based on using one relation at a time. As detailed in Table 2, relations that are better represented in Yago3 such as isLocatedIn and playsFor have a greater impact on results, in part because this supports greater generalization over OOV entities. Having said this, the relation which has the least edges, happenedIn, has the highest impact on results in term of Acc@161 and the third impact in terms of Mean and Median showing that it is not just the density of a relation that is a determinant of its impact. Surprisingly, the overall best result in terms of Median, which includes using relation sets such as GEORELATIONS and GEO+TOP-5 RELATIONS, is obtained by with is-LocatedIn only, despite it being a single relation. This result also shows that choosing task-specific relations is one of the important features in our method.
Even though the R-GCN baseline is closely related to our method, the results were worse. The reason for this is that it has an individual weight matrix for each channel, which means that it has more parameters to learn than our proposed method. To confirm the effect of the number of parameters, we conducted an experiment comparing the Median error as we changed the number of units in the middle layer in the range {112, 224, 448, 672, 896} for our proposed method and the R-GCN baseline method. As shown in Figure 4, the Median error of the R-GCN baseline method is almost equal when the number of units is between 224 and 896, at a level worse than our proposed method. This result suggests that the R-GCN baseline method cannot be improved by simply reducing the number of parameters. This is because the amount of training data is imbalanced for each channel, so some channels do not train well over small data sets. With larger data sets, it is likely that the R-GCN baseline would perform better, which we leave to future work. We also analyzed the results across test users with differing numbers of tweets in the data set, as detailed in Figure 5, broken down into bins of 20 tweets (from 40 tweets; note that the minimum number of tweets for a given user in the data set is 20). "Proposed" refers to our proposed method using GEORELATIONS, and "BoW" refers to the bag-of-words MLP method of Rahimi et al. (2017b). We can see that our method is superior for users with small numbers of tweets, indicating that it generalizes better from sparse data. This suggests that our method is particularly relevant for small-data scenarios, which are prevalent on Twitter in a real-time scenario. Figure 6 shows the results across test users with differing numbers of entities in the data set. Our method can improve for all cases, even users who do not mention any entities. This is because our method shares the same weight matrix for entity and word embeddings, meaning it is optimized for both. On the other hand, the median error for users who mention over 10 entities is high. Most of their tweets mention sports events, and they typically include more than two geospatially-grounded entities. For example, Lakers @ Bobcats has two entities -Lakers and Bobcats -both of which are basketball teams, but their hometown is different (Los Angeles, CA for Lakers and Charlotte, NC for Bobcats). Therefore, users who mention many entities are difficult to geolocate.
Tweets are written in colloquial style, making NER difficult. For this reason, it is highly likely that there is noise in the output of AIDA, our NE recognizer. To investigate the tension between precision and recall of NE recognition and linking, we conducted an experiment using simple caseinsensitive longest string match against Yago3 as our NE recognizer, which we would expect to have higher recall but lower precision than AIDA. Table 3 shows the results, based on GEORELA-TIONS. We see that AIDA has a slight advantage in terms of Acc@161 and Mean, but that longest  string match is superior in terms of Median despite its simplicity. Given its efficiency, and there being no need to train the model, this potentially has applications when porting the method to new KBs or applying it in a real-time scenario.

Conclusion and Future Work
In this paper, we proposed a user geolocation prediction method based on entity linking and embedding a knowledge base, and confirmed the effectiveness of our method through evaluation over the GeoText data set. Our method outperformed conventional text-based geolocation, in terms of Acc@161 and Median, due to its ability to generalize over OOV named entities, which was seen particularly for users with smaller numbers of tweets. We also showed that our method is not reliant on a pre-trained named entity recognizer, and that the selection of relations has an impact on the results of the method.
In future work, we plan to combine our method with user mention-based network methods, and to confirm the effectiveness of our method over larger-sized data sets.