Geolocation with Attention-Based Multitask Learning Models

Geolocation, predicting the location of a post based on text and other information, has a huge potential for several social media applications. Typically, the problem is modeled as either multi-class classification or regression. In the first case, the classes are geographic areas previously identified; in the second, the models directly predict geographic coordinates. The former requires discretization of the coordinates, but yields better performance. The latter is potentially more precise and true to the nature of the problem, but often results in worse performance. We propose to combine the two approaches in an attentionbased multitask convolutional neural network that jointly predicts both discrete locations and continuous geographic coordinates. We evaluate the multi-task (MTL) model against singletask models and prior work. We find that MTL significantly improves performance, reporting large gains on one data set, but also note that the correlation between labels and coordinates has a marked impact on the effectiveness of including a regression task.


Introduction
Knowing the location of a social media post is useful for a variety of applications: from improving content relevance for the socio-cultural environment of a geographic area (Rakesh et al., 2013), to the understanding of demographic distributions for disaster relief (Lingad et al., 2013).
However, most social media posts do not include location. On Twitter, one of the most studied social media, geotagging is enabled for at most 5% of the posts (Sloan and Morgan, 2015;Cebeillac and Rault, 2016). In order to address this issue, samples of geolocated data have been used to create corpora of geo-tagged texts. Those corpora allow us to train supervised models to predict the geographic location for a post, relying on the post's text and, possibly, users' interaction information and other meta-data provided by the social media. While a lot of work has gone into this problem, it is still far from solved.
The task is usually framed as a multi-class classification problem, but actual location information is normally given as a pair of continuous-valued latitude/longitude coordinates (e.g.: 51.5074 • N, 0.1278 • W). Using these coordinates in classification requires translation into labels corresponding to a geographic area (e.g., cities, states, countries). This translation is another non-trivial task (Wing and Baldridge, 2014), and necessarily loses information. Much less frequently, geolocation is framed as regression, i.e., direct prediction of the coordinates. While potentially more accurate, regression over geographic coordinates presents a host of challenges (values are continuous but bounded, can be negative, and distances are non-Euclidean, due to the Earth's curvature). It is therefore usually less effective than classification.
Ideally, we would like to combine the advantages of both approaches, i.e., let the regression over continuous-valued coordinates guide the discrete location classification. So far, however, no work has tried to combine the two approaches. With recent advances in multi-task learning (MTL), we have the opportunity to combine them. In this paper, we do exactly that.
We combine classification and regression in a multi-task attention-based convolutional neural network (MTL-Att-CNN), which jointly learns to predict the geographic labels and the relative coordinates. We evaluate on two data sets widely used in the geolocation literature, TWITTER-US and TWITTER-WORLD (Section 3). In line with prior research on MTL (Alonso and Plank, 2017; Bingel and Søgaard, 2017), we do find that auxiliary regression can indeed help classification performance, but under a somewhat surprising con-dition: when there are enough classification labels. We show this by evaluating on two different schemes for discretizing coordinates into labels. The first (Rahimi et al., 2017b) identifies irregular areas via k-d trees, and is the most common in the literature. The second (Fornaciari and Hovy, 2019b) directly identifies towns of at least 15K inhabitants and allows the evaluation of the method in a more realistic scenario, but results in 3-6 times more labels.
Contributions 1) We propose a novel multitask CNN model, which learns geographic label prediction and coordinate regression together.
2) Based on Fornaciari and Hovy (2019b), we propose an alternative coordinate discretization, which correlates more with geocoordinates (Section 3). We find that label granularity impacts the effectiveness of MTL.

Related Work
Most successful recent approaches to geolocation use Deep Learning architectures for the task (Liu and Inkpen, 2015;Iso et al., 2017;Han et al., 2016). Many authors (Miura et al., 2016;Bakerman et al., 2018;Rahimi et al., 2018;Ebrahimi et al., 2018;Do et al., 2018;Fornaciari and Hovy, 2019a) follow a hybrid approach, combining the text representation with network information and further meta-data. However, recent works explore the effectiveness of purely textual data for geolocation (Tang et al., 2019).
Other researchers have directly predicted the geographic coordinates associated with the texts. Eisenstein et al. (2010) was the first to formulate the problem as a regression task predicting the coordinate values as numerical values. Lourentzou et al. (2017) use very simple labels, but create a neural model which separately performs both the classification task and the prediction of the geographic coordinates. They evaluate the relative performance of each approach. Rahimi et al. (2017a) created a dense representation of bi-dimensional points using Mixture Density Networks (Bishop, 1994). They motivate the higher complexity of such multi-dimensional representations with the limits of the loss minimization in uni-modal distributions for multitarget scenarios. In particular, they underline that minimizing the squared loss is equivalent to positioning the predicted point in the middle of the possible outputs, when more flexible representa-tions would be useful for geographic prediction: "a user who mentions content in both NYC and LA is predicted to be in the centre of the U.S.".
We address this point with a model which jointly solves the classification and regression problem, similar to the approach by Subramanian et al. (2018), who combine regression with a classification-like "ordinal regression" in order to predict both the number of votes for a petition as well as the voting threshold it reaches.

Data
Corpora We use two publicly available data sets commonly used for geolocation, known as TWITTER-US and TWITTER-WORLD. They were released by Roller et al. (2012) and Han et al. (2012) respectively. Both data sets consist of geolocated tweets written in English, coming from North America and from everywhere in the World. Each instance consists of a set of tweets from a single user, associated with a pair of geographic coordinates (latitude and longitude). TWITTER-US has 449 694 instances, TWITTER-WORLD 1 386 766. Both corpora have predefined development and test sets of 10 000 records each. These corpora were used in the shared task of W-NUT 2016, providing the basis for comparison with other models in the literature.

Labels
Since the location is represented as coordinates, there is no single best solution for translating them into meaningful labels (i.e., geographic areas). We follow two distinct discretizing approaches, resulting in different label sets. First, to allow comparison with prior work, we implement the coordinate clustering method proposed by Rahimi et al. (2017b). It relies on the k-d tree procedure (Maneewongvatana and Mount, 1999) and led to the identification of 256 geographic areas for TWITTER-US and 930 for TWITTER-WORLD. These areas, however, are quite large and do not always correspond to any meaningful territorial division (e.g., city, county, state, etc).
In order to create labels sets corresponding more closely to existing geographic distinctions, we follow the Point2City -P2C, another algorithm based on k-d tree with additional steps, proposed by Fornaciari and Hovy (2019b). This results in more fine-grained geographic labels.
P2C clusters all points closer than 11 km (which correspond to the first decimal point on the longitude axis), then iteratively merges the centroids until no centroids are closer than 11 km to each other. Finally, these points are labeled with the name of the closest city of at least 15 000 inhabitants, according to the information provided by the free database GeoNames. We refer the reader to Fornaciari and Hovy (2019b) for more details of the method.
The mean distance between P2C labels and the respective actual city centers is less than 3.5 km. P2C results in 1 593 labels for TWITTER-US and 2 975 for TWITTER-WORLD, a factor of respectively 6 and 3 greater than the method used by Rahimi et al. (2017b). We provide our labels and our models on GitHub Bocconi-NLPLab.
Pre-processing and feature selection We preprocess the text by converting it to lowercase, removing URLs and stop-words. We reduce numbers to 0, except for those appearing in mentions (e.g., @abc123). In order to make the vocabulary size computationally tractable, we restrict the allowed words to those with a minimum frequency of 5 for each corpus. Since this removes about 80% of the vocabulary, losing possibly relevant information, we convert a part of the low-frequency words into replacement tokens. In particular, considering the training set only, we selected all those appearing uniquely in the same place according to the P2C labels. We discarded the low frequency terms found in more than one geographic area. In this way, the resulting vocabulary size is 1.470M words for TWITTER-US and 470K for TWITTER-WORLD.
We follow Han et al. (2014) and Forman (2003) in limiting both vocabularies to the same number of tokens, i.e., 470K tokens, by filtering the terms according to their Information Gain Ratio (IGR). This is a measure of the degree of informativeness for each term, according to its distribution among a set of labels -geographic areas in our case.

Methods
We train embeddings for both corpora, and use them as input to the multi-task learning model.

Embeddings
Since tweets are natively short texts further reduced by removing stop words, we use a small context window size of 5 words. We trained our embeddings on the training sets of each corpus. As we are interested in potentially rare geographically informative words, we use the skip-gram model, which is more sensitive to low-frequency terms than CBOW (Mikolov et al., 2013) and train for 50 epochs. We use an embedding size of 512, choosing a power of 2 for memory efficiency, and the size as a compromise between a rich representation and the computational tractability of the embeddings matrix. For the same reason, we limit the length of each instance to 800 words for TWITTER-US and 400 words for TWITTER-WORLD, which preserves the entire text for 99.5% of the instances in each corpus.

MTL-Att-CNN
We implement a CNN with the following structure. The input layer has the word indices of the text, converted via the embedding matrix into a matrix of shape words × embeddings. We convolve two parallel channels with max-pooling layers and convolutional window sizes 4 and 8 over the input. The two window sizes account for both short and relatively long patterns in the texts. In both channels, the initial number of filters is 128 for the first convolution, and 256 in the second one. We join the output of the convolutional channels and pass it through an attention mechanism (Bahdanau et al., 2014;Vaswani et al., 2017) to emphasize the weight of any meaningful pattern recognized by the convolutions. We use the implementation of Yang et al. (2016). The output consists of two independent, fully-connected layers for the predictions, respectively in the form of discrete labels for classification and of continuous latitude and longitude values for regression. Gradient Normalization Multi-task networks are quite sensitive to the choice of auxiliary tasks and the associated loss (Benton et al., 2017). If the loss function outputs of different tasks differ in scale, backpropagation also involves errors at different scales. This can imbalance the relative contributions of each task on the overall results: the "lighter" task can therefore be disadvantaged up to the point to become untrainable, since the backpropagation becomes dominated by the task with the larger error scale. To prevent this problem, we first normalize the coordinates to the range 0 − 1. Since coordinates include negative values, we transform them by adding 180 and dividing by 360. As loss function, we compute the Euclidean distance between the predicted and the target coordinates. 1 We rescale all distances to within 0−1 as well, i.e., to the same scale as the softmax output of the classification task. For the main task (i.e., classification), we use the Adam optimizer (Kingma and Ba, 2014). This gradient descent optimizer is widely used as it uses moving averages of the parameters (i.e., the momentum), in practice adjusting the step size during the training (Bengio, 2012). The Adam optimizer, though, requires a high number of parameter. For the auxiliary task (i.e., regression), we simply used standard gradient descent.

Experiments
We carry out 8 experiments, 4 on TWITTER-US and 4 on TWITTER-WORLD. For each data set, we compare the performance of multi-task (MTL) and single-task (i.e., classification) models (STL), both with the labels of Rahimi et al. (2017b) and our own label set. For each of the 8 conditions, we report results averaged over three runs to reduce the impact of the random initializations. For each condition, we compute significance between STL and MTL via bootstrap sampling (Berg-Kirkpatrick et al., 2012;Søgaard et al., 2014). TWITTER-US and TWITTER-WORLD are two remarkably different data sets. Not only they address areas of different size, with different geographic density of the entities to locate, they also differ in vocabulary size (larger in TWITTER-US), even considering different pre-processing procedures. Therefore, the performance difference many studies report is not surprising.
The outcomes are shown in Table 1. On both data sets, MTL yields the best results for exact accuracy. On TWITTER-US, we outperform Han et al. (2014) in exact accuracy, but cannot compare to Rahimi et al. (2017b), and do not reach their acc@161 or distance measures. For TWITTER-WORLD, we report the best results for both types of accuracy and median distance. Interestingly, mean distance is higher, suggesting a very long tail of far-away predictions.
The effectiveness of MTL increases with label granularity. This makes sense, since under a more fine-grained label scheme, the correlation between coordinates and labels is higher, which is exactly what we learn in the auxiliary task. Under the broader labeling scheme by Rahimi et al. (2017b), label areas are of irregular size, and so the correlation with the coordinates varies. With the k-d tree labels, the mean distance between the coordinates and the cluster centroids is 50 Km for TWITTER-US and 40 km for TWITTER-WORLD, while with our labels the mean distance is 16 and 7 km, re-spectively. With highly granular P2C labels, MTL consistently outperforms STL; in contrast, with wider areas, STL mean distance beats MTL in TWITTER-US. The auxiliary regression adds valuable information to the classification task: MTL improves significantly over STL.

Ablation study
In order to verify the impact of the network components on the overall performance, we carry out a brief ablation study. In particular, we are interested in the attention mechanism, implemented following Yang et al. (2016). To this end, we train a MTL model without attention mechanism. We note that they are not directly comparable to those shown in table 1, since they used different, randomly initialized embeddings, and should be interpreted with caution. The results do suggest, though, that we can expect the attention mechanism to increase performance by about 10 points percent (both for accuracy and for acc@161), and to increase median distance by about 150 km. This effect holds for both multi-task and single-task models.

Conclusion
IN this paper, we propose a novel multi-task learning framework with attention for geolocation, combining label classification with regression over geo-coordinates.
We find that the granularity of the labels (and their correlation with the coordinates) has a direct impact on the effectiveness of MTL, with more labels counter-intuitively resulting in higher exact accuracy. Besides the labels commonly adopted in the literature, we also evaluate with a greater number and more specific locations (arguably a more realistic way to evaluate the geolocation for many real life applications). This effect holds independent of whether the model is trained with attention or not.
The auxiliary regression task is helpful for classification when using more fine-grained labels, which address specific rather than broad areas. Our models are optimized for exact accuracy, rather than to Acc@161, and we report some of the best accuracy measures for TWITTER-WORLD, and competitive results for TWITTER-US.