Dense Node Representation for Geolocation

Prior research has shown that geolocation can be substantially improved by including user network information. While effective, it suffers from the curse of dimensionality, since networks are usually represented as sparse adjacency matrices of connections, which grow exponentially with the number of users. In order to incorporate this information, we therefore need to limit the network size, in turn limiting performance and risking sample bias. In this paper, we address these limitations by instead using dense network representations. We explore two methods to learn continuous node representations from either 1) the network structure with node2vec (Grover and Leskovec, 2016), or 2) textual user mentions via doc2vec (Le and Mikolov, 2014). We combine both methods with input from social media posts in an attention-based convolutional neural network and evaluate the contribution of each component on geolocation performance. Our method enables us to incorporate arbitrarily large networks in a fixed-length vector, without limiting the network size. Our models achieve competitive results with similar state-of-the-art methods, but with much fewer model parameters, while being applicable to networks of virtually any size.


Introduction
Current state-of-the-art methods for user geolocation in social media rely on a number of data sources. Text is the main source, since people use location-specific terms (Salehi et al., 2017). However, research has shown that text should be augmented with network information, since people interact with other from their local social circles. Even though social media allows for worldwide connections, most people have a larger number of connections with people who live close-by (from their school, workplace, or friend network). The most successful predictive models are therefore architectures that combine these different kinds of inputs (Rahimi et al., 2018;Ebrahimi et al., 2018).
However, incorporating network information is the computational bottleneck of these hybrid approaches: we want to represent the whole network, but we have to do so efficiently. We show that both are possible with dense representations, and indeed improve performance over previous sparse graph network representations. Following graph theory (Bondy et al., 1976), networks are typically represented as connections between entities in a square adjacency matrix, whose size corresponds to the number of users in the network. This means, though, that the matrix grows quadratically with the number of nodes/users. For large-scale social media analysis, where the number of users is often in the millions, this property creates a computational bottleneck: Incorporating such a matrix in a neural architecture, for example through graph-convolution (Kipf and Welling, 2017), easily increases the parameters by orders of magnitude, making training more expensive and increasing the risk of overfitting.
Previous work has therefore usually resorted to sampling methods. While sampling addresses the space issue, it necessarily loses a large amount of information, especially in complex networks, and introduces the risk of sampling biases.
Compounding the problem is the fact that adjacency matrices, despite their size, are very sparse, and do not represent information efficiently. This problem is analogous to sparse word and text representations, which were successfully replaced by dense embeddings (Mikolov et al., 2013a).
We show how to incorporate dense network representations in two ways: 1) with an existing word2vec-based method based on network structure, node2vec (Grover and Leskovec, 2016), and 2) with a new, doc2vec-based method of document representations (Le and Mikolov, 2014) over the set of user mentions in the text of posts (M2V). Both allow us to represent mentions as dense vectors that encode the network interactions so that similar users will have similar representations. However, they capture different aspects of interactions: people we are connected with vs. people we mention.
We compare the geolocation performance of models that combine a text view with the network views of both node2vec and the doc2vecbased method. We measure the contribution of each component to performance. Our results show that dense network representations significantly improve over sparse network representations, but that mention representations (M2V) are more important than structure representations (node2vec).
Contributions The contributions of the study are the following: • We propose a document embeddings application that builds effective network representations through dense vectors, with no need of sampling procedures even in large networks; • We show that the node representations can be tuned via two parameters which model the width and strength of their interactions.

Related work
Different kinds of data sources and methods can be used for the geolocation of users in Social Media. The the most straightforward approach is to exploit the geographic information conveyed by the linguistic behavior of the user. The first studies relied on the idea of exploiting Location-Indicative Words (LIW) (Han et al., 2012(Han et al., , 2014. More recently, neural models have been applied to the same strategy (Rahimi et al., 2017;Tang et al., 2019), improving performance. The problem, however, can be modeled in different ways, including the different designs of the geographic areas to predict, such as grids (Wing and Baldridge, 2011), hierarchical grids (Wing and Baldridge, 2014), or different kinds of clusters (Han et al., 2012(Han et al., , 2014. In this paper, we test our models both on the set of geographic areas -i.e., labels -used in the shared task of the Workshop on Noisy User-generated Text -W-NUT (Han et al., 2016), and the more fine-grained clusters obtained through the method of Fornaciari and Hovy (2019b). Geographic coordinates themselves can also be exploited, as Fornaciari and Hovy (2019a) showed in a multi-task model that jointly predicts continuous geocoordinates and discrete labels.
In general, geolocation with multi-source models is becoming more popular, as indicated by their increased use in state-of-the-art performances. Miura et al. (2016Miura et al. ( , 2017) considered text, metadata and network information, modeling the last as a combination user and city embeddings. Similarly to our study, Rahimi et al. (2015) exploited the mentions, even though they used them to build undirected graphs. Ebrahimi et al. (2017Ebrahimi et al. ( , 2018 also used mentions to create an undirected graph, that they pruned and fed into an embedding layer followed by an attention mechanism, in order to create a network representation. The study of Rahimi et al. (2018) is an example of network segmentation for use in a neural model. They propose a Graph Convolutional Neural Network (GCN), where network and text data are vertically concatenated in a single channel, rather than employed as parallel channels into the same model. Do et al. (2017Do et al. ( , 2018 present the Multi-Entry Neural Network (MENET), a model which, similarly to our study, employs node2vec and, separately, includes doc2vec as methods for extraction of document features.
These works represent the state-of-the-art benchmark with respect to the implementation of network views in the models. Other models (Ebrahimi et al., 2017(Ebrahimi et al., , 2018Do et al., 2018) also include metadata or other source of information.

The data sets
We test our methods on three data sets: GEOTEXT (Eisenstein et al., 2010), TWITTER-US (Roller et al., 2012) and TWITTER-WORLD (Han et al., 2012). They contain English tweets, concatenated by author, with geographic coordinates associated with each author. GEOTEXT contains 10K texts, TWITTER-US 450K and TWITTER-WORLD 1.390M. The corpora are each split into training, development and test sets.

node2vec
Grover and Leskovec (2016) presented node2vec, a method to obtain dense node representations through a skip-gram model. Those representations, however, are obtained through a tiered sam-pling procedure. While that allows node2vec to explore large networks, by balancing the breadth and depth of the search for the neighbours' identification, it does introduce a random factor. In addition, since node2vec uses the word2vec skipgram model (Mikolov et al., 2013c), the sequence of the nodes does not carry any meaning, essentially functioning as a further random neighbors selection. In the geolocation scenario, though, network breadth is more important than depth, as similarity between entities grows with their proximity: we would like to preserve this is information entirely, even and especially in large networks. For this reason, we follow the authors settings for the detection of nodes' homophily, rather than their structural similarity in the network, and set the node2vec parameters p = 1 and q = 0.5 (Grover and Leskovec, 2016, p. 11).

mentions2vec -M2V
We introduce a novel network representation method which does not depend on graph theory. We bypass the adjacency matrices and directly learn the social interactions from the content of social media messages. In many social media this is straightforward, as the users' mentions are introduced by the at sign '@', but in general other forms of Named Entity Recognition (NER) might be considered for the same purpose.
Concretely, we filter from the text everything but the user mentions and apply doc2vec to the resulting "texts" (Mikolov et al., 2013b). Basically, we are representing the users according to their communicative behavior directed at other users, in the temporal order these mentions appear in. Therefore, similarly to node2vec, M2V creates a dense representation of the user interactions.
As pointed out earlier, node2vec is applied to a sequence of nodes sampled form the whole network that does not account for temporal ordering. In contrast, M2V does not address nodes, but mentions, which are themselves an evidence of personal connection. The consequence of this choice is two-fold. First, there is no need for a sampling procedure: the whole set of interactions can be considered, even for wide networks. Second, the order of the mentions in the texts reflects the time sequence of the interactions, possibly encoding patterns of social behaviors.

Labels
For our experiments we use two different sets of labels: those used in the W-NUT 2016 task (Han et al., 2016), and our own labels (Fornaciari and Hovy, 2019b). Our label identification method, called point2city (P2C), clusters all points closer than 25 km and associates each cluster with the closest town of at least 15K people. For further details, see Fornaciari and Hovy (2019b). The resulting labels are highly granular and precise in the identification of meaningful administrative regions.

Feature selection
The label sets were involved in the preprocessing as follows. Using only the training data, we first select the terms with frequency greater or equal to 10 and 5 for TWITTER-US and TWITTER-WORLD, respectively. This choice is motivated by the different vocabulary size of the two data sets. Any term with frequency greater than 2, but below these thresholds, which is associated with only one label, we replace with label-representative tokens. Low-frequency terms found in more than one place are considered geographically ambiguous and discarded. This allows us to reduce remarkably the vocabulary size, maintaining the useful geographic information of the huge amount of low frequency terms. Considering the terms' Zipf distribution (Powers, 1998), this procedure allows us to replace a small number of types, but a great number of tokens. Following Han et al. (2012), we further filter the vocabulary by applying Information Gain Ratio (IGR), selecting the terms with the highest values until we reach a manageable vocabulary size: 750K and 470K for TWITTER-US and TWITTER-WORLD.

Multiview Attention-based Convolutional Models
Our models are multi-view neural networks with three input channels: the text view, node2vec, and mentions2vec. The text view, in turn, contains two channels of convolutional/max pooling layers (with window size 4 and 8) followed by an attention mechanism. Both node2vec and men-tions2vec are fed into a dense layer, followed by an attention mechanism. All the outputs are then concatenated and fed into a fully connected output layer. For a graphical representation, see Figure 1. We report the performance metrics commonly considered in the literature: accuracy, acc@161 -i.e., the accuracy within 161 km, or 100 miles, from the target point. This allows us to measure the accuracy of predictions within a reasonable distance from the target point. We also report mean and median distance between the predicted and the target points. We evaluate significance via bootstrap sampling, following Søgaard et al. (2014). The code for the methods described in this paper are available at github.com/Bocconi-NLPLab.

Results
Tables 1 and 2 show the performance of our models with and without N2V/M2V, in TWITTER-US and TWITTER-WORLD. Compared to the previous studies using only textual features, our basic model AttCNN shows comparable (TWITTER-WORLD) or better performance (TWITTER-US).
Therefore we consider our base AttCNN model as baseline comparison for the hybrid models AttCNN-N2V, AttCNN-M2V and AttCNN-M2V-N2V. We test two label sets (k-d tree and P2C), and the significance level remarkably changes in these two conditions.
In TWITTER-US, with coarse granularity labels, there is no performance improvement with dense node representations. In contrast, the models with M2V show a significant effect with fine Figure 2: Labels' coordinates in TWITTER-USand TWITTER-WORLD granularity labels. In TWITTER-WORLD, the dense node representations significantly improve the models' performance, with both kind of labels, even though AttCNN-N2V does not show improvements with P2C labels.

Discussion
Mentions2vec is a computationally affordable method for dense network representations, designed to capture social interactions. It proves very effective under most experimental conditions. The results suggest that dense users' network representation enhance geolocation performance, in particular when fine-grained labels identify specific geographic areas, rather than when a small number of labels refers to larger areas, where more different social communities can be found. Figure  2 shows the different density of labels identified by k-d tree and P2C. These settings are particularly useful for M2V, which considers the users' linguistic behavior. In contrast, Node2vec does not lead to significant improvement in TWITTER-US, presumably because the sampling procedure of node2vec does not allow to detect homophily with sufficient clarity. Mentions2vec, which does not suffer from this limitation, appears to be more effective in that context. However, in general, the labels' granularity affects the usefulness of the methods. In TWITTER-US, using labels which cover large areas is detrimental for techniques which address geographical homophily, that is, relatively small cultural/linguistic areas. Even so, it makes sense to use these techniques, as in favourable conditions (for example in TWITTER-WORLD), they lead to remarkable performance improvements. Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855-864. ACM.