The Language of Place: Semantic Value from Geospatial Context

There is a relationship between what we say and where we say it. Word embeddings are usually trained assuming that semantically-similar words occur within the same textual contexts. We investigate the extent to which semantically-similar words occur within the same geospatial contexts. We enrich a corpus of geolocated Twitter posts with physical data derived from Google Places and OpenStreetMap, and train word embeddings using the resulting geospatial contexts. Intrinsic evaluation of the resulting vectors shows that geographic context alone does provide useful information about semantic relatedness.


Introduction
Words follow geographic patterns of use. At times the relationship is obvious; we would expect to hear conversations about actors in and around a movie theater. Other times the connection between location and topic is less clear; people are more likely to tweet about something they love from a bar than from home, but vice versa for something they hate. 1 Distributional semantics is based on the theory that semantically similar words occur within the same textual contexts. We question the extent to which similar words occur within the same geospatial contexts.
Previous work validates the relationship between the content of text and its physical origin. Geographically-grounded models of language enable toponym resolution (DeLozier et al., 2015), document origin prediction, (Wing and Baldridge, 2011;Hong et al., 2012;Han et al., 2012b;Han et al., 2013;Han et al., 2014) and tracking regional variation in word use (Eisenstein et al., 2010;Eisenstein et al., 2014;Bamman et al., 2014;Huang et al., 2016). Our work differs from earlier models; rather than modeling language with respect to an absolute, physical location (like a geographic bounding box), we model language with respect to attributes describing a type of location (like amenity:movie theater or landuse:residential). This allows us to model the impact of geospatial context independently of language and region.
We enrich a corpus of geolocated tweets with geospatial information describing the physical environment where they were posted. We use the geospatial contexts to train geo-word embeddings with the skip-gram with negative sampling (SKIPGRAM) model (Mikolov et al., 2013) as adapted to support arbitrary contexts (Levy and Goldberg, 2014). We then demonstrate the semantic value of geospatial context in two ways. First, using intrinsic methods of evaluation, we show that the resulting geo-word embeddings themselves encode information about semantic relatedness. Second, we present initial results suggesting that because the embeddings are trained with language-agnostic features, they give a potentially useful signal about bilingual translation pairs. tokens in these tweets were normalized by converting to lowercase, replacing @-mentions, numbers, and URLs with special symbols, and applying the lexical normalization dictionary of Han et al. (2012a).
To enrich our collected tweets with geospatial features, we used publicly-available geospatial data from OpenStreetMap and the Google Places API. OpenStreetMap (OSM) is a crowdsourced mapping initiative. Users provide surveyed data such as administrative boundaries, land use, and road networks in their local area. In addition to geographic coordinates, each shape in the data set includes tags describing its type and attributes, such as shop:convenience and building:retail for a convenience store. We downloaded metro extracts for our 20 cities in shapefile format. To maximize coverage, we supplemented the OSM data with Google Places data from its web API, consisting of places tagged with one or more types (i.e. aquarium, ATM, etc).
We enrich each geolocated tweet by finding the coordinates and tags for all OSM shapes and Google Places located within 50m of the tweet's coordinates. The enumerated tags become geographic contexts for training word embeddings. Figure 1 gives an example of geospatial data collected for a single tweet.

Geo-Word Embeddings
SKIPGRAM learns latent fixed-length vector representations v w and v c for each word and context in a corpus such that v w · v c is highest for frequently observed word-context pairs. Typically a word's context is modeled as a fixed-length window of words surrounding it. Levy and Goldberg (2014) generalized SKIPGRAM to accept arbitrary contexts as input. We use their software (word2vecf) to train word embeddings using geospatial contexts.
word2vecf takes a list of (word, context) pairs as input. We train 300-dimensional geo-word embeddings denoted GEOD -where D indicates a radius -as follows. For each length-n tweet, we find all shapes within D meters of its origin and enumerate the length-m list of the shapes' geographic tags. The tweet in Figure 1, for example, has m = 10 tags as context when training GEO30 embeddings. Under our model, each token in the tweet shares the same contexts. Thus the input graphic coordinates.  to word2vecf for training GEO30 embeddings produced by the example tweet is an m × n list of (word, context) pairs:

Intrinsic Evaluation
To determine the extent to which geo-word embeddings capture useful semantic information, we first evaluate their performance on three semantic relatedness and four semantic similarity benchmarks (listed in Table 1). In each case we calcu-late Spearman's rank correlation between numerical human judgements of semantic similarity or relatedness for a large set of word pairs, and the cosine similarity between the same word pairs under the geo-word embedding models.
To understand the impact of geographic contexts on the embedding model, we compare GEO15, GEO30, and GEO50 geo-word embeddings to the following baselines: TEXT5: Using our corpus of geolocated tweets, we train word embeddings with word2vecf using traditional linear bag-of-words contexts with window width 5.
GEO30+TEXT5: We also evaluate the impact of combining textual and geospatial contexts. We train a model over the geolocated tweets corpus using both the geospatial contexts from GEO30 and the textual contexts from TEXT5.
RAND30: Because our GEOD models assign the same geospatial contexts to every token in a tweet, we need to rule out the possibility that GEOD models are simply capturing relatedness between words that frequently appear in the same tweets, like movie and theater. We implement a random baseline model that captures similarities arising from tweet co-location alone. For each tweet, we enumerate the geospatial tags (i.e. contexts) for shapes within 30m of the tweet origin. Then, before feeding the m × n list of (word, context) pairs to word2vecf for training, we randomly map each tag type to a different tag type within the context vocabulary. For example, route:bus could be mapped to amenity:bank for input to the model. We redo the random tag mapping for each tweet. In this way, vectors for words that always appear together within tweets are trained on the same set of associated contexts. But the randomly mapped contexts do not model the geographic distribution of words.

Intrinsic Evaluation Results
Qualitatively, we find that strongly locational words, like #nyc, and words frequently associated with a type of place, like burger and baseball, tend to have the most semantically and topically similar neighbors (Table 2) . Function words and others with geographically independent use (i.e. man) have less semantically-similar neighbors.
We can also qualitatively examine the geographic context embeddings v c output by word2vecf. Recall that the SKIPGRAM objec-  Table 2: Most similar words based on cosine similarity of embeddings trained using geographic contexts within a radius of 30m (GEO30) and textual contexts with a window of 5 words (TEXT5).
tive function pushes the vectors for frequently cooccurring v c and v w close to one another in a shared vector space. Thus we can find the words (Table 4) and other contexts (Table 3) most closely associated with each geographic context on the basis of cosine similarity. We find qualitatively that the word-context and context-context associations make intuitive sense. In our intrinsic evaluation (Table 1), geo-word embeddings outperformed the random baseline in six of seven benchmarks. These results are significant (p < .01) based on the Minimum Required Difference for Significance test of Rastogi et al. (2015). This indicates that geospatial information does provide some useful semantic information. However, the GEOD embeddings underperformed the TEXT5 embeddings in all cases. And although the combined GEO30+TEXT5 embeddings outperformed the TEXT5 embeddings in 2 of 3 semantic relatedness benchmarks, the results were significant only in the case of the MEN dataset (p < .05). This suggests, inconclusively, that geospatial contextual information may improve the semantic relatedness content of word embeddings in some cases, but that geospatial context is no substitute for textual context in capturing semantic relationships. Nevertheless, geospatial context does provide some signal for semantic relatedness that may be useful in combination with other multimodal signals. Finally, it should be noted that the Spearman correlation achieved by all models in our tests is significantly  (Hill et al., 2015) 1 Indicates a significant difference between TEXT5 and GEO30+TEXT5 results (p < 0.05, (Rastogi et al., 2015)) 2 Indicates RAND30 results are significantly lower than any GEO or WORD embedding results (p < 0.01, (Rastogi et al., 2015)) Table 1: We calculate the Spearman correlation between pairwise human semantic similarity (sim) and relatedness (rel) judgements, and cosine similarity of the associated word embeddings, over 7 benchmark datasets.  Table 3: Most similar contexts, based on cosine similarity of the associated GEO30 context vectors.
below the current state-of-the-art; this is to be expected given the relatively small size of our training corpus (approx. 400M tokens).

Translation Prediction
Our intrinsic evaluation established that geospatial context provides semantic information about words, but it is weaker than information provided by textual context. So a natural question to ask is whether geospatial context can be useful in any setting. One potential strength of word embeddings trained using geospatial contexts is that the features are language-independent. Thus we in-  fer that training geo-word embeddings jointly over two languages might yield translation pairs that are close to one another in vector space. This type of model could be applicable in a low-resource language setting where large parallel texts are unavailable but geolocated text is. To test this hypothesis, we collect an additional 236k geolocated Turkish tweets and re-train GEO30, TEXT5, and GEO30+TEXT5 vectors on the larger set.
Similar to Irvine and Callison-Burch (2013), we use a supervised method to make a binary translation prediction for Turkish-English word pairs. We build a dataset of positive Turkish-English word pairs by all Turkish words in a Turkish-English dictionary (Pavlick et al., 2014) that appear in our vector vocabulary and do not translate to the same word in English (528 words in total). We add these words and their translations to our dataset as positive examples. Then, for each Turk-ish word in the dataset we also select a random English word and add this pair as a negative example. Our resulting data set has 1056 word pairs, 50% of which are correct translations. We split this into 80% train and 20% test examples.
We construct a logistic regression model, where the input for each word pair is the difference between its Turkish and English word vectors, v f − v e . We evaluate the results using precision, recall, and F-score of positive translation predictions. Table 5 gives our results, which we compare to a model that makes a random guess for each word pair. Combining geographic and textual contexts to train embeddings leads to better translation performance than using textual or geospatial contexts in isolation. In particular, with a seed dictionary of just 528 Turkish words and monolingual text of just 236k tweets, our supervised method is able to predict correct translation pairs with 67.8% precision. While the not signficant under McNemar's test (p=0.07), they are suggestive that geospatial contextual information may provide a useful signal for bilingual lexicon induction when used in combination with other methods, as in Irvine and Callison-Burch (2013 Table 5: We make a binary translation prediction for Turkish-English word pairs using their embeddings in a simple logistic regression model.

Conclusion
Typically word embeddings are generated using the text surrounding a word as context from which to derive semantic information. We explored what happens when we use the geospatial context -information about the physical location where text originates -instead. Intrinsic evaluation of word embeddings trained over a set of geolocated Twitter data, using geospatial information derived from OpenStreetMap and the Google Places API as context, indicated that the geospatial context does encode information about semantic relatedness. We also suggested an extrinsic evaluation method for geo-word embeddings: predicting translation pairs without bilingual parallel corpora. Our experiments suggested that while geospatial context is not as semantically-rich as textual context, it does provide useful semantic relatedness information that may be complementary as part of a multimodal model. As future work, another extrinsic evaluation task that may be appropriate for geo-word embeddings is geolocation prediction.