Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting

Dialects are one of the main drivers of language variation, a major challenge for natural language processing tools. In most languages, dialects exist along a continuum, and are commonly discretized by combining the extent of several preselected linguistic variables. However, the selection of these variables is theory-driven and itself insensitive to change. We use Doc2Vec on a corpus of 16.8M anonymous online posts in the German-speaking area to learn continuous document representations of cities. These representations capture continuous regional linguistic distinctions, and can serve as input to downstream NLP tasks sensitive to regional variation. By incorporating geographic information via retrofitting and agglomerative clustering with structure, we recover dialect areas at various levels of granularity. Evaluating these clusters against an existing dialect map, we achieve a match of up to 0.77 V-score (harmonic mean of cluster completeness and homogeneity). Our results show that representation learning with retrofitting offers a robust general method to automatically expose dialectal differences and regional variation at a finer granularity than was previously possible.


Introduction
People actively use dialects to mark their regional origin (Shoemark et al., 2017a,b), making them one of the main drivers of language variation. Accounting for this variation is a challenge for NLP systems (see for example the failed attempts of people with accents trying to use dialogue systems. Accounting for variation can significantly improve performance in machine translation (Mirkin and Meunier, 2015;Östling and Tiedemann, 2017), geolocation (Rahimi et al., 2017a,b) and help personalize applications and search. However, regional variation involves a complex set of grammatical, lexical, and phonological features, all of them continuously changing. Consequently, dialects are not static discrete entities, but exist along a continuum in most languages. Variational linguistics and dialectology typically discretize this continuum by using a set of preselected features (Trudgill, 2000), often including outdated vocabulary. The resulting dialect areas are highly accurate, but extremely timeconsuming to construct and inflexible (the largest and to date most comprehensive evaluation of German dialects, the Wenker-Atlas (Rabanus et al., 2010) is almost 150 years old and took decades to complete). Work in dialectometry has shown that computational methods, such as clustering (Nerbonne and Heeringa, 1997;Prokić and Nerbonne, 2008;Szmrecsanyi, 2008, inter alia) and dimensionality reduction (Nerbonne et al., 1999;Shackleton Jr, 2005) can instead be used to identify dimensions of variation in manually constructed discrete feature vectors. However, the success of such approaches depends on precise prior knowledge of variation features (Lameli, 2013).
Distributed representations, as unsupervised methods, can complement these methods by capturing similarities between words and documents (here: cities) along various latent dimensions, including syntactic, semantic, and pragmatic aspects. These representations are therefore more compact, less susceptible to data sparsity than latent variable models, and allow us to represent a large number of possible clusters than featurebased representations (cf. Luong et al. (2013)). These properties also allow us to measure similarities on a continuous scale, which makes represen-tation learning especially useful for the study of regional language variation along several linguistic dimensions.
We use a corpus of 16.8 million anonymous German online posts, cast cities as document labels, and induce document embeddings for these cities via Doc2Vec (Le and Mikolov, 2014). We first show that the resulting city embeddings capture regional linguistic variation at a more finegrained, continuous regional distinction than previous approaches (Bamman et al., 2014;Östling and Tiedemann, 2017), which operated at a state or language level. 1 We also show that the embeddings can serve as input to a geolocation task, outperforming a bag-of-words model, and producing competitive results.
However, such representations are susceptible to linguistic data bias, ignore geographic factors, and are hard to evaluate with respect to their fit with existing linguistic distinctions. We address these problems by including geographic information via retrofitting (Faruqui et al., 2015;Hovy and Fornaciari, 2018): we use administrative region boundaries to modify the city embeddings, and evaluate the resulting vectors in a clustering approach to discover larger dialect regions.
In contrast to most dialectometric approaches (Nerbonne et al., 1999;Prokić and Nerbonne, 2008), and in line with common NLP practice (Doyle, 2014;Grieve, 2016;Huang et al., 2016;Rahimi et al., 2017a), we also evaluate the clustered dialect areas quantitatively. Rather than testing the geographic extent of individual words against known dialect areas (Doyle, 2014), we compare the match of entire geographic regions to a recent German dialect map (Lameli, 2013). We use cluster evaluation metrics to measure how well our clusters match the known dialect regions.
The results show that our method automatically captures existing (manually determined) dialect distinctions well, and even goes beyond them in that it also allows for a more fine-grained qualitative analysis. Our research shows that representation learning is well suited to the study of language variation, and demonstrates the potential of incorporating non-linguistic information via retrofitting. For an application of our methodology to a larger Twitter data set over multiple languages, see (Hovy et al., In Preparation). 1 Han et al. (2014) has used city-level representations, but have not applied them to the identification of dialect areas.
Contributions In this paper, we make the following contributions to linguistic insights, performance improvements, and algorithmic contributions. We show: 1. how Doc2Vec can be used to learn distributed representations of cities that capture continuous regional linguistic variation. The approach is general and can be applied to other languages and data sets; 2. that the city representations capture enough distinction to produce competitive results in geolocation, even this was not the main focus; 3. that retrofitting can be used to incorporate geographic information into the embeddings, extending the original algorithm's applications; 4. that the clusterings match with a sociolinguistic dialect map (Lameli, 2013), measuring their homogeneity, completeness, and their harmonic mean (V-measure), and reach a V-measure of 0.77, beating an informed baseline; We publicly release the data, code, and map files for future research at https://github.com/Bocconi-NLPLab.

Source
We use data from the social media app Jodel, 2 a mobile chat application that lets people anonymously talk to other users within a 10km-radius around them. The app was first published in 2014, and has seen substantial growth since its beginning. It has several million users in the Germanspeaking area (GSA), and is expanding to France, Italy, Scandinavia, Spain, and lately the United States. Users can post and answer to posts within the radius around their own current location. All users are anonymous. Answers to an initial post are organized in threads. The vast majority of posts in Jodel are written in standard German, but since it is conceptually spoken langauge (Koch and Oesterreicher, 1985;Eisenstein, 2013), regional and dialectal forms are common, especially in Switzerland, Austria, and rural areas in Southern Germany. The data therefore reflects current developments in language dynamics to mark regionality (Purschke, 2018). We used a publicly available API to collect data between April and June 2017 from 123 initial locations: 79 German cities with a population over 100k people, all 17 major cities in Austria ("Mittel-und Oberzentren"), and 27 cities in Switzerland (the 26 cantonal capitals plus Lugano in the very south of the Italian-speaking area). Due to the 10km radius, posts from other nearby cities get collected as well. We include these additional cities if they have more than 200 threads, thereby growing the total number of locations. 3 Ultimately, this results in 408 cities (333 in Germany, 27 in Austria, 48 in Switzerland). The resulting locations are spread relatively evenly across the entire GSA, albeit with some gaps in parts of Germany with low population density. In total, we collect 2.3 million threads, or 16.8 million posts.
We treat each thread as a document in our representation learning setup, labeled with the name of the city in which the thread took place.

Preprocessing
We preprocess the data to minimize vocabulary size, while maintaining regional discriminative power. We lowercase the input and restrict ourselves to content words, based on the part-ofspeech (nouns, verbs, adjectives, adverbs, and proper names), using the spacy 4 tagger.
Prior studies showed that many regionallydistributed content words are topically driven (Eisenstein et al., 2010;Salehi et al., 2017). People talk more about their own region than about others, so the most indicative words include place names (the own city, or specific places within that city), and other local culture terms, such as sports teams. We try to minimize the effect of such regional topics, by excluding all named entities, as well as the names of all cities in our list, to instead focus on dialectal lexical variation.
We use NLTK 5 to remove German stop words, and to lemmatize the words. While this step removes the inflectional patterns found in German, which could have regional differences, we focus here on lexical differences, and lemmatization greatly reduces vocabulary size, leading to bet-ter representations. While both POS-tagging and NER can introduce noise, they are more flexible and exhaustive than pre-defined word lists. 6 Finally, we concatenate collocations based on the PMI of the adjacent words in the cleaned corpus. The average instance length is about 40 words after cleaning.

Data Statement
The corpus was selected to represent informal, everyday online speech across the German-speaking area in Europe, and to capture regional distinctions. The data was acquired via the publicly available API. The language is mainly standard German, but with a substantial amount of dialectal entries, mainly from southern German varieties, as well as some French and Italian, which could not be removed without losing dialect. The platform is anonymous, but mainly used by young people, as indicated by a prevalence of college-related topics. It contains spontaneous, written, asynchronous interactions in a chat platform organized by threads. Anonymous reference to prior interlocutors is possible. The app is mainly used to discuss everyday topics, entertainment, flirting, venting, and informal surveys. To learn both word and city representations, we use the Doc2Vec implementation of para-graph2vec (Le and Mikolov, 2014) in gensim. 7 The model is conceptually similar to word2vec (Mikolov et al., 2013), but also learns document label representations (in our case city names), embedded in the same space as the words. We use distributed bag-of-words (DBOW) training. The model parameters are fitted by predicting randomly sampled context words from a city vector. The objective is to maximize the log probability of the prediction,

Representation Learning
where k is a city, and W = w i...N a sequence of N randomly sampled words from the thread (see Figure 1 for a schematic representation).
During training, semantically similar words end up closer together in vector space, as do words "similar" to a particular city, and cities that are linguistically similar to each other.
Due to the nature of our task, we unfortunately do not have gold data (i.e., verified cluster labels) to tune parameters.We therefore follow the settings described in (Lau and Baldwin, 2016) for the parameters, and set the vector dimensions to 300, window size to 15, minimum frequency to 10, negative samples to 5, downsampling to 0.00001, and run for 10 iterations.

Visualization
In order to examine whether the city embeddings capture the continuous nature of dialects, we visualize them. If our assumption holds, we expect to see gradual continuous change between cities and regions.
We use non-negative matrix factorization (NMF) on the 300-dimensional city representation matrix to find the first three principal components, normalize them each to values 0.0-1.0 and interpret those as RGB values. 8 I.e., we assume the first principal component signals the amount of red, the second component the amount of green, and the third component the amount of blue. This triple can be translated into a single color value. E.g., 0.5 red, 0.5 green, and 0.5 blue translates into medium gray. This transformation translates city representations into color values that preserve linguistic similarities. Similar hues correspond to similar representations, and therefore, by extension, linguistic similarity.
NMF tries to find a decomposition of a given i-by-k matrix W into d components by a i-by-d row-representation V and a d-by-k column representation H. In our case, d = 3. Since we are only interested in a reduced representation of the cities, V , we discard H.
The result is indeed a continuous color gradient over the cities over 200 threads, see Figure 2. The circle size for every city indicates the relative number of threads per location.
In order to get reliable statistics, we restrict ourselves to cities with more than 200 observed conversations (about 2.1M conversations: 1.82M in Germany, 173k in Austria, and 146k in Switzerland). Including cities with fewer conversations adds more data points, but induces noise, as many of those representations are based on too little data, resulting in inaccurate vectors.
Even without in-depth linguistic analysis, we can already see differences between Switzerland (green color tones) and the rest of the GSA. Within Switzerland, we see a distinction between the German (lighter green) and the French-speaking area around Lausanne and Geneva (darker tones). On the other hand, we find a continuous transition from red over purple to bluish colors in Germany and Austria. These gradients largely correspond to the dimensions North→South(East): red→blue and West→East: intense tones →pale tones. These dimensions mirror the well-known strong linguistic connection between the southeast of Germany and Austria, and between most cities in the north of Germany.

Clustering
The visualization in the last section already suggests that we capture the German dialect continuum, and the existence of larger dialect areas. However, in order to evaluate against existing dialect maps, we need to discretize the continuous representation. We use hierarchical agglomerative clustering (Ward Jr, 1963) with Ward linkage, Euclidean affinity, and structure to discover dialect areas. We compare the agglomerative clustering results to a k-means approach.
Agglomerative clustering starts with each city in its own cluster, and recursively merges pairs into larger clusters, until we have reached the required number. Pairs are chosen to minimize the increase in linkage distance (for Ward linkage, this measure is the new cluster's variance). We use cities with 50-199 threads (66 cities) to tune the clustering parameters (linkage function and affinity), and report results obtained on cities with more than 200 threads.
Since the city representations are indirectly based on the words used in the respective cities, the clustering essentially captures regional similarity in vocabulary. If the clusters we find in our data match existing dialect distinctions, this provides a compelling argument for the applicability of our methodology.

Including geographic knowledge
While we capture regional variation by means of linguistic similarities here, it does include a geographic component as well. The embeddings we learn do not include this component, though. This can produce undesirable clustering results. Large cities, due to their "melting-pot" function, often use similar language, so their representations are close in embedding space. This is an example of Galton's problem (Naroll, 1961): Munich and Berlin are not linguistically similar because they belong to the same dialect, but due to some outside factor (in this case, shared vocabulary through migration).
Structure To introduce geographic structure into clustering, we use a connectivity matrix over the inverse distance between cities (i.e., geographically close cities have a higher number), which is used as weight during the merging. This weight makes close geographic neighbors more likely to be merged before distant cities are.
Note, though, that this geographic component does not predetermine the clustering outcome: geographically close cities that are linguistically different still end up in separate clusters, as we will see. The Spearman ρ correlation between the geographic distance and the cosine-similarity of cities is positive, but does not fully explain the similarities (Austria 0.40, Germany 0.42, Switzerland 0.72). The stronger correlation for Switzerland suggests a localized effect of regional varieties. Geographic structure in clustering does, however, provide speedups, regional stability, and more stable clustering solutions than unstructured clustering. We will see this in comparison to k-means.
Retrofitting Faruqui et al. (2015) introduced retrofitting of vectors based on external knowledge. We take the idea proposed for word vectors and semantic resources and extend it following Hovy and Fornaciari (2018) to apply it to city representations and membership in geographic regions. We construct a set Ω with tuples of cities (c i , c j ) such that there exists a region R where c i ∈ R and c j ∈ R. We use the NUTS2 regions (Nomenclature of Territorial Units for Statistics, a EuroStats geocoding standard) to determine R. In Germany, NUTS2 has 39 regions, corresponding to government regions. To include the geographic knowledge, we retrofit the existing city embeddings C. The goal is to make the representations of cities that are in the same region more similar to each other than to cities in other regions, resulting in a retrofit embeddings matrixĈ. For a retrofit city vectorĉ i , the update equation iŝ c i = αc i + β j:(i,j)∈Ωĉ j N whereĉ i is the original city vector, and α and β are tradeoff parameters to control the influence of the geographic vs. the linguistic information. See Faruqui et al. (2015) and Hovy and Fornaciari (2018) for more details.

Evaluation
In order to evaluate our methodology, we measure both its ability to match German dialect distinctions, and the performance of the learned embeddings in a downstream geolocation task. Figure 3 provides examples of different clustering solutions after retrofitting. Note that colors are assigned randomly and do not correspond to the linguistic similarity from Figure 2. Switzerland immediately forms a separate cluster (the 2-cluster solution separates Switzerland vs. everything else), and further clusters first separate out more southern German varieties before distinguishing the northern varieties. This is in line with sociolinguistic findings (Plewnia and Rothe, 2012) about ubiquity of dialect use (more common in the south, therefore more varied regions, reflected in our clustering). Due to space constraints, we have to omit further clustering stages, but find linguistically plausible solutions beyond the ones shown here. For an in-depth qualitative analysis of the different clustering solutions and the sociodemographic and linguistic factors, see Purschke and Hovy (In Preparation).
Dialect match We use the map of German dialects and their regions by Lameli (2013) (see Figure 4) and its 14 large-scale areas 9 as gold standard to measure how well the various clusteringsolutions correspond to the dialect boundaries. This map is based on empirical quantitative analysis of German dialects, albeit based on data from the 19th century, and therefore naturally on different domains and media than our study.
Note that we can only assess the cities within modern-day Germany (clusters formed in Austria or Switzerland are not covered). We therefore rerun the clusterings on the subset of German cities, so results differ slightly from the clusters induced on the entire GSA. We report homogeneity (whether a cluster contains only data points from a single region) and completeness (how many data points of a NUTS region are in the same cluster), as well as their harmonic mean, the V-score. This corresponds to precision/recall/F1 scores used in classification. Note that we will not be able to faithfully reconstruct Lameli's distinctions, since Lameli's map contains overlapping regions, whose data points therefore already violate perfect homogeneity.
The outline of dialect regions in Lameli's map is based on the NUTS2 regions, so we compare all clustering solutions to an informed baseline that assigns each city the NUTS2 region it is located in. Except for regions in dialect overlaps, each NUTS region is completely contained in one dialect region, so the baseline can achieve almost perfect homogeneity.
Downstream task geolocation For the geolocation task, we randomly select 100 cities with at least 200 threads from each country (7 in Austria, 82 in Germany, 11 in Switzerland). We then collect threads with at least 100 words from these cities for each country (11,240 threads from Austria, 137,081 from Germany, and 18,590 from Switzerland). Each thread is a training instance, i.e., we have 166,911 instances. We use the Doc2Vec model from before to induce a document representation for each instance and use the vector as input to a logistic regression model that predicts the city name.
For testing, we sample 5,000 threads from the same cities (maintaining the same proportional distribution and word count constraint), but from a separate data set, collected two months after the original sample. We again use the Doc2Vec model to induce representations, and evaluate the classifier on this data.
We measure accuracy, accuracy at 161km (100 miles), and the median distance between prediction and target.
We compare the model with Doc2Vec representations to a bag-of-words (BOW) model with the same parameters. Since the representation here is based on words, we can not apply retrofitting. As baseline, we report the most-frequent city prediction.

Results
Dialect match Table 1 shows the results of clustering solutions up to 20 clusters for both retrofit and original embeddings. Irrespective of the clustering approach, retrofit representations perform markedly better.
Homogeneity increases substantially the more clusters we induce (in the limit, each data point becomes a single cluster, resulting in perfect homogeneity), whereas completeness decreases slightly with more clusters (they increase the likelihood that a region is split up into several clusters). We achieve the best V-score, 0.77, with 16 clusters.
Averaged k-means (over 5 runs) is much less consistent, due to random initialization, but presumably also because it cannot incorporate the geographic information. For few clusters, its performance is better than agglomerative clustering, but as the number of clusters increases (and the geographic distribution of the cities becomes more intricate), k-means stops improving.
The baseline achieves almost perfect homogeneity, as expected (the only outliers are NUTS regions in overlap areas). Completeness is lower than almost all clustering solutions, though. The V-score, 0.74, is therefore lower than the best clustering solution.
Both the cluster evaluation metrics and the visual correspondence suggest that our method captures regional variation at a lexical level well.   Table  2 shows the results of the geolocation downstream task. Despite the fact that the representation learning setup was not designed for this task and excluded all the most informative words for it (Salehi et al., 2017), the induced embeddings capture enough pertinent regional differences to achieve reasonable performance (albeit slightly below state of the art, which typically has a median distance around 100km, and an accuracy@161 of 0.54, see cf. Rahimi et al. (2017b)) and decidedly outperform the BOW model and most-frequentcity baseline on all measures. Because both words and cities are represented in the same embeddings space (at least before retrofitting), we can compare the vectors of cities to each other (asking: which cities are linguistically most similar to each other, which is what we have done above) and words to cities (asking: which words are most similar to/indicative of a city). The latter allows us to get a qualitative sense of how descriptive the words are for each city. Figure 5 shows an example of word and city similarity for the city representation of Vienna.

Analysis
We can also use the cluster centroid of several city vectors to represent entire regions. The new vector no longer represents a real location, but is akin to the theoretic linguistic center of a dialect region. We can then find the most similar words to this centroid. For the solution with 3 clusters (cf. Figure 3), we get the solutions in Table 3. As expected, the regional prototypes do not overlap, but feature dialectal expressions in the south, and general standard German expressions in the north.
Again, for an in-depth qualitative analysis and discussion of the socio-linguistic correlations, see Purschke and Hovy (In Preparation).

Related Work
Dialectometric studies, exploring quantitative statistical models for regional variation, range from  work on dialect data in Dutch (Nerbonne and Heeringa, 1997;Prokić and Nerbonne, 2008;Wieling et al., 2011, inter alia) and British English (Szmrecsanyi, 2008), to Twitter-based approaches for American dialect distinctions (Grieve et al., 2011;Huang et al., 2016) and the regional differentiation of African American Vernacular English (Jones, 2015). While these papers rely on existing dialect maps for comparison, they rarely quantitatively evaluate against them, as we do.
The use of representation learning is new and relatively limited, especially given its prevalence in other areas of NLP. Bamman et al. (2014) have shown how regional meaning differences can be learned from Twitter via distributed word representations between US states, but not for individual cities. More recently, Kulkarni et al. (2016); Rahimi et al. (2017a) and Rahimi et al. (2017b) have shown how neural models can exploit regional lexical variation for geolocation, while also enabling dialectological insights, whereas our goals are exactly reversed. Östling and Tiedemann (2017) have shown how distributed representations of entire national languages capture typological similarities that improve translation quality. Most of these papers focus on downstream performance that accounts for regional variation, rather than on explicitly modeling variation. We include a downstream performance, but also evaluate the cluster composition quantitatively.

Conclusion
We use representation learning, structured clustering, and geographic retrofitting on city embeddings to study regional linguistic variation in German. Our approach captures gradual linguistic differences, and matches an existing German dialect map, achieving a V-score of 0.77. The learned city embeddings also capture enough regional distinction serve as input to a downstream geolocation task, outperforming a BOW baseline and producing competitive results. Our findings indicate that city embeddings capture regional linguistic variation, which can be further enriched with geographic information via retrofitting. They also suggest that traditional ideas of regionality persist online. Our methodology is general enough to be applied to other languages that lack dialect maps (e.g., Switzerland), and to other tasks studying regional variation. We publicly release our data and code.