Dialectometric analysis of language variation in Twitter

In the last few years, microblogging platforms such as Twitter have given rise to a deluge of textual data that can be used for the analysis of informal communication between millions of individuals. In this work, we propose an information-theoretic approach to geographic language variation using a corpus based on Twitter. We test our models with tens of concepts and their associated keywords detected in Spanish tweets geolocated in Spain. We employ dialectometric measures (cosine similarity and Jensen-Shannon divergence) to quantify the linguistic distance on the lexical level between cells created in a uniform grid over the map. This can be done for a single concept or in the general case taking into account an average of the considered variants. The latter permits an analysis of the dialects that naturally emerge from the data. Interestingly, our results reveal the existence of two dialect macrovarieties. The first group includes a region-specific speech spoken in small towns and rural areas whereas the second cluster encompasses cities that tend to use a more uniform variety. Since the results obtained with the two different metrics qualitatively agree, our work suggests that social media corpora can be efficiently used for dialectometric analyses.


Introduction
Dialects are language varieties defined across space.These varieties can differ in distinct linguistic levels (phonetic, morphosyntactic, lexical), which determine a particular regional speech (Chambers and Trudgill, 1998).The ex-tension and boundaries (always diffuse) of a dialect area are obtained from the variation of one or many features such as, e.g., the different word alternations for a given concept.Typically, the dialect forms plotted on a map appear as a geographical continuum that gradually connects places with slightly different diatopic characteristics.A dialectometric analysis aims at a computational approach to dialect distribution, providing quantitative linguistic distances between locations (Séguy, 1971;Goebl, 2006;Wieling and Nerbonne, 2015).
Dialectometric data is based upon a corpus that contains the linguistic information needed for the statistical analysis.The traditional approach is to generate these data from surveys and questionnaires that address variable types used by a few informants.Upon appropriate weighting, the distance metric can thus be mapped on an atlas.In the last few years, however, the impressive upswing of microblogging platforms has led to a scenario in which human communication features can be studied without the effort that traditional studies usually require.Platforms such as Twitter, Flickr, Instagram or Facebook bring us the possibility of investigating massive amounts of data in an automatic fashion.Furthermore, microblogging services provide us with real-time communication among users that, importantly, tend to employ an oral speech.Another difference with traditional approaches is that while the latter focus on male, rural informants, users of social platforms are likely to be young, urban people (Smith and Rainie, 2010), which opens the route to novel investigations on today's usage of language.Thanks to advances in geolocation, it is now possible to directly examine the diatopic properties of specific regions.Examples of computational linguistic works that investigate regional variation with Twitter or Facebook corpora thus far comprise English (Eisenstein et al., 2014;Doyle, 2014;Kulka-rni et al., 2016;Huang et al., 2016;Blodgett et al., 2016), Spanish (Gonc ¸alves and Sánchez, 2014;Gonc ¸alves and Sánchez, 2016;Malmasi et al., 2016), German (Scheffler et al., 2014), Arabic (Lin et al., 2014) and Dutch (Tulkens et al., 2016).It is noticeable that many of these works combine big data techniques with probabilistic tools or machine learning strategies to unveil linguistic phenomena that are absent or hard to obtain from conventional methods (interviews, hand-crafted corpora, etc.).
The subject of this paper is the language variation in a microblogging platform using dialectrometric measures.In contrast to previous works, here we precisely determine the linguistic distance between different places by means of two metrics.Our analysis shows that the results obtained with both metrics are compatible, which encourages future developments in the field.We illustrate our main findings with a careful analysis of the dialect division of Spanish.For definiteness, we restrict ourselves to Spain but the method can be straightforwardly applied to larger areas.We find that, due to language diversity, cities and main towns have similar linguistic distances unlike rural areas, which differ in their homogeneous forms.but obtained with a completely different method

Methods
Our corpus consists of approximately 11 million geotagged tweets produced in Europe in Spanish language between October 2014 and June 2016.(Although we will focus on Spain, we will not consider in this work the speech of the Canary Islands due to difficulties with the data extraction).The classification of tweets is accomplished by applying the Compact Language Detector (CLD) (McCandless, 2012) to our dataset.CLD exhibits accurate benchmarks and is thus good for our purposes, although a different detector might be used (Lui and Baldwin, 2012).We have empirically checked that when CLD determines the language with a probability of at least 60% the results are extremely reliable.Therefore, we only take into account those tweets for which the probability of being written in Spanish is greater than 0.6.Further, we remove unwanted characters, such as hashtags or at-mentions, using Twokenize (O' Connor et al., 2010), a tokenizer designed for Twitter text in English, adapted to our goals.
We present the spatial coordinates of all tweets in figure 1 (only the south-western part of Europe is shown for clarity).As expected, most of the tweets are localized in Spain, mainly around major cities and along main roads.
Next, we select a word list from Varilex (Ueda et al., 2015), a lexical database that contains Spanish variation across the world.We consider 89 concepts expressed in different forms.Our selection eliminates possible semantic ambiguities.The complete list of keywords is included in the supplementary material below.For each concept, we determine the coordinates of the tweets in which the different keywords appear.From our corpus, we find that 219362 tweets include at least one form corresponding to any of the selected concepts.
The pictorial representation of these concepts is made using a shapefile of both the Iberian Peninsula and the Balearic Islands.Then, we construct a polygon grid over the shapefile.The size of the cells (0.35 • × 0.35 • ) roughly corresponds to 1200 km 2 .We locate the cell in which a given keyword matches and assign a different color to each keyword.We follow a majority criterion, i.e., we depict the cell with the keyword color whose absolute frequency is maximum.This procedure nicely yields a useful geographical representation of how the different variants for a concept are distributed over the space.

Language distance
The dialectometric differences are quantified between regions defined with the aid of our cells.For this purpose we take into account two metrics, which we now briefly discuss.

Cosine similarity
This metric is a vector comparison measure.It is widely used in text classification, information retrieval and data mining (Murphy, 2012).Let u and v be two vectors whose components are given by the relative frequencies of the lexical variations for a concept within a cell.Quite generally, u and v represent points in a high-dimensional space.The similarity measure d(u, v) between these two vectors is related to their inner product conveniently normalized to the product of their lengths, This expression has an easy interpretation.If both vectors lie parallel, the direction cosine is 1 and thus the distance becomes d = 0. Since all vector components in our approach are positive, the upper bound of d is 1, which is attained when the two vectors are maximally dissimilar.

Jensen-Shannon metric
This distance is a similarity measure between probability density functions (Lin, 1991).It is a symmetrized version of a more general metric, the Kullback-Leibler divergence.Let P and Q be two probability distributions.In our case, these functions are built from the relative frequencies of each concept variation.Our frequentist approach differs from previous dialectometric works, which prefer to measure distances using the Dice similarity coefficient or the Jaccard index (Manning and Schütze, 1999).
The Kullback-Leibler divergence is defined as (2) We now symmetrize this expression and take the square root, (3) where M = (P + Q)/2.The Jensen-Shannon distance JSD(P ||Q) is indeed a metric, i.e., it satisfies the triangle inequality.Additionally, JSD(P ||Q) fulfills the metric requirements of non-negativity, d(x, y) = 0 if and only if x = y (identity of indiscernibles) and symmetry (by construction).This distance has been employed in bioinformatics and genome comparison (Sims et al., 2009;Itzkovitz et al., 2010), social sciences (DeDeo et al., 2013) and machine learning (Goodfellow et al., 2014).To the best of our knowledge, it has not been used in studies of language variation.An exception is the work of Sanders ( 2010), where JSD is calculated for an analysis of syntactic variation of Swedish.Here, we propose to apply the Jensen-Shannon metric to lexical variation.Below, we demonstrate that this idea leads to quite promising results.

Average distance
Equations 1 and 3 give the distance between cells A and B for a certain concept.We assign the global linguistic distance in terms of lexical variability between two cells to the mean value where d i is the distance between cells A and B for the i-th concept and N is the total number of concepts used to compute the distance.In the cosine similarity model, we replace d i in equation 4 with equation 1 whereas in the Jensen-Shannon metric d i is given by equation 3.

Results and discussion
We first check the quality of our corpus with a few selected concepts.Examples of their spatial distributions can be seen in figure 2. The lexical variation depends on the particular concept and on the keyword frequency.We recall that the majority rule demands that we depict the cell with the color corresponding to the most popular word.Despite a few cells appearing to be blank, we have instances in most of the map.Importantly, our results agree with the distribution for the concept cold reported by Gonc ¸alves and Sánchez (2014) with a different corpus.The north-south bipartition of the variation suggested in figure 2(a) also agrees with more traditional studies (Ordóñez, 2011).As a consequence, these consistencies support the validity of our data.The novelty of our approach is to further analyze this dialect distribution with a quantitative measure as discussed below.

Single-concept case
Let us quantify the lexical difference between regions using the concept cold as an illustration.First, we generate a symmetric matrix of linguistic distances m ij (d) between pairs of cells i and j with d calculated using equation (1) or equation (3).Then, we find the maximum possible d value in the matrix (d max ) and select either its corresponding i max or j max index as the reference cell.Since both metrics are symmetric, the choice between i max and j max should not affect the results much (see below for a detailed analysis).Next, we normalize all values to d max and plot the distances to the reference cell using a color ] is obtained with the cosine similarity (Jensen-Shannon metric).Crucially, we observe that both metrics give similar results, which confirm the robustness of our dialectometric method.
Clearly, cells with a low number of tweets will largely contribute to fluctuations in the maps.To avoid this noise-related effect, we impose in figure 4 a minimum threshold of 5 tweets in every cell.Obviously, the number of colored cells decreases but fluctuations become quenched at the same time.If the threshold is increased up to 10 tweets, we obtain the results plotted in figure 5, where the north-south bipartition is now better seen.We stress that there exist minimal differences between the cosine similarity and the Jensen-Shannon metric models.

Global distance
Our previous analysis assessed the lexical distance for a single concept (cold).Let us now take into account all concepts and calculate the averaged distances using equation (4).To do so, we proceed as above and measure the distance from any of the two cells that presents the maximal value of d, where d is now calculated from equation 4. As aforementioned, d max connects two cells, which denote as C 1 and C 2 .Any of these can be selected as the reference cell from which the remaining linguistic distances are plotted in the map.To ensure that we obtain the same results, we plot the distance distribution in both directions.The results with the cosine similarity model are shown in figure 6.It is worth noting that qualitatively the overall picture is only slightly modified when the reference cell is changed from C 1 [figure 6 After averaging over all concepts, we lose information on the lexical variation that each concept presents but on the other hand one can now investigate which regions show similar geolectal variation, yielding well defined linguistic varieties.Those cells that have similar colors in either figure 6 or figure 7 are expected to be ascribed to the same dialect zone.Thus, we can distinguish two main regions or clusters in the maps.The purple background covers most of the map and represents rural regions with small, scattered population.Our analysis shows that this group of cells possesses more specific words in their lexicon.In contrast, the green and yellow cells form a second cluster that is largely concentrated on the center and along the coastline, which correspond to big cities and industrialized areas.In these cells, the use of standard Spanish language is widespread due probably to school education, media, travelers, etc.The character of its vocabulary is more uniform as compared with the purple group.While the purple cluster prefer particular utterances, the lexicon of the urban group includes most of the keywords.Importantly, we emphasize that both distance measures (cosine similarity and Jensen-Shanon) give rise to the same result, with little discrepancies on the numerical values that are not significant.The presence of two Twitter superdialects (urban and rural) has been recently suggested (Gonc ¸alves and Sánchez, 2014) based on a machine learning approach.Here, we arrive at the same conclusion but with a totally distinct model and corpus.The advantage of our proposal is that it may serve as a useful tool for dialectometric purposes.

Conclusions
To sum up, we have presented a dialectrometric analysis of lexical variation in social media posts employing information-theoretic measures of language distances.We have considered a grid of cells in Spain and have calculated the linguistic distances in terms of dialects between the different regions.Using a Twitter corpus, we have found that the synchronic variation of Spanish can be grouped into two types of clusters.The first region shows more lexical items and is present in big cities.The second cluster corresponds to rural regions, i.e., mostly villages and less industrialized regions.Furthermore, we have checked that the different metrics used here lead to similar results in the analysis of the lexical variation for a representative concept and provide a reasonable description to language variation in Twitter.
We remark that the small amount of tweets generated after matching the lexical variations of concepts within our automatic corpus puts a limit to the quantitative analysis, making the differences between regions small.Our work might be improved by similarly examining Spanish tweets worldwide, specially in Latin America and the United States.This approach should give more information on the lexical variation on the global scale and would help linguists in their dialectal classification work of micro-and macro-varieties.Our work hence represents a first step into the ambitious task of a thorough characterization of language variation using big data resources and information-theoretic methods.

Figure 1 :
Figure 1: Heatmap of Spanish tweets geolocated in Europe.There exist 11208831 tweets arising from a language detection and tokenization procedure.We have zoomed in those arising in Spain, Portugal and the south of France.

Figure 2 :
Figure 2: Spatial distribution of a few representative concepts based on the maximum absolute frequency criterion.Each concept has a lexical variation as indicated in the figure.The concepts are: (a) cold, (b) school, (c) streetlight, (d) fans.

Figure 3 :
Figure 3: Linguistic distances for the concept cold using (a) cosine similarity and (b) Jensen-Shannon divergence metrics.The horizontal (vertical) axis is expressed in longitude (latitude) coordinates.

Figure 4 :
Figure 4: Linguistic distances as in figure 3 but with a minimum threshold of 5 tweets in each cell using (a) cosine similarity and (b) Jensen-Shannon metric.

Figure 5 :
Figure 5: Linguistic distances as in figure 3 but with a minimum threshold of 10 tweets in each cell using (a) cosine similarity and (b) Jensen-Shannon metric.
(a)] to C 2 [figure 6(b)].The same conclusion is reached when the distance is calculated with the Jensen-Shannon metric model, see figures 7(a) and (b).

Figure 6 :
Figure 6: Global distances averaged over all concepts.Here, we use the cosine similarity measure to calculate the distance.The color distribution displays a small variation from (a) to (b) due to the change of the reference cell.

Figure 7 :
Figure 7: Global distances averaged over all concepts.Here, we use the Jensen-Shannon metric to calculate the distance.The color distribution displays a small variation from (a) to (b) due to the change of the reference cell.