Spatio-Temporal Prediction of Dialectal Variant Usage

The distribution of most dialectal variants have not only spatial but also temporal patterns. Based on the ‘apparent time hypothesis’, much of dialect change is happening through younger speakers accepting innovations. Thus, synchronic diversity can be interpreted diachronically. With the assumption of the ‘contact effect’, i.e. contact possibility (contact and isolation) between speaker communities being responsible for language change, and the apparent time hypothesis, we aim to predict the usage of dialectal variants. In this paper we model the contact possibility based on two of the most important factors in sociolinguistics to be affecting language change: age and distance. The first steps of the approach involve modeling contact possibility using a logistic predictor, taking the age of respondents into account. We test the global, and the local role of age for variation where the local level means spatial subsets around each survey site, chosen based on k nearest neighbors. The prediction approach is tested on Swiss German syntactic survey data, featuring multiple respondents from different age cohorts at survey sites. The results show the relative success of the logistic prediction approach and the limitations of the method, therefore further proposals are made to develop the methodology.

The distribution of most dialectal variants have not only spatial but also temporal patterns. Based on the 'apparent time hypothesis', much of dialect change is happening through younger speakers accepting innovations 1 . Thus, synchronic diversity can be interpreted diachronically. With the assumption of the 'contact effect', i.e. contact possibility (contact and isolation) between speaker communities being responsible for language change, and the apparent time hypothesis, we aim to predict the usage of dialectal variants. In this paper we model the contact possibility based on two of the most important factors in sociolinguistics to be affecting language change: age and distance. The first steps of the approach involve modeling contact possibility using a logistic predictor, taking the age of respondents into account. We test the global, and the local role of age for variation where the local level means spatial subsets around each survey site, chosen based on k nearest neighbors. The prediction approach is tested on Swiss German syntactic survey data, featuring multiple respondents from different age cohorts at survey sites. The results show the relative success of the logistic prediction approach and the limitations of the method, therefore further proposals are made to develop the methodology.

Motivation
Contact and isolation, in geographic space and in social space, are assumed to be the most impor-1 An innovation is, of course, relative. A locally appearing new form with or without attestation elsewhere can be considered an innovation. tant factors behind language change. The concept of apparent time (Bailey et al., 1991) hypothesizes that mother tongue is mostly acquired until the late teenage, after which one's language is more resistant to change. Throughout an individual's life contact patterns and social network might change (e.g., due to the ease of contact through media and changing migration or commuting patterns -especially from the 20 th century). However, based on the apparent time hypothesis, if not uprooted, an individual's linguistic patterns can be assumed to reflect the contact patterns of their early life. With keeping all other variables constant, it can be assumed that for two people that are close in age and spent their youth near each other, the chance for a similar language is higher.
Thus, the quantification of contact possibility allows predicting current language usage and, through the concept of 'apparent time depth', future dialect change. If it is possible to predict the usage of variants based on the contact among users, core issues in sociolinguistics and diachronic linguistics such as the diffusion of variants, tracing back and forecasting change in language can be addressed with a better (spatial and temporal) granularity. Besides, through such an approach, linguistic theories long used, such as the apparent time concept (Bailey et al., 1991), language change following gravity-like paths (Trudgill, 1974) or wave-like diffusion (Yokoyama and Sanada, 2009;Blythe and Croft, 2012), can be tested. Further, it can contribute to natural language processing endeavours, such as predicting age from language attributes (Morgan-Lopez et al., 2017). This study, tracing language variation back to the patterns of contact between communities, contributes to existing approaches (e.g., Pickl and Rumpf, 2012;Wieling and Nerbonne, 2015;Yamauchi and Murawaki, 2016;Burridge, 2017) in language change and variation studies. So far linguistic geography mostly tested individual phenomena (Willis, 2017), but as obtaining data with better granularity becomes increasingly faster, computational approaches can speed up analysis in language change studies, and highlight variants that can be then more thoroughly investigated with the methods of qualitative and quantitative linguistics.
To account for the diverse roles of contact quantitatively, the relationship of the measured linguistic variation and variables affecting contact patterns -including social, demographic, policyrelated or geographic factors -has to be tested. This paper is not the first step in this direction, with sociolinguistics and linguistic geography extensively having researched social status, geographic distances and trade, among others, in these regards (e.g., Labov, 1963;Gooskens, 2004;Nerbonne, 2009;Szmrecsanyi, 2012;Lameli et al., 2015). However, this paper shows one of the first steps towards assembling a model for predicting usage of dialectal variants, and thereby, language change by means of taking as many extralinguistic variables as possible into account. In this paper we start assembling the model by taking two main variable assumed by sociolinguistics to have a crucial impact on language contact and change: age and distance. In a previous paper (Jeszenszky et al., 2018), we provided first steps from the ordination aspect for assessing the spatial predictors of different grammatical domains.
The specific goal of this paper is to analyze the roles that age and distance play in language contact, as explanatory variables for the usage patterns of dialectal variants, tested at the linguistic level of syntax. We build a logistic predictor model at global and local scales for classifying multivariate syntactic data from a Swiss German dialect survey and present first results.

Dialect Data
It is often assumed in dialectology that of all linguistic levels, change in syntax is the slowest (Longobardi and Guardiano, 2009). It could mean that the association with age might be lower in syntax than for lexicon (Morgan-Lopez et al., 2017). However, the lower possible number of syntactic variants allows for more robust results with fewer responses in a survey.
The dialectal data used in this paper stems from the database of the Syntactic Atlas of German-speaking Switzerland (SADS; (Bucheli and Glaser, 2002;Glaser and Bart, 2015)). The database holds data collected in a series of four dialect surveys, which was conducted between 2000 and 2002, and probed 54 different (morpho)syntactic phenomena. At 383 survey sites, relatively homogeneously distributed throughout the German-speaking area, a total of 3'174 respondents (multiple respondents, 3-26 per survey site, median=7) filled in the questionnaires containing 118 questions. Respondents of several age groups (12-94 years old) were included at most survey sites. However, the age distribution is slightly skewed, with a median of 57 years (Stoeckle, 2018). The multitude of responses shows the local variation in variant usage, and give a higher attribute granularity and thus allows testing the association of variant usage and extralinguistic variables, such as age. Most survey questions involved translation from Standard German to the local dialect and multiple choice (MC) questions. For MC questions however, respondents could accept several answer variants as locally valid, and they were asked to specify their 'preferred' variants. In this work we rely on these preferred variants, as especially younger respondents tended to accept more variants (Glaser et al., 2019) -a clue for age as a factor conditioning usage patterns of dialectal variants. It has to be noted that even though dialectological research often refers to survey questions as variables, in this paper we call them 'phenomena', as the term 'variable' overlaps with the statistical terms used further on (i.e., explanatory variable, independent variable).

Predicting Dialectal Variant Usage Based on Age and Spatial Neighborhood
This paper presents the methodology and first results of our proposed approach for analysing the effects of age and regional contact. Regional contact is assumed to be more important in language change, manifesting itself in the variation of dialectal variants by age. We test the following two hypotheses: • At the global scale, age explains the usage of dialect variants in linguistic phenomena.
• Age is a better predictor for the usage of dialect variants at the local scale.
Firstly, using logistic regression, similarly to Willis (2017), we analyse the predictive power of age at the global level, taking into account all respondents, for the usage of variants that correspond to dialectal phenomena. Secondly, we utilize a regionalisation approach: for every survey site s, taking a set of k nearest survey sites, we predict the usage of each variant in s, based on the age of respondents and the variant preference in the whole set.
Global scale. We test the association between linguistic variation as a categorical (nominal) variable and age as a continuous predictor variable, using logistic regressions. Logistic regression does not provide a good effect size statistic similar to R 2 used for Pearson's product-moment correlation. Nevertheless, its predictive power can be tested by training the logistic regression predictor on a training set in the data and checking whether the predictions of this model correspond to the observed data previously masked. We use a 10-fold cross-validation strategy, with all data used in the training set and all observed data predicted. This tests whether logistic regression based on age provides a significant prediction on the usage of variants at the global level, and if so, with what accuracy. Thus, we report in Figure 1 the significance in a binary way (i.e., whether the prediction of the usage of a certain variant is significant or not). Besides, we present the AU C in Figure 1 as well, as a typical performance measurement for binary classifiers, showing the separability, i.e., to what degree the model is capable of distinguishing between classes. The higher the AU C, better the model is at predicting 0s as 0s and 1s as 1s.
Local scale. The regional approach can be viewed as a classification problem. Our model has to decide for each variant whether respondents at a central survey site s used it or not, based on age as the predictor variable in a set of k nearest neighbor survey sites. We use a logistic regression approach again. Using age as continuous and answer variants of all respondents as boolean variables, we train a logistic model and predict the variant usage for each respondent at s. We do this for all 383 survey sites. In this paper, we choose the k nearest neighbors based on Euclidean distance and we test models with different k values (5 to 50). Our approach employs distance cut-off, rather than distance decay, however it can also be assumed that the closer survey sites are, the more linguistic influence they have on each other.

Results
For this paper, we used 60 phenomena from the SADS survey (approximately half of all), which were already used in Jeszenszky et al. (2017). Appendix A provides some linguistic details on each phenomenon.
Results with regards to the explanatory value of age as a global predictor for variant usage are presented in Figure 1. For more than half of the variants considered, age is not a significant predictor (dark grey squares). The AU C values of separability, reported for the variants where the relationship with age is significant, are relatively low (0.5 means no discriminative capacity of the model). At the same time, variants that reach higher values typically have relatively few users (below 100 out of 3'714), e.g., II5 3 2 ; 10 users and II30 7; 8 users. However, several variants with sparse usage are also found among those not predicted significantly by age. Variants with many users (e.g., I12 1; 2'683 users, I7 3; 2'880 users, III2 1; 2'021 users) typically have an AU C value between 0.5 and 0.6. These values of association between variant usage and age alone are relatively low overall, leading us to investigate the prediction power of age at the regional scale, the patterns of which are possibly concealed by the global patterns.
For each variant in each phenomenon, Figure 2 presents the number of survey sites (out of the total 383) in which age significantly predicts the variant's usage, based on k = 13 nearest neighbors. It is visible that age proves to be a significant predictor in a large amount or survey sites only for a few variants. These are, however, not always variants with a few users. The first few variants in each phenomenon usually cover the majority of respondents.
The distribution of one such variant (III7 2) is mapped in Figure 3 along with the significance and accuracy of the predictor variable age. The patterns in Figure 3 show that the higher number 2 Variant coding includes the survey question number and a variant ID. For example, II5 3 is Variant #3 in the 5 th question of the 2 nd survey sheet   of users does not necessarily make age a significant predictor. Significance of age as a predictor variable is spatially autocorrelated, which can be interpreted as follows. When present, the usage of this variant is characteristic of certain age groups at survey sites with green points, while at red ones it is used by different age groups.
As logistic regression is sensitive to class imbalances, it might not always be the best choice as a predictor when there are a lot of 0s and only a few 1s in the data, as it might result in false accuracy by predicting 0s only and not the 1s.
Interpretations of the first results show that age alone does not prove to be an exceptionally good predictor of syntactic variation. This is partly due to the nature of the data. It has been shown that while lexicon is more prone to have a correlation with age, syntax changes slower. The first results, however, show that already with a relatively simple approach, our research direction seems to be a worthwhile undertaking. Therefore, we have a wide outlook for further developing the methodology. The area and number of respondents involved in each model will be tested through different values of k, a distance decay approach and weights based on different parameters (including age). The spatial basis of the model will feature estimations of contact potential that have proved more 'informative' than Euclidean distance, such as travel time (Jeszenszky et al., 2017); linguistic gravity (Trudgill, 1974), predicting influence and therefore language change based on settlement populations as weights in a gravity equation; or linguistic distance (Pickl et al., 2014), assuming that the closer dialect varieties should be the outcomes of closer (historical) contact. Furthermore, different algorithms beyond the logistic predictor (e.g., random forests, SVM, XGBoost) will be tested in the prediction model. Oh, ich habe den Fritz kommen hören.
Oh, I heard Fritz coming. perfect with 'hear': form and position of non-finite verb (IPP) I.5 Der Korb ist umgekippt. The basket is toppled over. resultative: subject agreement I.6 Wissen Sie, jetzt brauche ich sogar Tabletten zum einschlafen.
You know, now I even need pills in order to fall asleep.
No, it belongs to my sister.
But I already helped doing the dishes yesterday.
perfect with 'help': form and position of non-finite verb (IPP) I.9 Also ich weiss auch nicht, ob er einmal heiraten will.
Well, I don't know if he ever wants to get married.
But I just sat down in order to read a book.
Actually, fish fingers should be fried while still frozen.  'get'-passive III.11 Also mich erwischt keiner! Well, no one will catch me! personal pronoun (1sg): DO