Case and Cause in Icelandic: Reconstructing Causal Networks of Cascaded Language Changes

Linguistic drift is a process that produces slow irreversible changes in the grammar and function of a language’s constructions. Importantly, changes in a part of a language can have trickle down effects, triggering changes elsewhere in that language. Although such causally triggered chains of changes have long been hypothesized by historical linguists, no explicit demonstration of the actual causality has been provided. In this study, we use co-occurrence statistics and machine learning to demonstrate that the functions of morphological cases experience a slow, irreversible drift along history, even in a language as conservative as is Icelandic. Cru-cially, we then move on to demonstrate –using the notion of Granger-causality– that there are explicit causal connections between the changes in the functions of the different cases, which are consistent with documented processes in the history of Icelandic. Our technique provides a means for the quantitative reconstruction of connected networks of subtle linguistic changes.

1 Introduction Sapir (1921Sapir ( /2014 noticed that "Language moves down on a current of its own making. It has a drift" (emphasis added). In Sapir's view, the formation of different dialects requires that the small changes constantly being introduced by the speakers are not just plain white noise, but rather random walks in which minute changes accumulate over time. The very high dimensionality on which languages operate makes cumulative linguistic changes irreversible. Once a change has been effected there is very little chance that the language will ever return to its original state before the change, in the same way that a diffusion process in a very high dimensional space is never going to return to the exact same point in the space. Drift in language is in this respect reminiscent of random genetic drift from evolutionary biology (Wright, 1955). However, Sapir's idea of drift goes further in that he viewed it as a directional process, more similar to Wright's (1929) concept of a directional drift related to selectional pressures. In Sapir's view, "language has a 'slope"'; the small changes that accumulate in linguistic drift are not fully random, but rather they reflect the speakers' unconscious cognitive tendency to increase the consistency within their languages. This idea is currently challenged by some researchers (Croft, 2000;Lupyan and Dale, 2015), who are of the opinion that purely random driftof the same type as that found in genetics-, when coupled with adequate selection mechanisms, is sufficient to account for the diachronic changes observed in the world's languages. Sapir motivated the need for directional change in what he saw as apparent causal chains in language change, which he illustrated with the progressive loss and functional shift of English oblique case markers, into an absolutive case-free system encoding animacy and position relative to the head noun.
'Chain reactions' along the history of a language are particularly well-studied in phonology. Chain shifts (Martinet, 1952) are processes by which the position in perceptual/articulatory space of a phoneme changes in response to the change in position of another phoneme (either moving away from the second phoneme, in a 'push' chain, or moving to occupy the space left void by the other, in a 'pull' chain). A famous example of a chain shift is the Great English Vowel Shift. In a similar fashion, one could think of functional chain shifts in morphology, by which a certain morphological category takes over some of the functions of another, triggering a chain of 'push' and/or 'pull' movements in other categories. Such cascaded changes have often been reported in diachronic linguistics (Biberauer and Roberts, 2008;Fisiak, 1984;Lightfoot, 2002;Wittmann, 1983).
Icelandic is a famously conservative language. Compared to most other languages, its grammar has experienced remarkably little change since the high middle ages. For instance, Barðdal and Eythórsson (2003) argue that the changes it has experienced from its old phase (Old West Norse; mid XI century to mid XIV century) to its current phase are comparable to the slight changes occurring from early Modern English (late XV century into early XVIII century) into Modern English (from early XVIII century). In terms of inflectional morphology, change in Icelandic has been minimal. For instance, one finds that the nominal paradigms of Old West Norse, are mostly the same as those of Modern Icelandic. Notwithstanding the apparent formal stability of Icelandic cases, there is evidence that they are experiencing subtle changes in their functions (Barðdal, 2011;Barðdal and Eythórsson, 2003;Eythórsson, 2000). In particular, Barðdal argues that an accumulation of small syntatico-semantic shifts has finally resulted in a shift in the Icelandic dative's functions (i.e., 'dative sickness'), possibly triggered by earlier changes in nominatives and accusatives (e.g., 'nominative sickness').
In this study, we investigate whether one can reliably detect a drift in the functions of Icelandic case and -crucially-whether there is evidence for causal chain shifts in these functions. In Section 2, we describe the processing of a diachronic corpus of Icelandic to obtain co-occurrence representations of the functions of case types and tokens. Section 3 uses machine learning on the cooccurrence vectors of tokens to demonstrate that the usage of Icelandic cases has been subject to a constant drift along history, a drift that is distinguishable from the overall changes experienced by the language in this period. We then go on -in Section 4-to demonstrate using Granger-causality (Granger, 1969) that there are causal relations between the changes in the different cases, and it is possible to reconstruct a directed network of chain shifts, which is consistent with the directions of causality hypothesized by Barðdal (2011). Fi-nally, in Section 5, we discuss the theoretical implications of our results for theories of language change, as well as the possibilities offered by the technical innovations presented here.

Corpus
We used the Icelandic Parsed Historical Corpus (Wallenberg et al., 2011), a sample of around one million word tokens of Icelandic texts that have been orthographically standardized, manually lemmatized, part-of-speech tagged, and parsed into context-free derivation trees. An example of the lemmatization and part-of-speech tagging for a sentence is shown in Fig. 1. The dating of the text samples ranges from 1,150 CE to 2,008 CE, covering most of the history of Icelandic (from its origins in Old West Norse, to the current official language of Iceland). The corpus is divided into 61 files of similar sizes (around 18,000 words per file), each file corresponding to a single document. The documents were chosen to cover the period in a roughly uniform manner, sampling from similar genres across the periods.

Preprocessing
We collapsed into a single file all documents that were dated on the same year. This left us with 44 files containing texts from distinct years. From each of the files, we discarded all tokens that contained anything but valid Icelandic alphabetic characters or the dollar sign (used for marking enclitic breaks within a word, such as the clitic determiner in krossins from the example in Fig. 1). All remaining word tokens were lower-cased, and the '$' character was removed from the stem elements of broken stem plus clitic pairs (e.g., kross$ was changed into kross).

First Order Co-occurrence Vectors
Ideally, for constructing co-occurrence vectors, it is best to choose as features those words with highest overall informativity, which in fact tend to be those words with the highest occurrence frequencies (Bullinaria and Levy, 2007;Bullinaria and Levy, 2012;Lowe and McDonald, 2000). In our case, however, using plain token frequencies runs the risk of creating a representational space that is strongly uninformative about particular periods in the history of the language. Instead, we used document frequencies, as these still provide a measure of word frequency (and therefore informativity), while at the same time ensuring that those words chosen as features are most representative across the history of the language. We selected as features all word types that occurred in at least 75% of the 44 by-year files, that is, all 529 distinct (unlemmatized) word forms that had a document frequency in the corpus of at least 33 documents. For each (unlemmatized) word type (w) occurring at least three times in the whole corpus (17,741 distinct word types), we computed its co-occurrence frequency with each of the feature words (t). In this way, we obtained a matrix of 17,741 × 529 word by feature co-occurrence frequencies (f[w, t]) within a symmetrical window including the two preceding and following words. 1 The plain co-occurrence matrix was converted into a matrix of word-feature pointwise mutual informations M = (m i,j ), such that, where N = 899,763 tokens is the total number of tokens in the corpus, W 1 = 5 is the total sliding window size considered, and f[w i ] and f[t j ], are the overall corpus frequencies of words w i and t j , respectively. In this manner, the row M i,· = (m i,1 , . . . , m i,529 ) represents the contexts in which the word type w i is found across the whole corpus.

Second Order Co-occurrence Vectors
The co-occurrence vectors computed above provide representations for the average contexts in which a given word type is found. In order to represent the specific context of each word token, we used second order co-occurrence vectors (Schütze and Pedersen, 1997). These provide important information about the aspects of a context that are relevant for inflectional morphology (Moscoso del Prado Martín, 2007). The second order vectors were computed by passing a symmetrical sliding window including, for each token, the immediately preceding and following word. The vector for each token was computed as the average between the first order vectors (of Subsection 2.3) of the preceding and following words. If no first order vector was available for either the preceding or the following word, the second order vector directly corresponded to the plain first order vector of the word for which there was a first order vector. We excluded those tokens for which we had first order vectors for neither the previous nor the following word type. We computed such second order vectors for all tokens in the corpus that had been tagged for grammatical case (a total of 419,910 vectors, on average 9,453 vectors per year, of which 38.14% were nominatives, 10.91% were genitives, 26.38% were accusatives, and 24.56% were datives).

Representation of the Case Prototypes
In order to represent the prototypical usages each grammatical case (i.e., nominative, genitive, accusative, and dative) in a given year, we used the first order co-occurrence technique. For each of the 44 distinct years -using the same features identified in Subsection 2.3-we computed first order co-occurrence vectors collapsing all word tokens in each grammatical case, and using a reduced window size including just the preceding and following words (i.e., W 2 = 3). 2 For each year (y) we obtained a 4 × 529 element matrix of co-occurrence frequencies (f y [c, t]), indicating the number of times that each case (c) was found to co-occur (within the specified window) with feature (t). These matrices were transformed into case to feature pointwise mutual informations, resulting in a series of 44 matrices where N is the total number of tokens in the corpus, f[c i ] is the number of instances of case c i , and f[t j ] denotes the number of instances of word t j in the corpus. In this way, the rows of the M [y] matrices provided a representation of the contexts in which each grammatical case was used in each year. Fig. 2a plots the spatial organization of the resulting vectors (after reducing to a bidimensional projection using Singular Value Decomposition; SVD). Notice that the prototypes for each case very naturally cluster together across the years. The scatter is however asymmetric, hinting at a process of change along the years common for all four cases. If we compute a yearly overall prototype vector as the average vector for the cases in each year, and we substract it from the corresponding case prototypes, we find that the case idetities become clearly differentiated in space (see Fig. 2b), demonstrating that the case prototype vectors do indeed capture the contextual properties of all four cases, which are highly distinctive.

Functional Drift in Icelandic Cases
As was discussed in the Introduction, the inflectional paradigms marking case and number have barely -if at all-changed along the history of Icelandic. On the basis of this fact alone, one could conclude that the grammatical case system is not actually experiencing any linguistic drift, but has rather remained basically static throughout the last millennium. There is, however, another possibility. Linguistic drift could have been affecting the functions of grammatical cases in Icelandic. If this were the case, one would expect to observe a slow -constant rate-diachronic change in the contexts in which each of the four cases is used.
To investigate this latter possibility, for each of the 44 years documented in the corpus, we trained a basic logistic classifier in the task of assigning grammatical case to the second order cooccurrence vectors developed in Subsection 2.4. Once each of the classifiers had been trained, we tested the classifiers' performances on the vectors obtained from each of other 43 years on which they were not trained. On the one hand, if the functions of the cases have indeed remained constant along the history of the language, one would expect that the performance of a classifier tested on the data from a given year, should remain approximately constant when tested on vectors from all other years. If, on the other hand, the functions of Icelandic grammatical cases have been subject to linguistic drift, the irreversible and cumulative nature of the drift (Sapir, 1921(Sapir, /2014 implies that the classifier error should grow -if only so slightlywith each year passed. The reason for this is that the contexts in which one would use each case should be slightly different from year to year. One should then predict that the error of the classifier should depend on the temporal distance between the year of the testing vectors and that of the training ones. Furthermore, the change in error should be of a linear nature, with a very slight slope. When tested on the same years in which they had been trained, the classifiers performed rather well in inferring the case to which each of the context vectors belonged (the distribution of errors across the 44 years was well approximated by a normal distribution with a mean error of 26.67%, a standard deviation of 1.99%, and best and worst classification errors of 22.17% and 30.39%, respectively). 3 β + = 0. 002, p + < 0. 001, β − = − 0. 002, p − < 0. 001 Figure 3: Correlation between the classifier error and the temporal distance from the year from which the training vectors were obtained to the year when the testing vectors were obtained. The solid lines plots a linear regressions.
We then tested the classifiers on the vectors obtained from different years. The results are plotted in Fig. 3. The scatter plots the difference in years (i.e., the difference values are positive when the classifier was tested on vectors obtained after those used for training, and negative when testing with vectors obtained before the training ones). When testing on data different from the training sets, there is a logical loss in performance (of about 8%) from the baseline of testing on the same training set. We fitted two linear regressions, one to the positive differences and another to the negative differences (plotted by the solid lines in Fig. 3). The first thing that stands out is that the performance of the classifier is remarkably good when tested on vectors obtained at considerable temporal distances from the time when the training vectors were obtained. While the error of the classifier is of about 34% when tested on vectors from the year after or before the training vectors, the error remains at 35% for vectors originating from texts that are five centuries apart. Once again, this speaks to the remarkable conservativeness of the Icelandic language. However, these small differences are in fact reliable: There are significant slopes in both regressions (positive differences: R + = .161, p + < .001; negative differences: R − = −.164, p − < .001). A second remarkable thing is that both regressions are substantially symmetrical, in fact their slopes are basically identical (|β + | = |β − | = .002). This indicates that the degree to which the usages of the cases at different to have a classifier with a decent performance, as our goal was showing that the error is time-dependent.  time points have diverged depends on the amount of time that has intervened, irrespective of whether it was the training or the testing set that was collected before.
One could argue that the slow drift observed may not be really due to changes in the functions of the grammatical cases themselves, but just to overall changes either in the overall language, or in the very topics that are addressed (e.g., one might guess that talk of swords, slaves, and longships was more frequent in XII century Norse than it is in Modern Icelandic). To investigate this possibility we used an information-theoretical measure of the prototypicality of a set of second order vectors for a particular year (based on that used in Moscoso del Prado Martín, 2007). From the vectors of each year, irrespective of their case, we fitted a 529-dimensional Gaussian distribution (by estimating the mean vector for that year, µ y , and the covariance matrix, Σ y ). The average inadequacy of a given set of vectors {v 1 , . . . , v n } obtained in year z to the distribution fitted to the vectors obtained in year y is measured by the crossentropy, a Monte Carlo estimator of which is given by, 4 where K is a constant. 5 In addition, one should also take into account the fact that, for some years, the classifier might generalize better or worse than for others (due to irrelevant idiosyncrasies of one specific text used for training), which could lead to a distortion of the results.
To investigate whether, after discounting for the inadequacy of the vectors to the overall distribution of those in which the classifier was trained, there was still evidence for drift in the functions of the cases, while also accounting for the different generalization powers of the classifiers, we fitted a linear mixed-effects model to the classifier errors, including fixed-effect predictors of the cross-entropy described above, and the absolute value of the difference in years between the training set and testing set dates (as indicated above, the effects were equivalent for positive and negative values in years), and the dating of the testing set as a random effect. As expected, we found that the cross-entropies had a significant positive effect (β = .002, z = 5.009, p < .001; left panel if Fig. 4), indicating that the performances of the models were indeed worse for less adequate sets of testing vectors, irrespective of any aspect of grammatical case. However -crucially-even after considering the effect of cross-entropy, there remained a significant positive effect of the temporal distance (β = .001, z = 4.661, p < .001; right panel if Fig. 4). 6 This result therefore supports the hypothesis that the function of grammatical case has been subject to a slight constant change during the history of the language: a functional drift.

Functional Chain Shifts in Case
In the previous section we have demonstrated that the functions of Icelandic cases have been subject to slow linguistic drift. The question now arises of whether this drift is purely random, or rather it has some degree of directionality arising from endogenous linguistic factors. It is possible that changes in the functions of some cases caused changes in the functions of others. We investigate this possibility using the notion of Granger-causality

Granger-causality
Granger-causality (Granger, 1969) is a powerful technique for assessing whether one time series can be said to be the cause of another one. The basic idea is that one time series x is said to Grangercause another series y if the past of series x predicts the future of series y over and above any 6 The estimated covariance matrices were not definite positive for two of the years, which were excluded from the analyses. In addition, in 552 out of the remaining 1,849 estimates, the cross-entropy took unusually large values, orders of magnitude larger than the rest (likely reflecting inadequacy of the multidimensional Gaussian approximation for these cases), which distorted the effect estimates. The analyses reported exclude these 552 points. However, keeping these outlying values in the regression, both key effects remained significant, but the slope estimates were less trustworthy. predictive power that can be found on y's own past. This idea has proven of great value to investigate the causal connections between economic variables, sequences of behavioral responses, neural spikes, or electroencephalographic potentials. Often, the technique is used to reconstruct directional networks of variables and processes that have causal connections.
If x and y are stationary time sequences on time (τ ), in order to test whether x Granger-causes y, one begins by fitting autoregressive models (AR) that predict the values of y from its own n values lagged into the past. This consists on finding values a 1 , a 2 , . . . , a n that minimize the error ε in the equation, One then augments the autoregression by including m lagged values of x, with additional parameters b 1 , . . . , b m to be fitted, where the ε sequences are uncorrelated (white) gaussian noises, reflecting the fully random or chaotic part of the system, which cannot be predicted from its past (i.e., the error, that is termed by some the 'creativity' of the model). If the second regression is a significant improvement over the first, then it can be said that x Granger-causes y, indicating that past values of x significantly predict future values of y over and above any predictive power of y's own past values. This is tested using an F -test, with the null hypothesis being that the second model does not improve on the first one. The selection of the autoregressive order parameters n and m is achieved by model comparisons using information criteria.
When one is interested in reconstructing a network of causal relations between multiple variables, one can use a mutivariate generalization of the AR model, the vector autoregressive model (VAR). The VAR model consists of mutiple AR equations (one for each variable in the model). If we consider an autoregressive order of one (i.e., m = n = 1), when we are simultaneously considering p variables Y = {y 1 , . . . , y p }, the VAR[1] model to be fitted can be expressed in matrix no-tation as, . . .
This model enables testing for Granger-causality between any pair of variables y i ∈ Y and y j ∈ Y, after partialling out the possible confounding effects of {y t , t = i, t = j, 1 ≤ t ≤ p}. y j is said to Granger-cause y i if the model coefficient A i,j is significantly different than zero, and the reverse holds if A j,i significantly different than zero (i.e., y i Granger-causes y j ).

Granger-causality in Case Drift
To investigate whether the pattern of change in one case triggers (i.e., Granger-cause) the pattern of change in another, we made use of the prototype vectors for the cases in each of the years developed in Subsection 2.5. As a measure of the amount of contextual change for a given case in a given year, we computed the city-block distances between the case prototypes from each year to the next available time point, which are plotted in Fig. 5a. Notice that there is an overall pattern of change equally affecting all cases, and the changes are therefore strongly correlated. This reflects the overall pattern of historical changes affecting Icelandic as a whole, as well as changes in the topics that would be discussed in the different time periods, as was documented in Subsection 2.5 and Section 3. Considering the changes in each case as a component in a four-dimensional vector, the modulus of this vector (plotted by the dashed orange line in Fig. 5a) gives the overall magnitude of the changes that are unspecific to the cases themselves. To remove this component from the changes, we fitted a linear regression to the sequence of changes in each case, using the overall pattern of change as a predictor. Fig. 5b plots the resulting residuals, indicating the amount of change that was specific to each case, over and above the overall pattern. 7 A precondition for testing for Granger-causality is that the time series under consideration are stationary. In our case, the series depicted in Fig. 5b are significantly non-stationary; they exhibit, for instance, significant temporal trends. In order to remove the non-stationarities, the series were differentiated (i.e., we considered the difference between each two consecutive points). The result of 7 Negative values in this figure indicate changing less than the average, rather than 'negative change'. this differentiation, plotted in Fig. 5c, removed the non-stationary trends from the original series. We fitted a VAR[n] model to the four differentiated time-series. The autoregressive order found to maximize Akaike's Information Criterion (Akaike, 1974) was n = 1. 8 The F statistics and significance values for the coefficients in the resulting VAR[1] model are given in Tab. 1. In order to reconstruct the causality network, we also need to consider that we started out with only very vague predictions on the possible directions of causality. As the model involved twelve separate p-value tests, the p-value estimates need to be corrected for multiple comparisons. This correction was done using the false discovery rate (FDR) method (Benjamini and Hochberg, 1995), resulting in the corrected p-value estimates listed in the last column of Tab. 1.
The Granger-causality analysis leads us to reconstruct the causality network depicted in Fig. 6. It appears that the drift in the functions of Icelandic case is not plainly random. Instead, we find evidence that changes in the functions of the accusatives and genitives have had a domino effect, triggering further changes in the functions of nominatives. Finally, changes in all other three cases result in changes in the functions of the dative. In summary, the changes observed are consistent with the idea discussed in the Introduction of a functional chain shift affecting the morphological case system of Icelandic.

Discussion
We have presented evidence for a steady drift -of the precise kind advocated by Sapir (1921Sapir ( /2014)even in a language as remarkably conservative as is Icelandic. This supports the claim that human languages are in a state of 'perpetual motion' (Beckner et al., 2009;Dediu et al., 2013;Hawkins and Gell-Mann, 1992;Hopper, 1987; Larsen-Freeman and Cameron, 2007; Niyogi and Berwick, 1997). Although we have found that functional change in Icelandic case has proceeded at a constant rate, we do not think, as argued by Nettle (1999aNettle ( , 1999b, that this rate of change needs to be constant across languages. There are strong arguments suggesting that in other languages such rates might be different (Wichmann, 2008;Wichmann and Holman, 2009). The crucial innovation presented in this paper is the reconstruction of the causality network linking the changes in the four cases. Previous applications of the notion of Granger-causality to diachronic language change (Moscoso del Prado Martín, 2014) have focused on the macroscopic relation between sudden changes in syntax and morphology. Here, we have demonstrated that Granger-causality can also be used to reconstruct detailed networks of slow changes within the morphological system, at a more microscopic scale. The techniques developed offer a mechanism for investigating subtle changes in the functions of linguistic constructions, and the causal relations between them. Traditionally, historical linguists have focused on 'narrative' accounts of the the chains of change within a language. Although such type of accounts are extremely useful, the often very subtle changes in usage that can occur from one time-point to another cannot always be described with such clearcut patterns. Nevertheless, we have shown that those very small changes do accumulate in meaningful ways.
An important question addressed by this study is the presence of endogenous causal chains in language change. Lupyan and Dale (2015) argue that languages are constrained by their 'ecological niches', the communities in which they are spoken, and the extralinguistic properties of those niches can trigger exogenous change in the morphology of the languages. Following on Lupyan and Dale's ecosystem analogy, one should see that, as well as being part of ecosystems, languages are also ecosystems in themselves, in a nesting similar to that found in natural ecosystems (i.e., an animal is part of a particular ecosystem, and its body is an ecosystem in itself). Sounds, words and constructions have their own ecological niches within the language, and disturbances in the system can trigger cascaded changes, leading to readaptation (evolution) of the constructions. This contrasts with the view of changes in the function of Icelandic cases expressed by Eythórsson (2000). He showed that verbs whose arguments exhibit 'nominative sickness' and 'accusative sickness' tend to be clustered along certain syntactic and semantic lines. That it is in these particular niches that accusatives and datives ended up settling is not, however, the cause of the language changes. As we have shown, the case system was subject to a string of cascaded pressures. That the cases ended up settling in new syntactico-semantic niches was the result, rather than the cause of the changes.