On the Linearity of Semantic Change: Investigating Meaning Variation via Dynamic Graph Models

We consider two graph models of semantic change. The first is a time-series model that relates embedding vectors from one time period to embedding vectors of previous time periods. In the second, we construct one graph for each word: nodes in this graph correspond to time points and edge weights to the similarity of the word's meaning across two time points. We apply our two models to corpora across three different languages. We find that semantic change is linear in two senses. Firstly, today's embedding vectors (= meaning) of words can be derived as linear combinations of embedding vectors of their neighbors in previous time periods. Secondly, self-similarity of words decays linearly in time. We consider both findings as new laws/hypotheses of semantic change.


Introduction
Meaning is not uniform, neither across space, nor across time. Across space, different languages tend to exhibit different polysemous associations for corresponding terms (Eger et al., 2015;Kulkarni et al., 2015b). Across time, several wellknown examples of meaning change in English have been documented. For example, the word gay's meaning has shifted, during the 1970s, from an adjectival meaning of cheerful at the beginning of the 20 th century to its present meaning of homosexual (Kulkarni et al., 2015a). Similarly, technological progress has led to semantic broadening of terms such as transmission, mouse, or apple.
In this work, we consider two graph models of semantic change. Our first model is a dynamic model in that the underlying paradigm is a (time-)series of graphs. Each node in the series of graphs corresponds to one word, associated with which is a semantic embedding vector. We then ask how the embedding vectors in one time period (graph) can be predicted from the embedding vectors of neighbor words in previous time periods. In particular, we postulate that there is a linear functional relationship that couples a word's today's meaning with its neighbor's meanings in the past. When estimating the coefficients of this model, we find that the linear form appears indeed very plausible. This functional form then allows us to address further questions, such as negative relationships between words -which indicate semantic differentiation over time -as well as projections into the future. We call our second graph model time-indexed self-similarity graphs. In these graphs, each node corresponds to a time point and the link between two time points indicates the semantic similarity of a specific word across the two time points under consideration. The analysis of these graphs reveals that most words obey a law of linear semantic 'decay': semantic self-similarity decreases linearly over time.
In our work, we capture semantics by means of word embeddings derived from context-predicting neural network architectures, which have become the state-of-the-art in distributional semantics modeling (Baroni et al., 2014). Our approach and results are partly independent of this representation, however, in that we take a structuralist approach: we derive new, 'second-order embeddings' by modeling the meaning of words by means of their semantic similarity relations to all other words in the vocabulary (de Saussure, 1916;Rieger, 2003). Thus, future research may in principle substitute the deep-learning architectures for semantics considered here by any other method capable of producing semantic similarity values between lexical units. This work is structured as follows. In §2, we discuss related work. In §3.1 and 3.2, respectively, we formally introduce the two graph models outlined. In §4, we detail our experiments and in §5, we conclude.

Related work
Broadly speaking, one can distinguish two recent NLP approaches to meaning change analysis. On the one hand, coarse-grained trend analyses compare the semantics of a word in one time period with the meaning of the word in the preceding time period (Jatowt and Duh, 2014;Kulkarni et al., 2015a). Such coarse-grained models, by themselves, do not specify in which respects a word has changed (e.g., semantic broadening or narrowing), but just aim at capturing whether meaning change has occurred. In contrast, more fine-grained analyses typically senselabel word occurrences in corpora and then investigate changes in the corresponding meaning distributions (Rohrdantz et al., 2011;Mitra et al., 2014;Plitz et al., 2015;Zhang et al., 2015). Sense-labeling may be achieved by clustering of the context vectors of words (Huang et al., 2012;Chen et al., 2014;Neelakantan et al., 2014) or by applying LDA-based techniques where word contexts take the roles of documents and word senses take the roles of topics (Rohrdantz et al., 2011;Lau et al., 2012). Finally, there are studies that test particular meaning change hypotheses such as whether similar words tend to diverge in meaning over time (according to the 'law of differentiation') (Xu and Kemp, 2015) and papers that intend to detect corresponding terms across time (words with similar meanings/roles in two time periods but potentially different lexical forms) (Zhang et al., 2015).

Graph models
Let V = {w 1 , . . . , w |V | } be the common vocabulary (intersection) of all words in all time periods t ∈ T . Here, T is a set of time indices. Denote an embedding of a word w i at time period t as w i (t) ∈ R d . Since embeddings w i (s), w i (t) for two different time periods s, t are generally not comparable, as they may lie in different coordinate systems, we consider the vectorsw (1) each of which lies in R |V | and where sim is a similarity function such as the cosine. We note that our structuralist definition ofw i (t) is not unproblematic, since the vectors w 1 (t), . . . , w |V | (t) tend to be different across t, by our very postulate, so that there is non-identity of these 'reference points' over time. However, as we may assume that the meanings of at least a few words are stable over time, we strongly expect the vectorsw i (t) to be suitable for our task of analysis of meaning changes. 1 For the remainder of this work, for convenience, we do not distinguish, in terms of notation, between w i (t) andw i (t).

A linear model of semantic change
We postulate, and subsequently test, the following model of meaning dynamics which describes meaning change over time for words w i : where α n w j ∈ R, for n = 1, . . . , p, are coefficients of meaning vectors w j (t − n) and p ≥ 1 is the order of the model. The set N (w i ) ⊆ V denotes a set of 'neighbors' of word w i . 2 This model says that the meaning of a word w i at some time t is determined by reference to the meanings of its 'neighbors' in previous time periods, and that the underlying functional relationship is linear.
We remark that the model described by Eq.
(2) is a time-series model, and, in particular, a vector-autoregressive (VAR) model with special structure. The model may also be seen in the socio-economic context of so-called "opinion dynamic models" (Golub and Jackson, 2010;Acemoglu and Ozdaglar, 2011;Eger, 2016). There it is assumed that agents are situated in network structures and continuously update their opinions/beliefs/actions according to their ties with other agents. Model (2) substitutes multi-dimensional embedding vectors for onedimensional opinions.

Time-indexed self-similarity graphs
We track meaning change by considering a fully connected graph G(w) for each word w in V . The nodes of G(w) are the time indices T , and there is an undirected link between any two s, t ∈ T whose weight is given by sim(w(s), w(t)). We call the graphs G(w) time-indexed self-similarity (TISS) graphs because they indicate the (semantic) similarity of a given word with itself across different time periods. Particular interest may lie in weak links in these graphs as they indicate low similarity between two different time periods, i.e., semantic change across time.

Experiments
Data As corpus for English, we use the Corpus of Historical American (COHA). 3 This covers texts from the time period 1810 to 2000. We extract two slices: the years 1900-2000 and 1810-2000. For both slices, each time period t is one decade, e.g., T = {1810, 1820, 1830, . . .}. 4 For each slice, we only keep words associated to the word classes nouns, adjectives, and verbs. For computational and estimation purposes, we also only consider words that occur at least 100 times in each time period. To induce word embeddings w ∈ R d for each word w ∈ V , we use word2vec (Mikolov et al., 2013b) with default parametrizations. We do so for each time period t ∈ T independently. We then use these embeddings to derive the new embeddings as in Eq. (1). Throughout, we use cosine similarity as sim measure. For German, we consider a proprietary dataset of the German newspaper SZ 5 for which T = {1994, 1995, . . . , 2003}.
We lemmatize and POS tag the data and likewise only consider nouns, verbs and adjectives, making the same frequency constraints as in English. Finally, we use the PL (Migne, 1855) as data set for Latin. Here, T = {300, 400, . . . , 1300}. We use the same preprocessing, frequency, and word class constraints as for English and German.
Throughout, our datasets are well-balanced in terms of size. For example, the English COHA datasets contain about 24M-30M tokens for each decade from 1900 to 2000, where the decades 1990 and 2000 contain slighly more data than the earlier decades. The pre-1900 decades contain 18-24M tokens, with only the decades 1810 and 1820 containing very little data (1M and 7M tokens, respectively). The corpora are also balanced by genre.

TISS graphs
We start with investigating the TISS graphs. Let D t 0 represent how semantically similar a word is across two time periods, on average, when the distance between time periods is t 0 : Figure  1 plots the values D t 0 for the time slice from 1810 to 2000, for the English data. We notice a clear trend: self-similarity of a word tends to (almost perfectly) linearly decrease with time distance. In   Table 1: Coefficients (%) in regression of D t 0 on t 0 , and adjusted R 2 values (%).
The values D t 0 as a function of t 0 are averages over all words. Thus, it might be possible that the average word's meaning decays linearly in time, while the semantic behavior, over time, of a large fraction of words follows different trends. To investigate this, we consider the distribution of D t 0 (w) = 1 C ′ |s−t|=t 0 sim(w(s), w(t)) over fixed words w. Here C ′ = |{(s, t) | |s − t| = t 0 }|. We consider the regression models for each word w independently and assess the distribution of coefficients α as well as the goodnessof-fit values. Figure 2 shows -exemplarily for the English 1900-2000 COHA data -that the coefficients α are negative for almost all words. In fact, the distribution is left-skewed with a mean of around −0.4%. Moreover, the linear model is always a good to very good fit of the data in that R 2 values are centered around 85% and rarely fall below 75%. We find similar patterns for all other datasets considered here. This shows that not only the average word's meaning decays linearly, but almost all words' (whose frequency mass exceeds a particular threshold) semantics behaves this way.
Next, we use our TISS graphs for the task of finding words that have undergone meaning change. To this end, we sort the graphs G(w) by the ratios R G(w) = maxlink minlink , where maxlink denotes maximal weight of a link in graph G(w) and minlink is the minimal weight of a link in graph G(w). We note that weak links may indicate semantic change, but the stated ratio requires that 'weakness' is seen relative to the strongest semantic links in the TISS graphs. Table 2  selected words that have highest values R G(w) . 6 We omit a fine-grained semantic change analysis, bush (1), web (2), alan (3), implement (4) jeff (5), gay (6), program (7), film (8), focus (9), terrific (16), axis (36) In brackets are the ranks of words, i.e., bush has the highest value R G(w) , web the 2nd highest, etc.
which could be conducted via the methods outlined in §2, but notice a few cases. 'Terrific' has a large semantic discrepancy between the 1900s and the 1970s, when the word probably (had) changed from a negative to a more positive meaning. The largest discrepancy for 'web' is between the 1940s and the 2000s, when it probably came to be massively used in the context of the Internet. The high R G(w) value for w = 'axis' derives from comparing its use in the 1900s with its use in the 1940s, when it probably came to be used in the context of Nazi Germany and its allies. We notice that the presented method can account for gradual, accumulating change, which is not possible for models that compare two succeeding time points such as the model of Kulkarni et al. (2015a).

Meaning dynamics network models
Finally, we estimate meaning dynamics models as in Eq. (2), i.e., we estimate the coefficients α n w j from our data sources. We let the neighbors N (w) of a word w as in Eq. (2) be the union (w.r.t. t) over sets N t (w; n) denoting the n ≥ 1 semantically most similar words (estimated by cosine similarity on the original word2vec vectors) of word w in time period t ∈ T . 7 In Table 3, we indicate two measures: adjusted R 2 , which indicates the goodness-of-fit of a model, and prediction error. By prediction error, we measure the average Euclidean distance between the true semantic vector of a word in the final time period t N vs. the predicted semantic vector, via the linear model in Eq. (2), estimated on the data excluding the final period. The indicated prediction error is the average over all words. We note the following: R 2 values are high (typically above 95%), indicating that the linear semantic change model we have suggested fits the data well. Moreover, R 2 values slightly increase between order p = 1 and p = 2; however, for prediction error this trend is reversed. 8 We also include a strong baseline that claims that word meanings do not change in the final period t N but are the same as in t N −1 . We note that the order p = 1 model consistently improves upon this baseline, by as much as 18%, depending upon parameter settings.
Negative relationships Another very interest-  ing aspect of the model in Eq.
(2) is that it allows for detecting words w j whose embeddings have negative coefficients α w j for a target word w i (we consider p = 1 in the remainder). Such negative coefficients may be seen as instantiations of the 'law of differentiation': the two words' meanings move, over time, in opposite directions in semantic space. We find significantly negative relationships between the following words, among others: summit ↔ foot, boy ↔ woman, vow ↔ belief, negro ↔ black. Instead of a detailed analysis, we mention that the Wikipedia entry for the last pair indicates that the meanings of 'negro' and 'black' switched roles between the early and late 20 th century. While 'negro' was once the "neutral" term for the colored population in the US, it acquired negative connotations after the 1960s; and vice versa for 'black'.

Concluding remarks
We suggested two novel models of semantic change. First, TISS graphs allow for detecting gradual, non-consecutive meaning change. They enable to detect statistical trends of a possibly general nature. Second, our time-series models allow for investigating negative trends in meaning change ('law of differentiation') as well as forecasting into the future, which we leave for future work. Both models hint at a linear behavior of semantic change, which deserves further investigation. We note that this linearity concerns the core vocabulary of languages (in our case, words that occurred at least 100 times in each time period), and, in the case of TISS graphs, is an average result; particular words may have drastic, non-linear meaning changes across time (e.g., proper names referring to entirely different entities). However, our analysis also finds that most core words' meanings decay linearly in time.