Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Word embeddings show promise as a diachronic tool, but have not been carefully evaluated. We develop a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes. We then use this methodology to reveal statistical laws of semantic evolution. Using six historical corpora spanning four languages and two centuries, we propose two quantitative laws of semantic change: (i) the law of conformity---the rate of semantic change scales with an inverse power-law of word frequency; (ii) the law of innovation---independent of frequency, words that are more polysemous have higher rates of semantic change.


Introduction
Shifts in word meaning exhibit systematic regularities (Bréal, 1897;Ullmann, 1962). The rate of semantic change, for example, is higher in some words than others (Blank, 1999) -compare the stable semantic history of cat (from Proto-Germanic kattuz, "cat") to the varied meanings of English cast: "to mould", "a collection of actors', "a hardened bandage", etc. (all from Old Norse kasta, "to throw", Simpson et al., 1989).
Various hypotheses have been offered about such regularities in semantic change, such as an increasing subjectification of meaning, or the grammaticalization of inferences (e.g., Geeraerts, 1997;Blank, 1999;Traugott and Dasher, 2001).
But many core questions about semantic change remain unanswered. One is the role of frequency. Frequency plays a key role in other linguistic changes, associated sometimes with faster change-sound changes like lenition occur in more frequent words-and sometimes with slower change-high frequency words are more resistant to morphological regularization (Bybee, 2007;Pagel et al., 2007;Lieberman et al., 2007). What is the role of word frequency in meaning change?
Another unanswered question is the relationship between semantic change and polysemy. Words gain senses over time as they semantically drift (Bréal, 1897;Wilkins, 1993;Hopper and Traugott, 2003), and polysemous words 1 occur in more diverse contexts, affecting lexical access speed (Adelman et al., 2006) and rates of L2 learning (Crossley et al., 2010). But we don't know whether the diverse contextual use of polysemous words makes them more or less likely to undergo change (Geeraerts, 1997;Winter et al., 2014;Xu et al., 2015). Furthermore, polysemy is strongly correlated with frequency-high frequency words have more senses (Zipf, 1945;Ilgen and Karaoglan, 2007)-so understanding how polysemy relates to semantic change requires controling for word frequency.
Answering these questions requires new methods that can go beyond the case-studies of a few words (often followed over widely different timeperiods) that are our most common diachronic data (Bréal, 1897;Ullmann, 1962;Blank, 1999;Hopper and Traugott, 2003;Traugott and Dasher, 2001). One promising avenue is the use of distributional semantics, in which words are embedded in vector spaces according to their co-occurrence relationships (Bullinaria and Levy, 2007;Turney and Pantel, 2010), and the embeddings of words Figure 1: Two-dimensional visualization of semantic change in English using SGNS vectors. 2 a, The word gay shifted from meaning "cheerful" or "frolicsome" to referring to homosexuality. b, In the early 20th century broadcast referred to "casting out seeds"; with the rise of television and radio its meaning shifted to "transmitting signals". c, Awful underwent a process of pejoration, as it shifted from meaning "full of awe" to meaning "terrible or appalling" (Simpson et al., 1989). are then compared across time-periods. This new direction has been effectively demonstrated in a number of case-studies (Sagi et al., 2011;Wijaya and Yeniterzi, 2011;Gulordava and Baroni, 2011;Jatowt and Duh, 2014) and used to perform largescale linguistic change-point detection (Kulkarni et al., 2014) as well as to test a few specific hypotheses, such as whether English synonyms tend to change meaning in similar ways (Xu and Kemp, 2015). However, these works employ widely different embedding approaches and test their approaches only on English.
In this work, we develop a robust methodology for quantifying semantic change using embeddings by comparing state-of-the-art approaches (PPMI, SVD, word2vec) on novel benchmarks.
We then apply this methodology in a large-scale cross-linguistic analysis using 6 corpora spanning 200 years and 4 languages (English, German, French, and Chinese). Based on this analysis, we propose two statistical laws relating frequency and polysemy to semantic change: • The law of conformity: Rates of semantic change scale with a negative power of word frequency. • The law of innovation: After controlling for frequency, polysemous words have significantly higher rates of semantic change.

Diachronic embedding methods
The following sections outline how we construct diachronic (historical) word embeddings, by first constructing embeddings in each time-period and then aligning them over time, and the metrics that 2 Appendix B details the visualization method.
we use to quantify semantic change. All of the learned embeddings and the code we used to analyze them are made publicly available. 3

Embedding algorithms
We use three methods to construct word embeddings within each time-period: PPMI, SVD, and SGNS (i.e., word2vec). 4 These distributional methods represent each word w i by a vector w i that captures information about its co-occurrence statistics. These methods operationalize the 'distributional hypothesis' that word semantics are implicit in co-occurrence relationships (Harris, 1954;Firth, 1957). The semantic similarity/distance between two words is approximated by the cosine similarity/distance between their vectors (Turney and Pantel, 2010).

PPMI
In the PPMI representations, the vector embedding for word w i ∈ V contains the positive point-wise mutual information (PPMI) values between w i and a large set of pre-specified 'context' words. The word vectors correspond to the rows of the matrix M PPMI ∈ R |V|×|V C | with entries given by (1) where c j ∈ V C is a context word and α > 0 is a negative prior, which provides a smoothing bias (Levy et al., 2015). Thep correspond to the smoothed empirical probabilities of word Google books (all genres) 1.9 × 10 11 1800-1999 (Sagot et al., 2006)

CHIALL Chinese
Google books (all genres) 6.0 × 10 10 195010 -199910 (Xue et al., 2005  (co-)occurrences within fixed-size sliding windows of text. Clipping the PPMI values above zero ensures they remain finite and has been shown to dramatically improve results (Bullinaria and Levy, 2007;Levy et al., 2015); intuitively, this clipping ensures that the representations emphasize positive word-word correlations over negative ones.

SVD
SVD embeddings correspond to low-dimensional approximations of the PPMI embeddings learned via singular value decomposition (Levy et al., 2015). The vector embedding for word w i is given by where M PPMI = UΣV is the truncated singular value decomposition of M PPMI and γ ∈ [0, 1] is an eigenvalue weighting parameter. Setting γ < 1 has been shown to dramatically improve embedding qualities (Turney and Pantel, 2010; Bullinaria and Levy, 2012). This SVD approach can be viewed as a generalization of Latent Semantic Analysis (Landauer and Dumais, 1997), where the term-document matrix is replaced with M PPMI . Compared to PPMI, SVD representations can be more robust, as the dimensionality reduction acts as a form of regularization.
2.1.3 Skip-gram with negative sampling SGNS 'neural' embeddings are optimized to predict co-occurrence relationships using an approximate objective known as 'skip-gram with negative sampling' (Mikolov et al., 2013). In SGNS, each word w i is represented by two dense, lowdimensional vectors: a word vector (w SGNS i ) and context vector (c SGNS i ). These embeddings are optimized via stochastic gradient descent so that where p(c i |w i ) is the empirical probability of seeing context word c i within a fixed-length window of text, given that this window contains w i . The SGNS optimization avoids computing the normalizing constant in (3) by randomly drawing 'negative' context words, c n , for each target word and ensuring that exp(w SGNS SGNS has the benefit of allowing incremental initialization during learning, where the embeddings for time t are initialized with the embeddings from time t − ∆ (Kim et al., 2014). We employ this trick here, though we found that it had a negligible impact on our results.

Datasets, pre-processing, and hyperparameters
We trained models on the 6 datasets described in Table 1, taken from Google N-Grams (Lin et al., 2012) and the COHA corpus (Davies, 2010). The Google N-Gram datasets are extremely large (comprising ≈6% of all books ever published), but they also contain many corpus artifacts due, e.g., to shifting sampling biases over time (Pechenick et al., 2015). In contrast, the COHA corpus was carefully selected to be genre-balanced and representative of American English over the last 200 years, though as a result it is two orders of magnitude smaller. The COHA corpus also contains pre-extracted word lemmas, which we used to validate that our results hold at both the lemma and raw token levels. All the datasets were aggregated to the granularity of decades. 5 We follow the recommendations of Levy et al. (2015) in setting the hyperparameters for the embedding methods, though preliminary experiments were used to tune key settings. For all methods, we used symmetric context windows of size 4 (on each side). For SGNS and SVD, we use embeddings of size 300. See Appendix A for further implementation and pre-processing details.

Aligning historical embeddings
In order to compare word vectors from different time-periods we must ensure that the vectors are aligned to the same coordinate axes. Explicit PPMI vectors are naturally aligned, as each column simply corresponds to a context word. Low-dimensional embeddings will not be naturally aligned due to the non-unique nature of the SVD and the stochastic nature of SGNS. In particular, both these methods may result in arbitrary orthogonal transformations, which do not affect pairwise cosine-similarities within-years but will preclude comparison of the same word across time. Previous work circumvented this problem by either avoiding low-dimensional embeddings (e.g., Gulordava and Baroni, 2011;Jatowt and Duh, 2014) or by performing heuristic local alignments per word (Kulkarni et al., 2014).
We use orthogonal Procrustes to align the learned low-dimensional embeddings. Defining W (t) ∈ R d×|V| as the matrix of word embeddings learned at year t, we align across time-periods while preserving cosine similarities by optimizing: The solution corresponds to the best rotational alignment and can be obtained efficiently using an application of SVD (Schönemann, 1966).

Time-series from historical embeddings
Diachronic word embeddings can be used in two ways to quantify semantic change: (i) we can measure changes in pair-wise word similarities over time, or (ii) we can measure how an individual word's embedding shifts over time.
Pair-wise similarity time-series Measuring how the cosine-similarity between pairs of words changes over time allows us to test hypotheses about specific linguistic or cultural shifts in a controlled manner. We quantify shifts by computing the similarity time-series between two words w i and w j over a time-period (t, ..., t + ∆). We then measure the Spearman correlation (ρ) of this series against time, which allows us to assess the magnitude and significance of pairwise similarity shifts; since the Spearman correlation is non-parametric, this measure essentially detects whether the similarity series increased/decreased over time in a significant manner, regardless of the 'shape' of this curve. 6 Measuring semantic displacement After aligning the embeddings for individual timeperiods, we can use the aligned word vectors to compute the semantic displacement that a word has undergone during a certain time-period. In particular, we can directly compute the cosinedistance between a word's representation for different time-periods, i.e. cos-dist(w t , w t+∆ ), as a measure of semantic change. We can also use this measure to quantify 'rates' of semantic change for different words by looking at the displacement between consecutive time-points.

Comparison of different approaches
We compare the different distributional approaches on a set of benchmarks designed to test their scientific utility. We evaluate both their synchronic accuracy (i.e., ability to capture word similarity within individual time-periods) and their diachronic validity (i.e., ability to quantify semantic changes over time).

Synchronic Accuracy
We evaluated the synchronic (within-time-period) accuracy of the methods using a standard modern benchmark and the 1990s portion of the ENGALL data. On Bruni et al. (2012)'s MEN similarity task of matching human judgments of word similarities, SVD performed best (ρ = 0.739), followed by PPMI (ρ = 0.687) and SGNS (ρ = 0.649).
These results echo the findings of Levy et al. (2015), who found SVD to perform best on similarity tasks while SGNS performed best on analogy tasks (which are not the focus of this work).

Diachronic Validity
We evaluate the diachronic validity of the methods on two historical semantic tasks: detecting known shifts and discovering shifts from data. For both these tasks, we performed detailed evaluations on a small set of examples (28 known shifts and the top-10 "discovered" shifts by each method). Using these reasonably-sized evaluation sets allowed the authors to evaluate each case rigorously using existing literature and historical corpora.    Table 2. SGNS and SVD capture the correct directionality of the shifts in all cases (%Correct), e.g., gay becomes more similar to homosexual, but there are differences in whether the methods deem the shifts to be statistically significant at the p < 0.05 level (%Sig).
Detecting known shifts. First, we tested whether the methods capture known historical shifts in meaning. The goal in this task is for the methods to correctly capture whether pairs of words moved closer or further apart in semantic space during a pre-determined time-period. We use a set of independently attested shifts as an evaluation set (Table 2). For comparison, we evaluated the methods on both the large (but messy) ENGALL data and the smaller (but clean) COHA data. On this task, all the methods performed almost perfectly in terms of capturing the correct directionality of the shifts (i.e., the pairwise similarity series have the correct sign on their Spearman correlation with time), but there were some differences in whether the methods deemed the shifts statistically significant at the p < 0.05 level. 7 Overall, SGNS performed the best on the full English data, but its performance dropped significantly on the smaller COHA dataset, where SVD performed best. PPMI was noticeably worse than the other two approaches (Table 3).
Discovering shifts from data. We tested whether the methods discover reasonable shifts 7 All subsequent significance tests are at p < 0.05.
by examining the top-10 words that changed the most from the 1900s to the 1990s according to the semantic displacement metric introduced in Section 2.4 (limiting our analysis to words with relative frequencies above 10 −5 in both decades).
We used the ENGFIC data as the most-changed list for ENGALL was dominated by scientific terms due to changes in the corpus sample. Table 4 shows the top-10 words discovered by each method. These shifts were judged by the authors as being either clearly genuine, borderline, or clearly corpus artifacts. SGNS performed by far the best on this task, with 70% of its top-10 list corresponding to genuine semantic shifts, followed by 40% for SVD, and 10% for PPMI. However, a large portion of the discovered words for PPMI (and less so SVD) correspond to borderline cases, e.g. know, that have not necessarily shifted significantly in meaning but that occur in different contexts due to global genre/discourse shifts. The poor quality of the nearest neighbors generated by the PPMI algorithm-which are skewed by PPMI's sensitivity to rare events-also made it difficult to assess the quality of its discovered shifts. SVD was the most sensitive to corpus artifacts (e.g., co-occurrences due to cover pages and advertisements), but it still captured a number of genuine semantic shifts.
We opted for this small evaluation set and relied on detailed expert judgments to minimize ambiguity; each potential shift was analyzed in detail by consulting consulting existing literature (especially the OED; Simpson et al., 1989) and all disagreements were discussed.    Table 4. In English, wanting underwent subjectification and shifted from meaning "lacking" to referring to subjective "desire", as in "the education system is wanting" (1900s) vs. "I've been wanting to tell you" (1990s). In French asile ("asylum") shifted from primarily referring to "hospitals, or infirmaries" to also referring to "asylum seekers, or refugees". Finally, in German Widerstand ("resistance") gained a formal meaning as referring to the local German resistance to Nazism during World War II. some significant changes for Chinese in this short time-period, such as 病毒 ("virus") moving closer to 电脑 ("computer", ρ = 0.89).

Methodological recommendations
PPMI is clearly worse than the other two methods; it performs poorly on all the benchmark tasks, is extremely sensitive to rare events, and is prone to false discoveries from global genre shifts. Between SVD and SGNS the results are somewhat equivocal, as both perform best on two out of the four tasks (synchronic accuracy, ENGALL detection, COHA detection, discovery). Overall, SVD performs best on the synchronic accuracy task and has higher average accuracy on the 'detection' task, while SGNS performs best on the 'discovery' task. These results suggest that both these methods are reasonable choices for studies of semantic change but that they each have their own tradeoffs: SVD is more sensitive, as it performs well on detection tasks even when using a small dataset, but this sensitivity also results in false discoveries due to corpus artifacts. In contrast, SGNS is robust to corpus artifacts in the discovery task, but it is not sensitive enough to perform well on the detection task with a small dataset. Qualitatively, we found SGNS to be most useful for discovering new shifts and visualizing changes (e.g., Figure 1), while SVD was most effective for detecting subtle shifts in usage.

Statistical laws of semantic change
We now show how diachronic embeddings can be used in a large-scale cross-linguistic analysis to reveal statistical laws that relate frequency and polysemy to semantic change. In particular, we analyze how a word's rate of semantic change, depends on its frequency, f (t) (w i ) and a measure of its polysemy, d (t) (w i ) (defined in Section 4.4).

Setup
We present results using SVD embeddings (though analogous results were found to hold with SGNS). Using all four languages and all four conditions for English (ENGALL, ENGFIC, and COHA with and without lemmatization), we performed regression analysis on rates of semantic change, ∆ (t) (w i ); thus, we examined one data-point per word for each pair of consecutive decades and analyzed how a word's frequency and polysemy at time t correlate with its degree of semantic displacement over the next decade.
To ensure the robustness of our results, we analyzed only the top-10000 non-stop words by aver-Top-10 most polysemous yet, always, even, little, called, also, sometimes, great, still, quite Top-10 least polysemous photocopying, retrieval, thirties, mom, sweater, forties, seventeenth, fifteenth, holster, postage Table 6: The top-10 most and least polysemous words in the ENGFIC data. Words like yet, even, and still are used in many diverse ways and are highly polysemous. In contrast, words like photocopying, postage, and holster tend to be used in very specific well-clustered contexts, corresponding to a single sense; for example, mail and letter are both very likely to occur in the context of postage and are also likely to co-occur with each other, independent of postage. a b Figure 2: Higher frequency words have lower rates of change (a), while polysemous words have higher rates of change (b). The negative curvature for polysemy-which is significant only at high d(wi)-varies across datasets and was not present with SGNS, so it is not as robust as the clear linear trend that was seen with all methods and across all datasets. The trendlines show 95% CIs from bootstrapped kernel regressions on the ENGALL data (Li and Racine, 2007).
age historical frequency (lower-frequency words tend to lack sufficient co-occurrence data across years) and we discarded proper nouns (changes in proper noun usage are primarily driven by nonlinguistic factors, e.g. historical events, Traugott and Dasher, 2001). We also log-transformed the semantic displacement scores and normalized the scores to have zero mean and unit variance; we denote these normalized scores by∆ (t) (w i ). We performed our analysis using a linear mixed model with random intercepts per word and fixed effects per decade; i.e., we fit β f , β d , and β t s.t.
where z w i ∼ N (0, σ w i ) is the random intercept for word w i and (t) is an error term. β f , β d and β t correspond to the fixed effects for frequency, polysemy and the decade t, respectively 8 . Intuitively, this model estimates the effects of frequency and polysemy on semantic change, while controlling for temporal trends and correcting for the fact that measurements on same word will be correlated across time. We fit (7) using the standard restricted maximum likelihood algorithm (McCulloch and Neuhaus, 2001; Appendix C).

Overview of results
We find that, across languages, rates of semantic change obey a scaling relation of the form with β f < 0 and β d > 0. This finding implies that frequent words change at slower rates while polysemous words change faster, and that both these relations scale as power laws.

Law of conformity: Frequently used words change at slower rates
Using the model in equation (7), we found that the logarithm of a word's frequency, log(f (w i )), has a significant and substantial negative effect on rates of semantic change in all settings (Figures 2a  and 3a). Given the use of log-transforms in preprocessing the data this implies rates of semantic change are proportional to a negative power (β f ) of frequency, i.e.
The relatively large range of values for β f is due to the fact that the COHA datasets are outliers due to their substantially smaller sample sizes ( Figure 3; the range is β f ∈ [−0.66, −0.27] with COHA excluded). Figure 3: a, The estimated linear effect of log-frequency (β f ) is significantly negative across all languages. The effect is significantly stronger in the COHA data, but this is likely due to its small sample size (∼100× smaller than the other datasets); the small sample size introduces random variance that may artificially inflate the effect of frequency. From the COHA data, we also see that the result holds regardless of whether lemmatization is used. b, Analogous trends hold for the linear effect of the polysemy score (β d ), which is strong and significantly positive across all conditions. Again, we see that the smaller COHA datasets are mild outliers. 9 95% CIs are shown.

Law of innovation: Polysemous words change at faster rates
There is a common hypothesis in the linguistic literature that "words become semantically extended by being used in diverse contexts" (Winter et al., 2014), an idea that dates back to the writings of Bréal (1897). We tested this notion by examining the relationship between polysemy and semantic change in our data.

Quantifying polysemy
Measuring word polysemy is a difficult and fraught task, as even "ground truth" dictionaries differ in the number of senses they assign to words (Simpson et al., 1989;Fellbaum, 1998). We circumvent this issue by measuring a word's contextual diversity as a proxy for its polysemousness. The intuition behind our measure is that words that occur in many distinct, unrelated contexts will tend to be highly polysemous. This view of polysemy also fits with previous work on semantic change, which emphasizes the role of contextual diversity (Bréal, 1897;Winter et al., 2014).
We measure a word's contextual diversity, and thus polysemy, by examining its neighborhood in an empirical co-occurrence network. We construct empirical co-occurrence networks using the PPMI measure defined in Section 2. In these networks words are connected to each other if they co-occur more than one would expect by chance (after smoothing). The polysemy of a word is then measured as its local clustering coefficient within 9 The COHA data is ∼100× smaller, which has a global effect on the construction of the co-occurrence network (e.g., lower average degree) used to compute polysemy scores. this network (Watts and Strogatz, 1998): This measure counts the proportion of w i 's neighbors that are also neighbors of each other. According to this measure, a word will have a high clustering coefficient (and thus a low polysemy score) if the words that it co-occurs with also tend to cooccur with each other. Polysemous words that are contextually diverse will have low clustering coefficients, since they appear in disjointed or unrelated contexts.
Variants of this measure are often used in wordsense discrimination and correlate with, e.g., number of senses in WordNet (Dorow and Widdows, 2003;Ferret, 2004). However, we found that it was slightly biased towards rating contextually diverse discourse function words (e.g., also) as highly polysemous, which needs to be taken into account when interpreting our results. We opted to use this measure, despite this bias, because it has the strong benefit of being clearly interpretable: it simply measures the extent to which a word appears in diverse textual contexts. Table 6 gives examples of the least and most polysemous words in the ENGFIC data, according to this score.
As expected, this measure has significant intrinsic positive correlation with frequency. Across datasets, we found Pearson correlations in the range 0.45 < r < 0.8 (all p < 0.05), confirming frequent words tend to be used in a greater diversity of contexts. As a consequence of this high correlation, we interpret the effect of this measure only after controlling for frequency (this control is naturally captured in equation (7)).

Polysemy and semantic change
After fitting the model in equation (7), we found that the logarithm of the polysemy score exhibits a strong positive effect on rates of semantic change, throughout all four languages ( Figure 3b). As with frequency, the relation takes the form of a power law with a language/corpus dependent scaling constant in β d ∈ [0.37, 0.77]. Note that this relationship is a complete reversal from what one would expect according to d(w i )'s positive correlation with frequency; i.e., since frequency and polysemy are highly positively correlated, one would expect them to have similar effects on semantic change, but we found that the effect of polysemy completely reversed after controlling for frequency. Figure 2b shows the relationship of polysemy with rates of semantic change in the EN-GALL data after regressing out effect of frequency (using the method of Graham, 2003).

Discussion
We show how distributional methods can reveal statistical laws of semantic change and offer a robust methodology for future work in this area. Our work builds upon a wealth of previous research on quantitative approaches to semantic change, including prior work with distributional methods (Sagi et al., 2011;Wijaya and Yeniterzi, 2011;Gulordava and Baroni, 2011;Jatowt and Duh, 2014;Kulkarni et al., 2014;Xu and Kemp, 2015), as well as recent work on detecting the emergence of novel word senses (Lau et al., 2012;Mitra et al., 2014;Cook et al., 2014;Mitra et al., 2015;Frermann and Lapata, 2016). We extend these lines of work by rigorously comparing different approaches to quantifying semantic change and by using these methods to propose new statistical laws of semantic change.
The two statistical laws we propose have strong implications for future work in historical semantics. The law of conformity-frequent words change more slowly-clarifies frequency's role in semantic change. Future studies of semantic change must account for frequency's conforming effect: when examining the interaction between some linguistic process and semantic change, the law of conformity should serve as a null model in which the interaction is driven primarily by underlying frequency effects.
The law of innovation-polysemous words change more quickly-quantifies the central role polysemy plays in semantic change, an issue that has concerned linguists for more than 100 years (Bréal, 1897). Previous works argued that semantic change leads to polysemy (Wilkins, 1993;Hopper and Traugott, 2003). However, our results show that polysemous words change faster, which suggests that polysemy may actually lead to semantic change.
Overall, these two factors-frequency and polysemy-explain between 48% and 88% of the variance 10 in rates of semantic change (across conditions). This remarkable degree of explanatory power indicates that frequency and polysemy are perhaps the two most crucial linguistic factors that explain rates of semantic change over time.
These empirical statistical laws also lend themselves to various causal mechanisms. The law of conformity might be a consequence of learning: perhaps people are more likely to use rare words mistakenly in novel ways, a mechanism formalizable by Bayesian models of word learning and corresponding to the biological notion of genetic drift (Reali and Griffiths, 2010). Or perhaps a sociocultural conformity bias makes people less likely to accept novel innovations of common words, a mechanism analogous to the biological process of purifying selection (Boyd and Richerson, 1988;Pagel et al., 2007). Moreover, such mechanisms may also be partially responsible for the law of innovation. Highly polysemous words tend to have more rare senses (Kilgarriff, 2004), and rare senses may be unstable by the law of conformity. While our results cannot confirm such causal links, they nonetheless highlight a new role for frequency and polysemy in language change and the importance of distributional models in historical research.

A Hyperparameter and pre-processing details
For all datasets, words were lowercased and stripped of punctuation. For the Google datasets we built models using the top-100000 words by their average frequency over the entire historical time-periods, and we used the top-50000 for COHA. During model learning we also discarded all words within a year that occurred below a certain threshold (500 for the Google data, 100 for the COHA data). For all methods, we used the hyperparameters recommended in Levy et al. (2015). For the context word distributions in all methods, we used context distribution smoothing with a smoothing parameter of 0.75. Note that for SGNS this corresponds to smoothing the unigram negative sampling distribution. For both, SGNS and PPMI, we set the negative sample prior α = log(5), while we set this value to α = 0 for SVD, as this improved results. When using SGNS on the Google data, we also subsampled, with words being random removed with probability p r (w i ) = 1 − 10 −5 f (w i ) , as recommended by Levy et al. (2015) and Mikolov et al. (2013). Furthermore, to improve the computational efficiency of SGNS (which works with text streams and not co-occurrence counts), we downsampled the larger years in the Google N-Gram data to have at most 10 9 tokens. No such subsampling was performed on the COHA data.
For all methods, we defined the context set to simply be the same vocabulary as the target words, as is standard in most word vector applications (Levy et al., 2015). However, we found that the PPMI method benefited substantially from larger contexts (similar results were found in Bullinaria and Levy, 2007), so we did not remove any lowfrequency words per year from the context for that method. The other embedding approaches did not appear to benefit from the inclusion of these lowfrequency terms, so they were dropped for computational efficiency.
For SGNS, we used the implementation provided in Levy et al. (2015). The implementations for PPMI and SVD are released with the code package associated with this work.

B Visualization algorithm
To visualize semantic change for a word w i in two dimensions we employed the following procedure, which relies on the t-SNE embedding method (Van der Maaten and Hinton, 2008) as a subroutine: 1. Find the union of the word w i 's k nearest neighbors over all necessary time-points.
2. Compute the t-SNE embedding of these words on the most recent (i.e., the modern) time-point.
3. For each of the previous time-points, hold all embeddings fixed, except for the target word's (i.e., the embedding for w i ), and optimize a new t-SNE embedding only for the target word. We found that initializing the embedding for the target word to be the centroid of its k -nearest neighbors in a timepoint was highly effective.
Thus, in this procedure the background words are always shown in their "modern" positions, which makes sense given that these are the current meanings of these words. This approximation is necessary, since in reality all words are moving.

C Regression analysis details
In addition to the pre-processing mentioned in the main text, we also normalized the contextual diversity scores d(w i ) within years by subtracting the yearly median. This was necessary because there was substantial changes in the median contextual diversity scores over years due to changes in corpus sample sizes etc. Data points corresponding to words that occurred less than 500 times during a time-period were also discarded, as these points lack sufficient data to robustly estimate change rates (this threshold only came into effect on the COHA data, however). We removed stop words and proper nouns by (i) removing all stop-words from the available lists in Python's NLTK package (Bird et al., 2009) and (ii) restricting our analysis to words with part-of-speech (POS) tags corresponding to four main linguistic categories (common nouns, verbs, adverbs, and adjectives), using the POS sources in Table 1. When analyzing the effects of frequency and contextual diversity, the model contained fixed effects for these features and for time along with random effects for word identity. We opted not to control for POS tags in the presented results, as contextual diversity is co-linear with these tags (e.g., adverbs are more contextual diverse than nouns), and the goal was to demonstrate the main effect of contextual diversity across all word types. That said, the effect of contextual diversity remained strong and significantly positive in all datasets even after controlling for POS tags.
To fit the linear mixed models, we used the Python statsmodels package with restricted maximum likelihood estimation (REML) (Seabold and Perktold, 2010). All mentioned significance scores were computed according to Wald's z-tests, though these results agreed with Bonferroni corrected likelihood ratio tests on the eng-all data.
The visualizations in Figure 2 were computed on the eng-all data and correspond to bootstrapped locally-linear kernel regressions with bandwidths selected via the AIC Hurvitch criteria (Li and Racine, 2007).