A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains

We perform an interdisciplinary large-scale evaluation for detecting lexical semantic divergences in a diachronic and in a synchronic task: semantic sense changes across time, and semantic sense changes across domains. Our work addresses the superficialness and lack of comparison in assessing models of diachronic lexical change, by bringing together and extending benchmark models on a common state-of-the-art evaluation task. In addition, we demonstrate that the same evaluation task and modelling approaches can successfully be utilised for the synchronic detection of domain-specific sense divergences in the field of term extraction.


Introduction
Diachronic Lexical Semantic Change (LSC) detection, i.e., the automatic detection of word sense changes over time, is a flourishing new field within NLP (Frermann and Lapata, 2016;Hamilton et al., 2016b;Schlechtweg et al., 2017, i.a.). 1 Yet, it is hard to compare the performances of the various models, and optimal parameter choices remain unclear, because up to now most models have been compared on different evaluation tasks and data. Presently, we do not know which model performs best under which conditions, and if more complex model architectures gain performance benefits over simpler models. This situation hinders advances in the field and favors unfelicitous drawings of statistical laws of diachronic LSC (Dubossarsky et al., 2017).
In this study, we provide the first large-scale evaluation of an extensive number of approaches.
Relying on an existing German LSC dataset we compare models regarding different combinations of semantic representations, alignment techniques and detection measures, while exploring various pre-processing and parameter settings. Furthermore, we introduce Word Injection to LSC, a modeling idea drawn from term extraction, that overcomes the problem of vector space alignment. Our comparison of state-of-the-art approaches identifies best models and optimal parameter settings, and it suggests modifications to existing models which consistently show superior performance.
Meanwhile, the detection of lexical sense divergences across time-specific corpora is not the only possible application of LSC detection models.
In more general terms, they have the potential to detect sense divergences between corpora of any type, not necessarily timespecific ones. We acknowledge this observation and further explore a synchronic LSC detection task: identifying domain-specific changes of word senses in comparison to general-language usage, which is addressed, e.g., in term identification and automatic term extraction (Drouin, 2004;Pérez, 2016;Hätty and Schulte im Walde, 2018), and in determining social and dialectal language variations (Del Tredici and Fernández, 2017;Hovy and Purschke, 2018). 2 For addressing the synchronic LSC task, we present a recent sense-specific term dataset (Hätty et al., 2019) that we created analogously to the existing diachronic dataset, and we show that the diachronic models can be successfully applied to the synchronic task as well. This two-fold evaluation assures robustness and reproducibility of our model comparisons under various conditions.

Related Work
Diachronic LSC Detection. Existing approaches for diachronic LSC detection are mainly based on three types of meaning representations: (i) semantic vector spaces, (ii) topic distributions, and (iii) sense clusters. In (i), semantic vector spaces, each word is represented as two vectors reflecting its co-occurrence statistics at different periods of time (Gulordava and Baroni, 2011;Kim et al., 2014;Xu and Kemp, 2015;Eger and Mehler, 2016;Hamilton et al., 2016a,b;Hellrich and Hahn, 2016;Rosenfeld and Erk, 2018). LSC is typically measured by the cosine distance (or some alternative metric) between the two vectors, or by differences in contextual dispersion between the two vectors (Kisselew et al., 2016;Schlechtweg et al., 2017). (ii) Diachronic topic models infer a probability distribution for each word over different word senses (or topics), which are in turn modeled as a distribution over words (Wang and McCallum, 2006;Bamman and Crane, 2011;Wijaya and Yeniterzi, 2011;Lau et al., 2012;Mihalcea and Nastase, 2012;Cook et al., 2014;Frermann and Lapata, 2016). LSC of a word is measured by calculating a novelty score for its senses based on their frequency of use. (iii) Clustering models assign all uses of a word into sense clusters based on some contextual property (Mitra et al., 2015). Word sense clustering models are similar to topic models in that they map uses to senses. Accordingly, LSC of a word is measured similarly as in (ii). For an overview on diachronic LSC detection, see Tahmasebi et al. (2018).
Synchronic LSC Detection. We use the term synchronic LSC to refer to NLP research areas with a focus on how the meanings of words vary across domains or communities of speakers.
Synchronic LSC per se is not widely researched; for meaning shifts across domains, there is strongly related research which is concerned with domain-specific word sense disambiguation (Maynard and Ananiadou, 1998;Chen and Al-Mubaid, 2006;Taghipour and Ng, 2015;Daille et al., 2016) or term ambiguity detection (Baldwin et al., 2013;Wang et al., 2013). The only notable work for explicitly measuring across domain meaning shifts is Ferrari et al. (2017), which is based on semantic vector spaces and cosine distance. Synchronic LSC across commu-nities has been investigated as meaning variation in online communities, leveraging the large-scale data which has become available thanks to online social platforms (Del Tredici and Fernández, 2017;Rotabi et al., 2017).
Evaluation. Existing evaluation procedures for LSC detection can be distinguished into evaluation on (i) empirically observed data, and (ii) synthetic data or related tasks. (i) includes case studies of individual words (Sagi et al., 2009;Jatowt and Duh, 2014;Hamilton et al., 2016a), stand-alone comparison of a few hand-selected words (Wijaya and Yeniterzi, 2011;Hamilton et al., 2016b;Del Tredici and Fernández, 2017), comparison of hand-selected changing vs. semantically stable words (Lau et al., 2012;Cook et al., 2014), and post-hoc evaluation of the predictions of the presented models (Cook and Stevenson, 2010;Kulkarni et al., 2015;Del Tredici et al., 2016;Eger and Mehler, 2016;Ferrari et al., 2017). Schlechtweg et al. (2017) propose a small-scale annotation of diachronic metaphoric change.
Overall, the various studies use different evaluation tasks and data, with little overlap. Most evaluation data has not been annotated. Models were rarely compared to previously suggested ones, especially if the models differed in meaning representations. Moreover, for the diachronic task, synthetic datasets are used which do not reflect actual diachronic changes.

Task and Data
Our study makes use of the evaluation framework proposed in , where diachronic LSC detection is defined as a comparison between word uses in two time-specific corpora. We further applied the framework to create an analogous synchronic LSC dataset that compares word uses across general-language and domainspecific corpora. The common, meta-level task in our diachronic+synchronic setup is, given two corpora C a and C b , to rank the targets in the respective datasets according to their degree of relatedness between word uses in C a and C b .

Corpora
DTA (Deutsches Textarchiv, 2017) is a freely available lemmatized, POS-tagged and spellingnormalized diachronic corpus of German containing texts from the 16th to the 20th century.
COOK is a domain-specific corpus. We crawled cooking-related texts from several categories (recipes, ingredients, cookware and cooking techniques) from the German cooking recipes websites kochwiki.de and Wikibooks Kochbuch 3 .
SDEWAC (Faaß and Eckart, 2013) is a cleaned version of the web-crawled corpus DEWAC (Baroni et al., 2009). We reduced SDEWAC to 1 /8th of its original size by selecting every 8th sentence for our general-language corpus. Table 1 summarizes the corpus sizes after applying pre-processing. See Appendix A for preprocessing details.

Datasets 4 and Evaluation
Diachronic Usage Relatedness (DURel). DURel is a gold standard for diachronic LSC consisting of 22 target words with varying degrees of LSC . Target words were chosen from a list of attested changes in a diachronic semantic dictionary (Paul, 2002), and for each target a random sample of use pairs from the DTA corpus was annotated for meaning relatedness of the uses on a scale from 1 (unrelated meanings) to 4 (identical meanings), both within and across the time periods 1750-1799 and 1850-1899. The annotation resulted in an average Spearman's ρ = 0.66 across five annotators and 1,320 use pairs. For our evaluation of diachronic meaning change we rely on the ranking of the target words according to their mean usage relatedness across the two time periods.

Synchronic Usage Relatedness (SURel).
SURel is a recent gold standard for synchronic LSC (Hätty et al., 2019) using the same framework as in DURel. The 22 target words were chosen such as to exhibit different degrees of domain-specific meaning shifts, and use pairs were randomly selected from SDEWAC as general-language corpus and from COOK as domain-specific corpus. The annotation for usage relatedness across the corpora resulted in an average Spearman's ρ = 0.88 across four annotators and 1,320 use pairs. For our evaluation of synchronic meaning change we rely on the ranking of the target words according to their mean usage relatedness between general-language and domain-specific uses.
Evaluation. The gold LSC ranks in the DURel and SURel datasets are used to assess the correctness of model predictions by applying Spearman's rank-order correlation coefficient ρ as evaluation metric, as done in similar previous studies (Gulordava and Baroni, 2011;Schlechtweg et al., 2017;. As corpus data underlying the experiments we rely on the corpora from which the annotated use pairs were sampled: DTA documents from 1750-1799 as C a and documents from 1850-1899 as C b for the diachronic experiments, and the SDEWAC corpus as C a and the COOK corpus as C b for the synchronic experiments.

Meaning Representations 5
Our models are based on two families of distributional meaning representations: semantic vector spaces (Section 4.1), and topic distributions (Section 4.2). All representations are bag-of-wordsbased, i.e. each word representation reflects a weighted bag of context words. The contexts of a target word w i are the words surrounding it in an n-sized window: w i−n , ..., w i−1 , w i+1 , ..., w i+n .

Semantic Vector Spaces
A semantic vector space constructed from a corpus C with vocabulary V is a matrix M , where each row vector represents a word w in the vocabulary V reflecting its co-occurrence statistics (Turney and Pantel, 2010). We compare two stateof-the-art approaches to learn these vectors from co-occurrence data, (i) counting and (ii) predicting, and construct vector spaces for each time period and domain.

Count-based Vector Spaces
In a count-based semantic vector space the matrix M is high-dimensional and sparse. The value of each matrix cell M i,j represents the number of cooccurrences of the word w i and the context c j , #(w i , c j ). In line with Hamilton et al. (2016b) we apply a number of transformations to these raw co-occurrence matrices, as previous work has shown that this improves results on different tasks (Bullinaria and Levy, 2012;Levy et al., 2015).

Positive Pointwise Mutual Information (PPMI).
In PPMI representations the co-occurrence counts in each matrix cell M i,j are weighted by the positive mutual information of target w i and context c j reflecting their degree of association. The values of the transformed matrix are where k > 1 is a prior on the probability of observing an actual occurrence of (w i , c j ) and 0 < α < 1 is a smoothing parameter reducing PPMI's bias towards rare words Levy et al., 2015).
Singular Value Decomposition (SVD). Truncated SVD finds the optimal rank d factorization of matrix M with respect to L2 loss (Eckart and Young, 1936). We use truncated SVD to obtain low-dimensional approximations of the PPMI representations by factorizing M PPMI into the product of the three matrices U ΣV ⊤ . We keep only the top d elements of Σ and obtain where p is an eigenvalue weighting parameter (Levy et al., 2015). The ith row of M SVD corresponds to w i 's d-dimensional representation.
Random Indexing (RI). RI is a dimensionality reduction technique based on the Johnson-Lindenstrauss lemma according to which points in a vector space can be mapped into a randomly selected subspace under approximate preservation of the distances between points, if the subspace has a sufficiently high dimensionality (Johnson and Lindenstrauss, 1984;Sahlgren, 2004). We reduce the dimensionality of a countbased matrix M by multiplying it with a random matrix R: where the ith row of M RI corresponds to w i 's ddimensional semantic representation. The choice of the random vectors corresponding to the rows in R is important for RI. We follow previous work (Basile et al., 2015) and use sparse ternary random vectors with a small number s of randomly distributed −1s and +1s, all other elements set to 0, and we apply subsampling with a threshold t.

Predictive Vector Spaces
Skip-Gram with Negative Sampling (SGNS) differs from count-based techniques in that it directly represents each word w ∈ V and each context c ∈ V as a d-dimensional vector by implicitly 1+e −x , D is the set of all observed word-context pairs and D ′ is the set of randomly generated negative samples (Mikolov et al., 2013a,b;. The optimized parameters θ are v c i = C i * and v w i = W i * for w, c ∈ V , i ∈ 1, ..., d. D ′ is obtained by drawing k contexts from the empirical unigram distribution P (c) = #(c) |D| for each observation of (w,c), cf. Levy et al. (2015). SGNS and PPMI representations are highly related in that the cells of the implicitly factorized matrix M are PPMI values shifted by the constant k . Hence, SGNS and PPMI share the hyperparameter k. The final SGNS matrix is given by where the ith row of M SGNS corresponds to w i 's d-dimensional semantic representation. As in RI we apply subsampling with a threshold t. SGNS with particular parameter configurations has shown to outperform transformed count-based techniques on a variety of tasks (Baroni et al., 2014;Levy et al., 2015).

Alignment
Column Intersection (CI). In order to make the matrices A and B from time periods a < b (or domains a and b) comparable, they have to be aligned via a common coordinate axis. This is rather straightforward for count and PPMI representations, because their columns correspond to context words which often occur in both A and B (Hamilton et al., 2016b). In this case, the alignment for A and B is Shared Random Vectors (SRV). RI offers an elegant way to align count-based vector spaces and reduce their dimensionality at the same time (Basile et al., 2015). Instead of multiplying count matrices A and B each by a separate random matrix R A and R B they may be multiplied both by the same random matrix R representing them in the same low-dimensional random space. Hence, A and B are aligned by We follow Basile et al. and adopt a slight variation of this procedure: instead of multiplying both matrices by exactly the same random matrix (corresponding to an intersection of their columns) we first construct a shared random matrix and then multiply A and B by the respective sub-matrix.
Orthogonal Procrustes (OP). In the lowdimensional vector spaces produced by SVD, RI and SGNS the columns may represent different coordinate axes (orthogonal variants) and thus cannot directly be aligned to each other. Following Hamilton et al. (2016b) we apply OP analysis to solve this problem. We represent the dictionary as a binary matrix D, so that D i,j = 1 if w i ∈ V b (the ith word in the vocabulary at time b) corresponds to w j ∈ V a . The goal is then to find the optimal mapping matrix W * such that the sum of squared Euclidean distances between B's mapping B i * W and A j * for the dictionary entries D i,j is minimized: Following standard practice we length-normalize and mean-center A and B in a pre-processing step (Artetxe et al., 2017), and constrain W to be orthogonal, which preserves distances within each time period. Under this constraint, minimizing the squared Euclidean distance becomes equivalent to maximizing the dot product when finding the optimal rotational alignment (Hamilton et al., 2016b;Artetxe et al., 2017). The optimal solution for this problem is then given by Hence, A and B are aligned by where A and B correspond to their preprocessed versions. We also experiment with two variants: OP − omits mean-centering (Hamilton et al., 2016b), which is potentially harmful as a better solution may be found after mean-centering. OP + corresponds to OP with additional pre-and postprocessing steps and has been shown to improve performance in research on bilingual lexicon induction (Artetxe et al., 2018a,b). We apply all OP variants only to the low-dimensional matrices.
Vector Initialization (VI). In VI we first learn A VI using standard SGNS and then initialize the SGNS model for learning B VI on A VI (Kim et al., 2014). The idea is that if a word is used in similar contexts in a and b, its vector will be updated only slightly, while more different contexts lead to a stronger update.
Word Injection (WI). Finally, we use the word injection approach by Ferrari et al. (2017) where target words are substituted by a placeholder in one corpus before learning semantic representations, and a single matrix M WI is constructed for both corpora after mixing their sentences. The advantage of this approach is that all vector learning methods described above can be directly applied to the mixed corpus, and target vectors are constructed directly in the same space, so no post-hoc alignment is necessary.

Topic Distributions
Sense ChANge (SCAN). SCAN models LSC of word senses via smooth and gradual changes in associated topics (Frermann and Lapata, 2016).
The semantic representation inferred for a target word w and time period t consists of a Kdimensional distribution over word senses φ t and a V -dimensional distribution over the vocabulary ψ t,k for each word sense k, where K is a predefined number of senses for target word w. SCAN places parametrized logistic normal priors on φ t and ψ t,k in order to encourage a smooth change of parameters, where the extent of change is controlled through the precision parameter K φ , which is learned during training.
Although ψ t,k may change over time for word sense k, senses are intended to remain thematically consistent as controlled by word precision parameter K ψ . This allows comparison of the topic distribution across time periods. For each target word w we infer a SCAN model for two time periods a and b and take φ a w and φ b w as the respective semantic representations.

LSC Detection Measures
LSC detection measures predict a degree of LSC from two time-specific semantic representations of a word w. They either capture the contextual similarity (Section 5.1) or changes in the contextual dispersion (Section 5.2) of w's representations. 6

Similarity Measures
Cosine Distance (CD). CD is based on cosine similarity which measures the cosine of the angle between two non-zero vectors x, y with equal magnitudes (Salton and McGill, 1983): The cosine distance is then defined as CD( x, y) = 1 − cos( x, y).
CD's prediction for a degree of LSC of w between time periods a and b is obtained by CD( w a , w b ).
Local Neighborhood Distance (LND). LND computes a second-order similarity for two nonzero vectors x, y (Hamilton et al., 2016a). It measures the extent to which x and y 's distances to their shared nearest neighbors differ. First the cosine similarity of x, y with each vector in the union of the sets of their k nearest neighbors N k ( x) and 6 Find an overview of which measure was applied to which representation type in Appendix A. N k ( y) is computed and represented as a vector s whose entries are given by LND is then computed as cosine distance between the two vectors: LN D( x, y) = CD( s x , s y ).
LND does not require matrix alignment, because it measures the distances to the nearest neighbors in each space separately. It was claimed to capture changes in paradigmatic rather than syntagmatic relations between words (Hamilton et al., 2016a).
Jensen-Shannon Distance (JSD). JSD computes the distance between two probability distributions φ x , φ y of words w x , w y (Lin, 1991;Donoso and Sanchez, 2017). It is the symmetrized square root of the Kullback-Leibler divergence: where M = (φ x +φ y )/2. JSD is high if φ x and φ y assign different probabilities to the same events.

Dispersion Measures
Frequency Difference (FD). The logtransformed relative frequency of a word w for a corpus C is defined by F (w, C) = log |w ∈ C| |C| FD of two words x and y in two corpora X and Y is then defined by the absolute difference in F: FD's prediction for w's degree of LSC between time periods a and b with corpora C a and C b is computed as F D(w, C a , w, C b ) (parallel below).
Type Difference (TD). TD is similar to FD, but based on word vectors w for words w. The normalized log-transformed number of context types of a vector w in corpus C is defined by where |C T | is the number of types in corpus C.
The TD of two vectors x and y in two corpora X and Y is the absolute difference in T: T D( x, X, y, Y ) = |T ( x, X) − T ( y, Y )|.  Entropy Difference (HD). HD relies on vector entropy as suggested by Santus et al. (2014). The entropy of a non-zero word vector w is defined by VH is based on Shannon's entropy (Shannon, 1948), which measures the unpredictability of w's co-occurrences (Schlechtweg et al., 2017). HD is defined as We also experiment with differences in H between topic distributions φ a w , φ b w , which are computed in a similar fashion, and with normalizing VH by dividing it by log(V T ( w)), its maximum value.

Results and Discussions
First of all, we observe that nearly all model predictions have a strong positive correlation with the gold rank. Table 2 presents the overall best results across models and parameters. 7 With ρ = 0.87 for diachronic LSC (DURel) and ρ = 0.85 for synchronic LSC (SURel), the models reach comparable and unexpectedly high performances on the two distinct datasets. The overall best-performing model is Skip-Gram with orthogonal alignment and cosine distance (SGNS+OP+CD). The model is robust in that it performs best on both datasets and produces very similar, sometimes the same results across different iterations.
Pre-processing and Parameters. Regarding pre-processing, the results are less consistent: L ALL (all lemmas) dominates in the diachronic task, while L/P (lemma:pos of content words) 7 For models with randomness we computed the average results of five iterations. dominates in the synchronic task. In addition, L/P pre-processing, which is already limited on content words, prefers shorter windows, while L ALL (pre-processing where the complete sentence structure is maintained) prefers longer windows. Regarding the preference of L/P for SURel, we blame noise in the COOK corpus, which contains a lot of recipes listing ingredients and quantities with numerals and abbreviations, to presumably contribute little information about context words. For instance, COOK contains 4.6% numerals, while DTA only contains 1.2% numerals.
Looking at the influence of subsampling, we find that it does not improve the mean performance for Skip-Gram (SGNS) (with ρ = 0.506, without ρ = 0.517), but clearly for Random Indexing (RI) (with ρ = 0.413, without ρ = 0.285). Levy et al. (2015) found that SGNS prefers numerous negative samples (k > 1), which is confirmed here: mean ρ with k = 1 is 0.487, and mean ρ with k = 5 is 0.535. 8 This finding is also indicated in Table 2, where k = 5 dominates the 5 best results on both datasets; yet, k = 1 provides the overall best result on both datasets.
Semantic Representations. Table 3 shows the best and mean results for different semantic representations. SGNS is clearly the best vector space model, even though its mean performance does not outperform other representations as clearly as its best performance. Regarding count models, PPMI and SVD show the best results. SCAN performs poorly, and its mean results indicate that it is rather unstable. This may be explained by the particular way in which SCAN constructs context windows: it ignores asymmetric windows, thus reducing the number of training instances considerably, in particular for large window sizes.  Alignments. The fact that our modification of Hamilton et al. (2016b) (SGNS+OP) performs best across datasets confirms our assumption that column-mean centering is an important preprocessing step in Orthogonal Procrustes analysis and should not be omitted. Additionally, the mean performance in Table  4 shows that OP is generally more robust than its variants. OP + has the best mean performance on DURel, but performs poorly on SURel. Artetxe et al. (2018a) show that the additional preand post-processing steps of OP + can be harmful in certain conditions. We tested the influence of the different steps and identified the nonorthogonal whitening transformation as the main reason for a performance drop of ≈20%.
In order to see how important the alignment step is for the low-dimensional embeddings (SVD/RI/SGNS), we also tested the performance without alignment ('None' in Table 4). As expected, the mean performance drops considerably. However, it remains positive, which suggests that the spaces learned in the models are not random but rather slightly rotated variants.
Especially interesting is the comparison of Word Injection (WI) where one common vector space is learned against the OP-models where two separately learned vector spaces are aligned. Although WI avoids (post-hoc) alignment altogether, it is consistently outperformed by OP, which is shown in Table 4 for low-dimensional embeddings. 9 We found that OP profits from meancentering in the pre-processing step: applying  mean-centering to WI matrices improves the performance by 3% on WI+SGNS+CD.
The results for Vector Initialization (VI) are unexpectedly low (on DURel mean ρ = −0.017, on SURel mean ρ = 0.082). An essential parameter choice for VI is the number of training epochs for the initialized model. We experimented with 20 epochs instead of 5, but could not improve the performance. This contradicts the results obtained by Hamilton et al. (2016b) who report a "negligible" impact of VI when compared to OP − . We reckon that VI is strongly influenced by frequency. That is, the more frequent a word is in corpus C b , the more its vector will be updated after initialization on C a . Hence, VI predicts more change with higher frequency in C b .
Detection Measures. Cosine distance (CD) dominates Local Neighborhood Distance (LND) on all vector space and alignment types (e.g., mean ρ on DURel with SGNS+OP is 0.723 for CD vs. 0.620 for LND) and hence should be generally preferred if alignment is possible. Otherwise LND or a variant of WI+CD should be used, as they show lower but robust results. 10 Dispersion measures in general exhibit a low performance, and previous positive results for them could not be reproduced (Schlechtweg et al., 2017). It is striking that, contrary to our expectation, dispersion measures on SURel show a strong negative correlation (max. ρ = −0.79). We suggest that this is due to frequency particularities of the dataset: SURel's gold LSC rank has a rather strong negative correlation with the targets' frequency rank in the COOK corpus (ρ = −0.51). Moreover, because COOK is magnitudes smaller than SDEWAC the normalized values computed in most dispersion measures in COOK are much higher. This gives them also a much higher weight in the final calculation of the absolute differences. Hence, the negative correlation in COOK propagates to the final results. the findings in Dubossarsky et al. (2019), using, however, a different task and synthetic data. This is supported by the fact that the only measure not normalized by corpus size (HD) has a positive correlation. As these findings show, the dispersion measures are strongly influenced by frequency and very sensitive to different corpus sizes.
Control Condition. As we saw, dispersion measures are sensitive to frequency. Similar observations have been made for other LSC measures (Dubossarsky et al., 2017). In order to test for this influence within our datasets we follow Dubossarsky et al. (2017) in adding a control condition to the experiments for which sentences are randomly shuffled across corpora (time periods). For each target word we merge all sentences from the two corpora C a and C b containing it, shuffle them, split them again into two sets while holding their frequencies from the original corpora approximately stable and merge them again with the original corpora. This reduces the target words' mean degree of LSC between C a and C b significantly. Accordingly, the mean degree of LSC predicted by the models should reduce significantly if the models measure LSC (and not some other controlled property of the dataset such as frequency). We find that the mean prediction on a result sample (L/P, win=2) indeed reduces from 0.5 to 0.36 on DURel and from 0.53 to 0.44 on SURel. Moreover, shuffling should reduce the correlation of individual model predictions with the gold rank, as many items in the gold rank have a high degree of LSC, supposedly being canceled out by the shuffling and hence randomizing the ranking. Testing this on a result sample (SGNS+OP+CD, L/P, win=2, k=1, t=None), as shown in Table 5, we find that it holds for DURel with a drop from ρ = 0.816 (ORG) to 0.180 on the shuffled (SHF) corpora, but not for SURel where the correlation remains stable (0.767 vs. 0.763). We hypothesize that the latter may be due to SURel's frequency properties and find that downsampling all target words to approximately the same frequency in both corpora (≈ 50) reduces the correlation (+DWN). However, there is still a rather high correlation left (0.576). Presumably, other factors play a role: (i) Time-shuffling may not totally randomize the rankings because words with a high change still end up having slightly different meaning distributions in the two corpora than words with no change at all. Combined with the fact that the SURel rank is less uniformly distributed than DURel this may lead to a rough preservation of  Table 5: ρ for SGNS+OP+CD (L/P, win=2, k=1, t=None) before (ORG) and after time-shuffling (SHF) and downampling them to the same frequency (+DWN).
the SURel rank after shuffling. (ii) For words with a strong change the shuffling creates two equally polysemous sets of word uses from two monosemous sets. The models may be sensitive to the different variances in these sets, and hence predict stronger change for more polysemous sets of uses.
Overall, our findings demonstrate that much more work has to be done to understand the effects of time-shuffling as well as sensitivity effects of LSC detection models to frequency and polysemy.

Conclusion
We carried out the first systematic comparison of a wide range of LSC detection models on two datasets which were reliably annotated for sense divergences across corpora. The diachronic and synchronic evaluation tasks we introduced were solved with impressively high performance and robustness. We introduced Word Injection to overcome the need of (post-hoc) alignment, but find that Orthogonal Procrustes yields a better performance across vector space types. The overall best performing approach on both data suggests to learn vector representations for different time periods (or domains) with SGNS, to align them with an orthogonal mapping, and to measure change with cosine distance. We further improved the performance of the best approach with the application of mean-centering as an important pre-processing step for rotational vector space alignment. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations.  2017. 11 For all corpora, we removed words below a frequency threshold t. For the smallest corpus COOK we set t = 2, and set the other thresholds in the same proportion to the corpus size. This led to t = 25, 37, 97 for DTA18, DTA19 and SDEWAC respectively. (Note that we excluded three targets from the DURel dataset and one target from the SURel dataset because they were below the frequency threshold.) We then created two versions: • a version with minimal pre-processing, i.e., with punctuation removed and lemmatization (L ALL ) • a stronger preprocessed version with only content words. After punctuation removal, lemmatization and POS-tagging, only nouns, verbs and adjectives were retained in the form lemma:POS (L/P) Context window. For all models we experimented with values n = {2, 5, 10} as done in Levy et al. (2015). It is important to note that the extraction of context words differed between models, because of inherent parameter settings of the implementations. While our implementations of the count-based vectors have a stable window of size n, SGNS has a dynamic context window with maximal size n (cf. Levy et al., 2015) and SCAN has as stable window of size n, but ignores all occurrences of a target word where the number of context words on either side is smaller than n. This may affect the comparability of the different models, as especially the mechanism of SCAN can lead to very sparse representations on corpora with small sentence sizes, as e.g. the COOK corpus. Hence, this variable should be controlled in future experiments.
Vector Spaces. We followed previous work in setting further hyper-parameters (Hamilton et al., 2016b;Levy et al., 2015). We set the number of dimensions d for SVD, RI and SGNS to 300. We trained all SGNS with 5 epochs. For PPMI we set α = .75 and experimented with k = {1, 5} for PPMI and SGNS. For RI and SGNS we experimented with t = {none, .001}. For SVD we set p = 0. In line with Basile et al. (2015) we set 11 http://www.deutschestextarchiv.de/download s = 2 for RI and SRV. Note though that we had a lower d than Basile et al., who set d = 500.
SCAN. We experimented with K = {4, 8}. For further parameters we followed the settings chosen by Frermann and Lapata (2016): K ψ = 10 (a high value forcing senses to remain thematically consistent across time). We set K φ = 4, and the Gamma parameters a = 7 and b = 3. We used 1, 000 iterations for the Gibbs sampler and set the minimum amount of contexts for a target word per time period min = 0 and the maximum amount to max = 2000.
Measures. For LND we set k = 25 as recommended by Hamilton et al. (2016a). The normalization constants for FD, HD and TD were calculated on the full corpus with the respective preprocessing (without deleting words below a frequency threshold).

B Model Overview
Find an overview of all tested combinations of semantic representations, alignments and measures in Table 6.

C Datasets
Find the datasets with the target words and their annotated degree of LSC in Tables 7 and 8.  Table 8: SURel dataset without Messerspitze, which was excluded for low frequency. C a =SDEWAC, C b =COOK. LSC denotes the inverse compare rank from , where high values mean high change.