Sequential Modelling of the Evolution of Word Representations for Semantic Change Detection

Semantic change detection concerns the task of identifying words whose meaning has changed over time. Current state-of-the-art approaches operating on neural embeddings detect the level of semantic change in a word by comparing its vector representation in two distinct time periods, without considering its evolution through time. In this work, we pro-pose three variants of sequential models for detecting semantically shifted words, effectively accounting for the changes in the word representations over time. Through extensive experimentation under various settings with synthetic and real data we showcase the importance of sequential modelling of word vectors through time for semantic change detection. Finally, we compare different approaches in a quantitative manner, demonstrating that temporal modelling of word representations yields a clear-cut advantage in performance.


Introduction
Identifying words whose lexical meaning has changed over time is a primary area of research at the intersection of natural language processing and historical linguistics. Through the evolution of language, the task of "semantic change detection" (Tahmasebi et al., 2018;Tang, 2018;Kutuzov et al., 2018) can provide valuable insights on cultural evolution over time (Michel et al., 2011). Measuring linguistic change is also relevant to understanding the dynamics in online communities (Danescu-Niculescu-Mizil et al., 2013) and the evolution of individuals (McAuley and Leskovec, 2013). Recent years have seen a surge in interest in this area since researchers are now able to leverage the increasing availability of historical corpora in digital form and develop models that detect the shift in a word's meaning through time.
However, two key challenges in the field still remain. Firstly, there is little work in existing lit-erature on model comparison Dubossarsky et al., 2019;Shoemark et al., 2019). Partially due to the lack of (longitudinal) labelled datasets, existing work assesses model performance mainly in a qualitative manner, without quantitative comparisons against prior work. Therefore, it becomes difficult to assess what constitutes an appropriate approach for semantic change detection. Secondly, on a methodological front, a large body of related work detects semantically shifted words by pairwise comparisons of their representations in distinct time periods, ignoring the sequential modelling aspect of the task. Since semantic change is a time-sensitive process (Tsakalidis et al., 2019), considering consecutive vector representations through time -instead of two bins of word representations (Schlechtweg et al., 2018(Schlechtweg et al., , 2020) -can be crucial to improving model performance (Shoemark et al., 2019).
Here we tackle both challenges by approaching semantic change detection as an anomaly identification task. Working on embedding representations of words in the English language, we learn their evolution through time via an encoder-decoder architecture. We hypothesize that once such a model has been successfully trained on temporally sensitive sequences of word representations, it will accurately predict the evolution of the semantic representation of any word through time. Words that have undergone semantic change will be those that yield the highest errors by the prediction model. Our work makes the following contributions: • we develop three variants of an LSTM-based architecture to measure the level of semantic change of a word by tracking its evolution through time in a sequential manner: (a) a word representation autoencoder, (b) a future word representation decoder and (c) a hybrid approach combining (a) and (b); • we show the effectiveness of our models under thorough experimentation with synthetic data; • we compare our models against current practices and competitive baselines using real data, demonstrating important gains in performance and highlighting the importance of sequential modelling of word vectors through time; • we release our code, to help set up a benchmark for model comparison within the domain in a quantitative fashion. 1

Related Work
One can distinguish two directions within the literature on semantic change detection: (a) learning word representations over discrete time intervals (bins) and comparing the resulting vectors and (b) jointly learning word representations across time (Bamler and Mandt, 2017;Rosenfeld and Erk, 2018;Yao et al., 2018;Rudolph and Blei, 2018). Such representations can be generated via different approaches, such as topic- (Frermann and Lapata, 2016;Perrone et al., 2019), graph- (Mitra et al., 2014) and neural-based models (e.g., word2vec) -work by Tahmasebi et al. (2018) provides an overview of such approaches. In this work we focus on (a) due to scalability issues in learning diachronic representations from very large corpora, as in our case, and -without loss of generality -we utilise pre-trained, neural-based representations. Related work in (a) derives word representations W i (i ∈ [0,..,|T − 1|]) across |T | time intervals and performs pairwise comparisons for different values of i. Early work used frequency-or co-occurrencebased representations (Sagi et al., 2009;Cook and Stevenson, 2010;Gulordava and Baroni, 2011;Mihalcea and Nastase, 2012). However, leveraging word2vec-based representations (Mikolov et al., 2013) has become the common practice in recent years. Due to the stochastic nature of word2vec, Orthogonal Procrustes (OP) (Schönemann, 1966) is often applied to the resulting vectors, aiming at aligning the pairwise representations (Kulkarni et al., 2015;Hamilton et al., 2016;Shoemark et al., 2019;Tsakalidis et al., 2019;. Given two word matrices W k , W j at times k and j respectively, OP finds the optimal transformation matrix R = argmin Ω;Ω T Ω=I ΩW k − W j F and the semantic 1 Code is available at: https://github.com/ adtsakal/semantic_change_evolution shift level of a word w during this time interval is defined as the cosine distance between the two aligned matrices (Hamilton et al., 2016). By operating in a linear pairwise fashion, such approaches ignore the time-sensitive and possibly non-linear nature of semantic change.
By contrast, Kim et al. (2014), Kulkarni et al. (2015), Dubossarsky et al. (2019) and Shoemark et al. (2019) derive time series of a word's level of semantic change to detect semantically shifted words. Even though these methods incorporate temporal modelling, they either rely heavily on the linear transformation R (Kulkarni et al., 2015;Shoemark et al., 2019) or focus primarily on the generation of temporally-sensitive representations as a means towards capturing semantic change (Kim et al., 2014;Dubossarsky et al., 2019). A key contribution of our work is that we do not base our methods on pre-defined transformations, but instead propose a model for learning how (any type of) pre-trained word representations vary across time, effectively exploiting the full sequence of a word's evolution.
Finally, the comparative evaluation of semantic change detection models is still in its infancy.
Most related work assesses model performance based on artificial tasks (Cook and Stevenson, 2010;Kulkarni et al., 2015;Rosenfeld and Erk, 2018;Dubossarsky et al., 2019;Shoemark et al., 2019) or on a few hand-picked examples (Sagi et al., 2009), without cross-model comparison. The recently introduced shared tasks SemEval Task 1 (Schlechtweg et al., 2020) and DIACR-Ita (Basile et al., 2020) aim at bridging this gap; however, the respective datasets consist of documents split in two distinct time periods, thus not facilitating the study of the sequential nature of semantic change. Setting a benchmark for model comparison with real-world and sequential word representations is crucial in this field.

Methods
We formulate semantic change detection as an anomaly detection task in the evolution of pretrained word embeddings. We assume that pretrained word vectors W t ∈ [W 0 , ..., W |T −1| ], where W t ∈ R |V |×d (|V |: vocabulary size; d: word representation size) in a historical corpus over |T | time periods, evolve according to a non-linear function f (W t ). 2 By approximating f , we obtain the level of semantic shift of a word w at time t by measuring the similarity between its word representation w t against f (w t ). A low similarity score for a given word implies an inaccurate model prediction (anomaly) and thus a high level of semantic change for the given word. Therefore, we can obtain a ranking of the words based on their semantic shift level by ordering them in ascending order of their similarity scores between w t and f (w t ). We approximate f via temporally sensitive deep neural models: (a) an autoencoder, which aims to reconstruct a word's trajectory up to a given point in time i [w 0 , ..., w |i| ] (section 3.1); and (b) a future predictor, which aims to predict future representations of the word [w |i+1| , ..., w |T −1| ] (section 3.2). The two models can be trained individually or (c) in a joint multi-task setting (section 3.3). These models benefit from accounting for sequential word representations across time [W 0 , ..., W |T −1| ], which is better suited for detecting semantically shifted words compared to the common practice of comparing only the first and last word representations

Reconstructing Word Representations
Given an input sequence of vectors representing the words in a vocabulary across i points in time W 0:i−1 = [W 0 , W 1 , ..., W i−1 ], the goal of the autoencoder is to reconstruct the input sequence W 0:i−1 . Since the task of semantic change includes a natural temporal dimension, we model our autoencoder via RNNs (see Figure 1). The encoder is composed of two LSTM layers (Hochreiter and Schmidhuber, 1997) with Dropout layers operating on their outputs, for regularisation (Srivastava et al., 2014). The first layer encodes the input sequence of W 0:i−1 and returns the hidden states as input to the second layer. The output of the second layer is the final encoded state, which is then copied |i| times and fed as input to the decoder. The decoder has the same architecture as the encoder, albeit with additional dense layers on top of the second LSTM layer to make the final reconstruction W r 0:i−1 on the |i| time steps. The model is trained by minimising the mean squared error (MSE) loss function: After training, the words that yield the highest error rates in a given test set of word representations through time are considered to be the ones whose semantics have changed the most during the given time period. This is compatible with prior work based on word alignment (Hamilton et al., 2016;Tsakalidis et al., 2019), where the alignment error of a word indicates its level of semantic change.

Predicting Future Word Representations
Reconstructing the word vectors can reveal which words have changed their semantics in the past (i.e., up to time i − 1, see 3.1). If we are interested in predicting changes in the semantics of future word representations (i.e., after time i − 1), we can consider a future word representation prediction task: given the sequence of past word repre- over the first i time points, we predict the future representations of the words in the vocabulary W i: , for a sequence of overall length |T | (see Figure 1). We follow the same model architecture as in section 3.1, with the only difference being the number of time steps (T − i) used in the decoder to make |T − i| predictions. The model is trained using the loss function L f : (2)

Joint Model
The two models can be combined into a joint one, where, given an input sequence of representations of the vocabulary W 0:i−1 over i points in time, the goal is both to (a) reconstruct the input sequence and (b) predict the future word |T − i| representations W i:T −1 . The complete model architecture is provided in Figure 1: the encoder is identical to the one used in 3.1 and 3.2. However, the bottleneck is now copied |T | times and passed to the decoders of the reconstruction (|i| times) and future prediction (|T − i| times). The loss function L rf here is the summation of the individual losses in Eq. 1 and 2: (3) There are two main reasons for modelling semantic change in this multi-task setting. Firstly, we benefit from the finer granularity of the two decoders due to their handling of only part of the sequence in a more fine-grained manner, compared to the individual task models. Secondly, the joint model is insensitive to the value of i in Eq. 3 compared to Eq. 1 and 2, as discussed next.

Model Equivalence
The three models perform different operations; however, setting the operational time periods appropriately in Eq. 1-3 can result in model equivalence (i.e., performing the same task). Specifically, to detect the words whose semantics have changed during [0, T − 1], the autoencoder (Eq. 1) needs to be fed and reconstruct the full sequence across [0, T − 1] (i.e., i=T -1). Reducing this interval (reducing i) would limit its operational time period. On the contrary, an increase in the value of i in Eq. 2 of the future prediction model shortens the time period during which it can detect the words whose semantics have changed the most -to detect the words whose semantics have changed within the full sequence [1, T − 1], it requires only the word representations W 0 in the first time interval. Therefore, setting the parameter i can be crucial for the performance of the individual models. By contrast, the joint model in section 3.3 is able to detect the words that have undergone semantic change, regardless of the value of i (see Eq. 3), since it is still able to operate on the full sequence -we showcase these effects in section 5.2.

Experiments with Synthetic Data
Tasks run on artificial data have been used for evaluation purposes in related work (Gale et al., 1992;Schütze, 1998;Cook and Stevenson, 2010;Kulkarni et al., 2015;Rosenfeld and Erk, 2018;Dubossarsky et al., 2019;Shoemark et al., 2019). In this section, we work with artificial data as a proofof-concept of our proposed models -we compare against state-of-the-art and other baseline methods with real data in the next section. Here we employ a longitudinal dataset of word representations (4.1) and artificially alter the representations of a small set of words across time (4.2). We then train (4.3) our models and evaluate them on the basis of their ability to identify those words that have undergone (artificial) semantic change (4.4).

Dataset
We employ the UK Web Archive dataset (Tsakalidis et al., 2019), which contains 100-dimensional representations of 47.8K words for each year in the period 2000-2013. These were obtained by employing word2vec (i.e., skip-gram with negative sampling (Mikolov et al., 2013)) on the documents published in each year independently. Note that our models can be applied to any type of pre-trained embeddings. Each year corresponds to a time step in our modelling. The dataset contains 65 words whose meaning has changed within 2001-13, as indicated by the Oxford English Dictionary. These are removed for the purposes of this section, to avoid interference with the artificial data modeling. We use one subset (80%) of the remaining longitudinal word representations for training our models and the rest (20%) for evaluation purposes.

Artificial Examples of Semantic Change
We generate artificial examples of words with changing semantics, by following a paradigm inspired by Rosenfeld and Erk (2018). We uniformly at random select 5% of the words in the test set to alter their semantics. For every selected "source" word α, we select a "target" word β (details about the selection process of β are provided in the next paragraph). We then alter the representation w (α) t of the source word α at each point in time t so that it shifts towards the representation w (β) t of the target word at this point in time as: Following Rosenfeld and Erk (2018), we model λ t via a sigmoid function. λ t receives values within [0, 1] and acts as a decay function that controls the speed of change in the source word's semantics towards the target. Thus, the semantic representation of α is not altered during the first time points and then it gradually shifts towards the representation of β (for middle values of t), where it stabilizes towards the last time points. Since the duration of the semantic shift of a word may vary, we experiment under different scenarios (see "Conditioning on Duration of Change" below). Alternative modelling approaches of artificial semantic change have been presented in Shoemark et al. (2019) -e.g., forcing a word to acquire a new sense while also retaining its original meaning. We opted for the "stronger" case of semantic shift (Eq. 4) as a proof-of-concept for our models. In section 5 we experiment with real-world data, without any assumptions about the form of the function underlying semantic change.
Conditioning on Target Words The selection of the target words should be such that they allow the representation of the source word to change through time (Dubossarsky et al., 2019). This will not be the case if we select a pair of {α, β} {source, target} words whose representations are very similar (e.g., synonyms). Thus, for each source word

Artificial Data Experiment
Our task is to rank the words in the test set by means of their level of semantic change. We first train our three models on the training set and then we apply them on the test set. Finally, we measure the level of semantic change of a word by means of the average cosine similarity between the predicted and actual word representations at each time step of the decoder. Model performance is assessed via rank-based metrics (Basile and McGillivray, 2018;Tsakalidis et al., 2019;Shoemark et al., 2019).

Model Training
We define and train our models as follows: • seq2seq r : the autoencoder (section 3.1) receives and reconstructs the full sequence of the word representations in the training set: We have a different number of timesteps for seq2seq r and seq2seq f in their input, so that the decoder in each model operates on the maximum possible output sequence, thus exploiting the semantic change of the words over the whole time period (see section 3.4). seq2seq rf is expected to be insensitive to the number of input time steps, therefore we conventionally set it to half of the overall sequence. We keep 25% of our training set for validation purposes and train our models using the Adam optimiser (Kingma and Ba, 2015). We select the best parameters after 25 trials using the Tree of Parzen Estimators algorithm of the hyperopt module (Bergstra et al., 2013), by means of the maximum average (i.e., per time step) cosine similarity in the validation set. 3

Testing and Evaluation
After training, each model is applied to the test set, yielding its predictions for every word through time. 4 The level of semantic change of a word is then calculated via the average cosine similarity between the actual and the predicted word representations through time, with higher values indicating a better model prediction (thus, a lower level of semantic change). The words are ranked on ascending order of their average cosine similarity, with the first ranks indicating words whose representations have changed the most (low cosine similarity). For evaluation, similarly to Tsakalidis et al. (2019), we employ the average rank across all of the semantically changed words (in %, denoted here as µ r ), with lower scores indicating a better model. We prefer µ r to the mean reciprocal rank, because the latter emphasises the first rankings. Since semantic change detection is an under-explored task in quantitative terms, we aim at getting better insights on model performance by working with an averaging metric. For the same reason we avoid using classification-based metrics that are based on a cut-off point (e.g., recall at k (Basile and McGillivray, 2018)). We do make use of such metrics in the cross-model comparison with real data (section 5.2). Figure 3 presents the results of the three models on our synthetic data across all (c, λ) combinations. seq2seq rf performs consistently better than the individual (seq2seq r , seq2seq f ) models in µ r , showing that combining the two models under a multi-task setting benefits from the joint and finer-grained parameter tuning of the two components. seq2seq r performs slightly better than seq2seq f , probably due to the autoencoder having to output a longer sequence (i.e., due to W r 00 ), which helps explore the temporal variation of the words more effectively. Figure 4 shows the cosine similarity between the predicted and actual representation of each synthetic word per time step for the "Full" case when c=0.0 (highest level of change, see section 4.2). seq2seq r reconstructs the input sequence of synthetic examples more accurately than the future prediction component (average cosine similarity per year (avg cos): .65 vs .50). It particularly manages to reconstruct the synthetic word representations during the years 2006-2008 (avg cos 06:08 =.75), which are the points when λ t varies more rapidly (see Figure 2); however, it fails to reconstruct equally well their representations before (avg cos 00:05 = .65) and after (avg cos 09:13 = .59) this sharp change. On the contrary, seq2seq f predicts more accurately the synthetic word representations during the first years (avg cos 01:05 = .74), when the change in their semantics is minor, but (correctly) fails after the semantic change is almost complete (i.e., when λ t ≤ .25, avg cos 09:13 = .24). seq2seq rf benefits from the individual components' advantage: it appropriately reconstructs the artificial examples in the first years (avg cos 00:05 = .85) so that their semantic shift is highlighted more clearly during (avg cos 06:08 = .62) and after the process is almost complete (avg cos 09:13 = .26). Finally, avg cos in seq2seq rf highly correlates with λ t (ρ=.987), potentially providing insights on how to measure the speed of semantic change of a word.

Effect of Conditioning Parameters
Regardless of the duration of the semantic change process, an increase in the value of c results in model performance degradation. This is expected, since the increase of c implies that the level of semantic change of the source words is lower, as discussed in 4.2, thus making the task of detecting them more difficult. Nevertheless, our worst performing model in the most challenging setting (c=0.5, Full, seq2seq f ) achieves µ r =28.17, which is clearly better than the µ r expected by a random baseline (µ r =50.00).
The decrease of the duration of semantic change has a positive effect on our models (see Figure 3). This is more evident in the cases of high value of c, where seq2seq r (µ r : 26.09-18.21 in the Full-to-Quarter cases), seq2seq f (µ r : 28.17-22.48) and seq2seq rf (µ r :20.38-13.09) show clear gains in performance. This indicates that our models can capture the semantic change in small subsequences of the time-series. Studying this effect in datasets of longer duration is an important future direction.

Experimental Setting
We approach the task in a rank-based manner, as in section 4. However, here we are interested in detecting real-world examples of semantic change in words and comparing our models against strong baselines and current practices.

Data and Task
We employ the UK Web Archive dataset (see section 4.1). We keep the same 80/20 train/test split as in section 4 and incorporate in the test set the 65 words with known changes in meaning according to the Oxford English Dictionary. We train our models as in section 4.3, aiming at detecting the 65 words in the test set. We use µ r (as in section 4) and recall at k (Rec@k, k=5%, 10%, 50%) as our evaluation metrics. We refrain from using precision at k, since Oxford English Dictionary is not expected to have a full coverage of the semantically shifted words. Lower µ r and higher Rec@k scores indicate better models.

Models
We compare the three variants from section 3 against four types of baselines: -A random word rank generator (RAND). We report average metrics after 1K runs on the test set.
-Variants of Procrustes Alignment, as a common practice in past work: Given word representations in two different years [W 0 , W i ] centered around the origin and s.t. tr(W k W T k ) = 1, PROCR transforms W i into W * i s.t. the squared differences between W 0 and W * i are minimised. We also use the PROCR k and PROCR kt variants (Tsakalidis et al., 2019), which first detect the k most stable words across either [W 0 , W i ] (PROCR k ) or [W 0 , ..., W T −1 ] (PROCR kt ) to learn the alignment and then transform W i into W * i . Words are ranked based on the cosine distance between [W 0 , W * i ]. -Models leveraging the first and last word representations only. We use a Random Forest (Breiman, 2001) regression model (RF) that predicts W i , given W 0 . We also use the same architectures presented in sections 3.1-3.2, trained on [W 0 , W i ] (ignoring the full sequence): LSTM r reconstructs the sequence [W 0 , W i ]; LSTM f predicts W i , given W 0 , similarly to RF. Words are ranked in ascending order of the (average, for LSTM r ) cosine similarity between their predicted and actual representations.
-Models operating on the time series of distances. Given a sequence of vectors [W 0 , ..., W i ], we construct the time series of cosine distances that result by PROCR. Then, we use two global trend models (Shoemark et al., 2019): GT c ranks the words by means of the absolute value of the Pearson correlation of their time series; GT β fits a linear regression model for every word and ranks the words by the absolute value of the slope. Finally, we  Table 1: Model comparison when operating on the entire time sequence (2000-13) and averaged across time (2000-01, ..., 2000-13). Past work and baseline models shown in the table are defined in section 5.1 ("Models"). employ PROCR * , ranking words based on the average cosine distance within [0, i]; this is similar to the "Mean Distances" model used in Rodina et al. (2019), with the difference that the distances at time point i are calculated by measuring the cosine distance resulting from the alignment against the initial time point 0 and not against i − 1. 6 We report the performance of the models (a) when they operate on the full interval  and (b) averaged across all intervals [2000-01, ..., 2000-13]. In (b), our models use additional (future) information compared to our baselines; when seq2seq f is fed with the word sequences of [2000,2001], it makes a prediction for the years [2002, ..., 2013] -such information cannot be leveraged by the baselines. Thus, for (b), we only perform intra-model (and intra-baseline) comparisons.

Results
Our models vs baselines Results are shown in Table 1. The three proposed models consistently achieve the lowest µ r and highest Rec@k when working on the whole time sequence ('00-'13 columns). The comparison between {seq2seq r , LSTM r } and {seq2seq f , LSTM f } in the years 2000-13 showcases the benefit of modelling the full sequence of the word representations across time, compared to using the first and last representations only. Our models provide a relative boost of 4.6% in µ r and [35.7%, 42.8%, 5.8%] in Rec@k (for k= [5,10,50]) compared to the best perform-5 Example (2005 in x-axis): The sequence of the word representations until 2005 is the input to all of our models. Then, seq2seqr reconstructs the word representations up to 2005, seq2seq f predicts the future representations (2006, ..., 2013) and seq2seq rf performs both tasks jointly. 6 We refrain from evaluating the GT models when i ≤2, due to the very short time interval that does not allow for correlations to appear in the data, leading to very poor performance.
ing baseline. seq2seq f and seq2seq rf models outperform the autoencoder (seq2seq r ) in most metrics, while seq2seq rf yields the most stable results across all runs. We explore these differences in detail in the last paragraph of this section.
Intra-baseline comparison Models operating only on the first and last word representations fail to confidently outperform the Procrustes-based baselines, demonstrating again the weakness of operating in a non-sequential manner. The LSTM models achieve low µ r on the 2000-13 experiments; however, the difference with the rest of the baselines in µ r across all years is negligible. The intra-Procrustes model comparison shows that the benefit of selecting a few anchor words to learn a better alignment (PROCR k , PROCR kt ) shown in Tsakalidis et al. (2019) in examining semantic change over two consecutive years might not apply when examining a longer time period. Finally, contrary to Shoemark et al. (2019), we find that time sensitive models operating on the word distances across time (GT c , GT β ) perform worse than the baselines that leverage only the first and last word representations. This difference is attributed to the low number of time steps in our dataset that does not allow the GT models to exploit long-term correlations (i.e., considering the average distance across time (PROCR * ) performs better), but also highlights the importance of leveraging the full word sequence across time.
Effect of input/output lengths Figure 5 shows the µ r of our three variants when we alter the length of the input and output (see section 3.4). The performance of seq2seq r increases with the input size since by definition the decoder is able to detect words whose semantics have changed over a longer period of time (i.e., within [2000, i], with i increasing), while also modelling a longer se- On the contrary, the performance of seq2seq f increases alongside the decrease of the number of input time steps. This is expected since, as i decreases, seq2seq f encodes a shorter input sequence and the decoding (and hence the semantic change detection) is applied on the remaining (and increased number of) time steps within [i+1, 2013]. These findings provide empirical evidence that both models can achieve better performance if trained over longer sequences of time steps. Finally, the stability of seq2seq rf showcases its input lengthinvariant nature, which is also clearly evident in all of the averaged results (standard deviation in avg±std columns) in Table 1: in its worst performing setting, seq2seq rf still manages to achieve results that are close to the best performing model (µ r =25.17, Rec@k=[21.54,36.92,83.08] for the three thresholds) and always better (or equal to) the best performing baseline shown in Table 1 in Rec@k. This is a very attractive aspect of the model as it removes the need to manually define the number of time steps to be fed to the encoder.
Words with shifted meaning Figure 6 shows the cosine distances between the actual and predicted vectors of the 65 words that acquired a new meaning between 2001-2013. The distances are calculated by applying the seq2seq rf model (trained as in section 4.3) on the test set. The words are ranked based on their average cosine distance throughout the years such that the words in the first rows form more challenging examples for detecting their semantic shift. Despite that some of these words have acquired an additional meaning in the context of social networks (e.g., "like", "unlike"), this is not effectively captured by their vectors. Utilising contextual representations (Giulianelli et al., 2020) in our models can be more effective for capturing such cases in future work.

Conclusion and Future Work
We introduce three sequential models for semantic change detection that effectively exploit the full sequence of a word's representations through time to determine its level of semantic change. Through extensive experiments on synthetic and real data we showcase the effectiveness of the proposed models under various settings and in comparison to state-of-the-art on the UK Web Archive dataset. Importantly, we show that their performance increases alongside the duration of the time period under study, confidently outperforming competitive models and common practices on semantic change.
Future work can use anomaly detection approaches operating on our model's predicted word vectors to detect anomalies in a word's representation across time. We also plan to investigate different architectures, such as Variational Autoencoders (Kingma and Welling, 2014), and incorporate contextual representations (Devlin et al., 2019;Hu et al., 2019) to detect new senses of words. A limitation of our work is that it has been tested on a single dataset, where 65 words have undergone semantic change; testing our models in datasets of different duration and in different languages will provide clearer evidence of their effectiveness.
• LSTM r/f : we follow the exact same settings as in our seq2seq r and seq2seq f models, respectively.
• RF: we experiment with the number of trees ([50, 100, 150, 200]) and select the best model based on the maximum average cosine similarity across all predictions, as in our models.
• PROCR k/kt : we experiment with different rate [.001, .01, .05, .1, .2, ... .9] of anchor (or diachronic anchor) words on the basis of the size of the test set. We select to display in our results the best model based on the average performance in the test set (k=.9 for PROCR k , k=.5 for PROCR kt ).
• GT c : we explore different correlation metrics (Spearman Rank, Pearson Correlation, Kendall Tau) and select to display the best one (Pearson Correlation) on the basis of its average performance on the test set across all experiments. Due to the very poor performance of all metrics when operating on a small number of time-steps (≤ 2), we only provide the results in Table 1 (avg±std columns) when these models operate on longer sequences.
• PROCR, PROCR * , GT β , RAND: there are no hyper-parameter to tune in these models. In terms of preprocessing, for all PROCR-based baselines, we first subtract the mean and then we divide each matrix by its Frobenius norm, so that the resulting (transformed) matrices are in the same space.

B Complete Results on Real Data
The complete list of results (µ r ) of all models in all of the experiments with real data (section 5, Table 1 and Figure 5) are provided in Table 2. The interpretation of the "year" for each model is provided in Table 3.