An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability

Our research aims at tracking the semantic evolution of the lexicon over time. For this purpose, we investigated two well-known training protocols for neural language models in a synchronic experiment and encountered several problems relating to accuracy and reliability. We were able to identify critical parameters for improving the underlying protocols in order to generate more adequate diachronic language models.


Introduction
The lexicon can be considered the most dynamic part of all linguistic knowledge sources over time. There are two innovative change strategies typical for lexical systems: the creation of entirely new lexical items, commonly reflecting the emergence of novel ideas, technologies or artifacts, on the one hand, and, on the other hand, shifts in the meaning of already existing lexical items, a process which usually takes place over larger periods of time. Tracing semantic changes of the latter type is the main focus of our research.
Meaning shift has recently been investigated with emphasis on neural language models (Kim et al., 2014;Kulkarni et al., 2015). This work is based on the assumption that the measurement of semantic change patterns can be reduced to the measurement of lexical similarity between lexical items. Neural language models, originating from the word2vec algorithm (Mikolov et al., 2013a;Mikolov et al., 2013b;Mikolov et al., 2013c), are currently considered as state-of-the-art solutions for implementing this assumption (Schnabel et al., 2015). Within this approach, changes in similarity relations between lexical items at two different points of time are interpreted as a signal for meaning shift. Accordingly, lexical items which are very similar to the lexical item under scrutiny can be considered as approximating its meaning at a given point in time. Both techniques were already combined in prior work to show, e.g., the increasing association of the lexical item "gay" with the meaning dimension of "homosexuality" (Kim et al., 2014;Kulkarni et al., 2015).
We here investigate the accuracy and reliability of such similarity judgments derived from different training protocols dependent on word frequency, word ambiguity and the number of training epochs (i.e., iterations over all training material). Accuracy renders a judgment of the overall model quality, whereas reliability between repeated experiments ensures that qualitative judgments can indeed be transferred between experiments. Based on the identification of critical conditions in the experimental set-up of previously employed protocols, we recommend improved training strategies for more adequate neural language models dealing with diachronic lexical change patterns. Our results concerning reliability also cast doubt on the reproducibility of experiments where semantic similarity between lexical items is taken as a computationally valid indicator for properly capturing lexical meaning (and, consequently, meaning shifts) under a diachronic perspective.

Related Work
Neural language models for tracking semantic changes over time typically distinguish between two different training protocols-continuous training of models (Kim et al., 2014) where the model for each time span is initialized with the embeddings of its predecessor, and, alternatively, independent training with a mapping between models for different points in time (Kulkarni et al., 2015). A comparison between these two protocols, such as the one proposed in this paper, has not been carried out before. Also, the application of such protocols to non-English corpora is lacking, with the exception of our own work relating to German data (Hellrich and Hahn, 2016b;Hellrich and Hahn, 2016a).
The word2vec algorithm is a heavily trimmed version of an artificial neural network used to generate low-dimensional vector space representations of a lexicon. We focus on its skip-gram variant, trained to predict plausible contexts for a given word that was shown to be superior over other settings for modeling semantic information (Mikolov et al., 2013a). There are several parameters to choose for training-learning rate, downsampling factor for frequent words, number of training epochs and choice between two strategies for managing the huge number of potential contexts. One strategy, hierarchical softmax, uses a binary tree to efficiently represent the vocabulary, while the other, negative sampling, works by updating only a limited number of word vectors during each training step.
Furthermore, artificial neural networks, in general, are known for a large number of local optima encountered during optimization. While these commonly lead to very similar performance (LeCun et al., 2015), they cause different representations in the course of repeated experiments.
Approaches to modelling changes of lexical semantics not using neural language models, e.g., Wijaya and Yeniterzi (2011), Gulordava and Baroni (2011), Mihalcea and Nastase (2012), Riedl et al. (2014) or Jatowt and Duh (2014) are, intentionally, out of the scope of this paper. In the same way, we here refrain from comparison with computational studies dealing with literary discussions related to the Romantic period (e.g., Aggarwal et al. (2014)).

Experimental Set-up
For comparability with earlier studies (Kim et al., 2014;Kulkarni et al., 2015), we use the fiction part of the GOOGLE BOOKS NGRAM corpus (Michel et al., 2011;Lin et al., 2012). This part of the corpus is also less affected by sampling irregularities than other parts (Pechenick et al., 2015). Due to the opaque nature of GOOGLE's corpus acquisition strategy, the influence of OCR errors on our results cannot be reasonably estimated, yet we assume that they will affect all experiments in an equal manner.
The wide range of experimental parameters described in Section 2 makes it virtually impossible to test all their possible combinations, especially as repeated experiments are necessary to probe a method's reliability. We thus concentrate on two experimental protocols-the one described by Kim et al. (2014) (referred to as Kim protocol) and the one from Kulkarni et al. (2015) (referred to as Kulkarni protocol), including close variations thereof. Kulkarni's protocol operates on all 5grams occurring during five consecutive years (e.g., 1900-1904) and trains models independently of each other. Kim's protocol operates on uniformly sized samples of 10M 5-grams for each year from 1850 onwards in a continuous fashion (years before 1900 are used for initialization only). Its constant sampling sizes result in both oversampling and undersampling as is evident from Figure 1. We use the PYTHON-based GENSIM 1 implementation of word2vec for our experiments; the relevant code is made available via GITHUB. 2 Due to the 5-gram nature of the corpus, a context window covering four neighboring words is used for all experiments. Only words with at least 10 occurrences in a sample are modeled. Training for each sample is repeated until convergence 3 is achieved or 10 epochs have passed. Following both protocols, we use word vectors with 200 Table 1: Accuracy and reliability among top n words for threefold application of different training protocols. Reliability is given as fraction of the maximum for n. Standard deviation for accuracy ±0, if not noted otherwise; reliability is based on the evaluation of all lexical items, thus no standard deviation. dimensions for all experiments, as well as an initial learning rate of 0.01 for experiments based on 10M samples, and one of 0.025 for systems trained on unsampled texts; the threshold for downsampling frequent words was 10 −3 for sample-based experiments and 10 −5 for unsampled ones. We tested both negative sampling and hierarchical softmax training strategies, the latter being canonical for Kulkarni's protocol, whereas Kim's protocol is underspecified in this regard. We evaluate accuracy by using the test set developed by Mikolov et al. (2013a). This test set is based on present-day English language and world knowledge, yet we assume it to be a viable proxy for overall model quality. It contains groups of four words connected via the analogy relation '::' and the similarity relation '∼', as exemplified by the expression king ∼ queen :: man ∼ woman.
We evaluate reliability by training three identically parametrized models for each experiment. We then compare the top n similar words (by cosine distance) for each word modeled by the experiments with a variant of the Jaccard coefficient (Manning et al., 2008, p.61). We limit our analysis to values of n between 1 and 5, in accordance with data on word2vec accuracy (Schnabel et al., 2015). The 3-dimensional array W i,j,k contains words ordered by similarity (i) for a word in question (j) according to an experiment (k). If a word in question is not modeled by an experiment, as can be the case for comparisons over different samples, ∅ is the corresponding entry. The reliability r for a specific value of n (r@n) is defined as the magnitude of the intersection of similar words produced by all three experiments with a rank of n or lower, averaged over all t words modeled by any of these experiments and normalized by n, the maximally achievable score for this value of n:

Results
We focus our analysis on the representations generated for the initial period, i.e., 1900 for samplebased experiments and 1900-1904 for unsampled ones. This choice was made since researchers can be assumed to be aware of current word meanings, thus making correct judgments on initial word semantics more important. As a beneficial side effect, we get a marked reduction of computational demands, saving several CPU years compared to an evaluation based on the most recent period.

Detailed Investigation
As variations of Kulkarni's protocol yield more consistent results, we further explore its performance considering word frequency, word ambiguity and the number of training epochs. All experiments described in this section are based on the complete 1900-1904 corpus. Figure 2 shows the influence of word frequency, negative sampling being overall more reliable, especially for words with low or medium frequency. The 21 words reported to have undergone traceable semantic changes 4 are all frequent with percentiles between 89 and 99. For such high-frequency words hierarchical softmax performs similar or slightly better. Entries in the lexical database WORDNET (Fellbaum, 1998) can be employed to measure the effect of word ambiguity on reliability. 5 The number of WORDNET synsets a word belongs to (i.e., the number of its senses) seems to have little effect on top-1 reliability for negative sampling, while hierarchical softmax underperforms for words with a low number of senses, as shown in Figure 3.
Model reliability and accuracy depend on the number of training epochs, as shown in Figure  4. There are diminishing returns for hierarchical softmax, reliability staying constant after 5 epochs, while negative sampling increases in reliability with each epoch. Yet, both methods achieve maximal accuracy after only 2 epochs; additional epochs lead to a small decrease from 0.4 down to 0.38 for negative sampling. This could indicate overfitting, but accuracy is based on a test set for modern-day language, and can thus not be considered a fully valid yardstick.

Discussion
Our investigation in the performance of two common protocols for training neural language models on historical text data led to several hitherto unknown results. We could show that negative sampling outperforms hierarchical softmax both in terms of accuracy and reliability, especially 4 Kulkarni et al. (2015) compiled the following list based on prior work (Wijaya and Yeniterzi, 2011;Gulordava and Baroni, 2011;Jatowt and Duh, 2014;Kim et al., 2014): card, sleep, parent, address, gay, mouse, king, checked, check, actually, supposed, guess, cell, headed, ass, mail, toilet, cock, bloody,   Even the most reliable system often identifies widely different words as most similar. This carries unwarranted potential for erroneous conclusions on a words' semantic evolution, e.g., "romantic" happens to be identified as most similar to "lazzaroni" 7 , "fanciful" and "melancholies" by three systems trained with negative sampling on 1900-1904 texts. We are thus skeptical about using such similarity clouds to describe or visualize lexical semantics at a point in time.
In future work, we will explore the effects of continuous training based on complete corpora. The selection of a convergence criterion remains another open issue due to the threefold trade-off between training time, reliability and accuracy. It would also be interesting to replicate our experiments for other languages or points in time. Yet, the enormous corpus size for more recent years might require a reduced number of maximum epochs for these experiments. In order to improve the semantic modeling itself one could lemmatize the training material or utilize the part of speech annotations provided in the latest version of the GOOGLE corpus (Lin et al., 2012). Also, recently available neural language models with support for multiple word senses (Bartunov et al., 2016;Panchenko, 2016) could be helpful, since semantic changes can often be described as changes in the usage frequency of different word senses (Rissanen, 2008, pp.58-59). Finally, it is clearly important to test the effect of our proposed changes, based on synchronic experiments, on a system for tracking diachronic changes in word semantics.