GASC: Genre-Aware Semantic Change for Ancient Greek

Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word’s correct meaning in its historical context is a central challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document time-stamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy (the fact that a word has several senses) and semantic change (the process of acquiring, losing, or changing senses), and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts’ genre to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, our model achieves improved predictive performance compared to the state of the art.


Introduction
Change and its precondition, variation, are inherent in languages. Over time, new words enter the lexicon, others become obsolete, and existing words acquire new senses. These changes are grounded in cognitive, social, and contextual factors, and can be realized in different ways. For example, in Old English thing meant 'a public assembly' 1 and currently it more generally means 'entity'. Semantic change research has a number of practical applications, beyond historical linguistics research, including new sense detection in computational lexicography and information retrieval for historical texts that allows to restrict a search to certain word senses (e.g. the old sense of the English adjective nice as 'silly'). To take an example from recent semantic change in English, the verb tweet used to be uniquely associated with birds' sounds and has recently acquired a new sense related to the social media platform Twitter. However, in this as in many other cases, the original sense co-exists with the new one, and specific contexts or genres will select one over the other. This is known as synchronic variation, and can be successfully modelled probabilistically, as advocated by several authors (see e.g. Jenset and McGillivray (2017)). The close relationship between innovation and variation is well-known in historical linguistics, and critical to ancient languages, for which balanced corpora are not available due to the limited amount of data at our disposal; therefore models need to explicitly account for confounding variables like genre.
To address these challenges, we introduce GASC (Genre-Aware Semantic Change), a novel dynamic Bayesian topic model for semantic change. In this model, the evolution of word senses over time is based not only on distributional information of lexical nature, but also on additional features, specifically genre. This allows GASC to decouple sense probabilities and genre prevalence, which is critical with genre-unbalanced data such as ancient languages corpora. The value of incorporating genre information in the model goes beyond literary corpora and historical language data and can be applied to recent data spanning over a period of time where text type information is critical, for example in specialized domains. Explicitly modelling genres also makes it possible to address a number of additional questions, revealing the genre most likely associated to a given sense, the most unusual sense for a genre, and which genres have the most similar senses. Naturally, this framework can be applied to other kinds of categorical metadata about the text, such as author, geography, or style.
Ancient Greek is an insightful test case for several reasons. First, Ancient Greek words tend to have a particularly high number of different senses (Clarke, 2010), and the extant corpus of Ancient Greek texts displays a large number of literary genres. Second, we can use data spanning over several centuries. Third, Ancient Greek scholarship provides high-quality data to validate automatic systems. Top-quality transcribed Ancient Greek texts are available, thus eliminating the need for OCR correction.
Finally, polysemous words are particularly sensitive to register variation and the distribution of senses can vary greatly across registers (Leiwo et al., 2012). As most extant texts are literary and relatively conservative from a linguistic perspective, we expect genre and register to play a significant role in the variation of sense distributions in polysemous words. The word mus, for instance, can mean 'mouse', 'muscle', or 'mussel'. As Figure 1 shows, the distribution of 'muscle' over time (light blue bars) closely follows the distribution of this word in technical genres over time (red line), suggesting that the effect of genre should be incorporated in semantic change models.

Related work
Semantic change in historical languages, especially on a large scale and over a long time period, is an under-explored, but impactful research area. Previous work has mainly been qualitative in nature, due to the complexity of the phenomenon (cf. e.g. Leiwo et al. (2012)). In recent years, NLP research has made great advances in the area of semantic change detection and modelling (for an overview of the NLP literature, see Tang (2018) and ), with methods ranging from topic-based models (Boyd-Graber et al., 2007;Wijaya and Yeniterzi, 2011;Frermann and Lapata, 2016), to graph-based models (Mitra et al., 2014(Mitra et al., , 2015Tahmasebi and Risse, 2017), and word embeddings (Kim et al., 2014;Basile and McGillivray, 2018;Kulkarni et al., 2015;Hamilton et al., 2016;Du-bossarsky et al., 2017;Tahmasebi, 2018;Rudolph and Blei, 2018). However, such models are purely based on words' lexical distribution information (such as bag-of-words) and do not account for language variation features such as text type because genre-balanced corpora are typically used.
With the exception of Bamman and Crane (2011) and Rodda et al. (2016), no previous work has focussed on ancient languages. Recent work on languages other than English is rare but exists: Falk et al. (2014) use topic models to detect changes in French, whereas Cavallin (2012) and Tahmasebi (2018) focus on Swedish, with the comparison of verb-object pairs and word embeddings, respectively. Zampieri et al. (2016) use SVMs to assign a time period to text snippets in Portuguese, and Tang et al. (2016) work on Chinese newspapers using S-shaped models. Most work in this area focusses on simply detecting the occurrence of semantic change, while Frermann and Lapata (2016)'s system, SCAN, takes into account synchronic polysemy and models how the different word senses evolve across time.
The work we present bears important connections with the topic model literature. The idea of enriching topic models with document-specific author meta-data has been explored in Rosen-Zvi et al. (2004) for the static case. Several time-dependent extensions of Bayesian topic models have been developed by the machine learning community, with a number of parametric and nonparametric approaches (Blei and Lafferty, 2006;Rao and Teh, 2009;Ahmed and Xing, 2012;Dubey et al., 2013;Perrone et al., 2017). In this paper we transfer such ideas to the semantic change domain, where each datapoint comes in the form of a bag of words associated to a single sense (rather than a mixture of topics). Excluding cases of intentional ambiguity, which we expect to be rare, we can safely assume that there are generally no ambiguities in a context, and each word instance maps to a single sense.

The model
We start with a lemmatized corpus which has been pre-processed into a set of text snippets, each containing an instance of the target word. Every snippet corresponds to a fixed-size window W , e.g., a set of 5 words to the left and to the right of an instance of the target word. The inferential task is to detect the sense associated to the target word in the given context, and describe the evolution of all sense proportions over time.
The generative model for GASC is presented in Algorithm 1 and illustrated by the plate diagram in Figure 2. First, suppose that throughout the corpus the target word is used with K different senses, where we define a sense at time t as a distribution ψ t k over words from the dictionary. Based on the intuition that each genre is more or less likely to feature a given sense, we assume that each of G possible text genres determines a different distribution over senses. Each observed document snippet is then associated with a genre-specific distribution over senses φ t g d at time t, where g d is the observed genre for document d. Crucially, conditioning on the observed genre we have a specific distribution over senses, which accounts for genrespecific word usage patterns. On the other hand, to make sure senses can be uniquely identified across genres, we associate each sense to the same probability distribution over words for all genres. We let word and sense distributions evolve over time with changes drawn from a Gaussian, ensuring smooth transitions. The degree of coupling between sense probabilities over time is controlled by K φ , the sense probability precision parameter, so that the larger K φ , the stronger the coupling between the sense probabilities over time. We place a Gamma prior over K φ with hyperparameters a and b, and infer K φ from the data. We fix K ψ , the word probability precision parameter.
The model can be applied to different inferential goals: we can focus on the evolution of the sense probabilities or on the changes within each sense. For each of these aims, we can use several hyperparameter combinations for K φ , which is drawn from the prior distribution as determined by a and b, and K ψ . Specifically, we consider the following three settings. Setting 1: a = 7, b = 3, K ψ = 10, as in Frermann and Lapata (2016). Setting 2: a = 7, b = 3, K ψ = 100. This setting aims at enforcing less variation within senses over time. Setting 3: a = 1, b = 1, K ψ = 100. This still keeps the bag of words stable for each sense, but also induces less smoothing of the sense probabilities over time. Setting 3 allows the probabilities to vary widely from one century to one another. We also expect the high value of K ψ to reduce the likelihood of dramatic changes within the same sense across contiguous time periods and also to favour the emergence of new senses. If not otherwise specified, we use setting 3. Finally, note that an extra parameter of the model is the window size W , namely the number of words surrounding an instance of the target. While larger values increase the range of dependencies that can be captured by the model, this tends to introduce noise as wider windows can include irrelevant contextual words.

Inference
For posterior inference we extend the blocked Gibbs sampler proposed in Frermann and Lapata (2016). Specifically, the full conditional is avail- able for the snippet-sense assignment, while to sample the sense and word distributions we adopt the auxiliary variable approach from Mimno et al. (2008). The sense precision parameters are drawn from their conjugate Gamma priors. For the distribution over genres we proceed as follows. First, sample the distribution over senses φ t g for each genre g = 1, . . . , G following Mimno et al. (2008). Then, sample the sense assignment conditioned on the observed genre from its full conditional: This setting easily extends to sample genre assignments for tasks where, for example, some genre metadata are missing.

Ancient Greek corpus
We used the Diorisis Annotated Ancient Greek Corpus (Vatri and McGillivray, 2018), which consists of 10,206,421 words and is lemmatized and part-of-speech-tagged (https://doi. org/10.6084/m9.figshare.6187256). The corpus contains 820 texts spanning between the beginnings of the Ancient Greek literary tradition (8 th century BC) and the 5 th century AD. The corpus covers a number of Ancient Greek literary and technical genres: poetry (narrative, choral, epigrams, didactic), drama (tragedy, comedy), oratory, philosophy, essays, narrative Figure 2: GASC plate diagram with 3 time periods.
(historiography, biography, mythography, novels), geography, religious texts (hymns, Jewish and Christian Scriptures, theology, homilies), technical literature (medicine, mathematics, natural science, tactics, astronomy, horsemanship, hunting, politics, art history, rhetoric, literary criticism, grammar), and letters (see Table 1). In technical texts, we expect polysemous words to have a technical sense. On the other hand, in works more closely representing general language (comedy, oratory, historiography) we expect the words to appear in their more concrete and less metaphorical senses; in a number of genres such as philosophy and tragedy, we cannot assume that this distribution holds. Whilst genre-annotated corpora are not especially common in NLP, where most tasks rely on specific genres (e.g. Twitter) or on genre-balanced corpora such as COHA (Davies, 2002), they are more prevailing within the humanities, and especially the classics. Additionally, research on automated genre identification has been flourishing for decades (e.g. Kessler et al. (1997)), making the need for genre information in a potential corpus not as much of a hindrance as can be thought.

Evaluation framework
Evaluating the performance of models tackling lexical semantic change is notoriously challenging. Frameworks are either lacking or focus on very specific types of sense change (Schlechtweg et al., 2018;. Exceptions are Kulkarni et al. (2015), Basile and McGillivray (2018)   guages), semantic change is so closely related to polysemy and the corpora typically contain gaps and uneven distribution of text genres, that it is very hard to find a specific point in time when a new sense emerged in the language. Therefore, it is more appropriate to take a probabilistic approach to modelling sense distribution, and devise an evaluation approach that fits this. Although historical dictionaries and traditional philology do describe the evolution of words' senses across time, they do not necessarily reflect the evidence from corpora on which models can be evaluated, and often only provide insights into the appearance of a new sense, rather than the relative predominance of a word's senses across time. These reasons led us to craft a novel evaluation dataset and framework, which has the advantage of reflecting the data on which the model is evaluated, and allows for a finer-grained evaluation of the predominance of a word's senses across time.

Log-likelihood evaluation
First, we compared GASC against the current stateof-the-art (SCAN) in terms of log-likelihood of held-out data. We chose 50 target words in the corpus that could be identified as polysemous (e.g. the verb legō, whose senses are 'gather' and 'tell') based on expert judgment. 17 words were selected from the technical vocabulary of Greek aesthetics, whose polysemy has been described in the secondary literature (Pollitt, 1974) and 33 words have been manually selected from the highest-frequency lemmas in the Diorisis corpus. The necessity to manually identify suitable words has led us to limit their overall number to 50. For each one of these target words, we randomly divided the corpus into a training (80%) and test set (20%). Results on the 50-word dataset are reported in Section 6.1.

Expert annotation
To evaluate our method against ground truth, we proceeded as follows. First, two Ancient Greek experts determined, for each of three target words (mus 'mouse'/'muscle'/'mussel', harmonia 'fastening'/'agreement'/'stringing (musical scale, melody)', and kosmos 'order'/'world'/'decoration'), the range of its possible senses. This was achieved using the standard scholarly Ancient Greek-English dictionary (Liddell et al., 1996) and existing philological evidence (Pollitt, 1974). The target words were selected (a) from the vocabulary of Ancient Greek aesthetics, which includes numerous terms that are used with an abstract metaphorical sense and have a concrete counterpart in the general vocabulary, and as such it has been the subject of much philological research literature; (b) from high-frequency polysemous terms. In addition, these words are attested in most of the time periods covered by the corpus and across different literary genres. Once the set of possible senses was created, experts manually annotated the whole corpus by tagging the senses of the target words in context, and we plan to publish this dataset in the future. Table 2 shows an example from the annotated dataset for the word kosmos.
The annotators also marked the cases in which the semantic annotation was purely based on the corpus context of the target words, which is the evidence base on which the model can rely (category "collocates"). Only the annotations that were based on collocates were retained in the evaluation. date genre author work target word sense id -335 Technical Aristotle De Mundo kosmos kosmos:world Table 2: Example from annotated dataset displaying a sentence containing the target word kosmos and its expertassigned sense 'world'. The date of the text is given as a negative number because it refers to the year 335 B. C. This instance refers to the sentence Tou de sumpantos ouranou te kai kosmou sphairoeidous ontos kai kinoumenou kathaper eipon "The whole of the heaven, the whole cosmos, is spherical, and moves continuously, as I have said".
Using this information, the relative frequency of each sense usage for each target word in any time slice becomes computable, and was used to create ground-truth data on the diachronic predominance of a word's senses as reflected in the corpus.
On the other hand, the expert annotation provides lists of occurrences of the target words in their corpus context, each associated to a sense label. In the example displayed in Table 2, the sense label is 'kosmos-world' and we can associate lemmas such as ouranos 'sky' and sphairoeides 'spherical' to this sense because these lemmas occur in the corpus context of this target word.
In order to evaluate the model's output against the expert annotation, we need a way to automatically match the lists of words associated to each sense by the model to the sense labels assigned by the annotators. To achieve this, we devised the following sense-labelling strategy that matches the word senses assigned by the annotators (denoted by s) with the senses outputted by the model (denoted by k).
First, we aimed to measure how closely each model's sense k matches each expert sense s. We assigned a confidence score to every possible (k, s) pair by relying on the words associated to k in the model's output and the words co-occurring with the target word in a given expert-assigned sense. In the example for kosmos, for k = 0 we compare words from the model output, such as ouranos 'sky' and sphairoeides 'spherical' with words from the context words of the annotated sentences, such as aêr 'air', gê 'earth', and ouranos 'sky'. We therefore considered two elements. For the words from the model output, we consider the normalized probability with which these words w i are associated to the model sense k, i.e. P (w i |k). In the example for kosmos, aêr 'air' is associated to probability 0.069, gê 'earth' and ouranos 'sky' to 0.033. For the context words from the annotated dataset, we consider the degree by which these words are associated to an expert sense. In the example of kosmos from Table 2, this is calculated based on how many different senses a context word like ouranos 'sky' or sphairoeides 'spherical' is associated to. To measure this degree of association we define the expert score m(w i , s) of a word w i as 1 divided by the number of senses assigned by the experts to this word. If the word is associated to only one sense s in the annotated dataset, its expert score m(w i , s) will be highest (1). If a word not assigned to the sense s by the experts, its expert score m(w i , s) is 0.
The formula for the confidence score of a pair of model sense k and expert-assigned sense s is as follows: The confidence score is highest when both elements, P (w i and m(w i , s) are highest for all words. In the extreme cases, P (w i will be 1 if the model estimated that a word w i is very strongly associated to sense k and m(w i , s) is 1 if the w i is only found in contexts labelled as s by the experts. This points to k and s being associated to the same words, and therefore being the same sense. On the other hand, the confidence score is lowest when k and s do not share any words.
In contrast with clustering overlap techniques like purity or rand index (which compare sets of words), we use this weighting to ensure words with a higher model-estimated probability and uniquely associated to a sense (thus whose expert score is 1) weigh more compared to other words.
The confidence scores were used to find the best matching pair (k, s): for every expert sense s we selected the sense(s) k for which conf(k, s) was higher than the random baseline (1 divided by the number of expert senses) and higher than the sum of the second and third best confidence scores, when possible. We consider NA as an additional expert sense whenever the expert assigned a sense based on other factors than lexical context. After matching the model-estimated senses to the expert-assigned senses, we calculated precision and recall metrics, to measure if the words associated to a given sense by the model were correct. For every target word and for every matched pair (s, k), we considered a word to be correctly assigned to a sense k by the model if this word also appeared within a 5-word window of the target word in the expert annotation for s. In the example above for kosmos, k = 0 and s='kosmos-world', one such word is ouranos 'sky' because it appears in the model output for k = 0 and in the context window of a sentence labelled as 'kosmos-world' by the annotators. Moreover, we decided to weight every word by the probability that the model assigned to it, so to take into account the fact that some words are more strongly associated to a sense than others.
Therefore, we defined precision as the ratio between the number of words correctly assigned to k, weighted by their respective normalised modelestimated probabilities, and the number of words assigned to k by the model. Note that our precision metric is based on the distributional hypothesis whereby words occurring in similar contexts tend to exhibit similar meanings. We computed this metric after stop word removal, which limited the amount of noise by excluding uninformative contextual words. We fixed the window size W to the same value of SCAN for all methods to ensure a fair comparison. We have defined precision in terms of the words assigned to a sense by the model and that also appear within a 5-word window of the target word in the expert annotation for that sense. The reason for this is that our model, as SCAN, only considers those context words for determining the target word's senses, and that for the evaluation against the ground truth we only retained the cases in which the annotators were able to disambiguate the words based purely on their context. We defined recall as the ratio between the number of all words correctly assigned to k (weighted by their respective probabilities) and the number of words assigned to sense s by the experts (weighted by their expert scores). For each model, the precision and recall scores for each (s, k) pair were averaged and used as the final scores. Since recall directly depends on the number of expert words, the metric can only be used to compare the performance of models for a specific target word. While the proposed assessment method focusses on evaluating dynamic topic models, it can be generalised to any probabilistic model by considering the posterior probability of the gold word sense.

Predictions on held-out words
Considering the 50-word dataset described in Section 5, we evaluated the predictive performance in terms of log-likelihood of held-out data for 3 models: SCAN (not using any genre information), GASC-all (GASC with all the G = 10 available genres) and GASC-narr (GASC with 2 genres, Narrative vs. non Narrative). Narrative and Technical are the genres with the highest frequency in the corpus for which all the 50 words occurred at least once in the training and test sets, and analogous results are obtained when GASC with Technical vs. non Technical is used. For each, we compared the 3 hyperparameter settings previously reported, with higher scores indicating that a model is better at explaining unseen data. Figure 3 shows the predictive log-likelihood scores for a range of values of K, averaging the results over 50 leave-one-out folds. Each time, the log-likelihood scores were averaged under the final 10 samples of the latent variables, out of 1000 MCMC iterations (150 of which used as burn-in). On average, GASC-narr consistently outperforms SCAN across every K and for each hyperparameter setting. On the other hand, SCAN exhibits a higher held-out log-likelihood than GASC-all. Exploiting some information on the genre yields better predictions, while using all genres attested in the corpus is not effective as some genres are not sufficiently represented by the data. Figure 3 also shows that the best predictions over unseen data are obtained for K between 10 and 15. Higher K values tend to introduce noisy senses with no improvement for the model output. In addition, Setting 3 proved to work better or as well as the other settings. In the next section, we fix the model hyperparameters and use a validation set of words that were not part of the 50 targets of this experiment.

Ground truth recovery
We explored the ability to recover ground truth when available. For the word mus, experts annotated 205 instances, of which 198 were assigned to one of the 3 senses 'mouse', 'mussel', and 'muscle'; out of these 198 assignments, 114 were performed based on lexical contextual information only (category 'collocates') and were retained for the evaluation. For harmonia, the number of annotated occurrences was 599, of which 411 were   of the type 'collocates'. For kosmos, 1,411 occurrences were annotated, of which 1,406 were assigned to a sense, and in 1,102 cases the annotation was of the type 'collocates'. We identified the genres that manual annotation shows have the largest effect on the distribution of senses for each target words by calculating the Spearman's Rank Correlation Coefficient for each word-sense s between the frequency f(s) of s across centuries and the frequency f(s,g) of s in each genre g across centuries. Significant correlation between f(s) and any f(s,g) would suggest that variation in the frequency of a word sense across centuries is not due to diachronic change, but to how frequently s is attested in g in each century (and, ultimately, to the amount of texts representing g in each century). Significance (p < 0.05) is reached for following senses and genres as shown in Table 3 Table 4: SCAN vs GASC on mus ('mouse', 'muscle', 'mussel'), harmonia ('abstract', 'concrete', 'musical'), and kosmos ('order', 'decoration', 'world') in terms of precision ('P'), recall ('R'), and F1-score ('F1').
of available data and the size of the correlations, we considered the genres Technical and non-Technical for mus and harmonia, and both Technical and non-Technical and Narrative and non-Narrative for kosmos (see Table 1). These target words were selected as examples of polysemous words (a) exhibiting a range of clearly distinct senses (such as 'mus', whose three senses are strikingly diverse), (b) attested in most, if not all, the time periods covered by the corpus, and (c) attested across a number of literary genres. As expert annotations of semantic change in Ancient Greek corpora are virtually unavailable, this particular choice of targets also allowed us to leverage ground truth for validation.
We compared the performance of SCAN with GASC and GASC-independent, namely a simpler version of GASC that fits an independent model to each collection of documents sharing the same genre, so that parameters and senses are inferred independently across genres (while in GASC senses are shared but their probability distributions are independent across genres). The comparison was carried out with two approaches: 1) a comparison of the word senses across time against the expertannotated data, 2) precision, recall, and F1 scores (the harmonic mean of precision and recall) to determine how closely the words assigned to a sense by the model match the words assigned to a sense by the experts. Figure 4 compares the time distribution of the senses of kosmos in the expert annotation (left) and as outputted by GASC run on Narrative vs. non-Narrative (right). For every matched (k, s) pair, we computed precision, recall, and F1 scores. For GASC, the values average precision, recall, and F1score for {Technical, non-Technical} for mus and harmonia and {Narrative, Non-Narrative} for kosmos. The results reported in Table 4 indicate that, for the targets that are sufficiently represented in the corpus, incorporating genre information leads to a greater ability to recover the ground truth.

Discussion
We introduced GASC, a Bayesian model to study the evolution of word senses in ancient texts. Crucially, we performed this analysis conditional on the text genre, demonstrating that the ability to harness genre metadata addresses a fundamental challenge in disambiguating word senses in ancient Greek. In experiments we showed that GASC is able to provide interpretable representations of the evolution of word senses, and achieves improved predictive performance compared to the state of the art. Further, we established a new framework to assess model accuracy against expert judgment. To our knowledge, no previous work has systematically compared the estimates from a statistical model to manual semantic annotations of ancient texts. This work can be seen as a step towards the development of richer evaluation schemes and models that can embed expert judgments. Future work in this direction could encode more structured cross-genre dependencies, or allow for change points that occur in the light of exogenous forces by historical events.