Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings

The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.


Introduction
Word embeddings are by now a well established instrument for exploring and comparing corpora in terms of lexical fields and semantic richness (Lenci, 2008). More recently, diachronic word embeddings have been successfully applied to investigate lexical semantic change (e.g. Jatowt and Duh (2014); Hamilton et al. (2016a); Hellrich and Hahn (2016); Fankhauser and Kupietz (2017); Hellrich et al. (2018)). We supplement this line of work using diachronic word embeddings for the analysis of change in grammatical use, potentially indicating shifts in style/register. Word embeddings reflect shared usage contexts not only of lexical words but also of grammatical words. By grammatical words we understand function words (determiners, conjunctions, etc.) as well as some other specific word forms, such as whpronouns or ing-forms of verbs. Typically, the latter are poly-functional (e.g. verbal ing-forms can be gerunds, participles or markers of present continuous). Function words are high-frequency words and affected by change only in the long term (e.g. by becoming clitics or bound forms), while lexical words, typically in the lower frequency band, tend to change (meaning) fast. If pressure arises for grammar to change (e.g. for more economical expression), it will likely affect the poly-functional word forms first, which can spread to new syntagmatic environments or attract new lexemes and extend paradigmatically (like lexical words, unlike function words). To capture such developments, we employ diachronic word embeddings with visualization of word clusters on a diachronic axis combined with some other exploratory techniques, such as clustering and relative entropy. For instance, spread of a word/word form will result in the word moving in the overall embedding space, or paradigmatic extension will result in locally higher populate, denser spaces. Comparing lexical words, function words and poly-functional word forms, we inspect the overall topology of the embedding space over time as well as capture the internal composition of (selected) individual sub-spaces.
As a data set we use the Royal Society Corpus (RSC) (Kermes et al., 2016), a diachronic corpus of the Philosophical Transactions and the Proceedings of the Royal Society of London, which includes text material that is linguistically well explored in terms of style, register and diachrony (e.g. Biber and Finegan (1997); Atkinson (1999); Banks (2008); Degaetano-Ortlieb et al. (2018)).
Following related work (Section 2), we present our data and methods (Section 3). In Section 4, we analyze the embedding space in terms of change in overall topology as well as changes in selected clusters. Zooming in on ing-forms, we also microinspect their (changing) syntagmatic contexts. We conclude with a summary and future work directions (Section 5).

Related work
Quantitative corpus-based approaches to language change (e.g. Hilpert (2006); Geeraerts et al. (2011);Sagi et al. (2011);Hilpert and Gries (2016)) share the basic assumption that language use is governed by statistical properties of lexical and grammatical items. In recent years, distributional semantic approaches based on word embeddings, often combined with clustering, capture this assumption in a bottom-up fashion, allowing to model semantic similarity of words from corpora. Approaches such as word2vec (Mikolov et al., 2013) and SVD PPMI (Levy and Goldberg, 2014;Levy et al., 2015) trained on corpora covering several time spans allow investigating changes in the semantic usage of lexical items over time (Jatowt and Duh, 2014;Kim et al., 2014;Kulkarni et al., 2015;Hamilton et al., 2016a;Hellrich and Hahn, 2016). To also capture syntactic information, approaches have been developed that account for word order based on structured skip-gram models (Ling et al., 2015) and clustering the model output (Dubossarsky et al., 2015;Fankhauser and Kupietz, 2017).
Particularly targeted at the digital humanities as well as socio-historical corpus-linguistics are approaches which also allow meaningful ways to inspect the data. For instance, Hellrich et al. (2018) provide a visualization website (JeSemE) to inspect change in word meaning over time by means of line and bar plots considering different comparative parameters (word similarity, word emotion, typical context, and relative frequency); Fankhauser and Kupietz (2017) provide a visualization of change in the distributional semantics of words combined with their relative frequency over time.

Data
As a data set we use v4.0 of the Royal Society Corpus (RSC) 1 , containing the publications of the Philosophical Transactions and Proceedings of the Royal Society of London from 1665 to 1869 (ca. 32 million tokens and 10,000 documents). The RSC contains various types of metadata (e.g. author, publication date, text title) and linguistic annotations (e.g. lemma, parts of speech, sentence boundaries).

Diachronic word embeddings
For computing word embeddings on a diachronic corpus, we follow the approach of Fankhauser and Kupietz (2017) -based on the structured skipgram method described in Ling et al. (2015) with a one-hot encoding for words as input layer, a 200-dimensional hidden layer, and a window of [-5,5] as the output layer. Importantly, as this approach takes into account word order, it will capture grammatical patterns in word usage. Word embeddings are calculated for each decade of the RSC. The embeddings for the first decade are initialized with a first-run training on the whole corpus, and subsequently refined for each decade of the 20 decades taken into consideration (1670-1860). The vocabulary of the models consists of a total of 117.165 100-dimensional points. The vocabulary consists only of "spaced" tokens (i.e. divided by space or punctuation in the original text). Multiword expressions and phrases are not taken into account, to maintain the original modelling as agnostic as possible about the content of the corpus. The models were trained on non-lemmatized text. For interpretability, Fankhauser and Kupietz apply dimensionality reduction using t-Distributed Stochastic Neighbor Embedding by Maaten and Hinton (2008). Finally, a dynamic, interactive visualization of the resulting embeddings is provided which covers two crucial factors involved in diachronic change: frequency (encoded by colour -shades of violet-blue for decreasing frequency, shades of red-orange for increasing frequency) and similarity in context of use (encoded by proximity in space). For an example see Figure 1a. This allows us to explore changes in word use as shown in Section 4. As in most studies regarding distributional semantics, we will use cosine distance to compute the similarity between words in the space.

Investigating change in grammatical use
The large majority of studies performed on diachronic corpora through embedding spaces focuses on lexical semantics: to analyze changes in the distance between specific words over time (Szymanski, 2017), to infer semantic changes between specific categories of words, e.g. words referring to specific objects or concepts (Recchia et al., 2016), or to model the development of new terms with respect to the existing "neighborhoods" to infer their emergent semantic profile (Gangal et al., 2017).
But embedding spaces can be used to go beyond the study of change in lexical meaning (Jenset, 2013;Perek, 2016;Lenci, 2011), as they capture, to varying degrees, both paradigmatic and syntag-matic properties of words 2 . The same methods used for lexical words can be applied to grammatical words (as defined in Section 1): measuring the distance of individual words from their neighbours, mapping the evolution from their original position in the space, the nearest neighbour similarity, the similarity to other specifically selected words etc. Operating on grammatical words in the same way in which we traditionally operate on lexical words can return interesting observations, exactly as happens studying lexical words. Polyfunctional grammatical words, such as ing-forms, are at the boundary between lexis and grammar and are therefore particularly interesting because they can give us insights on the interplay between lexis and grammar. We outline here two main phenomena pertaining to the interplay between lexical semantics and grammatical function in distributional spaces: (1) diachronic expansion of the space; (2) diachronic clustering of poly-functional words with ing-forms as an exemplary case.
Considering (1), we measure average distances of lexical and function words as well as polyfunctional word forms. Average distance is the average of the mean distances of each word from the rest of the vocabulary. In addition, we consider the average distance between words within a group (henceforth: inner distance), and the aver-age distance of the group from all other words in the space (henceforth: outer distance). Change in inner distance reflects how much the words remain close to each other or drift apart in meaning/usage. Change in outer distance reflects how much semantically similar words become more isolated from all other words, possibly indicating a trend towards more specialized meaning/usage.
Considering (2), we operate in two main steps: (i) We first explore the sub-space of ing-forms to see whether meaningful clusters of verbs can be suspected. We do this by simply looking at verbs that have near neighbours, setting a threshold for what we consider near 3 . Through this very simple system, we elaborate an idea of what kinds of verbs are likely to constitute the clusters we are interested in. (ii) Once we have formed a hypothesis about the structure of the sub-space, we run some fairly popular algorithms of clustering and compare their results with our predictions and interpretations. This double step rises from the conviction that unsupervised clustering algorithms require a hypothesis about the structure of the data to both set their parameters and interpret their results, and that such hypothesis has to be acquired through an exploration of the space.

Investigating syntagmatic context
For further insights, we inspect the syntagmatic context of selected clusters of ing-forms extracting part-of-speech ngrams preceding an ing. We then use relative entropy (here: pointwise Kullback-Leibler Divergence (KLD; Kullback and Leibler (1951);Fankhauser et al. (2014); Tomokiyo and Hurst (2003)) to measure how distinctive particular syntagmatic contexts are for particular time periods. This is performed for each inspected feature (in our case a syntagmatic context in terms of a part-of-speech ngram, e.g. preposition-nouning-verb) comparing two time periods, T 1 and T 2 (cf. Equation (1)).
Basically, the probability of a feature in a time period T 1 (p(f eature|T 1)) is compared to that feature in time period T 2 (p(f eature|T 2)), i.e.

Analyses
In the analysis, we inspect (1) changes in the overall topology of the embedding space over time, and (2) the development of ing-forms of verbs.
4.1 Topology of the overall embedding space over time Figures 1a and 1b show the embedding spaces for the RSC's first (1670s) and last (1860s) full decades. Most function words (e.g. the, and, from) are isolated in both decades indicating their functional status. Lexical words (e.g. verbs, nouns, adjectives), instead, cluster in one large group in the middle. Considering diachronic development, apart from local clusters disappearing altogether (e.g. a cluster of Latin, marked in blue), a visible general trend is the expansion of the overall space to smaller, more spread out and more separated clusters. Thus, the distance between words seems to increase in general, possibly indicating a process of specialization at word level. We test this for three cases: all words, function words and two poly-functional word forms (ing-and -ed forms of verbs).
All words. Analysing the spaces diachronically, we find that most lexical words 4 tend to drift further from each other over time. This does not mean that they do not form lexico-semantic clusters, but the average distance of each word from both its nearest neighbours (inner distance) and every other word in the space increases (outer distance) (see again Figure 2). Considering different sets of words in the spaces' vocabulary, we observe the same phenomenon: the average distributional distance tends to increase, both within the group (inner distance), and between the group and the rest of the lexicon (outer distance). In Figure 2 we show how this trend is clearly detectable in our spaces, independently of the words' frequencies. It can also be noted that the low frequency words maintain most of the time a lower average distance than high frequency words. We consider this a hint that the reason of the expansion of the space is due to specialization: the tail of the frequency curve tends to contain many highly technical words, with particularly specialized meanings. These words usually, while being far from the rest of the vocabulary, have a low number of very close neighbours, which represent those few words that happen to share similar specialized contexts. This is often considered an indication of single and specialized meaning (Hamilton et al., 2016b). In fact, words having a frequency lower than three in each decade have, on average, one neighbour which is considerably closer than the closest neighbour of highly frequent words (0.84 vs 0.71 cosine similarity on average). This all leads to the conclusion that the underlying mechanism is lexical specialization.
Function words. If we compare these general distributional behaviours to the behaviour of only function words (here: determiners, conjunctions and adpositions), we observe an interesting difference: function words tend to have an increasingly "reclusive" tendency. While their outer distance increases (see Figure 3), the inner distance stays stable. In other terms, while the average lexical word in our corpus undergoes a process of contextual specialization, function words do not. Poly-functional word forms. If lexical words undergo expansion in both directions (inner and outer distance), while function words only show an increase in the outer distance, we can assume that the increase in distances is due to the lexico-semantic side of words rather than their functional-grammatical side. This becomes particularly clear when we look at poly-functional word forms which share a common formal feature (e.g., suffix ed), but not a common semantic belonging. For example, the average inner distance between ed-forms of verbs 5 , while increasing over time (see Figure 3), remains lower than their average outer distance: their grammatical side shows its effect on their distributional behaviour, somehow in tension with their semantic change. Among ingforms of verbs, the same tension can be observed: the inner and outer distances both increase, but their inner distance remains smaller. Compare also trends in Figure 3, where the difference between inner and outer distance is immediately evident (outer distance always higher), with those in Figure 2, where such difference does not seem to retain a particular importance. See also Figure 4 for an exemplification of this semantic-grammatical tension.

Tracing the development of ing-forms
We have observed that for poly-functional word forms, which are very much "in between" lexis and grammar, inner distance grows more slowly.
To analyze this phenomenon in more detail, we focus on ing-forms of verbs. Figure 4: Example of semantic-grammatical tension. Two couples of verbs undergoing a semantic diversification (the left-side verbs become more specialized in meaning). In the lower side of the space, the two verbs have both semantic and grammatical differences. In the upper side of the space, the verbs have a growing semantic distance, but their grammatical profile remains similar; thus their distance grows more slowly.

Diachronic frequency distribution of ing-forms
In a first step, to obtain a better understanding of the frequency distribution of ing-verb forms in the RSC corpus, we extract all verbs part-of-speech tagged as "gerunds or present participles" (VVG, VBG, VHG). Verbs with this tag include progressives, but exclude other verbs ending in -ing (e.g. sing, bring) or other parts of speech (e.g. morning, spring). We observe a fairly stable diachronic tendency. In addition, scientific writing is known to use ing-verbs most prominently as gerunds and participles rather than progressives (Biber et al., 1999). Indeed, the progressive form (i.e. BE + ingverb) is quite infrequent in the RSC overall and it is declining over time; i.e. 250 occurrences of progressive per million tokens in the 1860s in 13,000 occurrences/million of ing-forms altogether.

Inspecting clusters of ing-forms
We consider all ing-forms per decade and consider as a cluster all neighbours closer than a given threshold distance. In this way, we can analyze (1) how close to other words ing-forms are on average, (2) how large their average cluster is (i.e. no. of words in a cluster), and (3) how much they tend to cluster with each other (i.e. whether and which ing-forms tend to occur in other ing-forms' neighbourhoods).
To build clusters we use a dynamic threshold. We set this threshold empirically to the decade's average distance of the nearest neighbours + .05. Thus, for each decade we can see which ingforms have the highest number of "near" neigh-bours, and how many large clusters are formed, despite the general expansion of the space. From this exploratory analysis we observe that, first, despite our dynamic threshold, the density (i.e. number of words per cluster) of ing-clusters diminishes over time. We ascribe this effect, like the more general expansion of the space, mostly to the lexical-semantic component of the verbs involved: their meaning becomes more specific, their context more specialized -and thus less overlap between their contexts is observed. At the same time, the words that are at the center of a cluster (i.e. words with relatively large and close neighbourhoods) appear to belong to three increasingly distinct categories. The most prominent category are so-called academic verbs, such as ascertaining, determining, examining etc. acquiring relatively tight and large neighbourhoods (see Figure 5a and 5b 6 ).
The complementary analysis of the most frequent neighbours (words that occur most frequently in other words' close neighbourhood) shows the same phenomenon: academic verbs rise in frequency. The two other main categories we observe at the center of large clusters are change-of-state verbs (saturating, diluting, etc.) and motion verbs (passing, falling, etc.).

Clustering specialized vs. broader meanings
Based on the above findings, which gave us a general idea of possible clusters, we can now apply some traditional clustering algorithms to our dataset. We will show the results of three algorithms: Affinity Propagation (AP) (Frey and Dueck, 2007), DBSCAN (Ester et al., 1996;Tran et al., 2013), and MiniBatch K Means (Sculley, 2010;Feizollah et al., 2014). Results are presented in Table 2. Affinity Propagation, much like DBSCAN, does not require a pre-determined number of clusters, i.e. it defines its own number of centroids. While usually seen as an advantage, in our case it could result in a flaw: these algorithms tend towards a micro-clustering (clustering tight relationships), leading to many small clusters of specialized meanings of ing-verbs. This would probably shadow the larger and looser clustering resulting from a possible interplay of semantics with more grammatical classes of ing-verbs. In fact, Affinity Propagation individuates a large number of ingclusters, and most relevant, an increasing number of ing-clusters over time. What we see here is lexico-semantic specialization at work: every cluster contains "few" words semantically very close, e.g. drawing -tracing, preceding -foregoing.
DBSCAN does not require a pre-determined number of clusters either, but a fixed threshold and a fix minimum of neighbours to consider members of a cluster. While the number of centroids is lower than the number found by Affinity Propagation, it still increases over time.
Unlike the previous two algorithms, MiniBatch K Means requires a heavier pre-interpretation of the data: we need to know how many clusters we are looking for. While usually seen as a disadvantage, once we have more than an educated guess -thanks to our previous exploration of the data -it can turn into a strength: we can force the algorithm to look beyond the most evident microclusters and define a larger subdivision of the space. In fact, once we use the K Means algorithm on the ing-subspace, setting the number of centroids to 3 (the number of verb classes we have observed through our exploration in Section 4.2.2), we obtain results that are very close to our observations. The verbs falling in the three groups more and more pertain to what we would call academic, change-of-state, and motion verbs (see Table 2, 1860s decade). The centroids determined by the MiniBatch K Means algorithm for these three clusters grow further apart through time, and especially from the beginning of the 19th century we can detect a growing distributional difference between the three centroids of these clusters.

Grammatical classes of ing-clusters
To observe whether the use of these main ingclusters differs in terms of grammatical class (gerund vs. participle), we further inspect their syntagmatic context. For this, we generate lists of the top 30 verbs derived from the clusters and extract their preceding part-of-speech ngrams to observe how their use varies in syntactic context. Using Kullback-Leibler Divergence we can inspect which possible grammatical classes (i.e. gerund vs. participle) are distinctive of later time periods in comparison to earlier time periods considering each semantic group of verbs (i.e. academic, change-of-state, motion). Figure 6 shows the frequency distribution of the three clusters across decades in the RSC. Changeof-state verbs (e.g. purifying, warming, cooling) seem to remain relatively stable, showing only a very slight increase. Motion verbs (e.g. passing, extending, running) increase especially after 1820. Verbs belonging to the academic semantic sub-space rise until 1810 and decline afterwards. It seems that the beginning of the 19th century (1810-1840) marks a period of change.
Using relative entropy, we compare the partof-speech ngrams of the three main clusters (academic, motion, and change-of-state verbs in ingform) for the period preceding the 1810s and the period after the 1840s (i.e. 1660-1810 vs. 1850-1869). Table 3 shows the top five ngrams for each cluster, ranked by KLD. By inspecting the grammatical class of each ngram, we see a clear difference between the academic and the motion clusters: while verbs in the academic ing-cluster are used as gerunds, those in the motion ing-cluster are used as participles. Change-of-state ing-verbs are also most distinctively used as gerunds. This   shows that besides capturing semantic relatedness, the diachronic word embeddings also capture grammatical use.

Conclusion
We have shown an analysis of diachronic word embeddings based on a diachronic corpus of English scientific writing. The aim of the analysis has been to trace changes in the embeddings of words with grammatical functions (function words, polyfunctional word forms) compared to lexical words. Analyzing the changing topology of the embedding space over time, operating with the notions of inner and outer distance (see Section 3), we were able to show that grammatical words behave differently from lexical words (Section 4). Specifically, we focused on words that have both a lexical meaning and specific grammatical functions, exemplified by ingand ed-forms of verbs, because it seemed to us that such forms are common hosts for short to mid-term change in language use in scientific language. Here, we showed that ing-forms of verbs form three semantic groups (academic, motion and change-of-state), where change-of-state and academic verbs tend to be gerunds and motion verbs tend to be used as participles.
Methodologically, we showed that diachronic word embeddings are well suited to detect change not only in lexical but also in grammatical use as well as the interplay of lexis and grammar. Di-achronic word embeddings combined with informative visualization and appropriate exploratory techniques (here: clustering and relative entropy) presents a powerful tool to investigate changing language use.
In our future work, we plan to inspect other poly-functional words and word forms, such as wh-words, because they seem to be involved in the development of scientific style as well. At the level of lexical words, we plan to analyze the embedding space in terms of domain-specific vocabulary. As mentioned in our analyses in various places, the overall trend in scientific vocabulary is specialization. To form distinctive registers (e.g., the language of chemistry, physics, medicine, etc.), vocabulary needs to become diversified. To track diversification related to register formation is therefore a high priority on our research agenda.