Profiling of Intertextuality in Latin Literature Using Word Embeddings

Identifying intertextual relationships between authors is of central importance to the study of literature. We report an empirical analysis of intertextuality in classical Latin literature using word embedding models. To enable quantitative evaluation of intertextual search methods, we curate a new dataset of 945 known parallels drawn from traditional scholarship on Latin epic poetry. We train an optimized word2vec model on a large corpus of lemmatized Latin, which achieves state-of-the-art performance for synonym detection and outperforms a widely used lexical method for intertextual search. We then demonstrate that training embeddings on very small corpora can capture salient aspects of literary style and apply this approach to replicate a previous intertextual study of the Roman historian Livy, which relied on hand-crafted stylometric features. Our results advance the development of core computational resources for a major premodern language and highlight a productive avenue for cross-disciplinary collaboration between the study of literature and NLP.


Introduction
In "Lonesome Day Blues," Bob Dylan sings, "I'm gonna spare the defeated...I am goin' to teach peace to the conquered / I'm gonna tame the proud." This lyric echoes a passage from Vergil's ancient Latin epic, the Aeneid, as translated by Allen Mandelbaum: "to teach the ways of peace to those you conquer, / to spare defeated peoples, tame the proud" (Thomas, 2012). Such allusions or "intertexts" transmit ideas across space and time, diverse media, and languages. Although researchers focus on those intertextual connections felt to have special literary significance for the works at hand, in principle intertextuality refers to any verbal or semantic resemblance within the literary system, ranging from direct quotation to topical similarities (Kristeva, 1980;Juvan, 2009). Given the importance of intertextual criticism to literary study, computational identification of text reuse in literature is an active area of research (Bamman and Crane, 2008;Forstall and Scheirer, 2019).
Classical Latin literature is a highly influential tradition characterized by an extraordinary density of allusions and other forms of text reuse (Hinds, 1998). The most widely used tools for the detection of Latin intertextuality, such as Tesserae and Diogenes, rely on lexical matching of repeated words or phrases (Coffee et al., 2012(Coffee et al., , 2013Heslin, 2019). In addition to these core methods, other research has explored the use of sequence alignment (Chaudhuri et al., 2015;Chaudhuri and Dexter, 2017), semantic matching (Scheirer et al., 2016), and hybrid approaches (Moritz et al., 2016;Manjavacas et al., 2019) for Latin intertextual search, complementing related work on English (Smith et al., 2014;Zhang et al., 2014;Barbu and Trausan-Matu, 2017). Much NLP research on historical text reuse, including previous applications of Latin word embeddings, has focused on the Bible and other religious texts (Lee, 2007;Moritz et al., 2016;Bjerva and Praet, 2016;Manjavacas et al., 2019). As such, there is a clear need for enhanced computational methods for classical Latin literature. We describe the optimization of word embedding models for Latin and their application to longstanding questions about literary intertextuality.

Evaluation and optimization of word embedding models for Latin
As is typical for many low-resource and premodern languages, development of core NLP technologies for Latin remains at an early stage. Following attempts to train word2vec models on unlemmatized corpora of Latin literature shortly after the method's introduction (Bamman;Bjerva and Praet, 2015) and inclusion of Latin in large-scale multilingual releases of FastText and BERT Devlin et al., 2019), in the past year there has been increased interest in systematic optimization and evaluation of Latin embeddings. Spurred by the recent EvaLatin challenge (Sprugnoli et al., 2020), a number of Latin models have been trained for use in lemmatization and part-ofspeech tagging (Bacon, 2020;Celano, 2020;Straka and Straková, 2020;Stoeckel et al., 2020), complementing new literary applications to Biblical text reuse and neo-Latin philosophy (Manjavacas et al., 2019;. In addition,  recently introduced a synonym selection dataset, based on the TOEFL benchmark for English, which they used to evaluate word2vec and FastText models trained on the LASLA corpus of Latin literature. To the best of our knowledge, there have been no attempts to compare the performance of these models on standard evaluation tasks. To establish a baseline for further language-specific optimization and to inform our research on intertextuality, we evaluate five Latin models for which pretrained embeddings are publicly available. These models encompass a variety of training corpora and methods, including word2vec, FastText, and nonce2vec (Appendix). We consider two tasks involving synonym matching. The first is the selection task introduced by ; the task is to distinguish the true synonym of a Latin word from three distractors (N = 2, 759). The second task, which is modeled on one of the English evaluation datasets from Mikolov et al. (2013), involves unrestricted search for the synonyms of 1,910 words found in an online dictionary of Latin near-synonyms (Appendix). In addition, we train word2vec embeddings on a large corpus of Latin compiled from the Internet Archive (Bamman and Crane, 2011;Bamman and Smith, 2012), which we first lemmatize using either the Classical Language Toolkit (Johnson, 2021) or TreeTagger (Schmid, 1994).
The results of the comparative evaluation are summarized in Table 1. For the synonym search task, we consider the number of correct matches found in the top 1, 10, and 25 results by cosine similarity, as well as the mean reciprocal rank (MRR). We find that our models achieve state-of-the-art performance on both tasks compared to the five published models. The improvement in performance may be due to the combination of training on lem-matized text, which  identified as an important optimization for Latin, and use of a lower-quality but much larger training corpus (1.38 billion tokens, compared to 1.7 million tokens in the curated LASLA corpus).

Construction of benchmark intertextuality dataset
Despite the enormous number of Latin intertextual parallels recorded in the scholarship, computational research on literary text reuse is hampered by a lack of benchmark datasets. Existing benchmarks tend to focus either on binary comparisons, such as between Vergil and Lucan (Coffee et al., 2012), or on specialized forms of religious intertextuality (Moritz et al., 2016;Manjavacas et al., 2019). To enable validation testing of general NLP methods for intertextual search, we assemble a new benchmark dataset based on Valerius Flaccus' Argonautica, an epic poem dating from the 1st century C.E. which recounts the myth of Jason and the Argonauts. For Book 1 of the Argonautica we record 945 verbal intertexts with four major epics (Vergil's Aeneid, Ovid's Metamorphoses, Lucan's Pharsalia, and Statius' Thebaid) that are noted in the commentaries of Spaltenstein (2002), Kleywegt (2005), and Zissos (2008). Our dataset thus contains a substantial number of intertexts of established literary interest with coverage across Book 1.

Enhanced intertextual search
Several widely used computational search methods for Latin intertextuality rely on lexical matching of related words. We present an alternative approach in which potential intertextual phrases are ranked using word embeddings. According to this method, we compare a bigram of interest to all bigrams in another text subject to the constraint that the distance between the words does not exceed a fixed interval. The interval parameter is determined by the number of words occurring between the words comprising the bigram of interest and is usually, but not exclusively, between 0 and 2. The choice of bigrams as the basic unit conforms to ancient poetic practice, in which allusive phrases frequently consist of two words (although they can also be single words or longer phrases), and hence also conforms to modern intertextual search methods such as Tesserae (Coffee Model Selection ( , 2012, 2013). A key difference in our approach, however, is that bigram pairs may share only one or even zero words in common. The bigrams are drawn from the dataset of commentators' annotations; in cases where commentators note only a single-word intertext or a phrase longer than a bigram, we supplement or select words on a case-by-case basis, giving preference to those words that bear a semantic or syntactic similarity to one or more words in the intertext.
The similarity score for a bigram pair is calculated by taking the cosine similarities of the embeddings of the four possible pairs of words across both bigrams, and averaging the highest cosine similarity and the score for the remaining pair of words. The bigram pair flammifero Olympo ("fiery Olympus") and flammifera nocte ("fiery night"), for example, generates the four lemmatized pairs flammifer ∼ flammifer, flammifer ∼ nox, Olympus ∼ flammifer, and Olympus ∼ nox. Hence, the similarity score for the bigram pair is the average of 1.0 for the exact match, flammifer ∼ flammifer, and 0.35 for the other remaining word pair, Olympus ∼ nox (i.e., 0.67). In this way, the similarity score for an intertext noted by the commentators is ranked against all other bigrams in the relevant text, the size of which we set at a single book of poetry (i.e., equivalent to the text on which the dataset is based). Although the choice to use one unit of text rather than another is somewhat arbitraryone could consider complete works rather than constituent books, for example-the use of single books has several advantages, notably provision of a large but not overwhelming number of comparison phrases while maintaining ancient textual units with distinct episodes and themes.
Following this approach, we compute a ranking for each of the 945 parallels in the Valerius Flaccus benchmark. For embeddings we use our word2vec model trained on CLTK-lemmatized text, which by MRR performs best in the synonym ranking task ( Table 1). The precision@k and recall@k for k = 1, 3, 5, 10, 25, 50, 75, 100, and 250 are summarized in Fig. 1. We next compare our method and the Tesserae search tool, which is regarded as state-of-the-art for Latin intertextual search (Bernstein et al., 2015;Forstall and Scheirer, 2019). Using their public web-based interface, we run Tesserae searches comparing Book 1 of the Argonautica with each of the four texts in the benchmark dataset. Tesserae produces lists of repeated bigrams ranked according to a hand-crafted scoring formula that considers the rareness and proximity of the words in each bigram. For the complete set of Tesserae results, the recall is 33.9%, and the precision is 0.97%; with k = 250, our method achieves a comparable precision (1.4%) but higher recall (82.4%). An important advantage of the Tesserae tool, however, is that it searches for similar phrases in parallel and does not require a list of specific queries as input. As such, the results aggregated for this comparison come from a much smaller number of Tesserae searches than the 945 embedding-based searches we run. For this reason, Tesserae is likely to be more suitable than our method for applications in which the user does not have predetermined phrases of interest.
A minority of intertexts in the dataset contain no shared lemma and hence present a challenge for existing detection methods based on lexical matching but are recoverable using our search method. The phrases e clausis [antris] ("from the enclosed [cave]," Arg. 1.417) and circum claustra [fremunt] ("[they roar] around the gate," Aen. 1.56), for example, contain no words in common but have similar syntax (the prepositions e and circum) and semantics (words indicating enclosure). Similarly, the phrases Phlegethontis operti ("hidden Phlegethon," Arg. 1.735) and Acherontis aperti ("open Acheron," Theb. 11.150) both refer to rivers in the underworld and contain near-antonymic adjectives. Word embeddings can thus be used to identify intertexts of literary interest in a way that complements existing methods.

Anomaly detection
Computational analysis of literary intertextuality is typically treated as an information retrieval problem, as in the previous section. Here we consider an alternative framework of studying intertextuality through anomaly detection (Forstall et al., 2011). For this approach, we train word embeddings on highly restricted corpora, so that the resulting models capture aspects of authorial style. We use those restricted embeddings as features to predict instances of similarity between authors, which can indicate intertextuality. To illustrate this approach we describe a case study involving Latin historiography and the development of prose style.
In particular, we examine patterns of stylistic influence between the Roman historian Livy, his source material, and other Latin prose literature. As assessment of similarities in literary style is inherently subjective, we consider the task of replicating two experiments from a previous computational study of Livy, which employed a hand-crafted set of Latin stylometric features such as syntactic markers and function words, using word embeddings. Our approach to evaluation of a subjective task is thus similar to that of Bamman et al. (2014), who tested a set of preregistered hypotheses about literary characters.
Like most historical writing, Livy's monumental history of Rome drew on a wide range of source material, such as earlier historiography and political speeches, most of which is no longer extant. The extent to which Livy cited these earlier sources, and their influence on Livy's compositional practice, remain important open questions for ancient historians.  demonstrated previously that anomaly detection could be used to distinguish a database of 439 putative citational passages from the remainder of Livy. To replicate this analysis, we train a word2vec model on all of Livy's surviving history and use the embeddings as input for a one-class support vector machine (SVM). Following , we set the detection rate of the one-class SVM to 20% and train on a random selection of 30,000 5-sentence passages of Livy. We find that the one-class SVM labels 38.2 ± 0.8% of passages from the citation database as anomalous, compared to 18.4 ± 2.0% of a validation set with 439 passages of general Livy (mean and standard deviation from N = 3 runs). These results provide further evidence that citational passages of Livy exhibit an anomalous writing style, whether due to source use or stylistic modulation, corroborating the earlier analysis. Finally, we consider the stylistic similarity of Livy to 17 other works of Latin literature analyzed by . Again using a one-class SVM trained on Livy, we predict the "Livianess" of each work (Fig. 2). Our results confirm the major trends identified by the prior stylometric analysis, including the expected dissimilarity to Livy of the verse texts and the consistent similarity of contemporary and early imperial historiography. The primary difference between the two sets of results is that the stylometric features indicate greater similarity between Livy and non-historiographical prose, such as Augustine's Confessions and Vitruvius' De architectura, than do word embeddings, which may reflect a relative lack of shared diction.

Conclusions
We present an empirical analysis of Latin intertextuality using word embedding models. In addition to its specific contributions to literary criticism and the digital humanities, our work makes several methodological advances of interest to the broader NLP community. We conduct a comparative evaluation of Latin word embedding models for two synonym matching tasks and report an optimized model that achieves state-of-the-art performance, which we apply to intertextual search of Latin poetry. By capturing similarities other than exact repetition of words and phrases, our method complements existing search tools, such as Diogenes and Tesserae. Given the diversity and complexity of references employed by Latin authors, taking a multifaceted approach is essential to the computational study of Latin intertextuality. Although our initial work focuses on static embeddings, one potential avenue for improving our search method would be to leverage context-aware embeddings such as multilingual or Latin BERT (Devlin et al., 2019;Bamman and Burns, 2020). In addition, we illustrate how intertextuality can be studied using anomaly detection, and we replicate previous stylometric research about the Roman historian Livy, which was informed by domain knowledge, using an unsupervised approach. We hope that this work will strengthen cross-disciplinary collaboration between classics, the digital humanities, and NLP. 271822-20