A Stylometry Toolkit for Latin Literature

Computational stylometry has become an increasingly important aspect of literary criticism, but many humanists lack the technical expertise or language-specific NLP resources required to exploit computational methods. We demonstrate a stylometry toolkit for analysis of Latin literary texts, which is freely available at www.qcrit.org/stylometry. Our toolkit generates data for a diverse range of literary features and has an intuitive point-and-click interface. The features included have proven effective for multiple literary studies and are calculated using custom heuristics without the need for syntactic parsing. As such, the toolkit models one approach to the user-friendly generation of stylometric data, which could be extended to other premodern and non-English languages underserved by standard NLP resources.


Introduction
Stylometry, the quantitative analysis of writing style, is an longstanding yet active area of research in literary studies. Traditional applications of stylometry in both classical and modern literary scholarship have focused on authorship attribution and establishing relative chronology (Mosteller and Wallace, 1964;Marriott, 1979;Fitch, 1981;Vickers, 2004;Jockers and Witten, 2010;Stover et al., 2016). In recent years, new digital tools and computational methods, especially machine learning (Long and So, 2016;Dexter et al., 2017), have allowed researchers to address more finegrained literary critical questions and have also given rise to novel frameworks for literary analysis, such as 'distant reading' and 'macroanalysis' (Moretti, 2013;Jockers, 2013;Piper, 2018;Underwood, 2019).
Much research in computational stylometry has focused on English literature due in part to the rich NLP resources available for the English language, especially high-quality syntactic parsing. NLP resources for many premodern and non-English languages are, by contrast, at an earlier stage of development or entirely lacking. Moreover, many of the academic disciplines studying these languages are smaller than for English, and thus the community of potential developers is correspondingly reduced. These factors suggest the need for user-friendly stylometric tools, which can provide a wide range of literary data for under-resourced languages and are suitable for use by humanists lacking a computational background.
Syntactic parsing, which remains at an early stage of development for Latin, 1 is not a prerequisite for the successful application of computational stylometry to literary problems. Our prior work has shown that custom heuristics can enable extraction of a wide range of features useful for the study of Latin literature, in particular syntactic markers, non-content words, and elements of sound and rhythm (Dexter et al., 2017;Chaudhuri et al., 2018). Here we report development of a point-and-click stylometry toolkit to enable easy generation of such data for a corpus containing almost all major classical Latin texts.
Other recently developed stylometry packages, such as the "stylo" R package and Lexomics, are aimed at audiences with a range of computational expertise (Eder et al., 2016;Drout et al., 2007). These packages, however, have typically been developed for general-purpose application to multiple languages instead of a single language. Focusing on the latter creates opportunities for targeting language-specific features, which often play a crucial role in literary style.
The need for a point-and-click toolkit is particularly acute in classical studies. Although classical philologists have long applied stylometry to shed light on questions of authorship, relatively few studies have employed digital tools. Exceptions have tended to focus on a restricted set of features, such as relative word frequency (Stover et al., 2016) or average sentence length (Marriott, 1979;Clayman, 1981). Such limitations may be due in part to the absence of an accurate method for syntactic parsing, and in part to a more general lack of collaboration to date between classical philologists and NLP specialists. By improving the accessibility of rich philological data, our toolkit should further promote the adoption of quantitative approaches by literary critics. At the same time, the toolkit bridges the gap between classical studies and research on English, in which computational approaches are more common and are supported by a more extensive technical apparatus.

Toolkit
Our toolkit provides researchers working with Latin literature access to large-scale stylometric data difficult to acquire by non-computational methods and enables humanists without specialist digital training to construct custom datasets.
The design goal for the toolkit is to provide an intuitive and easy-to-use interface hosted in a web browser. The interface is point-and-click and can be used by researchers with no prior programming or NLP experience. Users can choose from over 700 Latin texts, which comprise almost all of the surviving corpus of classical Latin. The texts were originally digitized by the Perseus Digital Library and further developed by the Tesserae Project (Crane, 1996;Coffee et al., 2012). Texts can be selected by author, text, or book (roughly the ancient equivalent of a chapter). Searches can be as fine-grained as examining a single book, or as large-scale as analyzing the entire built-in corpus in one go ( Figure 1).
Next, users select the stylometric features to analyze for their chosen corpus. They can run analyses using any combination of the twenty-six features ( Figure 2 shows a sample output). The results are displayed on a spreadsheet in the web browser and can be downloaded as a CSV file. In addition, a user can produce simple visualizations (e.g., a bar chart comparing the values of a partic-ular feature across a set of texts) inside the toolkit.
The ease with which the toolkit can be used does not limit its versatility. A user can create a custom corpus of texts preselected from the existing database, which is close to comprehensive for canonical material, or upload texts of their own for analysis. This latter functionality is especially important for understudied texts, such as those produced in Late Antiquity and during the Renaissance, the sum total of which far exceeds the quantity of extant classical Latin. While digital versions are available for many post-classical texts, for the most part the later periods are not well served by the prominent tools or repositories in the field, which maintain a classical focus. Our toolkit allows users to analyze any text available in electronic form. Furthermore, if a work is not available online, a user may upload a plain text file or transcribe it directly into the upload interface.

Features
Our feature set comprises twenty-six stylometric features across four broad syntactic and grammatical categories (pronouns and non-content adjectives, subordinate clauses, conjunctions, and miscellaneous, as listed in Table 1) and is described in detail in a previous publication (Chaudhuri et al., 2018). Some features are lexical (e.g., prepositions), while others are syntactic (e.g., sentence length) or address semantic and rhetorical aspects of the texts (e.g., superlatives and interrogative sentences). Taken together, the features offer a rich and diverse, albeit necessarily partial, profile of Latin literary style.
An important aspect of our toolkit is that it does not depend on syntactic parsing, named entity recognition, or other NLP methods that have not been developed fully for classical Latin (Erdmann et al., 2016). We employ three strategies to circumvent current technical limitations.  signal n-gram. For instance, all regular superlative adjectives include the n-gram -issim-(e.g., largissimus, "most abundant" or clarissima, "clearest"). As this n-gram is extremely rare outside of superlatives, we could curate a near-comprehensive list of exclusions (e.g., dissimilis, "unlike"). We use a similar strategy to capture the instances of selected gerunds and gerundives, which contain the n-grams -ndus, -ndum, -ndarum, or -ndorum. A third class of features are determined using punctuation (e.g., question marks to assess the frequency of direct interrogative sentences or to filter interrogative pronouns, which have many forms in common with relative pronouns, from relative clause counts).
The precision and recall of each of these heuristics is discussed in detail in (Chaudhuri et al., 2018). We emphasize that these approaches are not intended as a substitute for NLP, but rather as a stopgap for philologists until more substantial resources become available for classical lan-guages. We expect that the overall usefulness of the toolkit will increase as our heuristics are rendered obsolete by improvements in part-of-speech tagging and dependency parsing for Latin.
Our features are drawn from a wide array of sources in order to maximize the capture of information pertinent to Latin literary style. Some features, such as prepositions, are inspired by studies of other languages, where they have proven useful for the characterization of genres or subgenres (Jockers, 2013). Most features, however, are based on previous studies of Latin style and are designed to capture aspects specific to the Latin language (Adams, 1972;Adams et al., 2005). For example, atque ("and") followed by a word beginning with a consonant is a stylistic feature that is associated with certain influential figures writing early in the tradition. When later authors employ atque + consonant, they do so either in imitation of these figures specifically, or to recall an archaizing style more generally.

Literary Importance
The stylometric data generated by the toolkit sheds light on a variety of literary problems. The simplest type of analysis involves a single feature calculated across a small number of texts. Past research in Ancient Greek stylometry, for instance, has shown that sentence length constitutes one meaningful difference between the early Homeric hexameter tradition and the Hellenistic tradition, since later writers use longer sentences even as they retain other core aspects such as formulaic language and meter (Clayman, 1981). Figure 3 shows the mean sentence length of most of the surviving classical Latin epics as calculated by the toolkit. Three texts, De Rerum Natura by Lucretius, Astronomicon by Manilius, and the Georgics by Vergil, have noticeably longer sentences on average (mean length >140 characters, compared to <125 characters for the other epics). An attractive explanation for the three anomalous texts is that they are all identified with a subgenre of epic known as "didactic," a specific class which purports to teach its readers philosophy or a specialized technical skill, such as astrology or farming. The sentences are longer plausibly because detailed treatment of intricate philosophical or technical issues requires more complex sentences than typically more straightforward narrative action or direct speech, which represent the principal content of the other epics. The toolkit also reveals that Latin drama has a higher frequency of personal pronouns than other verse genres, as shown in Figure 4. This is no doubt due to drama's dialogic form: characters speak to each other directly, often employing first ("I" or "we") and second person ("you") pronouns. Many other literary genres primarily employ a narrative structure in which a narrator describes the action. This narrative type often uses third person pronouns ("he,""she," "it"), but rarely uses first or second person pronouns. Accordingly, the frequency of personal pronoun use is higher in drama. While this difference may be intuitive to a reader, the large-scale data generated by the toolkit offers quantitative evidence of a genre's formal style, which would otherwise be difficult if not im-possible to calculate by hand. Finally, the toolkit can also generate input data for supervised and unsupervised machine learning analyses. In our recent study of Latin prose and verse, we trained a random forest classifier using all 26 features to distinguish the two genres with high (>97%) accuracy (Chaudhuri et al., 2018). The underlying data can now be produced easily using the toolkit, and similar datasets can be constructed for other machine learning applications.

Conclusion and Future Work
This paper introduces a stylometry toolkit for Latin literature, which incorporates a diverse feature set demonstrably useful for literary criticism. The toolkit includes a point-and-click interface to maximize usage among core domain specialists, principally researchers in the humanities, who may not have specialized computational training. Future versions of the toolkit will further diversify the feature set, incorporating high-frequency n-grams and sense-pauses alongside the existing categories (Fitch, 1981;Dexter et al., 2017), and will leverage expected advances in Latin NLP to improve the methods for calculation of existing features.
In related work, we have developed a similar feature set for Ancient Greek, which has been used to classify prose and verse and, at a more finegrained level, epic and drama (Gianitsos et al., 2019). Our work on Old English has demonstrated the utility of related features for various literary and attribution studies (Neidorf et al., 2019). After extension of the current toolkit to Ancient Greek and Old English, we plan in due course to incorporate other underserved languages, in particular Bengali.