Stylometric Classification of Ancient Greek Literary Texts by Genre

Classification of texts by genre is an important application of natural language processing to literary corpora but remains understudied for premodern and non-English traditions. We develop a stylometric feature set for ancient Greek that enables identification of texts as prose or verse. The set contains over 20 primarily syntactic features, which are calculated according to custom, language-specific heuristics. Using these features, we classify almost all surviving classical Greek literature as prose or verse with >97% accuracy and F1 score, and further classify a selection of the verse texts into the traditional genres of epic and drama.


Introduction
Classification of large corpora of documents into coherent groups is an important application of natural language processing. Research on document organization has led to a variety of successful methods for automatic genre classification (Stamatatos et al., 2000;Santini, 2007). Computational analysis of genre has most often involved material from a single source (e.g., a newspaper corpus, for which the goal is to distinguish between news articles and opinion pieces) or from standard, well-curated test corpora that contain primarily non-literary texts (e.g., the Brown corpus or equivalents in other languages) (Kessler et al., 1997;Petrenz and Webber, 2011;Amasyali and Diri, 2006).
Notions of genre are also of substantial importance to the study of literature. For instance, examination of the distinctive characteristics of various forms of poetry dates to classical Greece and Rome (for instance, by Aristotle and Quintilian) and remains an active area of humanistic research today (Frow, 2015). A number of computational analyses of literary genre have been reported, using both English and non-English corpora such as classical Malay poetry, German novels, and Arabic religious texts (Tizhoosh et al., 2008;Kumar and Minz, 2014;Jamal et al., 2012;Hettinger et al., 2015;Al-Yahya, 2018). However, computational prediction of even relatively coarse generic distinctions (such as between prose and poetry) remains unexplored for classical Greek literature.
Encompassing the epic poems of Homer, the tragedies of Aeschylus, Sophocles, and Euripides, the historical writings of Herodotus, and the philosophy of Plato and Aristotle, the surviving literature of ancient Greece is foundational for the Western literary tradition. Here we report a computational analysis of genre involving the whole of the classical Greek literary tradition. Using a custom set of language-specific stylometric features, we classify texts as prose or verse and, for the verse texts, as epic or drama with >97% accuracy. An important advantage of our approach is that all of the features can be computed without syntactic parsing, which remains in an early phase of development for ancient Greek. As such, our work illustrates how computational modeling of literary texts, where research has concentrated overwhelmingly on modern English literature (Elson et al., 2010;Elsner, 2012;Bamman et al., 2014;Chaturvedi et al., 2016;Wilkens, 2016), can be extended to premodern, non-Anglophone traditions.

Stylometric feature set for ancient Greek
The feature set is composed of 23 features covering four broad grammatical and syntactical categories. The majority of the features are function or non-content words, such as pronouns and syntactical markers; a minority concern rhetorical functions, such as questions and uses of superla-  tive adjectives and adverbs. Function words are standard features in stylometric research on English (Stamatatos, 2009;Hughes et al., 2012) and have also been used in studies of ancient Greek literature (Gorman and Gorman, 2016). Our feature selection is not drawn from a prior source but has been devised based on three criteria: amenability to exact or approximate calculation without use of syntactic parsing, substantial applicability to the corpus, and diversity of function. The feature set is listed in  in standard studies of English stylistics, such as pronouns, others are specific to ancient Greek. Attention to language-specific features enhances stylometric methods developed for the English language and not directly transferable to languages possessing a different structure (Rybicki and Eder, 2011;Kestemont, 2014). Greek particles, for example, are uninflected adverbs used for a wide range of logical and emotional expressions; in English their equivalent meaning is often expressed by a phrase or, in speech, tone. In order to avoid significant problems arising from dialectical variation, including a large increase in homonyms, we restrict features to the Attic dialect, in which the majority of classical Greek texts were composed. Many features are computed by counting all inflected forms of the appropriate word(s), which can be found in any standard ancient Greek textbook or grammar such as Smyth (1956). A detailed description of the methods for computing the features is given in Appendix A.
Calculation of five features relies on heuristics to disambiguate between words of similar morphology. (All other features can be calculated exactly.) To assess the effectiveness of these heuristics, we hand-annotate the five features in a representative sub-corpus containing three verse (Homer's Odyssey 6, Quintus of Smyrna's Posthomerica 12, and Euripides' Cyclops) and two prose (Lysias 7 and Plutarch's Caius Gracchus) texts. Table 2 lists the precision and recall of each feature on the aggregated verse and prose texts. In every instance, the precision is > 0.95 and the recall is > 0.85.

Experimental setup 3.1 Dataset
We use a corpus of ancient Greek text files, which was assembled by the Perseus Digital Library and further processed by Tesserae Project (Crane, 1996;Coffee et al., 2012). A full list of texts is provided in Appendix B. Each file typically contains either an entire work of literature (e.g., a play or a short philosophical treatise) or one book of a longer work (e.g., Book 1 of Homer's Iliad). 29 files are composites of multiple books included elsewhere in the Tesserae corpus and are omitted from our analysis, leaving 751 files. In total, this corpus contains essentially all surviving classical Greek literature and spans from the 8th century BCE to the 6th century CE.
For our first experiment, we hand-annotate the full set of texts as prose (610 files) or verse (141 files) according to standard conventions (Appendix B). For the second experiment, we handannotate the verse texts as epic (82 files) and drama (45 files), setting aside 14 files that contain poems of other genres (Appendix C).

Feature extraction
All text processing is done using Python 3.6.5. We first tokenize the files from the Tesserae corpus into either words or sentences using the Natural Language Toolkit (NLTK; v. 3.3.0) (Bird et al., 2009). For sentence tokenization, we use the PunktSentenceTokenizer class of NLTK Greek (Kiss and Strunk, 2006). After tokenization, the features are calculated either by tabulating instances of signal n-grams or (for lengthbased features) counting characters exclusive of whitespace, as described in Appendix A.

Supervised learning
All supervised learning is done using Python 3.6.5. For each experiment, we use the scikit-learn (v. 0.19.2) implementation of the random forest classifier. A full list of hyperparameters and other settings is given in Appendix D. For each binary classification experiment (prose vs. verse and epic vs. drama), we perform 400 trials of stratified 5fold cross-validation; each trial has a unique combination of two random seeds, one used to initialize the classifier and the other to initialize the data splitter. Feature rankings are determined by the average Gini importance across the 400 trials.

Prose vs. verse classification
Using the workflow described in Section 3.3, we classify each of the literary texts in the corpus as prose or verse. Table 3 lists the accuracy and weighted F1 score for a sample cross-validation trial, along with the mean for that trial and overall mean across the 400 trials. We find that the texts can be classified as prose or verse with extremely high accuracy using the set of 23 stylometric features and that, despite the small size of the corpus, classifier performance is robust to the choice of cross-validation partition. The five highest-ranked features are given in Table 4. Outside of these five, no other feature has a Gini importance of > 0.05. All five features predominate in prose rather than poetry, of which three are pronouns or pronominal adjectives. The sustained discussions commonly found in various prose genres may favor the use of pronouns to avoid extensive repetition of nouns and proper names. The high ranking of conjunctions is plausibly connected to the longer sentences characteristic of most prose (mean length 205 characters, compared to 166 characters for poetry).

Classification of poems as epic or drama
The genres of epic and drama are in certain respects quite distinct: they differ in length and poetic meter, and the vocabulary of Aristophanes' comic plays is unlike either epic or tragedy. In other aspects of form and content, however, they have much in common, including passages of direct speech, high register diction, and mythological subject matter. The playwright Aeschylus is even reported to have described his tragedies as "slices from the great banquets of Homer" (Athenaeus, Deipnosophistae 8.347E). The similarities between epic and drama thus present an intuitively greater challenge for classification. Table 5 summarizes the results of the epic vs. drama experiment, for which we achieve performance comparable to that of the prose vs. verse experiment. Table 6 lists the top features, which reflect several important differences between the genres. The most important feature -sentence length -highlights the relatively shorter sentences of drama compared to epic, which can be explained at least in part by the rapid exchanges between speakers that occur throughout both tragedy and comedy. Although sentence length is a feature that can be affected by modern editorial practice, the difference between drama and epic on this score is sufficiently large that it cannot be explained by variations in editorial practice alone (< 80 characters/sentence on average across dramatic texts, > 150 characters/sentence for epic). The importance of demonstrative pronouns, ranked second, plausibly captures a different side of drama -the habit of characters referring, often indexically, to persons or objects in the plot (e.g., ἐκ῀ εινος ῟ ουτός ἐιμι, ekeinos houtos eimi, "I am that very man," Euripides, Cyclops 105, which uses two  demonstrative pronouns in succession). Another typical characteristic of dramatic plot and dialogue accounts for the third highly-ranked feature -interrogative sentences -since both tragedies and comedies often show characters in a state of uncertainty or ignorance, or making inquiries of other characters. Although many of the features in the full set are correlated (e.g., sentence length and various markers of subordinate clauses), none of the top 5 plausibly are, suggesting that the analysis identifies a diverse set of stylistic markers for epic and drama.

Misclassifications
For epic vs. drama, no text is misclassified in more than 12% of the trials. For prose vs. verse, only five texts are misclassified in >50% of the trials (Demades, On the Twelve Years; Dionysius of Halicarnassus, De Antiquis Oratoribus Reliquiae 2; Plato, Epistle 1; Aristotle, Virtues and Vices; Sophocles, Ichneutae). Most of the common misclassifications result from highly fragmentary or short texts. Almost half the speech of Demades, for example, contains short or incomplete sentences. The misclassified text of Dionysius of Halicarnassus amounts to only a few unconnected sentences; Sophocles' Ichneutae (the only verse text misclassified in over half the trials) is also fragmentary. The third most frequently misclassified text, Plato's First Epistle, in fact highlights the classifier's effectiveness, as it contains several verse quotations, which (given the short length of the text) plausibly account for the error.

Conclusion
In this paper, we demonstrate that ancient Greek literature can be classified by genre using a straightforward supervised learning approach and stylometric features calculated without syntactic parsing. Our work suggests a number of natural follow-up analyses, especially extension of the experiments to encompass the full range of tradi-tional prose genres (such as historiography, philosophy, and oratory) and application of the feature set to other questions in classical literary criticism. In addition, we hope that our heuristic approach will motivate and inform analogous work on other premodern traditions for which natural language processing research remains at an early stage.

Acknowledgments
This work was conducted under the auspices of the Quantitative Criticism Lab (www.qcrit.org), an interdisciplinary group co-directed by P.C. and J.P.
Interrogative sentences are excluded because the Greek interrogative pronoun (τίς) is often identical in form to the indefinite pronoun.
• ἵνα (hina, an adverb of place often translated "where" or a conjunction indicating purpose often translated "in order that") is computed by counting all instances of ἵνα and ἵν΄ (hin).
• ὅπως (hopos, an adverb of manner often translated "how" or a conjunction indicating purpose often translated "in order that") is computed by counting all instances of ὅπως.
• Fraction of sentences with a relative clause is determined by counting sentences that have one or more of the inflected forms of the Greek relative pronouns ὅς, ἥ, ὅ (hos, he, ho, "who" or "which").
• ὥστε (hoste, a conjunction used to indicate a result, "so as to") not preceded by ἤ is calculated by counting all instances of ὥστε not immediately preceded by ἤ. This limitation is imposed to exclude instances in which ὥστε is part of a comparative phrase.
• The mean length of relative clauses is determined by counting the number of characters between each relative pronoun and the next punctuation mark.

A.4 Miscellaneous
• Interrogative sentences are computed by counting all instances of ";" (the Greek question mark).
• Sentences with ὦ exclamations is determined by identifying sentences that have at least one instance of ὦ (o, "O"), a Greek exclamation.
• ὡς (hos, an adverb of manner often translated "how" or a conjunction often translated as "that," "so that," or "since," among several other possibilities) is computed by counting all instances of ὡς.
• Mean and variance of sentence length is determined by counting the number of characters in each tokenized sentence (see Section 3.2 of main paper).