Supervised Machine Learning for Hybrid Meter

,


Introduction
The divergence of Latin into distinct regional dialects had profound linguistic and literary implications for all of Europe. Even before the Middle Ages, the syllable length of classical Latin had been nearly forgotten in the vernacular. 1 Latin poetry had used quantitative meter, whereby syllable length was the organizing principle. However, the emerging dialects differed from Latin in that stress became a phonologically important feature, and socalled qualitative meter predominated. In order to reconcile these linguistic differences, poetic forms emerged in which meter relied on both stress and syllable length. These hybrid metrical forms pose unique challenges to automated scansion (the pro-cess of determining the metrical value of each syllable for a line of poetry). In applying machine learning techniques to scan syllables of a hybrid meter, we believe we can contribute to the study of both metrics and poetics, medieval and otherwise. Our system serves not only pedagogical purposes by introducing students to the meter of medieval German epic poetry, but also presents itself as a tool for further research in author identification or topic modeling metrical form.
To illustrate quantitative meter, we consider the epic poetry of Latin and Greek. Each line consists of six feet, each foot typically a dactyl (a long syllable followed by two short syllables) or spondee (two long syllables). A syllable is considered long if it has a long vowel or diphthong, or ends in a consonant (Hayes, 1989). All other syllables are short. The first line of Virgil's Aeneid serves as example: 2 arma vi|rumque ca|nō, Tro|jae quī|prīmus ab|ōrīs Shakespeare's verse, on the other hand, exhibits qualitative meter, structured in iambic pentameter, where each line has five iambs (a bisyllabic foot consisting of an unstressed first syllable followed by a stressed second syllable). The first line of Romeo and Juliet is scanned below: 3 |Two house|holds, both |alike |in dig|nity.| | ×× | ×× |×× |×× |××|

Middle High German Meter
This paper considers the meter of twelfth and thirteenth century Middle High German (MHG) epic 1 verse. Although written in a Germanic language, MHG poetry was greatly influenced by the Romance tradition. This heritage is evident in its hybrid metrical structure: MHG verse patterns according to both syllable stress and length (Bostock, 1947).
The predominating pattern is an alternation between stressed and unstressed syllables (Tervooren, 1997). 4 MHG epic verse employs trochaic tetrameter: each line has four feet (Bostock, 1947), and each foot is a trochee. Phonologically, a trochee consists of two syllables; the first syllable is stressed, and the second is unstressed. For example, the English word "better" is a trochee, but the word "alive" is not. The famous Longfellow epic poem The Song of Hiawatha is written in trochaic tetrameter, and the first line serves to illustrate this rhythm: Should you ask me, whence these stories?
Similarly, the typical MHG epic verse foot is two syllables in length, a stressed syllable followed by an unstressed syllable. However, feet can also be filled by one or three syllables (Domanowski et al., 2009). If a foot is filled by one syllable, the syllable must be phonologically long. If the foot is filled by three syllables, either the first two or the last two syllables must both be phonologically short.
It is in these atypical feet that the influence of quantitative meter, where syllable length is the key factor, becomes evident. We must slightly redefine the foot to account for this. Syllable length is measured in morae. Phonologically, a mora is a unit of time such that a short syllable has one mora and a long syllable has two morae (Fox, 2000). 5 A foot in this meter is more precisely defined as having two morae, not necessarily two syllables. 6 Indeed, the mora, not the syllable, has been called the fundamental unit of MHG verse (Tervooren, 1997, p. 1), although the mora functions differently in this po-4 There is no consensus view on MHG meter. For this work we have most closely followed the viewpoints presented by Domanowski et al. (2009) andHeusler (1956), as well as more explicitly addressed the function of morae. 5 For example, the English word "red" has two morae since it ends in a consonant, whereas the first syllable in the English word "reduce" has one mora, since it ends in a short vowel. 6 It can be helpful to think of MHG meter in the musical sense. Each foot is a measure of 2/4 meter, where one mora is equivalent to one quarter note (Bögl, 2006). etic tradition than in its phonological definition. If a foot has only one syllable, the syllable must be long because a long syllable is two morae and the MHG foot requires two morae. A short syllable cannot be the only syllable in a foot, since it cannot be two morae. If a foot has three syllables, two must be short because only short syllables can be scanned as half morae, together forming one mora. 7 The other syllable is analyzed as one mora, yielding the required two morae in the foot. To summarize, a syllable can have one of three length values: mora, half mora, or double mora. A half mora must be phonologically short, and a double mora must be phonologically long. Phonological length is otherwise irrelevant and any syllable can be one mora.
In addition to length, as a function of morae, syllables are also assigned stress. There are three stress values: primary, secondary, or unstressed. Primary stress is assigned to the first or only stressed syllable in a word. Secondary stress is assigned to any following stressed syllable(s) in that word. All other syllables are unstressed. 8 The final mora of the final foot of a line is omitted by convention. This is construed as a pause, and receives its own symbol in the scansion, even though there is no corresponding word or syllable. A short, word final syllable may also be elided before a word beginning with a vowel. MHG epic verse permits up to three syllables in anacrusis (a series of syllables at the beginning of a line that do not count in the meter). These syllables may or may not carry lexical or syntactic stress, but they are always scanned as unstressed morae.
The above features yield eight possible metrical values for any syllable: • mora -primary stress (×): a syllable with primary stress • mora -secondary stress (×): a syllable with secondary stress • mora -unstressed (×): an unstressed syllable • half mora -primary stress (´ ): a short syllable with primary stress; according to metrical convention the preceding syllable must be long • half mora -secondary stress (` ): a short syllable with secondary stress • half mora -unstressed ( ): a short unstressed syllable • double mora (-): a stressed long syllable; double morae always carry primary stress • elision (e . ): an elided syllable Line 1 of Hartmann von Aue's Der arme Heinrich is prototypical. Each foot consists of a stressed syllable followed by an unstressed syllable. There is a one-syllable anacrusis: 9 Ein | rîter | sô ge|lêret | was 10 × |× × |× ×|× × |×L ine 6 also begins with one syllable in anacrusis. The second foot has a stressed mora consisting of two syllables, each one a half mora. The third foot has one syllable; a diphthong allows it to be scanned as long. The final foot has a mora with secondary stress, since the preceding syllable is stressed and in the same word: der |nam im |manege |schou|we 11 × |× × |´ × | -|×L ine 34 has no anacrusis, and in the second foot two half mora syllables form the unstressed mora: |die ein |ritter in |sîner |jugent 12 |× × |× |× × |´ L ine 8 shows an elided syllable in the second foot: dar |an be|gunde . er |suo|chen 13 × |× × |× × | -|×9ˆr epresents the rest for the empty mora at the end of a line. Note that this notation differs slightly from that which is used for classical and Shakespearian verse.
10 "There was a knight so learned" 11 "he looked extensively," 12 "which a knight [should have] in his youth." 13 "in [these books] he began to search,"

Previous Computational Approaches to Meter
There are two prevailing treatments of meter in the literature concerned with computational poetic text analysis. One approach takes a known meter and assigns syllables to stress patterns based on such parameters (Hartman, 1996). The second approach assumes nothing of the meter, and seeks to determine it by marking syllables and identifying patterns (Plamondon, 2006 Our approach draws more on the latter. Previous scholarship has focused on relatively simple systems of meter and adopted rule-based, statistical, or unsupervised approaches. The hybrid nature of MHG meter, and other complex systems developing out of classical antiquity, makes it difficult to scan poetry using these methodologies. 14

Data
As supervised machine learning is a novel approach to scansion, annotated metrical data do not exist for MHG or most other languages. Following the scansion categorization system outlined above, the authors annotated syllables of MHG epic poetry into the eight categories of metrical value.
The annotated data consist of 450 lines from Hartmann von Aue's Der arme Heinrich, 200 lines from Wolfram von Eschenbach's Parzival, and 100 lines from Wirnt von Grafenberg's Wigalois. 15 An additional 10% (75 lines of Hartmann von Aue's Iwein) was annotated to be held-out for testing, yielding a total of 825 annotated lines. Summary statistics for all annotated data are given in Table 1.
Syllabification was performed prior to annotation. The principles of onset maximization (early formulation in Vennemann (1972)) and sonority sequencing (early formulation in Jesperson (1904)) govern syllabification in many languages, including, to a great degree, MHG. The division of words 14 A strictly rule-based approach was undertaken by Friedrich Dimpel (2004). While Dimpel's approach is accurate, it is an arduous task, restrictive, and extremely language specific. Moreover, by identifying only the stressed syllables, it does not encompass the full complexity of MHG meter. 15 Incorporating different poems from different poets accommodates varying styles of writing, but it also introduces more variability, an issue to be addressed in subsequent work. with only one intervocalic consonant such as ta-ge "days" poses no difficulties. Only certain consonant clusters necessitate further information from MHG phonology. For example, the orthographic sequence of a nasal followed by a velar obstruent, although representing simply a velar nasal in modern German, were in fact still two separate phonemes in MHG (Paul et al., 1982). Thus the word lange "long" is syllabified as lan-ge. Intervocalic affricates can be viewed as either ambisyllabic or biphonemic, for example in sitzen "sit"; under both interpretations the first syllable has a coda and the second has an onset. There are also instances where morpheme boundaries interfere with the otherwise normal processes of syllabification. For example, the common MHG suffix -lich in wîplich "female" results in the syllabification wîp-lich, not wîplich, despite onset maximization preferring the latter. Accounting for these idiosyncrasies, beyond onset maximization and sonority sequencing, with additional rules resulted in syllabification with an accuracy of 99.4% on the first 1,000 words in Hartmann von Aue's Iwein, yielding a 95% confidence interval of 98.9% to 99.9%. Annotation was carried out by both authors, who are trained in MHG scansion. 16 In the case that a line exhibits multiple permissible scansions, priority is given to the scansion which best preserves the alternation of stressed and unstressed syllables. If a decision still cannot be made, then stress is given to semantic importance. An additional consideration is the syntactic stress of a particular line. Clearly, such evaluations allow some room for interpretation. Nevertheless, on a sample of 100 lines from the annotated data (739 syllables), the Cohen's kappa co- 16 Although neither author is a native speaker of New High German, the two phases of the language and the metrical traditions are sufficiently different that both native and non-native speakers require training in MHG scansion.  • Position within line: the last mora of a line is always stressed, and double morae occur most often in the third foot.
• Length of syllable in characters: longer syllables are more likely to be stressed. Unstressed prefixes and suffixes tend to be maximally three characters.
• Syllable characters: the characters in a syllable can help identify grammatical morphemes that are often unstressed. Slices were taken of the first character, first and second characters, last three characters, last two characters, and last character.
• Elision: the last two characters of the previous syllable and the first two characters of the cur-17 Future work might consider an LSTM neural network model. The decision to implement a CRF model was predicated on the interpretability of CRF modeling and understanding the primary features for MHG scansion. 18 The implementation of the CRF model was expedited with the help of Python's crfsuite (Okazaki, 2007). rent syllable are identified to detect conditions for elision.
• Syllable weight and length: syllables ending in a vowel or consonant are open and closed, respectively. Syllables ending in a short vowel are short; otherwise they are long. Such values are useful in identifying double or half mora syllables, which must be long or short respectively. For example, the syllable "schou" in line 6 of Der Arme Heinrich above is a double mora, and is accordingly long.
• Word boundaries: stress usually occurs on the first syllable of a word.
The model was tuned only on the development data and the best performing model was chosen. The resulting best model uses an L1 coefficient of 1.3 and L2 coefficient of .001. No further changes were made after the model features and parameters were selected.

Results and Comparison to Other
Models 19 To evaluate the performance of our CRF model, we compare against two baselines: an n-gram model cascading into regular expressions, and a Brill transformation-based model on top of the ngram model, both using syllables as units, just as the CRF model does. The n-gram model consists of cascading trigram, bigram, unigram, and regular expressions models, i.e. first a label is predicted based on the previous two labels, if possible; otherwise it is predicted based on the previous one label, and if the first two models fail it is predicted solely based on the label probability for the syllable itself. If the syllable did not appear in the training data, and it cannot be predicted by the first three models, it resorts to regular expressions. Based on MHG scansion theory and observations while annotating, syllables with long vowels were assigned to double mora, short syllables to unstressed mora, and the remaining syllables to mora with primary stress (which proved important to recognize stress alternation). The n-gram model was implemented with default settings. 19 The results for all models are 10-fold cross-validated. The Brill model (Brill, 1995), implemented with the help of NLTK (Bird et al., 2009), first assigns the most likely label from the n-gram model described above, and then generates rules to improve the initial estimate of the n-gram model according to the training data. It then iterates over these rules, correcting labels until accuracy no longer increases. The Brill model was implemented with a maximum of 200 rules.
The n-gram model found little success even with added training data, ending with an accuracy of only 61.8%. The transformation-based Brill model improved quickly upon the n-gram model, but plateaued at 82.8% accuracy. Figure 1 shows the increase in accuracy with an increase in the number of annotated lines for all models, suggesting that marginal returns to annotation begin to diminish significantly after around 400 lines, or, in the case of MHG, about 3,000 syllables. Supervised machine learning thus proves to be an economical option for languages with complex meter.
The final results of the cross-validated CRF model are given in Table 3 in descending order of frequency in the data, along with a final held-out test set of 75 lines from Hartmann von Aue's Iwein. The model achieves an F-score of .894 on the crossvalidated development data and .904 on the held-out testing data.
Apart from the infrequent half morae with secondary stress and elisions, the confusion matrix highlights four other problematic situations: (1) stressed morae marked as unstressed morae, (2) unstressed morae marked as stressed morae, (3) dou-  he model correctly predicts stress, but does not hold to the constraints of a foot as defined above. While these errors may be common, they are not as 20 "and with which he might" 21 "to our gods, who brought him to us"

CRF Brill
(1) not× if next syll. is end of line (1)× → × if at word boundary and following syll. is× (2) -if end of line is next syll.
(2)× → -if followed by× and word boundary (3) e . if last char is "e" and first char. of next syll. is "e"  severe as the situation depicted in the first example.
The top five highest scoring features of the CRF model and rules of the Brill model are given in Table  4. The CRF feature scores show that syllable quality is helpful (4), something the Brill model cannot generalize. Both models recognize alternation, but while the Brill model adopts a more general rule (1), CRF features (1) and (5) recognize alternation using the number of syllables and the end of the line. CRF feature (5) also accounts for anacrusis. Notably, the Brill model took more advantage of word boundaries in (1) and (2), while these features rank lower in the CRF model. Features for phonemes do not rank high individually, unsurprising considering their impact among the other features. Nevertheless, "ver" and "ge" (MHG prefixes) both score highly in the CRF for marking a syllable unstressed, but only "ge" ranks as a top five rule for the Brill model.
The scores from both models confirm extant MHG metrical theory and suggest new methods of approach for students of MHG meter. Instead of first marking stress, as suggested by Minimalmetrik (1997) and the pedagogically oriented website Mittelhochdeutsche Metrik Online (2009), it may be useful to first determine the cadence and anacrusis, as noted in CRF features (1), (2), (5), and Brill rule (3). The advantage of this approach is not mistakenly marking stress in anacrusis. Stress can then be marked in the remaining syllables, and metrical values can be assigned based on phonological features. These results and insights support our feature decisions and our implementation of a CRF model.

6
This paper has presented a new application of machine learning models to poetry, specifically to traditions with hybrid meter. It promises to contribute to other literary interests in computational linguistics such as author and genre analysis. Research has shown that a proper account of meter must consider its variation throughout an entire text (Golston, 2009); automated scansion makes this a realistic enterprise. In applying this method, researchers can survey large scale variation within and across texts to discover the patterns that characterize authors and genres. Indeed, subtle differences in meter may prove to be distinct authorial voice or reveal significant stylistic choices. This paper also paves the way for further work such as: cluster analysis of meter over a large corpus of texts, topic modeling cadence across genre, and charting the literary affect of meter. 22