A computationally-assisted procedure for discovering poetic organization within oral tradition

A procedure is described which is capable of detecting poetic organization within transcribed oral tradition text. It consists of two components: An automated process which generates recurring n-gram patterns, and a verification process of manual review. Applied to a corpus of Tahitian and Mangarevan oral tradition, it has led to discovery of a variety of uses of meter and parallelism; ranging from the universally common to the unanticipated. The procedure should be generalizable to the study of other of the world’s oral poetries, having been designed to identify a wide range of organizational possibilities.


Introduction
Our knowledge of the many ways by which oral tradition may be organized poetically derives from an uneven study of mostly European, Middle-Eastern, and Asian traditions. On a positive note, descriptions of the oral poetry of Indo-European languages have been sufficient to spawn the field of comparative-historical poetics (see Watkins 1995). Unfortunately, much less effort has been applied to the remainder of the world's oral traditions, which tend to fade away well before their languages die off. In a homogenizing era, unless these vulnerable data are collected and their varied means of poetic organization discovered, much of what could have been learned with regard to oral poetics universally might be forsaken.
When venturing into the study of an undescribed poetic tradition, a purely manual approach is generally insufficient. The investigative path is likely to be lined with wide cognitive gaps from researcher prejudice as to what might be recognized as poetic.
The procedure described here attempts to remedy potential bias by informing the researcher of instances of parallelism which may not have been otherwise detected. The procedure consists of two components: An automated process which generates recurring n-gram patterns, and a verification process of manual review. Manual verification is recommended given that a tradition may employ many different organizational methods, but the corpora which contain them are often small.
Some of the examples used below are drawn from application of the procedure to two sources of Polynesian oral tradition: A 50,000 word corpus of early 19 th century Tahitian material representing multiple genres, and a 10,000 word corpus of early 20 th century Mangarevan songs and chants. Treatment of the complete Tahitian corpus was successful at the discovery of two varieties of counting meter (one of which may be unique to Tahiti), complex patterns of meter and sound parallelism, and many uses of syntactic and semantic parallelism (see Meyer 2011 and2013). Analysis of the Mangarevan data is still underway.
Due to space constraints, the automated procedure's functionality has only been summarized here. It is hoped that enough information will have been provided for the computational linguist reader to be successful at his or her own implementation.

Description of the Procedure
As mentioned, the procedure consists of an automated process which generates recurring ngram patterns, followed by a verification process of manual review.
With regard to former, computationallygenerated candidates consist of recurring ngrams of linguistic features, 1 any of which could potentially have application to poetic composi-tion. The n-grams are sorted and counted, and then presentedin their original contextin multiple interactive reports as preparation for manual review.
The automated component initially attempts to accommodate any linguistic feature a poet may wish to employ. After an initial round of manual analysis, however, it is desirable to pare the feature set down to just those which demonstrate some degree of promise; in order to lighten the load of the overall endeavor. 2 The list in table 1, for example, contains the reduced linguistic feature set which was ultimately selected for treatment of the Polynesian data. It may also be necessary to re-apply the procedure were it discovered during manual review 2 See also the discussion of combinatorial explosion in 2.4 below. that the oral tradition specialist's poetic use of linguistic features differs from that of general language. 3 In its implementation, the automated process need not be restricted to observation of a single feature in isolation (single-feature pattern detection), but should attempt to be sufficiently expansive so as to detect an oral poet's efforts to coordinate more than one feature (multi-feature pattern detection). It should also be capable of detecting patterns of inverted parallelism. Line, 4 word, and syllable boundaries may or may not be significant, and therefore all possibilities for boundary inclusion into a pattern should be permitted.
Concerning the raw output of candidate pattern generation, it was found during manual review of the Polynesian data that: long and short vowels have been conflated, as it was discovered early on in manual analysis that patterns could be extended, or those near to each other joined, by permitting such an abstraction. It was also discovered that the Tahitian and Mangarevan diphthong /ae/ is poetically equivalent to /ai/, and the Tahitian /ao/ to /au/. Poetic equivalence of /ae/ to /ai/ has been similarly observed by Jacob Love to apply to Samoan rhyme (Love 1991:88). Finally, the glottal stop phoneme /ʔ/ was determined to serve no role in Tahitian poetic function. 4 A tradition's poetic line must be established before line boundary may be included as a pattern element. Nigel Fabb suggests that the concept of line is a poetic universal (Fabb 2009:54-55). It generally represents a syntactic structure with a specific metrical count, although for some traditions it may be non-metrical, bounded by some indicator such as a pause or lengthened vowel. Its identification, perhaps through trial and error, should be accomplished early on in the analysis. 5 These criteria were empirically motivated mostly from analysis of the Polynesian data, and so may evolve after the described process has found application to a wider variety of traditions.

2.
A pattern should occur multiple times in the same text. A longer pattern need only occur twice in the same text.
Poetic intent might subsequently be asserted if either of the following were satisfied: 1. The candidate pattern was found to match any method of poetic organization documented for other of the world's poetic traditions.
2. For promising pattern types unspecified in the literature, a pattern might be esteemed to self-justify as poetic were it found to be sufficiently complex or repetitive so as to eliminate the likelihood of chance.
The following sections will discuss singlefeature pattern detection, multi-feature pattern detection, and detection of inverted parallelism. Examples will be provided of application of the procedure to a passage from a familiar English children's poem, and to extracts from several of the transcribed Tahitian and Mangarevan oral texts.

Single-Feature Pattern Detection
In single-feature pattern detection, only one linguistic feature is analyzed at a time. As with the other detection methods, the possibility exists of poetic intent whenever an n-gram token recurs.
The first four lines of the well-known children's poem Mary had a little lamb will serve to initially demonstrate this type of analysis. The passage in (1) has been tagged for three wordlevel linguistic features: IPA word form, simple part-of-speech, and word syllable count.
(1) Passage from Mary had a little lamb tagged for word form, simple part-of-speech, and word syllable count The list of bi-gram word form tokens from this passage would begin: The list of 4-gram simple part-of-speech tokens would begin: From a tally of matching simple part-ofspeech bigrams, we note in (2) below four occurrences of NOUN-VERB.
(2) Some bigram repetition in the Mary had a little lamb passage Level of analysis: Word Linguistic feature: Simple part-of-speech Boundary relevance: Line boundary is significant. Minimum pattern occurrences = 4 With prior knowledge that English is an SVO language, however, the NOUN-VERB pattern candidate is dismissed during manual review as being common as well to English prose. 6 In (3) below, we find repetition of the word syllable count 11-gram: 1-2-1-|-1-1-1-1-1-1-|, 7 corresponding to a little lamb | whose fleece was white as snow |, and that Mary went | her lamb was sure to go |.
(3) 11-gram repetition in the Mary had a little lamb passage Level of analysis: Word Linguistic feature: Word syllable count Boundary relevance: Line boundary is significant. Minimum pattern occurrences = 2 1. Mary had a little lamb 2 1 1 2 1 2. whose fleece was white as snow It may be that parallelism of such a long pattern is metrically significant, although this would be difficult to confirm given just one recurrence. It should be reiterated that while patterns which emerge out of a single text are not always conclusively poetic, when compared with similar pattern organization in other texts, poetic intent often becomes clear.
In (4), we turn to analysis at the syllable level. Here, we discover the apparent end-rhyming bigram /oʊ/-| of snow |, and go |.
(4) Some bigram repetition in the Mary had a little lamb passage Level of analysis: Syllable Linguistic feature: Syllable rhyme Boundary relevance: Line boundary is significant. Minimum pattern occurrences = 2 1. Mary had a little lamb ɛ ɪ aed ǝ ɪ əl aem 2. whose fleece was white as snow uːz iːs ǝz aɪt aez oʊ 6 With regard to languages for which common patterns of prosepart-of-speech or otherwiseare unknown, the analysis process should be applied as well to a prose corpus, and its findings subtracted, either by automated or manual means, from poetry analysis results. 7 To ease readability, line-boundary is indicated in some pattern descriptions as a vertical bar |.
3. and everywhere that Mary went aend ɛv i ɛɹ aet ɛ ɪ ɛnt 4. her lamb was sure to go ɚ aem ǝz ɚ u oʊ With prior knowledge that end-rhyme on alternating lines is common to English, French, and several other poetic traditions, we conclude that the intent here is poetic.
In (5), we encounter a passage of a Mangarevan song 8 which consists of a repeated syntactic frame, with the four nouns vai, kukau, aʔi, and inaina and the two adjectives rito and ka serving as its variable elements. We observe end-rhyme in lines 1 and 5 with the syllable rhyme pattern a-i| (in bold) corresponding to the nouns vai and aʔi, and note that a-i as a bigram is also contained within the name of the song's subject, the young woman Tai-tinaku-toro. We additionally observe assonant matching between the syllable rhyme bigram a-u (in bold underlined) of the noun ku.ka.u and the syllables na.ku of the woman's name. Finally, we note a match between the syllable rhyme bigram I-A (in bold small caps) of the noun i.na~i.na and the syllables ti.na of the woman's name.
(5) Some bi-and tri-gram repetition in an extract of a If similar use of assonance were discovered in several other texts of the same genre, such should warrant a claim that assonant matching between a syntactic frame's variable elements and the poem's theme is a method of Mangarevan poetic organization.

Multi-Feature Pattern Detection
In multi-feature analysis, n-gram patterns are comprised of cross-level linguistic feature information. This is motivated by a desire to be sufficiently expansive so as to detect a poet's efforts to coordinate more than one feature. 9 In the Mary had a little lamb passage, the addition of a bit of manual semantic tagging reveals the following multi-feature tri-gram: Semantics: lamb-part -Word form: wǝz -Part-of-speech: MODIF 9 Multi-feature detection was originally inspired by the bag of trees approach used by Data-Oriented Parsing, which permits assembling syntactic patterns from different levels of tree structure (see Bod 1998).
The tri-gram token is provided in context in (6): (6) Some multi-feature trigram repetition in the Mary had a little lamb passage Level of analysis: Word Linguistic features: Word form, simple part-ofspeech, and "Mary-part" and "lamb-part" semantic tagging Boundary relevance: All boundaries are ignored. Minimum pattern occurrences = 2 Whether or not the recurrence of this tri-gram might be interpreted as poetic, it should be recognized that it would not have been detected by single-feature analysis.
From the Tahitian corpus, we find an 11-gram multi-feature token which combines information relevant to word form, syllable count, and word vowel: Line boundary -Word form: e -Word form: noho -Line boundary -Syllable count: 1 -Syllable count: 2 -Line boundary -Word form: i -Word form: te -Word vowels: a-o-a -Line boundary This token appears initially in lines 1 through 3 and then repeats in lines 4 through 6 of (7) below: (7) Some multi-feature 11-gram repetition in an extract of "Warning by messengers of the paʻi-atua service" (Henry 1928:158- It might be best to re-interpret this complex ngram as simply providing evidence of two overlapping methods of organization: A 3-3-5 pattern of syllabic counting meter alongside an a-o-a pattern of end-rhyme. During manual review, an attempt should always be made to re-analyse candidates into more generalizable patterns.
It should be noted that, with regard to the Polynesian data, the discovery of poetic organization was generally achievable through singlefeature analysis. Patterns only detectable through multi-feature analysis were uncommon.

Inverted Parallelism
In some poetic traditions, patterns of linguistic features are not always repeated as is, but rather by means of an inverted ordering. An example is chiasmus, which is an inversion of repeated semantic elements; very common to the Ancient Hebrew of the Old Testament.
Automated detection of inverted parallelism is accomplished by a simply comparing the linguistic feature n-grams of a given document with the n-grams generated from a reverse ordering of those features. As before, matching n-grams are sorted and counted, and then presented within the context of the non-reversed material.
In the Tahitian example given in (9) below, we find the 7-gram pattern of syllabic counting meter 6-4-5-3-3-3-4 which is followed, after a 5 count, by its inverted match 4-3-3-3-5-4-6. Due to the detection as well of many other patterns of inverted meter in the corpus, inversion of the patterns which govern syllabic counting meter was deemed to self-justify, under criteria mentioned above, as a method of Tahitian poetic organization.

Concerning Combinatorial Explosion
Inherent to the automated process is a combinatorial explosion of n-gramsparticularly true with regard to multi-feature analysis. The total number of single-and multi-feature n-gram tokens generated for a given text may be determined as described in figure 1. The number of word-level n-grams generated from a typical 1,000 word text, after restricting analysis to 10 layers of linguistic feature tagging, is the quite large 6.82 x 10 501 . By reducing the number of tagged layers to four and maximum n to 10, however, the final count diminishes to a much more tractable 1.39 billion. It should be mentioned that foregoing multi-level analysis would permit maximum n to be set much higher.
It follows that a reduction in the interaction of linguistic features for a given pass would result in some patterns being missed by the automated Figure 1. Calculation for all single-and multi-feature n-gram tokens of a text. 12 Given: C = The count of all single-and multi-feature n-gram tokens which might be generated from a text at a given level of analysis (e.g. word level, syllable level). E = The number of linguistic elements in the text (e.g. in the passage from Mary had a little lamb, we analysed at the word level where there are 22 words and 5 instances of line boundary, for a total of 27 wordlevel elements). N = The current n-gram n number.
Max N = The n number of the largest desired n-gram. For an n-gram token to be able to occur at least twice, and thereby potentially demonstrate a pattern, max n should not exceed half the total number of linguistic elements (e.g. for word-level analysis of the passage from Mary had a little lamb, it would not be useful for n to be larger than 13). F = The number of tagged linguistic features (e.g. the passage from Mary had a little lamb in (1) is tagged for three features).
Max N C = ∑ ( E -( N -1 )) · F N N=1 process. Therefore, a certain degree of trial and error must be pursued in order to determine which combinations of four features at a time yield the best candidates. Furthermore, with a maximum n of just 10, it may become necessary to stitch togethereither manually or through an automated processadjacent and overlapping patterns.
art is enriched through their study. Relevant to the level of detail required for such research, John Miles Foley asserts that "We must give the idiosyncratic aspects of each tradition their due, for only when we perceive sameness against the background of rigorously examined individualized traits can we claim a true comparison of oral traditions" (Foley 1981:275).
The procedure which has been described here is admittedly labor-intensive; especially with regard to its manual component. However, it is probably necessary that it be so in order to succeed at documenting the majority of a poetic tradition's individualized traits. Relevant to the Tahitian material, the procedure was successful at the detection of a syllabic counting meter based upon word stress (see Meyer 2013:88-105). Such was previously unattested among world poetries, and with its discovery our understanding of what is universally possible for meter became expanded.