Converting a comprehensive lexical database into a computational model: The case of East Cree verb inflection

In this paper we present a case study of how comprehensive, well-structured, and consistent lexical databases, one indicating the exact inflectional subtype of each word and another exhaustively listing the full paradigm for each inflectional subtype, can be quickly and reliably converted into a computational model of the finite-state transducer (FST) kind. As our example language, we will use (Northern) East Cree (Algonquian, ISO 639-3: crl), a morphologically complex Indigenous language. We will focus on modeling (Northern) East Cree verbs, as their paradigms represent the most richly inflected forms in this language.


Introduction
In this paper we present a case study of how comprehensive, well-structured, and consistent lexical databases, one indicating the exact inflectional subtype of each word and another exhaustively listing the full paradigm for each inflectional subtype, can be quickly and reliably converted into a computational model of the finite-state transducer (FST) kind. As our example language, we will use (Northern) East Cree (Algonquian, ISO 639-3: crl), a morphologically complex Indigenous language. We will focus on modeling (Northern) East Cree verbs, as their paradigms represent the most richly inflected forms in this language.

Background on East Cree
East Cree is a Canadian Indigenous language spoken by over 12,000 people in nine communities situated in the James Bay region of Northern Quebec. It is still learned by children as their first language and fluently spoken in schools and in the communities overall, involved in most spheres of life as an oral language. Speakers have basic literacy in East Cree, but written communication tends to be in English or, to a lesser extent, French. The language is fairly well documented, with main resources available on a web site (www.eastcree.org) that includes multilingual dictionaries of two dialects (English, French, East Cree Northern and Southern dialects), thematic dictionaries, an interactive grammar, verb conjugation applets, oral stories, interactive lessons and exercises, a book catalogue, and other various resources like typing tools for the syllabics used, tutorials, spelling manuals, and so forth.

Verb structure
The verb in Northern East Cree follows the general Algonquian structure. Verbs fall into four major types according to transitivity and the animacy of the participants (intransitive with inanimate, or no, subject: II; intransitive with animate subject: AI; transitive with animate subject and inanimate object: TI; and transitive with animate subject and animate object: TA). Verbs are inflected for the person of the subject and/or object, and for modality. There are three major types of inflections, with their specific properties, known as orders: independent, conjunct, and imperative. As can be seen in the examples below, only the independent forms have person prefixes, while for conjunct and imperative forms person is only marked in the suffixes.
The orders can be divided into subparadigms according to how modality is marked in the post-verbal suffix complex. For Northern East Cree, a total of 15 distinct sub-paradigms have been identified, 7 for independent, 6 for conjunct, and 2 for imperative (Junker & MacKenzie, 2015), taking a classical Word-and-Paradigm approach (Blevins, 2011). In addition, verb stems can be combined with several prestem elements, known as preverbs, which can be divided into grammatical and lexical ones and which functionally correspond to auxiliary verbs or adverbials in English. For example, the preverb chî(h) indicates 'past' tense, wî 'want', and nitû 'go and V'. These are illustrated in examples (1a-c) below. As for orthographical conventions, grammatical and lexical preverbs are separated from the rest of the verb construction by spaces, (though this is not followed consistently for lexical preverbs, sometimes written attached to the stem). Personal prefixes (in the case of independent order forms) are attached onto the first preverb or the verb stem, as can be seen in (1a) and (3a-b). Moreover, long vowels may be indicated with a circumflex, such as <â>, used throughout the examples in this paper, or by doubling the vowel graphemes, i.e. <â> could alternatively be written as <aa>. The double-vowel notation is used for long-vowels in the computational model to be discussed below. Morphophonology: While Northern East Cree (NEC) is fairly regularly agglutinative in its structure, there are some morphophonological phenomena occurring at the stemsuffix juncture, at the prefix-preverb/verb stem initial morpheme juncture, as well as within the suffix complex. For instance, a template morphology approach such as Collette (2014) presents 10 different suffix positions for the NEC verb. Furthermore, in the case of conjunct verb forms, the first syllable of the verbal complex, whether that of the first preverb or the stem, can undergo ablaut, known as Initial Change (IC), and resulting in a changed conjunct form. For example, the vowel -â-of the first syllable of the verb mâtû below (2a-c) changes to -iyâ-in the conjunct neutral form used in partial questions. Initial change of the verb stem only happens when there is no preverb before the verb stem, as preverbs can undergo initial change as well (cf. Junker, Salt & MacKenzie, 2015a To account for stem-suffix juncture morphophonological phenomena, Junker, Salt and MacKenzie (2015b) identify up to 19 stem types 1 . For example, t-/sh-stems alternate depending on person marking (3a-b), and hstems trigger vowel i-lengthening (4).
All the inflectional information above is encoded into two databases, (1) a verb paradigm database and (2) a dictionary database. The verb paradigm database, consisting of 9,457 entries, lists exhaustive paradigms for each inflectional subtype (19 in all), plus some partial paradigms as well. That is, all basic prefix and suffix sequence combinations, indicating the person and number of subject (for all verb classes) and object (for TA verbs) as well as the various possible types of modality, are identified for each inflectional paradigm subtype and verb class (II, AI, TI, TA). Each entry in the verb paradigm database is a fully inflected verb form, which is associated with the relevant set of morphological features (Table 1)  The dictionary database (15,614 entries) (Junker et al. 2012) determines the inflectional subtype for each verb. This allows for linking each verb with its entire paradigm according to a model verb for each inflectional subtype, as enumerated in the verb paradigm database. In addition, the aforementioned technical stem, in both its regular and changed (conjunct) form, is explicitly stored directly for each verb in the dictionary database. Using these technical stems and the corresponding word-final (and wordinitial) technical suffix chunks from the verb paradigm database, one can generate all the inflected forms by simple concatenation, without needing any morphophonological rules. Nevertheless, one needs to bear in mind that these technical stems and word-final technical suffix chunks have no morphological reality, but are simply representations of convenience (see Junker & Stewart, 2008). Furthermore, all grammatical and lexical preverbs are also included as their own entries in the dictionary database, and we are treating initial-changed forms of preverbs as separate entries labelled as conjunct preverbs.

Computational modeling of the Northern East Cree verb
As our computational modeling technology, we are using Finite-State Transducers (FST) (e.g. Beesley & Karttunen 2003), well-known computational data structures that are optimized for word form analysis and generation, with a calculus for powerful manipulations. FSTs are easily portable to different operating systems and platforms, and thus can be packaged and integrated with other software applications, like providing a spell-checking functionality within a word-processor. In designing a finite-state computational model, with a fairly regularly agglutinative language such as East Cree, one has to decide whether one models morphophonological alternations at stem+affix junctures by (1) dividing stems into subtypes which are each associated with their own inflectional affix sets that can simply be glued onto the stem, or whether (2) one models such morphophonological alternations using contextbased rewrite rules. Furthermore, one has to decide the extent to which one treats affix sequences by splitting these into their constituent morphemes, each associated with one morphosyntactic feature, or rather treats affixes as chunks which are associated with multiple morphosyntactic features (Arppe et al., in press). The more one splits affix sequences, the more one may need to develop and test rules for dealing with morphophonological alternations at these morpheme junctures, whereas in the case of chunking such alternations are precomposed within the chunk. In contrast, the more one uses chunks, the more one has to enumerate chunks based on the number of relevant inflectional subtypes.
While the chunking strategy is not parsimonious and compact, in our experience it results in FST source code which is nevertheless structurally quite flat and easily comprehensible for scholars who are not specialists for the language in question. Importantly, current finitestate compilers, e.g. XFST, HFST, or FOMA (Beesley and Karttunen 2003;Lindén et al. 2011;Hulden 2009), implement a minimization procedure on the finite-state model, so that recurring realizations of string-final character sequences and associated morphological features are systematically identified and merged, resulting in the end in a relatively compact model (that in practice might not be much larger, nor structurally substantially different, than a model compiled from source code implementing maximal splitting). On the other hand, if some aspect of the chunked morpheme sequences needs to be changed, with the chunking strategy these have to be implemented in potentially quite a large number of locations.
For the Northern East Cree model, we decided to (1) split the pre-stem morphemes (personal prefixes for the independent order forms, and the regular and initial-changed forms of the grammatical and lexical preverbs), as there are very few morphophonological phenomena (initial change, epenthesis), and these are very regular. We deal with initial change by exhaustively listing the two alternative preverbs or stems (regular vs. changed); (2) entirely chunk the post-stem suffix morphemes, associating the chunks with multiple morphological feature tags; and (3) make maximal use of inflectional subtypes through using the aforementioned technical stems and post-stem word-final technical suffix chunks. Thus we will require no morphophonological rules for the stem-suffix morpheme juncture, and only two regular morphophonological rules in the pre-stem part. 2 These morphophonological rules are implemented using the TWOLC formalism within the FST framework. As to the rest, the LEXC formalism in the FST framework is used to define the concatenation of the morpheme sequences as treated above. For Independent order forms with subject (and object) person and number marked with a combination of a prefix and suffix (which can be understood to constitute a circumfix), agreement constraints between these affixes are implemented with the flag diacritic notation within the LEXC formalism.

Model statistics and details
The computational model currently includes stems and suffixes for AI, TI, and TA, but not for II verbs (which have the simplest paradigms). The LEXC source code for verb affixal morphology in its current form consists of 16,590 lines, of which 68 concern the pre-stem component and 16,514 the post-stem technical suffix chunks. 3 With minimization, its compilation with XFST takes 5.462 seconds with a 2 GHz Intel Core i7 processor and 8MB of RAM, resulting in a 108 kB XFST model (1,084kB with HFST). 4 While this full enumeration of suffix chunks per each inflectional paradigm type results in a large number lines in the LEXC code, in comparison to a decompositional approach, the structure of the source code is quite flat and easy to grasp. As can be seen in Table 3 presenting the source code for the Independent Neutral Indicative suffix chunks for Animate Intransitive verbs of the -aa paradigm type, the suffix chunk -aan, which requires a first person prefix ni-to have been observed at the very beginning of the verb construction, indicated by the flag-diacritic @U.person.NI@, is associated with three morphological tags +Indic, +Neu and +1Sg, corresponding to the morphological features INDICATIVE, NEUTRAL and FIRST PERSON SINGULAR actor, respectively. In addition, the numeric code +[01] is provided, indicating the paradigm subset for Regular (Non-Relational) Independent Neutral Indicative verb forms. Example analyses provided by the FST analyzer for the forms (1a-c) are presented below in (5a-c). Grammatical and lexical preverbs are indicated with the notation PV/…+, and the subset of the paradigm using a notation with bracketed numbers, e.g. +[05]  This almost entirely concatenative modeling strategy described above is made possible thanks to the exhaustive listing of the technical stems (both regular and changed) for each verb in the dictionary database, and the likewise comprehensive enumeration of all inflected forms for each subtype in the verb paradigm database, with one of the representations of each inflected form providing a partitioning into the technical stem and a technical suffix chunk. All the forms in the verb paradigm database have been verified in countless sessions with fluent East Cree Elders over decades.
Importantly, though the creation of the two databases has taken a substantial amount of meticulous human work and scrutiny, and while FST source code for the (relatively straightforward) pre-stem component has been written by hand, the FST source code for the suffix component is generated in its entirety from the underlying two lexical databases, minimizing the risk for human typing error (when the underlying databases are error-free). Equally importantly, the automatic generation allows for easy generation of revised versions, if changes need to be implemented.
In terms of time required to create this this general FST architecture, the manual coding of the basic pre-stem morphology, and developing the scripts for automatically generating the post-stem FST source code has taken altogether 2 weeks of 3 people's work.

Conclusion
Having comprehensive, well-structured resources such as those described above, and people with appropriate programming and linguistic skills, the brute-force listing strategy presented in this paper is a surprisingly fast and efficient way of creating a finite-state computational model, to form a basis for subsequent development of practical end-user applications.