Learning Transducer Models for Morphological Analysis from Example Inflections

In this paper, we present a method to convert morphological inﬂection tables into un-weighted and weighted ﬁnite transducers that perform parsing and generation. These transducers model the inﬂectional behavior of morphological paradigms induced from examples and can map inﬂected forms of previously unseen word forms into their lemmas and give morphosyntactic descriptions of them. The system is evaluated on several languages with data collected from the Wiktionary.


Introduction
Wide-coverage morphological parsers that return lemmas and morphosyntactic descriptions (MSDs) of arbitrary word forms are fundamental for achieving strong performance of many downstream tasks in NLP (Tseng et al., 2005;Spoustová et al., 2007;Avramidis and Koehn, 2008;Zeman, 2008;Hulden and Francom, 2012). This is particularly true for languages that exhibit rich inflectional and derivational morphology. Finite-state transducers are the standard technology for addressing this issue, but constructing them often requires not only significant commitment of resources but also demands linguistic expertise from the developers (Maxwell, 2015). Access to large numbers of example inflections organized into inflection tables in resources such as the Wiktionary promises to offer a less laborious route to constructing robust large-scale analyzers. Learning morphological generalizations from such example data has been the focus of much recent research, particularly in the domain of morphologically complex languages (Cotterell et al., 2016).
In this paper we present a tool for automatic generation of both probabilistic and non-probabilistic morphological analyzers that can be represented as unweighted and weighted transducers. The assumption is that we have access to a collection of example word forms together with corresponding MSDs. We present two systems: one that is designed to be high-recall and operates with unweighted automata, the purpose of which is to return all linguistically plausible analyses for an unknown word form; the second is an addition to the first in that the word shapes are modeled with a generative probabilistic model that can be implemented as a weighted transducer that produces a ranking of the plausible analyses. The analyzers are constructed with standard finite state tools and are designed to operate similarly to a hand-constructed morphophonological analyzer extended with a 'guesser' module to handle unknown word forms.
The system takes as input sets of lemmatized words annotated with an MSD, all grouped into inflection tables-such as can be found in, for example, the Wiktionary. The output is a morphological analyzer either as an unweighted (in the non-probabilistic case) or a weighted model (in the probabilistic case). For the non-probabilistic case we use the Xerox regular expression formalism (Karttunen et al., 1996), which we compile into a transducer with the open-source finite-state toolkit foma (Hulden, 2009) and for the weighted case we have used the Kleene toolkit (Beesley, 2012). 1

Paradigm Learning
The starting point for the research in this paper is the notion that inflections and derivations of related word forms can be expressed as functions-this idea is often filed under the rubric of 'functional morphology' and is strongly related to word-andparadigm models of morphology (Hockett, 1954; drink drank drunk x 1 a x 2 x 1 u x 2 x 1 i x 2 x 1 a x 2 x 1 u x 2 x 1 x 1 ed x 1 ed x 1 x 1 ed x 1 ed (2) the aligned Longest Common Subsequence is extracted; (3) resulting identical paradigms are merged. If the resulting paradigm f 1 is interpreted as a function, f 1 (shr, nk) produces shrink, shrank, shrunk. , 1959;Matthews, 1972;Stump, 2001). In particular, we assume a model where a single function generates all the possible inflected forms of a group of lemmas that behave alike. This approach has earlier been seen as an alternative to finite-state morphology, and the functions that model inflectional behavior have been hand-built in much previous work (Forsberg and Ranta, 2004;Forsberg et al., 2006;Détrez and Ranta, 2012). Here, we assume the recent model of Ahlberg et al. (2014) and Ahlberg et al. (2015), which work with a system that automatically learns these functions that model inflection tables from labeled data.

Robins
The purpose of modeling inflection types as functions is to be able to generalize concrete manifestations of word inflection for specific lemmas, and to apply those generalizations to unseen word forms. The generalization in question is performed by extracting the Longest Common Subsequence (LCS) from all word forms related to some specific lemma and then expressing each word form in terms of the LCS (Hulden, 2014). The LCS in turn is broken down into possibly discontiguous sequences that express parts of word forms that are variable in nature. Figure 1 shows a toy example of four inflection tables generalized into variable-and nonvariable parts by first extracting the LCS, expressing the original word forms in terms of this LCS, and then collapsing the resulting functions that are identical. The resulting representation, which is essentially a set of strings which have variable parts (x 1 , . . . , x n ), and fixed parts (such as i, a, u) that can be used to generate an unbounded number of new inflection tables by instantiating the variable parts in new ways and concatenating the variables and the fixed parts.
This learning method often produces a very small number of functions compared with the number of complete inflection tables that have been input-obviously, because many lemmas behave alike and result in identical functions. We note that the output of this procedure is human-readable, i.e. it can be inspected (even in real-world scenarios) for correctness and also hand-corrected in case of noise in the learning data. In the current work, we use these functions as the backbone of a generative model and implement them as transducers that can be run in the inverse direction to map fully inflected forms into their lemmas and morphosyntactic descriptions.

Paradigm functions
The variables x 1 , . . . , x n that are used in the paradigm function representation capture possible inter-word variation. This means that each lemma that gives rise to an inflection table can be directly represented as simply an instantiation of the variables, together with the inflection function. As seen in Figure 1, the function f 1 learned from the inflection tables swim and drink can be used to represent some other word, e.g. sing by instantiating x 1 as s and x 2 as ng.
As we collect a large number of inflection tables, many of which result in identical paradigms, we can also collect statistics about the variables involved and how they were assumed to be instantiated in the original table. For example, from the truncated tables in Figure 1, we can gather that f 1 has witnessed x 1 as both dr and sw, and x 2 as nk and m. These statistics can be used to turn the learned functions into a restricted generative model that produces entire inflection tables, but also taking advantage of how variables tend to be instantiated in that paradigm function.
Additionally, since each possible inflected form consists of the same variables, we can also define a string-to-string mapping between any two related forms, where the content of the variable parts stay fixed, and the non-variable parts change. For example, in Figure 1, we know that we can, for some verbs, go from the past participle (e.g. drunk) to the past (e.g. drank) by a string transformation x 1 u x 2 → x 1 a x 2 , with some constraints on the nature of x 1 and x 2 . This information can then be encoded in transducer form where the variable parts can be modeled as a probabilistic language model (for weighted transducers) or a nonprobabilistic, constrained model (for unweighted transducers). Figure 2 illustrates this idea. We have learned a paradigm in Spanish-we call the paradigm avenir (arbitrarily), since that is one of the verbs out of 12 that behaved the same way and gave rise to the same function. A natural mapping to learn from the data is how to go from any inflected form to the dictionary or 'citation' form. For example, going from the present participle to the infinitive would involve changing the fixed i occurring between the two variables x 1 and x 2 into e and then changing the fixed suffix after x 2 from iendo to ir. The figure also shows how the different variables were instantiated in the training data: x 1 showed up in variable shapes (but always ending in v), while x 2 was always n.
Although the learning model in principle states nothing about the nature of the variables, morphophonological restrictions will constrain their appearance and the key to producing a transducer that can inflect unseen words without undue overgeneration is to take these restrictions into account. We do so in two ways: (1) for the unweighted case, we collect statistics on the seen variables and constrain their possible shapes in an absolute manner, and (2) for the weighted case, we induce a language model over the shapes of the variables, which can later be used to rank parses produced by the system.

The unweighted case: constraining variables
Different generalized inflection tables naturally give rise to different variable instantiations for x 1 , . . . , x n . However, many of the seen variables will not differ arbitrarily in a paradigm. This is something we can take advantage of when designing a parsing mechanism; in particular, we can express preferences to the effect that such parses where variables resemble already seen instantiations should be preferred. Figure 3 is a case in point. Here, we show the implicit string-to-string rule in the paradigm which derives the lemma form from the present participle and the first person singular present forms in Spanish. In both the paradigms learned, the variables x 1 and x 2 show a somewhat repetitive pat-tern. In the paradigm avenir, x 1 ends in the letter v for all the inflection tables seen that produced that paradigm, while x 2 always consists of the single letter n. Likewise, in the other paradigm (negar), x 2 is consistently the string eg across all forms seen (the inflected forms of cegar, denegar, etc.). The only variable that does not show such a regular pattern is the x 1 variable for the paradigm called negar.

Estimating probabilities of new variable instantiations
That the parts of paradigms that vary from lemma to lemma, i.e. the 'variables', are not subject to arbitrary variation can be used to constrain their shape. To model the unweighted transducers, we begin by formalizing our belief in not seeing novel variable shapes in the future. To quantify this, we assume we have seen n concrete instantiations of t different types of variables, and subsequently ask: if there were in fact t + 1 types, all of which are drawn from a uniform distribution, how likely are we to have witnessed only the t types we did? This quantity can be expressed as For example, the measure for the x 2 variable in Figure 3 (avenir) becomes (1 − 1 2 ) 12 ≈ 0.0002. We can use this as a cutoff parameter that defines how much evidence we require to declare a variable not subject to further variation apart from the types we have already seen. With this, we assume that if p unseen ≤ 0.05 for some variable, that variable in the paradigm will not exhibit new types. 2

Expressing constraints through regular expressions
We also expand this measure to cover variables that show variation only in non-edge positions. For example, x 2 in the avenir-paradigm in Table 3 is always n and can be assumed to not be subject to variation by the calculation above. The paradigm's x 1 -variable, however, cannot. That variable seems to vary much more, with the exception of the last letter, which is always v. To capture this, we extend the method to apply not only to the whole ciego → cegar[person=1st number=singular mood=indicative] Figure 2: Examples of two single generalized word forms mapped to lemmas followed by morphosyntactic description. The parts that correspond to constraints of the variables x 1 and x 2 are marked. Transitions marked @ are identity transduction 'elsewhere' cases, matching any symbol not explicitly mentioned in the state.
Inflection x 1 +e+x 2 +ir infinitive negar x 1 +x 2 +ar infinitive aviniendo x 1 +i+x 2 +iendo pres part negando x 1 +x 2 +ando pres part avenido x 1 +e+x 2 +ido past part negado x 1 +x 2 +ado past part avengo x 1 +e+x 2 +go 1sg pres ind niego x 1 +i+x 2 +o 1sg pres ind avienes x 1 +ie+x 2 +es 2sg pres ind niegas x 1 +i+x 2 +as 2sg pres ind Table 1: Two partial Spanish verb inflection tables generalized into paradigm functions. The segments that are part of the longest common subsequence, which are cast as variables in the generalization, are shown in boldface in the inflection tables.
string, but also edge positions of the string. First, we examine the whole string, and if that fails to yield the conclusion that the variable is 'fixed', we find the longest prefix and suffix which can be assumed to be fixed by the same measure. With this we construct a regular expression that models the variables as follows: 1. (w 1 ∪w 2 ∪. . .∪w n ) if the variable is assumed to be fixed, where the w i s are the complete strings seen as instantiations.
if both prefixes and suffixes can be constrained; here the p i s correspond to the prefixes of the maximal length that can be assumed to be drawn from a fixed set of types, and the s i s the suffixes.
In the above, Σ represents all the symbols seen in the training data. Under this formulation, the variables in the avenir paradigm in Figure 3 yield the following regular expressions:

Deriving morphological analyzers
Once we have the constraints in place, they can be used to construct larger regular expressions that reflect mappings from a specific word form to a lemma together with the MSD. We convert each inflection form in a paradigm to a regular expression that permits the above variable values in place of x 1 , . . . , x n , and that maps the remaining fixed strings to other fixed strings, depending on what kind of application is needed. For example, to create a regular expression for mapping the 1p pres ind-form (exemplified by niego) to the lemma form in Table 1, we proceed as follows: we construct a transducer that repeats the x 1 and x 2 -variables, possibly subject to the constraints on their shape, and maps an i to the empty string and o to ar. Since x 1 is not constrained in the paradigm, while x 2 is constrained to always be the string eg, this produces the following regular expression:  The transducer corresponding to the expression is seen in Figure 2, and will generalize to words that fit the variable pattern, e.g. ciego → cegar.
Each inflection form of every paradigm is converted in such a manner to a transducer that maps that single inflection to its lemma and morphosyntactic description. All such individual transducers can then be unioned together for every form in every paradigm:

Prioritizing analyses
The above formulation, though it already produces a working transducer that generalizes to unseen forms, can be refined further. First, if a word form matches the original variables seen exactly, it may be superfluous to return extra analyses from other paradigms that the word form might also fit. Secondly, it may be the case that we have overconstrained some variable with the heuristic described  Table 2: Example of the tri-level analyses produced by the unweighted system: here the three sub-grammars (Original = O, Constrained = C, Unconstrained = U) each allow for successively more analyses. The word peleaste 'quarrel' has been seen in the training data and thus receives an analysis from the constrained analyzer, whereas aceleran 'accelerate' has not and only receives parses from C and U.
earlier, and so return no analyses at all, motivating a potential relaxation of the constraints on variable shapes.
To provide a ranking of the analyses in the unweighted analyzer, we actually generate a layered approach with three different models: • Original: an analyzer where each x i must match exactly some shape seen in training data.
• Constrained: an analyzer where variables are constrained as described above.
• Unconstrained: an analyzer where there are no constraints on variables, except that they must be at least one symbol long, i.e. match Σ + .
The three analyzers can be joined by a "priority union" operation (Kaplan, 1987), in effect producing a single analyzer that prioritizes more constrained analyses, if such are possible: Original ∪ P Constrained ∪ P Unconstrained. This in effect leads to an analyzer that can be thought of as first consulting Original, and that failing to produce an analysis, consults Constrained, and if that also fails, consults Unconstrained. The same effect can also be modeled in runtime code by keeping the three transducers separate for potential savings of space. Table 2 illustrates this priority effect with two Spanish words being analyzed.
6 The weighted case: language models over variables The above unweighted model provides a hierarchical system by which to return plausible analyses, while curbing implausible ones. However, it lacks the power to provide a ranking of analyses within each layer of ever laxer constraints on the variables. An alternative to that model is to directly use the statistics over the variable parts to generate a weighted transducer that performs the same type of parsing, but with a (hopefully) strict ranking of candidate parses. We address this by inducing an n-gram model over each variable in each paradigm. We calculate these individual n-gram models in the usual way for a single variable v, consisting of the letters v 1 , . . . , v n : (5) For each variable and function, we perform a standard maximum likelihood estimate of the ngrams by with some additional add-δ smoothing to prevent zero counts. The resulting variable models can then (after taking negative logs) replace the variable portions of each individual transducer that maps a word form to its citation form. The fixed parts of the inflection mappings retain the weight 0. These language models are then concatenated with the same model as used for the unweighted case in place of the variables. This is illustrated in Figure 4. We tune the model for each language evaluated by doing a grid search on (1) the order of the ngram (1-5), (2) the prior on the n-grams (0.01-3.0), (3) the prior of picking a paradigm (we include a paradigm weight for each individual paradigm).
Similarly to the unweighted case, the final model is a union of all the individual inflection models for each paradigm and word, with the language models for the variables interleaved.

Evaluation
To evaluate the systems, we used the data set published by Durrett and DeNero (2013) (D&DN13), which includes full inflection tables for a large number of lemmas in German (nouns and verbs), Spanish (verbs), and Finnish (nouns+adjectives and verbs). That source also provides a division into train/dev/test splits, with 200 tables in dev and test, respectively. We then evaluated the ability of our systems to provide a correct lemmatization and MSD of each word form in the held-out tables, testing separately on each part of speech. For the unweighted analyzer, we use the three-part setup as described above. For the weighted case, we produce a single highest scoring analysis. The train/dev/test sets are entirely disjoint and share no tables.
We trained the models by inspecting all the word forms and corresponding MSDs, organizing them into tables, learning the paradigms, and the generating weighted and unweighted transducers as described above. These transducers were then run on the test data to provide lemmatization and analyses of the unseen word forms. Table 4 summarizes the number of inflection tables seen during training, together with the final number of paradigms learned. Table 5 shows the statistics in the held-out data.
Because we focus on the recall figures of the analyzers, we also calculated an "inherent ambiguity" measure of the test data. This is the average number of different MSDs that are given for each word form. This ambiguity may arise as follows: the Spanish verb tenga, for example, can be either the first person singular present subjunctive of tener 'to have' or the third person singular present subjunctive. Such ambiguity shows that there exist cases where returning multiple analyses is warranted, given that we do not have any sentence context to determine the correct choice.
For the weighted case, sometimes the system returns multiple equally scoring parses. This is due to the fact that the language model only operates over the variables, and, in many languages multi-   LM x1 LM x2 Figure 4: Illustration of the coupling of language models for variables x 1 and x 2 to create the weighted analyzer. Here, LM x 1 and LM x 2 illustrate a collection of states representing the language models for the variables, inferred from variable instantiations seen in the training data.    ple MSDs often have the same surface form. For example, Spanish compraba 'bought 1P/3P' (and -aba suffix-bearing verbs in general) are always ambiguous between 1st/3rd past tense. For this reason, we calculate the recall (as opposed to accuracy) of all the top scoring parses. The weighted system always returns a single lemma in the evaluation. It can, of course, produce a number of ranked analyses if needed-an example of extracting the top-10 ranked analyses of a word form is given in Table 7. Table 3 shows the main results of the evaluation of the unweighted model and   PAST' in Spanish with weights (in effect the negative log probability), the inferred variable division, the lemmatization, and MSDs. Lemmas and parts of the analysis that are correct are given in boldface. Note that several paradigms can produce an entirely correct parse for a single form such as this one, even though the paradigms would differ in other forms.

Results
we consider the lemma-recall and lemma+MSD recall, and also document the average number of unique parses returned (lemma or lemma+MSD). For the weighted model, we give the recall for all combinations of lemma+MSD. The weighted recall is-for obvious reasonsconsistently below the unweighted version as the unweighted case uses the hierarchical model to potentially return a much larger number of analyses. The weighted version always returns a single lemma, and possibly several equally ranked MSDs, as discussed above. Still, for some languages (Spanish and Finnish verbs in particular), despite returning only a single analysis, performance is on par with the unweighted model, which returns 1.93 analyses on average (Spanish) and 3.77 (Finnish). We emphasize that the test set for our experiments is entirely disjoint from the training set, and that the figures therefore reflect potential performance on unseen word forms, not standard per-token performance in running text, which is presumably much higher. The reported figures can thus be interpreted to correspond to a per-type performance for OOV items.

Conclusion and future work
We have described two supervised methods for producing finite-state models morphological analyzers and guessers from labeled word forms, organized into inflection tables. The method can be used to quickly produce high-recall morphological analysis from labeled data with little or no linguistic development effort.
These tools can be used as is and can also be modified to exploit unlabeled data in the form of raw text corpora in a semi-supervised lexicon expansion setting. Some potential extensions could be of immediate value: the generative weighted model could be combined and evaluated on a task of tagging/disambiguating running text where contextual features could be used and seamlessly combined with the morphological language model. The weighted model also offers paths for further experimentation-for example, it is not immediately obvious that an n-gram model is the best choice. It seems reasonable to assume that those parts of the variables modeled that stand closer to the fixed parts, i.e. at the edges, would be more important in judging similarity to previously seen inflected forms. Table 2 hints at this being the case since, for example, the Spanish variables seem far more constrained at edge positions than in the middle of the variable string. Which parts to weight as more important in judging similarity could also be inferred from data. Another potential extension is to also constrain the analysis form by integrating a word-level language model instead of only a variable-level one, either replacing the variablelevel model or working in conjunction with it.