Evaluating Sequence Alignment for Learning Inflectional Morphology

This work examines CRF-based sequence alignment models for learning natural language morphology. Although these systems have performed well for a limited number of languages, this work, as part of the SIGMORPHON 2016 shared task, speciﬁcally sets out to determine whether these models handle non-concatenative morphology as well as previous work might suggest. Results, however, indicate a strong preference for simpler, concatena-tive morphological systems.


Introduction
Morphologically-rich languages pose a challenge for the natural language processing and generation community. Computationally mapping inflected wordforms to a baseform has been standard practice in semantics and generation. Traditionally, hand-coding these as rule-based systems required extensive engineering overhead, but has produced high quality resolution between base and inflected wordforms. This work extends work by Durrett and DeNero (2013) to automatically learn morphological paradigms by comparing edit operations between a lemma and baseform and tests a similar algorithm on other morphologically-rich langauges and those which exhibit more extensive use of non-concatenative morphology.

Background
Morphological reinflection and lemma generation are not trivial tasks, and have been the subject of much research and engineering. Traditionally, rule-based and finite-state methods (Minnen et al., 2001;Koskenniemi, 1984) have been used, particularly when no training data is available. Although these handcrafted systems perform with a high level of accuracy, creating them is difficult and requires a great deal of engineering overhead.
Recently, more automatic, machine learning methods have been utilized. These systems have required far less handcrafting of rules, but also do not perform as well. Specifically, work by Durrett and DeNero (2013) exploits sequence alignment systems across strings, a technique originally developed for DNA analysis. They showed that by computing minimum edit operations between two strings and having a semi-markov conditional random field (CRF) (Sarawagi and Cohen, 2004) predict when wordform edits rules were to be used, a system could achieve state-of-the-art performance in completing morphological paradigms for English and German.
English and German, along with other Germanic languages, have a somewhat rarer tendency towards ablauting, that is changing or deleting segments from the lemma of a wordform as part of its inflection. In some circles, morphology is thought of in the purely concatenative sense (i.e. give + -s → gives). Durrent and DeNero's work shows promise in that they already account for non-concatenative morphonology in English and German. Using a similar system, this work hypothesizes that such an approach will perform well on languages with more prolific non-concatenative morphology, such as Arabic and Maltese.

Shared Task
The 2016 SIGMORPHON (Cotterell et al., 2016) shared task on morphological reinflection consisted of multiple tracks for discerning fully inflected wordforms in ten languages, two of which were surprise languages whose data was not released until a week before the submission deadline. In task 1, participants were given a lemma and a target word's grammatical information with which to guess the fully inflected target wordform. In task 2, participants were supplied with two fully inflected wordforms-one source and one targetand their grammatical features. Task 3 was the same as task 2, except that no source grammatical information was supplied. Additionally, participants were allowed to choose standard, restricted, or bonus training sets. The standard training allowed for any task to use training data from a task lower than it. Restricted training only allowed for training on data for that given data set (i.e. task 1 can only train on task 1, task 2 on task 2, and task 3 on task 3). A system attempting a certain task number and training on a higher task number (e.g. attempting task 1 and additionally using task 2 training data) constituted using bonus training.
For the purposes of testing this work's hypothesis, task 1 was chosen as being the most analogous and direct means of evaluation. Additionally, restricted training was used to minimize variance between the training sets of the ten languages in question. As seen in table 1, although generally most training sets have about 12,000 items, Navajo, Maltese, and Hungarian are the exceptions.

Implementation
This work exploits string sequence alignment algorithms such as Hirschberg's algorithm (Hirschberg, 1975) and the Ratciff/Obershelp algorithm (Black, 2004) in the same vein as recent work by Durrett and DeNero (2013) and Nicolai et al. (2015). In these frameworks, the fewest number of edits required to convert one string to another are considered to be morphological springen → gesprungen s p r i n g e n ge s p r u n g e n Rule +g+e -i +u Figure 1: Sample edits for English give → gave, Arabic kitab to kutub ('book' singular → plural), and German springen → gesprungen ('to jump' infinitival → past participle). Note that edit rules are applied in a character-by-character manner across the lemma.
rules. As shown in figure 1, source and target words are aligned to minimize edit operations required to make them the same. This minimal list of edit operations is converted into an edit rule at the character level (i.e. this work does not predict word level edit operations). These segment edits are fed with a feature set to be trained on by a linear chain CRF (Sutton and McCallum, 2011) using online passive-aggressive training (Crammer et al., 2006). Features for the CRF included a mix of data provided by the task data and surface features from the uninflected lemmas. All features were shared across all segments (i.e. at the word level) except for features specific to the the current segment and listed in table 2. Outputs from the CRF were edit operations for each segment of the input lemma. After these operations were carried out on their respective segments within the lemma, a fully inflected wordform was the final output from the system. The feature set was chosen with insight from previous work.   • Character level trigram information -forwards and backwards • Character indexes from beginning and end • Distance from the current character to the beginning of the lemma • Distance from the current character to the end of the lemma • Affix type (prefixing, infixing, or suffixingcircumfixing was not explicitly encoded into the feature set)

Results
Overall the system performed far better on the development set than the test set. It is easiest to summarize the results from table 6 in terms of the number of edit rules the system had to learn. Languages with under 500 edit rules for the system to learn performed best and only experienced moderate dropoff between the development and test sets. Languages with over 500 edit rules to be learned both performed worse and experienced extreme drop offs in some instances. The exception, Turkish, will be discussed below and in the next section. Languages traditionally used in these tasks, such as German, performed best, while those less often tested in these systems, such as Maltese, seem to be more difficult for the system to accurately predict. There was a drastic drop in Navajo, which the task organizers claim to be caused by a dialectal shift between the training, development,

Discussion
The difference in system performance between development and testing could be interpreted as overfitting. That said, overfitting to the development set would show a more universal drop in scores from development to testing than is exhibited here. Table 5 shows the number of unique affixes the system had to learn. As expected, languages traditionally thought to have less complex morphological structure had fewer unique affixes in both training and development sets. This is echoed in table 4, where entropy over the unique affix counts was calculated.
In addition to a non-uniform drop in accuracy, a strong negative correlation-as seen in table 3between the number of affixes in the training set and accuracy seems to indicate that data sparsity might explain this phenomenon more fully. It appears that data sparsity has a greater effect as the number of affixes increases.
Certain languages did appear to drop between development and testing more drastically than others. While Finnish, German, Hungarian, Russian, Spanish, and Turkish fell less that 10%, Navajo and Arabic fell more than 30% each. Navajo's drop can be explained by the lack of training data. In 6012 training items, there were 684 edit rules that the system had to learn. This ratio of edit rules to wordforms is more than 1:10, which is higher than almost any other language in the task, second only to Maltese. What is particularly interesting is the number of affixes between Turkish and Arabic.
Although Arabic fell more drastically, Turkish clearly has more affixes in the data set, both by ratio and sheer count, and should perform  worse given the previously-mentioned observations about an overall negative correlation between affix number and system performance. It should be taken into consideration that the kinds of morphology in Arabic and Turkish are not entirely analogous. Turkish, although agglutinative, is also primarily a suffixing language (Lewis, 2000), while Modern Standard Arabic is comparatively more non-concatenative. Arabic and Maltese, both of which have high entropies as seen in table 4 in additional more non-concatenative morphological structures, also performed worse than Turkish in the development results, which had an entropy more akin to Russian and Finnish. This points to the likelihood that non-contatenative morphology is still an issue for sequence alignment algorithms. Whether this problem can be solved by using a different algorithm, increasing training data, or by altering the underlying machine learning is beyond the scope of this task. It should also be noted that, as far as this work is aware, the data sets were not balanced for frequency. Language learners often rotely memorize irregular forms because they do not fit a productive inflectional pattern. Luckily, irregular forms usually occur more frequently than wordforms subject to productive morphological rules (Bybee and Thompson, 1997;Grabowski and Mindt, 1995). Since the algorithm ostensibly treats productive and lexicalized forms equally, it would be interesting to see if there were any difference in performance between these datasets and others balanced to account for irregular form frequency.

Conclusion
Sequence alignment algorithms have proven useful in automatically learning natural language  Table 6: Aggregate Accuracy Across Languages. Maltese required 15 days to train, and was unable to finish before the results were due.
morphology. That said, supervised models require exceptional amounts of training data to overcome data sparsity. Given a lack of training data, more traditional finite-state methods might be preferable given enough time to engineer such systems. This work has shown that CRF-based sequence alignment models do perform well for languages with lower affix to wordform ratios and unique affix count entropy values. Although there is not enough evidence to overtly reject this work's hypothesis, the evidence does indicate a preference for concatenative morphology by CRF-based sequence alignment models.