Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages

Polysynthetic languages pose a challenge for morphological analysis due to the root-morpheme complexity and to the word class “squish”. In addition, many of these polysynthetic languages are low-resource. We propose unsupervised approaches for morphological segmentation of low-resource polysynthetic languages based on Adaptor Grammars (AG) (Eskander et al., 2016). We experiment with four languages from the Uto-Aztecan family. Our AG-based approaches outperform other unsupervised approaches and show promise when compared to supervised methods, outperforming them on two of the four languages.

We propose an unsupervised approach for morphological segmentation of polysynthetic languages based on Adaptor Grammars (Johnson et al., 2007). We experiment with four Uto-Aztecan languages: Mexicanero (MX), Nahuatl (NH), Wixarika (WX) and Yorem Nokki (YN) (Kann et al., 2018). Adaptor Grammars (AGs) are nonparametric Bayesian models that generalize probabilistic context free grammars (PCFG), and have proven to be successful for unsupervised morphological segmentation, where a PCFG is a morphological grammar that specifies word structure (Johnson, 2008;Sirts and Goldwater, 2013;Eskander et al., 2016Eskander et al., , 2018. Our main goal is to examine the success of Adaptor Grammars for unsupervised morphological segmentation when applied to polysynthetic languages, where the morphology is synthetically complex (not simply agglutinative), and where resources are minimal. We use the datasets introduced by Kann et al. (2018) in an unsupervised fashion (unsegmented words). We design several AG learning setups: 1) use the best-on-average AG setup from Eskander et al. (2016); 2) optimize for language using just the small training vocabulary (unsegmented) and dev vocabulary (segmented) from Kann et al. (2018); 3) approximate the effect of having some linguistic knowledge; 4) learn from all languages at once and 5) add additional unsupervised data for NH and WX (Section 3). We show that the AG-based approaches outperform other unsupervised methods -M orf essor (Creutz and Lagus, 2007) and M orphoChain (Narasimhan et al., 2015)) -, and that for two of the languages (NH and YN), the best AG-based approaches outperform the best supervised methods (Section 4).

Languages and Datasets
Typically, polysynthetic languages demonstrate holophrasis, i.e. the ability of an entire sentence to be expressed as what is considered by native speakers to be just one word. To illustrate, consider the following example from Inuktitut (Kla-vans, 2018b), where the morpheme -tusaa-is the root and all the other morphemes are synthetically combined with it in one unit: tusaa-tsia-runna-nngit-tu-alu-u-jung hear-well-be.able-NEG-DOE-very-BE-PT.1S I can't hear very well.
Another example from WX, one of the languages in the dataset for this paper (from (Mager et al., 2018c)) shows this complexity: yu-huta-me ne-p+-we-iwa an-two-ns 1sg:s-asi-2pl:o-brother I have two brothers.
In linguistic typology, the broader gradient is: isolating/analytic to synthetic to polysynthetic. Agglutinating refers to the clarity of boundaries between morphemes. This more specific gradation is: agglutinating to mildly fusional to fusional. Thus a language might be characterized overall as polysynthetic and agglutinating, i.e. generally a high number of morphemes per word, with clear boundaries between morphemes and thus easily segmentable. Another language might be characterized as polysynthetic and fusional, so again, many morphemes per word, but many phonological and other processes so it is difficult to segment morphemes.
Thus, morphological analysis of polysynthetic languages is challenging due to the rootmorpheme complexity and to word class gradations. Linguists recognize a gradience in word classes, known as "squishiness", a term first discussed in Ross (1972) who argued that, instead of a fixed, distinct inventory of syntactic categories, a quasi-continuum from verb, adjective and noun best reflects most lexical distinctions. The rootmorpheme complexity and the word class "squish" makes developing segmented training data with reliability across annotators difficult to achieve. Kann et al. (2018) have made a first step by releasing a small set of morphologically segmented datasets although even in these carefully curated datasets, the distinction between affix and clitic is not always indicated. We use these datasets in an unsupervised fashion (i.e., we use the unsegmented words). These datasets were taken from detailed descriptions in the Archive of Indigenous Languages collection for MX (Canger, 2001), NH (de Suárez, 1980), WX (Gómez and López, 1999), and YN (Freeze, 1989). They were constructed so they include both segmentable as well as non-   Kann et al. (2018), for training we do not use the segmented version of the data (our approach is unsupervised). In addition to the datasets, for NH and WX we also have available the Bible (Christodouloupoulos and Steedman, 2015;Mager et al., 2018a), which we consider for one of our experimental setups as additional training data. In the dataset from (Kann et al., 2018), the maximum number of morphemes per word for MX is seven with an average of 2.13; for NH, six with an average of 2.2; for WX, maximum of ten with an average of 3.3; and for YN, the maximum is ten, with an average of 2.13.

Using Adaptor Grammars for Polysynthetic Languages
An Adaptor Grammar is typically composed of a PCFG and an adaptor that adapts the probabilities of individual subtrees. For morphological segmentation, a PCFG is a morphological grammar that specifies word structure, where AGs learn latent tree structures given a list of words. In this paper, we experiment with the grammars and the learning setups proposed by Eskander et al. (2016), which we outline briefly below. Grammars. We use the nine grammars from Eskander et al. (2016Eskander et al. ( , 2018) that were designed based on three dimensions: 1) how the grammar models word structure (e.g., prefix-stem-suffix vs. morphemes), 2) the level of abstraction in nonterminals (e.g., compounds, morphemes and submorphemes) and 3) how the output boundaries are specified (see Table 2 for a sample grammars). For example, the PrStSu+SM grammar models the   Eskander et al. (2018Eskander et al. ( , 2016. Compound = Upper level representation of the word as a sequence of compounds; Morph = affix/morpheme representation as a sequence of morphemes. SubMorph (SM) = Lower level representation of characters as a sequence of sub-morphemes. "+" denotes one or more.
word as a complex prefix, a stem and a complex suffix, where the complex prefix and suffix are composed of zero or more morphemes, and a morpheme is a sequence of sub-morphemes. The boundaries in the output are based on the prefix, stem and suffix levels.
Learning Settings. The input to the learner is a grammar and a vocabulary of unsegmented words. We consider the three learning settings in (Eskander et al., 2016): Standard, Scholarseeded Knowledge and Cascaded. The Standard setting is language-independent and fully unsupervised, while in the Scholar-seeded-Knowledge setting, some linguistic knowledge (in the form of affixes taken from grammar books) is seeded into the grammar trees before learning takes place. The Cascaded setting simulates the effect of seeding scholar knowledge in a language-independent manner by first running an AG of high precision to derive a set of affixes, and then seeding those affixes into the grammars.

AG Setups for Polysynthetic Languages
We experimented with several setups using AGs for unsupervised segmentation.
Language-Independent Morphological Segmenter. LIMS is the best-on-average AG setup obtained by Eskander et al. (2016) when trained on six languages (English, German, Finnish, Estonian, Turkish and Zulu), which is the Cascaded PrStSu+SM configuration. We use this AG setup for each of the four languages. We refer to this system as AG LIM S .
Best AG Configuration per Language. In this experimental setup, we consider all nine grammars from Eskander et al. (2016) using both the Standard and the Cascaded approaches and choosing the one that is best for each polysynthetic language by training on the training set and evaluating on the development set. We denote this system as AG BestL .
Using Seeded Knowledge. To approximate the effect of Scholar-seeded-Knowledge in Eskander et al. (2016), we used the training set to de-rive affixes and use them as scholar-seeded knowledge added to the grammars (before the learning happens). However, since affixes and stems are not distinguished in the training annotations from Kann et al. (2018), we only consider the first and last morphemes that appear at least five times. We call this setup AG Scholar BestL . Multilingual Training. Since the vocabulary in Kann et al. (2018) for each language is small, and the languages are from the same language family, one data augmentation approach is to train on all languages and test then on each language individually. We call this setup AG M ulti .
Data Augmentation. In this setup, we examine the performance of the best AG configuration per language (AG BestL ) when more data is available. We merge the training corpus with unique words in the New Testament of the Bible (train Bible ). We run this only on NH and WX since the Bible text is only available for these two languages. We denote this setup as AG Aug .

Evaluation and Discussion
We evaluate the different AG setups on the blind test set from Kann et al. (2018) and compare our AG approaches to state-of-the-art unsupervised systems as well as supervised models including the best supervised deep learning models from Kann et al. (2018). As the metric, we use the segmentation-boundary F1-score, which is standard for this task (Virpioja et al., 2011).
Evaluating different AG setups. Table 3 shows the performance of our AG setups on the four languages. The best AG setup learned for each of the four polysynthetic languages (AG BestL ) is the PrStSu+SM grammar using the Cascaded learning setup. This is an interesting finding as the Cascaded PrSTSu+SM setup is in fact AG LIM S -the best-on-average AG setup obtained by Eskander et al. (2016)    WX and YN, respectively. Seeding affixes into the grammar trees (AG Scholar BestL ) improves the performance of the Cascaded P rStSu + SM setup only for MX and WX (additional absolute F1-scores of 0.023 and 0.019, respectively). However, it does not help for NH, while it even decreases the performance on YN. This occurs because AGs are able to recognize the main affixes in the Cascaded setup, while the seeded affixes were either abundant or conflicting with the automatically discovered ones. The multilingual setup (AG M ulti ) does not improve the performance on any of the languages. This could be because the datasets are too small to generalize common patterns across languages. Finally, augmenting with Bible text in the cases of NH and WX leads to an absolute F1-score increase of 0.015 for both languages when compared to AG BestL . There are two possible explanations for why we only see a slight increase when adding more data: 1) AGs are able to generalize from small data and 2) the added Bible data represents a domain that is different from those of the datasets we are experimenting with as only 4.8% and 9% of the words in the training sets from Kann et al. (2018) appear in the augmented data of NH and WX, respectively. Overall, AG BestL is the best setup for YN, AG Scholar BestL is the best setup for MX and WX, while AG Aug is the best for NH.
Comparison with unsupervised baselines. We consider M orf essor (Creutz and Lagus, 2007), a commonly-used toolkit for unsupervised morphological segmentation, and M orphoChain (Narasimhan et al., 2015), another unsupervised morphological system based on constructing morphological chains. Our AG approaches significantly outperform both M orf essor and M orphoChain on all four languages, as shown in Table 3.
Comparison with supervised baselines. To obtain an upper bound, we compare the best AG setup to the best supervised neural methods presented in Kann et al. (2018) for each language. We consider their best multi-task approach (BestMTT) and the best data-augmentation approach (BestDA), using F1 scores from their Table 4 for each language. In addition, we report the results on their other supervised baselines: a supervised seq-to-seq model (S2S) and a supervised CRF approach. As can be seen in Table  4, our unsupervised AG-based approaches outperform the best supervised approaches for NH and YN with absolute F1-scores of 0.010 and 0.012, respectively. An interesting observation is that for YN we only used the words in the training set of Kann et al. (2018) (unsegmented), without any data augmentation. For MX and WX, the neural models from Kann et al. (2018) (BestMTT and BestDA), outperform our unsupervised AG-based approaches.
Error Analysis. For the purpose of error analysis, we train our unsupervised segmentation on the training sets and perform the analysis of results on the output of the development sets based on our best unsupervised models AG BestL . Since there is no distinction between stems and affixes in the labeled data, we only consider the morphemes that appear at least three times in order to eliminate open-class morphemes in our statistics.
We first define the degree of ambiguity of a morpheme to be the percentage of times its sequence of characters does not form a segmentable morpheme when they appear in the training set. We also define the degree of ambiguity of a language as the average degree of ambiguity of the morphemes in that language. Table 5 shows the number of morphemes, average length of a morpheme (in characters) and the degree of morpheme   ambiguity in each language. Looking at the two languages where our models perform worse than the supervised models, we notice that MX has the least number of morphemes, and our unsupervised methods tend to oversegment; WX has the highest degree of ambiguity with a large number of one-letter morphemes, which makes the task more challenging for unsupervised segmentation as opposed to the case of a supervised setup. Analyzing all the errors that our AG-based models made across all languages, we noticed one, or a combination, of the following factors: a high degree of morpheme ambiguity, short morpheme length and/or low frequency of a morpheme.
Examples. Table 6 shows some examples of correctly and incorrectly segmented words by our models (blue indicates correct morphemes while red are wrong ones). For MX, our models fail to recognize ka as a correct affix 100% of the time due to its high degree of ambiguity (71.79%), while we often wrongly detect ro as an affix, most likely since ro tends to appear at the end of a word; our approaches tend to oversegment in such cases. On the other hand, our method correctly identify ki as a correct affix 100% of the time since it appears frequently in the training data. For NH, the morpheme tla has a high degree of ambiguity at 79.12%, which lead the model to fail in recognizing it as an affix (see an example in Table 6). On the other hand, NH has a higher percentage of correctly recognized morphemes, due to their less ambiguous nature and higher frequency (such as ke, tl or mo). For WX, a large portion of errors stem from one-letter morphemes that are highly ambiguous (e.g., u, a, e, m, n, p and r), in addition to having morphemes in the training set which are not frequent enough to learn from, such as ki,nua and wawi (see Table 6). Examples of correct segmentation involve morphemes that are more frequent and less ambiguous (pe, p@ and ne). For YN, ambiguity is the main source of segmentation errors (e.g., wa, wi and ßa).slight

Conclusions
Unsupervised approaches based on Adaptor Grammars show promise for morphological segmentation of low-resource polysynthetic languages. We worked with the AG grammars developed by Eskander et al. (2016Eskander et al. ( , 2018 for languages that are not polysynthetic. We showed that even when using these approaches and very little data, we can obtain encouraging results, and that using additional unsupervised data is a promising path.