Data-Driven Morphological Analysis for Uralic Languages

This paper describes an initial set of experiments in data-driven morphological analysis of Uralic languages. The paper differs from previous work in that our work covers both lemmatization and generating ambiguous analyses. While hand-crafted finite-state transducers represent the state of the art in morphological analysis for most Uralic languages, we believe that there is a place for data-driven approaches, especially with respect to making up for lack of completeness in the шlexicon. We present results for nine Uralic languages that show that, at least for basic nominal morphology for six out of the nine languages, data-driven methods can achieve an F-score of over 90%, providing results that approach those of finite-state techniques. We also compare our system to an earlier approach to Finnish data-driven morphological analysis (Silfverberg and Hulden, 2018) and show that our system outperforms this baseline. Abstract


Introduction
Morphological analysis is the task of producing, for a given surface form, a list of all and only the valid analyses in the language. For example, given the surface form voisi in Finnish, a morphological analyser must produce not only the most frequent analysis voida+VERB|Mood=Cond|Number=Sg|Person=3 'can they?' but also the less frequent voida+VERB|ConNeg=Yes 'can-neg', or theoretical/rare ones voi+NOUN|Number=Sg|Case=Nom|Possessor=Sg2 'your butter'.
Morphological analysis is a cornerstone of language technology for Uralic and other morphologically complex languages, where type-to-token ratio becomes prohibitive for purely word based methods. Rule-based morphological analyzers (Beesley and Karttunen, 2003) represent the current state-of-the-art for this task. The analyses returned by such systems are typically very accurate, however, rule-based systems suffer from low coverage since novel lexical items often need to be manually added to the system.¹ We explore the task of data-driven morphological analysis, that is, learning a model for analyzing previously unseen word forms based on a morphologically analyzed text corpus. This can help with the coverage problem encountered with rulebased analyzers. Morphological guessers based on existing rule-based analyzers represent a classical approach to extending the coverage of a rule-based analyzer. These are constructed by transforming an original analyzer typically using weighted finitestate methods (Lindén, 2009). In practice, this limits the range of data-driven models that can be applied. For example, models which do not incorporate a Markov assumption (such as RNNs) can be difficult to apply due to the inherent finite-state nature of rule-based analyzers.
Our system² is a neural encoder-decoder which is learned directly from morphologically analyzed text corpora. It is inspired by previous approaches to morphological analysis by Moeller et al. (2018) and Silfverberg and Hulden (2018). In contrast to these existing neural morphological analyzers, our system produces full morphological analyses: it provides both morphological tags and lemmas as output and it can return multiple alternative analyses for one input word form using beam search.
We present experiments on morphological analysis of nouns for nine Uralic languages: Estonian, Finnish, Komi-Zyrian, Moksha, Hill Mari, Meadow Mari, Erzya, North Sámi and Udmurt. We show that our system achieves roughly 90% F1-score for most of the tested languages. Additionally, we compare our system to the Finnish data-driven morphological analyzer presented by Silfverberg and Hulden (2018). As seen in Section 5, our system clearly outperforms the earlier approach.

Related Work
There is a strong tradition of work on rule-based morphological analysis for Uralic languages. Recent examples include Pirinen et al. (2017), Trosterud et al. (2017 and Antonsen et al. (2016), although work in the area has been going on for many years (cf. Koskenniemi (1983)). There is also a growing body of work on data-driven morphological tagging for Uralic languages, especially Finnish. Here, a system is trained to find a single contextually appropriate analysis for each token in a text. Examples of ¹Although novel lexical items can cause problems for data-driven systems as well, most data-driven systems are still able to analyze any word form in principle.
²Code available at https://github.com/mpsilfve/morphnet. work exploring morphological tagging for Finnish include Kanerva et al. (2018) and Silfverberg et al. (2015). However, work on full data-driven morphological analysis, where the task is to return all and only the valid analyses for each token irrespective of sentence context, is almost non-existent for Uralic languages. The only system known to the authors is the recent neural analyzer for Finnish presented by Silfverberg and Hulden (2018). The system first encodes an input word form into a vector representation using an LSTM encoder. It then applies one binary logistic classifier conditioned on this vector representation for each morphological tag (for example NOUN|Number=Sg|Case=Nom). The classifier is used to determine if the tag is a valid analysis for the given input word form. Similarly to Silfverberg and Hulden (2018), our system is also a neural morphological analyzer but unlike Silfverberg and Hulden (2018) we incorporate lemmatization. Moreover, the design of our system considerably differs from their system as explained below in Section 3. The lack of work on morphological analysis for Uralic languages is unsurprising because the field of data-driven morphological analysis in general remains underexplored at the present time. Classically, morphological analyzers have been extended using morphological guessers (Lindén, 2009), however, the premise for such work is quite different-An existing analyzer is modified to analyze unknown word forms based on orthographically similar known word forms. In contrast, we explore a setting, where the starting point is a morphologically analyzed corpus and the aim is to learn a model for analyzing unseen text.
Outside of the domain of Uralic languages, Nicolai and Kondrak (2017) frame morphological analysis as a discriminative string transduction task. They present experiments on Dutch, English, German and Spanish. In contrast to Nicolai and Kondrak (2017), Moeller et al. (2018) use a neural encoder-decoder system for morphological analysis of Arapaho verbs. Their system returns both lemmas and morphological tags but it cannot handle ambiguous analyses in general.³ Our work is inspired by the neural encoder-decoder approach presented by Moeller et al. (2018) but we do handle unrestricted ambiguity.
In contrast to data-driven morphological analysis, data-driven morphological generation has received a great deal of attention lately due to several shared tasks organized by CoNLL and SIGMORPHON (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018. The most successful approaches (Kann and Schütze, 2016;Bergmanis et al., 2017;Makarov et al., 2017;Makarov and Clematide, 2018) to the generation task involve different flavors of the neural encoder-decoder model. Therefore, we opted for applying it in our morphological analyzer.

Model
This section presents the encoder-decoder model used in the experiments.

An Encoder-Decoder Model for Morphological Analysis
Following Moeller et al. (2018), we formulate morphological analysis as a characterlevel string transduction task and use an LSTM (Hochreiter and Schmidhuber, 1997) encoder-decoder model with attention (Bahdanau et al., 2014) for performing the string transduction. To this end, we train our model to translate input word forms ³The system can handle ambiguity in limited cases by using underspecified tags. For example an ambiguity between singular and dual number could be expressed using a tag [SG/DPL]. We use a bidirectional LSTM encoder for encoding an input word form into forward and backward states (pink and green bars) one character at a time. We then use an attentional LSTM decoder for generating output analyses one symbol at a time. We return the least number of most probable analyses whose combined mass is greater than a threshold p. In this example, for p = 0.9, the analyzer would return koira+NOUN+Num=Sg|Case=Ill and koiras+NOUN+Num=Sg|Case=Gen whose combined probability mass is 0.97, given the input form koiraan.
like koiraan (singular illative for koira 'dog' or singular genitive for koiras 'male' in Finnish) into a set of output analyses:

koira+NOUN+Number=Singular|Case=Ill koiras+NOUN+Number=Singular|Case=Gen
Each analysis consists of a lemma (koira 'dog'), a part-of-speech (POS) tag (NOUN) and a morphosyntactic description (MSD) (Number=Singular|Case=Gen). The procedure is illustrated in Figure 1. Above, we presented an example from Finnish, voisi, which can be both an inflected form of a noun and an inflected form of a verb. This shows that a word form may have multiple valid morphological analyses with different lemmas, POS tags and MSDs. Therefore, our model needs to be able to generate multiple output analyses given an input word form. We accomplish this by extracting several output candidates from the model using beam search and selecting the most probable candidates as model outputs. The number of outputs is controlled by a probability threshold hyperparameter p. We extract the least number of top scoring candidates whose combined probability mass is greater than p. Additionally, we restrict the maximal number of output candidates using a single hyperparameter N . The hyperparamaters p and N are tuned on the development data.

Implementation Details
We implement our LSTM encoder-decoder model using the OpenNMT neural machine translation toolkit (Klein et al., 2017). We use 500-dimensional character and tag embeddings for input and output characters as well as POS and MSD tags. These are processed by a 2-layer bidirectional LSTM encoder with hidden state size 500. Encoder representations are fed into a 2-layer LSTM decoder with hidden state size 500. During inference, we use beam search with beam width 10.
When training, we use a batch size of 64 and train for 10,000 steps where one step corresponds to updating on a single mini-batch. Model parameters are optimized using the Adam optimization algorithm (Kingma and Ba, 2014).

Data
We use two datasets in the experiments. The first dataset is created by using the morphological transducers from Giellatekno to analyze wordforms in a frequency list from Uralic Wikipedias. The second one is created using data from the Turku Dependency Treebank. This dataset was originally presented by Silfverberg and Hulden (2018). We explicitly do not use any data from the Unimorph project.

Uralic Wikipedia Data
We applied the models to nine Uralic languages: Erzya (myv), Estonian (est), Finnish (fin), Komi-Zyrian (kpv), Hill Mari (mhr), Meadow Mari (mrj), Moksha (mdf), North Sámi (sme) and Udmurt (udm). These languages were chosen as they had both a moderately-sized free and open text corpus (Wikipedia) and an existing free/opensource morphological analyser from the Giellatekno infrastructure (Moshagen et al., 2014). Hungarian (hun) was omitted as there was no functional analyser in the Giellatekno infrastructure, while the remainder of the Sámi languages (i.e. South (sma), Lule (smj), Inari (smn), …) and Kven (fkv) were left out as they have as yet no Wikipedia. The remainder of the Uralic languages have neither wide-coverage analyser nor Wikipedia.
The data used in the experiments consisted of tab separated files with five columns: language code, surface form, lemma, part-of-speech and list of morphological tags expressed as Feature=Value pairs (see Figure 2). Both the parts of speech and the morphological tags broadly follow the conventions of the Universal Dependencies project (Nivre et al., 2016), with one exception: The tags are given in the same order they appear in the original morphological analyses (largely morpheme order) as opposed to in alphabetical order by feature name.
Each file was generated as follows: First we downloaded the relevant Wikipedia dump⁴ and extracted the text using WikiExtractor.⁵ This gave us a plain-text corpus of the language in question. We then used the morphological transducers from Giellatekno (Moshagen et al., 2014) to both tokenize and analyze the text. This was then made into a frequency list using standard Unix utilities. We then extracted only the forms with noun analyses and removed all non-noun analyses, along with noun analyses that included numerals, abbreviations, acronyms, spelling errors or dialectal forms. All derived and compound analyses were also removed, in addition to analyses that included clitics (e.g. Finnish -kään, -kaan). The exclusion of these phenomena makes the task less applicable to a real-world setting, but at the same time makes it tractable for initial experiments such as the ones presented in this paper.
After creating the frequency list, we converted the format of the analyses by means of a simple lookup table (e.g. +Gen → Case=Gen). An example from the training data of North Sámi can be found in Figure 2 and details about the size of the training data for each of the languages can be found in Table 1.
All data sets were randomly split into 80% training data, 10% development data and 10% test data. The splits are disjoint in the sense that the training and development set never include word forms seen in the test set. They may, however, include other inflected forms of lemmas that do occur in the test set.

Finnish Treebank Data
Our second dataset was presented by Silfverberg and Hulden (2018). It is the Finnish part of the Universal Dependencies treebank v1 (Pyysalo et al., 2015) which has been analyzed using the OMorFi morphological analyzer (Pirinen et al., 2017). We used the splits into training, development and test sets provided by Silfverberg and Hulden (2018). In contrast to the Uralic Wikipedia datasets, which is a type-level resource consisting of analyses for unique word forms, the Finnish treebank data is a token-level resource consisting of morphologically analyzed running text. Therefore, the same word form can occur multiple times in the dataset. This means that the training, development and test sets are not disjoint which makes the task somewhat easier. However, the dataset contains word forms from all Finnish word classes. It also contains derivations and clitics. This, in turn, makes it more versatile than the Uralic Wikipedia data. The dataset is described in Table 2.

Experiments and Results
We present results for two experiments. In the first experiment, we train analyzers for the Uralic Wikipedia data presented in Section 4. In the second experiment, we train an analyzer on the Finnish Treebank data used by Silfverberg and Hulden (2018) and compare our system to theirs.
Because an input word form can have several alternative analyses, we present results for precision, recall and F1-score on analyses. These are defined with regard to the quantities true positives (tp) which is the number of gold standard analyses that our system recovered, false positives (f p) which is the number of incorrect analyses that our system produced and false negatives (f n) which is the number of gold standard analyses which our system was unable to recover. Definitions for recall, precision and F1-score are given below: Recall = tp tp+f n , Precision = tp tp+f p and F1-score = 2 · Recall·Precision

Experiment on Uralic Wikipedia Data
We present three different evaluations of the results.    Total: 97 100.0 Table 6: Qualitative evaluation of the errors in the output of the system for Udmurt. The majority of errors can be classified with loan words from Russian making a good proportion. Table 4 shows results for plain lemmas without POS or MSD. Here all languages except Northern Sámi receive F1-score over 90% and, as in the case of full analyses, recall is again higher than precision for all languages. For lemmas, the best F1-score is again attained on Udmurt (95.77%) The final evaluation is shown in Table 5. This table shows results for POS and MSD tag. Overall results here are higher than for the lemma or full analysis: in excess of 92% for all languages. Similarly as in the case of full analyses and lemmas, our model again delivers the best F1-score for Udmurt (97.03%).
For the Udmurt data, given that only 97 analyses were incorrect we were able to do a partial qualitative evaluation shown in Table 6. We looked at all the analyses and categorised them into nine error classes: (1) Russian loan words ending in -ие, -ье or -ья that do not receive the right lemma; (2) Other mistakes in loan words from Russian; (3) Plural morpheme is considered part of the stem; (4) Words ending in soft sign -ь that were mislemmatized; (5) Overenthusiastic lemmatization -i.e. the system produced a lemma that did not exist in the data; (6) Under enthusiastic lemmatization -i.e. a lemma in the data was not produced by the system; (7) Impossible lemmai.e. the singular nominative did not have the same form as the lemma; (8) Words with hyphen in; (9) Other.
A typical error of the first type can be found in the lemmatization of the word путешествие 'travel', the lemma given by the network was *путешестви,⁶ similarly *междометия was given for междометие 'interjection'. The second error class included errors like the lemma *республик for the form республиказы 'to/in our republic'. The system also sometimes generated lemmas in the plural form (third error type), for example бурдъёсаз 'on/to its wings' generated two correct analyses with the lemma бурд 'wing' and one incorrect with the lemma бурдъёс 'wings'. For errors of the fourth type we can consider the form пристане 'wharf-ill' which has the lemma пристань 'wharf' (as in Russian), but for which the system produced both *пристан and *пристане, neither of which exist as lemmas in Udmurt or Russian.
For the fifth type we have спортэ giving the lemma *спор⁷ instead of спорт 'sport' ⁶According to some Udmurt authors this is the preferred nominative singular form, but we count it as an error as the analyser we based the gold standard on uses путешествие as the lemma.
⁷Note that this could potentially be a loan of спор 'dispute, argument' from Russian, but as it was not in the gold standard counted it as an error.
and sixth type берлань we get the noun lemma берлань instead of бер 'back-approx'. Note that there is a much more frequent reading of берлань as an adverb 'ago, back' (rus. назад), but as this was not a nominal reading it was excluded from the experiments.
For the seventh type consider the word пияш 'boy, lad' which generated nominative singular analyses with the lemmas пи 'son' and *пиеш.
The system was also confused by compound words written with a hyphen (error type 8). Three out of seven of these had various different kinds of errors, for example losing part of the compound ваньмыз-ӧвӧлэз → ваньмыз, making compound-internal vowel changes тодон-эскеронъя → *тодон-ӧскерон or considering an affix part of the lemma музей-коркан 'village-house museum' → музей-коркан.
While the 'Other' class makes up almost half of the data, we can see that over half of the errors should in principle be able to be solved with simply adding more data. That is, the model has not received enough information about how Russian loan words, or words with hyphens work as they compose a small fraction of the data.

Experiment on Finnish Treebank Data
In our second experiment, we compare our system against the neural morphological analyzer proposed by Silfverberg and Hulden (2018). We trained a morphological analyzer on the Finnish treebank training data used by Silfverberg and Hulden (2018) and report results on their test data. Similarly to Silfverberg and Hulden (2018), we also return the set of analyses seen in the training set for those test word forms which were seen in the training data. Table 7 shows results on the Finnish treebank dataset. We only report results for precision, recall and F1-score with regard to tags (POS + MSD) because the system by Silfverberg and Hulden (2018) is not capable of lemmatization. As Table 7 shows, our system clearly outperforms the system proposed by Silfverberg and Hulden (2018) with regard to F1-Score on tags. Results on the Finnish treebank data are also far better than results on the Finnish Wikipedia data.
Our system clearly outperforms the system by Silfverberg and Hulden (2018) on the Finnish Wikipedia data. In contrast to the Uralic Wikipedia data, the Finnish Treebank dataset, represents continuous text with word forms belonging to a mix of word classes. It also covers clitics and derivations which are missing from the Uralic Wikipedia dataset.⁸ Therefore, this experiment indicates that our system is also applicable to analysis of running text for Finnish.
The overall better performance on the Finnish Treebank dataset is explained by the fact that it is a token-level resource where frequent words, which are easy to analyze, can substantially improve performance.
In contrast to what Silfverberg and Hulden (2018) found, our results on the Finnish Wikipedia data indicate that recall is higher than precision for most languages. However, on the Finnish treebank data, we also get higher precision than recall although our system delivers more balanced recall and precision than the system proposed by Silfverberg and Hulden (2018). It is not immediately clear, why it is advantageous to prefer precision over recall but this may be related to the large number of possible POS + MSD combinations in the Finnish Treebank dataset. Many of these could potentially be applicable judging purely on the basis of the orthographical form of a particular word form but only a small number of the combinations will actually result in a valid analysis. Therefore, it may be advantageous to return a more restricted set of highly likely analyses.
As explained in Section 3, we return analyses based on probability mass. It could be better to predict how many forms are going to be included based on the input word form. For example, if the input word form is markedly different than most forms seen in the training data, the model may assign lower confidence to output analyses. Applying a probability mass threshold in this case may result in a very large number of outputs.
Large training sets are available for only a few Uralic languages, Therefore, we should explore using a hard attention model similar to Makarov and Clematide (2018) in our encoder-decoder. The results from CoNLL SIGMORPHON shared tasks (Cotterell et al., 2018) show that a hard attention model can be a far stronger learner in a low-resource setting.
We presented a data driven morphological analyzer and evaluated its performance on morphological analysis of nouns for nine Uralic languages. Moreover, we evaluated the performance on Finnish running text. Our system delivers encouraging results. F1-score for analysis of nouns is around 90% for most of our languages. In addition, our system substantially improves upon the baseline presented by Silfverberg and Hulden (2018). In future work, we need to explore hard attention models for morphological analysis since these deliver strong performance in low-resource settings which are typical for Uralic languages. Moreover, we need to explore more principled ways to handle ambiguous analyses.
⁸Recall that clitics and derivations are missing as they were removed during processing of the Wikipedia data ( §4) to make the data easier to process and more comparable cross-linguistically, as clitics are treated differently in the different analysers.