Morphological disambiguation from stemming data

Morphological analysis and disambiguation is an important task and a crucial preprocessing step in natural language processing of morphologically rich languages. Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis. While linguistically curated finite state tools can be easily developed for morphological analysis, the morphological richness of the language allows many ambiguous analyses to be produced, requiring effective disambiguation. In this paper, we propose learning to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing. Using feature engineering and a feed-forward neural network based classifier, we achieve about 89% non-contextualized disambiguation accuracy. Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.


Introduction
For morphologically rich languages, morphological analysis and disambiguation plays a critical role in most natural language processing (NLP) tasks. When inflections are generated by piecing together multiple morphemes, a large and sparse vocabulary is produced, requiring tools to unpack the individual morphemes for downstream NLP tasks such information extraction and machine translation. A key characteristic of these languages is that morphemes often have specific meanings (often relating to properties of the words they form or referring to contextual entities) and their combination into words is mostly regular. Figure 1 shows typical morphological units contained in the word 'ntuzamwibeshyeho' (Never underestimate him/her).
While several morphologically rich languages such as Turkish, Arabic and Modern Hebrew already have mature tools for morphological segmentation (Cöltekin, 2010) (Çöltekin, 2014) (Itai and Segal, 2003) (Habash and Rambow, 2006), Kinyarwanda still lacks appropriate tools for the task. A key limitation in the effort is the need to have high quality datasets manually annotated by language experts. With limited funding opportunities, research on NLP for low resource languages lags behind recent advancements made for NLP on high resource languages. In this work, we leverage an easy to collect stemming dataset and transform it into a resource for morphological disambiguation. While the focus here is on Kinyarwanda verbal forms, the method can be applied to other morphologically rich languages. Collecting stemming data is much faster and less prone to errors than full morphological segmentations which require subtle linguistic knowledge.
Through a maximum entropy model, we are able to combine morphological properties of stems with inflectional similarity information from word embeddings to accurately disambiguate candidate segmentations from a rule-based morphological analyzer. Our work here pertains to non-contextual verb-phrase disambiguation but is a key step towards contextual disambiguation. We believe that this work will allow rich morphology to be incorporated in new models for Kinyarwanda and improve downstream NLP tasks on the language. Figure 1: Morphological segmentation of the word 'ntuzamwibeshyeho'. Each morpheme unit has a specific meaning or function in order to get the meaning of the whole word. The word is an inflection of the lemma 'kwibeshya' (to err) which a derivation from 'kubeshya' (to lie) by adding a reflexive -ii-. Therefore, a literal morpheme-by-morpheme translation of the word would be 'Never lie to yourself about him/her' while the real meaning is 'Never underestimate him/her'.

Related work
Computational morphology has been studied for decades, but most implementations and evaluations have been conducted on languages that are not related to Kinyarwanda. Finite state methods for morphological analysis have been proposed by Beesley and Karttunen (Karttunen, 2000) and have been popular for morphological analysis. Our morphological analyzer is based on the underlying principle of two level morphology (Koskenniemi, 1983), but our custom implementation does not follow the exact formalism of finite state transducers. We rather focus on refining rules that are specific to Kinyarwanda through extensive empirical examination. In (Muhirwe, 2007), a morphological alternation model for Kinyarwanda was presented using Xerox tools, but no empirical evaluation was conducted. In (Garrette et al., 2013), an experiment was conducted on learning POS taggers for Kinyarwanda and Malgasay using a small dataset of frequent words annotated by linguists. While POS tagging is an important task, morphological analysis is even more important for morphologically rich languages because it reveals more information than what can be covered by a finite set of tags. Other work on Kinyarwanda has been more linguistic in nature (Kimenyi, 1980); (Jerro, 2016), especially due to that Kinyarwanda is considered as a more generic prototype of the larger group of Bantu languages, owing to its rich morphology and tonal system. The problem of morphological disambiguation has been researched on for other morphologically rich languages such as Turkish, Arabic and Modern Hebrew. Proposed methods range from statistical ones (Hakkani-Tür et al., 2002) (Cohen andSmith, 2007), to rule based approaches (Yuret and Türe, 2006), and more recently using recurrent neural networks (Zalmout and Habash, 2017). Most of these methods are usually trained and evaluated on richly annotated tree-banks which are not available for low resource languages like Kinyarwanda. The only labeled data required by our method is a list of inflection/stem pairs which can be conveniently collected by untrained native speakers. Another major difference is also that our disambiguation focuses on uncontextualized morphology of Kinyarwanda verbal forms. We reserve contextual disambiguation and part of speech tagging for a future study. Figure 2: A smart-phone interface is used by volunteers to stem Kinyarwanda verb forms. For instance, the analysis of the inflected form 'izarigishijwe'(...which have been made disappear) can result in ambiguous stems -'igish'(kwigisha: to teach), 'gish'(kugisha: to ask [for advice]), 'rigis'(kurigisa: to make disappear), 'ig'(kwiga: to study), 'shir'(gushira: to end) and 'siz'(gusiza: to prepare a place for construction)

Dataset development
The dataset for this project comes from a crowd-sourcing effort where users labeled inflected forms of Kinyarwanda verbs with corresponding lemma. From a web-crawled corpus, our toolkit detects potential verb inflections, auto-segments them and asks volunteers to choose the right lemma from a proposed list of candidates. For convenience of use, volunteers are asked to lemmatize the inflected verbs using a simple mobile application (see Figure 2) and the labeled data are sent to a back-end server. The raw labeled dataset was filtered for potential random user inputs by using a baseline classifier and removing data for users who performed poorly on the otherwise least ambiguous instances. While more than 200 volunteers used the stemming application, only data from 37 annotators was found to be consistent and then used in this study.

Morphological analysis
Our morphological analysis is based on finite state methods (Karttunen, 2000). Table 2 shows a repertoire of Kinyarwanda verb morphemes and examples of when they are used. The morphotactics, which dictate the ordering of morphemes, are modeled with a hierarchical graph shown in Figure 3. It is this graph of morpheme slots that make Kinyarwanda verbal system very productive. In theory, thousands of different morpheme slot sequences can be produced by this graph, but in reality, there are more semantic and syntactic restrictions.
In addition to the basic morphotactics graph model, morpho-graphemic rules and other morpheme association rules are added to the analyzer using small constraint-enforcement language. The language is expressive enough to allow a researcher to incorporate complex grammatical regularities.

Classification
We handle morphological disambiguation as a classification problem with a variable number of classes (candidate stems) for each instance. We compute two types of features from each morphological segmentation and feed those features to a feed-forward neural network and finally produce probabilities with a softmax function. We train the network to minimize a cross-entropy loss function.

Similarity features
The first type of features estimates how similar a given segmentation is to other inflections of the same stem, effectively handling the stem disambiguation part of the problem. Formally, given a candidate segmentation x with a stem s, first we produce a set of N common inflections of the same stem {y i } N i ∈ Inf l(s) ∩ V by associating the stem with common standard affixes, generating the surface forms and making sure that these surface forms are part of the word embedding vocabulary V. We choose K nearest inflections to x among {y i } N i (by cosine similarity) and estimate the final similarity scores as: where: * {·} N i notation means a set of N elements indexed by i * f m (·) is a mean function; we use both arithmetic, geometric and harmonic means as features.
* d e (x, y i ) is the normalized angular similarity between the word embedding vectors for x and y i , i.e.: with e(·) being the word embedding lookup function.
* σ(·) is a normalizing sigmoid function of the form: with min f and M ax f being tunable hyper-parameters for each type of feature f demarcating the active range of the feature.
Additionally, we use 2-dimensional euclidean distances of the token-and document-corpus frequencies between x and y i to estimate how popular a given segmentation is in the corpus in relation to the popularity of the inflection set {y i } K i : where: * t c (z) = σ(token count(z)), i.e. the sigmoid-normalized number of times z appear in the corpus * t d (z) = σ(document count(z)), i.e. the sigmoid-normalized number of corpus documents containing z Finally, and t c (x)+t d (x) 2 are included as separate features.

Morphological indicator features
The second type of features evaluates the appropriateness of "morphological features" present in a given segmentation versus typical features associated with its stem in the training dataset. These indicator features include the use of special morphemes, morpheme associations and special morpho-graphemic rules inherent in the segmentation. For example, passivization (transformation from active to passive form) is expressed by a special suffix but not all verbs can be used in passive form. The same applies to transitivity (the number and type of object pronouns a verb can take), the use of special suffixes, personal pronouns, locatives, and so on. Essentially, the M linguistically-motivated indicator features f i (x) are compared to their selection ratio scores in the training dataset. By selection ratio scores, we mean: and separately where: * chosen[f i , s] is the number of times stem s has been chosen as the valid stem for any morphological segmentation having morphological feature f i .
* proposed[f i , s] is the number of times stem s has been proposed (either chosen or rejected) among candidate lemmas for any segmentation having morphological feature f i .
The list of indicator features f i used in our experiments are given in Appendix A.

Feature extraction example
Here we present a working example of how features are extracted. Given the input the input word 'gatwikirwa' to be lemmatized by the annotator, we follow the following steps to extract the training data.
Step 1. Morphological analysis: The morphological analyzer first produces candidate segmentations as provided in Table 2. The morphological indicator features (from Appendix A) of each candidate analysis are shown in the same table.

Experimental setup
The first step in our experiments was to generate stem morphological indicator features from user annotations as explained in section 3.4. The second step involves preparing the dataset for training and evaluation. After features are extracted, we split the data between in training and validation set and then train a baseline classifier using only the data from user annotations. We up-sample by 4 factors the annotations from the best trained annotator who is equipped with subtle linguistic understanding of Kinyarwanda verb morphology. We use the baseline classifier to then predict the stem for the entire unlabeled vocabulary of Kinyarwanda verbal forms. We rank the predicted stems by prediction uncertainty (entropy). The most uncertain instances are sent back to annotators for labeling in a batched active learning fashion (Settles, 2009). For active learning, we send batches of the top 10000 uncertain samples for which the entropy H > 1. We also enrich our labeled set with the most confident predictions in a semisupervised manner. For semi-supervised learning, we only take examples for which the baseline model has labeled with at least 0.95 top probability (P 1 ), having at least 3 competing stems, (P 1 − P 2 ) > 0.95 and entropy H < 0.1.
After expanding our labeled data through active learning and semi-supervised learning, we repeat the first step of feature extraction to form our final training and evaluation dataset, which contains about 170,000 examples. We split our dataset into training, development and test set in the ratio of 70%+15%+15% respectively. All our models are trained with gradient descent using ADAM update rule (Kingma and Ba, 2014) using 0.01 learning rate in 256-sized mini-batches and for 50 epochs. For our best fine-tuned model, we re-train the model with large batches of 4000 examples each using LAMB method (You et al., 2019) for 100 more epochs. Since all features are precomputed, training each feedforward neural network takes less than 5 minutes on an quad-core machine. We use POSIX C and C++11 for this project with the only external dependency being Eigen matrix algebra library 1 .

Experimental results and discussion
Model size -We evaluated the model robustness by varying the number of hidden units in the feedforward neural nets (Table 4). Surprisingly, the size of the of the model doesn't affect the performance. Even a small network of two hidden layers of 6 and 3 units respectively achieves almost the same accuracy as the network of 3 layers of 32,16,8 hidden units. We also observed very little over-fitting, having the same level of accuracy on both training, development and test set. We believe that this persistent performance is probably due to the semi-supervised method we used and possibly that the summarizing features (i.e. f m (·)) precomputed explain most of the label variations.  Feature subsets -We then evaluated different feature subsets to assess which ones had greater impact on the final performance. All the results presented in Table 5 used a the small model of two layer, 64-6-3 feed-forward units. The three statistics (arithmetic, geometric and harmonic means) of morphological features account for most of the accuracy while using the individual morphological features under-performs. This is probably due to that the small nature of neural network used doesn't allow it to effectively learn these statistics. The difference in the performance of inflectional similarity features may be attributed to the differences in the two pre-trained word embeddings used. The vocabulary of our "Morpho" embeddings is almost as twice as big than the fastText (Bojanowski et al., 2017) one, even though they are trained on the same corpus and both are based on the Skip-Gram model (Mikolov et al., 2013). So, comparing them requires carefully setting proper hyper-parameters min f and M ax f for the normalizing function σ(·) in equation 4.

Size of feedforward layers Morphological Embedding FastText Embedding
Annotator performance -Our final evaluation looked at how our fine-tune model rated different individual annotator labels depending on their linguistic training level and the mode of active learning used ( Table 6). The 3 reported annotators in the table were identified contacts of the author and together contributed more than 30% of the labeled data. Our interpretation of the results is that the model might   be relying too much on easy examples pulled in through semi-supervised learning and noise introduced by individual annotators. The level of annotator training also has a clear impact on the performance. Sources of ambiguity -There are inherently multiple sources of ambiguity when one encounters a Kinyarwanda verbal expression. Achieving full disambiguation requires having access to complete contextual information. This information may even be encoded only in the tonal system (Kimenyi, 2002) and thus unavailable in written form. In fact, reading written Kinyarwanda requires careful real-time disambiguation by the reader because tones are not marked in text. Contextual information is also needed for semantic disambiguation. For example, the verb 'yarigishije' can mean both 'a-ara-igish-iz-ye' (he taught) or 'a-a-rigis-iz-ye' (he made disappear). Without the semantic context, both segmentations are possible. Sentence level disambiguation may also benefit from contextual agreements through the Bantu noun class system. Our annotation process is also affected by lemmatization ambiguity and the blurred boundary between inflection and derivation. For example it is subjective whether the verbs kwivuga 'kuii-vug-a' (to talk about self ), kuvuza 'ku-vug-y-a' ('to make sound with (some object)') and kuvugisha 'ku-vug-ish-a' (to talk to (someone)) are themselves lemma forms or just inflections of kuvuga 'ku-vuga' (to talk).

Conclusion
This work focused on morphological disambiguation of Kinyarwanda verb forms using maximum entropy model on new crowd-sourced stemming dataset. High disambiguation accuracy was achieved through careful feature engineering. Intuitively curated inflectional features emerged as important parsimonious predictors. Future work should look at how to directly use morpheme embedding methods as a way to more generically represent both semantics and morphology in a unified form. Achieving total disambiguation ultimately requires complete contextual information which may not be available in written form.