KU-CST at the SIGMORPHON 2020 Task 2 on Unsupervised Morphological Paradigm Completion

We present a model for the unsupervised dis- covery of morphological paradigms. The goal of this model is to induce morphological paradigms from the bible (raw text) and a list of lemmas. We have created a model that splits each lemma in a stem and a suffix, and then we try to create a plausible suffix list by con- sidering lemma pairs. Our model was not able to outperform the official baseline, and there is still room for improvement, but we believe that the ideas presented here are worth considering.


Introduction
In this paper we describe our attempt to capture morphological paradigms totally from scratch  prepared for the task of morphological paradigm completion in the CoNLL-SIGMORPHON 2020 Shared Task. Computational morphology is not a new area and there is plenty of related work. Some years ago, this problem was commonly tackled using finite-state and two-level approaches, such as in Kaplan and Kay (1994), Beesley and Karttunen (2003), and Koskenniemi (1983). Recent works, on the other hand, rely mostly on statistical approaches, such as in Faruqui et al. (2016) and Kann and Schütze (2017).
There have been several Shared Tasks recently on morphological inflection (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018McCarthy et al., 2019). The task for this year is more complex, as we are asked to discover paradigms from scratch. This is an intriguing research area that could give us the chance of recovering dead languages that have only limited written resources. Several researchers have attempted to solve this task, such as Goldsmith et al. (2017), Jin et al. (2020), and Erdmann et al. (2020).
We present a pipeline that assumes that all morphological realizations in a paradigm (for each language) follow a fixed structure: stem+suffix.
Based on that logic, we look for the best candidates to compose the suffix inventory, we cluster them using K-means and after that, we join stems and suffixes. We employ language models to get the most natural outputs. The pipeline that we have developed does not contain any neural network component, but we contemplate it as a possibility to extend our work in the near future.
This paper is structured as follows: In the next section we introduce the task that we have worked on. We describe our approach in the third section. Afterwards, we show our results compared to the baseline model. To conclude, we discuss our results and provide possible future directions.

Task
In this competition there was one task that we had to perform. A computational system had to be built, which, given a raw text and a set of lemmas, it would return the complete list of paradigms for each verb. The computational model should be able to read a text like this, The aircraft landed at the JFK airport. Other pilots decided to land in Philadelphia. As you may imagine, landing a plane is not an easy job, but imagination can help.
and extract morphological paradigms. In the shared task, a list of lemmas is also given as a starting point. This list of lemmas could include verbs like land and imagine.
In the case of the verb land, in the example above, it is pretty easy to get its inflections (land, landed, landing). This could, for example, be done with a Minimum Edit Distance based method and it is relatively easy, as there is no usage of land with the function of a noun. It gets slightly more complicated with the verb imagine, as a simple distancebased algorithm could fail, because it could find imagination as a possible conjugation of the verb imagine.

Dataset
As one of the most widely extended resources is the bible, the organizers decided to consider it as the raw text input data. Together with the bible, a list of verbal lemmas was given. The languages for development were Maltese, Persian, Portuguese, Russian and Swedish. The languages for testing included the following: Basque, Bulgarian, English, Finnish, German, Kannada, Navajo, Spanish and Turkish.

Method
Our method has a very strong assumption, which oversimplifies the problem but it also gives the chance of recognizing some patterns. The assumption is that all lemmas and their inflections have the following form for all languages STEM +SUFFIX → STEM +SUFFIX as illustrated in the following examples for English and Spanish: play → playing play+ → play+ing jugar → jugando jug+ar → jug+ando

Pipeline
We use a pipeline that includes four different steps. These are described below.

Step 1
In the first step, for each lemma l in the lemma list L and each word w in the corpus/dictionary D, all possible splits l 1 1 +l |l| 2 , l 2 1 +l |l| 3 ,.., l |l| 1 + , and We assume the stem (the hypothesized STEM) to be nonempty but allow the suffix to be empty. For the Spanish lemma jugar we thus get j +ugar , ju+gar , jug+ar , juga+r , and jugar+ .

Step 2
In the second step, we determine the inflections of the regular verbs of the language. These will be used for the estimation of the morphological richness r m of the lemmas (verbs) in the third step. The morphological richness of the lemmas can be identified with the number of combinations of those tense, aspect, mood, and agreement features that can be distinctively morphologically realized. Because the morphological richness of the lemmas (verbs) does not tend to vary much across the different lemmas (verbs), even if they inflect semi-irregularly or irregularly, we assume that each lemma has r m different inflections. r m thus provides an upper bound on the number of cells of the paradigms of the language/corpus.
For determining r m we identify the inflections of the lemmas with regular inflection. First, we determine for each splitted lemma l = r+s the number of potential inflections of the hypothesized stem r, that is r+s , in D. This is the set S r+s = {s | r+s ∈ D}. Then, for regularly inflecting lemmas, S r+s will be large for the actual split but also for any split within the stem. This is illustrated for the German lemma spielen (play) with the actual split spiel+en below.
To accommodate for this deficiency, we also consider pairs of splitted lemmas l = r+s, l = r +s with distinct stem endings r |r| = r |r | and we determine the splitî of s that yields the maximum number of common inflections: We choose for each lemma pair l, l the splitŝ r+ŝ andr +ŝ, withr = rsî 0 ,r = r sî 0 , and s = s |s| i+1 , and consider their common suffixes in D: Sr +ŝ ∩ Sr +ŝ .
Because regularly inflecting verbs tend to share their inflections, this lemma pairing allows us to reliably predict that, for example, the stems of spielen and gehen are spiel and geh.

Step 3
The goal of this step is to group different realizations of the same suffix. The previous step captures relevant suffixes, but in some cases, some parts of the stem are also included in these suffixes, or there might be some slight differences, because of morphophonological changes. In order to group them, we employ K-Means.
When using K-means we need a function that calculates the distance between the elements, and based on this distance, the instances will be clustered. We decided to employ a modified version of Minimum Edit Distance. Our modified version tries to punish changes that are made at the end of the suffix. The assumption in this case, is that changes at the beginning of the suffix are more likely to be caused by the stem (and they could be the same suffix). On the other hand, if there are changes at the end, it would be a different suffix. Our edit distance algorithm allows insertion and deletion as possible changes. We also assume that it is worse to substitute a vowel with a consonant, than changing a vowel with a vowel. Therefore, this would happen: Distance (era, bra) > Distance (era, ara) ntar ntaron aron ar ntar 0.000 0.939 0.778 0.094 ntaron -0.000 0.015 0.832 aron --0.000 0.656 ar ---0.000 We estimate that the number of paradigms (r m ) in a language is approximately the third of the number of different suffixes found in the previous step. This number was estimated based on the behaviour of the model considering Swedish data. Therefore, K-means will reduce the number of possible suffixes to the third (this is a parameter that will be tuned in the future). For example, one of the clustered groups found in this step considering the Spanish data would be this: {rá, erá, derá, ará, irá}. This corresponds to the suffix of future simple, third person singular.

Step 4
In the previous steps we will have generated possible suffixes for each cell in a paradigm. Now, the goal is to make a guess of how a word form should be generated. For example, in Spanish, if we have the lemma sanar, and we want to build the first person singular of the future simple tense (sanaré), we could expect the lemma to be combined with suffixes likeé, ré, aré, iré, and so on. These suffixes would be the output of the previous step.
First of all, for each lemma, the model needs to decide the position in which we will split the lemma, as following the previous assumption a word will have this shape: STEM+SUFFIX. In order make that decision, we check how often we associate each lemma with a specific stem in the output of step 2, and use the most frequently occurring stem for all the suffixes. For example, for the verb sanar, in Spanish, we get these frequencies: san:15, sana:21, sa:1, and therefore, we would use the stem sana.
We, then, try to join that stem with the clustered suffixes. Each stem will be joined with one suffix from each cluster. In order to decide which is the best suffix, we use a bigram character-level language model to estimate the probability of the output sequences, trained on the input bible. These are the probabilities that we get if we consider the example of the stem sana (from sanar) and suffixeś e, ré, aré and iré in Spanish.

Candidate output
Probability Obviously, in this case, the conjugation sanaré would be returned.

Expansion of the lemma list
At this point, the model produced a little amount of suffixes. Then, we decided to extend the list of input lemmas, so that it can find new suffixes and increase, therefore, the recall of the model. We obtain new lemmas by training a very simple verb classifier. We create a simple dataset with the input lemmas and some random words from the corpus. The input lemmas will be tagged as verbs and the random words will be tagged as nonverbs. We, then, train a simple Logistic Regression model, using character uni-, bi-and trigrams for representing each word. We also include word boundary symbols. For instance, in Spanish we would have cases like:

Word
Features (trigrams) class comer <co, com, ome, mer, me> V plaza <pl, pla, laz, aza, za> NV Using this approach we obtain new verbs that can be used in our Pipeline. The model that uses the extended list of lemmas for extracting suffixes is called the Flexible model, and on the other hand, the initial model (the one that uses only the initial lemmas as input) is called the Non-flexible model.  languages. Unfortunately, we could not surpass the baseline model in any of the languages. We can say that among the development language results, Portuguese and Swedish are the ones that are best captured by the Non-flexible model. Considering the test languages, Spanish and English are the ones that were best modeled by the Non-flexible model. It also seems that while the flexible model might have a better recall, the obtained result is not good enough, and therefore, it still requires some filtering.

Discussion and Future Work
We have presented our approach for automatically discovering morphological paradigms, given a text and list of lemmas. As mentioned above, our results are behind the official baseline, and therefore, there is a wide range of possibilities for improvement. We discuss some of them below.
We assumed each inflected form to be decomposable into a stem and a suffix. This could be, for example, sufficient for English or Spanish, but not for languages such as German that follow a two splits pattern: STEM +SUFFIX → PREFIX +STEM +SUFFIX In German, for example, participles are formed by prefixing ge: play → played play + → play +ed spielen → gespielt spiel +en → ge +spiel +t Apart from that, a much more straightforward estimate of the morphological richness r m could, for example, be obtained by just considering the triple l 1 =r 1 +s, l 2 =r 2 +s, l 3 =r 3 +s of optimally splitted distinct lemmas with the maximum number of common suffixes. Because these lemmas are most likely to be frequently used lemmas with regular inflection, the size of the union of their inflections would presumably yield a good estimate of r m . Clustering of these triples could also help in identifying verb classes with distinct but regular inflection.
Moreover, splitting of compound verbs of the form X+V, with X typically a noun or verb, would certainly improve performance because the inflec-tions of the verb V could be used for the typically less frequent compound verb X+V.
With respect to the writing system, the Basque bible follows old orthographical rules. On the other hand, the lemmas were written following more recent orthography rules. This lack of consistency makes the task a challenge, and we expect it to happen in other languages as well. This issue requires special attention, by maybe applying some preprocessing to the lemmas to accommodate to the old writing system (Etxeberria et al., 2019).
Also, we mentioned at the beginning of the article that we have not used any neural network based component, and these would be very useful for learning the morphophonological changes that commonly happen when inflecting words. Therefore, we would like to incorporate a Sequenceto-Sequence model at the end of our pipeline (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017).