The IMS–CUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion

In this paper, we present the systems of the University of Stuttgart IMS and the University of Colorado Boulder (IMS--CUBoulder) for SIGMORPHON 2020 Task 2 on unsupervised morphological paradigm completion (Kann et al., 2020). The task consists of generating the morphological paradigms of a set of lemmas, given only the lemmas themselves and unlabeled text. Our proposed system is a modified version of the baseline introduced together with the task. In particular, we experiment with substituting the inflection generation component with an LSTM sequence-to-sequence model and an LSTM pointer-generator network. Our pointer-generator system obtains the best score of all seven submitted systems on average over all languages, and outperforms the official baseline, which was best overall, on Bulgarian and Kannada.


Introduction
In recent years, a lot of progress has been made on the task of morphological inflection, which consists of generating an inflected word, given a lemma and a list of morphological features (Kann and Schütze, 2017;Makarov and Clematide, 2018;Cotterell et al., 2016Cotterell et al., , 2017Cotterell et al., , 2018McCarthy et al., 2019). The systems developed for this task learn to model inflection in morphologically complex languages in a supervised fashion.
However, not all languages have annotated data available. For the 2018 SIGMORPHON shared task (Cotterell et al., 2018), data for 103 unique languages has been provided. Even this highly multilingual dataset is just covering 1.61% of the 6359 languages 1 that exist in the world (Lewis, 2009). The unsupervised morphological paradigm completion task (Jin et al., 2020) aims at generating 1 The number of languages can vary depending on the classification schema used. inflections -more specifically all inflected forms, i.e., the entire paradigms, of given lemmas -without any explicit morphological information during training. A system that is able to solve this problem can generate morphological resources for most of the world's languages easily. This motivates us to participate in the SIGMORPHON 2020 shared task on unsupervised morphological paradigm completion .
The task, however, is challenging: As the number of inflected forms per lemma is unknown a priori, an unsupervised morphological paradigm completion system needs to detect the paradigm size from raw text. Since the names of morphological features expressed in a language are not known if there is no supervision, a system should mark which inflections correspond to the same morphological features across lemmas, but needs to do so without using names, cf. Figure 1. For the shared task, no external resources such as pretrained models, annotated data, or even additional monolingual text can be used. The same holds true for multilingual models.
We submit two systems, which are both modifications of the official shared task baseline. The latter is a pipeline system, which performs four steps: edit tree retrieval, additional lemma retrieval, paradigm size discovery, and inflection generation (Jin et al., 2020). We experiment with substituting the original generation component, which is either a simple non-neural system (Cotterell et al., 2017) or a transducer-based hard-attention model (Makarov and Clematide, 2018) with an LSTM encoder-decoder architecture with attention (Bahdanau et al., 2015) -IMS-CUB1-and a pointergenerator network (See et al., 2017) -IMS-CUB2. IMS-CUB2 achieves the best results of all submitted systems, outperforming the second best system by 2.07% macro-averaged best-match accuracy (BMAcc; Jin et al., 2020), when averaged over all languages. However, we underperform the baseline system, which performs 1.03% BMAcc better than IMS-CUB2. Looking at individual languages, IMS-CUB2 obtains the best results overall for Bulgarian and Kannada.
The findings from our work on the shared task are as follows: i) the copy capabilities of a pointergenerator network are useful in this setup; and ii) unsupervised morphological paradigm completion is a challenging task: no submitted system outperforms the baselines.
In the realm of morphological generation, Yarowsky and Wicentowski (2000) worked on a task which was similar to unsupervised morphological paradigm completion, but required additional knowledge (e.g., a list of morphemes). Dreyer and Eisner (2011) used a set of seed paradigms to train a paradigm completion model. Ahlberg et al. (2015) and Hulden et al. (2014) also relied on information about the paradigms in the language. Erdmann et al.
(2020) proposed a system for a task similar to this shared task.
Learning to generate morphological paradigms in a fully supervised way is the more common approach. Methods include Durrett and DeNero (2013), Nicolai et al. (2015), and Kann and Schütze (2018). Supervised morphological inflection has further gained popularity through previous SIG-MORPHON and CoNLL-SIGMORPHON shared tasks on the topic (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018McCarthy et al., 2019). The systems proposed for these shared tasks have a special relevance for our work, as we investigate the performance of morphological inflection components based on Kann and Schütze (2016a,b) and Sharma et al. (2018) within a pipeline for unsupervised morphological paradigm completion.

System Description
In this section, we introduce our pipeline system for unsupervised morphological paradigm completion. First, we describe the baseline system, since we rely on some of its components. Then, we describe our morphological inflection models.

The Shared Task Baseline
For the initial steps of our pipeline, we employ the first three components of the baseline (Jin et al., 2020), cf. Figure 2, which we describe in this subsection. We use the official implementation. 2 Retrieval of relevant edit trees. This component (cf. Figure 2.1) identifies words in the monolingual corpus that could belong to a given lemma's paradigm by computing the longest common substring between the lemma and all words. Then, the transformation from a lemma to each word potentially from its paradigm is represented by edit trees (Chrupała, 2008). Edit trees with frequencies are below a threshold are discarded.
Retrieval of additional lemmas. To increase the confidence that retrieved edit trees represent valid inflections, more lemmas are needed (cf. Figure  2.2). To find those, the second component of the system applies edit trees to potential lemmas in the corpus. If enough potential inflected forms are found in the corpus, a lemma is considered valid.
Paradigm size discovery. Now the system needs to find a mapping between edit trees and paradigms (cf. Figure 2.3). This is done based on two assumptions: that for each lemma a maximum of one edit tree per paradigm slot can be found, and that each edit tree only realizes one paradigm slot for all lemmas. In addition, the similarity of potential slots is measured. With these elements, similar potential slots are merged until the final paradigm size for a language is being determined.
Generation. Now, that the system has a set of lemmas and corresponding potential inflected forms, the baseline employs a morphological inflection component, which learns to generate inflections from lemmas and a slot indicator, and generates missing forms (cf. Figure 2.4). We experiment with substituting this final component.
In the remainder of this paper, we will refer to the original baselines with the non-neural system from Cotterell et al. (2017) and the inflection model from Makarov and Clematide (2018) as BL-1 and BL-2, respectively.

LSTM Encoder-Decoder
We use an LSTM encoder-decoder model with attention (Bahdanau et al., 2015) for our first system, IMS-CUB1, since it has been shown to obtain high performance on morphological inflection (Kann and Schütze, 2016a). This model takes two inputs: a sequence of characters and a sequence of morphological features. It then generates the sequence of characters of the inflected form. For the input, we simply concatenate the paradigm slot number and all characters.

Pointer-Generator Network
For IMS-CUB2, we use a pointer-generator network (See et al., 2017). 3 We expect this system to perform better than IMS-CUB1, given the pointergenerator's better performance on morphological inflection in the low-resource setting (Sharma et al., 2018). A pointer-generator network is a hybrid between an attention-based sequence-to-sequence model (Bahdanau et al., 2015) and a pointer network (Vinyals et al., 2015).
The standard pointer-generator network consists of a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) encoder and a unidirectional LSTM decoder with a copy mechanism. Here, we follow (Sharma et al., 2018) and use two separate encoders: one for the lemma and one for the morphological tags. The decoder then computes the probability distribution of the output at each time step as a weighted sum of the probability distribution over the output vocabulary and the attention distribution over the input characters. The weights can be seen as the probability to generate or copy, respectively,

Evaluation Metric
The official evaluation metric of the shared task is BMAcc (Jin et al., 2020). Gold solutions are obtained from UniMorph (Kirov et al., 2018). Two versions of BMAcc exist: micro-averaged BMAcc and macro-averaged BMAcc. In this paper, we only report macro-averaged BMAcc, the official shared task metric.
During the development of our morphological generation systems, we use regular accuracy, the standard evaluation metric for morphological inflection (Cotterell et al., 2016).

Morphological Inflection Component
Morphological inflection data. We use the first three components of the baseline model, i.e., the ones performing edit tree retrieval, additional lemma retrieval, and paradigm size discovery, to create training and development data for our inflection models. Those datasets consist of lemmainflection pairs found in the raw text, together with a number indicating the (predicted) paradigm slot, and are described in Table 1.
The test set for our morphological inflection systems consist of the lemma-paradigm slot pairs not found in the corpus.
Hyperparameters. For IMS-CUB1, we use an embedding size of 300, a hidden layer of size 100, a batch size of 20, Adadelta (Zeiler, 2012) for optimization, and a learning rate of 1. For each language, we train a system for 100 epochs, using early stopping with a patience of 10 epochs.
For IMS-CUB2, we follow two different approaches. The first is to use a single hyperparameter configuration for all languages (IMS-CUB2-S). The second consists of using a variable setup depending on the training set size (IMS-CUB2-V). For IMS-CUB2-S, we use an embedding size of 300, a hidden layer size of 100, a dropout rate of 0.3, and train for 60 epochs with an early-stopping patience of 10 epochs. We further use an Adam (Kingma and Ba, 2014) optimizer with an initial learning rate of 0.001.
For IMS-CUB2-V, we use the following hyperparameters for training set size T : • T < 101: an embedding size of 100, a dropout coefficient of 0.5, 300 epochs of training, and an early-stopping patience of 100; • 100 < T < 501: an embedding size of 100, a dropout coefficient of 0.5, 80 training epochs, and an early-stopping patience of 20; • 500 < T : the same hyperparameters as for IMS-CUB2-S.
For IMS-CUB2, we select the best performing system (between IMS-CUB2-S and IMS-CUB2-V) as our final model. The models are evaluated on the morphological inflection task development set using accuracy. All scores are shown in Table 2.   Table 3 shows the official test set results for IMS-CUB1 and IMS-CUB2, compared to the official baselines and all other submitted systems.

Results
Our best system, IMS-CUB2, achieves the highest scores of all submitted systems (i.e., excluding the baselines), outperforming the second best submission by 2.07% BMAcc. However, BL-1 and BL-2 outperform IMS-CUB2 by 1.03% and 0.3%, respectively. Looking at the results for individual languages, IMS-CUB2 obtains the highest performance overall for Bulgarian (difference to the second best system 0.42%) and Kannada (difference to the second best system 0.53%). Comparing our two submissions, IMS-CUB1 underperforms IMS-CUB2 by 3.6%, showing that vanilla sequence-to-sequence models are not optimally suited for the task. We hypothesize that this could be due to the amount or the diversity of the generated morphological inflection training files.
As our systems rely on the output of the previous 3 steps of the baseline, only few training examples were available for Basque and Navajo: 85 and 17, respectively. Probably at least partially due to this fact, i.e., due to finding patterns in the raw text corpus being difficult, all systems obtain their lowest scores on these two languages. However, even though Finnish has 2306 training instances for morphological inflection, our best system surprisingly only reaches 5.38% BMAcc. The same happens in Kannada and Turkish: the inflection training set is relatively large, but the overall performance on unsupervised morphological paradigm completion is low. On the contrary, even though English has a relatively small training set (343 examples), the performance of IMS-CUB2 is highest for this language, with 66.20% BMAcc. We think that the quality of the generated inflection training set and the correctness of the predicted paradigm size of the languages are the main reasons behind these performance differences. Improving steps 1 to 3 in the overall pipeline thus seems important in order to achieve better results on the task of unsupervised morphological paradigm completion in the future.

Conclusion
In this paper, we described the IMS-CUBoulder submission to the SIGMORPHON 2020 shared task on unsupervised morphological paradigm completion. We explored two modifications of the official baseline system by substituting its inflection generation component with two alternative models. Thus, our final system performed 4 steps: edit tree retrieval, additional lemma retrieval, paradigm size discovery, and inflection generation. The last component was either an LSTM sequence-to-sequence model with attention (IMS-CUB1) or a pointergenerator network (IMS-CUB2). Although our systems could not outperform the official baselines on average, IMS-CUB2 was the best submitted system. It further obtained the overall highest performance for Bulgarian and Kannada.