Unsupervised Morphological Paradigm Completion

We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or to assist linguistic annotators. From a cognitive science perspective, this can shed light on how children acquire morphological knowledge. We further introduce a system for the task, which generates morphological paradigms via the following steps: (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. We perform an evaluation on 14 typologically diverse languages. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.


Introduction
Morphologically rich languages express syntactic and semantic properties-like tense or caseof words through inflection, i.e., changes to the surface forms of the words.The set of all inflected forms of a lemma-the canonical form-is called its paradigm.While English does not manifest a rich inflectional morphology, Polish verbs have around a hundred different forms (Sadowska, 2012), and Archi paradigms, an extreme example, can have over 1.5 million slots (Kibrik, 1977).
Morphologically rich languages constitute a challenge for natural language processing (NLP) systems: because each lemma can take on a variety of surface forms, the frequency of each individual inflected word decreases drastically.This yields problems for speech recognition (Creutz   1 Our implementation is available under https://github.com/cai-lw/morpho-baseline. Sé vigilante y confirma las otras cosas que están para morir , porque no he hallado tus obras bien acabadas delante de Dios .Acuérdate , pues , de lo que has recibido y oído ; guárdalo y arrepiéntete , pues si no velas vendré sobre ti como ladrón y no sabrás a qué hora vendré sobre ti .El vencedor será vestido de vestiduras blancas , y no borraré su nombre del libro de la vida , y confesaré su nombre delante de mi Padre y delante de sus ángeles .El que tiene oído , oiga lo que el Espíritu dice a las iglesias .' " » Escribe al ángel de la iglesia en Filadelfia : » " Esto dice el Santo , el Verdadero , el que tiene la llave de David , el que abre y ninguno cierra , y cierra y ninguno abre : ... We describe it in detail in §4.et al., 2007), parsing (Seeker and C ¸etinoglu, 2015), and keyword spotting (Narasimhan et al., 2014), inter alia.For unsupervised machine translation, Guzmán et al. (2019) encounter difficulties when translating into the morphologically rich languages Nepalese and Sinhalese.
Children acquire morphological knowledge from raw utterances and, in particular, without access to explicit morphological information (Berko, 1958).Do they have an innate capacity that enables them to learn a language's morphology?Or can morphology be learned in an unsupervised fashion?This question-in addition to practical considerations like benefits for the aforementioned NLP taskshas motivated work on unsupervised morphological analyses (Goldsmith, 2001;Creutz, 2003).To the best of our knowledge, no previous work has considered unsupervised morphological generation. 2 However, over the last few years, there has been a lot of progress on morphological generation tasks with limited amounts of supervision, in particular on morphological inflection (Cotterell et al., 2018) and paradigm completion (Kann and Schütze, 2018), which can potentially be leveraged for unsupervised solutions.
Here, we fill the gap between unsupervised morphological analysis and morphological generation with limited training data by proposing the task of unsupervised morphological paradigm completion.That is, we aim to construct and fill inflection tables exclusively from raw text and a lemma list for a known part of speech (POS), in a situation similar to those encountered by field linguists.We further present a system for the task (see Figure 1) which employs state-of-the-art methods common in NLP and computational morphology.It performs the following four steps: (i) EDIT TREE (Chrupała, 2008) retrieval ( §4.1), (ii) additional lemma retrieval ( §4.2), (iii) paradigm size discovery using distributional information ( §4.3), and (iv) inflection generation ( §4.4).
To evaluate our approach, we design a metric for unsupervised paradigm completion, best-match accuracy ( §5.4), and experiment on 14 languages from 7 families.As we are tackling a novel task with no baselines in the NLP literature, we perform an extensive ablation study to demonstrate the importance of all steps in our pipeline.We further show that our system outperforms trivial baselines and, for some languages, even obtains higher accuracy than a minimally supervised system.

Related Work
Morphological Generation Versions of our task with varying degrees of supervision-though never totally unsupervised-have been explored in the past.Yarowsky and Wicentowski (2000) is the previous work most similar to ours.They also assume raw text and a word list as input, but additionally require knowledge of a language's consonants and vowels, as well as canonical suffixes for each part of speech.Dreyer and Eisner (2011) assume access to seed paradigms to discover paradigms in an empirical Bayes framework.Ahlberg et al. (2015) and Hulden et al. (2014) combine information about paradigms and word frequency from corpora to perform semi-supervised paradigm completion.Our work differs from them in that we do not assume any gold paradigms to be given.Durrett and DeNero (2013), Nicolai et al. (2015), andFaruqui et al. (2016) explore a fully supervised approach, learning morphological paradigms from large annotated inflection tables.This framework has evolved into the SIGMORPHON shared tasks on morphological inflection (Cotterell et al., 2016), which have sparked further interest in morphological generation (Kann and Schütze, 2016;Aharoni and Goldberg, 2017;Bergmanis et al., 2017;Makarov et al., 2017;Zhou and Neubig, 2017;Kann and Schütze, 2018).We integrate two systems (Cotterell et al., 2017;Makarov and Clematide, 2018b) produced for SIGMORPHON shared tasks into our framework for unsupervised morphological paradigm completion.
Morphological Analysis Most research on unsupervised systems for morphology aims at developing approaches to segment words into their smallest meaning-bearing units, called morphemes (Goldsmith, 2001;Creutz, 2003;Creutz and Lagus, 2007;Snyder and Barzilay, 2008).Unsupervised morphological paradigm completion differs from segmentation in that, besides capturing how morphology is reflected in the word form, it also requires correctly clustering transformations into paradigm slots as well as generating unobserved forms.The model by Xu et al. (2018) recovers something akin to morphological paradigms.However, those paradigms are a means to a segmentation end, and Xu et al. (2018) do not explicitly model information about the paradigm size as required for our task.
Other unsupervised approaches to learning morphological analysis and generation rely on projections between word embeddings (Soricut and Och, 2015;Narasimhan et al., 2015); however, these approaches rely on billions of words to train embeddings; at a minimum, Narasimhan et al. (2015) use 129 million word tokens of English Wikipedia.As we will describe later on ( §5.1), we, in contrast, are concerned with the setting with mere thousands of sentences.
For a detailed survey of unsupervised approaches to problems in morphology, we refer the reader to Hammarström and Borin (2011).
SIGMORPHON 2020: Unsupervised Morphological Paradigm Completion After multiple shared tasks on morphological inflection starting with Cotterell et al. (2016), in 2020, SIGMOR-PHON (the ACL special interest group on computational morphology and phonology) is organizing its first shared task on unsupervised morphological paradigm completion. 3The system presented here is the official shared task baseline system.The first other approach applicable to this shared task has been developed by Erdmann et al. (2020).Their pipeline system is similar in spirit to ours, but the individual components are different, e.g., a transformer model (Vaswani et al., 2017) is used for inflection generation.

Formal Task Description
Given a corpus D = w 1 , . . ., w |D| with a vocabulary V of word types {w i } and a lexicon L = { j } with |L| lemmas belonging to the same part of speech, the task of unsupervised morphological paradigm completion consists of generating the paradigms {π( )} ∈L of the entries in the lexicon.
Following Matthews and Matthews (1972) and Aronoff (1976), we treat a paradigm as a vector of inflected forms belonging to a lemma .Paradigm completion consists of predicting missing slots in the paradigm π( ): where f : Σ * × T → Σ * transforms a lemma into an inflected form,4 t γ ∈ T is a vector of inflectional features describing paradigm slot γ, and Γ( ) is the set of slots in lemma 's paradigm.Since we only consider lemmas that belong to the same part of speech, we will use Γ and Γ( ) interchangeably in the following.Furthermore, we will denote f ( , t γ ) as f γ ( ) for simplicity.
Remarks on Task Design In general, not all paradigm entries will be present in the corpus D. Thus, the task requires more than a keyword search.
On another note, it is not necessary to predict the features t γ corresponding to each slot γ; as the exact denotation of features is up to human annotators, they cannot be inferred by unsupervised systems.For our task, it is enough to predict the ordered vector π( ) of inflected forms.

Methodology
Our system implicitly solves two subtasks: (i) determining the number of paradigm slots; and (ii) generating the inflected form corresponding to each paradigm slot for each lemma.It is organized as a pipeline consisting of multiple components.Our system is highly modular: individual components can be exchanged easily.In the remainder of this section, we will dedicate each subsection to one component.

Retrieval of Relevant EDIT TREES
The first component in our pipeline identifies words in the corpus D which could belong to the paradigm of one of the lemmas in the lexicon L. We call those words paradigm candidates.It then uses the discovered paradigm candidates to identify EDIT TREE (Chrupała, 2008) operations that correspond to valid inflections.
Paradigm Candidate Discovery For most paradigms, all participating inflected forms share some characteristics-usually substrings-which humans use to identify the paradigm any given word belongs to.Given a pair ( , w) of a lemma and a word form w, the first step in our pipeline is to determine whether w is a paradigm candidate for lemma .For example, studied is likely to be an inflected form of the English lemma study, while monkey is not.We identify paradigm candidates C of a lemma by computing the longest common substring (LCS) between and w for all words w in the vocabulary V .If the ratio between the LCS's length and the length of is higher than a threshold λ P , w is a paradigm candidate for : EDIT TREE Discovery Surface form changes, which we denote as ψ, define a modification of a word's surface form.Our system employs EDIT TREES (Chrupała, 2008) to represent ψ.
) is constructed by first determining the LCS between x and x .We then recursively model the substrings before and after the LCS.If the length of the LCS is zero, the EDIT TREE consists of the substitution operation of the first string with the second.For example, EDIT TREE(najtrudniejszy, trudny)5 could be visualized as in Figure 2, where Split(i, j) represents taking the substring x [i...n−j] .
An EDIT TREE can be applied to new input strings.The EDIT TREE in Figure 2, for example, could be applied to najappleiejszs, and the resulting string would be apples.Note that not all EDIT TREES can be applied to all strings.For example, the EDIT TREE in Figure 2 can only be applied to words starting with naj.
Our system constructs EDIT TREES from all pairs ( , w) of lemmas and their paradigm candidates w and counts their frequencies: It then discards EDIT TREES with frequencies n ψ below a threshold λ F C (|L|), which is a function of the size of the lexicon L: where φ F C ∈ R is a hyperparameter.The idea is that an EDIT TREE is only valid if it can be applied to multiple given lemmas.EDIT TREES which we observe only once are always considered unreliable.
Our system then retains a set of frequent surface form changes represented by EDIT TREES.Assuming an one-toone mapping between surface form changes and paradigm slots (that is, that |Ψ| = |Γ| and that each ψ is equivalent to a particular inflection function f γ ), we now have a first basic paradigm completion system (PCS-I), which operates by applying all suitable EDIT TREES to all lemmas in our lexicon.

Complexity and Runtime
The time complexity to compute EDIT TREE Computing EDIT TREES can trivially be parallelized, and in practice this computation does not take much time.

Retrieval of Additional Lemmas
Since we assume a low-resource setting, our lexicon is small (≤ 100 entries).However, the more lemmas we have, the more confident we can be that the EDIT TREES retrieved by the first component of our system represent valid inflections.An intuitive method to obtain additional lemmas would be to train a lemmatizer and to generate new lemmas from words in our corpus.However, due to the limited size of our initial lemma list, such a lemmatizer would most likely not be reliable.
The second component of our system employs another method, which guarantees that additionally retrieved lemmas are valid words: It is based on the intuition that a word w ∈ V is likely to be a lemma if the pseudo-inflected forms of w, obtained by applying the EDIT TREES from §4.1, also appear in V .For a word w ∈ V , we say it is a discovered lemma if w / ∈ L and ψ∈Ψ for with φ NL ∈ R being a hyperparameter.Similar to Equation 4, λ NL depends on the number of discovered EDIT TREES, but is never smaller than 3.We set this minimum to require evidence for at least two transformations in addition to the identity.
We can now bootstrap by iteratively computing additional paradigm candidates and EDIT TREES, and then retrieving more lemmas.We denote the paradigm completion systems resulting from one and two such iterations as PCS-II-A and PCS-II-B, respectively.Since we cannot be fully confident about retrieved lemmas, we associate each additional lemma with a weight θ = θ it NL , where θ NL is a preset hyperparameter and it identifies the iteration in which a lemma is added, i.e., it = 0 for gold lemmas and it = i for lemmas retrieved in the ith iteration.The weights θ are used in later components of our system.

Paradigm Size Discovery
Until now, we have assumed a one-to-one mapping between paradigm slots and surface form changes.However, different EDIT TREES may indeed represent the same inflection.For example, the past tense inflection of verbs in English involves multiple EDIT TREES, as shown in Figure 3.
Thus, the next step is to group surface form changes based on the paradigm slots they realize.

One EDIT TREE per Lemma and Slot
Our algorithm for grouping surface form changes is based on two assumptions.First, since EDIT TREES are extracted from ( , w) pairs, different EDIT TREES belonging to the same paradigm slot cannot be extracted from the same lemma.6Thus: Assumption 1 For each lemma, at most one inflected form per paradigm slot can be found in the corpus.
Formally, for a multi-to-one mapping from EDIT TREES to paradigm slots z : Ψ → Γ, we define the EDIT TREE set Ψ γ of a potential paradigm slot γ as Then, for any lemma ∈ L and proposed paradigm slot γ ∈ Γ, we have: (8)

One Paradigm Slot per EDIT TREE
Our second assumption is a simplification,7 but helps to reduce the search space during clustering: Assumption 2 Each surface form change ψ ∈ Ψ belongs to exactly one paradigm slot.

Paradigm Slot Features and Similarity
In addition to Assumptions 1 and 2, we make use of a feature function r(γ) and a score function s(r(γ), r(γ )), which measures the similarity between two potential paradigm slots.
Our feature function makes the connection between paradigm slots and the instances of inflected forms in the corpus by utilizing the part-of-speech (POS) tags as context information.In our implementation, we employ an unsupervised POS tagger [N,V,V] is increased by 1.8 (Stratos et al., 2016) to extract tags for each word in the corpus D. This tagger assigns an anchor word, i.e., a pseudo POS tag, to each word w i in the corpus by using an anchor hidden Markov model (HMM) with 8 hidden states.Our feature function counts the tag tuples within a sliding window centered on each instance of inflected forms of a potential paradigm slot.
Formally, we denote the set of lemmas that surface form change ψ is applied to as L ψ , the set of lemmas that express a potential paradigm slot γ as L γ , and the corresponding inflected forms as V γ : We further refer to the set of available POS tag labels as P = {p 1 , . . ., p 8 }.For a corpus D = w 1 , . . ., w |D| , a window size 2d+1, and a potential paradigm slot γ, our feature function is defined as: where r γ is a vector with one dimension for each possible tag tuple sequence of length 2d+1.Its values are computed as: In practice, we initialize r γ to the zero vector, and iterate the sliding window over the corpus.When the central word w i ∈ V γ , the corresponding value of r γ at the POS tuple within the sliding window is incremented by 1. Figure 4 shows an example in English.
We assume that paradigm slots γ and γ are similar if the words in V γ and V γ frequently appear in Algorithm 1: Surface Form Change Grouping Result: Γ, {Ψγ}γ∈Γ Initialize Γ s.t.|Γ| = |Ψ| and fγ i = ψi for all ψi ∈ Ψ; similar contexts, i.e., within similar tag sequences.With the feature function defined above, our system uses cosine similarity as the score function s.
We then develop Algorithm 1 to group oneto-one mappings from surface form changes to paradigm slots into many-to-one mappings.The idea is to iteratively merge the most similar slots if this is not violating Assumption 1 until the similarity gets too low.λ S ∈ (0, 1) is a threshold parameter.9

Generation
Now, one paradigm slot can be represented by multiple EDIT TREES.Our system, thus, needs to learn to apply the correct transformation for a combination of lemma and paradigm slot.However, mapping lemmas and paradigm slots to inflected forms corresponds exactly to the morphological inflection task, which has been the subject of multiple shared tasks over the last years (Cotterell et al., 2018).
Our morphological inflection models take (slot, lemma, word) tuples extracted by the previous components of our system as training data.Formally, they are trained on the training set: We explore two morphological inflection models from the literature.
Affix Editing The baseline system of the CoNLL-SIGMORPHON 2017 shared task 10 (Cotterell et al., 2017) is a simple approach, which is very suitable for low-resource settings.The system breaks each word into PREFIX, STEM, and SUFFIX, and then stores the PREFIX editing rules and the STEM+SUFFIX editing rules.At test time, it applies the longest possible PREFIX and STEM+SUFFIX editing rules to the input lemma.We denote the surface form change grouping in combination with this system as PCS-III-C.
Transducer-Based Hard-Attention We further experiment with a transducer-based hard-attention model (Makarov and Clematide, 2018a).Unlike widely used soft-attention sequence-to-sequence models (Bahdanau et al., 2015), which predict the target tokens directly, it predicts edit action sequences to transform the input sequence into outputs, and it disposes of a hard attention mechanism.We denote the surface form change grouping in combination with this system as PCS-III-H.

Data
To evaluate our approach in a real-world setting, we restrict our data to resources typically available to a field linguist: a small written corpus (≤ 100k tokens) and a small lexicon.
For our corpora, we use the JHU Bible Corpus (McCarthy et al., 2020), which allows future work to build systems in 1600 languages.The Bible is frequently available even in low-resource languages: Ethnologue identifies 3,995 written languages, and the New Testament has been translated into 2,246.The Bible is also highly representative of a language's core vocabulary: Resnik et al. (1999) find high overlap with both the Longman Dictionary of Contemporary English (Summers and Gadsby, 1995) and the Brown Corpus (Francis and Kucera, 1964).Furthermore, the Bible is multiparallel and, thus, allows for a fair comparison across languages without confounds like domain.
For evaluation of our methods only, we additionally obtain ground-truth morphological paradigms from UniMorph (Kirov et al., 2018), which provides paradigms for over 100 languages.
From the intersection of languages in the Bible and UniMorph, we select 14 typologically diverse languages from 7 families, each of which display inflectional morphology: Basque (EUS), Bulgarian (BUL), English (ENG), Finnish (FIN), German (DEU), Kannada (KAN), Navajo (NAV), Spanish (SPA), and Turkish (TUR) as test languages, and Maltese (MLT), Persian (FAS), Portuguese (POR), Russian (RUS), and Swedish (SWE) for development.To create test data for all and development data for our development languages, we sample 100 paradigms for each set from UniMorph, then take their lemmas as our lexicon L. 11

Baselines and Skylines
Lemma Baseline (LB) Our first, trivial baseline predicts inflected forms identical to the lemma for all paradigm slots.We compare to one version of this baseline that has access to the ground-truth paradigm size (LB-Truth), and a second version which predicts the paradigm size as the average over the development languages (LB-Dev).
One/Ten-Shot Inflection Model Our second baseline could be seen as a skyline, since it leverages morphological information our proposed system does not have access to.In particular, we train the baseline system of CoNLL-SIGMORPHON 2017 (Cotterell et al., 2017) on one (CoNLL17-1) and ten (CoNLL17-10) paradigms.For this, we randomly sample paradigms from UniMorph, excluding those in our test data.

Hyperparameters
We choose the hyperparameters by grid search over intuitively reasonable ranges, using the development languages.No test language data is seen before final testing.Note also that only the corpus and the lexicon can be accessed by our system, and no ground-truth morphological information (including paradigm size) is given.

Evaluation Metrics
Systems for supervised or semi-supervised paradigm completion are commonly being evaluated using word-level accuracy (Dreyer and Eisner, 2011;Cotterell et al., 2017).However, this is not 11 For Basque, Kannada, and Maltese, we only take 20 paradigms for each set, due to limited availability.possible for our task because our system cannot access the gold data paradigm slot descriptions and, thus, does not necessarily produce one word for each ground-truth inflected form.Furthermore, the system outputs pseudo-tags, and the mapping from pseudo-tags to paradigm slots is unknown.
Therefore, we propose to use best-match accuracy (BMAcc), the best accuracy among all mappings from pseudo-tags to paradigm slots, for evaluation.Let Γ = {γ i } N i=1 and Γ = { γj } M j=1 be the set of all paradigm slots in the ground truth and the prediction, respectively, with transformation functions where f γ ( ) = ∅ if the corresponding inflection is missing in the ground truth. 13We define two types of BMAcc: Macro-averaged BMAcc This is the average per-slot accuracy for the best possible matching of slots.For any γ i , γj , we define g t (L, γ i , γj ) as the number of correct guesses (true positives) if γj maps to γ i , g a (L, γ i ) as the number of ground truth inflections for γ i , and acc(L, γ i , γj ) as the per-slot accuracy: Then, we construct a complete bipartite graph with Γ and Γ as two sets of vertices and acc(L, γ i , γj ) as edge weights.The maximumweight full matching can be computed efficiently with the algorithm of Karp (1980).With such a matching M = {(γ m , γm )} min{N,M } m=0 , the macroaveraged BMAcc is defined as: The normalizing factor 1 max{N,M } rewards predicting the correct number of paradigm slots.In the case when acc(L, γ m , γm ) = 1 for all (γ m , γm ) ∈ M, BMAcc-macro(L, Γ, Γ) reaches its maximum if and only if N = M .Micro-averaged BMAcc Our second metric is conceptually closer to word-level accuracy.We start with the same process of bipartite graph matching, but instead use g t (L, γ i , γj ) as edge weights.Given the optimal matching M, the microaveraged BMAcc is defined as:

Results and Discussion
Overall Results We present our experimental results in Table 1.The performance of our system varies widely across languages, with best results for ENG (74% BMAcc).On average over languages, our final system obtains 18.76%/19.06%BMAcc on the test set, as compared to the baseline of 4.94%/5.25%and skylines of 18.70%/18.70%and 35.58%/35.56%.Compared to versions of our system without selected components, our final system performs best on average for both development and test languages.Leaving out step II or step III leads to a reduction in performance.
Notably, variants of our system outperform the skyline CoNLL17-1, which has seen one training example and, thus, knows the correct paradigm size in advance, on EUS, BUL, ENG, FIN, KAN, NAV, TUR, MLT, RUS, and SWE.Moreover, it even outperforms CoNLL17-10 on EUS, ENG, and NAV, which shows that unsupervised paradigm completion has promise even in cases where a limited number of training examples-but not large amountsare available.

Differences between Languages
We hypothesize that the large differences between languagesover 73% between EUS and ENG-can be explained in parts by the following reasons: Intuitively, the larger the paradigm, the more difficult the task.If the number of slots is huge, each individual inflection is rare, and it is hard for any unsupervised paradigm completion system to distinguish true inflections (e.g., rise → rises) from false candidates (e.g., rise → arise).This could explain the high performance on ENG and the low performance on EUS, FIN, and FAS.
Related to the last point, in a limited corpus such as the Bible, some inflected forms might not appear for any lemma, which makes them undetectable for unsupervised paradigm completion systems.For example, a FAS paradigm has 136 slots in Uni-Morph, but only 46 are observed.14Additional statistics can be found in Table 2.
Furthermore, Assumption 2 does not hold for all languages.Surface forms can be shared between paradigm slots, as, for instance, in English for he studied and he has studied.Different languages  show different degrees of this phenomenon called syncretism.
Pipeline Effects Different combinations of components result in major performance differences.In particular, each step of our system has the potential to introduce errors.This demonstrates a pitfall of pipeline methods also discussed in McCarthy et al. (2020): the quality of individual steps, here, e.g., EDIT TREE discovery and retrieval of additional lemmas, can greatly affect the results of PCS-II and PCS-III.
Differences in Components Details of individual components also affect the results.On the one hand, applying more than one iteration of additional lemma retrieval impacts the results only slightly, as those lemmas are assigned very small weights.On the other hand, we see performance differences > 2% between PCS-III-C and PCS-III-H for DEU, MLT, and SWE.

Analysis of EDIT TREE Quality
As it is the first step in our pipeline, the quality of the EDIT TREE discovery strongly affects the performance of later components.For our development languages, we show in Table 2 the percentage of ( , w) pairs for which the system predicts an EDIT TREE ψ such that ψ( ) = w appears in the gold paradigm of .This corresponds to the highest possible performance after PCS-I.FAS has the worst performance (12.10%), while the results for SWE are high (72.91%).As expected, languages with lower values here also obtain lower final results.

Analysis of Syncretism
We further hypothesize that syncretism could be a source of due to Assumption 2. Table 2 shows the percentage of words that are the inflected forms corresponding to multiple paradigm slots of the same lemma.
We observe that SWE has a low degree of syncretism, and, in fact, our system predicts the correct paradigm size for SWE.A high degree of syncretism, in contrast, might contribute to the low performance on MLT.

Conclusion
We proposed unsupervised morphological paradigm completion, a novel morphological generation task.We further developed a system for the task, which performs the following steps: (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation.Introducing best-match accuracy, a metric for the task, we evaluated our system on a typologically diverse set of 14 languages.Our system obtained promising results for most of our languages and even outperformed a minimally supervised baseline on Basque, English, and Navajo.Further analysis showed the importance of our individual components and detected possible sources of errors, like wrongly identified EDIT TREES early in the pipeline or syncretism.
In the future, we will explore the following directions: (i) A difficult challenge for our proposed system is to correctly determine the paradigm size.Since transfer across related languages has shown to be beneficial for morphological tasks (Jin and Kann, 2017;McCarthy et al., 2019;Anastasopoulos and Neubig, 2019, inter alia), future work could use typologically aware priors to guide the number of paradigm slots based on the relationships between languages.(ii) We plan to explore other methods, like word embeddings, to incorporate context information into our feature function.(iii) We aim at developing better performing string transduction models for the morphological inflection step.By substituting the current transducers in our pipeline, we expect that we will be able to improve the overall performance of our system.

Figure 1 :
Figure1: Our unsupervised paradigm completion system, which takes raw text and a lemma list as inputs.We describe it in detail in §4.

Figure 3 :
Figure 3: Visualization of the EDIT TREES representing (a) work → worked and (b) continue → continued.

Figure 4 :
Figure 4: An example of the distributionally informed feature function with window size 3 for the past tense slot (PST).stop ∈ L and f PST (stop) = stopped.When the sliding window arrives to this instance of stopped, r PST [N,V,V] is increased by 1. 8

Table 1 :
Macro-and micro-averaged BMAcc (in percentage)as well as the predicted number of paradigm slots (in brackets), for each method.Overall best scores are bold, and the best scores of our system are underlined.

Table 2 :
Statistics for our development languages, computed with UniMorph.ET Match is the percentage of gold ( , w) pairs that can be matched to an EDIT TREE.Rep.Words denotes the percentage of inflected forms that represent multiple paradigm slots.Absent/Total Slots is the numbers of unobservable and total slots.