The Paradigm Discovery Problem

This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.


Introduction
In childhood, we induce our native language's morphological system from unannotated input. For instance, we learn that ring and rang belong to the same inflectional paradigm. We also learn that rings and bangs belong to the same cell, i.e., they realize the same morphosyntactic properties 3.SG.PRES, but in different paradigms. Acquiring such paradigmatic knowledge enables us to produce unseen inflectional variants of new vocabulary items, i.e. to complete morphological paradigms. Much work has addressed this task, which Ackerman et al. (2009) call the paradigm cell filling problem (PCFP), 1 but few have discussed inducing paradigmatic knowledge from scratch, which we call the paradigm discovery problem (PDP). 2 As an unsupervised task, the PDP poses challenges for modeling and evaluation and has yet to be attempted in its full form (Elsner et al., 2019). However, we contend there is much to be gained from formalizing and studying the PDP. There are insights for cognitive modeling to be won (Pinker, 2001;Goldwater, 2007) and intuitions on combating sparse data for language generation (King and White, 2018) to be accrued. Unsupervised language processing also has natural applications in the documentation of endangered languages (Zamaraeva et al., 2019) where a lot of annotated data is never likely to exist. Our formalization of the PDP offers a starting point for future work on unsupervised morphological paradigm completion.
Our paper presents a concrete formalization of the PDP. Then, as a baseline for future work, we introduce a heuristic benchmark system. Our benchmark system takes an unannotated text corpus and a lexicon of words from the corpus to be analyzed. It first clusters the lexicon by cell and then by paradigm making use of distributional semantics and string similarity. Finally, it uses this clustering as silver-standard supervision to bootstrap a neural transducer (Vaswani et al., 2017) that generates the desired target inflections. That is, the model posits forms to realize unoccupied cell slots in each proposed paradigm. Even though our benchmark system models only one part of speech (POS) at a time, our framework extends to the full PDP to support future, more intricate systems. We propose two separate metrics to evaluate both the clustering of attested forms into paradigms and cells and the prediction of unseen inflected forms. Our metrics handle non-canonical morphological behavior discussed in theoretical literature (Corbett, 2005) and extend to the full PDP.
For three of the five languages we consider, our benchmark system predicts unattested inflections one of its subtasks which Boyé and Schalchli (2019) call the paradigm cell finding problem (see §2.2). of lexicon forms with accuracy within 20% of a fully supervised system. However, our analysis suggests clustering forms into cells consistently across paradigms is still a very pressing challenge.

Previous Work in Morphology
This section couches our work on the PDP in terms of previous trends in morphological modeling.

Unsupervised Morphology
Much work on unsupervised morphological modeling focuses on segmentation (Gaussier, 1999;Goldsmith, 2001;Creutz and Lagus, 2005;Narasimhan et al., 2015;Bergmanis and Goldwater, 2017;Xu et al., 2018). While morphological segmenters can distinguish real from spurious affixes (e.g., bring = br + ing) with high accuracy, they do not attempt to solve the PDP. They do, however, reveal which forms take the same affixes (e.g., walked, talked), not which forms occupy the same cell (e.g., walked, brought). Indeed, they explicitly struggle with irregular morphology. Segmenters also cannot easily model non-concatenative phenomena like ablaut, vowel harmony and templatic processes.
Two works have proposed tasks which can be considered alternative formulations of the PDP, using either minimal or indirect supervision to bootstrap their models. We discuss each in turn. First, Dreyer and Eisner (2011) use a generative model to cluster forms into paradigms and cells with a Bayesian non-parametric mixture of weighted finite-state transducers. They present a PDP framework which, in principle, could be fully unsupervised, but their model requires a small seed of labeled data to get key information like the number of cells distinguished, making it less relevant cognitively. In contrast, our task is not directly supervised and focuses on distributional context. Second, contemporaneous to our work, Jin et al. (2020) propose a similar framework for SIGMOR-PHON 2020's shared task on unsupervised morphological paradigm completion. Given only a small corpus and lexicon of verbal lemmata, participating systems must propose full paradigms for each lemma. By contrast, our framework does not reveal how many paradigms should be generated, nor do we privilege a specific form as the lemma, but we do use a larger lexicon of exclusively verbal or nominal forms. Their proposed baseline uses distributional context for POS tagging and features, but does not train embeddings as the corpus is small.

Subtasks of Paradigm Discovery
A few works address subtasks of the PDP.  learn paradigm membership from raw text, but do not sort paradigms into cells. Boyé and Schalchli (2019) discuss the paradigm cell finding problem, identifying the cell (but not paradigm) realized by a given form. Lee (2015) clusters forms into cells across inflection classes. Beniamine et al. (2018) group paradigms into inflection classes, and Eskander et al. (2013) induce inflection classes and lemmata from cell labels.

The Paradigm Cell Filling Problem
The PCFP is the task of predicting unseen inflected forms given morphologically labeled input. PCFP models can guess a word's plural having only seen its singular, but the child must bootstrap morphological knowledge from scratch, first learning that singular-plural is a relevant distinction. Thus, the PDP must be at least partially solved before the PCFP can be attempted. Yet, as a supervised task, the PCFP is more easily studied, and has received much attention on its own, especially from the word-and-paradigm camp of morphological theory.
Some cognitive works suggest the PCFP cannot be too difficult for any language (Dale et al., 1998;Malouf, 2013, 2015;Blevins et al., 2017;. Neural models can test and extend such proposals (Cotterell et al., 2018a;. A related vein of work discusses how speakers inflect nonce words (Berko, 1958;Plunkett and Juola, 1999;Yang, 2015), e.g., is the past tense of sping, spinged or spung? There is a long tradition of modeling past-tense generation with neural networks (Rumelhart and McClelland, 1986;Corkery et al., 2019).
On the engineering side, Durrett and DeNero (2013) inspired much recent work, which has since benefited from large inflectional datasets  and advances in neural sequence modeling (Bahdanau et al., 2015). Shared tasks have drawn extra attention to the PCFP (Cotterell et al., 2016a(Cotterell et al., , 2017(Cotterell et al., , 2018c.

The Paradigm Discovery Problem
Paradigm discovery is a natural next step in computational morphology, building on related minimally or indirectly supervised works ( §2.2) to bridge the gap between unsupervised traditions ( §2.1) and supervised work on the PCFP ( §2.3). In the PCFP,

Corpus
The cat watched me watching it . I followed the show but she had n't seen it .
Let 's see who follows your logic .
Lexicon watching, seen, follows, watched, followed, see Gold Grid cell 1 cell 2 cell 3 cell 4 cell 5 paradigm 1 «watch» «watches» watching watched watched paradigm 2 «follow» follows «following» followed followed paradigm 3 see «sees» «seeing» «saw» seen Table 1: An example corpus, lexicon, and gold analyses. All lexicon entries appear in the corpus and, for our experiments, they will all share a POS, here, verb. The grid reflects all possible analyses of syncretic forms (e.g., walked, followed), even though these only occur in the corpus as PST realizations, like saw in Cell 4, not as PST.PTCP, like seen in Cell 5. Bracketed «forms» are paradigm mates of attested forms, not attested in the lexicon.
each input form is labeled with its morphosyntactic property set, i.e., the cell in the paradigm which it realizes, and its lexeme, i.e., the paradigm of related forms to which it belongs. By contrast, to solve the PDP, unlabeled input forms must be assigned cells and paradigms. This task requires learning what syntactic and semantic factors distinguish cells, what combinations of cells can cooccur in a paradigm, and what aspects of a surface form reflect its paradigm and its cell, respectively. Table 1 provides an overview of our PDP setup. The first two rows show input data: an unannotated corpus and a lexicon of forms attested in that corpus. Given only these data, the task is to output a grid such that (i) all lexicon forms and all their (potentially unseen) inflectional variants appear in the grid, (ii) all forms appearing in the same column realize the same morphosyntactic cell, and (iii) all forms appearing in the same row belong to the same paradigm. Unattested «forms» to be generated are depicted in brackets in Table 1's gold grid, which shows the ideal output of the system. Our setup permits multiple forms realizing the same slot, i.e., a specific cell in a specific paradigm, a single form realizing multiple slots, and unrealizable empty slots. This supports overabundance (Thornton, 2010(Thornton, , 2011, defectiveness (Sims, 2015), and syncretism (Blevins, 1995;Cotterell et al., 2018b). See Corbett (2005) for more on these phenomena. Experimentally, we constrain the PDP by limiting the lexicon to forms from one POS, but our formalization is more general.

Data for the PDP
For a given language and POS, we create a corpus, lexicon, and gold grid based on a Universal Depen-dencies (UD) corpus (Nivre et al., 2016). At a high level, the corpus includes raw, non-UD sentences, and UD sentences stripped of annotations. The lexicon includes all forms occurring in the UD sentences with the specified POS (potentially including variant spellings and typographical errors). The gold grid consists of full paradigms for every word which co-occurs in UD and the UniMorph lexicon  with a matching lemma-cell analysis; this is similar to the corpus created by . As a system does not know which lexicon forms will be evaluated in the gold grid, it must model the entire lexicon, which should contain a realistic distribution over rare words and inflection classes having been directly extracted from distributional data (Bybee, 2003;Lignos and Yang, 2018).
To ensure the gold grid is reasonably clean, we take all word-lemma-feature tuples from the UD portion of the corpus matching the specified POS and convert the features to a morphosyntactic cell identifier compatible with UniMorph representation as in McCarthy et al. (2018). 3 Then we check which word-lemma-cell tuples also occur in Uni-Morph. For each unique lemma in this intersection, the full paradigm is added as a row to the gold grid. To filter typos and annotation discrepancies, we identify any overabundant slots, i.e., slots realized by multiple forms, and remove all but the most frequently attested realization in UD. While some languages permit overabundance (Thornton, 2010), it often indicates typographical or annotation errors 3 Aligning UniMorph and UD requires removing diacritics in (Latin and Arabic) UniMorph corpora to match UD. This can obscure some morphosyntactic distinctions but is more consistent with natural orthography in distributional data. The use of orthographic data for morphological tasks is problematic, but standard in the field, due to scarcity of phonologically transcribed data (Malouf et al., 2020).
in UD and UniMorph (Gorman et al., 2019;Malouf et al., 2020). Unlike the gold grid, the lexicon retains overabundant realizations, requiring systems to handle such phenomena. For each language, the raw sentences used to augment the corpus add over 1 million additional words. For German and Russian, we sample sentences from OpenSubtitles (Lison and Tiedemann, 2016), for Latin, the Latin Library (Johnson et al., 2016), and for English and Arabic, Gigaword (Parker et al., 2011a,b). Supplementary sentences are preprocessed via Moses (Koehn et al., 2007) to split punctuation, and, for supported languages, clitics. Table 3 shows corpus and lexicon sizes.

Metrics
A system attemping the PDP is expected to output a morphologically organized grid in which rows and columns are arbitrarily ordered, but ideally, each row corresponds to a gold paradigm and each column to a gold cell. Aligning rows to paradigms and columns to cells is non-trivial, making it difficult to simply compute accuracy over gold grid slots. Furthermore, cluster-based metrics (Rosenberg and Hirschberg, 2007) are difficult to apply as forms can appear in multiple columns or rows. Thus, we propose novel metrics that are lexical, based on analogical relationships between forms. We propose a set of PDP metrics, to measure how well organized lexicon forms are in the grid, and a set of PCFP metrics, to measure how well the system anticipates unattested inflectional variants. All metrics support non-canonical phenomena such as defective paradigms and overabundant slots.

PDP Metrics
A form f 's paradigm mates are all those forms that co-occur in at least one paradigm with f . f 's paradigm F-score is the harmonic mean of precision and recall of how well we predicted its paradigm mates when viewed as an information retrieval problem (Manning et al., 2008). We macroaverage all forms' paradigm F-scores to compute F par . Qualitatively, F par tells us how well we clus-ter words that belong to the same paradigm. A form f 's cell mates are all those forms that cooccur in at least one cell with f . f 's cell F-score is the harmonic mean of precision and recall of how well we predicted its cell mates. As before, we macro-average all forms' cell F-scores to compute F cell . Qualitatively, F cell tells us how well we cluster words that belong to the same cell. Finally, we propose the F grid metric as the harmonic mean of F par and F cell . F grid is a single number that reflects a system's ability to cluster forms into both paradigms and cells. Because we designate separate PCFP metrics to evaluate gold grid forms not in the lexicon, we restrict f 's mates to only include forms that occur in the lexicon.
Consider the proposed grid in Table 2. There are 6 lexicon forms in the gold grid. Starting with watched, we correctly propose its only attested paradigm mate, watching. Thus, watched's paradigm F-score is 100%. For see, we propose no attested paradigm mates, but we should have proposed seen. 0 correct out of 1 true paradigm mate from 0 predictions results in an F-score of 0% for seen. We continue like this for all 6 attested forms in the gold grid and average their scores to get F par . As for F cell , we correctly predict that watched's only cell mate is followed, yielding an F-score of 100%. However, we incorrectly predict that see has a cell mate, seen, yielding an F-score of 0%; we average each word's F-score to get F cell ; the harmonic mean of F par and F cell gives us F grid .
While F grid handles syncretism, overabundance, defectiveness and mismatched grid dimensions, it is exploitable by focusing exclusively on the best attested cells realized by the most unique forms, since attested cells tend to exhibit a Zipfian distribution (Blevins et al., 2017;Lignos and Yang, 2018). Exploiting F grid in this manner propagates errors when bootstrapping to predict unattested forms and, thus, will be punished by PCFP metrics.

PCFP Metrics
We cannot evaluate the PCFP as in supervised settings (Cotterell et al., 2016a) Table 3: Statistics regarding the input corpus and lexicon. UD tokens refers to tokens in the corpus originally extracted from UD sentences. cells and paradigms cannot be trivially aligned to gold cells and paradigms. Instead, we create a test set by sampling 2,000 four-way analogies from the gold grid. The first and second forms must share a row, as must the third and fourth; the first three forms must be attested and the fourth unattested, e.g., watched : watching :: seen : «seeing».
From this test set and a proposed grid, we compute a strict analogy accuracy (An) metric and a more lenient lexicon expansion accuracy (LE) metric. An counts predictions as correct if all analogy directions hold in the proposed grid (i.e., watched, watching and seen, «seeing» share rows and watched, seen and watching, «seeing» share columns). LE counts predictions as correct if the unattested fourth form appears anywhere in the grid. That is, LE asks, for each gold form, if it was predicted in any slot in any paradigm.
Like the PDP metrics, our PCFP metrics support syncretism, overabundance, defectiveness, etc. One can, however, exploit them by proposing a gratuitous number of cells, paradigms, and syncretisms, increasing the likelihood of completing analogies by chance, though this will reduce F grid . As both PDP and PCFP metrics can be exploited independently but not jointly, we argue that both types of metrics should be considered when evaluating an unsupervised system.

Building a Benchmark
This section presents a benchmark system for proposing a morphologically organized grid given a corpus and lexicon. First, we cluster lexicon forms into cells. Then we cluster forms into paradigms given their fixed cell membership. To maintain tractability, clustering assumes a one-to-one mapping of forms to slots. Following cell and paradigm clustering, we predict forms to realize empty slots given one of the lexicon forms assigned to a cell in the same paradigm. This allows forms to appear in multiple slots, but does not support overabundance, defectiveness, or multi-word inflections.

Clustering into Cells
We use a heuristic method to determine the number of cells and what lexicon forms to assign to each. Inspired by work on inductive biases in word embeddings (Pennington et al., 2014;Trask et al., 2015;Goldberg, 2016;Avraham and Goldberg, 2017;Tu et al., 2017), we train morphosyntactically biased embeddings on the corpus and use them to k-means cluster lexicon forms into cells. Following , we emphasize morphosyntactically salient dimensions in embedding space by manipulating hyperparameters in fastText (Bojanowski et al., 2017). Specifically, to encourage grouping of morphologically related words, fastText computes a word's embedding as the sum of its subword embeddings for all subword sequences between 3 and 6 characters long (Schütze, 1993). We shorten this range to 2 to 4 to bias the grouping toward shared affixes rather than (usually longer) shared stems. This helps recognize that the same affix is likely to realize the same cell, e.g., watch +ed and follow +ed. We limit the context window size to 1; small windows encourage a morphosyntactic bias in embeddings (Erk, 2016).
We determine the number of cells to cluster lexicon forms into, k, via the elbow method, which progressively considers adding clusters until the reduction in dispersion levels off (Kodinariya and Makwana, 2013;Bholowalia and Kumar, 2014). 4 Since Tibshirani et al. (2001)'s popular formalism of the method does not converge on our data, we implement a simpler technique that works in our case. We incrementally increase k, each time recording clustering dispersion, d k (for consistency, we average d k over 25 iterations). Starting at k = 2, we calculate dispersion deceleration as the difference between the current and previous dispersions: Once decel(k) decreases below decel(2), we take the k th clustering: the (k + 1) th cluster did not explain enough variation in the embedding space to justify an additional morphosyntactic distinction.

Clustering into Paradigms
Given a clustering of lexicon forms into k cells, denoted as C 1 , . . . , C k , we heuristically cluster each form f into a paradigm, π, as a function of f 's cell, c. For tractability, we assume paradigms are pairwise disjoint and no paradigm contains multiple forms from the same cell. Our algorithm greedily builds paradigms cell by cell. To gauge the quality of a candidate paradigm, we first identify its base and exponents. Following Beniamine et al. (2018), we define π's base, b π , as the longest common subsequence shared by all forms in π. 56 For each form f in π, we define the exponent x f as the subsequences of f that remain after removing b π , i.e., x f is a tuple of affixes. For example, if π contains words wxyxz and axx, b π is xx and the exponents are (<w, y, z>) and (<a), respectively. 7 Inspired by unsupervised maximum matching in greedy tokenization (Guo, 1997;Erdmann et al., 2019), we define the following paradigm score function: which scores a candidate paradigm according to the number of base characters minus the number of exponent characters; it can be negative. Algorithm 1 then details our heuristic clustering approach. We greedily select one or zero forms from each cell to add (via the list concatenation operator •) to each paradigm such that the paradigm's score is maximized. 8 After performing a first pass of paradigm clustering with Algorithm 1, we estimate an unsmoothed probability distribution p(x | c) as follows: we take the number of times each exponent (tuple of affixes) realizes a cell in the output of Algorithm 1 and divide by the number of occurrences of that cell. We use this distribution p(x | c) to construct an exponent penalty: 5 The fact that we use a subsequence, instead of a substring, means that we can handle non-concatenative morphology. 6 We note that the longest common subsequence may be found with a polynomial-time dynamic program; however, there will not exist an algorithm whose runtime is polynomial in the number of strings unless P = NP (Maier, 1978). 7 We use word start (<) and end (>) tokens to distinguish exponents; they do not count as exponent characters in eq. (2).
8 Algorithm 1 has complexity O(|L| 2 ) where |L| is lexicon size. In practice, to make Algorithm 1 tractable, we limit the candidates for f j (line 8) to the n = 250 forms from cell j nearest to fi in pre-trained embedding space (trained via FastText with default parameters). This achieves a complexity upper bounded by O(|L|nk).
Algorithm 1 Paradigm Clustering Algorithm 1: input C 1 , . . . , C k 2: π ← [ ] 3: for C i ∈ {C 1 , . . . , C k } do 4: s ← s f j 13: Intuitively, if an exponent is the most likely exponent in the cell to which it belongs, the penalty weight is zero and its characters are not subtracted from the score. Otherwise, the weight is in the interval [1, 2] such that each exponent character is penalized at least as harshly but no more than twice as harshly than in the first pass, according to the exponent's likelihood. We use this exponent penalty weight to define a penalized score function: We then re-run Algorithm 1, swapping out score(·) for score ω (·), to re-cluster forms into paradigms. Empirically, we find that harsher exponent penalties-i.e., forcing weights to be greater than 1 for suboptimal exponents-lead to higher paradigm precision in this second pass. For an example, consider candidate paradigm [«», watched, «», «», «»]. If we add nothing, each character of watched can be analyzed as part of the base, yielding a score of 7. What if we attempt to add watching-pre-determined to belong to column 5 during cell clustering? Candidate paradigm [«», watched, «», «», watching] increases the number of base characters to 10 (watch shared by 2 words), but yields a score of 5 after subtracting the characters from both exponents, (ed>) and (ing>). Hence, we do not get this paradigm right on our first pass, as 5 < 7. Yet, after the first pass, should (ed>) and (ing>) be the most frequent exponents in the second and fifth cells, the second pass will be different. Candidate paradigm [«», watched, «», «», watching] is not penalized for either exponent, yielding a score of 10, thereby allowing watching to be added to the paradigm.

Reinflection
We now use the output of the clustering by cell and paradigm to bootstrap the PCFP. We use a Transformer (Vaswani et al., 2017) to predict the forms that realize empty slots. Transformer-based neural transducers constitute the state of the art for the PCFP. 9 In Cotterell et al. (2016b)'s terms, we reinflect the target from one of the non-empty source cells in the same paradigm. We select the source from which we can most reliably reinflect the target. We quantify this reliability by calculating the accuracy with which each target cell's realizations were predicted from each source cell's realizations in our development set. For each target cell, we rank our preferred source cells according to accuracy.
To generate train and development sets, we create instances for every possible pair of realizations occurring in the same paradigm (90% train, 10% development). We pass these instances into the Transformer, flattening cells and characters into a single sequence. Neural models for reinflection often perform poorly when the training data are noisy. We mitigate this via the harsh exponent penalty weights (eq. (3)) which encourage high paradigm precision during clustering. Table 4 shows results for two versions of our benchmark system: BENCH, as described in §4, and GOLD k, with the number of cells oracularly set to the ground truth. For reference, we also report a supervised benchmark, SUP, which assumes a gold grid as input, then solves the PCFP exactly as the benchmark does. In terms of the PDP, clustering assigns lexicon forms to paradigms (46-82%) more accurately than to cells (26-80%). Results are high for English, which has the fewest gold cells, and 9 We use the following hyperparameters: N = 4, dmodel = 128, dff = 512. Remaining hyperparameters retain their default values as specified in Vaswani et al. (2017). Our models are trained for 100 epochs in batches of 64. We stop early after 20 epochs without improvement on the development set.  3. An refers to analogy accuracy and LE to the lexicon expansion accuracy.

Results and Discussion
lower elsewhere. In German, Latin, and Russian, our benchmark proposes nearly as many cells as GOLD k, thus performing similarly. For English, it overestimates the true number and performs worse. For Arabic, it severely underestimates k but performs better, likely due to the orthography: without diacritics, the three case distinctions become obscured in almost all instances. In general, fixing the true number of cells can be unhelpful because syncretism and the Zipfian distribution of cells creates situations where certain gold cells are too difficult to detect. Allowing the system to choose its own number of cells lets it focus on distinctions for which there is sufficient distributional evidence.
As for the PCFP, our benchmark system does well on lexicon expansion accuracy and poorly on the analogy task. While lexicon expansion accuracy (50-86% compared to 72-97% for SUP) shows that the benchmark captures meaningful inflectional trends, analogy accuracy demonstrates vast room for improvement in terms of consistently organizing cell-realizations across paradigms. English is the only language where analogy accuracy is within half of SUP's upper bound. A major reason for low analogy accuracy is that forms, despite being clustered into paradigms well, get assigned   to the wrong cell, or the same gold cell gets misaligned across paradigms from different inflection classes. We discuss this phenomenon in more detail below.

Latin Noun Error Analysis
A detailed analysis of Latin nouns (also analyzed by Stump and Finkel (2015) and Beniamine et al. (2018)) reveals challenges for our system. Table 5 shows the inflectional paradigms for three Latin nouns exemplifying different inflection classes, which are mentioned throughout the analysis. In keeping with the UD standard, there are no diacritics for long vowels in the table.
One major challenge for our system is that similar affixes can mark different cells in different inflection classes, e.g. the ACC.SG of servus "slave.M" ends in um, as does the GEN.PL of frater "brother". Table 6 shows system-posited cells, the gold cells they best match to, and the longest suffix shared by 90% of their members. The system is often misled by shared affixes, e.g., cell 0 is evenly split between ACC.SG and GEN.PL, driven by the suffix um (cells 3 (is) and 4 (a) suffer from this as well). This kind of confusion could be resolved with better context modeling, as each distinct underlying cell, despite sharing a surface affix, occurs in distinct distributional contexts. We observe that the current system often fails to make use of context to handle some misleading suffixes. However, Cell 7 correctly groups ABL.PL forms marked with both is and ibus, excluding other suffixes ending in s. Similarly, cell 8 contains NOM.SG forms with heterogeneous endings, e.g., r, ix and ns.
In some cases, the system misinterprets derivational processes as inflectional, combining gold paradigms. Derivational relatives servus and serva, male and female variants of "slave", are grouped into one paradigm, as are philosophos "philosopher" and philosophia "philosophy." In other cases, cell clustering errors due to shared suffixes create spurious paradigms. After falsely clustering gold paradigm mates servum (ACC.SG) and servorum (GEN.PL) into the same cell, we must assign each to separate paradigms during paradigm clustering. This suggests clustering cells and paradigms jointly might avoid error propagation in future work.
We also find that clustering errors lead to PCFP errors. For servus/a, the neural reinflector predicts servibus in cell 8 with a suffix from the wrong inflection class, yet the slot should not be empty in the first place. The correct form, servis, is attested, but was mistakenly clustered into cell 3. Table 7 evaluates variants of the benchmark to determine the contribution of several system-task components in Arabic and Latin. We consider augmenting and shrinking the corpus. We also reset the fastText hyperparameters used to achieve a morphosyntactic inductive bias to their default values (no affix or window bias) and consider two constant exponent penalty weights (ω(x f , c) = 1 and ω(x f , c) = 0) instead of our heuristic weight defined in eq. (3). Finally, we consider selecting random sources for PCFP reinflection instead of identifying reliable sources. For all variants, the number of cells is fixed to the ground truth.

Benchmark Analysis
Corpus Size We consider either using a smaller corpus containing only the UD subset, or using a larger corpus containing 15 (Latin) or 100 (Ara-  bic) million words from additional supplementary sentences. As expected, performance decreases for smaller corpora, but it does not always increase for larger ones, potentially due to domain differences between UD and the supplemental sentences. Interestingly, F cell always increases with larger corpora, yet this can lead to worse F par scores, more evidence of error propagation that might be avoided with joint cell-paradigm clustering.
Embedding Morphosyntactic Biases Targeting affix embeddings by shrinking the default fastText character n-gram sizes seems to yield a much more significant effect than shrinking the context window. In Latin, small context windows can even hurt performance slightly, likely due to extremely flexible word order, where agreement is often realized over non-adjacent words.
Exponent Penalties When clustering paradigms with the penalty weight ω(x, c) = 1, (which is equivalent to just running the first pass of paradigm clustering), we see a steep decline in performance as opposed to the proposed heuristic weighting. It is even more detrimental to not penalize exponents at all (i.e., ω(x, c) = 0), but maximize the base characters in paradigms without concern for size or likelihoods of exponents. Given allomorphic variation and multiple inflection classes, we ideally want a penalty weight which is lenient to more than just the single most likely exponent, but without supervised data, it is difficult to determine when to stop being lenient and start being harsh in a language agnostic manner. Our choice to be harsh by default proposes fewer false paradigm mates, yielding less noisy input to train the reinflection model. In a post-hoc study, we calculated GOLD k PCFP scores on pure analogies only, where the first three attested forms were assigned correctly during clustering. Pure analogy PCFP scores were still closer to GOLD k's performance than SUP's for all languages. This suggests most of the gap between GOLD k and SUP is due to noisy training on bad clustering assignments, not impossible test instances created by bad clustering assignments. This supports our choice of harsh penalties and suggests future work might reconsider clustering decisions given the reinflection model's confidence.
Reinflection Source Selection During reinflection, feeding the Transformer random sources instead of learning the most reliable source cell for each target cell slightly hurts performance. The margin is small, though, as most paradigms have only one attested form. In preliminary experiments, we also tried jointly encoding all available sources instead of just the most reliable, but this drastically lowers performance.

Conclusion
We present a framework for the paradigm discovery problem, in which words attested in an unannotated corpus are analyzed according to the morphosyntactic property set they realize and the paradigm to which they belong. Additionally, unseen inflectional variants of seen forms are to be predicted. We discuss the data required to undertake this task, a benchmark for solving it, and multiple evaluation metrics. We believe our benchmark system represents a reasonable approach to solving the problem based on past work and highlights many directions for improvement, e.g. joint modeling and making better use of distributional semantic information.