An Encoder-Decoder Approach to the Paradigm Cell Filling Problem

The Paradigm Cell Filling Problem in morphology asks to complete word inflection tables from partial ones. We implement novel neural models for this task, evaluating them on 18 data sets in 8 languages, showing performance that is comparable with previous work with far less training data. We also publish a new dataset for this task and code implementing the system described in this paper.


Introduction
An important learning question in morphologyboth for NLP and models of language acquisition-is the so-called Paradigm Cell Filling Problem (PCFP). So dubbed by Ackerman et al. (2009), this problem asks how it is that speakers of a language can reliably produce inflectional forms of most lexemes without ever witnessing those forms before. For example, a Finnish noun or adjective can be inflected in 2,263 ways if one includes case forms, number, and clitics (Karlsson, 2008). However, it is unlikely that a Finnish speaker would have heard all forms for even a single, highly frequent lexical item. It is also unlikely that all 2,263 forms are found in the aggregate of all the witnessed inflected forms over different lexemes and speakers must be able to assess the felicity of, and possibly produce such inflectional combinations they have never witnessed for any noun or adjective. Figure 1 illustrates the PCFP.
This paper investigates PCFP in three different settings: (1) when we know n > 1 randomly selected forms in each of a number of inflection tables, (2) when we know a set of frequent word forms in each table (this most closely resembles an L1 language learning setting), and finally (3) 1 https://github.com/mpsilfve/pcfp-data when we know exactly n = 1 word form from each table.
We treat settings (1) and (2) as traditional morphological reinflection tasks (Cotterell et al., 2016) as explained in Section 2. In contrast, setting (3) is substantially more challenging because it cannot be handled using a traditional reinflection approach. To overcome this problem, we utilize an adaptive dropout mechanism which will be discussed in Section 2. This allows us to train the reinflection system in a manner reminiscent of denoising autoencoders (Vincent et al., 2008). Related Work Neural models have recently been shown to be highly competitive in many different tasks of learning supervised morphological inflection (Faruqui et al., 2016;Kann and Schütze, 2016;Makarov et al., 2017;Aharoni and Goldberg, 2017) and derivation (Cotterell et al., 2017b). Most current architectures are based on encoderdecoder models (Sutskever et al., 2014), and usually contain an attention component (Bahdanau et al., 2015). The SIGMORPHON (Cotterell et al., 2016) and CoNLL-SIGMORPHON (Cotterell et al., 2017a(Cotterell et al., , 2018 shared tasks in recent years have explored morphological inflection but not explicitly the PCFP. In the 2017 task, participants were given full paradigms-i.e. a listing of all forms-of lexemes during training after which they were given incomplete paradigms which had to be completed at test time. This is a slightly unrealistic setting in an L1-style learning scenario  where arguably very few full paradigms are ever witnessed and where generalization has to proceed on a number of very gappy paradigms. Of course, such gaps form a distribution where frequently used lexemes have fewer gaps than infrequent ones, which we will attempt to model in this work.  evaluate an extension to a linguistically informed symbolic paradigm model based on stem extraction from the longest common subsequence (LCS) shared among related forms (Ahlberg et al., 2014(Ahlberg et al., , 2015. While the original LCS paradigm extraction method was intended to learn from complete inflection tables (Hulden, 2014),  present modifications to allow learning from incomplete paradigms as well, and apply it to the PCFP. Comparing against their results, shows that our neural model consistently outperforms such a subsequence-based learning model. Kann et al. (2017) report results on so-called multi-source reinflection in which several input forms are used to generate one output form. This task is related to the PCFP; however, Kann et al. (2017) use full inflection tables for training. Moreover, their approach is applicable for PCFP only when 3 or more forms are given in the input tables. Since this mostly excludes our experimental settings, we do not compare to their system. Malouf (2016Malouf ( , 2017 documents an experiment with a generator LSTM in completing inflection tables in up to seven languages with either 10% or 40% of table entries missing. Our work differs from this in that Malouf gives as input a two-hot encoding of both the lexeme and the desired slot during training and testing for which an inflection table is to be completed, which means the system cannot complete paradigms which it has not seen examples of in the training data. By contrast, our system has no notion of lexeme and we simply work from the symbol strings which are collections of inflected forms of a lexeme given in the test data which may in principle be completely disjoint from training data lexemes. We use the Malouf system as a baseline to compare against.

Encoder-Decoder Models for PCFP
We explore two different models for paradigm filling. The first model is applicable when n > 1 forms are given in each inflection table. When exactly one (n = 1) form is given, we use another model.
Case n>1 When more than one form is given in training tables, PCFP can be treated as a morphological reinflection task (Cotterell et al., 2016), where the aim is to translate inflected word forms and their tags into target word forms. For example, a model would translate tried+PAST into the present participle (PRES,PCPLE) form trying. We adopt a common approach employed by Kann and Schütze (2016) and many others: we build a model which translates an input word form, its tag and a target tag, for example tried+PAST+PRES,PCPLE, into the target word form trying.
Our model closely follows the formulation of the encoder-decoder LSTM model for morphological reinflection proposed by Kann and Schütze (2016). We use a 1-layer bidirectional LSTM encoder for encoding the input word form into a sequence of state vectors and a 1-layer LSTM decoder with an attention mechanism over encoder states for generating the output word form.
We form training pairs by using the given forms in each table, i.e. take the cross-product of the given forms and learn to reinflect each given form in a table to another given form in the same table as demonstrated in Figure 2. 2 During test time, we predict forms for missing slots based on each of the given forms in the table and take a majority vote of the results. 3 Case n=1 When only one form is given in each inflection table, we cannot train the model as a traditional reinflection model. The best we can do is to train a model to reinflect forms into the same form walked+PAST+PAST → walked and then try to apply this model for reinflection to fill in missing forms walked+PAST+PRES,PCPLE → walking. According to preliminary experiments, this however leads to massive over-fitting and the model simply learns to only copy input forms.  (singular inessive of sileys 'smoothness'), a German past participle (of rammen 'to ram') and a French verb (infinitive form of resigner 'to resign'). The figure demonstrates that confidence is higher in the inflectional affixes than in the stem in general. It is also high at the stem-affix boundary.
The idea for our approach in case n = 1 is to first learn to segment word forms into a stem and an affix, for example walk+ed. We then hide the affix in the input form and learn to inflect. In other words, we map the word form walked into walk$$ and then learn a mapping walk$$+PAST → walked. This model suffers less from overfitting and we can use it to find missing forms in partial inflection tables.
Since we do not have access to segmented training data, we cannot directly train a segmentation model. Instead, we use the forms in the training data to train an LSTM language model conditioned on morphological tags. We then use the language model for identifying which characters belong to stems and which characters belong to affixes.
As shown in Figure 3, the language model in general gives higher confidence for predictions of characters in the affix than in the word stem. Nevertheless, it only gives a probabilistic segmentation into a stem and affix(es). Therefore, we do not perform a deterministic segmentation. Instead we use the language model to guide a character dropout mechanism in our word inflection model. When the language model is very confident, as in the case of affix characters, we frequently drop characters. In contrast, when the language model  (2017) using our own implementation of the model. In contrast to Malouf (2017), who used cross-validation, we train one system for each language. Therefore, we only report standard deviation for the results in Column 2.
is less confident, as in the case of stem characters, we typically keep the character. Apart from this adaptive dropout applied during training, our inflection system in case n = 1 is exactly the same as in case n > 1.
More precisely, given an input word form, which is a sequence of characters x = x 1 , ..., x T , the LSTM language model emits a probability p(x t+1 , h t , E xt , E y ) for the next character x t+1 based on the entire previous input sequence x 1 , ..., x t . Here h t is the hidden state vector of the language model at position t, E a joint tag and character embedding and y the morphological tag of the input word form. The embedding vector E y is in fact a sum of sub-tag embeddings. For example, E PAST+PCPLE denotes E PAST + E PCPLE . This allows us to handle combinations of subtags which we have not seen in the training data. Guided by the language model, we replace input characters x t+1 during training of the reinflection system with a dropout character $ with probability equal to language model confidence p(x t+1 , h t , E xt , E y ). 4 Baseline Model As a baseline model, we use the neural system presented by Malouf (2016Malouf ( , 2017 for solving PCFP. It is an LSTM generator which is conditioned on the table number of the partial inflection tables and the morphological tag index. The model is trained to generate training word forms in inflection tables. During testing, it can then generate missing forms by conditioning on morphological tags for the missing forms. In order to assure fair comparison, we perform the paradigm completion experiment described in Malouf (2017), where 90% of the word forms in the data set is used for training and the remaining 10% for testing. 5 As the results in Table 1 show, our results very closely replicate those reported by Malouf (2017).

Implementation details
We use 1-layer bidirectional LSTM encoders, decoders and generators with embeddings and hidden states of size 100. We train the language model for case n > 1 for 20 epochs and all other models for 60 epochs without batching. We train 10 models for every language and part-of-speech and apply majority voting to get the final output forms. All models were implemented using DyNet (Neubig et al., 2017).

Data
We use UniMorph morphological paradigm data in our experiments (Kirov et al., 2018). Unimorph data sets are crowd-sourced collections of morphological inflection tables based on Wiktionary. We conduct experiments on noun and verb paradigms from eight languages. 6 Not all languages have 1,000 noun and verb tables. Hence, our selection is not complete as seen in Table 3.
We conduct experiments on two different sets of tables: (1) we randomly sample 1,000 tables for each language and part-of-speech, and (2) we select Unimorph tables including some of the 10,000 most common word forms according to Wikipedia frequency. The Wikipedia word frequencies are based on plain Wikipedia text dumps from the Polyglot project (Al-Rfou et al., 2013). Georgian and Latin did not have a Polyglot Wikipedia so we excluded those. Moreover, we excluded Latvian verbs because there was very little overlap between the most frequent Wikipedia word forms and Unimorph table entries (< 200 forms occurred in both). Details for both types of data sets are given in Tables 3 and 2. # Tables Table Size   FINNISH   however, we did not have access to the exact splits into training and test data used by Malouf (2017). This may influence results. 6 Finnish (fin), French (fre), Georgian (geo), German (ger), Latin (lat), Latvian (lav), Spanish (spa) and Turkish (tur).   Table 4: Overall results for filling in missing forms when the 10,000 most frequent forms are given in the inflection tables. We give the 0.99 confidence intervals as given by a one-sided t-test. Figures where one system significantly outperforms the other one are in boldface.

Experiments and Results
We perform two experiments. In the first one, we take the set of 1,000 randomly sampled inflection tables for each language and part-of-speech and then randomly select n=1, 2 or 3 training forms from each table. We then train a reinflection system on these forms and use the resulting system to predict the missing forms. We report accuracy on correctly predicted missing forms and on reconstructing the entire paradigm correctly. In our second experiment, we consider Unimorph tables which contain entries from a list of 10,000 most common word tokens compiled using a Wikipedia dump of the language as explained above. We take the forms in the top-10,000 list as given and train a model which is used to reconstruct the remaining forms in each table. We train an identical model as in the case n > 1 on tables with more than one given form. As in the first task, we evaluate with regard to accuracy for reconstructed forms and full tables. Results are presented in Tables 4 and 5, and  Table 4 shows results for completing tables for common lexemes. Our system significantly out-   The blue bars (on the left) denote accuracy for our system and green bars (on the right) accuracy for the baseline system. The graphs show accuracy separately for tables where 1, 2, 3, 4, and > 4 forms are given.

Discussion and Conclusions
performs the baseline on all other datasets apart from German nouns. We believe that the reason for the German outlier is the high degree of syncretism in German noun tables. To see why syncretism is harmful, consider the German noun Gräben. Its paradigm consists of eight forms but four of those are identical: Gräben. Only this form is observed among the top 10,000 forms in the German Wikipedia. Following Section 2, this gives rise to 12 training examples where both the input and output form are Gräben. This strongly biases the system to copying input forms into the output. However, this will never give the correct output because, by design, missing forms cannot be Gräben. 7 This can be seen as a problem with our datasets rather than the model itself. Consequently, an important future work in addressing the PCFP from an acquisition perspective is to create realistic and accurate data sets that model 7 If the same word form occurs in multiple slots, all of them are considered known. learner exposure both in word types and frequencies to enable assessment of the true difficulty of the PCFP.
There is a notable transition from witnessing one form in each inflection table to witnessing two forms. With only two forms given, we already approach accuracies reported in earlier work (Malouf, 2016(Malouf, , 2017) that used almost complete tables to train-only 10% of the forms were missing. Additionally, our encoder-decoder model strongly outperforms that generator model designed for the same task with the same amount of training data on nearly all of our datasets.