Analogy Models for Neural Word Inflection

Analogy is assumed to be the cognitive mechanism speakers resort to in order to inflect an unknown form of a lexeme based on knowledge of other words in a language. In this process, an analogy is formed between word forms within an inflectional paradigm but also across paradigms. As neural network models for inflection are typically trained only on lemma-target form pairs, we propose three new ways to provide neural models with additional source forms to strengthen analogy-formation, and compare our methods to other approaches in the literature. We show that the proposed methods of providing a Transformer sequence-to-sequence model with additional analogy sources in the input are consistently effective, and improve upon recent state-of-the-art results on 46 languages, particularly in low-resource settings. We also propose a method to combine the analogy-motivated approach with data hallucination or augmentation. We find that the two approaches are complementary to each other and combining the two approaches is especially helpful when the training data is extremely limited.


Introduction
Morphological tasks such as the task of morphological inflection generation have attracted great research interest in recent years. SIGMORPHON has organized annual shared tasks on morphological inflection in the past five years (Cotterell et al., 2016;McCarthy et al., 2019;Vylomova et al., 2020). In the typical SIGMORPHON shared task of morphological inflection, a lemma (citation form) and a morphosyntactic description (MSD) consisting of a set of features are provided, and the task is to generate an inflected form for the lemma corresponding to the MSD.
Neural network models have been very successful in handling natural language processing (NLP) problems, and have achieved new state of the arts in almost every area of NLP, including characterlevel sequence to sequence transformation tasks like morphological inflection, especially when there are abundant labeled data (Goldberg, 2016). However, neural network models are usually very data-hungry, and the performance of such models can suffer when labeled data is limited. Unfortunately, the large amounts of labeled data needed is not always available and can be difficult to obtain for many languages.
As interest has grown in low-resource NLP, several effective strategies to improve the performance of neural models have surfaced, and neural network models have become the dominant approach in low-resource settings as well. Such efforts for the morphological inflection task include engineering the neural network architecture to take better advantage of linguistic knowledge (Aharoni and Goldberg, 2017;Wu et al., 2018;Canby et al., 2020), designing data hallucination techniques to generate synthetic data based on existing labeled data (Silfverberg et al., 2017;Bergmanis et al., 2017;Anastasopoulos and Neubig, 2019;Yu et al., 2020), augmenting the training data by making better use of labeled or unlabeled data Kann et al., 2017a;Liu and Hulden, 2020), and cross-lingual transfer learning, i.e. using labeled data in related languages to train models for the target language (McCarthy et al., 2019).
Lexeme5 "draw" "sweep" "carry over the head" "wear" "ask" Model architecture engineering and data hallucination and augmentation techniques have seen consistent performance gains in current literature, but the effect of cross-lingual transfer for morphological inflection is less consistent. Some work has shown advances by conducting cross-lingual learning (Kann et al., 2017b;Anastasopoulos and Neubig, 2019;Murikinati and Anastasopoulos, 2020;Scherbakov, 2020;Peters and Martins, 2020), while some others have not found obvious improvements (Bergmanis et al., 2017;Rama and Çöltekin, 2018;Çöltekin, 2019;Hauer et al., 2019;Madsack and Weißgraeber, 2019). Wu et al. (2020) shows the success of the Transformer architecture (Vaswani et al., 2017) for characterlevel transduction tasks, as is also supported by the results of the SIGMORPHON 2020 shared task 0 on morphological inflection (Vylomova et al., 2020). One approach the SIGMORPHON 2020 shared task 0 participating teams adopted to tackle the low-resource languages is data augmentation. The winning system (Liu and Hulden, 2020) reorganized the shared task data into partial paradigms and augmented the training data by inflecting from multiple known source forms in a paradigm-as opposed to the prevailing practice of just using the lemma form. This turns out to be very effective, and the system achieves the best performance in average accuracy and Levenshtein edit distance. Other participating systems (Yu et al., 2020;Singer and Kann, 2020;Murikinati and Anastasopoulos, 2020;Scherbakov, 2020) and the baseline system show the positive effect of data hallucination, which has also been evidenced by previous studies (Silfverberg et al., 2017;Anastasopoulos and Neubig, 2019). This motivates us to explore the following three questions: 1. Can one improve upon the choice of source forms to use in generating an inflected form? 2. Is data hallucination complementary to augmenting training data by using multiple source forms, or are the two strategies orthogonal, particularly in low-resource scenarios?
3. Is ensembling or model selection of multiple models necessary for best results?
For the first question, we follow the practice of organizing individual inflectional examples into incomplete paradigms and propose different ways motivated by the analogy mechanism to make use of known forms in the same paradigm as well as across paradigms, which can achieve even better results. For the second question, we conduct an experiment to combine the previous strategy with a data hallucination approach, and find that the two approaches are complementary in general, and that using both approaches is especially helpful when the training data is extremely limited (< 1,000 training examples). Comparison of results by different models as to training data size, paradigm completion rate, and language groups, did not find dominant advantage of any single model, indicating that model ensembling or model selection is worthwhile. 1 2 Model descriptions

Motivation
Analogy is assumed to be at the core of human cognition and it is assumed to be the mechanism by which we can inflect an unknown word given the other word forms we know (Blevins and Blevins, 2009). For example, Table 1 presents paradigm examples from Tagalog. If we know that the imperfective aspect with agent focus (V;IPFV;AGFOC) form for the Tagalog verb guhit ("draw") is nagguguhit ("is drawing"), we can predict the V;IPFV;AGFOC form of another Tagalog verb, walis ("sweep"), to be nagwawalis ("is sweeping"). In this process, the analogy happens between the inflected form and the lemma, i.e. between guhit and nagguguhit, and between walis and nagwawalis, where commonality is found in the stem part between pairs. This part of the analogy has attracted much explicit attention and discussion in literature. It is the mechanism the typical morphological inflection task relies on. Though the lemma form is usually prioritized in morphological tasks, it is not always the most useful source form to inflect other forms in the same paradigm. The notion of principal parts states that there is a subset of forms in each paradigm which provides enough information to inflect other slots in the same paradigm correctly. This subset of forms is called a paradigm's principal parts (Finkel and Stump, 2007). For example, for Tagalog verbs, different agent focus (i.e. AGFOC) forms are very informative for each other's inflection, the perfective (i.e. PFV) and imperfective (i.e. IPFV) forms of patient focus (i.e. PFOC) are good sources for each other to inflect from, but AGFOC and PFOC forms are not very reliable predictors for each other. In the Tagalog example, forms which are more closely related are reliable sources for each other. But this is not always the case in every language. In some other languages, the reliable source form for a target form may not be closely related. The linguistic notion of Priscianic formation generalizes the situation where a slot in an inflectional paradigm is reliably formed from another slot of the same paradigm which is not necessarily closely related (Haspelmath and Sims, 2013). Both the principal parts and the Priscianic formation notions go against the idea of prioritizing the lemma as the only source form and encourage the use of other slots in the paradigm as source forms to predict the target slot from. Previous work Kann et al., 2017a;Liu and Hulden, 2020) has attempted to incorporate the notion of principal parts into neural network models for morphological inflection.
However, when we inflect walis V;NFIN → nagwawalis V;IPFV;AGFOC by analogy to guhit V;NFIN → nagguguhit V;IPFV;AGFOC, analogy also happens between paradigms: we also compare between guhit and walis, and between nagguguhit and nagwawalis, where commonality is found in the affix part between pairs. We resort to both the intraparadigmatic analogy and the interparadigmatic analogy in order to inflect unknown words from our knowledge of other words. There exists previous work trying to catch both parts of analogical reasoning (Hulden, 2014;Ahlberg et al., 2014;Ahlberg et al., 2015;Forsberg and Hulden, 2016;, though neural network models for morphological inflection have been relying on the neural model itself to catch the interparadigmatic analogy implicitly and haven't explicitly incorporated the cross-paradigm information.
Neural models for morphological inflection are traditionally trained to inflect from the lemma form only. In our Tagalog example, then, every form of the verb walis would be predicted from the NFIN form walis. Since models trained in this fashion perform quite well, they must have implicitly learned to form the analogies described above, even though only one source form is used. The root of our investigation, therefore, is the question: is it advantageous to explicitly provide source forms other than the lemma form when the model is trained?

Model architectures
Liu and Hulden (2020) convert the morphological inflection task into a partial paradigm completion problem, and use each form or each pair of forms together with the corresponding MSDs as input to the morphological inflection model of the Transformer architecture, which generates the inflected form for the target MSD (Figure 1 (a) and (b)). As this approach turned out to be very effective in modeling low-resource languages, it motivates us to explore additional ways to make use of the given data inspired by the analogy mechanism.
1-source and 2-source models Since Liu and Hulden (2020) presented their results as an ensemble, and did not analyze the model performance of using one source slot and using two source slots individually, we first reproduce their work and conduct an analysis of the two models they proposed: the 1-source model (see Figure 1(a)) where each given slot in the paradigm is used as the input to predict the   (a), every given slot is used for the target slot prediction respectively and thus we would get 6 1-source training input-output pairs out of the example partial paradigm, though only one such input-output pair is illustrated. For model (b), every pair of given slots are used for the target slot prediction and thus we would get 15 2-source training input-output pairs out of the example partial paradigm, though only one such pair is illustrated. For models (d) and (e), the crosstable forms are the inflected forms of the current target MSD from randomly picked partial paradigms where this form has been given.
missing slot from (i.e. source form + source MSD + target MSD → target form), and the 2-source model (see Figure 1(b)) where each pair of given slots is used as the input to the inflection model for predicting the target form (i.e. source form1 + source MSD1 + source form2 + source MSD2 + target MSD → target form). This increases the amount of training data compared to the typical morphological inflection task data format lemma + target MSD → target form, since every given slot or every given slot pairs are used.
Leave-1-out model The 1-source and 2-source models make use of the principal parts idea in an indirect way, by using average score or majority vote to pick out a final prediction from the multiple predictions for the target form. Is it possible for the neural network model to learn to pick out the subset of slots which are the principal parts? In order to explore this question, we propose the leave-1-out model (see Figure 1(c)) where the concatenation of all the known forms followed by their corresponding MSDs is input to the morphological inflection model to predict the target form, and the morphological inflection model is expected to learn to pick out the subset of slots which are the principal parts.
1-source+1-crosstable and 1-source+2-crosstable models Considering the analogy between paradigms, we propose the 1-source+1-crosstable model and the 1-source+2-crosstable model. In the 1-source+1-crosstable model (see Figure 1(d)), we propose to use each given slot with its corresponding MSD concatenated with the inflected form of another randomly picked lemma for the target MSD as input to predict the target form for the target lemma, source form + source MSD + inflected form from another random table for the target MSD + target MSD + target MSD → target form.
In the 1-source+2-crosstable model (see Figure 1(e)), we propose to use each given slot with its corresponding MSD concatenated with the inflected form of another two randomly picked lemmas for the target MSD as input to predict the target form for the target lemma, i.e. source form + source MSD + inflected form from another random table for the target MSD + target MSD + inflected form from a second random table for the target MSD + target MSD + target MSD → target form. The linguistic intuition for using the target slots of two other lemmas is that this can provide additional analogy sources which may be helpful for the neural model to learn from. 1-source+hallucination model The approach of using each individual form or each pair of forms in a paradigm to predict a target form is essentially a method to augment the training data, but this data augmentation approach is different from the data augmentation method of "hallucination," where synthetic "plausible" data are generated based on known labeled data and added to the training data for the morphological inflection model. Both augmentation by reformatting training data and data hallucination have produced improvement in neural model performance for morphological inflection in low-resource settings, but to our knowledge no work has analyzed whether the two data augmentation approaches are complementary to each other. Therefore, we propose the 1-source+hallucination approach. We will use the 1-source input format proposed by Liu and Hulden (2020) to create more training data examples from the given data, generate synthetic data based on the newly formatted training examples with the data hallucination method proposed by Anastasopoulos and Neubig (2019), and combine the newly formatted training data with the hallucination data to train the morphological inflection model.
Transformer As the Transformer architecture has been shown to be very successful in handling character-level string transduction tasks such as morphological inflection (Wu et al., 2020;Vylomova et al., 2020), we adopt the Transformer architecture for all the inflection models in our experiments.

Experiments
We evaluate the performance of all the models on the low-resource languages in the SIGMORPHON 2020 shared task 0 on morphological inflection (Vylomova et al., 2020). For our experiments, we regard languages with less than 5,000 training examples as low-resource. There are 46 such languages from 17 language groups in the SIGMORPHON 2020 shared task 0 data. The training and development data sets of the shared task are provided as triples of lemma, target form and target MSD, e.g. jump V;PST jumped. The test data is missing the target form, which the morphology inflection model is expected to predict. This dataset contains labeled data for 1 to 3 different parts of speech (POSs) depending on the language (nouns, verbs, and adjectives). We follow the same method as Liu and Hulden (2020) to reconstruct paradigms from the shared task data. Detailed statistics about the data for each language, including training data size, POSs, paradigm size for each POS, the number of paradigms per POS, and average paradigm completion rate as well as language group information are provided in Tables 5 and 6 in the Appendix B. The final number of training examples after the 1-source and 2-source transformation of the original training data is also provided in these tables. In this dataset, the development set is usually 1/7 of the original training set size and the test set is usually 2/7 of the original training data size. The SIGMORPHON 2020 shared task 0 provides 2 types of neural baselines: a Transformer architecture applied at the character level (Wu et al., 2020) and a BiLSTM-based sequence-to-sequence architecture with exact hard monotonic attention . Each type of architecture is trained in four different ways with identical hyperparameters: training one model for each language with and without data hallucination, or training one model per language group with and without data hallucination. This results in 8 baseline models: trm-single, trm-hal-single, trm-shared, trm-hal-shared, and mono-single, mono-hal-single, mono-shared, mono-hal-shared. All the baseline models are trained with only lemma as the source form. Since we adopt the Transformer architecture for morphological inflection, our work focuses on the comparison with the Transformer baselines.
We use the implementation of the Transformer architecture in the Fairseq toolkit 2 (Ott et al., 2019), and set the hyperparameters equal to the SIGMORPHON shared task Transformer baselines, except that we use beam search rather than greedy search for decoding. Details on the hyperparameters and training heuristics used in the current paper is provided in Appendix A. We train one model with the Fairseq  Liu and Hulden (2020), and figures in red are produced from the official results for the Transformer baselines provided by the SIGMORPHON shared task organizers.
Transformer implementation with the same input-output format as the SIGMORPHON single-language baseline (i.e. trm-single) and the identical hyperparameters as we use for other model experiments. The result is presented in the row named fairseq-trm-single in Table 2. The fairseq-trm-single result is no better than the results for trm-single provided by the shared task organizers. This shows us that the improvements in performance in other models of our implementation truly reflects the contribution of incorporating more analogy sources to the input, and we can compare our results with the SIGMORPHON 2020 shared task 0 official baseline results. 3 We reproduce the experiments with the 1-source and 2-source models on the 46 languages, and train models for our proposed models for comparison: leave-1-out, 1-source+1-crosstable, 1-source+2crosstable, and 1-source+hallucination. The evaluation metric we use to compare the performance of different models is accuracy, i.e. the fraction of correctly predicted target forms out of all predictions.

Results and discussion
Overall performance Figure 2 provides an overview of the performance by different models. Details about the accuracy of each language by each model is provided in Table 7 in Appendix B. We have the following findings based on the observation of overall model performance.
1. The good performance of the 1-source+1-crosstable and the 1-source+2-crosstable models supports the positive effect of providing more analogy sources for the neural model to learn from, or at least ones that differ from the citation form.  data augmentation techniques in our experiments achieve better results. Specifically, the 1-source+hallucination model produces the highest average accuracy, followed by the 1-source+1crosstable model and the 1-source+2-crosstable model. The 1-source model achieves an average accuracy higher than the baseline trm-hal-single. Though the 2-source model has an average accuracy lower than baseline trm-hal-single and has a larger variance, its average accuracy is still higher than the baseline Transformer model trained without data hallucination, i.e. baseline trm-single.
3. The analogy-motivated approach of reformatting given data (Liu and Hulden, 2020) and the data hallucination approach (Anastasopoulos and Neubig, 2019) are complementary and can be profitably combined to improve the result. This is evidenced by the best performance of the 1-source+hallucination model.
4. Our proposed leave-1-out model has the third lowest average accuracy, lower than the baseline trmsingle model, indicating that the Transformer model failed to pick out the principal parts in our proposed way. The failure may be related to the limited amount of training data, which we leave to future work for validation.
5. The two baseline Transformer models trained per language group have the lowest average accuracies, indicating that the cross-lingual learning did not contribute to positive effects in these models.
Performance and data size Figure 3 plots the performance of the Transformer models with relation to the training data size. The regression lines indicate that reformatting the training data by adding more analogy sources for the model to learn from is essentially an effective data augmentation approach, but as data increases, the lines in plots (a) and (c) cross each other, indicating that this approach may not be necessary when there is abundant training data available. This is true for the data hallucination approach as well, as is shown in plot (b). Plot (d) shows that reformatting the data has similar effects as data hallucination, but that reformatting is in general more effective.  Table 2: Average accuracy (%) of each model grouped by training data size range. fairseq-trm-single is the transformer model we trained with the same hyperparameters as our other models with Fairseq implementation, for which the input is the same as trm-single. 1-source and 2-source rows present our reproduction results of the models proposed by Liu and Hulden (2020). SIGMORPHON 2020 shared task 0 baseline model results are copied from the published official results. The highest accuracies for each data size range are in boldface and the second highest is italicized.
We further break down the languages by training data size, and present the average accuracy for each data size range in Table 2 in order to explore whether any model show obvious advantage as to the amount of labeled data. Because the LSTM-based sequence-to-sequence architecture with exact hard monotonic attention model  was particularly designed to tackle low-resource languages, we include in the comparison the results for this type of models provided in the SIGMORPHON 2020 shared task 0 as well. Our findings are as follows: 1. For all the data size ranges, the models with additional analogy sources in the input (i.e. 1source, 2-source, 1-source+1-crosstable, 1-source+2-crosstable, and 1-source+hallucination) usually achieve better performance than models using only the lemma as input. This shows the effectiveness of explicitly providing more analogy sources as input to the neural morphological inflection model.

2.
The 1-source+hallucination model is effective across all data size ranges and especially for extremely low-resource scenarios. This model achieves significantly higher accuracy than other models for languages with fewer than 1,000 training examples, and its performance in other training data size ranges is also very close to the best models. This provides additional support to the benefit of combining the analogy-motivated data reformatting approach and the data hallucination approach.
3. The 2-source model can be helpful in some scenarios, but it has high variance and is not as flexible and reliable as 1-source models. The 2-source model produces the highest average accuracy for languages with 1,000-2,000 or 3,000-4,000 training examples while its performance for languages with fewer than 1,000 or 4,000-5,000 training examples is much worse than the 1-source models. This may be related to the fact that the 2-source approach can augment data exponentially, which may result in a lot of pairs, misleading the model with noise.  Figure 4: Scatterplots of average paradigm completion rate and accuracy for different models with regression lines.
average accuracy for languages with 2,000-3,000 training examples, and the 1-source+2-crosstable model is the best one for the 4,000-5,000 range.
4. The mono-hal-single model usually produces higher accuracy than other models of the same architecture. However, it is still worse than most Transformer models. This reinforces earlier observation that the Transformer architecture is superior in handling character-level sequence transduction tasks over the hard-attention-enhanced LSTM encoder-decoder architecture (Wu et al., 2020). 5. Considering that no model shows obvious advantage across the board, model ensembling by picking the best model for each language on the development data set may be good practice in order to produce the best results for morphological inflection, as has been noted in Vylomova et al. (2020).
Performance and paradigm completion rate The relationship between the model performance and the paradigm completion rate is illustrated in Figure 4, from which we can see that languages with more known forms in paradigms tend to, on average, have higher accuracy. We also see that the contribution of data augmentation by either data reformatting with more analogy sources or data hallucination tends to decrease as the average paradigm completion rate increases. Still, data reformatting by analogy demonstrates the advantage over data hallucination, as reflected by the line for the 1-source+1-crosstable model being above the regression line for the trm-hal-single model in plot (d) and the cross of the trend lines in plot (a) and plot (c) coming at a higher paradigm completion rate level than in plot (b).
Performance and language group The advantage of the strategy of reformatting data by the analogy mechanism is also observed across language groups, as is shown in Table 3, where the Siouan language group has an average accuracy in the shared task baseline results higher than the input augmented with analogy strategy methods, but this language group has only one language in our data, and the difference is not significant (higher by only 0.1%). However, none of the models show a general advantage across   Uto-Aztecan. lv1out: leave-1-out, 1src+1: 1-source+1-crosstable, 1src+2: 1-source+2-crosstable, 1src+h: 1-source+hallucination, sing, h.sing, shrd, h.shrd are results copied from SIGMORPHON 2020 shared task 0 results for the Transformer baseline models trained per language without (sing) or with (h.sing) data hallucination, or per language group without (shrd) or with (h.shrd) data hallucination.
language groups, again supporting the idea that model ensembling is a good choice for producing best results for a collection of diversified languages.

Conclusion
We propose three new ways to reformat training data using an analogy mechanism for morphological inflection in low-resource scenarios: leave-1-out, 1-source+1-crosstable, 1-source+2-crosstable. A systematic evaluation of the model performance shows that the proposed methods that provide both intraparadigmatic and interparadigmatic analogy sources (i.e. 1-source+1-crosstable, 1-source+2-crosstable) are effective. In general, providing more analogy sources for the Transformer model to learn from is helpful. We further explore whether the data reformatting approach is orthogonal to data hallucination. Experimental results show that combining the two approaches is especially helpful when the training data is extremely limited. However, none of the models we evaluated in our experiments show an acrossthe-board advantage with respect to training data size, paradigm completion rate, or language groups, implying that model ensembling or model selection based on the development data is a good choice to achieve the best morphological inflection performance for a diversified collection of languages. This also indicates that morphological inflection generation is complicated, with many orthogonal factors affecting performance.
The frequency of checkpoint saving, the maximum number of parameter updates and early stop threshold vary between languages and models as stated below and summarized in Table 4, because the number of training data size varies after conversion and the setting is supposed to optimize the training process. The general pattern is for languages with more training data, the checkpoint is saved more frequently, the maximum number of updates is larger and more updates are allowed before early stop is enforced. leave-1-out models: all languages 6,000 1-source models: (0, 20k) bod, ceb, ctp, czn, dak, dje, gaa, gmh, gml, gsw, hil, izh, kjh, kon, lin, lud, mao, mlg, mlt, ood, sot, syc, tel, tgk, tgl, vot, vro, xno, xty, zpv Table 6: Data information for languages with only one part-of-speech in the data. Languages are listed by the increasing order of the number of original training data. trn-raw: the amount of original training data, trn-1-src: the amount of training data after 1-source conversion, which is the same for all 1-source models, trn-2-src: the amount of training data after 2-source conversion, POS: part-of-speech, psize: paradigm size, pnum: the number of paradigms, pfill-rate: paradigm completion rate in percentage.  Table 7: Accuracy (%) for each language by each of model we compare. The SIGMORPHON 2020 shared task 0 baseline results (i.e. the last 4 columns) are duplications of the shared task official results. The two columns shaded gray (i.e. 1src and 2src) are our results from Liu and Hulden (2020). Other models are our proposed methods. Languages are listed by the increasing order of the original training data size.