IGT2P: From Interlinear Glossed Texts to Paradigms

An intermediate step in the linguistic analysis of an under-documented language is to ﬁnd and organize inﬂected forms that are attested in natural speech. From this data, linguists generate unseen inﬂected word forms in order to test hypotheses about the language’s inﬂectional patterns and to complete inﬂec-tional paradigm tables. To get the data lin-guists spend many hours manually creating in-terlinear glossed texts (IGTs). We introduce a new task that speeds this process and automatically generates new morphological resources for natural language processing systems: IGT-to-paradigms (IGT2P). IGT2P generates entire morphological paradigms from IGT input. We show that existing morphological reinﬂection models can solve the task with 21% to 64% accuracy, depending on the language. We further ﬁnd that (i) having a language expert spend only a few hours cleaning the noisy IGT data improves performance by as much as 21 percentage points, and (ii) POS tags, which are generally considered a necessary part of NLP morphological reinﬂection input, have no effect on the accuracy of the models considered here.


Introduction
Over the last few years, multiple shared tasks have encouraged the development of systems for learning morphology, including generating inflected forms of the canonical form-the lemma-of a word. NLP systems that account for morphology can reduce data sparsity caused by an abundance of individual word forms in morphologically rich languages (Cotterell et al., , 2017a(Cotterell et al., , 2018Mc-Carthy et al., 2019;Vylomova et al., 2020) and help mitigate bias in training data for natural language processing (NLP) systems (Zmigrod et al., 2019). However, such systems have often been limited to languages with publicly available structured data, i.e. languages for which tables containing Figure 1: Inflected word forms attested in interlinear glossed texts (IGT) train transformer encoderdecoder to generalize morphological paradigmatic patterns and generate word forms when given known morphosyntatic features of missing paradigm cells. Noisy paradigms are automatically constructed from IGT and a language expert creates "cleaned" paradigms. Both sets are tested on the same missing word forms and the results are compared.
inflectional patterns can be found, for example, in online dictionaries like Wiktionary. 1 This limits the development of NLP systems for morphology to languages for which morphological information can be easily extracted.
Here, we propose to instead make use of a resource which is much more common, especially for low-resource languages: we explore how to leverage interlinear glossed text (IGT)-a common artifact of linguistic field research-to generate unseen forms of inflectional paradigms, as illustrated in Figure 1. This task, which we call IGT-to-paradigms (IGT2P), differs from the ex-isting morphological inflection (Yarowsky and Wicentowski, 2000; task in three aspects: (1) inflected forms extracted from IGT are noisier than curated training data for morphological generation, (2) since lemmas are not explicitly identified in IGT, systems cannot be trained on typical lemma-to-form mappings and, instead, must be trained on form-to-form mappings, and (3) partof-speech (POS) tags are often unavailable in IGT. IGT2P can thus be seen as a noisy version of morphological reinflection , but without explicit POS information. Our experiments show that morphological reinflection systems following preprocessing are strong baselines for this task.
We further perform two analyses: (i) Part-of-speech (POS) tags are usually considered necessary inputs for learning morphological generation. However, they are frequently missing from IGT, since they result from a later step in a linguist's pipeline. Thus, we ask: are POS tags necessary for morphological generation? Surprisingly, we find that POS tags are of little use for morphological generation systems.
(ii) How much does manual cleaning of IGT data by a domain expert improve performance? As expected, cleaning the data improves performance across the board with a transformer model: by 1.27% to 16.32%, depending on the language.
We examine which inflection model performs better on noisy and cleaned IGT data and how the performance varies across languages and data quality or size.

Background: Morphological Generation
An inflectional paradigm is illustrated in tables, such as Table 1. Paradigms can be large; for example, Polish verbs paradigms can have up to 30 cells and other languages may have several more. Here we define the notation related to morphological inflection systems for the remainder of this paper.
We denote the paradigm of a lemma as: where f : Σ * × T → Σ * defines a mapping from a tuple consisting of the lemma and a vector t γ ∈ T present past sing. pl. sing. pl.
1 person am are was were 2 person are are were were 3 person is are was were Table 1: The inflectional paradigm of the English verb "to be". This verb has more inflected forms than any other English lemma, but is quite small compared to paradigms in many other languages.
of morphological features to the corresponding inflected form. Σ is an alphabet of discrete symbols, i.e., the characters used in the natural language. Γ( ) is the set of slots in lemma 's paradigm. We will abbreviate f ( , t γ ) as f γ ( ) for simplicity. Using this notation, we now describe the most important generation tasks from the computational morphology literature.

Morphological inflection.
The task of morphological inflection consists of generating unknown inflected forms, given a lemma and a feature vector t γ . Thus, it corresponds to learning the mapping f : Σ * × T → Σ * .

Morphological reinflection.
Morphological reinflection is a generalized version of the previous task. Here, instead of having a lemma as input, system are given some inflected form f ( , t γ 1 )optionally together with t γ 1 -and a target feature vector t γ 2 . The goal is then to produce the inflected form f ( , t γ 2 ). Paradigm completion. The task of paradigm completion consists of, given a partial paradigm π P ( ) = f ( , t γ ) γ∈Γ P ( ) of a lemma , generating all inflected forms for all slots γ ∈ Γ( ) − Γ P ( ). Training data for this task consists of entire paradigms.
Unsupervised morphological paradigm completion. For the unsupervised version of the paradigm completion task, systems are given a corpus D = w 1 , . . . , w |D| with a vocabulary V of word types {w i } and a lexicon L = { j } with |L| lemmas belonging to the same part of speech. However, no explicit paradigms are observed during training. The task of unsupervised morphological paradigm completion then consists of generating the paradigms {π( )} ∈L of all lemmas ∈ L.

IGT-to-Paradigms
The task we propose, IGT-to-paradigms (IGT2P), can be described as the paradigm completion problem above, with an additional step of inference regarding which of the attested forms is associated with which lemma.
Similar to unsupervised paradigm completion, we do not assume information about the lemma to be explicit. Similar to morphological reinflection, the input includes word forms with features, and a system has to learn to generate inflections from other word forms and morphological feature vectors. IGT2P is further similar to paradigm completion in that we aim at generating all inflected forms for each lemma. 2

Why IGT2P?
Descriptive linguistics aims to objectively analyze primary language data in new languages and publish descriptions of their structure. This work informs our understanding of human language and provides resources for NLP development through academic literature, which informs projects such as UniMorph , or through crowdsourced effort such as Wiktionary. Yet with most descriptive work performed manually with very little NLP assistance, language resources for thousands of under-described languages remain limited. This includes languages with millions of speakers, such as Manipuri in India.
However, there exists a type of labeled data that is available in nearly all languages where a linguist has undertaken any scientific endeavor: interlinear glossed texts (IGT), illustrated in Table  2. They are the output of early steps in a field linguist's pipeline which consist of recording natural speech, transcribing it, and then identifying minimal meaningful units-the morphemes-and using internally consistent tags to label the morphemes' morphosyntactic features. IGTs serve as vital sources of morphological, syntactic, and higher levels of linguistic information. They are often archived in long-term repositories, and openly accessible for non-commercial purposes, yet they are rarely utilized in NLP.
IGT2P has potential benefits for NLP (by increasing available resources in low-resource languages) but also for linguistic inquiry. First, since machine-assistance has been shown to increase speed and accuracy of manual linguistic annotation with just 60% model accuracy (Felt, 2012), such a model could assist the initial analysis of morphological patterns in IGT. Second, by quickly learning morphological patterns from word forms attested in IGT, IGT2P generates forms that fill empty cells in a lemma's paradigm. Since IGTs are unlikely to contain complete paradigms of lemmas, an accompanying step in fieldwork is that of elicitation of inflectional paradigms for selected lemmas. Presenting candidate words to a native speaker for acceptance or rejection is often easier than asking the speaker to grasp the abstract concept of a paradigm and to generate the missing cells in a table. With the help of IGT2P, linguists could use the machine-generated word forms to support this elicitation process. IGT2P then becomes a tool for the discovery of morphological patterns in under-described and endangered languages.

Related Work
IGT for NLP. The AGGREGATION project (Bender, 2014) has used IGT to automatically construct grammars for multiple languages. This includes inferring and visualizing systems of morphosyntax (Lepp et al., 2019;Wax, 2014). Much of their data comes from the Online Database of INterlinear Text (Lewis and Xia, 2010, ODIN) which is a collection of IGTs extracted from published linguistic documents on the web. Published IGT excerpts, such as those in ODIN, differ from IGTs produced by field linguists such as those used in our experiments. First, noise is generally removed from the published examples. Second, the amount of glossed information in published IGT snippets can vary widely depending on the phenomenon that is the main focus of the publication.
Computational morphology. Our work is further related to and takes inspiration from research on the tasks described in Section 2.1.
Most recent work in the area of computational morphology which was concerned with generation (as opposed to analysis) has focused on morpholog-Text Vecherom ya pobejala v magazin. Segmented vecher-om ya pobeja-la v magazin Glossed evening-INS 1.SG.NOM run-PFV.PST.SG.FEM in store.ACC Translation 'In the evening I ran to the store.' Table 2: An example of typical interlinear glossed text (IGT) with a transliterated Russian sentence, including translation. This paper leverages the original text and gloss lines.
ical inflection or reinflection. Approaches include Durrett and DeNero (2013); Nicolai et al. (2015); ; Kann and Schütze (2016); Aharoni and Goldberg (2017). Partially building on these, other research has developed models which are more suitable for low-resource languages and perform well with limited data (Kann et al., 2017b;Sharma et al., 2018;Makarov and Clematide, 2018;Kann et al., 2020a;Wu et al., 2020). These are the most relevant approaches for our work, since we expect IGT2P to aid documentation of low-resource languages. Accordingly, we use the systems by  and Wu et al. (2020) in our experiments.
Work on paradigm completion -or the paradigm cell filling problem (PCFP; Ackerman et al., 2009) -includes Malouf (2016), who trained recurrent neural networks for it, and applied them successfully to Irish, Maltese, and Khaling, among other languages. Silfverberg and Hulden (2018) also trained neural networks for the task. Kann et al. (2017a) differed from other approaches in that they encoded multiple inflected forms of a lemma to provide complementary information for the generation of unknown forms of the same lemma. Finally,  introduced neural graphical models which completed paradigms based on principal parts. The unsupervised version of the paradigm completion task (Jin et al., 2020) has been the subject of a recent shared task (Kann et al., 2020b), with the conclusion that it is exremely challenging for current state-of-the-art systems. Here, we propose to, instead of generating paradigms from raw text, generate them from IGT, a resource available for many under-studied languages.

To POS Tag or Not to POS Tag
In addition to the lemma and the morphological features of the target form, part-of-speech (POS) tags are by default a part of the input to neural morphological reinflection systems. POS tags are assumed to carry valuable information, since, for example, morphemes that are otherwise identical (e.g. "seat") may use one set of inflectional morphemes as nouns (e.g. "many seats") and another as verbs ("be seated").
Since POS tags are typically annotated at a later stage than morpheme boundaries and glosses, IGTs often do not contain POS tags for all words. This makes large parts of the IGT unusable for stateof-the-art reinflection systems if POS tags are assumed necessary. However, the assumption that POS tags improve morphological generation performance has never been empirically verified for recent state-of-the-art systems. We hypothesize that, in fact, POS tags might not be necessary, since they might be implicitly defined by either the morphological features or the input word form. Thus, we ask the following research question: Are POS tags a necessary or beneficial input to a morphological reinflection system?

Experimental Setup
To answer this question, we train morphological reinflection systems twice on 10 languages that have been released for the CoNLL-SIGMORPHON 2018 shared task (Cotterell et al., 2018), once with and once without POS tags as input. In order to obtain generalizable results, our selected languages belong to different families and are typologically diverse with regards to morphology, as shown in Table 3. 3 We kept the original training/validation/test splits, and experiment on the three training set sizes: 10,000, 1000, and 100 examples for the high, medium, and low setting, respectively.

Models
We experiment with two state-of-the-art neural models for morphology learning: the transformer model for character-level transduction (Wu et al., 2020) and the LSTM sequence-to-sequence model with exact hard monotonic attention for character-

Results
Table 3 illustrates the performance difference when including and not including POS tags for all three training data sizes. The largest difference is a decrease of 4.4 percentage points when POS tags are removed for Finnish at the medium setting using hard monotonic attention. The average difference is about 0.2 percentage points. We therefore conclude that a lack of POS tags does not make a significant difference in the reinflection task.

Language data
We used IGTs that were primarily transcribed from naturally-occurring oral speech in low-resource and endangered languages. They represent a wide range of projects, which is reflected in the size and quality of the data. The amount of usable data (i.e. glossed words) ranges from approximately 90,000 tokens in Arapaho to about 5,000 in Manipuri. The five languages (see Table 4) are spoken by communities across five continents. They represent different language families and morphological complexity, though all are agglutinating to some degree. Other than the IGT, there is very limited resources for these languages. 4 It is theoretically possible that the other baselines can outperform these models once we limit our experiments to words with POS information. However, based on our preliminary experiments using POS tags, this seems unlikely.

Issues specific to IGT
The most notable issue with IGT is the "noise". An inevitable cause is the dynamic nature of ongoing linguistic analysis. As the linguist gains a better understanding of the language's structure by doing interlinearization, early decisions about morpheme shapes and glosses differ from later ones. Another cause is that limited budget and time means IGT are often only partially completed. Another source of noise comes when the project is focused on annotating one particular phenomenon. For example, frequently only one morphosyntactic feature in Manipuri was glossed in each word, meaning different inflected forms looked like they had the same morphosyntactic features. Another source of noise is imprecision introduced by human errors or choices made for convenience to speed tedious annotation. One example of imprecision is glossing different stem morphemes with the same English word. For example, Lezgi has several copula verbs which can Figure 2: Lezgi paradigms were automatically constructed from IGT (left columns) and have typos or incorrect paradigms clusters. Experts filtered or corrected these issues, resulting in "clean" paradigms (middle). These can be compared with the published description (right column) which includes historic forms that are rarely used today.
be narrowly translated as 'be in', 'be at', etc., but most were merely glossed as 'be'. So all copula verbs were initially grouped into one paradigm. A similar situation happened with Arapaho: nuances of meaning were not often distinguished in the glosses; thus, different verb stems are glossed simply as 'give', when, in reality they should be divided into 'hand to someone' in one case, 'give as a present' in another case, and 'give ceremonially, as an honor' in third case. Another issue is that IGT annotators do not usually differentiate between different types of morphemes. Thus, we do not always distinguish between them. Derivational and inflectional morphemes were only differentiated where we were able to easily identify and eliminate derivational glosses. For example, in Arapaho we were able to group derived stems into separate paradigms because they were glossed distinctly. Also, clitics are often not distinguished from affixes. This means that the morphological patterns that the models learn are not always, strictly speaking, inflectional paradigms, but it does mean that the models learn all attested forms related one lemma.

Approach
As a first step, partial inflectional paradigms were automatically extracted from the IGT. Words were organized into paradigms based on the gloss of the stem morpheme. Then, these stem glosses were removed, leaving only the affix glosses which serve as morphosyntactic feature tags.
Step 1: Preprocessing paradigms. The automatically extracted paradigms were preprocessed in two ways. The resulting data is publicly available. 5 In the first preprocessing method, a language domain expert was asked to "clean" the automatically extracted paradigms. Example results are in shown Figure 2. Experts reorganized words into correct inflectional paradigms, for example, by regrouping Lezgi copula verbs. They also completed missing morphosyntatic information; for example, adding PL (plural) or SG (singular) where the nouns were otherwise glossed identically. Finally, they removed any words that are not inflected in the language. This usually included words that are morphologically derived from another part of speech but not inflected. For example, an affix might derive an adverb from a noun root, and if the adverbializing affix was glossed, then the word form would have been extracted automatically, resulting in more noise since it displays derivational morphology and no inflectional morphology. Experts were asked to spend no more than six hours on the cleaning task.
For the second preprocessing method, the automatically extracted paradigms were surveyed by a non-expert. Since non-experts could not be expected to identify and correct most issues, they simply removed obvious mistakes such as glosses of stem morphemes that were misidentified as affix glosses and word forms with obviously incomplete glosses or ambiguous glosses (due to identi-  cal glosses on one or more word forms). For some languages, this cleaning-by-removal made these paradigms smaller than the "cleaned" dataset.
Step 2: Preparing reinflection data. The typical morphological reinflection data is in tuple format of (source form, target form, target features). We convert the paradigm data into this format in preparation for reinflection. Table 5 presents the data sizes.
For each language, we prepare the validation and test sets by using the the expert-cleaned data language in the following way: If the paradigm has more than one form, pick a random form as the source form and select the remaining forms in the paradigm with a probability of 0.3 to be "unknown", i.e. to be predicted from the first form. Half of the "unknown" data transformed in this way is used for validation and the other half for testing. The validation and test sets for each language is shared across all the experiments we conduct for that language.
To prepare the training data from the noisy and clean paradigms, we first map each form in the data to itself and add them to the training data. Paradigms with a single entry have only self-to-self mapping. If a paradigm has more than one form, all possible pairs of forms in a paradigm are generated and added to the training data, excluding those that are part of testing or validation set, i.e. "unknown".
Step 3: Reinflection models and experimental setup. We experiment with two state-of-the-art models for morphological reinflection, the transformer model for character-level transduction (Wu et al., 2020) and the LSTM sequence-to-sequence model with exact hard monotonic attention for character-level transduction . For all the models, we used the implementation of the SIGMORPHON 2020 shared task 0 baseline (Vylomova et al., 2020), 6 and our hyperparameters are the same as the shared task baseline.
After paradigms are extracted and preprocessed, we conduct two experiments to generate "unknown" inflected forms. We then expand those experiments by two data augmentation techniques. First, we add all unannotated/uninflected words from the IGT data to the training data. When tokens that were either unannotated or uninflected are added, they are self-mapped as the source and target forms (as we do with single-entry paradigms), and their morphosyntactic features are annotated with a special tag: XXXX. Second, we augment the training data by generating 10,000 artificial instances with the implementation in the SIGMORPHON 2020 shared task 0 baseline of the data hallucination method proposed by (Anastasopoulos and Neubig, 2019). Finally, we combine both additions. These augmentations are intended to overcome data scarcity.
All models and techniques were tested on the same held-out set chosen randomly from multi-6 https://github.com/shijie-wu/ neural-transducer/tree/ f1c89f490293f6a89380090bf4d6573f4bfca76f entry paradigms in each language.

Results
We compared results when training on the noisy paradigms and on the expertly cleaned paradigms and found that the limited involvement of experts always improved results. We also found the transformer outperformed the LSTM with hard monotonic attention on cleaned data in all instances and on noisy data overall. When comparing results from augmenting the data by artificial and uninflected/unannotated tokens, we find varied results. The results are displayed in Table 6.
There is no clear correlation between accuracy and the total number of annotated tokens or training paradigms (see Tables 4 and 5). Tsez and Arapaho [arp] achieved over 60% accuracy and these languages do have more training data (35K and 283K triples, respectively) than the others (less than 10K). However, even though Arapaho has considerably more training data, its accuracy is lower than Tsez. A slight correlation between accuracy and amount of multi-entry paradigms does exist. Languages with a higher proportion of multientry paradigms tend to have better results. Fewer single-entry paradigms may indicate more complete paradigm information.
Any correlation between results and linguistic factors such as language family or morphological type is uncertain because of the limited number of testing languages. Tsez [ddo] gave best results overall. This could be due to its limited allomorphy and very regular inflection which may explain why its relative Lezgi [lez] perform better than languages with more data. Arapaho's poorer performance could be due to its polysynthetic morphology (Cowell and Moss, 2008) which is more complex than the fairly straightforward agglutination in Tsez (Job, 1994) and Lezgi (Haspelmath, 1993). The models do seem less sensitive in recognizing the word structure in Arapaho. When the front part of a stem is incidentally the same as a common inflection affix, the stem is often generated incorrectly.
The factor that seems most clearly correlated with accuracy is the consistency and thoroughness of IGT annotations. The Arapaho, Tsez, and Natügu [ntu] corpora were noticeably more complete (i.e. most morphemes were glossed) and polished. This probably explains why Tsez not only had the best results but also showed the smallest improvements after cleaning. Interestingly, augmentation techniques also helped these languages the least (only artificial data augmentation helped Tsez slightly). It seems, therefore, that results are highest and data augmentation is most helpful when original manual annotations are least consistent or complete.
As might be expected with limited data, errors were most common with irregular or rare forms. For example, the best performing model incorrectly inflected many Lezgi pronouns which have an inflection pattern identical to nouns except for a unpredictable change in the stem vowel. Perhaps related to this, the model also misidentified some epenthetic vowels in several Lezgi nouns. Another interesting pattern involved unique Nakh-Daghestanian (Tsez and Lezgi) case-stacking, where nominal affixes concatenate, rather than substitute each other, to form several peripheral cases such as SUPERELATIVE or POSTDIRECTIVE. The more common affixes in the concatenation string were often generated correctly but the less common concatenated affixes were not. Allomorphy also causes difficulty. Models struggle generating the right form when multiple forms are possible. For example, in Arapaho the third person singular inflection has variations (e.g. -oo, -o, or -'). On the other hand, models learned regular inflectional patterns well enough to correctly inflect forms even where the expert had left misspellings of that form in the clean data.
Finally, we clearly see expert cleaning improved performance across the board (with two negligible exceptions for Tsez and Lezgi on the hard monotonic attention model). Experts were asked to spend no more than six hours and actually spent up to seven but as little as two hours on each language. This indicates that expert labor is well worth its "cost".

Conclusion
We proposed a new morphological generation task called IGT2P, which aims to learn inflectional paradigmatic patterns from interlinear gloss texts (IGT) produced in linguistics fieldwork. We experimented with neural models that have been used for morphological reinflection and new preprocessing steps as baselines for the task. Our experiments show that IGT2P is a promising method for creating new morphological resources in a wide range of low-resource languages.  Table 6: Accuracy percentages of reinflection task for transformer model (T) and the LSTM seq2seq model with exact hard monotonic attention (mono) with/out artificial data augmention (+aug), unannotated/uninflected word forms (+uninfl) and both together. Boldface indicates best result; italics indicate best result on noisy paradigms.
With sufficient IGT annotations, IGT2P obtains reasonable performance from noisy data. We investigated the effect of manual cleaning on model performance and showed that even very limited cleaning effort (2-7 hours) drastically improves results. The inherent noisiness in IGT and other linguistic field data can be overcome with limited input from domain experts. This is a significant contribution considering the extensive effort-on the order of months and years-to produce the curated structured data normally used to train NLP models. In languages with the noisiest data performance is improved even further by data augmentation techniques. Finally, since field data does not often include POS annotation, we investigated the usefulness of POS tags for morphological reinflection and find that, surprisingly and in contrast to common assumptions, they are not beneficial to recent state-of-the-art systems. This is a useful discovery for researchers who wish to optimize their inflection systems.
There is room for future improvement. Better techniques for further cleaning might be useful since accuracy seems to have close related to data quality. However, at some point more cleaning will return less improvement. Upper bounds could be established by comparing results on languages with gold standard inflection tables, although polysynthetic languages like Arapaho would make this difficult since their tables do not always include noun incorporation. Better use of experts' time might involve identification of lemmata that could be used to train a lemma-to-form model, rather than the form-to-form mapping used here. Another approach would be to compare improvements between manual-only cleaning and cleaning done by a linguist working with someone who can write scripts to automatically correct repeated patterns of noise.
IGT2P also has implications for the documentation of endangered languages and addressing digital inequity of speakers of marginalized languages. It could be integrated into linguists' workflow in order to improve the study of inflection and increase IGT data. For example, the generated inflected forms could be used for automated glossing of raw text. IGT2P could speed the discovery and description of a language's entire morphological structure. An elicitation step with native speakers could be added to strategically augment data. This would integrate well with linguists' workflow. IGT2P results could serve as to prompt speakers for forms that are rare in natural speech. It might also be integrated into linguistic software such as FLEx.