SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.


Introduction
Human language is marked by considerable diversity around the world.Though the world's languages share many basic attributes (e.g., Swadesh, 1950 andmore recently, List et al., 2016), grammatical features, and even abstract implications (proposed in Greenberg, 1963), each language nevertheless has a unique evolutionary trajectory that is affected by geographic, social, cultural, and other factors.As a result, the surface form of languages varies substantially.The morphology of languages can differ in many ways: Some exhibit rich grammatical case systems (e.g., 12 in Erzya and 24 in Veps) and mark possessiveness, others might have complex verbal morphology (e.g., Oto-Manguean languages; Palancar and Léonard, 2016) or even "decline" nouns for tense (e.g., Tupi-Guarani languages).Linguistic typology is the discipline that studies these variations by means of a systematic comparison of languages (Croft, 2002;Comrie, 1989).Typologists have defined several dimensions of morphological variation to classify and quantify the degree of crosslinguistic variation.This comparison can be challenging as the categories are based on studies of known languages and are progressively refined with documentation of new languages (Haspelmath, 2007).Nevertheless, to understand the potential range of morphological variation, we take a closer look at three dimensions here: fusion, inflectional synthesis, and position of case affixes (Dryer and Haspelmath, 2013).
Fusion, our first dimension of variation, refers to the degree to which morphemes bind to one another in a phonological word (Bickel and Nichols, 2013b).Languages range from strictly isolating (i.e., each morpheme is its own phonological word) to concatenative (i.e., morphemes bind together within a phonological word); nonlinearities such as ablaut or tonal morphology can also be present.From a geographic perspective, isolating languages are found in the Sahel Belt in West Africa, Southeast Asia and the Pacific.Ablaut-concatenative morphology and tonal morphology can be found in African languages.Tonal-concatenative morphology can be found in Mesoamerican languages (e.g., Oto-Manguean).Concatenative morphology is the most common system and can be found around the world.Inflectional synthesis, the second dimension considered, refers to whether grammatical categories like tense, voice or agreement are expressed as affixes (synthetic) or individual words (analytic) (Bickel and Nichols, 2013c).Analytic expressions are common in Eurasia (except the Pacific Rim, and the Himalaya and Caucasus mountain ranges), whereas synthetic expressions are used to a high degree in the Americas.Finally, affixes can variably surface as prefixes, suffixes, infixes, or circumfixes (Dryer, 2013).Most Eurasian and Australian languages strongly favor suffixation, and the same holds true, but to a lesser extent, for South American and New Guinean languages (Dryer, 2013).In Mesoamerican languages and African languages spoken below the Sahara, prefixation is dominant instead.
These are just three dimensions of variation in morphology, and the cross-linguistic variation is already considerable.Such cross-lingual variation makes the development of natural language processing (NLP) applications challenging.As Bender (2009Bender ( , 2016) ) notes, many current architectures and training and tuning algorithms still present language-specific biases.The most commonly used language for developing NLP applications is English.Along the above dimensions, English is productively concatenative, a mixture of analytic and synthetic, and largely suffixing in its inflectional morphology.With respect to languages that exhibit inflectional morphology, English is relatively impoverished.1 Importantly, English is just one morphological system among many.A larger goal of natural language processing is that the system work for any presented language.If an NLP system is trained on just one language, it could be missing important flexibility in its ability to account for cross-linguistic morphological variation.
In this year's iteration of the SIGMORPHON shared task on morphological reinflection, we specifically focus on typological diversity and aim to investigate systems' ability to generalize across typologically distinct languages many of which are low-resource.For example, if a neural network architecture works well for a sample of Indo-European languages, should the same architecture also work well for Tupi-Guarani languages (where nouns are "declined" for tense) or Austronesian languages (where verbal morphology is frequently prefixing)?

Task Description
The 2020 iteration of our task is similar to CoNLL-SIGMORPHON 2017 (Cotterell et al., 2017) and 2018 (Cotterell et al., 2018) in that participants are required to design a model that learns to generate inflected forms from a lemma and a set of morphosyntactic features that derive the desired target form.For each language we provide a separate training, development, and test set.More historically, all of these tasks resemble the classic "wug"-test that Berko (1958) developed to test child and human knowledge of English nominal morphology.
Unlike the task from earlier years, this year's task proceeds in three phases: a Development Phase, a Generalization Phase, and an Evaluation Phase, in which each phase introduces previously unseen data.The task starts with the Development Phase, which was an elongated period of time (about two months), during which participants develop a model of morphological inflection.In this phase, we provide training and development splits for 45 languages representing the Austronesian, Niger-Congo, Oto-Manguean, Uralic and Indo-European language families.Table 1 provides details on the languages.The Generalization Phase is a short period of time (it started about a week before the Evaluation Phase) during which participants fine-tune their models on new data.At the start of the phase, we provide training and development splits for 45 new languages where approximately half are genetically related (belong to the same family) and half are genetically unrelated (are isolates or belong to a different family) to the languages presented in the Development Phase.More specifically, we introduce (surprise) languages from Afro-Asiatic, Algic, Dravidian, Indo-European, Niger-Congo, Sino-Tibetan, Siouan, Songhay, Southern Daly, Tungusic, Turkic, Uralic, and Uto-Aztecan families.See Table 2 for more details.
Finally, test splits for all 90 languages are released in the Evaluation Phase.During this phase, the models are evaluated on held-out forms.Importantly, the languages from both previous phases are evaluated simultaneously.This way, we evaluate the extent to which models (especially those with shared parameters) overfit to the development data: a model based on the morphological patterning of the Indo-European languages may end up with a bias towards suffixing and will struggle to learn prefixing or infixation.

Meet our Languages
In the 2020 shared task we cover 15 language families: Afro-Asiatic, Algic, Austronesian, Dravidian, Indo-European, Niger-Congo, Oto-Manguean, Sino-Tibetan, Siouan, Songhay, Southern Daly, Tungusic, Turkic, Uralic, and Uto-Aztecan.2Five language families were used for the Development phase while ten were held out for the Generalization phase.Tab. 1 and Tab. 2 provide information on the languages, their families, and sources of data.In the following section, we provide an overview of each language family's morphological system.

Afro-Asiatic
The Afro-Asiatic language family, consisting of six branches and over 300 languages, is among the largest language families in the world.It is mainly spoken in Northern, Western and Central Africa as well as West Asia and spans large modern languages such as Arabic, in addition to ancient languages like Biblical Hebrew.Similarly, some of its languages have a long tradition of written form, while others have yet to incorporate a writing system.The six branches differ most notably in typology and syntax, with the Chadic language being the main source of differences, which has sparked discussion of the division of the family (Frajzyngier, 2018).For example, in the Egyptian and Semitic branches, the root of a verb may not contain vowels, while this is allowed in Chadic.Although only four of the six branches, excluding Chadic and Omotic, use a prefix and suffix in conjugation when adding a subject to a verb, it is con-sidered an important characteristic of the family.In addition, some of the families in the phylum use tone to encode tense, modality and number among others.However, all branches use objective and passive suffixes.Markers of tense are generally simple, whereas aspect is typically distinguished with more elaborate systems.

Algic
The Algic family embraces languages native to North America-more specifically the United States and Canada-and contain three branches.Of these, our sample contains Cree, the language from the largest genus, Algonquian, from which most languages are now extinct.The Algonquian genus is characterized by its concatenative morphology.Cree morphology is also concatenative and suffixing.It distinguishes between impersonal and non-impersonal verbs and presents four apparent declension classes among non-impersonal verbs.

Austronesian
The Austronesian family of languages is largely comprised of languages from the Greater Central Philippine and Oceanic regions.They are characterized by limited morphology, mostly prefixing in nature.Additionally, tense-aspect affixes are predominantly seen as prefixes, though some suffixes are used.In the general case, verbs do not mark number, person, or gender.In Māori, verbs may be suffixed with a marker indicating the passive voice.This marker takes the form of one of twelve endings.These endings are difficult to predict as the language has undergone a loss of word-final consonants and there is no clear link between a stem and the passive suffix that it employs (Harlow, 2007).

Dravidian
The family of Dravidian languages comprises several languages which are primarily spoken across Southern India and Northern Sri Lanka, with over 200 million speakers.The shared task includes Kannada and Telugu.Dravidian languages primarily use the SOV word order.

Indo-European
Languages in the Indo-European family are native to most of Europe and a large part of Asia-with our sample including languages from the genera: Germanic, Indic, Iranian, and Romance.This is (arguably) the most well studied language family, containing a few of the highest-resource languages in the world.

Romance
The Romance genus comprises of a set of fusional languages evolved from Latin.They traditionally originated in Southern and Southeastern Europe, though they are presently spoken in other continents such Africa and the Americas.Romance languages mark tense, person, number and mood in verbs, and gender and number in nouns.Inflection is primarily achieved through suffixes, with some verbal person syncretism and suppletion for high-frequency verbs.There is some morphological variation within the genus, such as French, which exhibits comparatively less inflection, and Romanian has comparatively more-it still marks case.

Germanic
The Germanic genus comprises several languages which originated in Northern and Northwestern Europe, and today are spoken in many parts of the world.Verbs in Germanic languages mark tense and mood, in many languages person and number are also marked, predominantly through suffixation.Some Germanic languages exhibit widespread Indo-European ablaut.The gendering of nouns differs between Germanic languages: German nouns can be masculine, feminine or neuter, while English nouns are not marked for gender.In Danish and Swedish, historically masculine and feminine nouns have merged to form one common gender, so nouns are either common or neuter.Marking of case also differs between the languages: German nouns have one of four cases and this case is marked in articles and adjectives as well as nouns and pronouns, while English does not mark noun case (although Old English, which also appears in our language sample, does).

Indo-Iranian
The Indo-Iranian genus contains languages spoken in Iran and across the Indian subcontinent.Over 1.5 billion people worldwide speak an Indo-Iranian language.Within the Indo-European family, Indo-Iranian languages belong to the Satem group of languages.Verbs in Indo-Iranian languages indicate tense, aspect, mood, number and person.In languages such as Hindi verbs can also express levels of formality.Noun gender is present in some Indo-Iranian languages, such as Hindi, but absent in languages such as Persian.Nouns generally are marked for case.no grammatical evidentials.

Oto-Manguean
The Oto-Manguean languages are a diverse family of tonal languages spoken in central and southern Mexico.Even though all of these languages are tonal, the tonal system within each language varies widely.Some have an inventory of two tones (e.g., Chichimec and Pame) others have ten tones (e.g., the Eastern Chatino languages of the Zapotecan branch, Palancar and Léonard (2016)).
Oto-Manguean languages are also rich in tonal morphology.
The inflectional system marks person-number and aspect in verbs and personnumber in adjectives and noun possessions, relying heavily on tonal contrasts.Other interesting as-pects of Oto-Manguean languages include the fact that pronominal inflections use a system of enclitics, and first and second person plural has a distinction between exclusive and inclusive (Campbell, 2016).Tone marking schemes in the writing systems also vary greatly.Some writing systems do not represent tone, others use diacritics, and others represent tones with numbers.In languages that use numbers, single digits represent level tones and double digits represent contour tones.For example, in San Juan Quiahije of Eastern Chatino number 1 represents high tone, number 4 represents low tone, and numbers 14 represent a descending tone contour and numbers 42 represent an ascending tone contour Cruz (2014).

Sino-Tibetan
The Sino-Tibetan family is represented by the Tibetan language.Tibetan uses an abugida script and contains complex syllabic components in which vowel marks can be added above and below the base consonant.Tibetan verbs are inflected for tense and mood.Previous studies on Tibetan morphology (Di et al., 2019) indicate that the majority of mispredictions produced by neural models are due to allomorphy.This is followed by generation of nonce words (impossible combinations of vowel and consonant components).

Siouan
The Siouan languages are located in North America, predominantly along the Mississippi and Missouri Rivers and in the Ohio Valley.The family is represented in our task by Dakota, a critically endangered language spoken in North and South Dakota, Minnesota, and Saskatchewan.The Dakota language is largely agglutinating in its derivational morphology and fusional in its inflectional morphology with a mixed affixation system (Rankin et al., 2003).The present task includes verbs, which are marked for first and second person, number, and duality.All three affixation types are found: person was generally marked by an infix, but could also appear as a prefix, and plurality was marked by a suffix.Morphophonological processes of fortition and vowel lowering are also present.

Songhay
The Songhay family consists of around eleven or twelve languages spoken in Mali, Niger, Benin, Burkina Faso and Nigeria.In the shared task we use Zarma, the most widely spoken Songhay language.Most of the Songhay languages are predominantly SOV with medium-sized consonant inventories (with implosives), five phonemic vowels, vowel length distinctions, and word level tones, which also are used to distinguish nouns, verbs, and adjectives (Heath, 2014).

Southern Daly
The Southern Daly is a small language family of the Northern Territory in Australia that consists of two distantly related languages.In the current task we only have one of the languages, Murrinh-patha (which was initially thought to be a language isolate).Murrinh-patha is classified as polysynthetic with highly complex verbal morphology.Verbal roots are surrounded by prefixes and suffixes that indicate tense, mood, object, subject.As Mansfield ( 2019) notes, Murrinh-patha verbs have 39 conjugation classes.

Tungusic
Tungusic languages are spoken principally in Russia, China and Mongolia.In Russia they are concentrated in north and eastern Siberia and in China in the east, in Manchuria.The largest languages in the family are Xibe, Evenki and Even; we use Evenki in the shared task.The languages are of the agglutinating morphological type with a moderate number of cases, 7 for Xibe and 13 for Evenki.In addition to case markers, Evenki marks possession in nominals (including reflexive possession) and distinguishes between alienable and inalienable possession.In terms of morphophonological processes, the languages exhibit vowel harmony, consonant alternations and phonological vowel length.

Turkic
Languages of the Turkic family are primarily spoken in Central Asia.The family is morphologically concatenative, fusional, and suffixing.Turkic languages generally exhibit back vowel harmony, with the notable exception of Uzbek.In addition to harmony in backness, several languages also have labial vowel harmony (e.g., Kyrgyz, Turkmen, among others).In addition, most of the languages have dorsal consonant allophony that accompanies back vowel harmony.Additional morphophonological processes include vowel epenthesis and voicing assimilation.Selection of the inflectional allomorph can frequently be determined from the infinitive morpheme (which frequently reveals vowel backness and roundedness) and also the final segment of the stem.

Uralic
The Uralic languages are spoken in Russia from the north of Siberia to Scandinavia and Hungary in Europe.They are agglutinating with some subgroups displaying fusional characteristics (e.g., the Sámi languages).Many of the languages have vowel harmony.The languages have almost complete suffixal morphology and a medium-sized case inventory, ranging from 5-6 cases to numbers in the high teens.Many of the larger case paradigms are made up of spatial cases, sometimes with distinctions for direction and position.Most of the languages have possessive suffixes, which can express possession, or agreement in non-finite clauses.The paradigms are largely regular, with few, if any, irregular forms.Many exhibit complex patterns of consonant gradation-consonant mutations that occur in specific morphological forms in some stems.Which gradation category a stem belongs to in often unpredictable.The languages spoken in Russia are typically SOV, while those in Europe have SVO order.

Uto-Aztecan
The Uto-Aztecan family is represented by the Tohono O'odham (Papago-Pima) language spoken along the US-Mexico border in southern Arizona and northern Sonora.O'odham is agglutinative with a mixed prefixing and suffixing system.Nominal and verbal pluralization is frequently realized by partial reduplication of the initial consonant and/or vowel, and occasionally by final consonant deletion or null affixation.Processes targeting vowel length (shortening or lengthening) are also present.A small number of verbs exhibit suppletion in the past tense.

Data Format
Similar to previous years, training and development sets contain triples consisting of a lemma, a target form, and morphosyntactic descriptions (MSDs, or morphological tags). 3Test sets only contain two fields, i.e., target forms are omitted.All data follows UTF-8 encoding.
3 Each MSD is a set of features separated by semicolons.

Conversion and Canonicalization
A significant amount of data for this task was extracted from corresponding (language-specific) grammars.In order to allow cross-lingual comparison, we manually converted their features (tags) into the UniMorph format (Sylak-Glassman, 2016).We then canonicalized the converted language data4 to make sure all tags are consistently ordered and no category (e.g., "Number") is assigned two tags (e.g., singular and plural).5

Splitting
We use only noun, verb, and adjective forms to construct training, development, and evaluation sets.We de-duplicate annotations such that there are no multiple examples of exact lemma-formtag matches.To create splits, we randomly sample 70%, 10%, and 20% for train, development, and test, respectively.We cap the training set size to 100k examples for each language; where languages exceed this (e.g., Finnish), we subsample to this point, balancing lemmas such that all forms for a given lemma are either included or discarded.Some languages such as Zarma (dje), Tajik (tgk), Lingala (lin), Ludian* (lud), Māori (mao), Sotho (sot), Võro (vro), Anglo-Norman (xno), and Zulu (zul) contain less than 400 training samples and are extremely low-resource. 6Tab.6 and Tab.7 in the Appendix provide the number of samples for every language in each split, the number of samples per lemma, and statistics on inconsistencies in the data.

Baseline Systems
The organizers provided two types of pre-trained baselines.Their use was optional.

Non-neural
The first baseline was a non-neural system that had been used as a baseline in earlier shared tasks on morphological reinflection (Cotterell et al., 2017(Cotterell et al., , 2018)).The system first heuristically extracts lemma-to-form transformations; it assumes that these transformations are suffix-or prefix-based.
A simple majority classifier is used to apply the most frequent suitable transformation to an input lemma, given the morphological tag, yielding the output form.See Cotterell et al. (2017) for further details.

Neural
Neural baselines were based on a neural transducer (Wu and Cotterell, 2019), which is essentially a hard monotonic attention model (mono-*).The second baseline is a transformer (Vaswani et al., 2017) adopted for character-level tasks that currently holds the state-of-the-art on the 2017 SIG-MORPHON shared task data (Wu et al., 2020, trm-*).Both models take the lemma and morphological tags as input and output the target inflection.The baseline is further expanded to include the data augmentation technique used by Anastasopoulos and Neubig (2019, -aug-) (conceptually similar to the one proposed by Silfverberg et al. ( 2017)).Relying on a simple characterlevel alignment between lemma and form, this technique replaces shared substrings of length > 3 with random characters from the language's alphabet, producing hallucinated lemma-tag-form triples.Both neural baselines were trained in mono-(*-single) and multilingual (shared parameters among the same family, *-shared) settings.

Competing Systems
As Tab. 3 shows, 10 teams submitted 22 systems in total, out of which 19 were neural.Some teams such as ETH Zurich and UIUC built their models on top of the proposed baselines.In particular, ETH Zurich enriched each of the (multilingual) neural baseline models with exact decoding strategy that uses Dijkstra's search algorithm.UIUC enriched the transformer model with synchronous bidirectional decoding technique (Zhou et al., 2019) in order to condition the prediction of an affix character on its environment from both sides.(The authors demonstrate positive effects in Oto-Manguean, Turkic, and some Austronesian languages.) A few teams further improved models that were among top performers in previous shared tasks.IMS and Flexica re-used the hard monotonic attention model from (Aharoni and Goldberg, 2017).IMS developed an ensemble of two models (with left-to-right and right-to-left generation or-der) with a genetic algorithm for ensemble search (Haque et al., 2016) and iteratively provided hallucinated data.Flexica submitted two neural systems.The first model (flexica-02-1) was multilingual (family-wise) hard monotonic attention model with improved alignment strategy.This model is further improved (flexica-03-1) by introducing a data hallucination technique which is based on phonotactic modelling of extremely low-resource languages (Shcherbakov et al., 2016).LTI focused on their earlier model (Anastasopoulos and Neubig, 2019), a neural multi-source encoder-decoder with two-step attention architecture, training it with hallucinated data, cross-lingual transfer, and romanization of scripts to improve performance on low-resource languages.DeepSpin reimplemented gated sparse two-headed attention model from Peters and Martins ( 2019) and trained it on all languages at once (massively multilingual).The team experimented with two modifications of the softmax function: sparsemax (Martins and Astudillo, 2016, deepspin-02-1) and 1.5-entmax (Peters et al., 2019, deepspin-01-1).
Many teams based their models on the transformer architecture.
NYU-CUBoulder experimented with a vanilla transformer model (NYU-CUBoulder-04-0), a pointer-generator transformer that allows for a copy mechanism (NYU-CUBoulder-02-0), and ensembles of three (NYU-CUBoulder-01-0) and five (NYU-CUBoulder-03-0) pointer-generator transformers.For languages with less than 1,000 training samples, they also generate hallucinated data.CULing developed an ensemble of three (monolingual) transformers with identical architecture but different input data format.The first model was trained on the initial data format (lemma, target tags, target form).For the other two models the team used the idea of lexeme's principal parts (Finkel and Stump, 2007) and augmented the initial input (that only used the lemma as a source form) with entries corresponding to other (non-lemma) slots available for the lexeme.The CMU Tartan team compared performance of models with transformer-based and LSTM-based encoders and decoders.The team also compared monolingual to multilingual training in which they used several (related and unrelated) high-resource languages for low-resource language training.Although the majority of submitted systems  were neural, some teams experimented with nonneural approaches showing that in certain scenarios they might surpass neural systems.A large group of researchers from CU7565 manually developed finite-state grammars for 25 languages (CU7565-01-0).They additionally developed a non-neural learner for all languages (CU7565-02-0) that uses hierarchical paradigm clustering (based on similarity of string transformation rules between inflectional slots).Another team, Flexica, proposed a model (flexica-01-0) conceptually similar to Hulden et al. (2014), although they did not attempt to reconstruct the paradigm itself and treated transformation rules independently assigning each of them a score based on its frequency and specificity as well as diversity of the characters surrounding the pattern. 77 English plural noun formation rule "* → *s" has high diversity whereas past tense rule such as "*a* → *oo*" as in (understand, understood) has low diversity.

Evaluation
This year, we instituted a slightly different evaluation regimen than in previous years, which takes into account the statistical significance of differences between systems and allows for an informed comparison across languages and families better than a simple macro-average.
The process works as follows: 1.For each language, we rank the systems according to their accuracy (or Levenshtein distance).To do so, we use paired bootstrap resampling (Koehn, 2004) 8 to only take statistically significant differences into account.That way, any system which is the same (as assessed via statistical significance) as the best performing one is also ranked 1 st for that language.
2. For the set of languages where we want collective results (e.g.languages within a linguistic genus), we aggregate the systems' ranks and Table 4: Illustration of our ranking method, over the four Zapotecan languages.Note: The final ranking is based on the actual counts (#1,#2, etc), not on the system's average rank.
re-rank them based on the amount of times they ranked 1 st , 2 nd , 3 rd , etc. Table 4 illustrates an example of this process using four Zapotecan languages and six systems.

Results
This year we had four winning systems (i.e., ones that outperform the best baseline): CULing-01-0, deepspin-02-1, uiuc-01-0, and deepspin-01-1, all neural.As Tab. 5 shows, they achieve over 90% accuracy.Although CULing-01-0 and uiuc-01-0 are both monolingual transformers that do not use any hallucinated data, they follow different strategies to improve performance.The strategy proposed by CULing-01-0 of enriching the input data with extra entries that included non-lemma forms and their tags as a source form, enabled their system to be among top performers on all language families; uiuc-01-0, on the other hand, did not modify the data but rather changed the decoder to be bidirectional and made family-wise fine-tuning of each (monolingual) model.The system is also among the top performers on all language families except Iranian.The third team, DeepSpin, trained and fine-tuned their models on all language data.Both models are ranked high (although the sparsemax model, deepspin-02-1, performs better overall) on most language groups with exception of Algic.Sparsemax was also found useful by CMU-Tartan.The neural ensemble model with data augmentation from IMS team shows superior performance on languages with smaller data sizes (under 10,000 samples).LTI and Flexica teams also observed positive effects of multilingual training and data hallucination on low-resource languages.The latter was also found useful in the ablation study made by NYU-CUBoulder team.Several teams aimed to address particular research questions; we will further summarize their results.

System
Rank Acc  Is developing morphological grammars manually worthwhile?This was the main question asked by CU7565 who manually designed finitestate grammars for 25 languages.Paradigms of some languages were relatively easy to describe but neural networks also performed quite well on them even with a limited amount of data.For lowresource languages such as Ingrian and Tagalog the grammars demonstrate superior performance but this comes at the expense of a significant amount of person-hours.
What is the best training strategy for lowresource languages?Teams that generated hallucinated data highlighted its utility for lowresource languages.Augmenting the data with tuples where lemmas are replaced with nonlemma forms and their tags is another technique that was found useful.In addition, multilingual training and ensembles yield extra gain in terms of accuracy.

Are the systems complementary?
To address this question, we evaluate oracle scores for baseline systems, submitted systems, and all of them together.Typically, as Tables 8-21 in the Appendix demonstrate, the baselines and the submissions are complementary -adding them together increases the oracle score.Furthermore, while the full systems tend to dominate the partial systems (that were designed for a subset of languages, such as CU7565-01-0), there are a number of cases where the partial systems find the solution when the full systems don't -and these languages often then get even bigger gains when combined with the baselines.This even happens when the accuracy of the baseline is very high -Finnish has baseline oracle of 99.89; full systems oracle of 99.91; submission oracle of 99.94 and complete oracle of 99.96, so an ensemble might be able to improve on the results.The largest gaps in oracle systems are observed in Algic, Oto-Manguean, Sino-Tibetan, Southern Daly, Tungusic, and Uto-Aztecan families. 9 Has morphological inflection become a solved problem in certain scenarios?The results shown in Fig. 2 suggest that for some of the development language families, such as Austronesian and Niger-Congo, the task was relatively easy, with most systems achieving high accuracy, whereas the task was more difficult for Uralic and Oto-Manguean languages, which showed greater variability in level of performance across submitted systems.Languages such as Ludic (lud), Norwegian Nynorsk (nno), Middle Low German 9 Please see the results per language here: https://docs.google.com/spreadsheets/d/1ODFRnHuwN-mvGtzXA1sNdCi-jNqZjiE-i9jRxZCK0kg/edit?usp=sharing (gml), Evenki (evn), and O'odham (ood) seem to be the most challenging languages based on simple accuracy.For a more fine-grained study, we have classified test examples into four categories: "very easy", "easy", "hard", and "very hard"."Very easy" examples are ones that all submitted systems got correct, while "very hard" examples are ones that no submitted system got correct."Easy" examples were predicted correctly for 80% of systems, and "hard" were only correct in 20% of systems.Fig. 3, Fig. 4, and Fig. 5 represent percentage of noun, verb, and adjective samples that fall into each category and illustrate that most language samples are correctly predicted by majority of the systems.For noun declension, Old English (ang), Middle Low German (gml), Evenki (evn), O'odham (ood), Võro (vro) are the most difficult (some of this difficulty comes from language data inconsistency, as described in the following section).For adjective declension, Classic Syriac presents the highest difficulty (likely due to its limited data).

Error Analysis
In our error analysis we follow the error type taxonomy proposed in Gorman et al. (2019).First, we evaluate systematic errors due to inconsistencies in the data, followed by an analysis of whether having seen the language or its family improved accuracy.
We then proceed with an overview of accuracy for each of the language families.For a select number of families, we provide a more detailed analysis of the error patterns.Tab.6 and Tab.7 provide the number of samples in the training, development, and test sets, percentage of inconsistent entries (the same lemma-tag pair has multiple infected forms) in them, percentage of contradicting entries (same lemma-tag pair occurring in train and development or test sets but assigned to different inflected forms), and percentage of entries in the development or test sets containing a lemma observed in the training set.The train, development and test sets contain 2%, 0.3%, and 0.6% inconsistent entries, respectively.Azerbaijani (aze), Old English (ang), Cree (cre), Danish (dan), Middle Low German (gml), Kannada (kan), Norwegian Bokmål (nob), Chichimec (pei), and Veps (vep) had the highest rates of inconsistency.These languages also exhibit the highest percentage of contradicting entries.The inconsistencies in some Finno-Ugric languages (such as Veps and Ludic) are due to dialectal variations.
The overall accuracy of system and language pairings appeared to improve with an increase in the size of the dataset (Fig. 6; see also Fig. 7 for accuracy trends by language family and Fig. 8 for accuracy trends by system).Overall, the variance was considerable regardless of whether the language family or even the language itself had been observed during the Development Phase.A linear mixed-effects regression was used to assess variation in accuracy using fixed effects of language category, the size of the training dataset (log count), and their interactions, as well as random intercepts for system and language family accuracy. 10Language category was sum-coded with three levels: development language-development family, surprise language-development family, or surprise language-surprise family.
A significant effect of dataset size was observed, such that a one unit increase in log count corresponded to a 2% increase in accuracy (β = 0.019, p < 0.001).Language category type also significantly influenced accuracy: both development languages and surprise languages from development families were less accurate on average (β dev−dev = -0.145,β sur−dev = -0.167,each p < 0.001).These main effects were, however, significantly modulated by interactions with dataset size: on top of the main effect of dataset size, accuracy for development languages increased an additional ≈ 1.7% (β dev−dev×size = 0.017, p < 0.001) and accuracy for surprise languages from development families increased an additional ≈ 2.9% (β sur−dev×size = 0.029, p < 0.001).

Afro-Asiatic:
This family was represented by three languages.Mean accuracy across systems was above average at 91.7%.Relative to other families, variance in accuracy was low, but nevertheless ranged from 41.1% to 99.0%.
Algic: This family was represented by one language, Cree.Mean accuracy across systems was below average at 65.1%.Relative to other families, variance in accuracy was low, ranging from 41.5% to 73%.All systems appeared to struggle with the choice of preverbal auxiliary.Some auxiliaries were overloaded: 'kitta' could refer to future, imperfective, or imperative.The morphological features for mood and tense were also frequently combined, such as SBJV+OPT (subjunctive plus optative mood).While the paradigms were very large, there were very few lemmas (28 impersonal verbs and 14 transitive verbs), which may have contributed to the lower accuracy.Interestingly, the inflections could largely be generated by rules. 11  Austronesian: This family was represented by five languages.Mean accuracy across systems was around average at 80.5%.Relative to other families, variance in accuracy was high, with accuracy ranging from 39.5% to 100%.One may notice a discrepancy among the difficulty in processing different Austronesian languages.For instance, we see a difference of over 10% in the baseline performance of Cebuano (84%) and Hiligaynon (96%).12This could come from the fact that Cebuano only has partial reduplication while Hiligaynon has full reduplication.Furthermore, the prefix choice for Cebuano is more irregular, making it more difficult to predict the correct conjugation of the verb.
Dravidian: This family was represented by two languages: Kannada and Telugu.Mean accuracy across systems was around average at 82.2%.Relative to other families, variance in accuracy was high: system accuracy ranged from 44.6% to 96.0%.Accuracy for Telugu was systematically higher than accuracy for Kannada.

Indo-European:
This family was represented by 29 languages and four main branches.Mean accuracy across systems was slightly above average at 86.9%.Relative to other families, variance in accuracy was very high: system accuracy ranged from 0.02% to 100%.For Indo-Aryan, mean accuracy was high (96.0%)with low variance; for Germanic, mean accuracy was slightly below average (79.0%) but with very high variance (ranging from 0.02% to 99.5%), for Romance, mean accuracy was high (93.4%)but also had a high variance (ranging from 23.5% to 99.8%), and for Iranian, mean accuracy was high (89.2%),but again with a high variance (ranging from 25.0% to 100%).Languages from the Germanic branch of the Indo-European family were included in the Development Phase.
Niger-Congo: This family was represented by ten languages.Mean accuracy across systems was very good at 96.4%.Relative to other families, variance in accuracy was low, with accuracy ranging from 62.8% to 100%.Most languages in this family are considered low resource, and the resources used for data gathering may have been biased towards the languages' regular forms, as such this high accuracy may not be representative of the "easiness" of the task in this family.Languages from the Niger-Congo family was included in the Development Phase.

Oto-Manguean:
This family was represented by nine languages.Mean accuracy across systems was slightly below average at 78.5%.Relative to other families, variance in accuracy was high, with accuracy ranging from 18.7% to 99.1%.Languages from the Oto-Manguean family were included in the Development Phase.
Sino-Tibetan: This family was represented by one language, Bodic.Mean accuracy across systems was average at 82.1%, and variance across systems was also very low.Accuracy ranged from 67.9% to 85.1%.The results are similar to those in Di et al. (2019) where majority of errors relate to allomorphy and impossible combinations of Tibetan unit components.above average at 89.4%, and variance across systems was also low, despite the range from 0% to 95.7%.Dakota presented variable prefixing and infixing of person morphemes, along some complexities related to fortition processes.Determining the factor(s) that governed variation in affix position was difficult from a linguist's perspective, though many systems were largely successful.Success varied in the choice of the first or second person singular allomorphs which had increasing degrees of consonant strengthening (e.g., /wa/, /ma/, /mi/ /bde/, /bdu/ for the first person singular and /ya/, /na/, /ni/, /de/, or /du/ for the second person singular).In some cases, these fortition processes were overapplied, and in some cases, entirely missed.

Siouan
Songhay: This family was represented by one language, Zarma.Mean accuracy across systems was above average at 88.6%, and variance across systems was relatively high.Accuracy ranged from 0% to 100%.
Southern Daly: This family was represented by one language, Murrinh-Patha.Mean accuracy across systems was below average at 73.2%, and variance across systems was relatively high.Accuracy ranged from 21.2% to 91.9%.
Tungusic: This family was represented by one language, Evenki.The overall accuracy was the lowest across families.Mean accuracy was 53.8% with very low variance across systems.Accuracy ranged from 43.5% to 59.0%.The low accuracy is due to several factors.Firstly and primarily, the dataset was created from oral speech samples in various dialects of the language.The Evenki language is known to have rich dialectal variation.Moreover, there was little attempt at any standardization in the oral speech transcription.These peculiarities led to a high number of errors.For instance, some of the systems synthesized a wrong plural form for a noun ending in /-n/.Depending on the dialect, it can be /-r/ or /-l/, and there is a trend to have /-hVl/ for borrowed nouns.Deducing such a rule as well as the fact that the noun is a loanword is a hard task.Other suffixes may also have variable forms (such as /-kVllu/ vs /-kVldu/ depending on the dialect for the 2PL imperative.Some verbs have irregular past tense forms depending on the dialect and the meaning of the verb (e. g. /o:-/ 'to make' and 'to become').Next, various dialects exhibit various vowel and consonant changes in suffixes.For example, some dialects (but not all of them) change /w/ to /b/ after /l/, and the systems sometimes synthesized a wrong form.The vowel harmony is complex: not all suffixes obey it, and it is also dialect-dependent.Some suffixes have variants (e. g., /-sin/ and /-s/ for SEMEL (semelfactive)), and the choice between them might be hard to understand.Finally, some of the mistakes are due to the markup scheme scarcity.For example, various past tense forms are all annotated as PST, or there are several comitative suffixes all annotated as COM.Moreover, some features are present in the word form but they receive no annotation at all.It is worth mentioning that some of the predictions could theoretically be possible.To sum up, the Evenki case presents the chal-lenges of oral non-standardized speech.
Turkic: This family was represented by nine languages.Mean accuracy across systems was relatively high at 93%, and relative to other families, variance across systems was low.Accuracy ranged from 51.5% to 100%.Accuracy was lower for Azerbaijani and Turkmen, which after closer inspection revealed some slight contamination in the 'gold' files.There was very marginal variation in the accuracy for these languages across systems.Besides these two, accuracies were predominantly above 98%.A few systems struggled with the choice and inflection of the postverbal auxiliary in various languages (e.g., Kyrgyz, Kazakh, and Uzbek).
Uralic: This family was represented by 16 languages.Mean accuracy across systems was average at 81.5%, but the variance across systems and languages was very high.Accuracy ranged from 0% to 99.8%.Languages from the Uralic family were included in the Development Phase.
Uto-Aztecan: This family was represented by one language, O'odham.Mean accuracy across systems was slightly below average at 76.4%, but the variance across systems and languages was fairly low.Accuracy ranged from 54.8% to 82.5%.
The systems with higher accuracy may have benefited from better recall of suppletive forms relative to lower accuracy systems.

Conclusion
This years's shared task on morphological reinflection focused on building models that could generalize across an extremely typologically diverse set of languages, many from understudied language families and with limited available text resources.As in previous years, neural models performed well, even in relatively low-resource cases.Submissions were able to make productive use of multilingual training to take advantage of commonalities across languages in the dataset.Data augmentation techniques such as hallucination helped fill in the gaps and allowed networks to generalize to unseen inputs.These techniques, combined with architecture tweaks like sparsemax, resulted in excellent overall performance on many languages (over 90% accuracy on average).However, the task's focus on typological diversity revealed that some morphology types and language families (Tungusic, Oto-Manguean, South-ern Daly) remain a challenge for even the best systems.These families are extremely low-resource, represented in this dataset by few or a single language.This makes cross-linguistic transfer of similarities by multilanguage training less viable.They may also have morphological properties and rules (e.g., Evenki is agglutinating with many possible forms for each lemma) that are particularly difficult for machine learners to induce automatically from sparse data.For some languages (Ingrian, Tajik, Tagalog, Zarma, and Lingala), optimal performance was only achieved in this shared task by hand-encoding linguist knowledge in finite state grammars.It is up to future research to imbue models with the right kinds of linguistic inductive biases to overcome these challenges.Points are color-coded according to language family, and shape-coded according to language type: development language -development family, surprise language -development family, surprise language -surprise family.Figure 8: Accuracy for each language by the log size of the dataset, grouped by submitted system.Points are color-and shape-coded according to language type: development language -development family, surprise language -development family, surprise language -surprise family.1.0 99.9 BASE: trm-shared 1.0 99.9 BASE: mono-aug-single 1.0 99.9 cmu_tartan_00-0 1.0 99.9 BASE: trm-aug-shared 1.0 99.9 BASE: trm-aug-single 1.0 99.7 cmu_tartan_01-1 1.0 99.7 BASE: mono-aug-shared

FamilyFigure 1 :
Figure 1: Languages in our sample colored by family.

Figure 2 :
Figure 2: Accuracy by language averaged across all the final submitted systems with their standard deviations.Language families are demarcated by color, with accuracy on development languages (top), and generalization languages (bottom).

Figure 3 :Figure 4 :
Figure 3: Difficulty of Nouns: Percentage of test samples falling into each category.The total number of test samples for each language is outlined on the top of the plot.

Figure 5 :
Figure 5: Difficulty of Adjectives: Percentage of test samples falling into each category.The total number of test samples for each language is outlined on the top of the plot.

:Figure 6 :
Figure6: Accuracy for each system and language by the log size of the dataset.Points are color-coded according to language type: development language -development family, surprise language -development family, surprise language -surprise family.

Figure 7 :
Figure7: Accuracy for each system and language by the log size of the dataset, grouped by language family.Points are color-coded according to language family, and shape-coded according to language type: development language -development family, surprise language -development family, surprise language -surprise family.

Table 2 :
Surprise languages used in the shared task.

Table 3 :
The list of systems submitted to the shared task.

Table 5 :
Aggregate results on all languages.Bolded results are the ones which beat the best baseline.
* and italics denote systems that did not submit outputs in all languages (their accuracy is a partial average).

Table 6 :
Number of samples in training, development, test sets, as well as statistics on systematic errors (inconsistency) and percentage of samples with lemmata observed in the training set.

Table 7 :
Number of samples in training, development, test sets, as well as statistics on systematic errors (inconsistency) and percentage of samples with lemmata observed in the training set.