The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years’ inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year’s strong baselines or highly ranked systems from previous years’ shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.


Introduction
While producing a sentence, humans combine various types of knowledge to produce fluent outputvarious shades of meaning are expressed through word selection and tone, while the language is made to conform to underlying structural rules via syntax and morphology. Native speakers are often quick to identify disfluency, even if the meaning of a sentence is mostly clear.
Automatic systems must also consider these constraints when constructing or processing language. Strong enough language models can often reconstruct common syntactic structures, but are insufficient to properly model morphology. Many languages implement large inflectional paradigms that mark both function and content words with a varying levels of morphosyntactic information. * Now at Google For instance, Romanian verb forms inflect for person, number, tense, mood, and voice; meanwhile, Archi verbs can take on thousands of forms (Kibrik, 1998). Such complex paradigms produce large inventories of words, all of which must be producible by a realistic system, even though a large percentage of them will never be observed over billions of lines of linguistic input. Compounding the issue, good inflectional systems often require large amounts of supervised training data, which is infeasible in many of the world's languages.
This year's shared task is concentrated on encouraging the construction of strong morphological systems that perform two related but different inflectional tasks. The first task asks participants to create morphological inflectors for a large number of under-resourced languages, encouraging systems that use highly-resourced, related languages as a cross-lingual training signal. The second task welcomes submissions that invert this operation in light of contextual information: Given an unannotated sentence, lemmatize each word, and tag them with a morphosyntactic description. Both of these tasks extend upon previous morphological competitions, and the best submitted systems now represent the state of the art in their respective tasks.

Task 1: Cross-lingual transfer for morphological inflection
Annotated resources for the world's languages are not distributed equally-some languages simply have more as they have more native speakers willing and able to annotate more data. We explore how to transfer knowledge from high-resource languages that are genetically related to low-resource languages.
The first task iterates on last year's main task: morphological inflection .
Instead of giving some number of training examples in the language of interest, we provided only a limited number in that language. To accompany it, we provided a larger number of examples in either a related or unrelated language. Each test example asked participants to produce some other inflected form when given a lemma and a bundle of morphosyntactic features as input. The goal, thus, is to perform morphological inflection in the low-resource language, having hopefully exploited some similarity to the high-resource language. Models which perform well here can aid downstream tasks like machine translation in lowresource settings. All datasets were resampled from UniMorph, which makes them distinct from past years.
The mode of the task is inspired by Zoph et al. (2016), who fine-tune a model pre-trained on a high-resource language to perform well on a lowresource language. We do not, though, require that models be trained by fine-tuning. Joint modeling or any number of methods may be explored instead.
Example The model will have access to typelevel data in a low-resource target language, plus a high-resource source language. We give an example here of Asturian as the target language with Spanish as the source language.

Test input (Asturian) baxar
V;V.PTCP;PRS Test output (Asturian) "baxando" Evaluation We score the output of each system in terms of its predictions' exact-match accuracy and the average Levenshtein distance between the predictions and their corresponding true forms.

Task 2: Morphological analysis in context
Although inflection of words in a context-agnostic manner is a useful evaluation of the morphological quality of a system, people do not learn morphology in isolation.
In 2018, the second task of the CoNLL-SIGMORPHON Shared Task  required submitting systems to complete an inflectional cloze task (Taylor, 1953) given only the sentential context and the desired lemma -an example of the problem is given in the following lines: A successful system would predict the plural form "dogs". Likewise, a Spanish word form "ayuda" may be a feminine noun or a third-person verb form, which must be disambiguated by context.
The are barking.
This year's task extends the second task from last year. Rather than inflect a single word in context, the task is to provide a complete morphological tagging of a sentence: for each word, a successful system will need to lemmatize and tag it with a morphsyntactic description (MSD). Context is critical-depending on the sentence, identical word forms realize a large number of potential inflectional categories, which will in turn influence lemmatization decisions. If the sentence were instead "The barking dogs kept us up all night", "barking" is now an adjective, and its lemma is also "barking".

Data for Task 1
Language pairs We presented data in 100 language pairs spanning 79 unique languages. Data for all but four languages (Basque, Kurmanji, Murrinhpatha, and Sorani) are extracted from English Wiktionary, a large multi-lingual crowd-sourced dictionary with morphological paradigms for many lemmata. 1 20 of the 100 language pairs are either distantly related or unrelated; this allows speculation into the relative importance of data quantity and linguistic relatedness.
Data format For each language, the basic data consists of triples of the form (lemma, feature bundle, inflected form), as in Table 1. The first feature in the bundle always specifies the core part of speech (e.g., verb). For each language pair, separate files contain the high-and low-resource training examples.
All features in the bundle are coded according to the UniMorph Schema, a cross-linguistically consistent universal morphological feature set (Sylak-Glassman et al., 2015a,b).
Extraction from Wiktionary For each of the Wiktionary languages, Wiktionary provides a number of tables, each of which specifies the full inflectional paradigm for a particular lemma. As in the previous iteration, tables were extracted using a template annotation procedure described in (Kirov et al., 2018).
Sampling data splits From each language's collection of paradigms, we sampled the training, development, and test sets as in 2018. 2 Crucially, while the data were sampled in the same fashion, the datasets are distinct from those used for the 2018 shared task.
Our first step was to construct probability distributions over the (lemma, feature bundle, inflected form) triples in our full dataset. For each triple, we counted how many tokens the inflected form has in the February 2017 dump of Wikipedia for that language. To distribute the counts of an observed form over all the triples that have this token as its form, we follow the method used in the previous shared task , training a neural network on unambiguous forms to estimate the distribution over all, even ambiguous, forms. We then sampled 12,000 triples without replacement from this distribution. The first 100 were taken as training data for low-resource settings. The first 10,000 were used as high-resource training sets. As these sets are nested, the highest-count triples tend to appear in the smaller training sets. 3 is discussed in Mansfield (2019). Data for Kurmanji Kurdish and Sorani Kurdish were created as part of the Alexina project .
2 These datasets can be obtained from https:// sigmorphon.github.io/sharedtasks/2019/ 3 Several high-resource languages had necessarily fewer, but on a similar order of magnitude. Bengali, Uzbek, Kannada, The final 2000 triples were randomly shuffled and then split in half to obtain development and test sets of 1000 forms each. 4 The final shuffling was performed to ensure that the development set is similar to the test set. By contrast, the development and test sets tend to contain lower-count triples than the training set. 5 Other modifications We further adopted some changes to increase compatibility. Namely, we corrected some annotation errors created while scraping Wiktionary for the 2018 task, and we standardized Romanian t-cedilla and t-comma to t-comma. (The same was done with s-cedilla and s-comma.)

Data for Task 2
Our data for task 2 come from the Universal Dependencies treebanks (UD; Nivre et al., 2018, v2.3), which provides pre-defined training, development, and test splits and annotations in a unified annotation schema for morphosyntax and dependency relationships. Unlike the 2018 cloze task which used UD data, we require no manual data preparation and are able to leverage all 107 monolingual treebanks. As is typical, data are presented in CoNLL-U format, 6 although we modify the morphological feature and lemma fields.

Data conversion
The morphological annotations for the 2019 shared task were converted to the Uni-Morph schema (Kirov et al., 2018) according to McCarthy et al. (2018), who provide a deterministic mapping that increases agreement across languages. This also moves the part of speech into the bundle of morphological features. We do not attempt to individually correct any errors in the UD source material. Further, some languages received additional pre-processing. In the Finnish data, we removed morpheme boundaries that were present in the lemmata (e.g., puhe#kieli → puhekieli 'spoken+language'). Russian lemmata in the GSD treebank were presented in all uppercase; to match Swahili. Likewise, the low-resource language Telugu had fewer than 100 forms. 4 When sufficient data are unavailable, we instead use 50 or 100 examples. 5 This mimics a realistic setting, as supervised training is usually employed to generalize from frequent words that appear in annotated resources to less frequent words that do not. Unsupervised learning methods also tend to generalize from more frequent words (which can be analyzed more easily by combining information from many contexts) to less frequent ones. 6 https://universaldependencies.org/format. html the 2018 shared task, we lowercased these. In development and test data, all fields except for form and index within the sentence were struck.

Task 1 Baseline
We include four neural sequence-to-sequence models mapping lemma into inflected word forms: soft attention (Luong et al., 2015), non-monotonic hard attention (Wu et al., 2018), monotonic hard attention and a variant with offset-based transition distribution . Neural sequenceto-sequence models with soft attention (Luong et al., 2015) have dominated previous SIGMOR-PHON shared tasks (Cotterell et al., 2017). Wu et al. (2018) instead models the alignment between characters in the lemma and the inflected word form explicitly with hard attention and learns this alignment and transduction jointly.  shows that enforcing strict monotonicity with hard attention is beneficial in tasks such as morphological inflection where the transduction is mostly monotonic. The encoder is a biLSTM while the decoder is a left-to-right LSTM. All models use multiplicative attention and have roughly the same number of parameters. In the model, a morphological tag is fed to the decoder along with target character embeddings to guide the decoding. During the training of the hard attention model, dynamic programming is applied to marginalize all latent alignments exactly.

Task 2 Baselines
Non-neural (Müller et al., 2015): The Lemming model is a log-linear model that performs joint morphological tagging and lemmatization. The model is globally normalized with the use of a second order linear-chain CRF. To efficiently calculate the partition function, the choice of lemmata are pruned with the use of pre-extracted edit trees.
Neural (Malaviya et al., 2019): This is a stateof-the-art neural model that also performs joint morphological tagging and lemmatization, but also accounts for the exposure bias with the application of maximum likelihood (MLE). The model stitches the tagger and lemmatizer together with the use of jackknifing (Agić and Schluter, 2017) to expose the lemmatizer to the errors made by the tagger model during training. The morphological tagger is based on a character-level biLSTM embedder that produces the embedding for a word,  and a word-level biLSTM tagger that predicts a morphological tag sequence for each word in the sentence. The lemmatizer is a neural sequenceto-sequence model ) that uses the decoded morphological tag sequence from the tagger as an additional attribute. The model uses hard monotonic attention instead of standard soft attention, along with a dynamic programming based training scheme.

Results
The SIGMORPHON 2019 shared task received 30 submissions-14 for task 1 and 16 for task 2from 23 teams. In addition, the organizers' baseline systems were evaluated.

Task 1 Results
Five teams participated in the first Task, with a variety of methods aimed at leveraging the crosslingual data to improve system performance. The University of Alberta (UAlberta) performed a focused investigation on four language pairs, training cognate-projection systems from external cognate lists. Two methods were considered: one which trained a high-resource neural encoderdecoder, and projected the test data into the HRL, and one that projected the HRL data into the LRL, and trained a combined system. Results demonstrated that certain language pairs may be amenable to such methods.   The Tuebingen University submission (Tuebingen) aligned source and target to learn a set of editactions with both linear and neural classifiers that independently learned to predict action sequences for each morphological category. Adding in the cross-lingual data only led to modest gains.
AX-Semantics combined the low-and highresource data to train an encoder-decoder seq2seq model; optionally also implementing domain adaptation methods to focus later epochs on the target language.
The CMU submission first attends over a decoupled representation of the desired morphological sequence before using the updated decoder state to attend over the character sequence of the lemma. Secondly, in order to reduce the bias of the decoder's language model, they hallucinate two types of data that encourage common affixes and character copying. Simply allowing the model to learn to copy characters for several epochs significantly outperforms the task baseline, while further improvements are obtained through fine-tuning. Making use of an adversarial language discriminator, cross lingual gains are highly-correlated to linguistic similarity, while augmenting the data with hallucinated forms and multiple related target language further improves the model.
The system from IT-IST also attends separately to tags and lemmas, using a gating mechanism to interpolate the importance of the individual attentions. By combining the gated dual-head attention with a SparseMax activation function, they are able to jointly learn stem and affix modifications, improving significantly over the baseline system.
The relative system performance is described in Table 5, which shows the average per-language accuracy of each system. The table reflects the fact that some teams submitted more than one system (e.g. Tuebingen-1 & Tuebingen-2 in the table).

Task 2 Results
Nine teams submitted system papers for Task 2, with several interesting modifications to either the baseline or other prior work that led to modest improvements.
Charles-Saarland achieved the highest overall tagging accuracy by leveraging multi-lingual BERT embeddings fine-tuned on a concatenation of all available languages, effectively transporting the cross-lingual objective of Task 1 into Task 2. Lemmas and tags are decoded separately (with a joint encoder and separate attention); Lemmas are a sequence of edit-actions, while tags are calculated jointly. (There is no splitting of tags into features; tags are atomic.) CBNU instead lemmatize using a transformer network, while performing tagging with a multilayer perceptron with biaffine attention. Input words are first lemmatized, and then pipelined to the tagger, which produces atomic tag sequences (i.e., no splitting of features).
The team from Istanbul Technical University (ITU) jointly produces lemmatic edit-actions and morphological tags via a two level encoder (first word embeddings, and then context embeddings) and separate decoders. Their system slightly improves over the baseline lemmatization, but significantly improves tagging accuracy.
The team from the University of Groningen (RUG) also uses separate decoders for lemmatization and tagging, but uses ELMo to initialize the contextual embeddings, leading to large gains in performance. Furthermore, joint training on related languages further improves results.
CMU approaches tagging differently than the multi-task decoding we've seen so far (baseline is used for lemmatization). Making use of a hierarchical CRF that first predicts POS (that is subsequently looped back into the encoder), they then seek to predict each feature separately. In particular, predicting POS separately greatly improves results. An attempt to leverage gold typological information led to little gain in the results; experiments suggest that the system is already learning the pertinent information.
The team from Ohio State University (OHIOSTATE) concentrates on predicting tags; the baseline lemmatizer is used for lemmatization. To that end, they make use of a dual decoder that first predicts features given only the word embedding as input; the predictions are fed to a GRU seq2seq, which then predicts the sequence of tags.
The UNT HiLT+Ling team investigates a lowresource setting of the tagging, by using parallel Bible data to learn a translation matrix between English and the target language, learning morphological tags through analogy with English.
The UFAL-Prague team extends their submission from the UD shared task (multi-layer LSTM), replacing the pretrained embeddings with BERT, to great success (first in lemmatization, 2nd in tag-  ging). Although they predict complete tags, they use the individual features to regularize the decoder. Small gains are also obtained from joining multilingual corpora and ensembling.
CUNI-Malta performs lemmatization as operations over edit actions with LSTM and ReLU. Tagging is a bidirectional LSTM augmented by the edit actions (i.e., two-stage decoding), predicting features separately.
The Edinburgh system is a character-based LSTM encoder-decoder with attention, implemented in OpenNMT. It can be seen as an extension of the contextual lemmatization system Lematus (Bergmanis and Goldwater, 2018) to include morphological tagging, or alternatively as an adaptation of the morphological re-inflection system MED (Kann and Schütze, 2016) to incorporate context and perform analysis rather than re-inflection. Like these systems it uses a completely generic encoderdecoder architecture with no specific adaptation to the morphological processing task other than the form of the input. In the submitted version of the system, the input is split into short chunks corresponding to the target word plus one word of context on either side, and the system is trained to output the corresponding lemmas and tags for each three-word chunk.
Several teams relied on external resources to improve their lemmatization and feature analysis. Several teams made use of pre-trained embeddings. CHARLES-SAARLAND-2 and UFALPRAGUE-1 used pretrained contextual embeddings (BERT) provided by Google (Devlin et al., 2019). CBNU-1 used a mix of pre-trained embeddings from the CoNLL 2017 shared task and fastText. Further, some teams trained their own embeddings to aid performance.

Future Directions
In general, the application of typology to natural language processing (e.g., Gerz et al., 2018; provides an interesting avenue for multilinguality. Further, our shared task was designed to only leverage a single helper language, though many may exist with lexical or morphological overlap with the target language. Techniques like those of Neubig and Hu (2018) may aid in designing universal inflection architectures. Neither task this year included unannotated monolingual corpora. Using such data is well-motivated from an L1-learning point of view, and may affect the performance of low-resource data settings.
In the case of inflection an interesting future topic could involve departing from orthographic representation and using more IPA-like representations, i.e. transductions over pronunciations. Differ-         Table 9: Task 2 Morph F1 scores ent languages, in particular those with idiosyncratic orthographies, may offer new challenges in this respect. 7 Only one team tried to learn inflection in a multilingual setting-i.e. to use all training data to train one model. Such transfer learning is an interesting avenue of future research, but evaluation could be difficult. Whether any cross-language transfer is actually being learned vs. whether having more data better biases the networks to copy strings is an evaluation step to disentangle. 8 Creating new data sets that accurately reflect learner exposure (whether L1 or L2) is also an important consideration in the design of future shared tasks. One pertinent facet of this is information about inflectional categories-often the inflectional information is insufficiently prescribed by the lemma, as with the Romanian verbal inflection classes or nominal gender in German.
As we move toward multilingual models for morphology, it becomes important to understand which representations are critical or irrelevant for adapting to new languages; this may be probed in the style of (Thompson et al., 2018), and it can be used as a first step toward designing systems that avoid "catastrophic forgetting" as they learn to inflect new languages (Thompson et al., 2019).
Future directions for Task 2 include exploring cross-lingual analysis-in stride with both Task 1 and Malaviya et al. (2018)-and leveraging these analyses in downstream tasks.

Conclusions
The SIGMORPHON 2019 shared task provided a type-level evaluation on 100 language pairs in 79 languages and a token-level evaluation on 107 treebanks in 66 languages, of systems for inflection and analysis. On task 1 (low-resource inflection with cross-lingual transfer), 14 systems were submitted, while on task 2 (lemmatization and morphological feature analysis), 16 systems were submitted. All used neural network models, completing a trend in past years' shared tasks and other recent work on morphology.
In task 1, gains from cross-lingual training were generally modest, with gains positively correlating with the linguistic similarity of the two languages. 7 Although some work suggests that working with IPA or phonological distinctive features in this context yields very similar results to working with graphemes (Wiemerslage et al., 2018). 8 This has been addressed by Jin and Kann (2017). In the second task, several methods were implemented by multiple groups, with the most successful systems implementing variations of multiheaded attention, multi-level encoding, multiple decoders, and ELMo and BERT contextual embeddings.
We have released the training, development, and test sets, and expect these datasets to provide a useful benchmark for future research into learning of inflectional morphology and string-to-string transduction.