CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.


Introduction
Morphology interacts with both syntax and phonology.As a result, explicitly modeling morphology has been shown to aid a number of tasks in human language technology (HLT), including machine translation (MT) (Dyer et al., 2008), speech recognition (Creutz et al., 2007), parsing (Seeker and C ¸etinoǧlu, 2015), keyword spot-ting (Narasimhan et al., 2014), and word embedding (Cotterell et al., 2016b).Dedicated systems for modeling morphological patterns and complex word forms have received less attention from the HLT community than tasks that target other levels of linguistic structure.Recently, however, there has been a surge of work in this area (Durrett and DeNero, 2013;Ahlberg et al., 2014;Nicolai et al., 2015;Faruqui et al., 2016), representing a renewed interest in morphology and the potential to use advances in machine learning to attack a fundamental problem in string-to-string transformations: the prediction of one morphologically complex word form from another.This increased interest in morphology as an independent set of problems within HLT arrives at a particularly opportune time, as morphology is also undergoing a methodological renewal within theoretical linguistics where it is moving towards increased interdisciplinary work and quantitative methodologies (Moscoso del Prado Martín et al., 2004;Milin et al., 2009;Ackerman et al., 2009;Sagot and Walther, 2011;Ackerman and Malouf, 2013;Baayen et al., 2013;Blevins, 2013;Pirrelli et al., 2015;Blevins, 2016).Pushing the HLT research agenda forward in the domain of morphology promises to lead to mutually highly beneficial dialogue between the two fields.
Rich morphology is the norm among the languages of the world.The linguistic typology database WALS shows that 80% of the world's languages mark verb tense through morphology while 65% mark grammatical case (Haspelmath et al., 2005).The more limited inflectional system of English may help to explain the fact that morphology has received less attention in the computational literature than it is arguably due.
The CoNLL-SIGMORPHON 2017 shared task worked to promote the development of robust systems that can learn to perform cross-linguistically Table 1: Example training data from sub-task 1.Each training example maps a lemma and inflection to an inflected form, The inflection is a bundle of morphosyntactic features.Note that inflected forms (and lemmata) can encompass multiple words.In the test data, the last column (the inflected form) must be predicted by the system.reliable morphological inflection and morphological paradigm cell filling using varying amounts of training data.We note that this is also the first CoNLL-hosted shared task to focus on morphology.The task itself featured training and development data from 52 languages representing a range of language families.Many of the languages included were extremely low-resource, e.g., Quechua, Navajo, and Haida.The chosen languages also encompassed diverse morphological properties and inflection processes.Whenever possible, three data conditions were given for each language: low, medium, and high.In the inflection sub-task, these corresponded to seeing 100 examples, 1,000 examples, and 10,000 examples respectively in the training data for almost all languages.The results show that encoder-decoder recurrent neural network models (RNNs) can perform very well even with small training sets, if they are augmented with various mechanisms to cope with the low-resource setting.The shared task training, development, and test data are released publicly.1 2 Task and Evaluation Details This year's shared task contained two sub-tasks, which represented slightly different learning scenarios that might be faced by an HLT engineer or (roughly speaking) a human learner.Beyond manually vetted2 data for training, development and test, monolingual corpus data (Wikipedia dumps) was also provided for both of the sub-tasks.Figure 1 illustrates the two tasks and defines some terminology.At training time, the system is provided with complete paradigms, i.e., tables of all inflections for a given lemma, like the example at top.At test time, the system is asked to complete partially filled paradigms, like the example at bottom; note that the inflectional features for the missing paradigm cells are provided in the input.
The CoNLL-SIGMORPHON 2017 shared task is the second shared task in a series that began with the SIGMORPHON 2016 shared task on morphological reinflection (Cotterell et al., 2016a).In contrast to 2016, it happens that both of the 2017 sub-tasks actually involve only inflection, not reinflection. 3Nonetheless, we kept "reinflection" in this year's title to make it easier to refer to the series of tasks.

Sub-Task 1: Inflected Form from Lemma
The first sub-task in Figure 1 required morphological generation with sparse training data, something that can be practically useful for MT and other downstream tasks in NLP.Here, participants were given examples of inflected forms as shown in Table 1.Each test example asked them to produce some other inflected form when given a lemma and a bundle of morphosyntactic features.
The training data was sparse in the sense that it included only a few inflected forms from each lemma.That is, as in human L1 learning, the learner does not necessarily observe any complete paradigms in a language where the paradigms are Each large rectangle represents a paradigm, i.e., the full set of inflected forms for some lemma.Each small rectangle within the paradigm is a cell that is associated with a known morphological feature bundle, and lists a string that either is observed (shaded background) or must be predicted (white background).Sub-task 1 featured sparse training data and asked systems to inflect individual forms at test time.Sub-task 2 provides dense paradigms as training data and asks for full paradigm completion of unseen items.
2. The task is inflection: Given an input lemma and desired output tag, participants had to generate the correct output inflected form (a string).
3. The supervised training data consisted of individual forms (Table 1) that were sparsely sampled from a large number of paradigms.
4. Forms that are empirically more frequent were more likely to appear in both training and test data (see §3 for details).
5. Unannotated corpus data was also provided to participants.
6. Systems were evaluated after training on 10 2 , 10 3 , and 10 4 forms. 4Of course, human L1 learners do not get to observe explicit morphological feature bundles for the types that they observe.Rather, they analyze inflected tokens in context to discover both morphological features (including inherent features such as noun gender (Arnon and Ramscar, 2012)) and paradigmatic structure (number of forms per lemma, number of expressed featural contrasts such as tense, number, person. . .).

Sub-Task 2: Paradigm Completion
The second sub-task in Figure 1 focused on paradigm completion, also known as "the paradigm cell filling problem" (Ackerman et al., 2009).
Here, participants were given a few complete inflectional paradigms as training data.At test time, partially filled paradigms, i.e. paradigms with significant gaps in them, were to be completed by filling out the missing cells.Table 2 gives examples.
Thus, sub-task 2 requires predicting many inflections of the same lemma.Recall that sub-task 1 also required the system to predict several inflections of the same lemma (when they appear as separate examples in test data).However, in sub-task 2, one of our test-time evaluation metrics ( §2.3) is full-paradigm accuracy.Also, the sub-task 2 training data provides full paradigms, in contrast to sub-task 1 where it included only a few inflected forms per lemma.Finally, at test time, sub-task 2 presents each lemma along with some of its inflected forms, which is potentially helpful if the lemma had not appeared previously in training data.
Apart from the theoretical interest in this problem (Ackerman and Malouf, 2013), this sub-task is grounded in the practical problem of extrapolation of basic resources for a language, where only a few complete paradigms may be available from a native speaker informant (Sylak-Glassman et al., 2016) or a reference grammar.L2 classroom instruction also asks human students to memorize paradigms.
2.Not all paradigms within a language have the same shape.A noun lemma will have a different set of cells than a verb lemma does, and verbs of different classes (e.g., lexically perfective vs. imperfective) may also have different sets of cells.
3. The task was paradigm completion: given a sparsely populated paradigm, participants should generate the inflected forms (strings) for all missing cells.
4. The task simulates learning from compiled grammatical resources and inflection tables, or learning from a limited time with a nativelanguage informant in a fieldwork scenario.
5. Three training sets were given, building up in size from only a few complete paradigms to a large number (dozens).

Evaluation
Each team participating in a given sub-task was asked to submit 156 versions of their system, where each version was trained using a different training set (3 training sizes × 52 languages) and its corresponding development set.We evaluated each submitted system on its corresponding test set, i.e., the test set for its language.We computed three evaluation metrics: (i) Overall 1-best test-set accuracy, i.e., is the predicted paradigm cell correct?(ii) average Levenshtein distance, i.e., how badly does the predicted form disagree with the answer?(iii) Fullparadigm accuracy, i.e., is the complete paradigm correct?This final metric only truly makes sense in sub-task 2, where full paradigms are given for evaluation.For each sub-task, the three data conditions (low, medium, and high) resulted in a learning curve.For each system in each condition, we report the average metrics across all 52 languages.

Languages
The data for the shared task was highly multilingual, comprising 52 unique languages.Data for 47 of the languages came from the English edition of Wiktionary, a large multi-lingual crowd-sourced dictionary containing morphological paradigms for many lemmata.5Data for Khaling, Kurmanji Kurdish, and Sorani Kurdish was created as part of the Alexina project (Walther et al., 2013(Walther et al., , 2010;;Walther and Sagot, 2010). 6Novel data for Haida, a severely endangered North American language isolate, was prepared by Jordan Lachler (University of Alberta).The Basque language data was extracted from a manually designed finite-state morphological analyzer (Alegria et al., 2009).
The shared task language set is genealogically diverse, including languages from 10 language stocks.Although the majority of the languages are Indo-European, we also include two language isolates (Haida and Basque) along with languages from Athabaskan (Navajo), Kartvelian (Georgian), Quechua, Semitic (Arabic, Hebrew), Sino-Tibetan (Khaling), Turkic (Turkish), and Uralic (Estonian, Finnish, Hungarian, and Northern Sami).The shared task language set is also diverse in terms of morphological structure, with languages which use primarily prefixes (Navajo), suffixes (Quechua and Turkish), and a mix, with Spanish exhibiting internal vowel variations along with suffixes and Georgian using both infixes and suffixes.The language set also exhibits features such as templatic morphology (Arabic, Hebrew), vowel harmony (Turkish, Finnish, Hungarian), and consonant harmony (Navajo) which require systems to learn non-local alternations.Finally, the resource level of the languages in the shared task set varies greatly, from major world languages (e.g.Arabic, English, French, Spanish, Russian) to languages with few speakers (e.g.Haida, Khaling).

Data Format
For each language, the basic data consists of triples of the form (lemma, feature bundle, inflected form), as in Table 1.The first feature in the bundle always specifies the core part of speech (e.g., verb).All features in the bundle are coded according to the UniMorph Schema, a crosslinguistically consistent universal morphological feature set (Sylak-Glassman et al., 2015a,b).

Extraction from Wiktionary
For each of the 47 Wiktionary languages, Wiktionary provides a number of tables, each of which specifies the full inflectional paradigm for a particular lemma.These tables were initially extracted via a multi-dimensional table parsing strategy (Kirov et al., 2016;Sylak-Glassman et al., 2015a).
As noted in §2.2, different paradigms may have different shapes.To prepare the shared task data, each language's parsed tables from Wiktionary were grouped according to their tabular structure and number of cells.Each group represents a different type of paradigm (e.g., verb).We used only groups with a large number of lemmata, relative to the number of lemmata available for the language as a whole.For each group, we associated a feature bundle with each cell position in the table, by manually replacing the prose labels describing grammatical features (e.g."accusative case") with UniMorph features (e.g.acc).This allowed us to extract triples as described in the previous section.
By applying this process across the 47 languages, we constructed a large multilingual dataset that refines the parsed tables from previous work.This dataset was sampled to create appropriately-sized data for the shared task, as described in §3.4.7 Full and sampled dataset sizes by language are given in Table 3.
Systematic syncretism is collapsed in Wiktionary.For example, in English, feature bundles do not distinguish between different person/number forms of past tense verbs, because they are identical.8Thus, the past-tense form went appears only once in the table for go, not six times, and gives rise to only one triple, whose feature bundle specifies past tense but not person and number.

Sampling the Train-Dev-Test Splits
From each language's collection of paradigms, we sampled the training, development, and test sets as follows.These datasets can be obtained from http://www.sigmorphon.org/conll2017.
Our first step was to construct probability distributions over the (lemma, feature bundle, inflected form) triples in our full dataset.For each triple, we counted how many tokens the inflected form has in the February 2017 dump of Wikipedia for that language.Note that this simple "string match" heuristic overestimates the count, since strings are ambiguous: not all of the counted tokens actually render that feature bundle. 9 From these counts, we estimated a unigram distribution over triples, using Laplace smoothing (add-1 smoothing).We then sampled 12000 triples without replacement from this distribution.The first 100 were taken as the low-resource training set for sub-task 1, the first 1000 as the mediumresource training set, and the first 10000 as the high-resource training set.Note that these training sets are nested, and that the highest-count triples tend to appear in the smaller training sets.
The final 2000 triples were randomly shuffled and then split in half to obtain development and test sets of 1000 forms each.The final shuffling was performed to ensure that the development set is similar to the test set.By contrast, the development and test sets tend to contain lower-count triples than the training set. 10 In those languages where we have less than 12000 total forms, we omit the high-resource training set (all languages have at least 3000 forms).
To sample the data for sub-task 2, we perform a similar procedure.For each paradigm in our full dataset, we counted the number of tokens in Wikipedia that matched any of the inflected forms in the paradigm.From these counts, we estimated a unigram distribution over paradigms, using Laplace smoothing.We sampled 300 paradigms without replacement from this 9 For example, in English, any token of the string walked will be double-counted as both the past tense and the past participle of the lemma walk.This problem holds for all regular English verbs.Similarly, when we are counting the present-tense tokens lay of the lemma lay, we will also include tokens of the string lay that are actually the past tense of lie, or are actually the adjective or noun senses of lay.The alternative to double-counting each ambiguous token would have been to use EM to split the token's count of 1 unequally among its possible analyses, in proportion to their estimated prior probabilities (Cotterell et al., 2015).
10 This is a realistic setting, since supervised training is usually employed to generalize from frequent words that appear in annotated resources to less frequent words that do not.Unsupervised learning methods also tend to generalize from more frequent words (which can be analyzed more easily by combining information from many contexts) to less frequent ones.
distribution.The low-resource training sets contain the first 10 paradigms, the medium-resource training set contains the first 50, and high-resource training set contains the first 200.Again, these training sets are nested.Note that since different languages have paradigms of different sizes, the actual number of training exemplars may differ drastically.
With the same motivation as before, we shuffled the remaining 100 forms and took the first 50 as development and the next 50 as test.(In those languages with fewer than 300 forms, we again omitted the high-resource training setting.)For each development or test paradigm, we chose about 1 5 of the slots to provide to the system as input along with the lemma, asking the system to predict the remaining 4 5 .We determined which cells to keep by independently flipping a biased coin with probability 0.2 for each cell.
Because of the count overestimates mentioned above, our sub-task 1 dataset overrepresents triples where the inflected form (the answer) is ambiguous, and our sub-task 2 dataset overrepresents paradigms that contain ambiguous inflected forms.The degree of ambiguity varied among languages: the average number of triples per inflected form string ranged from 1.00 in Sorani to 2.89 in Khaling, with an average of 1.43 across all languages.Despite this distortion of true unigram counts, we believe that our datasets captured a sufficiently broad sample of the feature combinations for every language.

Previous Work
Most recent work in inflection generation has focused on sub-task 1, i.e., generating inflected forms from the lemma.Numerous, methodologically diverse approaches have been published.We highlight a representative sample of recent work.Durrett and DeNero (2013) heuristically extracted transformation rules and trained a semi-Markov model (Sarawagi and Cohen, 2004) to learn when to apply them to the input.Nicolai et al. (2015) trained a discriminative string-tostring monotonic transduction tool-DIRECTL+ (Jiampojamarn et al., 2008)-to generate inflections.Ahlberg et al. (2014) reduced the problem to multi-class classification, where they used finite-state techniques to first generalize inflectional patterns and then trained a feature-rich classifier to choose the optimal such pattern to inflect unseen words (Ahlberg et al., 2015).Finally, Malouf (2016), Faruqui et al. (2016) and Kann and Schütze (2016) proposed a neural-based sequenceto-sequence models (Sutskever et al., 2014), with Kann and Schütze making use of an attention mechanism (Bahdanau et al., 2015).Overall, the neural approaches have generally been found to be the most successful.
Some work has also focused on scenarios similar to sub-task 2. For example, Dreyer and Eisner (2009) modeled the distribution over the paradigms of a language as a Markov Random Field (MRF), where each cell is represented as a string-valued random variable.The MRF's factors are specified as weighted finite-state machines of the form given by Dreyer et al. (2008).Building upon this, Cotterell et al. (2015) proposed using a Bayesian network where both lemmata (repeated within a paradigm) and affixes (repeated across paradigms) were encoded as string-valued random variables.That work required its finitestate transducers to take a more restricted form (Cotterell et al., 2014) for computational reasons.Finally, Kann et al. (2017a) proposed a multisource sequence-to-sequence network, allowing a neural transducer to exploit multiple source forms simultaneously.
SIGMORPHON 2016 Shared Task.Last year, the SIGMORPHON 2016 shared task (http:// sigmorphon.org/sharedtask)focused on 10 languages (including 2 surprise languages).As for the present 2017 task, most of the 2016 data was derived from Wiktionary.The 2016 shared task had submissions from 9 competing teams with members from 11 universities.As mentioned in §2.1, our sub-task 1 is an extension of sub-task 1 from 2016.The other sub-tasks in 2016 focused on the more general reinflection problem, where systems had to learn to map from any inflected form to any other with varying degrees of annotations.See Cotterell et al. (2016a) for details.

The Baseline System
The shared task provided a baseline system to participants that addressed both tasks and all languages.The system was designed for speed of application and also for adequate accuracy with little training data, in particular in the low and medium data conditions.The design of the baseline was inspired by the University of Colorado's submission (Liu and Mao, 2016)   Table 4: Quantity of data available in sub-task 2. For each possible part of speech in each language, we present the range in the number of forms that comprise a paradigm as an indication of the difficulty of the task of forming a full paradigm.These ranges were computed using the data in the Train Medium condition.
shared task.

Alignment
For each (lemma, feature bundle, inflected form) triple in training data, the system initially aligns the lemma with the inflected form by finding the minimum-cost edit path.Costs are computed with a weighted scheme such that substitutions have a slightly higher cost (1.1)than insertions or deletions (1.0).For example, the German training data pair schielen-geschielt 'to squint' (going from the lemma to the past participle) is aligned as: --schielen geschielt- The system now assumes that each aligned pair can be broken up into a prefix, stem and a suffix, based on where the inputs or outputs have initial or trailing blanks after alignment.We assume that initial or trailing blanks in either input or output reflect boundaries between a prefix and a stem, or a stem and a suffix.This allows us to divide each training example into three parts.Using the example above, the pairs would be aligned as follows, after padding the edges with $-symbols:

Inflection Rules
From this alignment, the system extracts a prefixchanging rule based on the prefix pairing, as well as a set of suffix-changing rules based on suffixes of the stem+suffix pairing.The example alignment above yields the eight extracted suffixmodifying rules as well as the prefix-modifying rule $ → $ge.
Since these rules were obtained from the triple (schielen, V;V.PTCP;PST, geschielt), they are associated with a token of the feature bundle V;V.PTCP;PST.

Generation
At test time, to inflect a lemma with features, the baseline system applies rules associated with Table 5: The teams' abbreviations as well as their members' institutes and the accompanying system description paper are listed here.Note that in the main text the abbreviations are used with a integer index, indicating the specific submission.One team (marked * ), did not submit a system description.
training tokens of the precise feature bundle.There is no generalization across bundles that share features.
Specifically, the longest-matching suffix rule associated with the feature bundle is consulted and applied to the input form.Ties are broken by frequency, in favor of the rule that has occurred most often with this feature bundle.After this, the prefix rule that occurred most often with the bundle is likewise applied.That is, the prefix-matching rule has no longest-match preference, while the suffixmatching rule does.
For example, to inflect kaufen 'to buy' with the features V;V.PTCP;PST, using the single example above as training data, we would find that the longest matching stored suffix-rule is en$ → t$, which would transform kaufen into an intermediate form kauft, after which the most frequent prefix-rule, $ → $ge would produce the final output gekauft.If no rules have been associated with a particular feature bundle (as often happens in the low data condition), the inflected form is simply taken to be a copy of the lemma.
In sub-task 2, paradigm completion, the baseline system simply repeats the sub-task 1 method and generates all the missing forms independently from the lemma.It does not take advantage of the other forms that are presented in the partially filled paradigm.
In addition to the above, the baseline system uses a heuristic to place a language into one of two categories: largely prefixing or largely suffixing.Some languages, such as Navajo, are largely prefixing and have more complex changes in the left periphery of the input rather than at the right.However, in the method described above, the operation of the prefix rules is more restricted than that of the suffix rules: prefix rules tend to perform no change at all, or insert or delete a prefix.For largely prefixing languages, the method performs better when operating with reversed strings.Classifying a language into prefixing or suffixing is done by simply counting how often there is a prefix change vs. suffix change in going from the lemma form to the inflected form in the training data.Whenever a language is found to be largely prefixing, the system works with reversed strings throughout to allow more expressive changes in the left edge of the input.

System Descriptions
The CoNLL-SIGMORPHON 2017 shared task received submissions from 11 teams with members from 15 universities and institutes (Table 5).Many of the teams submitted more than one system, yielding a total of 25 unique systems entered including the baseline system.
In contrast to the 2016 shared task, all but one of the submitted systems included a neural component.Despite the relative uniformity of the sub-  mitted architectures, we still observed large differences in the individual performances.Rather than differences in architecture, a major difference this year was the various methods for supplying the neural network with auxiliary training data.For ease of presentation, we break down the systems into the features of their system (see Table 6) and discuss the systems that had those features.In all cases, further details of the methods can be found in the system description papers, which are cited in Table 5.
Neural Parameterization.All systems except for the EHU team employed some form of a neural network.Moreover, all teams except for SU-RUG, which employed a convolutional neural network, made use of some form of gated recurrent network-either a gated recurrent network (GRU) (Chung et al., 2014) or long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997).In these neural models, a common strategy was to feed in the morphological tag of the form to be predicted along with the input into the network, where each subtag was its own symbol.
Hard Alignment versus Soft Attention.Another axis, along which the systems differ is the use of hard alignment, over soft attention.The neural attention mechanism was introduced in Bahdanau et al. (2015) for neural machine translation (NMT).In short, these mechanisms avoid the necessity of encoding the input word into a fixed length vector, by allowing the decoder to attend to different parts of the inputs.Just as in NMT, the attention mechanism has led to large gains in mor-phological inflection.The CMU, CU, IIT (BHU), LMU, UE-LMU, UF and UTNII systems all employed such mechanisms.
An alternative to soft attention is hard, monotonic alignment, i.e., a neural parameterization of a traditional finite-state transduction system.These systems enforce a monotonic alignment between source and target forms.In the 2016 shared task (see Cotterell et al., 2016a, Table 6) such a system placed second (Aharoni et al., 2016), and this year's winning system-CLUZH-was an extension of that one.(See, also, Aharoni and Goldberg (2017) for a further explication of the technique and Rastogi et al. ( 2016) for discussion of a related neural parameterization of a weighted finite-state machine.)Their system allows for explicit biasing towards a copy action that appears useful in the low-resource setting.Despite its neural parameterization, the CLUZH system is most closely related to the systems of UA and EHU, which train weighted finite-state transducers, albeit with a log-linear parameterization.
Reranking.Reranking the output of a weaker system was a tack taken by two systems: ISI and UA.The ISI system started with a heuristically induced candidate set, using the edit tree approach described by Chrupała et al. (2008), and then chose the best edit tree.This approach is effectively a neuralized version of the lemmatizer proposed in Müller et al. (2015) and, indeed, was originally intended for that task (Chakrabarty et al., 2017).The UA team, following their 2016 submission, proposed a linear reranking on top of the k-best output of their transduction system.Data Augmentation.Many teams made use of auxiliary training data-unlabeled or synthetic forms.
Some teams leveraged the provided Wikipedia corpora (see §3).The UE-LMU team used these unlabeled corpora to bias their methods towards copying by transducing an unlabeled word to itself.The same team also explored a similar setup that instead learned to transduce random strings to themselves, and found that using random strings worked almost as well as words that appeared in unlabeled corpora.CMU used a variational autoencoder and treated the tags of unannotated words in the Wikipedia corpus as latent variables (see Zhou and Neubig (2017b) for more details).Other teams attempted to get silver-standard labels for the unlabeled corpora.For example, the UA team trained a tagger on the given training examples, and then tagged the corpus with the goal to obtain additional instances, while the UE-LMU team used a series of unsupervised heuristics.The CU team-which did not make use of external resources-hallucinated more training data by identifying suffix and prefix changes in the given training pairs and then using that information to create new artificial training pairs.The LMU submission also experimented with handwritten rules to artificially generate more data.It seems likely that the primary difference in the performance of the various neural systems lay in these strategies for the creation of new data to train the parameters, rather than in the neural architectures themselves.
7 Performance of the Systems Three teams exploited external resources in some form: UA, CMU, and UE-LMU.In general, any relative performance gained was minimal.The CMU system was outranked by several systems that avoided external resource use in the High and Medium conditions in which it competed.UE-LMU only submitted a system that used additional resources in the Medium condition, and saw gains of ∼%1 compared to their basic system, while it was still outranked overall by CLUZH.In the Low condition, UA saw gains of ∼%3 using external data.However, all UA submissions were limited to a small handful of languages.
All but one of the systems submitted were neural.As expected given the results from SIGMOR-PHON 2016, these systems perform very well when in the High training condition where data is relatively plentiful.In the Low and Medium conditions, however, standard encoder-decoder architectures perform worse than the baseline using only the training data provided.Teams that beat the baseline succeeded by biasing networks towards the correct solutions through pre-training on synthetic data designed to capture the overall inflectional patterns in a language.As seen in Table 9, these techniques worked better for some languages than for others.Languages with smaller, more regular paradigms were handled well (e.g., English sub-task 1 low-resource accuracy was at 90%).Languages with more complex systems, like Latin, proved more challenging (the best system achieved only 19% accuracy in the low condition).For these languages, it is possible that the relevant variation required to learn a best per-form inflectional pattern was simply not present in the limited training data, and that a language-specific learning bias was required.
Even though the top-ranked systems do well on their own, different systems may contain some amount of complementary information, so that an ensemble over multiple approaches has a chance to improve accuracy.We present an upper bound on the possible performance of such an ensemble.Table 7 and Table 8 include an "Ensemble Oracle" system (oracle-e) that gives the correct answer if any of the submitted systems is correct.The oracle performs significantly better than any one system in both the Medium (∼10%) and Low (∼15%) conditions.This suggests that the different strategies used by teams to "bias" their systems in an effort to make up for sparse data lead to substantially different generalization patterns.
For sub-task 1, we also present a second "Feature Combination" Oracle (oracle-fc) that gives the correct answer for a given test triple iff its feature bundle appeared in training (with any lemma).Thus, oracle-fc provides an upper bound on the performance of systems that treat a feature bundle such as V;SBJV;FUT;3;PL as atomic.In the lowdata condition, this upper bound was only 71%, meaning that 29% of the test bundles had never been seen in training data.Nonetheless, systems should be able to make some accurate predictions on this 29% by decomposing each test bundle into individual morphological features such as FUT (future) and PL (plural), and generalizing from training examples that involve those features.For example, a particular feature or sub-bundle might be realized as a particular affix.Several of the systems treated each individual feature as a separate input to the recurrent network, in order to enable this type of generalization.In the medium data condition for some languages, these systems sometimes far surpassed oracle-fc.The most notable example of this is Basque, where oracle-fc produced a 47% accuracy while six of the submitted systems produced an accuracy of 85% or above.Basque is an extreme example with very large paradigms for the verbs that inflect in the language (only a few dozen common ones do).This result demonstrates the ability of the neural systems to generalize and correctly inflect according to unseen feature combinations.

Future Directions
As regards morphological inflection, there is a plethora of future directions to consider.First, one might consider morphological transductions over pronunciations, rather than spellings.This is more challenging in the many languages (including English) where the orthography does not reflect the phonological changes that accompany morphological processes such as affixation.Orthography usually also does not reflect predictable allophonic distinctions in pronunciation (Sampson, 1985), which one might attempt to predict, such as the difference in aspiration of /t/ in English [t h Ap] (top) vs. [stAp] (stop).
A second future direction involves the effective incorporation of external unannotated monolingual corpora into the state-of-the-art inflection or reinflection systems.The best systems in our competition did not make use of external data and those that did make heavy use of such data, e.g., the CMU team, did not see much gain.The best way to use external corpora remains an open question; we surmise that they can be useful, especially in the lower-resource cases.A related line of inquiry is the incorporation of cross-lingual information, which Kann et al. (2017b) did find to be helpful.
A third direction revolves around the efficient elicitation of morphological information (i.e., active learning).In the low-resource section, we asked our participants to find the best approach to generate new forms given existing morphological annotation.However, it remains an open question, which of the cells in a paradigm are best to collect annotation for in the first place.Likely, it is better to collect diagnostic forms that are closer to principal parts of the paradigm (Finkel and Stump, 2007;Ackerman et al., 2009;Montermini and Bonami, 2013;Cotterell et al., 2017) as these will contain enough information such that the remaining transformations are largely deterministic.Experimental studies however suggest that speakers also strongly rely on pattern frequencies for inferring unknown forms (Seyfarth et al., 2014).Another interesting direction would therefore also include the organization of data according to plausible real frequency distributions (especially in spoken data) and exploring possibly varying learning strategies associated with lexical items of various frequencies.
Finally, there is a wide variety of other tasks involving morphology.While some of these have had a shared task, e.g., the parsing of morphologically-rich languages (Tsarfaty et al., 2010) and unsupervised morphological segmentation (Kurimo et al., 2010), many have not, e.g., supervised morphological segmentation and morphological tagging.A key purpose of shared tasks in the NLP community is the preparation and release of standardized data sets for fair comparison among methods.Future shared tasks in other areas of computational morphology would seem in order, giving the overall effectiveness of shared tasks in unifying research objectives in subfields of NLP, and as a starting point for possible cross-over with cognitively-grounded theoretical and quantitative linguistics.

Conclusion
The CoNLL-SIGMORPHON shared task provided an evaluation on 52 languages, with large and small datasets, of systems for inflection and paradigm completion-two core tasks in computational morphological learning.On sub-task 1 (inflection), 24 systems were submitted, while on sub-task 2 (paradigm completion), 3 systems were submitted.All but one of the systems used rather similar neural network models, popularized by the SIGMORPHON shared task in 2016.
The results reinforce the conclusions of the 2016 shared task that encoder-decoder architectures perform strongly when training data is plentiful, with exact-match accuracy on held-out forms surpassing 90% on many languages; we note there was a shortage of non-neural systems this year to compare with.In addition, and contrary to common expectation, many participants showed that neural systems can do reasonably well even with small training datasets.A baseline sequence-tosequence model achieves close to zero accuracy: e.g., Silfverberg et al. (2017) reported that all the team's neural models on the low data condition delivered accuracies in the 0-1% range without data augmentation, and other teams reported similar findings.However, with judicious application of biasing and data augmentation techniques, the best neural systems achieved over 50% exactmatch prediction of inflected form strings on 100 examples, and 80% on 1,000 examples, as compared to 38% for a baseline system that learns simple inflectional rules.It is hard to say whether these are "good" results in an absolute sense.An interesting experiment would be to pit the smalldata systems against human linguists who do not know the languages, to see whether the systems are able to identify the predictive patterns that humans discover (or miss).
An oracle ensembling of all systems shows that there is still much room for improvement, in particular in low-resource settings.We have released the training, development, and test sets, and expect these datasets to provide a useful benchmark for future research into learning of inflectional morphology and string-to-string transduction.

A Detailed Results
This section contains detailed results for each submitted system on each language.Systems are ordered by average per-form accuracy for each sub-task and data condition.Three metrics are presented for each system/language combination.
1. Per-Form Accuracy: Percentage of test forms inflected correctly.
2. Levenshtein Distance: Average Levenshtein distance of system-predicted form from gold inflected form.
3. Per-Paradigm Accuracy: Percentage of unique lemmata (paradigms) for which every form was inflected correctly.
Scores in bold include the highest scoring non-oracle system for each language as well as any other systems that did not differ significantly in terms of per-form accuracy according to a sign test (p >= 0.05).
Scores marked with a † indicate submissions that were significantly better than the feature combination oracle (p < 0.05), showing per-feature generalization.Scores marked with ‡ did not differ significantly from the ensemble oracle, suggesting minimal complementary information across systems.

Figure 1 :
Figure1: Overview of sub-tasks.Each large rectangle represents a paradigm, i.e., the full set of inflected forms for some lemma.Each small rectangle within the paradigm is a cell that is associated with a known morphological feature bundle, and lists a string that either is observed (shaded background) or must be predicted (white background).Sub-task 1 featured sparse training data and asked systems to inflect individual forms at test time.Sub-task 2 provides dense paradigms as training data and asks for full paradigm completion of unseen items.

Table 2 :
Example training and test data from sub-task 2 in Spanish.
to the SIGMORPHON 2016

Table 3 :
Total number of lemmata and forms available for sampling, and number of distinct lemmata present in each data condition in Task 1.For almost all languages, these were spread across 10000,1000, and 100 forms in the High, Medium, and Low conditions, respectively, and 1000 forms in each Dev and Test set.For †-marked languages, there was not enough total data to support these numbers.Bengali had 4423 forms in the High condition, and Dev and Test sets of 100 forms each.Haida had 6840 forms in the High condition and Dev and Test sets of 100 forms.Scottish Gaelic had no High condition, a Medium condition of 681 forms, and Dev and Test sets of 50 forms each.The three last columns indicate how many inflected forms have undergone changes in a prefix (Pr), a change in a suffix (Su), or a stem-internal change (Ap) versus the given lemma form.

Table 6 :
Features of the various submitted systems.

Table 7 :
Sub-task 1 results: Per-form accuracy (in %age points) and average Levenshtein distance from the correct form (in characters), averaged across the 52 languages with all languages weighted equally.The columns represent the different training size conditions.Systems marked with † used external resources.Accuracies marked with ‡ indicate that the submission did not include all 52 languages and should not be compared to the other accuracies.

Table 8 :
Sub-task 2 results: Per-form accuracy (in %age points) and average Levenshtein distance from the correct form (in characters).accuracy of each system by resource condition, for each of the sub-tasks.The table reflects the fact that some teams submitted more than one system (e.g.LMU-1 & LMU-2 in the table).Learning curves for each language across conditions are shown in Table9, which indicates the best perform accuracy achieved by a submitted system.Full results can be found in Appendix A, including full-paradigm accuracy.

Table 9 :
Best per-form accuracy (and corresponding system) by language.
task for morphological reinflection.In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 41-48, Berlin, Germany.Association for Computational Linguistics.Proceedings of the 4th Workshop on Systems and Frameworks for Computational Morphology (SFCM), Communications in Computer and Information Science, pages 72-93.Springer, Berlin.John Sylak-Glassman, Christo Kirov, and David Yarowsky.2016.Remote elicitation of inflectional paradigms to seed morphological analysis in lowresource languages.In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).European Language Resources Association (ELRA).