Weird Inflects but OK: Making Sense of Morphological Generation Errors

We conduct a manual error analysis of the CoNLL-SIGMORPHON Shared Task on Morphological Reinflection. This task involves natural language generation: systems are given a word in citation form (e.g., hug) and asked to produce the corresponding inflected form (e.g., the simple past hugged). We propose an error taxonomy and use it to annotate errors made by the top two systems across twelve languages. Many of the observed errors are related to inflectional patterns sensitive to inherent linguistic properties such as animacy or affect; many others are failures to predict truly unpredictable inflectional behaviors. We also find nearly one quarter of the residual “errors” reflect errors in the gold data.


Introduction
A huge amount of work in natural language processing treats words as indivisible units, but the vast majority of the world's languages have rich word-internal structure. For instance, 80% of the languages analyzed in the World Atlas of Linguistic Structure (Dryer and Haspelmath 2013) inflect verbs for tense and 65% inflect nouns for case. Generating and processing complex words is thus crucial for multilingual speech and language technologies.
Recent work on large-scale, multilingual computational modeling of morphology (e.g., Durrett andDeNero 2013, Cotterell et al. 2016) targets supervised inflection generation. Such tasks require variable-length outputs, so they are less constrained than earlier segmentation-based tasks (e.g., Kurimo et al. 2010), but appear to be tractable with existing neural network-based models. For example, in the CoNLL-SIGMORPHON 2017 Shared Task (sub-task 1 and the "high-data condition"), the focus of this study, the best-ranked sys-tem generates novel inflectional forms with 90% accuracy or better for 46 out of the 52 target languages, It achieves perfect accuracy for four languages (Cotterell et al. 2017).
In light of these apparent success, we examine the failure modes of existing models for morphological generation. We first propose and motivate an error taxonomy for this task, inspired by similar proposals for other natural language generation and processing technologies such as grammatical error correction (e.g., Rozovskaya and Roth 2016) and machine translation (e.g., Popović and Ney 2011, Fishel et al. 2012, Irvine et al. 2013. We then use this taxonomy to perform a manual error analysis of the CoNLL-SIGMORPHON 2017 Shared Task. Such analyses can help to identify strengths and weaknesses of existing systems, suggest future improvements, and guide development of strong ensemble models, but are often neglected or treated as an afterthought. This annotation also allows us to measure the quality of the gold data. Generating morphologically complex forms is a skill typically-developing children effortlessly acquire, so this task, and systems' error patterns, may have implications for the theory of language acquisition. While the shared task training paradigm is quite unlike human language learning, inference and evaluation resemble the classic wug-test (Berko 1958), in which speakers are presented with a word-either real or nonce-in citation form and prompted to provide a particular inflectional form of that word. Therefore, one can analyze inflection generation errors much like how one might analyze errors made by a child acquiring their first language. And, one can ask whether humans' and artificial learners' errors are in any way alike.
To answer these questions, we examine errors made by the two top-performing systems in the CoNLL-SIGMORPHON 2017 Shared Task for twelve languages.

Materials and methods
Here, we describe the shared task, data sources, and the targeted systems.

The task
The CoNLL-SIGMORPHON 2017 Shared Task (Cotterell et al. 2017) consists of two supervised morphological generation sub-tasks across 52 languages. In sub-task 1, the training data consists of triples of lemma, inflectional bundle, and inflected form, as in Table 1. At inference time, the system is given lemmata and inflectional bundles and asked to produce the appropriate inflected forms. In sub-task 2, training data consists of complete inflectional paradigms, and at inference time, the system is asked to produce full paradigms for unseen lemmata. We focus on the results from subtask 1, primarily because only two of the twelve teams chose to compete in sub-task 2. However, the proposed error taxonomy could easily be applied to sub-task 2, or to later morphological generation challenges such as sub-task 2 of the CoNLL-SIGMORPHON 2018 shared task  or sub-task 1 of the SIGMORPHON 2019 shared task (McCarthy et al. 2019).

Data
The data in both sub-tasks is primarily sampled from UniMorph , a free morphological database. In turn, UniMorph data for our twelve languages is automatically extracted from Wiktionary, a collaborative multilingual online dictionary. UniMorph pairs the cells of Wiktionary morphological paradigms, which bear prose labels like "genitive plural", to feature bundles in a language-independent morphological schema (Sylak-Glassman et al. 2015; also see Sylak-Glassman 2016). The data consist of the aforementioned triples of lemma, inflectional bundle, and inflected form. For sub-task 1, these triples were sampled from UniMorph paradigms according to frequencies of inflected forms as estimated from Wikipedia. Because of this sampling procedure, the data is sparse in the sense that there are rarely more than a few inflected forms per lemma. As such, this roughly mimics the statistical properties of the primary linguistic data encountered by child language learners (e.g., Chan 2008:71-100). Systems were evaluated under three training data conditions: low (100 triples), medium (1,000 triples) and high (10,000 triples). We focus on the high-data condition because nearly all systems performed poorly in the low-and medium-data conditions.

Systems
In sub-task 1, systems were ranked using the macro-averaged "per form" (i.e., full-token match) accuracy across all 52 target languages. 1 We analyze errors made by the two top-ranked systems, briefly described below.
ue-lmu-1 (Bergmanis et al. 2017) This system uses a recurrent neural network (RNN) with a bidirectional gated recurrent unit (GRU) encoder, a unidirectional GRU decoder, and a standard attention mechanism. It enhances a closely-related competitor system (Kann and Schütze 2017) by augmenting the provided training data with identical input-output pairs so as to create a bias toward copying the input stem. It is ranked as the bestperforming system on sub-task 1 (macro-average accuracy 95.32%).

cluzh-7 (Makarov et al. 2017)
This system also uses a neural encoder-decoder but replaces the "soft" attention mechanism with hard monotonic attention (Aharoni and Goldberg 2017) and special edit operations. It is ranked second-best overall on sub-task 2 (macro-average accuracy 95.12%) and also achieves the highest per form accuracy on eight languages including Hungarian and Spanish.

Error taxonomy
One major distinction in the proposed taxonomy of inflection generation errors is between those errors which can be given a linguistic characterization-i.e., in terms of misapplication of inflectional patterns independently attested in the target language-from those which cannot. As such we are inspired by a long and contentious debate in computational morphology research. Rumelhart and McClelland (1986) Figure 1: Overview of our annotation scheme, including subcategories. Annotators are instructed to proceed through the taxonomy from left to right. Pinker and Prince (1988) and Sproat (1992:216f.) dispute this characterization, pointing out bizarre errors like *membled for mailed. More recently,  claim that modern neural network architectures-such as those used in the CoNLL-SIGMORPHON 2017 Shared Taskgeneralize reasonably well while largely eliminating these bizarre errors. However, Corkery et al. (2019) argue that the Kirov and Cotterell model predictions align poorly with human productions, and suggest that the reported results may be uncharacteristic due to fortuitous random seeding. We desired a somewhat richer set of errors than this prior work. The final taxonomyincorporating feedback from a ten-language pilot study-consists of four major error categories, with several additional sub-categories. The categories are applied sequentially, as in Figure 1. We now describe these categories.

Target errors This category consists of cases
where the gold data is incorrect or incomplete. 3 We discern three sub-categories of target errors.
Free variation errors occur when more than one acceptable inflected form exists, but only one is present in the UniMorph data. Extraction errors indicate flaws in UniMorph's parsing of Wiktionary inflectional paradigms. Wiktionary errors represent errors in the Wiktionary data itself.
Silly errors This category consists of those "bizarre" errors which defy any purely linguistic characterization. In addition to the aforementioned case of *membled, such errors have also been reported for other language generation tasks such as machine translation (Arthur et al. 2016) and text normalization (Gorman and Sproat 2016, Sproat and Jaitly 2017, Zhang et al. 2019. Allomorphy errors This category consists of those errors which are characterized by misapplication of existing (i.e., independently attested) allomorphic patterns in the target language. Our annotation scheme recognizes four sub-categories of allomorphy error, but we set aside their their description for reasons of space.
Spelling errors This category includes inflected forms that do not follow language-specific orthographic conventions but are otherwise correct.

Results
We performed full error annotation on twelve of the 52 languages. Several other languages were initially targeted for annotation but produced too few errors to draw meaningful conclusions. Annotations were performed by the authors, all specialists in computational linguistics. 4 Of these, four languages-English, Finnish, Polish, and Russian-were annotated by native speakers; the remaining eight were annotated by second-language speakers. In addition to the annotation guidelines, annotators were encouraged to consult authoritative dictionaries and reference grammars-such as the Iso suomen kielioppi (Hakulinen et al. 2008) for Finnish, the Duden for German, the Oxford Latin Dictionary (Lee 1968), or the Diccionario de la lengua española for Spanish-and native speakers. Table 2 reports summary statistics for fully-annotated languages. Table 3 provides raw agreement and Krippendorf's α (Artstein and Poesio 2008) for those languages known to two annotators. As mentioned above, each annotator is a specialist in computational linguistics, and annotated at least one other language as well. Raw agreement is high, and while chancecorrected agreement statistics like α are notoriously difficult to interpret, α ≥ 0.8, a threshold obtained for all three double-coded languages, is generally considered to indicate substantial reliability (Krippendorff 2004:241f.). Table 4 provides the counts of the four major categories of error for all twelve languages and for both systems. We now proceed to describe some patterns observed within these categories. Table 5 gives counts for the three sub-categories of target errors. 5 Free variation errors Finnish has several free variation errors, many involving vowel harmony. For example, the abessive suffix has two allomorphs, namely the the back-harmonic -tta and the front-harmonic -ttä. The lemma progestiini 'progestin' can take the back allomorph, giving progestiinitta, but, vowel harmony often fails to apply when there are many intervening neutral vowels (i and e) between the harmonic trigger and the suffix (Hakulinen et al. 2008: §17), as is the case here. Therefore, the form progestiinittä, predicted by both systems, is grammatical, though not the form given by UniMorph. Another type of free variation error affects allomorphs of the Finnish genitive plural (gen.pl.). For instance, omenoiden, omenoitten, omenojen, omenien and omenain are all possible gen.pl. forms of omena 'apple', but only one is present in UniMorph.

Extraction errors
The comparatively low accuracy on Hungarian-cluzh-7, the best performing system on this language, achieves 89.80% per form accuracy-appears to be due to large number of extraction errors. In most cases, the error comes from pairing one paradigm cell with another cell's inflectional bundle. For instance, UniMorph incorrectly labels *lagúnák as the accusative plural (acc.pl.) for lagúna 'lagoon'; it is in fact the nominative plural (nom.pl.). In Romanian, a header for the Wiktionary paradigms reading "definite articulation" is incorrectly taken as an inflected form itself! Latin also suffers from pervasive extraction errors. This language has a robust phonemic contrast between short and long monophthongs (e.g., malus 'unpleasant' vs. mālus 'apple tree'). Long monophthongs are-at least in modern editionsindicated by the macron, a horizontal line above the vowel. UniMorph extraction has somehow removed macrons from all lemmata, though they are still present in the inflected forms. Thus, systems must attempt to predict an unpredictable phonemic contrast while mapping from lemma to inflected form. As a result, the vast majority of Latin errors concern vowel length.
Wiktionary errors Errors in the Wiktionary data itself are relatively rare and largely nonsystematic. For example, in Spanish, *demarce is given as the first person singular (1sg.) present subjunctive of demarcar 'to demarcate' instead of the correct form demarque.

Silly errors
Silly errors were found for all languages except English; they also appear to be somewhat more common for ue-lmu-1 (59) than for cluzh-7 (37).
ue-lmu-1 predicts praesōs as the acc.pl. of the Latin noun praesul (a title used by Roman religious leaders); there is no obvious analogue for the ul-ōs stem change. In German, cluzh-7 unexpectedly truncates the gen.pl. form of the compound noun Schädlingsbekämpfungsmittel 'pesticide' to produce *Schädlingsbekämpfungsmit. For the dative plural (dat.pl.) of the Russian compound noun meaning 'forced labor', ue-lmu-1 inexplicably deletes the r of rabóty 'labor', giving the bizarre *prinudítel'nym abótam. 6 And, in Spanish,   ue-lmu-1 gives *atuengáis as the second person plural present subjunctive of atener 'to maintain'. There is no analogue for this e-ue stem change.

Allomorphy errors
With the exception of Hungarian and Latinwhich suffer from systematic extraction errorsallomorphy errors are the largest category of error in all languages.

Stem-final vowels in Finnish
In Finnish nouns and adjectives, stem-final vowels commonly disappear or alternate with e or o when the plural marker i is added to the stem (Hakulinen et al. 2008: §45). For instance, the inessive plural of lasi 'glass' is laseissa. In principle, such alternations are predictable given the syllable count of the nominal stem, the stem-final consonants and the penultimate vowel, though the exact conditions are rather complex (Hakulinen et al. 2008: §46-50).
For the compound noun pohjanpystykorva 'norrbottenspets' (a breed of dog), cluzh-7 predicts an incorrect gen.pl. form *pohjanpystykorvojen for init here so as to make the data accessible to a wider audience. tended pohjanpystykorvien; it has transformed thestem final a to o and then selected the wrong plural marker (*-jen instead of -ien) as a result.

Ablaut in Dutch and German
Stem vowel alternations in the Germanic strong verbs are known as ablaut. Ablaut is robust in Dutch and German, and in both languages, is occasionally misapplied. In Dutch, for example, both systems overapply ablaut to the 1sg. preterite indicative forms of printen 'to print', producing *pront instead of printte. Similar errors are found in German. Both systems underapply ablaut to the 1sg. preterite indicative form of saufen 'to drink', producing *saufte instead of the expected soff, and ue-lmu-1 overapplies ablaut to the third person preterite subjunctive of the weak verb versenken 'to sink', giving *versächten in place of the expected versenkten.

Consonant gradation in Finnish
Many Finnish words undergo a set of unpredictable stem changes known as consonant gradation. Here, a "strong grade" of a consonant-normally a voiceless stop like t-alternates with the weak grade-a voiced stop like d-but the stop may also delete in the weak grade (Hakulinen et al. 2008: §41). Gradation leads to inflection errors because not all lexemes participate in gradation, and because the weak grade of the stem consonant is not predictable from the lemma. For instance, cluzh-7 incorrectly applies the weak grade to the negated third person singular *ei kiemurda (from kiemurtaa 'to crawl'); the proper gradation is t-r instead of the predicted t-d. cluzh-7 also incorrectly produces the strong grade where the weak grade is required, failing to delete the k in the comitative *rikoslakein (from rikoslaki 'criminal law').

Linking vowels in Hungarian
The Hungarian noun plural suffix is -k, usually preceded by a a, o, e, or ö linking vowel. For example, the nom.pl. form of vér 'blood' is vérek. The choice of linking vowel is partly determined by vowel harmony: back vowel stems select a or o whereas front vowel stems select e or ö. However, for back vowel stems, it is largely unpredictable whether a or o is used (Siptár andTörkenczy 2000:224f., Vago 1980:110f.), and there are several cases where one or both systems predict an incorrect linking vowel. For example, ue-lmu-1 predicts an incorrect elative plural *masszázsakból for masszázs 'massage'; the correct form is masszázsokból.

Yers in Polish
Another sub-category of allomorphy error in Polish concerns the yers, the "fleeting vowels" of Slavic. Oblique forms of the Polish nouns klęsek 'defeat' and żagiel 'sail', for example lack a stem e or ie, respectively, in certain case forms, as seen in the gen.pl. klęsk and żagli. Because fleeting vowels' position and quality are unpredictable, they cannot be analyzed as epenthetic. Instead, they are assumed to be present in the underlying form of certain roots and affixes, but somehow represented distinctly from the non-fleeting vowels (Lightner 1965, Rubach 1986). According to the analysis, a yer is deleted except when the following syllable also contains an yer, and the fleeting e and ie surface in the nom.sg. forms above because the masculine nom.sg. suffix is itself a yer (Gussman 1980:36f., Rubach 1984. It is impossible to predict the position or quality of a yer without referring to the rest of the inflectional paradigm, 7 and this indeterminacy contributes to several inflectional errors. For instance, cluzh-7 predicts *żagieli instead of the expected żagli, and both systems predict *klęsek for of the expected klęsk. Similar errors are also found in Russian. . Both models underapply diphthongization in *desplegue (from desplegar 'to unfold') and *recola (from recolar 'to strain again'). Interestingly, these are 1st conjugation (i.e., -ar) verbs, and children acquiring Spanish tend to underapply diphthongization in this class (Mayol 2007). But cluzh-7 also overapplies diphthongization in *atañieres (from atañer 'to concern') and *gañieseis (from gañir 'to yelp'). Similar errors occur in Portuguese, which also exhibits a stress-conditioned stem vowel alternation.

Genitive singular suffixes in Polish
Polish has two gen.sg. suffixes, -a and -u. It is generally impossible to predict which gen.sg. allomorph a given stem will select, and there is no evidence that one is more productive than the other (Dąbrowska 2001, 2005, Kottum 1981, Maunsch 2003. This unpredictable allomorphy causes many gen.sg. errors to both systems, such as *ateuszu for ateusza 'atheist', *izotopa for izotopu 'isotope', *krzyka for krzyku 'scream', and *legaru for legara 'joist'.

Verbal prefixes in German
Some verbal prefixes in German are known as "separable" because they separate (i.e., are postposed) from their host verb when tensed. Others, the "inseparable" prefixes, are always attached to their host verb without exception. Finally, some prefixes, such as um-, are separable or inseparable depending on the verb, and this leads to several errors. For example, both systems predict *umkehre for the 1sg. present indicative of umkehren 'to turn around'; the correct form is the separable kehre um.

Animacy in Polish and Russian
Case syncretisms in inanimate (i.e., non-personal) nouns are found in many Slavic languages. However, animacy is an inherent feature of nouns and cannot be predicted from the form of the lemma alone. In Russian, for example, cluzh-7 wrongly predicts a syncretic acc.pl. for the animate sadist 'id.' and both systems incorrectly predict a distinct (i.e., non-syncretic) acc.sg. for the inanimate magazin 'shop'. Similarly in Polish, both systems predict incorrect syncretic accusatives for animates such as śpiewak 'singer' and Żyd 'Jew', and incorrect non-syncretic accusatives for inanimates such as szampan 'champagne'. Some Polish stem changes are also conditioned by animacy. For example, for the inanimate noun katalizator 'catalyst', both sys-tems incorrectly predict a nom.pl. *katalizatorzy instead of katalizatory; the mutation of r to rz before the nom.pl. -y is restricted to masculine animates (Feldstein 2001:27).

Aspect in Russian
Russian verbal inflection is conditioned by an inherent feature known as aspect. For instance, the perfective verb sorvat' 'to pick' forms a synthetic future whereas the closelyrelated imperfective sryvat' forms a periphrastic (i.e., multi-word) future formed using future-tense forms of byt' 'to be'. Several errors involve the wrong future form for a verb's aspect. For example, for the perfective sorvat', cluzh-7 incorrectly predicts a periphrastic second person singular future *budeš' sorvat' instead of the expected synthetic sorvëš'.

Vowel harmony in Finnish compounds
In Finnish, the first stem in a noun compound does not participate in suffix harmony (Hakulinen et al. 2008: §14). For example, the partitive singular of the compound lapinsirri 'Temminck's stint' (a type of bird) is the lapinsirriä. Because this lemma is a compound of Lapin 'of Lapland' and sirri 'stint', and because all vowels in the second second stem of the compound are neutral, front harmonythe default-applies. However, cluzh-7 generates *lapinsirria, a form which would be correct were the lemma not a compound.

Internal inflection in Russian compounds
Many Russian nouns in the shared task are adjective-noun or noun-noun compounds, and systems fail to appropriately inflect both components of the compound. The acc.pl. of lëgkaja promyšlennost' 'light industry' is lëgkie promyšlennosti, but ue-lmu-1 predicts *lëgkix promyšlennosti, incorrectly placing the adjective in the genitive case. Other adjective-noun compounds for which one or both of the systems fail to produce proper agreement morphology include vizitnaja kartočka 'business card' and bulevo množestvo 'boolean domain'. Both stems of most noun-noun compounds, particularly hyphenated compounds, are inflected. For example, the prepositional plural of gosudarstvo-donor 'donor state' is gosudarstvax-donorax, but both systems predict *gosudarstvo-donorax, in which only the second stem is inflected. However, there are some cases in which one stem of a compound is not declined. For instance, in sindrom Aspergera 'Asperger's syndrome', only the head noun sindrom should be inflected because Aspergera is a nominal modifier and already in genitive case, but both systems incorrectly inflect the second stem producing the gen.pl. *sindromov Asperger.

Spelling errors
Spelling errors are relatively rare overall. In Dutch, diaeresis is used to mark hiatus-adjacent vowels in consecutive syllables-and thus the past participle of upgraden 'to upgrade' should be geüpgraded, not the predicted *geupgraded. Several English errors concern an orthographic doubling of certain final consonants; for example, both systems predict a past participle *disentered instead of the expected disenterred. There are many German spelling errors, including several concerning the spelling of the gen.sg. suffix-written as -es or -s depending on context-or s, ss, and ß, all pronounced [s]. In Spanish, a g followed by i or e is read as [x], not as [g], so the verb fungir 'to service as' has a 1sg. future indicative spelled funjo rather than the predicted *fungo. Several Portuguese and Spanish predictions omit the acute accent used to indicate exceptional primary stress; e.g., Portuguese *influisse for the 1sg. imperfect subjunctive influísse (from influir 'to influence').

Target errors
Target errors heavily impact performance for Hungarian, Latin, and Romanian. Overall, nearly one fourth of our sample's errors were target errors, and we suspect such errors also lurk in the training and development data. Clearly, the UniMorph data used in this task requires further vetting.

Allomorphy errors
Overall, silly errors were far less common than allomorphy errors. Many of the allomorphy errors appear to result from unpredictable linguistic behaviors rather than failures to extract reliable generalizations. In some cases, errors reflect systems' inability to predict inherent features such as animacy and aspect in Slavic. Such features are not encoded in UniMorph, although this information is often present on Wiktionary. Generally speaking, these features cannot be predicted from the orthographic form of lemmata, 9 but we suspect that the relevant information could be induced using either contextual or type-level word embeddings. We leave this for future work. 10 Systems also appear to struggle with lemmata which are themselves internally complex due to word-formation processes like prefixation or compounding, including prefix verbs in German and compounds in Finnish and Russian. Lemmainternal structure, once again, is not currently encoded in UniMorph, though it could in principle be extracted from Wiktionary entries. Finally, we see that systems struggle with certain lexically-specific morphophonological patterns-Germanic ablaut and umlaut, Finnish consonant gradation, Hungarian linking vowels, Slavic yers, and Spanish diphthongization-and with lexicallyconditioned affix selection in German and Polish. We have seemingly rediscovered what linguists have long known: certain allomorphic patterns cannot be predicted from the form of lemmata alone; they must be memorized. It is unreasonable to expect any neural network, no matter how powerful, to predict what is truly unpredictable.
Our analysis is limited to languages included the shared task, those for which the top systems have a non-trivial number of errors, and those for which we have sufficient linguistic expertise. As a result, our final sample of twelve languages only includes two major language families, Indo-European and Uralic, the latter represented by Finnish and Hungarian. However, this sample has some degree of grammatical diversity. Linguists traditionally distinguish between two types of morphological exponence. In agglutination, each morphological feature corresponds roughly to a single affix. For instance, in the Hungarian form cinkosoknak, the dat.pl. of cinkos 'accomplice', the -ok suffix marks plurality and the -nak suffix indicates the dative case. In fusion, on the other hand, single affixes may realize many morphological features at once. For instance, in the Russian form čabrecov, the gen.pl. of čabrec 'thyme', the -ov suffix is both genitive and plural (and its form also indirectly indicates that the stem is masculine). Agglutination is characteristic of the Uralic languages, whereas Indo-European languages makes heavy use of fusion. Furthermore, vowel harmony is limited to the two Uralic languages.

Conclusion
We propose an error taxonomy for morphological inflection generation and apply it to the predictions of the two best systems in the CoNLL-SIGMORPHON 2017 Shared Task. We estimate a lower bound for the percentage of "target" errors in the gold data. Over 80% of the remaining (non-target) errors can be understood as misapplication of language-specific morphological or spelling principles. One potential remedy is to enrich the input linguistic representations with, e.g., compound structure and inherent grammatical features; however, this is unlikely to avoid all errors; some morphological patterns cannot be generalized but only memorized.
The above analysis depends on manual annotation, but one might prefer to automate error classification. An automated system, for example, could be integrated into a rapid development process, or used as an additional objective during training and tuning, so long as it has reasonably high agreement with human experts. Ideally, such a system would scale to arbitrary languages, not just those for which linguistic expertise is readily available. A powerful ensemble model could help identify candidate target errors, and for certain highresource languages, it might be possible to leverage finite-state morphological analyzers and lexicons to distinguish between silly, spelling, and allomorphy errors. We leave these and many other open questions for future work.