The CMU-LTI submission to the SIGMORPHON 2020 Shared Task 0: Language-Specific Cross-Lingual Transfer

This paper describes the CMU-LTI submission to the SIGMORPHON 2020 Shared Task 0 on typologically diverse morphological inflection. The (unrestricted) submission uses the cross-lingual approach of our last year’s winning submission (Anastasopoulos and Neubig, 2019), but adapted to use specific transfer languages for each test language. Our system, with fixed non-tuned hyperparameters, achieved a macro-averaged accuracy of 80.65 ranking 20th among 31 systems, but it was still tied for best system in 25 of the 90 total languages.


Introduction
Morphological inflection is the process that creates grammatical forms (typically guided by sentence structure) of a lexeme/lemma. As a computational task it is framed as mapping from the lemma and a set of morphological tags to the desired form, which simplifies the task by removing the necessity to infer the form from context. For an example from Asturian, given the lemma aguar and tags V;PRS;2;PL;IND, the task is to create the indicative voice, present tense, 2 nd person plural form aguà.
Let X = x 1 . . . x N be a character sequence of the lemma, T = t 1 . . . t M a set of morphological tags, and Y = y 1 . . . y K be an inflection target character sequence. The goal is to model P (Y | X, T). The problem has been studied in various settings through the SIGMORPHON shared tasks (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018Mc-Carthy et al., 2019), with the 2019 edition focusing in particularly challenging low-resource scenarios. The 2020 edition (Vylomova et al., 2020) focused on generalization of systems across typologically diverse languages, regardless of data size.
In our submission we built upon our previous work , utilizing cross-lingual transfer from related languages, data hallucination, and a series of training techniques and regularizers. The defining change was that we attempted to create language-specific regimes for each test language, depending on the particular characteristics of the language, on the data availability for the particular test language and the availability of other related language data. As a result, for some high-resource languages we submitted systems without cross-lingual transfer, for some we used a single related high resource language, and for some we used multiple related languages. Last, for a few test languages we augmented our datasets with romanized versions of the training data, an approach that has shown promising results in concurrent work (Murikinati et al., 2020).
Our submissions are very competitive in 25 of the 90 test languages, with performance statistically significant similar to the best performing system, but fall behind in many other languages. We suspect that this is due to our not tuning of the system's hyperparameters towards higher-resource settings.  Table 1: Accuracy of our system on every language. We highlight the languages where our system was statistically equal to the best system (with p < 0.005).

System Description
Our system is the same as the one of Anastasopoulos and Neubig (2019): a neural multi-source encoder-decoder (which reads in the lemma and the tag sequences in a disentangled manner using two separate encoders) with a task-specific attention mechanism. We skip providing further redundant information and we direct the interested reader to  for all details. It is important to note, however, that we did not tune any model hyperparameters for our submissions (which we suspect contributed to the poor performance of our system in some languages); we used the default parameters from the system's distribution 1 which are tuned towards extremely low-resource settings.
Here, we provide an exhaustive list of modifications to the general pipeline that we devised for specific languages and language families. Tonal languages like Eastern Highland Chatino (cly), importantly, often denote the syllable's tone through superscript diacritics: take the Eastern Highland Chatino lemma sqwe 14 and its second person singular number habitual mood inflected form nsqwe 20 . The data hallucination technique would identify the substring sqwe as a stem-like region, and replace its characters with random ones. A completely random substitution, however, could lead to the creation of nonsensical syllables, if tone diacritics are inserted instead of letter characters e.g. if we hallucinated a s 3 ae 14 lemma for the above example. Similarly, if a stem-like region includes a tone diacritic, we would not want to randomly replace it with non-diacritic characters, lest we end up with badly formed syllables without tone information.
To avoid these issues, we restrict the random substitutions for Oto-Manguean languages with tone diacritics, so that we only sample tone diacritics if we are substituting a tone diacritic (and similarly for letter characters). We have found this approach to significantly improve results in previous work on morphological inflection for Eastern Highland Chatino (Cruz et al., 2020).

Single-Language Systems for High Resource
Languages For languages with more than 20,000 training examples, we decided to not use crosslingual transfer nor data hallucination, as systems in previous SIGMORPHON shared tasks achieved very competitive performance on such high-resource settings without these additions. For languages with less than 20,000 but more than 10,000 training examples, we used our data hallucination process to create 10,000 additional training examples to be used for training.
Cross-Lingual Transfer from a Single Language For some languages we decided to use a single, high-resource related language to combine into our training to perform cross-lingual transfer, along with data hallucination. We based most these decisions in previous results (mainly from (Anastasopoulos and Neubig, 2019)), but some where our semi-arbitrary experimenter's intuitions. We provide a complete list of these settings: • for Middle High German (gmh) we used German (deu), • for Middle Low German (gml) we used German (deu) also bypassing data hallucination, • for Swiss German (gsw) we used German (deu), • for North Frisian (frr) we used Dutch (nld), • for Kannada (kan) we used Telugu (tel), • for Telugu (tel) we used Kannada (kan), • for Asturian (ast) we used Galician (glg), • for Friulian (fur) we used French (fra), • for Ladin (lad) we used Friulian (fur), • for Venetian (vec) we used Italian (vec), • for Anglo-Norman (xno) we used Middle French (frm), • for Azerbaijani (aze) we used Turkish (tur), • for Khakas (kjh) we used Turkish (tur), but not including data hallucination, and • for Võro (vro) we used Estonian (est).

Family
Sub-family Acc.

Multiple-Language Cross-Lingual Transfer
We submitted systems with unique transfer language combinations for extremely low-resource languages for which several very related languages were available (all systems also included hallucinated data in the test language). Specifically: • for Ingrian (izh) we used Estonian (est), Votic (vot), and a random sample (20,000 instances) from Finnish (fin) data, • for Votic (vot) we used Estonian (est), Ingrian (izh), and a random sample (20,000 instances) from Finnish (fin) data, • for Urdu (urd) we used Hindi (hin) and Bengali (ben), In concurrent work (Murikinati et al., 2020) we experimented with transliterating the transfer language into the test language's script, with encouraging results in low-resource settings. Alternatively, if the training languages use the latin script but the test language does not, we found that that by romanizing the test language training data and concatenating them as another language (along with the data in the original script) also helped. We applied these strategies on the following language pairs. Transliterating a transfer language into the test language's script: well as romanized Classical Syriac (Classical Syriac originally uses a distinct script), 2. for Pashto (pus) we used romanized Farsi (fas) and romanized Pashto, while 3. for Tajik (tgk) we used romanized Farsi (fas) and romanized Tajik.

Romanization for Different Scripts
3 Results Table 1 lists the accuracy of our submitted system in every language. We also report results per language family and genus in Table 2, to further facilitate an equitable evaluation across language families. Our system achieves a macro-averaged accuracy of 86.6% with a standard deviation of 14.3. Even though it does not use self-attention and we did not tune any hyper-parameters, our system still achieved competitive performance, tying for first in 25 of the 90 total languages (it still however does not outperform the best baseline system (Wu et al., 2020)). These include languages that were generally easy for all systems, such as the Austronesian and the Niger-Congo ones. However, they also include the extremely low-resource languages like Ludian (lud), Võro (vro), and Middle Low German (gml), where we suspect that our system performed en par with the more sophisticated (and we suspect, tuned) systems due to our informed selection of languages for cross-lingual transfer.
The two languages where our system performs the worst are Algic (Cree) and Tungusic (Evenki). We suspect this is due to the fact that the data hallucination technique, which is crucial for such low resource settings, is not appropriate for capturing the vowel harmony of Evenki along with its agglutinating morphological patterns -the hallucinated data do not follow these patterns and hence do not guide the model towards learning them. As for Cree, we suspect that the problem lies again in the data hallucination process: the polysynthetic and fusional nature of Cree verb inflected forms is too complicated to be modeled by the simple characterlevel alignment model which is the first step for hallucination.

Conclusion and Future Work
The performance of our system in the 2020 SIG-MORPHON Shared Task leaves many questions unanswered and several avenues to explore in future work. Regarding the choice of languages to use for cross-lingual transfer, we will further in-vestigate the use of automatic suggestion systems such as the one of Lin et al. (2019). With regards to modeling, we will update our model to use sparsemax (Martins and Astudillo, 2016), which can facilitate exact search and hopefully lead to better results (Peters and Martins, 2019).
As we anticipate and hope the shared task and the whole community will become more multilingual in the future, in the future we will employ the language/task selection method of Xia et al. (2020), which will allow us to tune the systems in a small subset of languages that will generalize well in all others. Similarly, we will employ more sophisticated techniques for learning in multilingual settings, such as differential data selection (Wang et al., 2019(Wang et al., , 2020 which will allow us to optimize a single model to multiple model objectives (namely, each target language).