Pushing the Limits of Low-Resource Morphological Inflection

Recent years have seen exceptional strides in the task of automatic morphological inflection generation. However, for a long tail of languages the necessary resources are hard to come by, and state-of-the-art neural methods that work well under higher resource settings perform poorly in the face of a paucity of data. In response, we propose a battery of improvements that greatly improve performance under such low-resource conditions. First, we present a novel two-step attention architecture for the inflection decoder. In addition, we investigate the effects of cross-lingual transfer from single and multiple languages, as well as monolingual data hallucination. The macro-averaged accuracy of our models outperforms the state-of-the-art by 15 percentage points. Also, we identify the crucial factors for success with cross-lingual transfer for morphological inflection: typological similarity and a common representation across languages.


Introduction
The majority of the world's languages are categorized as synthetic, meaning that they have rich morphology, be it fusional, agglutinative, polysynthetic, or a mixture thereof. As Natural Language Processing (NLP) keeps expanding its frontiers to encompass more and more languages, modeling of the grammatical functions that guide language generation is of utmost importance.
In the case of morphologically-rich languages, explicit modeling of the inflection processes has significant potential to alleviate issues created by data scarcity and the resulting lack of vocabulary coverage. Especially on low-resource, underrepresented languages and dialects, the potential for impact is much higher. For example, speech a g u à s 1 · · · s K softmax P(y 1 · · · y K ) decoder V PRS 2 PL IND a g u a r Figure 1: Visualization of our proposed two-step attention architecture. The decoder first attends over the tag sequence T and then uses the updated decoder state s to attend over the character sequence X in order to produce the inflected form Y. (Example from Asturian.) recognition (Foley et al., 2018) and predictive keyboards (Breiner et al., 2019) for under-represented languages, if they exist, largely still rely on unigram lexicons with performance inferior to the sophisticated language models used in high-resource ones. Good inflection models would be invaluable for predictive text technology in morphologicallyrich languages as they could effectively enable proper handling of the huge vocabulary. Additionally, they could be very useful for building educational applications for languages of under-represented communities (along with their inverse, morphological analyzers). Encouraging examples are the Yupik morphological analyzer (Schwartz et al., 2019) and the Inuktitut educational tools from the respective Native Peoples communities. 2 The social impact of such applications can be enormous, effectively raising the status of the languages slightly closer to the level of 985 the dominant regional language.
Morphological inflection has been thoroughly studied in monolingual high resource settings, especially through the recent SIGMORPHON challenges (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018. Lowresource settings, in contrast, are relatively underexplored. One promising direction and the main focus of the SIGMORPHON 2019 challenge (Mc-Carthy et al., 2019b) is cross-lingual training, which has been successfully applied in other lowresource tasks such as Machine Translation (MT) or parsing.
In this work we focus in this cross-lingual setting for low-resource morphological inflection and propose several simple, yet effective approaches to mitigating problems caused by extreme lack of data which, put together, improve accuracy by 15 percentage points over a state-of-the-art baseline. This is achieved through the combination of a novel decoder architecture, a training regime that alleviates the need for costly structural biases that force attention monotonicity, and a data hallucination technique. We also present thorough ablations and identify the crucial factors for success with our approach.
Our system was the best performing system in the 2019 SIGMORPHON shared task on morphological inflection when evaluated on accuracy, while it ranked third when evaluated with average Levenshtein distance.

Task Definition and Approach
Morphological inflection is the process that creates grammatical forms (typically guided by sentence structure) of a lexeme/lemma. As a computational task it is framed as mapping from the lemma and a set of morphological tags to the desired form, which simplifies the task by removing the necessity to infer the form from context. For an example from Asturian, given the lemma aguar and tags V;PRS;2;PL;IND, the task is to create the indicative voice, present tense, 2 nd person plural form aguà.
Let X = x 1 . . . x N be a character sequence of the lemma, T = t 1 . . . t M a set of morphological tags, and Y = y 1 . . . y K be an inflection target character sequence. The goal is to model P(Y | X, T).
Our approach consists of three major components. First, we propose a novel two-step attention decoder architecture ( §2.1). Second, we augment the low-resource datasets with a data halluci-nation technique ( §2.2). Third, we devise a training schedule ( §2.3) that substitutes structural biases for attention monotonicity.

Model Architecture
Our models are based on a sequence-to-sequence model with attention (Bahdanau et al., 2015). In broad terms, the model is composed of three parts: an encoder, the attention, and a recurrent decoder. In the setting of the inflection task, there is an additional input provided (the set of morphological tags) which requires an additional encoder.
A visualization of our model is shown in : h x n = enc x (h x n−1 , x n ) and h t m = enc t (T). In our implementation, we use a single layer bidirectional recurrent encoder for the lemma, and a self-attention encoder (Vaswani et al., 2017) for encoding the tags as there is no inherent order (e.g. left-to-right) in their presentation. In preliminary experiments, a self-attention lemma encoder proved hard to train, while a recurrent tag encoder yielded quite competitive results.
Next, we have attention mechanisms that transform the two sequences of input states into two sequences of context vectors via two matrices of attention weights (k is the current decoder time step): Finally, the recurrent decoder computes a sequence of output states in a two-step process, from which a probability distribution over output characters can be computed: The attention mechanisms produce their weights as in and W h α t being parameters to be learned: ). The two-step attention process essentially first uses the decoder's previous state s k−1 as the query for attending over the tags. Then, it creates a tag-informed state s k by adding the tag context c t k to the previous state. The tag-informed state is then used as the query for attending over the source characters and produce the context c x k . The last step is then to update the recurrent state and produce the output character. Ultimately, we desire that the provided tag set guides the generation, which also means influencing the attention over the characters of the lemma.

Additional Structural Biases for Attention
Incorporating structural biases in the model's architecture or in the training objective can lead to improvements in performance, especially for tasks where the attention mechanism is expected to behave similarly to an alignment model, like MT. This idea has been successfully applied to the inflection task and is at the core of the state-of-theart model (Wu and Cotterell, 2019).
One bias we deem important is coverage of all input characters and tags from the attentions. Intuitively, this entails encouraging the model to "look at" the whole input. We take the approach of Cohn et al. (2016) and add two regularizers over the final attention matrices, encouraging them to also sum to one column-wise: −λ Σ j a t jm − I 2 −λ Σ j a x jn − I 2 Another bias that we incorporate encourages the Markov assumption over attention/alignments. Briefly, this means that if the i-th source character/word is aligned to the j-th target one (i ← j), then alignments i+1← j+1 or i← j+1 are also quite likely. In a neural architecture this can be approximated by providing the attention weight vector from the previous timestep as input to the function that computes the attention weights. We refer the reader to Cohn et al. (2016) for exact details.
Adversarial Language Discriminator When training multilingual systems, encouraging the encoder to learn language-invariant representations can often lead to improvements (Xie et al., 2017;Chen et al., 2018), as it forces the model to truly work in a multilingual setting. We achieve that by introducing a language discriminator (Ganin et al., 2016). This additional component receives the last output of the (bi-directional) intermediate lemma representations h x N and outputs a prediction y l of the source language such that y l =softmax(MLP(h x N )). The discriminator is trained to predict the language by minimizing a standard cross-entropy Original triple stem stem lemma π α ρ α κ ά μ π τ ω +V;2;SG;IPFV;PST π α ρ έ κ α μ π τ ε ς Hallucinated lemma π ξ ρ α κ ά μ ο τ ω +V;2;SG;IPFV;PST π ξ ρ έ κ α μ ο τ ε ς loss L l similar to Lample et al. (2018). However, in order to encourage the encoder to learn language-invariant representations, we reverse the gradients flowing from that component into the encoder during back-propagation.

Data Hallucination
Low-resource language datasets are usually too small to allow for proper learning with neural networks. A major issue in our case is label bias. Put simply, the character decoder will overly prefer outputting common character sequences. However, with just 50 examples to learn from, the output character n-gram distribution will hardly match the real one, because the majority of the ngrams will have zero probability. In order to mitigate this issue, we augment our training sets with hallucinated data. In most languages morphological reinflection is realized by adding, deleting, or modifying prefixes or suffixes over a stem that is mainly unchanged. Though we do not have prior information regarding the stem or affixes of our data we can use character alignment to approximately compute them. We use the alignment method from the SIGMOR-PHON 2016 baseline (Cotterell et al., 2016) to align the lemmata and the inflected forms. For each example, we consider as part of the stem any sequence of three or more 3 consecutive lemma characters that are aligned to the exact same characters in the inflected form. Now, for each such region considered as a "stem", we randomly substitute its inside (not start or end) characters with other characters from the language's alphabet. Note that we do not change the length of the region, though allowing for such variation could possibly lead to further improvement. The substitution characters are sampled uniformly for the alphabet, rather than attempting to sample from a more informed distribution, which has potential for further improvements. Overall, we hallucinated 10,000 examples for each lowresource language, creating an additional hallucinated dataset H.
A visualized example of our hallucination process is outlined in Figure 2. Out of the three regions with matching aligned characters (thick lines), we identify two with length equal to three or more. In the hallucinated example (bottom of the figure), we sample random characters for the inside of such regions. Silfverberg et al. (2017) have proposed a data hallucination method conceptually quite similar to ours, which treats the single longest common continuous substring between lemma and form as the stem. Their approach would be effectively similar to ours for languages with affixal steminvariant morphology, but it would likely fail in more complicated morphological phenomena like apophony, stem alternation (conceptually similar to infix morphology) or in the root-and-pattern morphology of semitic languages. Consider the apophony example from the past participle form gschwommen of the lemma schwimmen in Swiss German: our approach would treat both the schw and the mmen regions as a stem, as opposed to only considering one. 4 Note that neither of the two approaches are suitable for phenomena such as suppletion (e.g. the inflection of the Spanish verb ir 'to go' into fue 'went') or phenomena like the reduplication pattern of Indonesian noun plurals as in kuda 'horse', kuda-kuda 'horses' (Sneddon et al., 2012).

Training Schedule
Our training schedule attempts to balance two desires: helping the model to (1) learn to copy, and (2) learn cross-lingually with a particular focus on the low-resource language. To achieve this, we split training into three phases: warm-up, crosslingual training, and low-resource fine-tuning.
Phase 1: Warm-Up As several previous works have noted, learning to copy is crucial. Unlike other proposed models, though, our model does not include any structural biases that encourage 4 Most likely, the desired parts are schw and mm.

Model
Accuracy Median Wu and Cotterell (2019)  copying. Instead, we rely on an additional copying task in a warm up period. We transform each training triple [X, T, Y] into two additional triples that encourage copying: [X, COPY, X] and [Y, T, Y]. Using both the input and output sequences for the copying task allows the use of slightly more diverse data. Note that we only use the correct tags when copying the inflected form. For copying the lemma we use a specialized tag (COPY). Additional improvements could be achieved if one knew the exact tags that match the lemmata. But this could vary by language: an English verb's lemma is its infinitive and has a V;NFIN tag, while a verb lemma in Modern Greek is its V;1;SG;IPFV;PRS form.
In this stage we use a relatively large batch size (10) in order to encourage more coarse updates. In most cases, the model achieves extremely high copying accuracy after a couple of warm-up epochs, so we stop the warm up stage when copying accuracy exceeds 75%. At this point the attention mechanism over the source characters has learned to be monotonic. In contrast, the model of Wu and Cotterell (2019) requires a dynamic programming method that forces strict monotonicity, with an additional (non-parallelizable) computational cost O(|x| 2 ) throughout training.
Phase 2: Cross-lingual training In the main training phase we use both high-and low-resource language data (including any hallucinated data). If not using hallucinated data, we up-sample the low-resource data in order to match the size of the high-resource ones. Furthermore, with probability 0.30 we also sample copying tasks to intersperse throughout the training epoch. This ensures the source-character attention keeps being monotonic.
Phase 3: Fine-tuning The last phase is inspired by fine-tuning, or continued training, as applied to

Model
Dev Accuracy (macro-averaged) Wu and Cotterell (2019)  cross-lingual MT e.g. by Neubig and Hu (2018) or to domain adaptation for MT e.g. by . The setting is nearly identical to the second phase, except we only use the low-resource language data for training and do not use the copying task. Furthermore, we substitute teacher forcing for scheduled sampling , where with probability 50% the input to the next step of the decoder is not the gold one, but its previous prediction. This technique allows the model to become more robust to its own mistakes, effectively limiting the effect of exposure bias. It is worth noting that at this point, the learning rate is typically quite small, so we reduce the batch size to a single instance. In most cases, though, the improvements on development set accuracy are marginal (1 to 2%).

Inference and Implementation Details
The performance of inflection systems is typically evaluated with exact-match token-level accuracy, as well as character-level Levenshtein distance. 5 Hence, during training we continuously evaluate the model's performance on the development set with both metrics. Consequently we store three checkpoints: the one that achieved the highest accuracy, the one that reached the lowest Levenshtein distance, and the one that improved on both metrics over previous dev evaluations. In a few cases these three checkpoints coincide, but this is rather rare. We ensemble these three models with equal weights for producing our final predictions. All our models are implemented in DyNet . Each model is trained on a single CPU, as each training run requires less than 1GB of RAM and typically concludes within 3 to 4 hours. We provide additional hyperparameter details in the Appendix.

Empirical Results
Our experiments are conducted on the SIGMOR- Test Accuracy Our main results are summarized in Table 1. Our models significantly outperform the baseline, evaluated with macro-averaged accuracy over the 100 language pairs. Our novel architecture performs slightly better than the baseline without any additional data. Importantly, including the hallucinated datasets boosts total accuracy by 10 percentage points, while transfer from multiple languages further improves to a state-ofthe-art accuracy of 63.8% -higher than any system submitted to the shared task.
It is worth noting that all of these improvements are not uniform across languages/pairs: without hallucinated training data, our model's median is notably larger than its average, implying that for a few language pairs our model is under-trained and under-performs (in particular, pairs with Yiddish, Votic, and Ingrian as test languages). Nonetheless, if one had access to an oracle that would allow them to select the best performing model out of all the different settings, they would achieve an oracle accuracy of 68.2%.  Table 3: Results with a single transfer language. Monolingual data hallucination is crucial due to the distance of the languages. In some cases cross lingual transfer should be avoided in favor of a purely monolingual setting (H).

Architecture Ablations
We focus on the architecture of the model and the structural biases we introduced, using the development set for evaluation. Table 2 presents the macro-averaged accuracy for the baseline (Wu and Cotterell, 2019) and the different versions of our model. 6 The first thing to note is the importance of the warm-up period in our training schedule and the use of the copying task. Our model neither handles copying in any explicit way nor encourages the attention to be monotonic. Without the warm-up period and the additional copying tasks, our model's dev accuracy is worse than any of the baselines.
On the other hand, our two-step attention trained with the additional copying task already improves over most baselines without any of the additional biases, with a dev accuracy of 48. The best baseline model, with an average dev accuracy of 51.3, is better than our models, but this difference is not reflected on the test set (see Table 1) where our model performs slightly better. This discrepancy could be due to optimizer or early stopping decisions, but we leave a more thorough investigation for future work.
We attribute the success of our model to two factors, the first being our novel architecture. Compared to a single attention over the concatenated tag and lemma sequences, our two-step attention has the advantage of two distinct attention mechanisms, which capture the inherently different properties of the tag and the lemma character sequences. Another advantage is the two-step process that guides the lemma attention with the tags. Disentangling the two attentions and ordering them in an intuitive way makes it easier for them to learn their respective tasks.
The second factor is, we suspect, a slightly better choice of hyperparameters: our models are smaller than the baseline ones. Although we did not tune our hyperparameters extensively, a few experiments on a couple of language pairs showed that increasing the model size hurt performance under these extremely data-scarce conditions.
Each of the additional techniques we tested further contributed a few accuracy points. The attention biases add 0.7% points and scheduled sampling in the third training phase further adds 0.4% points. The different batching schedule with large batch sizes in the beginning but smaller towards the end of training helps a little more in terms of accuracy, but most importantly it speeds up training. Table 2 also reports the development set accuracy when using hallucinated data. Although the large improvements are also proportionally reflected in the test set, these numbers are not directly comparable to the rest, as the development set data were used in the hallucination process.

Analysis
We analyze the results over various groupings of languages to elucidate the properties of our models over the 100 quite diverse language pairs. We use typological information from the URIEL database (Littell et al., 2017) Table 4: Results with multiple transfer languages (sample). The best performing system per target (L2) language is highlighted. We repeated the hallucinateddata-only experiments as many times as potential transfer (L1) languages, hence the H column reports average accuracy ± standard deviation.
was provided, with detailed results presented in Table 3. Generally, the average genetic distance between the transfer and the task language is quite high (0.75) for these pairs. It is easy to observe that the larger the typological distance, the larger the improvement from adding hallucinated data. In fact, excluding the languages with no typological information available, there is strong corre-lation (ρ=0.6) between genetic distance and improvement from hallucinated data. For cross-lingual transfer to be useful, the languages need to be at least somewhat related and share similar characteristics. A prime example is transfer from Italian for Neapolitan, which achieves a 70% accuracy without any additional synthesized data. In the same vein, the same condition is necessary for the adversarial language discriminator to have impact, as using it on extremely distant language pairs leads to worse performance (e.g. Russian-Portuguese, Bengali-Greek, or Urdu-Old English). This is expected, as forcing language invariant representations across vastly different languages is analogous to representing a bimodal distribution with its mean.
The results on Kurmanji-Sorani (Northern Kurdish-Central Kurdish) seem to be a valid counter-example to the above statement, i.e. the two languages are related, 7 but cross-lingual transfer without hallucinated data performs poorly, achieving a mere 16.2 accuracy. The reason for this discrepancy lies in the characters: Kurmanji is written in the Latin alphabet, while Sorani uses the Arabic one. 8 The lack of any similar representations across the languages is too hard to overcome even with the adversarial language discriminator. 9 Multiple Transfer Languages For most lowresource languages and especially dialects, there exist several possible candidate transfer languages that can be related enough to satisfy the similarity constraint. We present extensive ablations on such cases in Table 4 with results on the rest of the SIGMORPHON language pairs. 10 We again observe positive correlations between the language genetic proximity and the performance of cross-lingual transfer, even with all the transfer and test languages being related. For example, transfer from (distant) Basque to Kashubian performs about 10 percentage points worse than transfer from (related) Slovak or Czech, which in turn perform worse than transfer from Polish (more closely related to Kashubian than the others). Transfer for North Frisian using Danish is also 10 percentage points worse than transfer from the more closely related Dutch. It is worth noting that using the hallucinated data reduces the effect of the genetic distance between languages, with the standard deviation across the results within the same test language becoming much smaller.
A very interesting case of cross-lingual transfer is Maltese, which is a semitic language and hence genetically close to Hebrew and Arabic. Surprisingly, we obtain better results when transferring from Italian. Again, a script discrepancy could be the main reason, also considering that the root and pattern morphology is only partially expressed in the scripts of Hebrew and Arabic, whereas it is fully expressed (by writing all the vowels) in Maltese. We should also point out that genetic similarity might not be enlightening enough. As Hoberman and Aronoff (2003) point out, the productive verbal morphology in Maltese has become affixal due to borrowings from Romance languages. To further exacerbate the situation, all provided train, dev, and test examples are verbs (no nouns or adjectives), providing an explanation to this seemingly counter-intuitive result.
Another interesting case is that of cross-lingual transfer for Bengali, with the potential languages varying from very related (Sanskrit, Hindi) to even only distantly related (Greek). Nevertheless, there is notably little variance in the performance of the systems. We believe that again the culprit is the difference in writing systems between all selected transfer and test language, which does not allow our system to leverage cross-lingual information: our Bengali data use the Bengali script, the Urdu dataset is in the Arabic one, Hindi and Sanskrit use Devanagari, and Greek uses the Greek alphabet.
We also present analytic results with crosslingual transfer from all transfer languages from the suggested SIGMORPHON pairs. 11 In 14 out of the 29 test languages, our best-performing model is trained on multiple transfer languages. For instance, using Turkish, Persian, Bashkir, and Uzbek data for transfer to Azeri leads to a 6 point improvement over any single-languagetransfer result. A potential explanation is that a dialect/language has indeed been influenced by multiple languages. Another reason could lie in the increased amount of data and potential regulariza-tion effects. We suspect the truth lies in the union of those factors, but nonetheless we conclude that whenever available, transfer from multiple related languages can further improve accuracy.
In our experiments we used all transfer languages that the SIGMORPHON organizers proposed in the 2019 challenge. On the one hand, a more sophisticated data selection process could likely yield improvements. For example, Yiddish is primarily based in High German and it has elements from Hebrew, Aramaic, as well as Slavic languages, but we did not test how transfer from these languages might perform. Moreover, in order to remain faithful to the SIGMORPHON challenge, we used some distant languages that probably worsen the results (e.g. Greek for Bengali).
Also, alphabet divergence issues still need to be addressed. For instance, we suspect that our accuracy in Yiddish (which uses the Hebrew alphabet with Yiddish orthography) could be greatly improved by finding some type of mapping between its orthography and the Latin script that most of its related languages use. The same could hold for transfer for the central Asian languages (Tatar, Turkmen, Azeri among others) which use a variety of the Latin, Cyrillic, or Arabic scripts.
Lastly, we experimented with a completely monolingual setting, using just the low-resource and hallucinated language data (columns H in Tables 3 and 4). For fairer comparison to crosslingual transfer, we repeated the hallucination process as many times as candidate transfer languages and we report the mean and standard deviation of the test set accuracy. This baseline is extremely competitive, lagging only a few points behind the L + H combination. Encouragingly, this entails that hallucination is a viable option for entire language families without a single high-resource representative or low-resource isolates.
Interpretability We find that the attention matrices can help understand our model's predictions. A visualization of two examples is shown in Figure 3, showcasing the interpretability advantage of the disentangled two-step attentions. In the Kazakh example the tag attention clearly identifies the suffixes да (that marks plural) and рды (that marks the accusative case). The Greek example is a great one of how the two-step process allows the tags to guide the lemma attention. Due to the SBJV tag, the model does not use the lemma until the necessary particle να has been generated. Conse- quently, the lemma attention properly copies the stem, and then the tag attention attends first over PFV and then over PL and 3 in order to construct the correct suffix for perfective and 3 rd person plural.

Related Work
The inflection task in high-resource settings has been extensively studied through the SIGMOR-PHON shared tasks. Notably, the best models explicitly model copying and hard monotonic attention (Aharoni et al., 2016;Aharoni and Goldberg, 2017) with the previous state-of-the-art forcing strict monotonicity (Wu and Cotterell, 2019). We instead achieve state-of-the-art with a cheaper approach that simply intermixes a copying task which also encourages monotonicity. Data augmentation for inflection has been explored by Bergmanis et al. (2017) and Zhou and Neubig (2017) among others. The work of Silfverberg et al. (2017) is the most similar to ours, but as we already discussed, it has a few shortcomings that our approach addresses.  have identified typology as playing a role for cross-lingual transfer, but they measure language similarity using lexical overlap. We attest that this data-based measure is less informative and more suspect to variation, so we instead use the genetic typological information to quantify correlations between performance improvements and language distance.
Our novel two-step process decoder architecture bares similarities with multi-source models (Anastasopoulos and Chiang, 2018;Zoph and Knight, 2016) which provide two contexts from two encoded sources to the decoder. A similar disentangled encoding was also used by Ács (2018) for their SIGMORPHON 2018 submission. We in fact experimented with this architecture but pre-liminary results on the development sets showed that our two-step architecture achieved better performance. Interestingly, the second-best performing system (Peters and Martins, 2019) at SIG-MORPHON 2019, which also ranked first in terms of Levenshtein distance, also uses decoupled encoders to separately encode the lemma and the tags; this further cosolidates our belief that such an approach is superior to using a single encoder for the concatentated sequence of the tags and lemma. The main difference to our model is that they do not use our two-step decoder process, while they substitute all softmax operations with sparsemax (Martins and Astudillo, 2016), yielding interpretable attention matrices very similar to ours. The use of sparsemax in conjunction with our twostep decoder process, as well as along our data hallucination technique, presents a promising direction towards even better results in the future.

Conclusion
With this work we advance the state-of-the-art for morphological inflection on low-resource languages by 15 points, through a novel architecture, data hallucination, and a variety of training techniques. Our two-step attention decoder follows an intuitive order, also enhancing interpretability. We also suggest that complicated methods for copying and forcing monotonicity are unnecessary. We identify language genetic similarity as a major success factor for cross-lingual training, and show that using many related languages leads to even better performance. Despite this significant stride, the problem is far from solved. Language-specific or language-family-specific improvements (i.e. proper dealing with different alphabets, or using an adversarial language discriminator) could potentially further boost performance.  Table 5: Results with a single transfer language. Monolingual data hallucination is crucial due to the distance of the languages. In some cases cross lingual transfer should be avoided in favor of a purely monolingual setting (H). Tables   Table 5 lists all results for test languages with a single candidate transfer language. Results on test languages with multiple candidate languages (both single-language-transfer, and transfer with all candidate suggested languages) are listed in Tables 6 and 7.