Zero-Shot Cross-Lingual Transfer with Meta Learning

Learning what to share between tasks has been a topic of great importance recently, as strategic sharing of knowledge has been shown to improve the performance of downstream tasks. In multilingual applications, sharing of knowledge between languages is important when considering the fact that most languages in the world suffer from being under-resourced. In this paper, we consider the setting of training models on multiple different languages at the same time, when English training data, but little or no in-language data is available. We show that this challenging setup can be approached using meta-learning, where, in addition to training a source language model, another model learns to select which training instances are the most beneficial. We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks (natural language inference, question answering). Our extensive experimental setup demonstrates the consistent effectiveness of meta-learning in a total of 16 languages. We improve upon the state-of-the-art for zero-shot and few-shot NLI and QA tasks on two NLI datasets (i.e., MultiNLI and XNLI), and on the X-WikiRE dataset, respectively. We further conduct a comprehensive analysis, which indicates that the correlation of typological features between languages can further explain when parameter sharing learned via meta-learning is beneficial.


Introduction
There are more than 7000 languages spoken in the world, over 90 of which have more than 10 million native speakers each (Eberhard et al., 2019). Despite this, very few languages have proper linguistic resources when it comes to natural language understanding tasks. Although there is a growing awareness in the field, as evidenced by the release of datasets such as XNLI (Conneau et al., 2018), most NLP research still only considers English (Bender, 2019). While one solution to this issue is to collect annotated data for all languages, this process is both time-consuming and expensive to be feasible. Additionally, it is not trivial to train a model for a task in a particular language (e.g., English) and apply it directly to another language where only limited training data is available (i.e., low-resource languages). Therefore, it is essential to investigate strategies that allow one to use the large amount of training data available for English for the benefit of other languages.
Meta-learning has recently been shown to be beneficial for several machine learning tasks (Koch et al., 2015;Vinyals et al., 2016;Santoro et al., 2016;Finn et al., 2017;Ravi and Larochelle, 2017;Nichol et al., 2018). In the case of NLP, recent work has also shown the benefits of this sharing between tasks and domains (Dou et al., 2019;Gu et al., 2018;Qian and Yu, 2019). Although meta-learning for cross-lingual transfer has been investigated in the context of machine translation (Gu et al., 2018), this paper -to best of our knowledge -is the first attempt to study the meta-learning effect for natural language understanding tasks. In this work, we investigate cross-lingual meta-learning using two challenging evaluation setups, namely (i) few-shot learning, where only a limited amount of training data is available for the target domain, and (ii) zero-shot learning, where no training data is available for the target domain. Specifically, in Section 4, we evaluate the performance of our model on two natural language understanding tasks, namely (i) NLI (Natural Language Inference, by experimenting on the MultiNLI and the XNLI (crosslingual setup) datasets (Conneau et al., 2018)), and (ii) cross-lingual QA (Question Answering) on the X-WikiRE dataset (Levy et al., 2017;Abdou et al., 2019). To summarise, the contribution of our model (detailed in Section 3) is four-fold. Concretely, we: (i) exploit the use of meta-learning methods for two different natural language understanding tasks (i.e., NLI, QA); (ii) evaluate the performance of the proposed architecture on various scenarios: cross-domain, cross-lingual, standard supervised as well as zero-shot, across a total of 16 languages (i.e., 15 languages in XNLI and 4 languages in X-WikiRE); (iii) observe consistent improvements of our cross-lingual meta-learning architecture (X-MAML) over the state-of-the-art on various cross-lingual benchmarks (i.e., 3.65% average accuracy improvement on zero-shot XNLI, 1.04% average accuracy improvement on few-shot XNLI, and 0.55% improvement in terms of average F 1 score on zero-shot QA); and (iv) perform an extensive error analysis which reveals that cross-lingual trends can partially be explained by typological commonalities between languages.

Meta-Learning
Meta-learning tries to tackle the problem of fast adaptation to a handful of new training data instances. It discovers the structure among multiple tasks such that learning new tasks can be done quickly. This is done by repeatedly simulating the learning process on low-resource tasks using many high-resource ones (Gu et al., 2018). There are several ways of performing meta-learning: (i) metric-based (Koch et al., 2015;Vinyals et al., 2016); (ii) model-based (Santoro et al., 2016); and (iii) optimisation-based (Finn et al., 2017;Ravi and Larochelle, 2017;Nichol et al., 2018). Metric-based methods aim to learn similarities between feature representations of instances from different training sets given a similarity metric. For model-based architectures, the focus has been on adapting models that learn fast (e.g., memory networks) for meta-learning (Santoro et al., 2016). In this work, we focus on optimisation-based methods due to their superiority in several tasks (e.g., computer vision (Finn et al., 2017)) over the above-mentioned meta-learning architectures. These optimisation-based methods are able to find good initialisation parameter values and adapt to new tasks quickly. To the best of our knowledge, we are the first to exploit the idea of meta-learning for transferring zero-shot knowledge in a cross-lingual setting for natural language understanding, in particular for the tasks of NLI and QA. Specifically, we exploit the usage of Model Agnostic Meta-Learning (MAML) which uses gradient descent and achieves a good generalisation for a variety of tasks (Finn et al., 2017). MAML is able to quickly adapt to new target tasks by using only a few instances at test time, assuming that these new target tasks are drawn from the same distribution.
Formally, MAML assumes that there is a distribution p(T ) of tasks {T 1 , T 2 , ..., T k }. The parameters θ of model M for a particular task T i , sampled from the distribution p(T ), are updated to θ i . In particular, the parameter vector θ is updated using one or a few iterations of gradient descent steps on the training examples (i.e., D train i ) of task T i . For example, for one gradient update, where α is the step size, the M θ is the learned model from the neural network and L T i is the loss on the specific task T i . The parameters of the model θ are trained to optimise the performance of M θ i on the unseen test examples (i.e., D test i ) across tasks p(T ). The meta-learning objective is: The MAML algorithm aims to optimise the model parameters via a few number of gradient steps on a new task, which we refer to as the meta-update. The meta-update across all involved tasks is performed for the θ parameters of the model using stochastic gradient descent (SGD) as: where β is the meta-update step size.
Algorithm 1: X-MAML Input: high-resource language h, set of low-resource languages L, Model M, step size α and learning rate β 1 Pre-train M on h and provide initial model parameters θ 2 Select one or more languages from L as a set of auxiliary languages (A) 3 while not done do 4 for l ∈ A do 5 Sample batches of tasks T i using development set of the auxiliary language l 6 for each T i do 7 Sample k data-points to form D train Compute adapted parameters with gradient descent: 13 Perform either (i) zero-shot or (ii) few-shot learning on {L A} using meta-learned parameters θ

Cross-Lingual Meta-Learning
The underlying idea of using MAML in NLP tasks (Gu et al., 2018;Dou et al., 2019;Qian and Yu, 2019) is to employ a set of high-resource auxiliary tasks / languages to find an optimal initialisation from which learning a target task / language can be done using only a small number of training instances. In a crosslingual setting (i.e., XNLI, X-WikiRE), where only an English dataset is available as a high-resource language, and a small number of instances are available for other languages, the training procedure for MAML requires some non-trivial changes. For this purpose, we introduce a cross-lingual meta-learning framework (X-MAML), which uses the following training steps: 1. Pre-training on the high-resource language h (i.e., English): Given all the training samples in the high-resource language h, we first train the model M on h to initialise the model parameters θ.
2. Meta-learning using low-resource languages: This step consists of choosing one or more auxiliary languages from the low-resource set. Using the development set of each auxiliary language, we construct a set of randomly sampled batches of task T i . Then, we update the model parameters using k data points of T i (D train i ) by one gradient descent step (see Eq. (1)). After this step, we can calculate the loss value using q examples (D test i ) in each task. It should be noted that the data used as q is different from that used as k. We sum up the loss value from all tasks to minimise the meta-objective function and to perform a meta-update using Eq. (3). This step is performed in multiple iterations.
3. Zero-shot or few-shot learning on the target languages: In the last step of X-MAML, we first initialise the model parameters with those learned during meta-learning. We then continue by evaluating the model on the test set of target languages (i.e., zero-shot learning) or fine-tuning the model parameters with standard supervised learning using the development set of target languages and evaluate on the test set (i.e., few-shot learning).
A more formal description of the proposed model X-MAML is given in Algorithm 1.

Natural Language Inference (NLI)
NLI is the task of predicting whether a hypothesis sentence is true (entailment), false (contradiction), or undetermined (neutral) given a premise sentence. The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433k sentence pairs annotated with textual entailment information . It covers a range of different genres of spoken and written text and thus supports crossgenre evaluation. The NLI premise sentences are provided in 10 different genres: facetoface, telephone, verbatim, state, government, fiction, letters, nineeleven, travel and oup. All of the genres appear in the test and development sets, but only five are included in the training set. To verify our learning routine more generally, we define T i as an NLI task in each genre. We exploit MAML, in its original setting, to investigate whether meta-learning encourages the model to learn a good initialisation for all target genres, which can then be fine-tuned with limited supervision for each genre's development instances (2000 examples) to achieve a good performance on its test set. The Cross-lingual Natural Language Inference (XNLI) dataset (Conneau et al., 2018) consists of 5000 test and 2500 development hypothesis-premise pairs with their textual entailment labels for English. Translations of these pairs are provided in 14 languages: French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw) and Urdu (ur). XNLI provides a multilingual benchmark to evaluate how to perform inference in low-resource languages such as Swahili or Urdu, in which only training data for the high-resource language English is available from MultiNLI. Following our X-MAML framework, we study the impact of meta-learning with one low-resource language to serve as an auxiliary language. We evaluate the performance of a cross-lingual NLI model on the target languages provided in the XNLI dataset. An evaluation on NLI benchmarks is performed reporting accuracy on the respective test sets.

Question Answering (QA)
Levy et al. (2017) frame the Relation Extraction (RE) task as a QA problem using pre-defined natural language question templates. For example, a relation type such as author is transformed to at least one language question template (e.g., who is the author of x?, where x is an entity). Building on the work of Levy et al. (2017), Abdou et al. (2019) introduce a new dataset (called X-WikiRE) for multilingual QA-based relation extraction in five languages (i.e., English, French, Spanish, Italian and German). Each instance in the dataset includes a question, a context and an answer. The question is a transformation of a target relation and the context may contain the answer. If the answer is not present, it is marked as NIL. In the QA task, we evaluate the performance of our method on the UnENT setting of the X-WikiRE dataset, where the goal is to generalise to unseen entities. For the evaluation, we use exact match (EM) and F 1 scores (for questions with valid answers) similar to Kundu and Ng (2018).

Experiments
We conduct experiments on the MultiNLI, XNLI and X-WikiRE datasets. We report results for few-shot as well as zero-shot cross-domain and cross-lingual learning. To examine the model-and task-agnostic features of MAML and X-MAML, we conduct experiments with various models for both tasks.
Experimental Setup: We implement X-MAML using the higher library. 1 We use the Adam optimiser (Kingma and Ba, 2014) with a batch size of 32 for both zero-shot and few-shot learning. We fix the step size α and learning rate β to 1e − 4 and 1e − 5, respectively. We experimented using [10,20,30,50,100,200, 300] meta-learning iterations in X-MAML. However, 100 iterations led to the best results in our experiments. The sample sizes k and q in X-MAML are equal to 16 for each dataset. We report results for each experiment by averaging the performance over ten different runs.

NLI:
We experiment using two different settings. (i) In MultiNLI, a cross-genre dataset, we employ the Enhanced Sequential Inference Model (ESIM) (Chen et al., 2016), which is commonly used for textual entailment problems. ESIM employs LSTMs with attention to create a rich representation, capturing the relationship between premise and hypothesis sentences. (ii) In XNLI, which is a cross-lingual dataset, we employ the PyTorch version of BERT 2 (Devlin et al., 2018) as the underlying model M. However, since our proposed meta-learning method is model-agnostic, it can easily be extended to any other architecture.
It is worthwhile mentioning that for the Setting (i), we apply MAML, whereas for Setting (ii), we apply X-MAML on the original English BERT model (En-BERT) and on Multilingual BERT (Multi-BERT) models. As the first training step (i.e., pre-training on a high-resource language, see Step 1 in Section 3 for more information) in X-MAML for XNLI, we fine-tune En-BERT and Multi-BERT on the MultiNLI dataset (English) to obtain the initial model parameters θ for each experiment.
QA: We use the Nil-Aware Answer Extraction Framework (NAMANDA, Kundu and Ng (2018)) 3 as the base model M in X-MAML for our QA experiments. NAMANDA encodes the question and context sequences to compute a similarity matrix. It creates evidence vectors through joint encoding of question and context and applies multi-factor self-attentive encoding. Finally, the evidence vectors are decomposed to output either the answer to the question or NIL. We set the parameters to the default values (as in the original work) for the training and evaluation phases. The NAMANDA model M is pretrained on the full English training set (1M instances -see Step 1 in our training algorithm). The model M is further being used by our meta-learning step to adapt the pre-trained QA model. We then evaluate how well the English model has been adapted by each of the auxiliary language through X-MAML via performing either few-or zero-shot learning. In few-shot X-MAML, the meta-learned M in fine-tuned on the development set (1k instances) of other languages (i.e., fr, es, it and de). For both few-and zero-shot learning, we evaluate on the 10k test set of each of the target languages. Following the work of Abdou et al. (2019), Multi-BERT model is used to jointly encode text for different languages in the QA model.

Baselines:
We create: (i) zero-shot baselines: directly evaluate the model on the test set of the target languages (for each task); (ii) few-shot baselines: fine-tune the model on the development set and evaluate on the test set of the low-resource languages.

Few-Shot Cross-Domain NLI
We train ESIM on the MultiNLI training set to provide initial model parameters θ (see Step 1 in Section 3). We evaluate the pre-trained model on the English test set of XNLI (since the MultiNLI test set is not publicly available) to set the baseline for this scenario. Since MultiNLI is already split into genres, we use each genre as a task within MAML. We then include either the training set (5 genres) or the development set (10 genres) during meta-learning (similar to Step 2 in X-MAML). In the last phase (similar to Step 3 in X-MAML), we first initialise the model parameters with those learned by MAML. We then continue to fine-tune the model using the development set of MultiNLI and report the accuracy on the English test set of XNLI. We proportionally select sub-samples x = [1%, 2%, 3%, 5%, 10%, 20%, 50%, 100%] from the training data (with random sampling). The results obtained by training on the corresponding proportions (x%) of the MultiNLI dataset using ESIM (as the learner model M) are shown in Table 1. We observe that for both settings (i.e., MAML on training (5 tasks) and on development (10 tasks)), the performance of all models (including baselines) improve as more training instances become available. However, as demonstrated by our experimental study, the effectiveness of MAML is larger when only limited training data is available (improving by 12% in accuracy when 2% of the data is available on the development set).

Zero-and Few-Shot Cross-Lingual NLI
Zero-Shot Learning: In this set of experiments, we employ the proposed framework (i.e., X-MAML) within a zero-shot setup, in which we do not fine-tune after the meta-learning step. We report the impact of meta-learning for each target language as a difference in accuracy with and without meta-learning on top of the baseline model (Multi-BERT) on the test set (Fig. 1). Each column corresponds to the performance of Multi-BERT after meta-learning with a single auxiliary language, and evaluation on the target language of the XNLI test set. Overall, we observe that our zero-shot approach with X-MAML outperforms the baseline model without MAML and results reported by Devlin et al. (2018). This way, we improve the state-of-the-art performance for zero-shot cross-lingual NLI (in several languages for up to +3.6% in accuracy, e.g., Hindi (hi) as target and Urdu (ur) as auxiliary language). For the exact accuracy scores, we refer to  commonalities among the languages has an effect (i.e., positive or negative) on the performance of X-MAML. It can be observed that the proposed learning approach provides positive impacts across most of the target languages. However, including Swahili (sw) as an auxiliary language in X-MAML is not beneficial for the performance on the other target languages. It is worth noting that we experimented by just training the model using an auxiliarSy language, instead of performing meta-learning (tep 2). From this experiment, we observe that meta-learning has a strongly positive effect on predictive performance (see also Fig. 2 in the Appendix).
In Table 2, we include the original baseline performances reported in Devlin et al. (2018) 4 and Wu and Dredze (2019). We report the average and maximum performance by using one auxiliary language for each target language. We also report the performance of X-MAML by also using Hindi (which is the most effective auxiliary language for the zero-shot setting, as shown in Fig. 1). We suspect that this is because of the typological similarities between Hindi (hi) and other languages. Furthermore, by using two auxiliary languages in X-MAML results to the largest benefit in our zero-shot experiments.
Few-Shot Learning: For few-shot learning, meta-learning in X-MAML (Step 3) is performed by finetuning on the development set (2.5k instances) of target languages, and then evaluating on the test set. Detailed ablation results are presented in the Appendix (Table 5 and Fig. 4). In Table 2, we report compare x-MAML results with one or two auxiliary languages with external and internal baselines. We also showcase the performance using specifically Swahili (sw), the overall most effective auxiliary language for meta-learning with Multi-BERT in the few-shot learning setting. In addition, we report results from Devlin et al. (2018) that use machine translation at test time (TRANSLATE-TEST) and results from Wu and Dredze (2019) that use machine translation at training time (TRANSLATE-TRAIN). Note that, using X-MAML, we are able to alleviate the machine translation step (TRANSLATE-TEST)   (Devlin et al., 2018) and (Devlin et al., 2018) are also Multi-BERT models. For our Multi-BERT baseline model for (i) zero-shot learning, we evaluate the pre-trained model on the test set of target language; and for (ii) few-shot learning, we fine-tune the model on the development set and evaluate on the test set of the target language. The avg column indicates row-wise average accuracy. We also report the average (AVG) and maximum (MAX) performance by using one auxiliary language for each target language. (l 1 , l 2 ) are the most beneficial auxiliary languages in X-MAML in improving the test accuracy of each target language X. In TRANSLATE-TEST (Devlin et al., 2018) the target language test data is translated to English and then the model is fine-tuned on English.
In TRANSLATE-TRAIN (Wu and Dredze, 2019), the English training data is translated to the target language and the model is fine-tuned using the translated data.
from the target language into English. The results also indicate that X-MAML boosts Multi-BERT performance on XNLI. It is worthwhile mentioning that Multi-BERT in the TRANSLATE-TRAIN setup outperforms few-shot X-MAML, however, we only use 2k development examples from the target languages, whereas in the aforementioned work, 433k translated sentences are used for fine-tuning.

Zero-and Few-Shot Cross-Lingual QA
We use a similar approach for cross-lingual QA task on the X-WikiRE dataset. Table 3 shows the results of both zero-and few-shot X-MAML for the UnENT part (i.e., generalise to unseen entities) of the X-WikiRE dataset. We compare our results for the UnENT scenario on the X-WikiRE dataset to those reported in the original paper. All of the target languages benefit from at least one of the auxiliary languages by adapting the model using X-MAML, highlighting the benefits of this method. We were not able to directly reproduce the result for the zero-shot scenario of the original paper, thus we also report our own baseline for the task. We find that: (i) our zero-shot results with X-MAML improve on those without meta-learning (i.e., baselines); and (ii) we outperform Abdou et al. (2019) for the UnENT scenario of zero-shot cross-lingual QA. Furthermore, for the few-shot scenario, adapting the QA model using few-shot X-MAML with only 1k development data outperforms their cross-lingual transfer model where Abdou et al. (2019) use 10k in the fine-tuning phase.  Table 3: F 1 scores (average over 10 runs) for the test set of the UnENT part of the X-WikiRE dataset using zero-and few-shot X-MAML. Baseline for (i) zero-shot learning: we evaluate the pre-trained NA-MANDA model on the test set of the target language indicated in each row; and for (ii) few-shot learning: we fine-tune the model on the development set and evaluate on the test set of the target language. We report results with few-shot X-MAML with only 1k instances from the development set.

Related Work
The main motivation for this work is the low availability of labelled training datasets for most of the world's languages. To alleviate this issue, a number of methods, including so-called few-shot learning approaches have been proposed. Few-shot learning methods have initially been introduced within the area of image classification (Vinyals et al., 2016;Ravi and Larochelle, 2017;Finn et al., 2017), but have recently also been applied to NLP tasks such as relation extraction (Han et al., 2018), text classification  and machine translation (Gu et al., 2018). Specifically, in NLP, these few-shot learning approaches include either: (i) the transformation of the problem into a different task (e.g., relation extraction is transformed to question answering (Abdou et al., 2019;Levy et al., 2017)); or (ii) meta-learning (Andrychowicz et al., 2016;Finn et al., 2017).
Meta-Learning: Meta-learning or learning-to-learn has recently received a lot of attention from the NLP community. First-order MAML has been applied to the task of machine translation (Gu et al., 2018), where they propose to use meta-learning for improving the machine translation performance for low-resource languages by learning to adapt to target languages based on multilingual high-resource languages. However, in the proposed framework, they include 18 high-resource languages as auxiliary languages and five diverse low-resource languages as target languages. In our work, we assume access to only English as a high-resource language. For the task of dialogue generation, Qian and Yu (2019) address domain adaptation using meta-learning. Dou et al. (2019) explore MAML variants thereof for low-resource NLU tasks in the GLUE dataset . They consider different high-resource NLU tasks such as MultiNLI  and QNLI (Rajpurkar et al., 2016) as auxiliary tasks to learn meta-parameters using MAML. Then, they fine-tune the low-resource tasks using the adapted parameters from the meta-learning phase. All the aforementioned works on meta-learning in NLP assume that there are multiple high-resource tasks or languages, which are then adapted to new target tasks or languages with a handful of training samples. However, in a cross-lingual NLI and QA setting, the available high-resource language is usually only English. Our work thus fills an important gap in the literature, as we only require a single source language.
Cross-Lingual NLU: Cross-lingual learning has a fairly short history in NLP, and has mainly been restricted to traditional NLP tasks, such as PoS tagging and parsing. In contrast to these tasks, which have seen much cross-lingual attention (Plank et al., 2016;Bjerva, 2017;de Lhoneux et al., 2018), there has been relatively little work on cross-lingual NLU, partly due to lack of benchmark datasets. Existing work has mainly been focused on NLI (Conneau et al., 2018;Agic and Schluter, 2018), and to a lesser degree on RE (Verga et al., 2016;Faruqui and Kumar, 2015) and QA (Abdou et al., 2019). Previous research generally reports that cross-lingual learning is challenging and that it is hard to beat a machine translation baseline (e.g., Conneau et al. (2018)). Such a baseline is (for instance) suggested by Faruqui and Kumar (2015), where the text in the target language is automatically translated to English. We achieve competitive performance compared to a machine translation baseline (for XNLI and X-WikiRE), and propose a method that requires no training instances for the target task in the target language. Furthermore, our method is model agnostic, and can be used to extend any pre-existing model, such as that introduced by Conneau and Lample (2019).

Discussion and Analysis
Cross-Lingual Transfer: Somewhat surprisingly, we find that cross-lingual transfer with metalearning yields improved results even when languages strongly differ from one another. For instance, for zero-shot meta-learning on XNLI, we observe gains for almost all auxiliary languages, with the exception of Swahili (sw). This indicates that the meta-parameters learned with X-MAML are sufficiently language agnostic, as we otherwise would not expect to see any benefits in transferring from, e.g., Russian (ru) to Hindi (hi) (one of the strongest results in Fig. 1). This is dependent on having access to a pre-trained multilingual model such as BERT, however, using monolingual BERT (En-BERT) yields overwhelmingly positive gains in some target/auxiliary settings (see additional results in Fig. 3 in the Appendix). For few-shot learning, our findings are similar, as almost all combinations of auxiliary and target languages lead to improvements when using Multi-BERT (Fig. 4 in the Appendix). However, when we only have access to a handful of training instances as in few-shot learning, even the English BERT model mostly leads to improvements in this setting (see additional results in Fig. 5 in the Appendix).
Typological Correlations: In order to better explain our results for cross-lingual zero-shot and fewshot learning, we investigate typological features, and their overlap between target and auxiliary languages. We evaluate on the World Atlas of Language Structure (WALS, Dryer and Haspelmath (2013)), which is the largest openly available typological database. It comprises approximately 200 linguistic features with annotations for more than 2500 languages, which have been made by expert typologists through study of grammars and field work. We draw inspiration from previous work (Bjerva and Augenstein, 2018a;Bjerva and Augenstein, 2018b) that attempt to predict typological features based on language representations learned under various NLP tasks. Similarly, we experiment with two conditions: (i) we attempt to predict typological features based on the mutual gain/loss in performance using X-MAML; (ii) we investigate whether sharing between two typologically similar languages is beneficial for performance using X-MAML. We train one simple logistic regression classifier per condition above, for each WALS feature. In the first condition (i), the task is to predict the exact WALS feature value of a language, given the change in accuracy in combination with other languages. In the second condition (ii), the task is to predict whether a main and auxiliary language have the same WALS feature value, given the change in accuracy when the two languages are used in X-MAML. We compare with two simple baselines, one based on always predicting the most frequent feature value in the training set, and one based on predicting feature values with respect to the distribution of feature values in the training set. We then investigate whether any features could be consistently predicted above baseline levels, given different test-training splits. We apply a simple paired t-test to compare our models predictions to the baselines. As we are running a large number of tests (one per WALS feature), we apply Bonferroni correction, changing our cut-off p-value from p = 0.05 to p = 0.00025.
We first investigate few-shot X-MAML, when using Multi-BERT, as reported in Table 5 (Appendix). We find that languages sharing the feature value for WALS feature 67A The Future Tense are beneficial to each other. This feature encodes whether or not a language has an inflectional marking of future tense, and can be considered to be a morphosyntactic feature. We next look at zero-shot X-MAML with Multi-BERT, as reported in Table 4 (Appendix). For this case, we find that languages sharing a feature value for the WALS feature 25A Locus of Marking: Whole-language Typology typically help each other. This feature describes whether the morphosyntactic marking in a language is on the syntactic heads or dependents of a phrase. For example en, de, ru, and zh are 'dependent-marking' in this feature. And if we look at the results in Fig. 1, they have the largest mutual gains from each other during the zero-shot X-MAML. In both cases, we thus find that languages with similar morphosyntactic properties can be beneficial to one another when using X-MAML.

Conclusion
In this work, we show that meta-learning allows one to leverage training data from an auxiliary language, to perform zero-shot and few-shot cross-lingual transfer. We evaluated this on two challenging NLU tasks (NLI and QA), and on a total of 16 languages. We are able to improve the performance of stateof-the-art baseline models for (i) zero-shot XNLI, and (ii) both few-shot and zero-shot QA on the X-WikiRE dataset. Furthermore, we show in a typological analysis that languages which share certain morphosyntactic features tend to benefit from this type of transfer. Future studies will extend this work to other cross-lingual NLP tasks and more languages. Figure 3: Differences in performance in terms of accuracy scores on the test set for zero-shot X-MAML on XNLI using the En-BERT (English) model. Rows correspond to target and columns to auxiliary languages used in X-MAML. Numbers on the off-diagonal indicate performance differences between X-MAML and the baseline model in the same row. The coloring scheme indicates the differences in performance (e.g., blue for large improvement).
Figure 4: Differences in performance in terms of accuracy scores on the test set for few-shot X-MAML on XNLI using the Multi-BERT model. Rows correspond to target and columns to auxiliary languages used in X-MAML. Numbers on the off-diagonal indicate performance differences between X-MAML and the baseline model in the same row. The coloring scheme indicates the differences in performance (e.g., blue for large improvement).
Figure 5: Differences in performance in terms of accuracy scores on the test set for few-shot X-MAML on XNLI using the En-BERT (English) model. Rows correspond to target and columns to auxiliary languages used in X-MAML. Numbers on the off-diagonal indicate performance differences between X-MAML and the baseline model in the same row. The coloring scheme indicates the differences in performance (e.g., blue for large improvement).  Table 4: The performance in terms of average test accuracy for the zero-shot setting over 10 runs of X-MAML on the XNLI dataset using Multi-BERT (multilingual BERT), as base model. Each column corresponds to the performance of the Multi-BERT system after meta-learning with a single auxiliary language, and evaluation on the target language of the XNLI test set. The auxiliary language is not included during the evaluation phase. Results of the Multi-BERT model without X-MAML (baseline) are also reported.