X-METRA-ADA: Cross-lingual Meta-Transfer Learning Adaptation to Natural Language Understanding and Question Answering

Multilingual models, such as M-BERT and XLM-R, have gained increasing popularity, due to their zero-shot cross-lingual transfer learning capabilities. However, their generalization ability is still inconsistent for typologically diverse languages and across different benchmarks. Recently, meta-learning has garnered attention as a promising technique for enhancing transfer learning under low-resource scenarios: particularly for cross-lingual transfer in Natural Language Understanding (NLU). In this work, we propose X-METRA-ADA, a cross-lingual MEta-TRAnsfer learning ADAptation approach for NLU. Our approach adapts MAML, an optimization-based meta-learning approach, to learn to adapt to new languages. We extensively evaluate our framework on two challenging cross-lingual NLU tasks: multilingual task-oriented dialog and typologically diverse question answering. We show that our approach outperforms naive fine-tuning, reaching competitive performance on both tasks for most languages. Our analysis reveals that X-METRA-ADA can leverage limited data for faster adaptation.


Introduction
Cross-lingual transfer learning is a technique used to adapt a model trained on a downstream task in a source language to directly generalize to the task in new languages. It aims to come up with common cross-lingual representations and leverages them to bridge the divide between resources to make any NLP application scale to multiple languages. This is particularly useful for data-scarce scenarios, as it reduces the need for API calls implied by machine translation or costly task-specific annotation for new languages. Figure 1: An overview of the X-METRA-ADA framework: we use English as the source and Spanish as the target language. The meta-train stage transfers from the source to the target languages, while the meta-adaptation further adapts the model to the target language. The application is few-shot if the test language is seen in any stage of X-METRA-ADA; or zeroshot if the test language is unseen.
Transformer-based contextualized embeddings and their multilingual counterparts such as M- BERT (Devlin et al., 2019) have become popular as off-the-shelf representations for cross-lingual transfer learning. While these multilingual representations exhibit some cross-lingual capability even for languages with low lexical overlap with English, the transfer quality is reduced for languages that exhibit different typological characteristics (Pires et al., 2019).
The generalization of such representations has been extensively evaluated on traditional tasks such as Part-of-Speech (POS) tagging, Named Entity Recognition (NER) and Cross-lingual Document Classification (CLDC) (Ahmad et al., 2019;Wu and Dredze, 2019;Bari et al., 2020a;Schwenk and Li, 2018), with ever-growing open community annotation efforts like Universal Dependencies (Nivre et al., 2020) and CoNLL shared tasks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003). On the other hand, crosslingual Natural Language Understanding (NLU) tasks have gained less attention, with smaller benchmark datasets that cover a handful of languages and don't truly model linguistic variety (Conneau et al., 2018;Artetxe et al., 2020). Natural Language Understanding tasks are critical for dialog systems, as they make up an integral part of the dialog pipeline. Understanding and improving the mechanism behind cross-lingual transfer for natural language understanding in dialog systems require evaluations on more challenging and typologically diverse benchmarks.
Numerous approaches have attempted to build stronger cross-lingual representations on top of those multilingual models; however, most require parallel corpora (Wang et al., 2019;Lample and Conneau, 2019) and are biased towards highresource and balanced setups. This fuels the need for a method that doesn't require explicit crosslingual alignment for faster adaptation to lowresource setups.
Meta-learning, a method for "learning to learn", has found favor especially among the computer vision and speech recognition communities (Nichol et al., 2018;Triantafillou et al., 2020;. Meta-learning has been used for machine translation (Gu et al., 2018), few-shot relation classification (Gao et al., 2019), and on a variety of GLUE tasks (Dou et al., 2019). Recently, Nooralahzadeh et al. (2020) apply the MAML (Finn et al., 2017) algorithm to crosslingual transfer learning for XNLI (Conneau et al., 2018) and MLQA (Lewis et al., 2020), NLU tasks that are naturally biased towards machine translation-based solutions. Nooralahzadeh et al. are able to show improvement over strong multilingual models, including M-BERT. However, they mainly show the effects of meta-learning as a first step in a framework that relies on supervised finetuning, making it difficult to properly compare and contrast both approaches.
We study cross-lingual meta-transfer learning from a different perspective. We distinguish between meta-learning and fine-tuning and design systematic experiments to analyze the added value of meta-learning compared to naive finetuning. We also build our analysis in terms of more typologically diverse cross-lingual NLU tasks: Multilingual Task-Oriented Dialogue System (MTOD) (Schuster et al., 2019) and Typologically Diverse Question Answering (Ty-DiQA) (Clark et al., 2020). While XNLI is a clas-sification task, MTOD is a joint classification and sequence labelling task and is more typologically diverse. TyDiQA is not a classification task, but we show how meta-learning can be applied usefully to it. We also show greater performance improvements from meta-learning than fine-tuning on transfer between typologically diverse languages.
To the best of our knowledge, we are the first to conduct an extensive analysis applied to MTOD and TyDiQA to evaluate the quality of cross-lingual meta-transfer. Our contributions are three-fold: • Proposing X-METRA-ADA, 1 a languageagnostic meta-learning framework (Figure 1), and extensively evaluating it. • Applying X-METRA-ADA to two challenging cross-lingual and typologically diverse taskoriented dialog and QA tasks, which includes recipes for constructing appropriate meta-tasks (Section 2.3). • Analyzing the importance of different components in cross-lingual transfer and the scalability of our approach across different k-shot and downsampling configurations (Section 4.2).

Methodology
We make use of optimization-based meta-learning on top of pre-trained models with two levels of adaptation to reduce the risk of over-fitting to the target language: (i) meta-training from the source language to the target language(s) (ii) metaadaptation on the same target language(s) for more language-specific adaptation ( Figure 1). We apply our approach to two cross-lingual downstream tasks: MTOD (Section 2.1) and Ty-DiQA (Section 2.2). We start by describing the base architectures for both tasks, before explaining how they are incorporated into our meta-learning pipeline. Applying meta-learning to a task requires the construction of multiple 'pseudo-tasks', which are instantiated as pairs of datasets. We describe this construction for our downstream tasks in Section 2.3. Finally, we present our X-METRA-ADA algorithm (Section 2.4).

Multilingual Task-Oriented Dialog (MTOD)
Similar to the architecture in Castellucci et al.
(2019), we model MTOD's intent classification and slot filling subtasks jointly. For that purpose, we use a joint text classification and sequence labeling framework with feature representation based on Transformer (Vaswani et al., 2017). More specifically, given a multilingual pre-trained model, we use it to initialize the word-piece embeddings layer. Then, we add on top of it a text classifier to predict the intent from the [CLS] token representation and a sequence labeling layer in the form of a linear layer to predict the slot spans (in BIO annotation), as shown in Figure 2. We optimize parameters using the sum of both intent and CRF based slot losses.

Typologically Diverse Question
Answering (TyDiQA) Inspired by Hu et al. (2020), we apply to Ty-DiQA the same architecture as the original BERT fine-tuning procedure for question answering on SQuAD (Devlin et al., 2019). Specifically, the input question (after prepending it with a [CLS] token) and the context are concatenated as a single packed sequence separated by a [SEP ] token. Then, the embeddings of the context are fed to a linear layer plus a softmax to compute the probability that each token is the START or END of the answer. The whole architecture is fine-tuned by optimizing for the joint loss over the START and END predictions. Any START and END positions that are outside of the scope of the context end up being truncated because of Transformer-based embeddings length limitations and are ignored during training. Figure 3 illustrates the architecture.

Psuedo-task Datasets
Meta-learning is distinguished from fine-tuning in that the former seeks an initialization point that is maximally useful to multiple downstream learning tasks, while the latter seeks to directly optimize a downstream 'child' task from the initialization point of a 'parent' task. To apply meta-learning to data scenarios that more closely fit fine-tuning, we construct multiple 'pseudo-tasks' by subsampling from parent and child task datasets. A pseudo-task is defined as a tuple T = (S, Q), where each of S and Q are labeled samples. In the inner loops of meta-learning, the loss on Q from a model trained on S is used to adapt the initialization point (where Q and S are referred to as the query and support in meta-learning literature). Pseudo-tasks are constructed in such a way as to make them balanced and non-overlapping. We describe our approach for each task below.

MTOD Pseudo-task Construction
MTOD labeled data consists of a sentence from a dialogue along with a sentence-level intent label and subsequence slot labels. From the available data, we draw a number of task sets T ; each T = (S, Q) ∈ T consists of k intent and slotlabeled items per intent class in S and q items per class in Q. Although carefully arranged to have the same number of items per class per task in each of the support and the query sets, the same task splits are used for slot prediction as well. During meta-training and meta-adaptation, task batches are sampled randomly from T .

QA Pseudo-task Construction
Unlike MTOD, QA is not a standard classification task with fixed classes; thus, it is not directly amenable to class distribution balancing across pseudo-task query and support sets. To construct pseudo-tasks for QA from the available (question, context, answer) span triplet data, we use the following procedure: We draw a task T = (S, Q), by first randomly drawing q triplets, forming Q. For each triplet t in Q, we draw the k/q most similar triplets to t from the remaining available data, thus forming S. 2 For two triplets t 1 , t 2 we define similarity as cos(f (t 1 ), f (t 2 )), where f (.) is a representation of the concatenation of the triplet elements delimited by a space; we use a cross-lingual extension to SBERT's pre-trained model Gurevych, 2019, 2020).

Cross-lingual extension
In the original MAML (Finn et al., 2017), in every iteration we sample a task set T from a single distribution D, and the support and query sets in a single task T would be drawn from a common space. We distinguish between the distributions D meta-train and D meta-adapt , which correspond to the two levels of adaptation introduced in Section 2 and explained below in Section 2.4.
To enable cross-lingual transfer, we draw data for the support set of tasks in D meta-train from task data in the high-resource base language (English, in our experiments). For the query set in D meta-train and for both support and query sets in D meta-adapt , we sample from task data in the language to be evaluated.

X-METRA-ADA Algorithm
Following the notation described in the above sections, we present our algorithm X-METRA-ADA, our adaptation of MAML to cross-lingual transfer learning in two stages. In each stage we use the procedure outlined in Algorithm 1. We start by sampling a batch of tasks from distribution D. For every task T j = (S j , Q j ), we update θ j over n steps using batches drawn from S j . At the end of this inner loop, we compute the gradients with respect to the loss of θ j on Q j . At the end of all tasks of each batch, we sum over all pre-computed gradients and update θ, thus completing one outer loop. The difference between meta-train and metaadapt stages comes down to the parameters and hyperparameters passed into Algorithm 1.
• Meta-train: This stage is similar to classical MAML. Task sets are sampled from D meta-train , which uses high-resource (typically English) data in support sets and low-resource data in the query sets. The input model θ B is typically a pretrained multilingual downstream base model, and we use hyperparameters n = 5, α = 1e−3 and β = 1e−2 for MTOD and α = β = 3e−5 for QA.
2 Thus k is constrained to be a multiple of q.
Algorithm 1 X-METRA-ADA Require: Task set distribution D, pre-trained learner B with parameters θB, meta-learner M with parameters (θ, α, β, n) 1: Initialize θ ← θB 2: while not done do 3: Sample batch of tasks T = {T1, T2, . . . T b } ∼ D 4: for all Tj = (Sj, Qj) in T do 5: Initialize θj ← θ 6: for t = 1 . . . n do 7: Evaluate Update θj = θj − α∂B θ j /∂θj 9: end for 10: Evaluate query loss L Q j T j (B θ j ) and save it for outer loop 11: end for 12: • Meta-adapt: During this stage, we ensure the model knows how to learn from examples within the target language under a low-resource regime. Task sets are sampled from D meta-adapt , which uses low-resource data in both support and query sets. The input model is the optimization resulting from meta-train, and we use hyperparameters n = 5, α = 1e−3 and β = 1e−2 for MTOD and α = β = 3e−5 for QA.

Datasets
For dialogue intent prediction, we use the Multilingual Task-Oriented Dialogue (MTOD) (Schuster et al., 2019) dataset. MTOD covers 3 languages (English, Spanish, and Thai), 3 domains (alarm, reminder, and weather), 12 intent types, and 11 slot types. 3 We train models with the English training data (T rain) but for the other languages we use the provided development sets (Dev) to further our goals to analyze methods of few-shot transfer. We evaluate on the provided test sets. Moreover, we evaluate on an in-house dataset of 7 languages. 4 For QA, we use the Typologically Diverse QA (TyDiQA-GoldP) (Clark et al., 2020) dataset. Ty-DiQA is a typologically diverse question answering dataset covering 11 languages. Like Hu et al.
(2020), we use a simplified version of the primary task. Specifically, we discard questions that don't have an answer and use only the gold passage as context, keeping only the short answer and its spans. This makes the task similar to XQuAD and MLQA, although unlike these tasks, the questions are written without looking at the answers and without machine translation. As with MTOD, we use the English training data as T rain. Since development sets are not specified for MTOD, we instead reserve 10% of the training data in each of the other languages as Dev. We report on the provided test sets. Statistics of datasets for both tasks can be found in Appendix A.

Evaluation
In order to fairly and consistently evaluate our approach to few-shot transfer learning via metalearning and to ablate components of the method, we design a series of experiments based on both internal and external baselines. Our internal baselines ablate the effect of the X-METRA-ADA algorithm vs. conventional fine-tuning from a model trained on a high-resource language by keeping the data sets used for training constant. As our specific data conditions are not reproduced in any externally reported results on these tasks, we instead compare to other reported results using English-only or entirely zero-shot training data.
Internal Evaluation We design the following fine-tuning/few-shot schemes: • PRE: An initial model is fine-tuned on the T rain split of English only and then evaluated on new languages with no further tuning or adaptation. This strawman baseline has exposure to English task data only.
• MONO: An initial model is fine-tuned on the Dev split of the target language. This baseline serves as a comparison for standard fine-tuning (FT, below), which shows the value of combining MONO and PRE.
• FT: We fine-tune the PRE model on the Dev split of the target language. This is a standard transfer learning approach that combines PRE and MONO.
• FT w/EN: Like FT, except both the Dev split of the target language and the T rain split of English are used for fine-tuning. This is used for dataset equivalence with X-METRA-ADA (below).
• X-METRA: We use the PRE model as θ B for meta-train, the T rain split from English to form support sets in D meta−train , and all of the Dev split of the target language to form query sets in D meta−train .
• X-METRA-ADA: We use the PRE model as θ B for meta-train, the T rain split from English to form support sets in D meta-train . For MTOD, we use 75% of the Dev split of the target language to form query sets in D meta-train . We use the remaining 25% of the Dev split of the target language for both the support and query sets of D meta-adapt . For QA, we use ratios of 60% for D meta-train and 40% for D meta-adapt .
All models are ultimately fine-tuned versions of BERT and all have access to the same task training data relevant for their variant. That is, X-METRA-ADA and PRE both see the same English T rain data and MONO, FT, and X-METRA-ADA see the same target language Dev data. However, since X-METRA-ADA uses both T rain and Dev to improve upon PRE, and FT only uses Dev, we make an apples-to-apples comparison, data-wise, by including FT w/EN experiments as well.
External Baselines We focus mainly on transfer learning baselines from contextualized embeddings for a coherent external comparison; supervised experiments on target language data such as those reported in Schuster et al. (2019) are inappropriate for comparison because they use much more inlanguage labeled data to train. The experiments we compare to are zero-shot in the sense that they are not trained directly on the language-specific task data. However, most of these external baselines involve some strong cross-lingual supervision either through cross-lingual alignment or mixed-language training. We also include machine translation baselines, which are often competitive and hard to beat. Our work, by contrast, uses no parallel language data or resources beyond pretrained multilingual language models, labeled English data, and fewshot labeled target language data. To the best of our knowledge, we are the first to explore cross-lingual meta-transfer learning for those benchmarks, so we only report on our X-METRA-ADA approach in addition to those baselines. For MTOD, then, we focus on the following external baselines:  2020) fine-tuned on intent classification. We also include Translate Train (TTrain) (Schuster et al., 2019), which translates English training data into target languages to train on them in addition to the target language training data.
For TyDiQA-GoldP, out of the already mentioned baselines, we use M-BERT, XLM, MMTE, and TTrain (which unlike (Schuster et al., 2019) only translates English to the target language to train on it without data augmentation). In addition to that we also include XLM-R as reported by Hu et al. (2020).

Implementation Details
We use M-BERT (bert-base-multilingual-cased) 5 with 12 layers as initial models for MTOD and TyDiQA-GoldP in our internal evaluation. We use xlm-r-distilroberta-base-paraphrase-v1 6 model for computing similarities when constructing the QA meta-dataset (Section 2.3.2). Our implementation of X-METRA-ADA from scratch uses learn2learn (Arnold et al., 2020) for differentiation and update rules in the inner loop. 7 We use the first-order approximation option in learn2learn for updating the outer loop, also introduced in Finn et al. (2017). For each model, we run for 3 to 4 different random initializations (for some experiments like PRE for TyDiQA-GoldP we use only 2 seeds respectively) and report the average and standard deviation of the best model for the few-shot language for each run. We use training loss convergence as a criteria for stopping. For the FT and MONO baselines, we don't have the luxury of Dev performance, since those baselines use the Dev dataset for training. 8 The Dev set is chosen to simulate a low-resource setup. More details on the hyperparameters used can be found in Appendix B.  Table 1: Performance evaluation on MTOD between meta-learning approaches, fine-tuning internal baselines and external baselines. All our internal experiments use k = q = 6. Zero-shot learning experiments that train only on English are distinguished from few-shot learning, which include a fair internal comparison. Models in bold indicate our own internal models. MONO, FT, FT w/EN, X-METRA, and X-METRA-ADA models include results for each test language when training on that language. FT w/EN trains jointly on English and only the target language. We highlight the best scores in bold and underline the second best for each language and sub-task. The rest are reported from † (Schuster et al., 2019), ‡ , and + (Siddhant et al., 2020). Table 1 shows the results for cross-lingual transfer learning on MTOD comparing different baselines. 9 In general, PRE model performs worse than other baselines. It performs less than the simplest baseline, MCoVe, when transferring to Thai with a decrease of 25.3% and 23.1% and an average cross-lingual relative loss of 4.5% and 2.1% for intent classification and slot filling respectively.  Table 2: F1 comparison on TyDiQA-GoldP between different meta-learning approaches, fine tuning and external baselines. We highlight the best scores in bold and underline the second best for each language. Our own models are in bold, whereas the rest are reported from † (Hu et al., 2020). This is using k = q = 6. This suggests that zero-shot fine-tuning M-BERT on English only is over-fitting on English and its similar languages. Using MLT A which adds more dialogue-specific mixed training helps reduce that gap for Thai on intent accuracy mainly, but not with the same degree on slot filling. The results confirm the positive effects of crosslingual fine-tuning; although PRE is not a very effective cross-lingual learner, fine-tuning with inlanguage data on top of PRE (i.e. FT) adds value over the MONO baseline. Adding English data to fine-tuning (FT w/EN) is slightly harmful. However, the meta-learning approach appears to make the most effective use of this data in almost all cases (Spanish slot filling is an exception). We perform a pairwise two-sample t-test (assuming unequal variance) and find the results of X-METRA-ADA compared to FT on intent classification to be statistically significant with p-values of 1.5% and 2.4% for Spanish and Thai respectively, rejecting the null hypothesis with 95% confidence.
X-METRA-ADA outperforms all previous external baselines and fine-tuning models for both Spanish and Thai (except for slot filling on Spanish). We achieve the best overall performance with an average cross-lingual cross-task increase of 3.2% over the FT baseline, 6.9% over FT w/EN, and 12.6% over MONO. Among all models, MONO has the least stability as suggested by higher average standard deviation. There is a tendency for X-METRA-ADA to work better for languages like Thai compared to Spanish as Thai is a truly lowresource language. This suggests that pre-training on English only learns an unsuitable initialization, impeding its generalization to other languages. As expected, fine-tuning on small amounts of the Dev data does not help the model generalize to new languages. MONO baselines exhibit less stability than X-METRA-ADA. On the other hand, X-METRA-ADA learns a more stable and successful adaptation to that language even on top of a model pre-trained on English with less over-fitting. Table 2 shows a comparison of methods for TyDiQA-GoldP across seven language, evaluating using F1. 10 The benefits of fine-tuning and improvements from X-METRA-ADA observed in Table 1 are confirmed. We also compare X-METRA-ADA to X-METRA, which is equivalent to X-METRA-ADA without the meta-adaptation phase. On average, X-METRA increases by 10.8% and 1.5% over the best external and fine-tuning baseline respectively, whereas MONO results lag behind. X-METRA-ADA outperforms X-METRA on average and is especially helpful on languages like Bengali and Telugu. We compare X-METRA and X-METRA-ADA in more depth in Section 4.2. Meta-learning significantly and consistently outperforms fine-tuning.
In Appendix D, we report zero-shot results for QA and notice improvements using X-METRA-ADA over FT for some languages. However, we cannot claim that there is a direct correlation between the degree to which the language is low-resource and the gain in performance of X-METRA-ADA over fine-tuning. Other factors like similarities of grammatical and morphological structure, and shared vocabulary in addition to consistency of annotation may play a role in the observed cross-lingual benefits. Studying such correlations is beyond the scope of this paper.

More Analysis
Meta-Adaptation Role The learning curves in Figure 4 compare X-METRA-ADA, X-METRA (i.e. meta-training but no meta-adaptation), and fine-tuning, both with English and with target language data only, for both Spanish and Thai intent detection in MTOD. In general, including English data in with in-language fine-tuning data lags behind language-specific training for all models, languages, and sub-tasks. With the exception of slot filling on Spanish, there is a clear gap between naive fine-tuning and meta-learning, with a gain in the favor of X-METRA-ADA especially for Thai. Naive fine-tuning, X-METRA, and X-METRA-ADA all start from the same checkpoint fine-tuned on English. All model variants are sampled from the same data. For Spanish, continuing to use English in naive fine-tuning to Spanish reaches better performance than both variants of meta-learning for Slot filling on Spanish (see Appendix E). This could be due to the typological similarity of Spanish and English, which makes optimization fairly easy for naive fine-tuning compared to Thai, which is both typologically distant and low-resource.

K-Shot Analysis
We perform a k-shot analysis by treating the number of instances seen per class (i.e. 'shots') as a hyper-parameter to determine at which level few-shot meta-learning starts to outperform the fine-tuning and monolingual baselines. As shown in Figure 5, it seems that while even one shot for X-METRA-ADA is better than finetuning on intent classification, k = q = 9 shot and k = q = 6 shot are at the same level of stability with very slightly better results for 6 shot showing that more shots beyond this level will not improve the performance. While 1 shot performance is slightly below our monolingual baseline, it starts approaching the same level of performance as 3 shot upon convergence. Figure 6 shows an analysis over both k and q shots for TyDiQA-GoldP. In general, increasing q helps more than increasing k. The gap is bigger between k = 6 q = 3 and k = 6 q = 6 especially for languages like Bengali and Telugu. We can also see that k = 6 q = 3 is at the same level of performance to FT for those languages. Downsampling Analysis We perform a downsampling analysis, where we gradually decrease the proportion of the overall set from which the target language is sampled used for few-shot learning in X-METRA-ADA and FT. Figure 7 shows a comparison between intent accuracies and slot F1 scores between the main models X-METRA-ADA and FT on Thai. We notice that as the percentage of query data increases, the gap between X-METRA-ADA and FT increases slightly, whereas the gain effect on slots is steadier. This suggests that X-METRA-ADA is at the same level of effectiveness even for lower percentages.

Related Work
Cross-lingual transfer learning Recent efforts apply cross-lingual transfer to downstream applications such as information retrieval (Jiang et al., 2020); information extraction (M'hamdi et al., 2019(M'hamdi et al., , Bari et al., 2020b, and chatbot applications , Abbet et al., 2018. Upadhyay et al. (2018) and Schuster et al. (2019) propose the first real attempts at cross-lingual task-oriented dialog using transfer learning. Although they show that cross-lingual joint training outperforms monolingual training, their zero-shot model lags behind machine translation for other languages.
To circumvent imperfect alignments in the crosslingual representations,  propose a latent variable model combined with cross-lingual refinement with a small bilingual dictionary related to the dialogue domain.  enhance Transformer-based embeddings with mixed language training to learn inter-lingual semantics across languages. However, although these approaches show promising zero-shot performance for Spanish, their learned refined alignments are not good enough to surpass machine translation baselines on Thai.
More recently, Hu et al. (2020) and Liang et al. (2020) introduce XTREME and XGLUE benchmarks for the large-scale evaluation of crosslingual capabilities of pre-trained models across a diverse set of understanding and generation tasks. In addition to M-BERT, they analyze models like XLM (Lample and Conneau, 2019) and Unicoder (Huang et al., 2019). Although the latter two models slightly outperform M-BERT, they need a large amount of parallel data to be pre-trained. It is also not clear the extent to which massive cross-lingual supervision helps to bridge the gap to linguistically distant languages. Meta-learning for NLP Previous work in metalearning for NLP is focused on the application of first-order MAML (Finn et al., 2017). Earlier work by Gu et al. (2018) extends MAML to improve lowresource languages for neural machine translation. Dou et al. (2019) apply MAML to NLU tasks in the GLUE benchmark. They show that meta-learning is a better alternative to multi-task learning, but they only validate their approach on English. Wu et al. (2020) also use MAML for cross-lingual NER with a slight enhancement to the loss function. More recently, Nooralahzadeh et al. (2020) also directly leverage MAML on top of M-BERT and XLM-R for zero-shot and few-shot XNLI and MLQA datasets. Although their attempt shows that cross-lingual transfer using MAML outperforms other baselines, the degree of typological commonalities among languages plays a significant role in that effect. In addition to that, their approach is an oversimplification of the n-way k-shot setup, with a one-fit-all sampling of data points for support and query and additional supervised fine-tuning.

Conclusion
In this paper, we adapt a meta-learning approach for cross-lingual transfer learning in Natural Language Understanding tasks. Our experiments cover two challenging cross-lingual benchmarks: taskoriented dialog and natural questions including an extensive set of low-resource and typologically diverse languages. X-METRA-ADA reaches better convergence stability on top of fine-tuning, reaching a new state of the art for most languages.
This work was started while the first author was a research intern at Adobe Research (Summer 2020). This material is partially based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. We thank the anonymous reviewers for their detailed comments.

A Dataset Statistics
Tables 3 and 4 show the statistics of MTOD and TyDiQA respectively per language and split.

B Hyperparameters
For MTOD, we fine-tune PRE on English training data. We use a batch size of 32, a dropout rate of 0.3, AdamW with a learning rate of 4e−5, and of 1e−8. We train for around 2000 steps. Beyond that point more training does not reveal necessary, so we perform early stopping at that point. For MONO, using a smaller learning rate of 4e−5 helped achieve a good convergence for that model. For all FT experiments, we use the same learning rate of 1e−3, which gave a better convergence.
For QA, we use a batch size of 4, doc stride of 128, a fixed maximum sequence length of 384, and a maximum length of questions of 30 words. We use AdamW optimizer throughout all experiments, which uses weight decay of 1e−3, learning rate of 3e−5, and a scheduler of 4 warm-up steps. 11 We fine-tune PRE for 2 epochs and observe no more gains in performance. For all MONO and FT experiments, we use the same learning rate of 3e−5. This is the same optimizer and learning rate used for the outer loops in meta-learning as well.
For X-METRA-ADA and X-METRA, we sample 2500 tasks in total for both MTOD and QA. For each task, we randomly sample k = q = 6 examples from each intent class to form the support and query sets respectively (we consider all classes not only the intersection across languages). For QA, we use only one support example per query class and 6 query examples as classes. For the inner loop, we use learn2learn pre-built optimizer.   Table 7: Full F1 Results on TyDiQA-GoldP between external, pre-training, monolingual and fine-tuning baselines on one hand and X-METRA and X-METRA-ADA on the other hand.

E More Ablation
Figure 8 compares between the learning curves for language-specific and joint training with respect to slot filling for both Spanish and Thai.

F More Analysis
More Downsampling Analysis Figure 9 shows a downsampling analysis on Spanish. Due to the typological similarity between Spanish and English, even lower percentages starting from 50% of the query reach a maximal performance for both intent classification and slot filling.

BERTology Analysis
We analyze the degree of contribution of M-BERT layers by freezing each pair of layers separately. Our analysis is not conclusive as the performance doesn't change significantly between layers. We then proceed to freeze all layers of M-BERT to discover that linear layers are more important in refining the cross-lingual alignment to the target language as shown by the narrow gap between freezing vs non-freezing BERT layers in Figure 10. This can be explained by the challenge of fine-tuning M-BERT alone with many layers and higher dimensionality for such a low-resource setting.  Table 8: Full EM Results on TyDiQA-GoldP between external, pre-training, monolingual and fine-tuning baselines on one hand, X-METRA and X-METRA-ADA on the other hand.