MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning

The combination of multilingual pre-trained representations and cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages. However, for extremely low-resource languages without large-scale monolingual corpora for pre-training or sufficient annotated data for fine-tuning, transfer learning remains an understudied and challenging task. Moreover, recent work shows that multilingual representations are surprisingly disjoint across languages, bringing additional challenges for transfer onto extremely low-resource languages. In this paper, we propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one and brings their representation spaces closer for effective transfer. Extensive experiments on real-world low-resource languages – without access to large-scale monolingual corpora or large amounts of labeled data – for tasks like cross-lingual sentiment analysis and named entity recognition show the effectiveness of our approach. Code for MetaXL is publicly available at github.com/microsoft/MetaXL.


Introduction
Recent advances in multilingual pre-trained representations have enabled success on a wide range of natural language processing (NLP) tasks for many languages. However, these techniques may not readily transfer onto extremely low-resource languages, where: (1) large-scale monolingual corpora are not available for pre-training and (2) sufficient labeled data is lacking for effective finetuning for downstream tasks. For example, multilingual BERT (mBERT) (Devlin et al., 2018) is pre-trained on 104 languages with many articles on * Most of the work was done while the first author was an intern at Microsoft Research. Wikipedia and XLM-R (Conneau et al., 2020) is pre-trained on 100 languages with CommonCrawl Corpora. However, these models still leave behind more than 200 languages with few articles available in Wikipedia, not to mention the 6, 700 or so languages with no Wikipedia text at all (Artetxe et al., 2020). Cross-lingual transfer learning for these extremely low-resource languages is essential for better information access but under-studied in practice (Hirschberg and Manning, 2015). Recent work on cross-lingual transfer learning using pre-trained representations mainly focuses on transferring across languages that are already covered by existing representations (Wu and Dredze, 2019). In contrast, existing work on transferring to languages without significant monolingual resources tends to be more sparse and typically focuses on specific tasks such as language modeling (Adams et al., 2017) or entity linking .
Building NLP systems in these settings is challenging for several reasons. First, a lack of sufficient annotated data in the target language prevents effective fine-tuning. Second, multilingual pre-trained representations are not directly transferable due to language disparities. Though recent work on cross-lingual transfer mitigates this challenge, it still requires a sizeable monolingual corpus to train token embeddings (Artetxe et al., 2019). As noted, these corpora are difficult to obtain for many languages (Artetxe et al., 2020).
Additionally, recent work (Singh et al., 2019) shows that contextualized representations of different languages do not always reside in the same space but are rather partitioned into clusters in multilingual models. This representation gap between languages suggests that joint training with combined multilingual data may lead to sub-optimal transfer across languages. This problem is further exacerbated by the, often large, lexical and syntactic differences between languages with existing pre-trained representations and the extremely lowresource ones. Figure 1(a) provides a visualization of one such example of the disjoint representations of a resource-rich auxiliary language (English) and resource-scarce target language (Telugu).
We propose a meta-learning based method, MetaXL, to bridge this representation gap and allow for effective cross-lingual transfer to extremely low-resource languages. MetaXL learns to transform representations from auxiliary languages in a way that maximally facilitates transfer to the target language. Concretely, our meta-learning objective encourages transformations that increase the alignment between the gradients of the source-language set with those of a target-language set. Figure 1 (b) shows that MetaXL successfully brings representations from seemingly distant languages closer for more effective transfer.
We evaluate our method on two tasks: named entity recognition (NER) and sentiment analysis (SA). Extensive experiments on 8 low-resource languages for NER and 2 low-resource languages for SA show that MetaXL significantly improves over strong baselines by an average of 2.1 and 1.3 F1 score with XLM-R as the multilingual encoder.

Background and Problem Definition
The standard practice in cross-lingual transfer learning is to fine-tune a pre-trained multilingual language model f θ parameterized by θ, (e.g. XLM-R and mBERT) with data from one or more auxiliary languages 1 and then apply it to the target language. This is widely adopted in the zero-shot transfer setup where no annotated data is available in the target language. The practice is still applicable in the few-shot setting, in which case a small amount of annotated data in the target language is available.
In this work, we focus on cross-lingual transfer for extremely low-resource languages where only a small amount of unlabeled data and taskspecific annotated data are available. That includes languages that are not covered by multilingual language models like XLM-R (e.g., Maori or Turkmen), or low-resource languages that are covered but with many orders of magnitude less data for pre-training (e.g., Telegu or Persian). We assume the only target-language resource we have access to is a small amount of task-specific labeled data.
More formally, given: (1) a limited amount of annotated task data in the target language, denoted (2) a larger amount of annotated data from one or more source language(s), denoted as (3) a pre-trained model f θ , which is not necessarily trained on any monolingual data from the target language -our goal is to adapt the model to maximize the performance on the target language.
When some target language labeled data is available for fine-tuning, a standard practice is to jointly fine-tune (JT) the multilingual language model using a concatenation of the labeled data from both the source and target languages D s and D t . The representation gap (Singh et al., 2019) between the source language and target language in a jointly trained model brings additional challenges, which motivates our proposed method.

Representation Transformation
The key idea of our approach is to explicitly learn to transform source language representations, such that when training with these transformed representations, the parameter updates benefit performance on the target language the most. On top of an existing multilingual pre-trained model, we introduce an additional network, which we call the representation transformation network to model this transformation explicitly.
The representation transformation network models a function g φ : R d → R d , where d is the di-  Figure 2: Overview of MetaXL. For illustration, only two Transformer layers are shown for XLM-R, and the representation transformation network is placed after the first Transformer layer. 1 source language data passes through the first Transformer layer, through the current representation transformation network, and finally through the remaining layers to compute a training loss with the corresponding source labels. 2 The training loss is backpropagated onto all parameters, but only parameters of XLM-R are updated. The updated weights of XLM-R are a function of the current representation transformation network due to gradient dependency (highlighted by the light-purple background of the updated XLM-R). 3 A batch of target language data passes through the updated XLM-R and the meta loss is evaluated with the corresponding labels. 4 The meta loss is back-propagated into the representation transformation network, since the meta-loss is in effect a function of weights from that network, and only the representation transformation network is updated.
mension of the representations. Conceptually, any network with proper input and output sizes is feasible. We opt to employ a two-layer feed-forward network, a rather simple architecture with the intention to avoid heavy parameter overhead on top of the pre-trained model. The input to the representation transformation network is representations from any layer of the pre-trained model. By denoting representations from layer i as h i ∈ R d , we have a parameterized representation transformation network as follows: is the set of parameters of the representation transformation network. In practice, we set r to be bottlenecked, i.e. r < d, so the representation transformation network first compresses the input representation and then projects back onto the original dimension of the input representation. As shown in Figure 2, by assuming that the base model has N layers, a source example (x s , y s ) ∈ D s passes through the first i layers, then through the representation transformation network, finally through the last N − i layers of the base model. We denote the final logits of this batch as f (x s ; θ, φ), encoded by both the base model and the representation transformation network. In contrast, for a target example x t , y t ∈ D t , we only pass it through the base model as usual, denoted as f (x t ; θ).
Ideally, suppose that we have a representation transformation network that could properly transform representations from a source language to the target language. In that case, the source data can be almost equivalently seen as target data on a representation level. Unfortunately, we cannot train such a representation transformation network in a supervised manner without extensive parallel data.
Architecturally, the representation transformation network adopts a similar structure to existing works on language and task adapters for cross-lingual and multi-task transfer (Pfeiffer et al., 2020b), a simple down-and up-projection of input representations. Nevertheless, beyond network architecture, the goal and training procedure of the two approaches are significantly different. Adapters are typically trained to encode task or language-specific information by fixing the rest of the model and updating the parameters of the adapters only. Adapters allow training parameterefficient models that could be flexibly adapted to multiple languages and tasks. While in our proposed method, we use the representation trans-

Algorithm 1 Training procedure for MetaXL
Input: Input data from the target language D t and the source language D s 1: Initialize base model parameters θ with pretrained XLM-R weights, initialize parameters of the representation transformation network φ randomly 2: while not converged do 3: Sample a source batch (x s , y s ) from D s and a target batch (x t , y t ) from D t ; 4: fer network at training time to adjust the training dynamics to maximally improve test-time performance on the target language. The optimization procedure and the function of the representation transformation network will be discussed in more detail in the next section.

Optimization
The training of the representation transformation network conforms to the following principle: If the representation transformation network g φ effectively transforms the source language representations, such transformed representations f (x s ; φ, θ) should be more beneficial to the target task than the original representations f (x s ; θ), such that the model achieves a smaller evaluation loss L Dt on the target language. This objective can be formulated as a bi-level optimization problem: where L(·) is the task loss function. In this bi-level optimization, the parameters φ of the representation transformation network are the meta parameters, which are only used at training time and discarded at test time. Exact solutions require solving for the optimal θ * whenever φ gets updated. This is computationally infeasible, particularly when the base model f is complex, such as a Transformerbased language model. Similar to existing work involving such optimization problems (Finn et al., 2017;Liu et al., 2019;Shu et al., 2019;Zheng et al., 2021), instead of solving the optimal θ * for any given φ, we adopt a one-step stochastic gradient descent update for θ as an estimate to the optimal base model for a given φ: where L Ds (x s ; ) is the loss function of the lower problem in Equation 2 and α is the corresponding learning rate. Note that the resulting θ is in effect a function of φ. We then evaluate the updated weights θ on data x t from the target language for updating g φ : where L Dt (x t ; ·) is the loss function of the upper problem in Equation 2 and β is its corresponding learning rate. Note that the meta-optimization is performed over the parameters of the representation transformation network g φ whereas the objective is calculated solely using the updated parameters of the main architecture θ . By plugging Equation 3 into Equation 4, we can further expand the gradient . We omit f and y in the following derivative for simplicity.
During training, we alternatively update θ with Equation 3 and φ with Equation 4 until convergence. We term our method MetaXL, for its nature to leverage Meta-learning for extremely lowresource cross(X)-Lingual transfer. Both Figure 2 and Algorithm 1 outline the procedure for training MetaXL.

Data
We conduct experiments on two diverse tasks, namely, sequence labeling for Named Entity Recognition (NER) and sentence classification task for Sentiment Analysis (SA). For the NER task, we use the cross-lingual Wikiann dataset (Pan et al., 2017). For the sentiment analysis task, we use the English portion of Multilingual Amazon Reviews SentiPers SentiPers (Hosseini et al., 2018) is a sentiment corpus in Persian (fa) consisting of around 26k sentences of users' opinions for digital products. Each sentence has an assigned quantitative polarity from the set of {−2, −1, 0, 1, 2}.
Sentiraama Sentiraama (Gangula and Mamidi, 2018) is a sentiment analysis dataset in Telugu (tel), a language widely spoken in India. The dataset contains example reviews in total, labeled as either positive or negative.
Pre-processing For SA, we use SentiPers and Sentiraama as target language datasets and MARC as the source language dataset. To unify the label space, we curate MARC by assigning negative labels to reviews rated with 1 or 2 and positive labels to those rated with 4 or 5. We leave out neutral reviews rated with 3. For SentiPers, we assign negative labels to reviews rated with -1 and -2 and positive labels to those rated with 1 or 2. For SentiPers, though the dataset is relatively large, we Tokenization For all languages, either pretrained with XLM-R or not, we use XLM-R's default tokenizer for tokenizing. We tried with the approach where we train subword tokenizers for unseen languages similar to Artetxe et al. (2020) but obtained worse results than using the XLM-R tokenizer as is, due to the extremely small scale of target language data. We conjecture that the subword vocabulary that XLM-R learns is also beneficial to encode languages on which it is not even   pre-trained on. We leave exploring the best tokenization strategy for leveraging pre-trained model on unseen language as future work.

Main Results
NER We present results of NER in Table 2, where we use 5k examples from English or a related language as source data. When we only use the annotated data of the target language to fine-tune XLM-R (target), we observe that the performance varies significantly across languages, ranging from 37.7 to 76.3 F1 score. Jointly fine-tuning XLM-R with target and source data (JT) leads to a substantial average gain of around 12.6 F1 score. Using the same amount of data from a related language (instead of English) is more effective, showing an average improvement of 16.3 F1 score over using target data only. Our proposed method, MetaXL, consistently outperforms the joint training baselines, leading to a significant average gain of 2.07 and 0.95 F1 score when paired with English or related languages, respectively.
SA We present results on the task of SA in Table 3, where we use 1K examples from English as source language data. We find that auxiliary data from source languages brings less but still signifi-cant gains to the joint training baseline (JT) over using target language data only (target only), as in the NER task. In addition, MetaXL still outperforms joint training by around 0.9 and 1.6 F1 score on Telugu and Persian. These results support our hypothesis that MetaXL can transfer representations from other languages more effectively. That, in turn, contributes to the performance gain on the target task.

Source Language Data Size
To evaluate how MetaXL performs with different sizes of source language data, we perform experiments on varying the size of source data. For NER, we experiment with 5k, 10k, and 20k source examples. For SA, we test on 1k, 3k and 5k 4 source examples.
As observed from Table 4, MetaXL delivers consistent gains as the size of source data increases over the joint training model (except on fa when using 5k auxiliary data). 5 However, the marginal gain decreases as the source data size increases on NER. We also note that MetaXL continues to improve even when joint training leads to a minor performance drop for SA.   this end, we conducted experiments with representation transformation networks placed at various depths of the Transformer model. Specifically, we experiment with placing the representation transformation network after the 0th (embedding layer), 6th and 12th layer (denoted by L0, L6, L12). We also experiment with placing two identical representation transformation networks after both the 0th and 12th layers.

Placement of Representation
As observed from Table 5, transformations at the 12th layer are consistently effective, suggesting that transformation at a higher and more abstract level results in better transfer for both tasks. 6 Transferring from lower layers leads to fewer gains for SA, coinciding with the fact that SA is more reliant on global semantic information. Transferring at multiple layers does not necessarily lead to higher performance, possibly because it results in increased instability in the bi-level optimization procedure.

Joint Training with Representation Transformation Networks
There are two major differences between MetaXL and joint training: (1)   dergoes transformation via an augmented representation transformation network; (2) we adopt a bi-level optimization procedure to update the base model and the representation transformation network. To verify that the performance gain from MetaXL is not attributed to increased model capacity, we conduct experiments on joint training using the representation transformation network. Specifically, the forward pass remains the same as MetaXL, whereas the backward optimization employs the standard stochastic gradient descent algorithm. We conduct experiments on placing the representation transformation network after the 0th layer or 12th layer and present results in Table 6 7 . Interestingly, joint training with the representation transformation network deteriorates the model performance compared to vanilla joint training. Transferring after the 0th layer is even more detrimental than the 12th layer. This finding shows that Transformer models are rather delicate to subtle architectural changes. In contrast, MetaXL breaks the restriction, pushing the performance higher for both layer settings.

Analysis of Transformed Representations
To verify that MetaXL does bring the source and target language spaces closer, we qualitatively and quantitatively demonstrate the representation shift with transformation. In particular, we collect representations of both the source and target languages from the joint training and the MetaXL models, with mBERT as the multilingual encoder, and present the 2-component PCA visualization in Figure 1 for SA and Figure 3 for NER. SA models are trained on Telugu paired with 5k English examples, and NER models are trained on Quechua paired with 5k English. From the figures, MetaXL merges the representations from two languages for SA, but the phenomenon is not as evident for NER. Singh et al. (2019) quantitatively analyze mBERT representations with canonical correlation analysis (CCA). However, CCA does not suit our case as we do not have access to semantically aligned data for various languages. Thus we adopt Hausdorff distance as a metric that has been widely used in vision and NLP tasks (Huttenlocher et al., 1993;Dubuisson and Jain, 1994;Patra et al., 2019) to measure the distance between two distinct datasets. Informally, the Hausdorff distance measures the average proximity of data representations in the source language to the nearest ones in the target language, and vice versa. Given a set of representations of the source language S = {s 1 , s 2 , . . . , s m } and a set of representations of the target language T = {t 1 , t 2 , . . . , t n }, we compute the Hausdorff distance as follows: where cosine distance is used as as the inner distance, i.e., d(s, t) 1 − cos(s, t) For SA, we observe a drastic drop of Hausdorff distance from 0.57 to 0.20 and a substantial performance improvement of around 4 F1 score. For NER, we observe a minor decline of Hausdorff distance from 0.60 to 0.53 as the representations are obtained at the token level, leading to a significant performance gain of 3 F1 score. For NER, we observe a correlation of 0.4 between performance improvement and the reduction in representation distance. Both qualitative visualization and quantitative metrics confirm our hypothesis that MetaXL performs more effective transfer by bringing the representations from different languages closer.  Despite our experiments so far on extremely lowresource languages, given by few labeled data for fine-tuning and limited or no unlabeled data for pretraining, MetaXL is generally applicable to all languages. To better understand the scope of applying MetaXL to languages with varying resources, we perform experiments on 4 target languages that do not belong to our extremely low-resource category for the NER task, namely, Spanish (es), French (fr), Italian (it), Russian (ru) and Chinese (zh). These languages are typically considered high-resource with 20k labeled examples in the WikiAnn datasets and large amount of unlabeled data consumed by mBERT for pre-training. We use only 100 examples for all target languages to mimic the lowresource setting and use 5k English examples for transfer.

Additional
As shown in Table 7, we found slight performance drop using MetaXL for these high-resource languages. We conjecture that these languages have been learned quite well with the mBERT model during the pre-training phase, therefore, leaving less scope for effective representation transformation in the low-resource setup. Nonetheless, this can be remedied with a back-off strategy by further finetuning the resulting model from MetaXL on the concatenated data from both source and target languages to match the performance of JT training. As high-resource languages are out of the scope of this paper, we leave further analysis and understanding of these scenarios for future work.

Related Work
Unifying Language Spaces MetaXL in essence brings the source and target representations closer. Previous works have shown that learning invariant representations across languages leads to better transfer. On the representation level, adversarial training is widely adopted to filter away languagerelated information (Xie et al., 2017;. One the form level,  show that replacing words in a source language with the correspondence in the target language brings significant gains in low-resource machine translation. Adapters Adapter networks are designed to encode task (Houlsby et al., 2019;Stickland and Murray, 2019;Pfeiffer et al., 2020a), domain (Bapna and Firat, 2019) and language (Pfeiffer et al., 2020c) specific information to efficiently share parameters across settings. Though RTN in MetaXL is similar to adapter networks in architecture, in contrast to adapter networks, it plays a more explicit role in transforming representations across languages to bridge the representation gap. More importantly, MetaXL trains the representation transformation network in a meta-learning based paradigm, significantly different from how adapters are trained.
Meta Learning MetaXL falls into the category of meta learning for its goal to learn to transform under the guidance of the target task. Related techniques have been used in Finn et al. (2017), which aims to learn a good initialization that generalizes well to multiple tasks and is further extended to low-resource machine translation (Gu et al., 2018) and low-resource natural language understanding tasks (Dou et al., 2019). The bi-level optimization procedure is widely adopted spanning across neural architecture search (Liu et al., 2019), instance re-weighting (Ren et al., 2018;Shu et al., 2019), learning from pseudo labels (Pham et al., 2020) and mitigating negative inference in multilingual systems (Wang et al., 2020). MetaXL is the first to meta learn a network that explicitly transforms representations for cross-lingual transfer on extremely low-resource languages.

Conclusions and Future Work
In this paper, we study cross-lingual transfer learning for extremely low-resource languages without large-scale monolingual corpora for pre-training or sufficient annotated data for fine-tuning. To allow for effective transfer from resource-rich source languages and mitigate the representation gap between multilingual pre-trained representations, we propose MetaXL to learn to transform representations from source languages that best benefits a given task on the target language. Empirical evaluations on cross-lingual sentiment analysis and named entity recognition tasks demonstrate the effectiveness of our approach. Further analysis on the learned transformations verify that MetaXL indeed brings the representations of both source and target languages closer, thereby, explaining the performance gains. For future work, exploring transfer from multiple source languages to further improve the performance and investigating the placement of multiple representation transformation networks on multiple layers of the pre-trained models are both interesting directions to pursue.

Ethical Considerations
This work addresses cross-lingual transfer learning onto extremely low-resource languages, which is a less studied area in NLP community. We expect that progress and findings presented in this paper could advocate awareness of advancing NLP for extremely low-resource languages and help improve information access for such under-represented language communities.
The proposed method is somewhat computeintensive as it requires approximating second-order gradients for updating the meta-parameters. This might impose negative impact on carbon footprint from training the described models. Future work on developing more efficient meta-learning optimization methods and accelerating meta-learning training procedure might help alleviate this concern.

A Hyper-parameters
We use a maximum sequence length of 200 and 256 for NER and AS respectively. We use a bottleneck dimension of r = 384 and r = 192 for the representation transformation network, same as Pfeiffer et al. (2020c). During the bi-level optimization process, we adopt a learning rate of 3e-05 for training the main architecture and tuned the learning rate on 3e-5, 1e-6 and 1e-7 for training the representation transformation network. We use a batch size of 16 for NER and 12 for AS, and train 20 epochs for each experiment on both tasks. We use a single NVIDIA Tesla V100 with a 32G memory size for each experiment. For each language, we pick the best model according to the validation performance after each epoch.

B.1 Source Data Size
The full results of using 10k and 20k English examples as transfer data are presented in Table 9.

B.2 Placement of RTN
The full results of placing the representation transformation network at different layers are presented in Table 10.

B.3 Joint Training w/ RTN
The full results of joint training with the representation transformation network are presented in Table 11.

C Additional Results on mBERT
We conduct experiments on mBERT (Devlin et al., 2019), which covers 104 languages with most Wikipedia articles. For a language that is not pretrained with mBERT, we train its subword tokenizer with the task data. Further, we combine the vocabulary from the newly trained tokenizer with the original mBERT vocabulary. A similar Method tel fa (1) target only 75.00 73.86 (2) JT 75.13 74.81 MetaXL 77.36 76.69 approach has been adopted in (Artetxe et al., 2020). Table 12 and Table 8 present results for NER and SA respectively where we finetune the tasks on mBERT. Note that the languages of SA are both covered by mBERT and XLM-R, while the languages of NER are not. Table 13 show MetaXL results on mBERT with various sizes of source data. Nevertheless, our method consistently brings gains on both tasks. We observe an average of 2 F1 points improvement on NER and 2.0 F1 points improvement on SA. It shows that the improvement brought by our method is consistent across different language models.