MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

The main goal behind state-of-the-art pretrained multilingual models such as multilingual BERT and XLM-R is enabling and bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance is the weakest exactly on such low-resource languages and languages unseen during pretraining. We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations. In addition, we introduce a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language. MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and achieves competitive results on question answering.


Introduction
Current deep pretrained multilingual models (Devlin et al., 2019;Conneau and Lample, 2019) achieve state-of-the-art results on cross-lingual transfer but do not have enough capacity to represent all languages. Evidence for this is the importance of the vocabulary size (Artetxe et al., 2020) and the curse of multilinguality (Conneau et al., 2020), a trade-off between language coverage and model capacity. Scaling up a model to cover all of the world's 7,000 languages is prohibitive. At the same time, limited capacity is an issue even for high-resource languages where state-of-the-art multilingual models underperform their monolingual variants (Eisenschlos et al., 2019;Virtanen et al., 2019;Nozza et al., 2020), and performance decreases further when moving down the list of languages towards lower-resource languages covered by the pretrained models. Moreover, the model capacity issue is arguably most severe for languages that were not included in the training data at all, and pretrained multilingual models perform poorly on those languages (Ponti et al., 2020).
In this paper, we propose and develop Multiple ADapters for Cross-lingual transfer (MAD-X), a modular framework that leverages a small number of extra parameters to address the fundamental capacity issue that limits current pretrained multilingual models. Using a state-of-the-art multilingual model as the foundation, we adapt the model to arbitrary tasks and languages by learning modular language-and task-specific representations via adapters (Rebuffi et al., 2017;Houlsby et al., 2019), small bottleneck layers that are inserted between a model's pretrained weights.
Concretely, using a recent efficient adapter variant (Pfeiffer et al., 2020), we train 1) languagespecific adapter modules via masked language modelling (MLM) on unlabelled target language data, and 2) task-specific adapter modules via optimising a target task on labelled data in any source language. Task and language adapters are stacked as in Figure 1, allowing us to adapt the pretrained multilingual model also to languages that are not covered in the model's (pre)training data by substituting the target language adapter at inference.
In order to deal with a mismatch between the shared multilingual vocabulary and target language vocabulary, we propose invertible adapters, a new type of adapter that is well suited to performing MLM in another language. Our framework goes beyond prior work on using adapters for crosslingual transfer (Bapna and Firat, 2019;Artetxe et al., 2020) by enabling adaptation to languages unseen during pretraining and without learning expensive language-specific token-level embeddings.
We compare our framework against state-of-theart cross-lingual transfer methods on the standard WikiANN NER dataset (Pan et al., 2017;Rahimi et al., 2019), relying on a representative set of typologically diverse languages which includes highresource, low-resource, as well as languages unseen by the pretrained multilingual model. Our framework outperforms the baselines significantly on seen and unseen high-resource and low-resource languages. On the high-resource languages of the challenging XQuAD question answering dataset (Artetxe et al., 2020), we achieve competitive performance while being more parameter-efficient.
Another contribution of our work is a simple method of adapting a pretrained multilingual model to a new language, which outperforms the standard setting of transferring a model only from labelled source language data. We use this novel method as an additional stronger baseline for our adapterbased approach, and demonstrate the usefulness of MAD-X also in comparison to this baseline.
In summary, our contributions are as follows. 1) We propose MAD-X, a modular framework that mitigates the curse of multilinguality and adapts a multilingual model to arbitrary tasks and languages. 2) We propose invertible adapters, a new adapter variant for cross-lingual masked language modelling. 3) We show that our method outperforms or is competitive to the state-of-the-art approaches to cross-lingual transfer across typologically diverse languages on standard NER and question answering tasks. 4) We propose a simple method for adapting a pretrained multilingual model to a new language, which can result in stronger transfer performance than the standard and commonly used transfer baseline. 5) We shed light on the behaviour of current methods on languages that are unseen during multilingual pretraining.

Related Work
Cross-lingual Representations Over the last years, research in cross-lingual NLP has increasingly focused on learning general-purpose crosslingual representations that can be applied to many tasks, first on the word level (Mikolov et al., 2013;Gouws et al., 2015;Glavaš et al., 2019;Wang et al., 2020) and later on the fullsentence level (Chidambaram et al., 2019;Devlin et al., 2019;Conneau and Lample, 2019;Cao et al., 2020). Recent models such as multilingual BERT (Devlin et al., 2019)-large Transformer (Vaswani et al., 2017) models pretrained on large amounts of multilingual data-have been observed to perform surprisingly well when transferring to other languages (Pires et al., 2019;Wu and Dredze, 2019;Wu et al., 2020) and the current state-of-the-art model, XLM-R has been shown to be competitive with the performance of monolingual models on the standard GLUE benchmark (Conneau et al., 2020). Recent studies (Hu et al., 2020), however, indicate that state-of-the-art models such as XLM-R still perform poorly on cross-lingual transfer across many language pairs. The main reason behind such poor performance is the current lack of capacity in the massively multilingual model to represent all languages equally in the vocabulary and representation space (Bapna and Firat, 2019;Artetxe et al., 2020;Conneau et al., 2020).
Adapters Adapter modules have been originally studied in computer vision tasks where they have been restricted to convolutions and used to adapt a model for multiple domains (Rebuffi et al., 2017(Rebuffi et al., , 2018. In NLP, adapters have been mainly used for parameter-efficient and quick fine-tuning of a base pretrained Transformer model to new tasks (Houlsby et al., 2019;Stickland and Murray, 2019) and new domains (Bapna and Firat, 2019), avoiding catastrophic forgetting (McCloskey and Cohen, 1989;Santoro et al., 2016). Bapna and Firat (2019) also use adapters to fine-tune and recover performance of a multilingual NMT model on highresource languages, but their approach cannot be applied to languages that were not seen during pretraining. Artetxe et al. (2020) employ adapters to transfer a pretrained monolingual model to an unseen language but rely on learning new token-level embeddings, which do not scale to a large number of languages. Pfeiffer et al. (2020) combine the information stored in multiple adapters for more robust transfer learning between monolingual tasks.

Multilingual Model Adaptation for
Cross-lingual Transfer Standard Transfer Setup The standard way of performing cross-lingual transfer with a state-ofthe-art large multilingual model such as multilingual BERT or XLM(-R) is 1) to fine-tune it on labelled data of a downstream task in a source language and then 2) apply it directly to perform inference in a target language (Hu et al., 2020). A downside of this setting is that the multilingual initialisation balances many languages. It is thus not suited to excel at a specific language at inference time. We propose a simple method to ameliorate this issue by allowing the model to additionally adapt to the particular target language.
Target Language Adaptation Similar to finetuning monolingual models on the task domain to improve their performance (Howard and Ruder, 2018), we propose to fine-tune a pretrained multilingual model with masked language modelling (MLM) on unlabelled data of the target language prior to task-specific fine-tuning in the source language. A disadvantage of this approach is that it no longer allows us to evaluate the same model on multiple target languages as it biases the model to a specific target language. However, this approach might be preferable if we only care about performance in a specific (i.e., fixed) target language.
We find that target language adaptation results in improved cross-lingual transfer performance over the standard setting ( §6). In other words, it does not result in catastrophic forgetting of the multilingual knowledge already available in the pretrained model that enables the model to transfer to other languages. In fact, experimenting with methods that explicitly try to prevent catastrophic forgetting (Wiese et al., 2017) led to worse performance in our experiments.
Nevertheless, the proposed simple adaptation method inherits the fundamental limitation of the pretrained multilingual model and the standard transfer setup: the model's limited capacity hinders effective adaptation to low-resource and unseen languages. In addition, fine-tuning the full model does not scale well to many tasks or languages.

Adapters for Cross-lingual Transfer
Our MAD-X framework addresses these deficiencies and can be used to effectively adapt an existing pretrained multilingual model to other languages. The framework comprises three types of adapters: language, task, and invertible adapters. As in previous work (Rebuffi et al., 2017;Houlsby et al., 2019), adapters are trained while keeping the parameters of the pretrained multilingual model fixed. Our framework thus enables learning language and task-specific transformations in a modular and parameter-efficient way. We show the full framework as part of a standard Transformer model (Vaswani et al., 2017) in Figure 1 and describe the three adapter types in what follows.  Figure 1: The MAD-X framework inside a Transformer model. Input embeddings are fed into the invertible adapter whose inverse is fed into the tied output embeddings. Language and task adapters are added to each Transformer layer. Language adapters and invertible adapters are trained via masked language modelling (MLM) while the pretrained multilingual model is kept frozen. Task-specific adapters are stacked on top of source language adapters when training on a downstream task such as NER (full lines). During zero-shot cross-lingual transfer, source language adapters are replaced with target language adapters (dashed lines).

Language Adapters
For learning language-specific transformations, we employ a recent efficient adapter architecture proposed by Pfeiffer et al. (2020). Following Houlsby et al. (2019) they define the interior of the adapter to be a simple down-and up-projection combined with a residual connection. 1 The language adapter LA l at layer l consists of a down-projection D ∈ R h×d where h is the hidden size of the Transformer model and d is the dimension of the adapter, followed by a ReLU activation and an up-projection U ∈ R d×h at every layer l: where h l and r l are the Transformer hidden state and the residual at layer l, respectively. The residual connection r l is the output of the Transformer's feed-forward layer whereas h l is the output of the subsequent layer normalisation (see Figure 1).
We train language adapters on unlabelled data of a language using masked language modelling, which encourages them to learn transformations that make the pretrained multilingual model more suitable for a specific language. During taskspecific training with labelled data, we use the language adapter of the corresponding source language, which is kept fixed. In order to perform zero-shot transfer to another language, we simply replace the source language adapter with its target language component. For instance, as illustrated in Figure 1, we can simply replace a language-specific adapter trained for English with a language-specific adapter trained for Quechua at inference time. This, however, requires that the underlying multilingual model does not change during fine-tuning on the downstream task. In order to ensure this, we additionally introduce task adapters that capture taskspecific knowledge.

Task Adapters
Task adapters TA l at layer l have the same architecture as language adapters. They similarly consist of a down-projection D ∈ R h×d , a ReLU activation, followed by an up-projection. They are stacked on top of the language adapters and thus receive the output of the language adapter LA l as input, together with the residual r l of the Transformer's feed-forward layer 2 : The output of the task adapter is then passed to another layer normalisation component. Task 2 Initial experiments showed that this residual connection performs better than one to the output of the language adapter. adapters are the only parameters that are updated when training on a downstream task (e.g., NER) and aim to capture knowledge that is task-specific but generalises across languages.

Invertible Adapters
The majority of the "parameter budget" of pretrained multilingual models is spent on token embeddings of the shared multilingual vocabulary. Despite this, they have been shown to underperform on low-resource languages (Artetxe et al., 2020;Conneau et al., 2020), and are bound to fare even worse for languages not covered by the multilingual model's training data.
In order to mitigate this mismatch between multilingual and target language vocabulary, we propose invertible adapters. We stack these adapters on top of the embedding layer while their respective inverses precede the output embedding layer (see Figure 1). As input and output embeddings are tied in multilingual pretrained models, invertibility allows us to leverage the same set of parameters for adapting both input and output representations. This is crucial as the output adapters might otherwise overfit to the pretraining task and get discarded during task-specific fine-tuning.
To ensure this invertibility, we employ Nonlinear Independent Component Estimation (NICE; Dinh et al., 2015). NICE enables the invertibility of arbitrary non-linear functions through a set of coupling operations (Dinh et al., 2015). For the invertible adapter, we split the input embedding vector e i of the i-th token into two vectors of equal dimensionality e 1,i , e 2,i ∈ R h/2 . 3 For two arbitrary non-linear function F and G, the forward pass through our invertible adapter A inv () is: where o is the output of the invertible adapter A inv and [·, ·] indicates concatenation of two vectors.
Correspondingly, the inverted pass through the adapter, thus A −1 , is computed as follows: (4) e is the output of A −1 Inv (). For the non-linear transformations F and G, we use similar down-and upprojections as for the language and task adapters: and x is a placeholder for e 1 , e 2 , o 1 and o 2 . We illustrate the complete architecture of the invertible adapter and its inverse in Figure 2. The invertible adapter has a similar function to the language adapter but aims to capture languagespecific transformations on the token level. As such, it is trained together with the language adapters using MLM on unlabelled data of a specific language. In an analogous manner, during task-specific training we use the fixed invertible adapter of the source language and replace it with the invertible adapter of the target language during zero-shot transfer. Importantly, note that our invertible adapters are much more parameter-efficient compared to the approach of Artetxe et al. (2020), which learns separate token embeddings for every new language.
An Illustrative Example To make the training process of MAD-X more apparent, we briefly walk through the example from Figure 1. Assuming English (En) as the source language and Quechua (Qu) as the target language, we first pretrain invertible adapters A En Inv and A Qu Inv , and language adapters A En Lang and A Qu Lang with masked language modelling. We then train a task adapter for the NER task A N ER T ask on the English NER training set. During training, embeddings are passed through A En Inv . At every layer of the model the data is first passed through the fixed A En Lang and then into the NER adapter A N ER T ask . The output of the last hidden layer is passed through A En Inv −1 . For zero-shot inference, the English invertible and language adapters A En Inv and A En Lang are simply replaced with their Quechua counterparts A Qu Inv and A Qu Lang while the data is still passed through the NER task adapter A N ER T ask .

Experiments
Data We conduct experiments on the named entity recognition task (NER) using the standard multilingual NER dataset: WikiANN (Pan et al., 2017), which was partitioned into train, development, and   (Rajpurkar et al., 2016).
Languages The partitioned version of the NER dataset covers 176 languages. In order to obtain a comprehensive outlook on the performance of MAD-X in comparison to state-of-the-art crosslingual methods under different evaluation conditions, we select languages based on: a) variance in data availability (by selecting languages with a range of respective Wikipedia sizes); b) their presence in pretrained multilingual models; more precisely, whether data in the particular language was included in the pretraining data of both multilingual BERT and XLM-R or not; and c) typological diversity to ensure that different language types and families are covered. In total, we can discern four categories in our language set: 1) high-resource languages and 2) low-resource languages covered by the pretrained SOTA multilingual models (i.e., by mBERT and XLM-R); as well as 3) low-resource languages and 4) truly low-resource languages not covered by the multilingual models. We select four languages from different language families for each category. We highlight characteristics of the 16 languages from 11 language families in Table 1.
We evaluate on all possible language pairs (i.e., on the Cartesian product), using each language as a source language with every other language (including itself) as a target language. This subsumes both the standard zero-shot cross-lingual transfer setting (Hu et al., 2020) as well as the standard monolingual in-language setting.
For QA, we evaluate on the 11 languages provided in XQuAD, with English as source language.

Baselines
The baseline models are based on different approaches to multilingual model adaptation for cross-lingual transfer, discussed previously in §3.
XLM-R The main model we compare against is XLM-R (Conneau et al., 2020), the current state-ofthe-art pretrained model for cross-lingual transfer (Hu et al., 2020). It is a Transformer-based model pretrained for one hundred languages on large cleaned Common Crawl corpora (Wenzek et al., 2019). For efficiency purposes, we use the XLM-R Base configuration as the basis for all of our experiments. However, we note that the main driving idea behind the MAD-X framework is not tied to any particular pretrained model, and the framework can be easily adapted to other pretrained multilingual models (e.g., multilingual BERT). First, we compare against XLM-R in the standard setting where the entire model is fine-tuned on labelled data of the task in the source language.
XLM-R MLM-SRC; XLM-R MLM-TRG In §3, we have proposed target language adaptation as a simple method to adapt pretrained multilingual models for better cross-lingual generalisation on the downstream task while retaining its zero-shot ability. As a sanity check, we also compare against adapting to the source language data, which we would expect to improve in-language performance but not help with cross-lingual transfer. In particular, we fine-tune XLM-R with MLM on unlabelled source language (XLM-R MLM-SRC) and target language data (XLM-R MLM-TRG) prior to taskspecific fine-tuning.

MAD-X: Experimental Setup
For the construction of the MAD-X framework we rely on the XLM-R Base architecture; we evaluate the full MAD-X, MAD-X without invertible adapters (-INV), and also MAD-X without language and invertible adapters (-LAD -INV). We use the Transformers library (Wolf et al., 2019) for all our experiments. For fine-tuning via MLM on unlabelled data, we train on the Wikipedia data of the corresponding language for 250,000 steps, with a batch size of 64 and a learning rate of 5e − 5 and 1e − 4 for XLM-R (also for the -SRC and -TRG variants) and adapters, respectively. We train models on NER data for 100 epochs with a batch size of 16 and 8 for high-resource and low-resource languages, respectively, and a learning rate of 5e − 5 and 1e − 4 for XLM-R and adapters, respectively. We choose the best checkpoint for evaluation based on validation performance. Following Pfeiffer et al. (2020), we learn language adapters, invertible adapters, and task adapters with dimensionalities of 384, 192 (384 for both directions), and 48, respectively. XLM-R Base has a hidden layer size of 768, so these correspond to reduction sizes of 2, 2, and 16.
For NER, we conduct five runs of fine-tuning on the WikiAnn training set of the source language, evaluate on all target languages-except for XLM-R MLM-TRG for which we conduct one run for efficiency purposes for every source language-target language combination-and report mean F 1 scores. For QA, we conduct three runs of fine-tuning on the English SQuAD training set, evaluate on all target languages, and report mean F 1 and exact match (EM) scores.

Named Entity Recognition
As our main summary of results, we average the cross-lingual transfer results of each method for each target language across all 16 source languages on the NER dataset. We show the aggregated results in Table 2. Moreover, in the appendix we report the detailed results for all methods across each single language pair, as well as a comparison of methods on the most common setting with English as source language.
In general, we can observe that XLM-R performance is indeed lowest for unseen languages (the right half of the table after the vertical dashed line). XLM-R MLM-SRC performs worse than XLM-R, which indicates that fine-tuning on the source language is not useful for cross-lingual transfer in general. However, there are some individual examples (e.g., JA, TK) where it does yield slight gains over the standard XLM-R transfer. On the other hand, XLM-R MLM-TRG is a stronger transfer method than XLM-R on average, yielding gains in 9/16 target languages. However, the gains with XLM-R MLM-TRG seem to vanish for low-resource languages. Further, there is another disadvantage, outlined in §3: XLM-R MLM-TRG  Table 2: NER F1 scores averaged over all 16 source languages when transferring to each target language (i.e., the columns refer to target languages). The vertical dashed line distinguishes between languages seen in multilingual pretraining and the unseen ones (see also Table 1).
requires fine-tuning the full large pretrained model separately for each target language in consideration, which can be prohibitively expensive. MAD-X without language and invertible adapters performs on par with XLM-R for almost all languages present in the pretraining data (left half of the table). This mirrors findings in the monolingual setting where task adapters have been observed to achieve performance similar to regular fine-tuning while being more parameter-efficient (Houlsby et al., 2019). However, looking at unseen languages the performance of MAD-X that only uses task adapters deteriorates significantly compared to XLM-R. This shows that task adapters alone are not expressive enough to bridge the discrepancy when adapting to an unseen language.
Adding language adapters to MAD-X improves its performance across the board, and their usefulness is especially pronounced for low-resource languages. Language adapters help capture the characteristics of the target language and consequently provide boosts for unseen languages. Even for high-resource languages, the addition of languagespecific parameters substantially improves performance. Finally, invertible adapters provide another improvement in performance and generally outperform only using task and language adapters: for instance, we observe gains with MAD-X over MAD-X -INV on 13/16 target languages. Overall, the full MAD-X framework improves upon XLM-R by more than 5 F 1 points on average.
To obtain a more fine-grained impression of the performance of MAD-X in different languages, we show its relative performance against XLM-R in the standard setting in Figure 3. We observe the largest differences in performance when transferring from high-resource to low-resource and unseen languages (top-right quadrant of Figure 3), which is arguably the most natural setup for crosslingual transfer. In particular, we observe strong gains when transferring from Arabic, whose script might not be well represented in XLM-R's vocabulary. We also detect strong performance in the inlanguage monolingual setting (see the diagonal) for the subset of low-resource languages. This observation indicates that MAD-X may help bridge the perceived weakness of multilingual versus monolingual models. Finally, MAD-X performs competitively even when the target language is highresource. We also plot relative performance of the full MAD-X method (with all three adapter types) versus XLM-R MLM-TRG across all language pairs in Figure 4. The scores lead to similar conclusions as before: the largest benefits of MAD-X are observed for the set of low-resource target languages (i.e., the right half of the heatmap in Figure 3). The scores also confirm that the proposed XLM-R MLM-TRG transfer baseline is a more competitive baseline than the standard XLM-R transfer across a substantial number of language pairs.

Further Analysis
Impact of Invertible Adapters To better understand the impact of invertible adapters, we show the relative performance difference of MAD-X with and without invertible adapters for each source language-target language pair on the NER data set in Figure 5. Invertible adapters improve performance for many transfer pairs, and particularly when transferring to low-resource languages. Performance is only consistently lower with a single low-resource language as source (i.e., Maori), likely due to variation in the data.

Sample Efficiency
The main bottleneck of MAD-X when adapting to a new language is training language adapters and invertible adapters. However, due to the modularity of MAD-X, once trained, these adapters have an advantage of being directly reusable (i.e., "plug-and-playable") across different tasks (see also the discussion in §6.2). To estimate the sample efficiency of adapter training, we measure NER performance on several low-resource target languages (when transferring from English as the source) conditioned on the number of training iterations. The results are provided in Figure 6. They reveal that we can achieve strong performance for the low-resource languages already at 20k training iterations, and longer training offers modest increase in performance.
Moreover, in Table 4 we present the number of parameters added to the original XLM-R Base model per language for each MAD-X variant. The full MAD-X model for NER receives an additional set of 8.25M adapter parameters for every language, which makes up only 3.05% of the original model.

Conclusion and Future Work
We have proposed MAD-X, a general modular framework for transfer across tasks and languages.   It leverages a small number of additional parameters to mitigate the capacity issue which fundamentally hinders current multilingual models. MAD-X is model-agnostic and can be adapted to any current pretrained multilingual model as foundation.
We have shown that it is particularly useful for adapting to languages not covered by the multilingual model's training model, while also achieving competitive performance on high-resource languages. We have additionally proposed a simple target language adaptation method for improved cross-lingual transfer, which may serve as a strong baseline if the target language is fixed.
In future work, we will apply MAD-X to other pretrained models, employ adapters that are particularly suited for languages with certain properties (e.g. with different scripts), evaluate on additional tasks, and investigate leveraging pretrained language adapters from related languages for improved transfer to truly low-resource languages with limited monolingual data.

A NER zero-shot results from English
We show the F1 scores when transferring from English to the other languages averaged over five runs in Table 5.

B NER results per language pair
We show the F1 scores on the NER dataset across all combinations of source and target language for all of our comparison methods in Figures