On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment

Modern multilingual models are trained on concatenated text from multiple languages in hopes of conferring benefits to each (positive transfer), with the most pronounced benefits accruing to low-resource languages. However, recent work has shown that this approach can degrade performance on high-resource languages, a phenomenon known as negative interference. In this paper, we present the first systematic study of negative interference. We show that, contrary to previous belief, negative interference also impacts low-resource languages. While parameters are maximally shared to learn language-universal structures, we demonstrate that language-specific parameters do exist in multilingual models and they are a potential cause of negative interference. Motivated by these observations, we also present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference, by adding language-specific layers as meta-parameters and training them in a manner that explicitly improves shared layers' generalization on all languages. Overall, our results show that negative interference is more common than previously known, suggesting new directions for improving multilingual representations.


Introduction
Advances in pretraining language models (Devlin et al., 2018;Yang et al., 2019) as general-purpose representations have pushed the state of the art on a variety of natural language tasks. However, not all languages enjoy large public datasets for pretraining and/or downstream tasks. Multilingual language models such as mBERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019) have been proven effective for cross-lingual transfer learning by pretraining a single shared Transformer model (Vaswani et al., 2017) jointly on multiple languages. The goals of multilingual modeling are not limited to improving language modeling in low-resource languages (Lample and Conneau, 2019), but also include zero-shot crosslingual transfer on downstream tasks-it has been shown that multilingual models can generalize to target languages even when labeled training data is only available in the source language (typically English) on a wide range of tasks (Pires et al., 2019;Wu and Dredze, 2019;Hu et al., 2020).
However, multilingual models are not equally beneficial for all languages.  demonstrated that including more languages in a single model can improve performance for lowresource languages but hurt performance for highresource languages. Similarly, recent work (Johnson et al., 2017;Tan et al., 2019;Aharoni et al., 2019; in multilingual neural machine translation (NMT) also observed performance degradation on high-resource language pairs. In multi-task learning (Ruder, 2017), this phenomenon is known as negative interference or negative transfer (Wang et al., 2019), where training multiple tasks jointly hinders the performance on individual tasks.
Despite these empirical observations, little prior work analyzed or showed how to mitigate negative interference in multilingual language models. Particularly, it is natural to ask: (1) Can negative interference occur for low-resource languages also? (2) What factors play an important role in causing it? (3) Can we mitigate negative interference to improve the model's cross-lingual transferability?
In this paper, we take a step towards addressing these questions. We pretrain a set of monolingual and bilingual models and evaluate them on a range of downstream tasks to analyze negative interference. We seek to individually characterize the un-derlying factors of negative interference through a set of ablation studies and glean insights on its causes. Specifically, we examine if training corpus size and language similarity affect negative interference, and also measure gradient and parameter similarities between languages.
Our results show that negative interference can occur in both high-resource and low-resource languages. In particular, we observe that neither subsampling the training corpus nor adding typologically similar languages substantially impacts negative interference. On the other hand, we show that gradient conflicts and language-specific parameters do exist in multilingual models, suggesting that languages are fighting for model capacity, which potentially causes negative interference. We further test whether explicitly assigning language-specific modules to each language can alleviate negative interference, and find that the resulting model performs better within each individual language but worse on zero-shot cross-lingual tasks.
Motivated by these observations, we further propose to meta-learn these language-specific parameters to explicitly improve generalization of shared parameters on all languages. Empirically, our method improves not only within-language performance on monolingual tasks but also cross-lingual transferability on zero-shot transfer benchmarks. To the best of our knowledge, this is the first work to systematically study and remedy negative interference in multilingual language models.

Motivation
Multilingual transfer learning aims at utilizing knowledge transfer across languages to boost performance on low-resource languages. State-of-theart multilingual language models are trained on multiple languages jointly to enable cross-lingual transfer through parameter sharing. However, languages are heterogeneous, with different vocabularies, morphosyntactic rules, and different pragmatics across cultures. It is therefore natural to ask, is knowledge transfer beneficial for all languages in a multilingual model? To analyze the effect of knowledge transfer from other languages on a specific language lg, we can compare multilingual models with the monolingual model trained on lg. For example, in Figure 1, we compare the performance on a named entity recognition (NER) task of monolingually-trained models vs. bilingual models (trained on lg and English) vs. state-of- the-art XLM . We can see that monolingual models outperform multilingual models on four out of six languages (See §3.3 for details). This shows that language conflicts may induce negative impacts on certain languages, which we refer to as negative interference. Here, we investigate the causes of negative interference ( §3.3) and methods to overcome it ( §4).
3 Investigating the Sources of Negative Interference in Multilingual Models

Methodology
To study negative interference, we compare multilingual models with monolingual baselines. Without loss of generality, we focus on analyzing bilingual models to minimize confounding factors. For two languages lg 1 and lg 2 , we pretrain a single bilingual model and two monolingual models. We then assess their performance on downstream tasks using two different settings. To examine negative interference, we evaluate both monolingual and multilingual models using the withinlanguage monolingual setting, such that the pretrained model is finetuned and tested on the same language. For instance, if the monolingual model of lg 1 outperforms the bilingual model on lg 1 , we know that lg 2 induces negative impact on lg 1 in the bilingual model. Besides, since multilingual models are trained to enable cross-lingual transfer, we also report their performance on the zero-shot cross-lingual transfer setting, where the model is only finetuned on the source language, say lg 1 , and tested on the target language lg 2 . We hypothesize that the following factors play important roles in causing negative interference and study each individually: Training Corpus Size While prior work mostly report negative interference for high-resource lan-  guages , we hypothesize that it can also occur for languages with less resources. We study the impact of training data size per language on negative interference. We subsample a high-resource language, say lg 1 , to create a "low-resource version". We then retrain the monolingual and bilingual models and compare with results of their high-source counterparts. Particularly, we test if reducing lg 1 's training size also reduces negative interference on lg 2 .

Language Similarity
Language similarity has been shown important for effective transfer in multilingual models.  shows that bilingual models trained on more similar language pairs result in better zero-shot transfer performance. We thus expect it to play a critical role in negative interference as well. For a specific language lg 1 , we pair it with languages that are closely and distantly related. We then compare these bilingual models' performance on lg 1 to investigate if more similar languages cause less severe interference. In addition, we further add a third language lg 3 that is similar to lg 1 and train a trilingual model on lg 1 -lg 2 -lg 3 . We compare the trilingual model with the bilingual model to examine if adding lg 3 can mitigate negative interference on lg 1 .

Gradient Conflict
Recent work (Yu et al., 2020) shows that gradient conflict between dissimilar tasks, defined as a negative cosine similarity between gradients, is predictive of negative interference in multi-task learning. Therefore, we study whether gradient conflicts exist between languages in multilingual models. In particular, we sample one batch for each language in the model and compute the corresponding gradients' cosine similarity for every 10 steps during pretraining.

Parameter Sharing
State-of-the-art multilingual models aim to share as many parameters as possible in the hope of learning a languageuniversal model for all languages . While prior studies measure the latent embedding similarity between languages, we instead examine model parameters directly. The idea is to test whether model parameters are language-universal or language-specific. To achieve this, we prune multilingual models for each language using relaxed L 0 norm regularization (Louizos et al., 2017), and compare parameter similarities between languages. Formally, for a model f (·; θ) parameterized by θ = {θ i } n i=1 where each θ i represents an individual parameter or a group of parameters, the method introduces a set of binary masks z, drawn from some distribution q(z|π) parametrized by π, and learns a sparse model f (·; θ z) by optimizing: (1) where is the Hadamard (elementwise) product, L(·) is some task loss and λ is a hyper-parameter. We follow the work of (Louizos et al., 2017) and use the Hard Concrete distribution for the binary mask z, such that the above objective is fully differentiable. Then, for each bilingual model, we freeze its pretrained parameter weights and learn binary masks z for each language independently. As a result, we obtain two independent sets of mask parameters π which can be used to determine parameter importance. Intuitively, for each parameter group, it is language-universal if both languages consider it important (positive π values). On the other hand, if one language assigns positive value while the other assigns negative, it shows that the parameter group is language-specific. We compare them across languages and layers to analyze parameter similarity in multilingual models.

Experimental Setup
We focus on standard multilingual masked language modeling (MLM) used in mBERT and XLM. We first pretrain models and then evaluate their performance on four NLP benchmarks.
For pretraining, we mainly follow the setup and implementation of XLM (Lample and Conneau, 2019). We focus on monolingual and bilingual models for a more controllable comparison, which we refer to as Mono and JointPair respectively. In particular, we always include English (En) in bilingual models to compare on zero-shot transfer settings with prior work. Besides, we consider three  Table 2: NER and POS results. We observe negative interference when monolingual models outperform multilingual models. Besides, adding language-specific layers (e.g. ffn) mitigates interference but sacrifices transferability.
high-resource languages {Arabic (Ar), French (Fr), Russian (Ru)} and three low-resource languages {Hindi (Hi), Swahili (Sw), Telugu (Te)} (see Table  1 for their statistics). We choose these six languages based their data availability in downstream tasks. We use Wikipedia as training data with statistics shown in Table 1. For each model, we use BPE (Sennrich et al., 2016) to learn 32k subword vocabulary shared between languages. For multilingual models, we sample language proportionally to P i = ( L i j L j ) 1 T , where L i is the size of the training corpus for i-th language pair and T is the temperature. Each model is a standard Transformer (Vaswani et al., 2017) with 8 layers, 12 heads, 512 embedding size and 2048 hidden dimension for the feedforward layer. Notice that we specifically consider a smaller model capacity to be comparable with existing models with larger capacity but also include much more (over 100) languages. We use the Adam optimizer (Kingma and Ba, 2014) and exploit the same learning rate schedule as Lample and Conneau (2019). We train each model with 4 NVIDIA V100 GPUs with 32GB of memory. Using mixed precision, we fit a batch of 128 for each GPU and the total batch size is 512. Each epoch contains 10k steps and we train for 50 epochs.

NER
We use the WikiAnn (Pan et al., 2017) dataset, which is a sequence labelling task built automatically from Wikipedia. A linear layer with softmax classifier is added on top of pretrained models to predict the label for each word based on its first subword. We report the F1 score.

POS
Similar to NER, POS is also a sequence labelling task but with a focus on synthetic knowledge. In particular, we use the Universal Dependencies treebanks (Nivre et al., 2018). Task-specific layers are the same and we report F1, as in NER.

QA
We choose to use the TyDiQA-GoldP dataset (Clark et al., 2020) that covers typologically diverse languages. Similar to popular QA dataset such as SQuAD (Rajpurkar et al., 2018), this is a span prediction task where task-specific linear classifiers are used to predict start/end positions of the answer. Standard metrics of F1 and Exact Match (EM) are reported. NLI XNLI (Conneau et al., 2018) is probably the most popular cross-lingual benchmark. Notice that the original dataset only contains training data for English. Consequently, we only evaluate this task on the zero-shot transfer setting while we consider both settings for the rest of other tasks.

Results and Analysis
In Table 2 and 3, we report our results on NER, POS and QA together with XLM-100, which is trained on 100 languages and contains 827M parameters. In particular, we observe that monolin-  gual models outperform bilingual models for all languages except Swahili on all three tasks. In fact, monolingual models even perform better than XLM on four out of six languages including hi and te, despite that XLM is much larger in model sizes and trained with much more resources. This shows that negative interference can occur on low-resource languages as well. While the negative impact is expected to be more prominent on high-resource languages, we demonstrate that it may occur for languages with resources fewer than commonly believed. The existence of negative interference confirms that state-of-the-art multilingual models cannot generalize equally well on all languages, and there is still a gap compared to monolingual models on certain languages. We next turn to dissect negative interference by studying the four factors described in Section 3.1.
Training Corpus Size By comparing the validation perplexity on Swahili and Telugu in Figure 2, we find that while both monolingual models outperform bilingual models in the first few epochs, the Swahili model's perplexity starts to increase and is eventually surpassed by the bilingual model in later epochs. This matches the intuition that monolingual models may overfit when training data size is small. To verify this, we subsample French and Russian to 100k sentences to create a "low-resource version" of them (denoted as fr l /ru l ). As shown in Table 5, while the performance for both models drop compared to their "high-resource" counterparts, bilingual models indeed outperform mono-  Table 4: Comparing trilingual models with bilingual models. This shows the effect of adding a third similar language to bilingual models. lingual models for fr l /ru l , in contrast for fr/ru. This suggests that multilingual models can stimulate positive transfer for low-resource languages when monolingual models overfit. On the other hand, when we compare bilingual models on English, models trained using different sizes of fr/ru data obtain similar performance, indicating that the training size of the source language has little impact on negative interference on the target language (English in this case). While more training data usually implies larger vocabulary and more diverse linguistic phenomena, negative interference seems to arise from more fundamental conflicts contained in even small training corpus. Language Similarity As illustrated by Table 5, the in-language performance on English drops as the paired language becomes more distantly related (French vs Russian). This verifies that transferring from more distant languages results in more severe negative interference. It is therefore natural to ask if adding more similar languages can mitigate negative interference, especially for low-resource languages. We then train two trilingual models, adding Marathi to English-Hindi, and Kannada to English-Telugu. Compared to their bilingual counterparts (Table 4), trilingual models obtain similar within-language performance, which indicates that adding similar languages cannot mitigate negative interference.   "En-En" refers to gradients of two English batches within the Ar-En model, while "Ar-En" and "Fr-En" refer to gradients of two batches, one from each language, within Ar-En and Fr-En models respectively. However, they do improve zero-shot cross-lingual performance. One possible explanation is that even similar languages can fight for language-specific capacity but they may nevertheless benefit the generalization of the shared knowledge. Gradient Conflict In Figure 3, we plot the gradient cosine similarity between Arabic-English and French-English in their corresponding bilingual models over the first 25 epochs. We also plot the similarity within English, measured using two independently sampled batches 2 . Specifically, gradients between two different languages are indeed less similar than those within the same language. The gap is more evident in the early few epochs, where we observe negative gradient similarities for Ar-En and Fr-En while those for En-En are positive. In addition, gradients in Ar-En are less similar than those in Fr-En, indicating that distant language pair can cause more severe gradient conflicts. These results confirm that gradient conflict exists in multilingual models and is correlated to per language performance, suggesting it may introduce optimization challenge that results in negative interference.

Parameter Sharing
The existence of gradient 2 Notice that we use gradient accumulation to sample an effectively larger batch of 4096 sentences to calculate the gradient similarity. conflicts may imply that languages are fighting for capacity. Thus, we next study how languageuniversal these multilingual parameters are. Figure  4a shows the cosine similarity of mask parameters π across different layers. We observe that within-language similarity (En-En) is near perfect, which validates the pruning method's robustness. The trend shows that model parameters are better shared in the bottom layers than the upper ones. Besides, it also demonstrates that parameters in multi-head attention layers obtain higher similarities than those in feedforward layers, suggesting that attention mechanism might be more languageuniversal. We additionally inspect π parameters with the highest absolute values and plot those values for Ar (Figure 4b), together with their En counterparts. A more negative value indicates that the parameter is more likely to be pruned for that language and vice versa. Interestingly, while many parameters with positive values (on the right) are language-universal as both languages assign very positive values, parameters with negative values (on the left) are mostly language-specific for Ar as En assigns positive values. We observe similar patterns for other languages as well. These results demonstrate that language-specific parameters do exist in multilingual models.
Having language-specific capacity in shared parameters is sub-optimal. It is less transferable and thus can hinder cross-lingual performance. Moreover, it may also take over capacity budgets for other languages and degrade their within-language performance, i.e., causing negative interference. A natural next question is whether explicitly adding language-specific capacity into multilingual models can alleviate negative interference. We thus train variants of bilingual models that contain language-specific components for each language. Particularly, we consider adding language-specific feedforward layers, attention layers, and residual adapter layers (Rebuffi et al., 2017;Houlsby et al., 2019), denoted as ffn, attn and adpt respectively. For each type of component, we create two separate copies in each Transformer layer, one designated for each language, while the rest of the network remains unchanged. As shown in Table 2 and 3, adding language-specific capacity does mitigate negative interference and improve monolingual performance. We also find that language-specific feedforward layers obtain larger performance gains compared to attention layers, consistent with our prior analysis. However, these gains come at a cost of cross-lingual transferability, such that their zeroshot performance drops tremendously. Our results suggest a tension between addressing interference versus improving transferability. In the next section, we investigate how to address negative interference in a manner that can improve performance on both within-language tasks and cross-lingual benchmarks.

Proposed Method
In the previous section, we demonstrated that while explicitly adding language-specific components can alleviate negative interference, it can also hinder cross-lingual transferability. We notice that a critical shortcoming of language-specific capacity is that they are agnostic of the rest of other languages, since by design they are trained on the designated language only. They are thus more likely to overfit and can induce optimization challenges for shared capacity as well. Inspired by recent work in meta learning (Flennerhag et al., 2019) that utilizes meta parameters to improve gradient geometry of the base network, we propose a novel meta-learning formulation of multilingual models that exploits language-specific parameters to im-prove generalization of shared parameters.
For a model with some predefined languagespecific parameters φ = {φ i } L i=1 , where φ i is designated for the i-th language, and shared parameters θ, our solution is to treat φ as meta parameters and θ as base parameters. Ideally, we want φ to store non-transferable language-specific knowledge to resolve conflicts and improve generalization of θ in all languages (a.k.a. mitigate negative interference and improve cross-lingual transferability). Therefore, we train φ based on the following principle: if θ follows the gradients on training data for a given φ, the resulting θ should obtain a good validation performance on all languages. This implies a bilevel optimization problem (Colson et al., 2007) formally written as: where L i val and L i train denote the training and the validation MLM loss for the i-th language. Since directly solving this problem can be prohibitive due to the expensive inner optimization, we approximate θ * by adapting the current θ (t) using a single gradient step, similar to techniques used in prior meta-learning methods (Finn et al., 2017). This results in a two-phase iterative training process shown in Algorithm 1 (See Appendix B).
To be specific, at each training step t on the ith language during pretraining, we first adapt a gradient step on θ to obtain a new θ and update φ i Update language-specific parameters as: φ Update shared parameters as: , φ (t+1) )) 8: end while based on the θ 's validation MLM loss: where α and β are learning rates. Notice that θ is a function of φ (t) i and thus this optimization requires computing the gradient of gradient. Particularly, by applying chain rule to the gradient of φ (t) i , we can observe that it contains a higher-order term: This is important, since it shows that φ i can obtain information from other languages through higherorder gradients. In other words, language-specific parameters are not agnostic of other languages anymore without violating the language-specific requirement. This is because, in Eq. 3, while ∇ θ (t) is based on the i-th language only, the validation loss is computed for all languages. Finally, in the second phase, we update θ based on the new φ (t+1) :

Evaluation
While our method is generic, we evaluate it applied on bilingual models with adapter networks. Adapters have been effectively utilized in multilingual models , and we choose them for practical consideration of limiting perlanguage capacity. Unlike prior works that finetune adapters for adaptation, we train them jointly with shared parameters during pretraining. We follow Houlsby et al. (2019) and insert language-specific adapters after attention and feedforward layers. We leave a more thorough investigation of how to better pick language-specific structures for future work. For downstream task evaluation, we finetune all layers. Notice that computing the gradient of gradient in Eq. 3 doubles the memory requirement. In practice, we utilize the finite difference approximation (Appendix B).
By evaluating their performance on the zeroshot transfer settings (Table 2, 3 and 6), we observe that our method, denoted as meta adpt, consistently improves the performance over JointPair baselines, while ordinary adapters (adpt) perform worse than JointPair. This shows that, the proposed method can effectively utilize the added language-specific adapters to improve generalization of shared parameters across languages. At the same time, our method also mitigates negative interference and outperforms JointPair on withinlanguage performance, closing the gap with monolingual models. In particular, it performs better than ordinary adapters in both settings. We hypothesize that this is because it alleviates language conflicts during training and thus converges more robustly. For example, we plot training loss in the early stage in Figure 4c, which shows that ordinary adapters converge slower than JointPair due to overfitting of language-specific adapters while meta adapters converge much faster. For ablation studies, we also report results for JointPair trained with adapters shared between two languages, denoted as share adpt. Unlike language-specific adapters that can hinder transferability, shared adapters improve both within-language and cross-lingual performance with the extra capacity. However, meta adapters still obtain better performance. These results show that mitigating negative interference can improve multilingual representations.  Table 6: XNLI results (Accuracy).

Related Work
Unsupervised multilingual language models such as mBERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019; work surprisingly well on many NLP tasks without parallel training signals (Pires et al., 2019;Wu and Dredze, 2019). A line of follow-up work Artetxe et al., 2019;Karthikeyan et al., 2020) study what contributes to the cross-lingual ability of these models. They show that vocabulary overlap is not required for multilingual models, and suggest that abstractions shared across languages emerge automatically during pretraining. Another line of research investigate how to further improve these shared knowledge, such as applying post-hoc alignment (Wang et al., 2020b;Cao et al., 2020) and utilizing better calibrated training signal (Mulcaire et al., 2019;Huang et al., 2019). While prior work emphasize how to share to improve transferability, we study multilingual models from a different perspective of how to unshare to resolve language conflicts. Our work is also related to transfer learning (Pan and Yang, 2010) and multi-task learning (Ruder, 2017). In particular, prior work have observed (Rosenstein et al., 2005) and studied (Wang et al., 2019) negative transfer, such that transferring knowledge from source tasks can degrade the performance in the target task. Others show it is important to remedy negative transfer in multi-source settings (Ge et al., 2014;Wang and Carbonell, 2018). In this work, we study negative transfer in multilingual models, where languages contain heavily unbalanced training data and exhibit complex intertask relatedness.
In addition, our work is related to methods that measure similarity between cross-lingual representations. For example, existing methods utilize statistical metrics to examine cross-lingual embeddings such as singular vector canonical correlation analysis (Raghu et al., 2017;Kudugunta et al., 2019), eigenvector similarity (Søgaard et al., 2018), and centered kernel alignment (Kornblith et al., 2019;. While these methods focus on testing latent representations, we directly compare similarity of neural network structures through network pruning. Finally, our work is related to meta learning, which sets a meta task to learn model initialization for fast adaptation (Finn et al., 2017;Gu et al., 2018;Flennerhag et al., 2019), data selection (Wang et al., 2020a), and hyperparameters (Baydin et al., 2018). In our case, the meta task is to mitigate negative interference.

Conclusion
We present the first systematic study of negative interference in multilingual models and shed light on its causes. We further propose a method and show it can improve cross-lingual transferability by mitigating negative interference. While prior efforts focus on improving sharing and cross-lingual alignment, we provide new insights and a different perspective on unsharing and resolving language conflicts.

A Fine-tuning Details
Notice that XNLI only has training data in available in English so we only evaluate zero-shot crosslingual performance on it. Following (Hu et al., 2020), we finetune the model for 10 epochs for NER and POS, 2 epochs for QA and 200 epochs for XNLI. For NER, POS and QA, we search the following hyperparameters: batch size {16, 32}; learning rate {2e-5, 3e-5, 5e-5}. We use English dev set for zero-shot cross-lingual setting and the target language dev set for within-language monolingual setting. For XNLI, we search for: batch size {4, 8}; encoder learning rate {1e-6, 5e-6, 2e-5}; classifier learning rate {5e-6, 2e-5, 5e-5}. For models with language-specific components, we test freezing these components or finetuning them together. We discover that finetuning the whole network always yields better results. For all experiments, we save checkpoint after each epoch.

B Method Details
Let z i be the output of the i-th layer of dimension d. The residual adapter network (Rebuffi et al., 2017;Houlsby et al., 2019;) is a bottleneck layer that first projects z i to an inner layer with dimension b: where W z i ∈ R d×b and g is some activation function such as relu. It is then projected back to the original input dimension d with a residual connection: where W h i ∈ R b×d . In our experiments, we fix b = 1 4 d. For a bilingual model of lg 1 and lg 2 , we inject two langauge-specific adapters after each attention and feedforward layer, one for each language. For example, if the input text is in lg 1 , the network will be routed to adapters designated for lg 1 . The rest of the network and training protocol remain unchanged. The injected adapter layers mimic the warp layers interleaved between base network layers in Flennerhag et al. (2019). Warp layers are meta parameters that aim to improve the performance of the base network. They precondition base network gradients to obtain better gradient geometry. In our experiments, we treat language-specific adapters as meta parameters to improve generalization of the shared network. The algorithm is outlined in Algorithm 1. The adapters are updated according to Eq 3, which doubles the memory requirement. In particular, the high-order term in Eq 4 requires computing the gradient of gradient. In practice, we approximate this term using the finite difference approximation as: j ) and is a small scalar. We use the same value for learning rates α and β in Eq 3, to be consistent with standard learning rate schedule used in XLM (Lample and Conneau, 2019).