Multilingual BERT Post-Pretraining Alignment

We propose a simple method to align multilingual contextual embeddings as a post-pretraining step for improved cross-lingual transferability of the pretrained language models. Using parallel data, our method aligns embeddings on the word level through the recently proposed Translation Language Modeling objective as well as on the sentence level via contrastive learning and random input shuffling. We also perform sentence-level code-switching with English when finetuning on downstream tasks. On XNLI, our best model (initialized from mBERT) improves over mBERT by 4.7% in the zero-shot setting and achieves comparable result to XLM for translate-train while using less than 18% of the same parallel data and 31% fewer model parameters. On MLQA, our model outperforms XLM-R_Base, which has 57% more parameters than ours.


Introduction
Building on the success of monolingual pretrained language models (LM) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), their multilingual counterparts mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) are trained using the same objectives-Masked Language Modeling (MLM) and in the case of mBERT, Next Sentence Prediction (NSP). MLM is applied to monolingual text that covers over 100 languages. Despite the absence of parallel data and explicit alignment signals, these models transfer surprisingly well from high resource languages, such as English, to other languages. On the Natural Language Inference (NLI) task XNLI (Conneau et al., 2018), a text classification model trained on English training data can be directly applied to the other 14 languages and achieve respectable performance. Having a single model that can serve over 100 languages also has important business applications.
Recent work improves upon these pretrained models by adding cross-lingual tasks leveraging parallel data that always involve English. Conneau and Lample (2019) pretrain a new Transformerbased (Vaswani et al., 2017) model from scratch with an MLM objective on monolingual data, and a Translation Language Modeling (TLM) objective on parallel data. Cao et al. (2020) align mBERT embeddings in a post-hoc manner: They first apply a statistical toolkit, FastAlign (Dyer et al., 2013), to create word alignments on parallel sentences. Then, mBERT is tuned via minimizing the mean squared error between the embeddings of English words and those of the corresponding words in other languages. Such post-hoc approach suffers from the limitations of word-alignment toolkits: (1) the noises from FastAlign can lead to error propagation to the rest of the pipeline; (2) FastAlign mainly creates the alignments with word-level translation and usually overlooks the contextual semantic compositions. As a result, the tuned mBERT is biased to shallow cross-lingual correspondence. Importantly, both approaches only involve word-level alignment tasks.
In this work, we focus on self-supervised, alignment-oriented training tasks using minimum parallel data to improve mBERT's cross-lingual transferability. We propose a Post-Pretraining Alignment (PPA) method consisting of both wordlevel and sentence-level alignment, as well as a finetuning technique on downstream tasks that take pairs of text as input, such as NLI and Question Answering (QA). Specifically, we use a slightly different version of TLM as our word-level alignment task and contrastive learning (Hadsell et al., 2006) on mBERT's [CLS] tokens to align sentence-level representations. Both tasks are self-supervised and do not require pre-alignment tools such as FastAlign. Our sentence-level alignment is implemented using MoCo (He et al., 2020), an instance discrimination-based method of contrastive learn-  Figure 1: Model structure for our Post-Pretraining Alignment method using parallel data. We use MoCo to implement our sentence-level objective and TLM for our word-level objective. The model is trained in a multi-task manner with both objectives.
ing that was recently proposed for self-supervised representation learning in computer vision. Lastly, when finetuning on NLI and QA tasks for non-English languages, we perform sentence-level codeswitching with English as a form of both alignment and data augmentation. We conduct controlled experiments on XNLI and MLQA (Lewis et al., 2020), leveraging varying amounts of parallel data during alignment. We then conduct an ablation study that shows the effectiveness of our method. On XNLI, our aligned mBERT improves over the original mBERT by 4.7% for zero-shot transfer, and outperforms Cao et al. (2020) while using the same amount of parallel data from the same source. For translate-train, where translation of English training data is available in the target language, our model achieves comparable performance to XLM while using far fewer resources. On MLQA, we get 2.3% improvement over mBERT and outperform XLM-R Base for zero-shot transfer.

Method
This section introduces our proposed Post-Pretraining Alignment (PPA) method. We first describe the MoCo contrastive learning framework and how we use it for sentence-level alignment. Next, we describe the finer-grained word-level alignment with TLM. Finally, when training data in the target language is available, we incorporate sentence-level code-switching as a form of both alignment and data augmentation to complement PPA. Figure 1 shows our overall model structure.
Background: Contrastive Learning Instance discrimination-based contrastive learning aims to bring two views of the same source image closer to each other in the representation space while encouraging views of different source images to be dissimilar through a contrastive loss. Recent advances in this area, such as SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) have bridged the gap in performance between self-supervised representation learning and fully-supervised methods on the ImageNet (Deng et al., 2009) dataset. As a key feature for both methods, a large number of negative examples per instance are necessary for the models to learn such good representations. SimCLR uses in-batch negative example sampling, thus requiring a large batch size, whereas MoCo stores negative examples in a queue and casts the contrastive learning task as dictionary (query-key) lookup. In what follows, we first describe MoCo and then how we use it for sentence-level alignment.
Concretely, MoCo employs a dual-encoder architecture. Given two views v 1 and v 2 of the same image, v 1 is encoded by the query encoder f q and v 2 by the momentum encoder f k . v 1 and v 2 form a positive pair. Negative examples are views of different source images, and are stored in a queue ∈ K, which is randomly initialized. K is usually a large number (e.g., K = 65, 536 for ImageNet). Negative pairs are formed by comparing v 1 with each item in the queue. Similarity between pairs is measured by dot product. MoCo uses the InfoNCE loss (van den Oord et al., 2019) to bring positive pairs closer to each other and push negative pairs apart. After a batch of view pairs are processed, those encoded by the momentum encoder are added to the queue as negative examples for future queries. During training, the query encoder is updated by the optimizer while the momentum encoder is updated by the exponential moving average of the query encoder's parameters to maintain queue consistency: where θ q and θ k are model parameters of f q and f k , respectively. m is the momentum coefficient.

Sentence-Level Alignment Objective
Our sentence-level alignment falls under the general problem of bringing two views of inputs from the same source closer in the representation space while keeping those from different sources dissimilar through a contrastive loss. From a crosslingual alignment perspective, we treat an English sequence S en i and its translation S tr i in another language tr ∈ L as two manifestations of the same semantics. At the same time, sentences that are not translations of each other should be further apart in the representation space. Given parallel corpora consisting of {(S en 1 , S tr 1 ), . . . , (S en N , S tr N )}, we align sentence representations in all the different languages together using MoCo.
We use the pretrained mBERT model to initialize both the query and momentum encoders. mBERT is made of 12 Transformer blocks, 12 attention heads, and hidden size d h = 768. For input, instead of feeding the query encoder with English examples and the momentum encoder with translation examples or vice versa, we propose a random input shuffling approach. Specifically, we randomly shuffle the order of S en i and S tr i when feeding the two encoders, so that the query encoder sees both English and translation examples. We observe that this is a crucial step towards learning good multilingual representations using our method. The final hidden state h ∈ R 1×d h of the [CLS] token, normalized with L 2 norm, is treated as the sentence representation 1 . Following Chen et al. (2020), we add a non-linear projection layer on top of h: where W 1 ∈ R d h ×d h , W 2 ∈ R d k ×d h , and d k is set to 300. The model is trained using the InfoNCE loss: where τ is a temperature parameter. In our implementation, we use a relatively small batch size of 128, resulting in more frequent parameter updates than if a large batch size were used. Items enqueued early on can thus become outdated with a large queue, so we scale down the queue size to K = 32, 000 to prevent the queue from becoming stale.

Word-Level Alignment Objective
We use TLM for word-level alignment. TLM is an extension of MLM that operates on bilingual data-parallel sentences are concatenated and MLM is applied to the combined bilingual sequence. Different from Conneau and Lample (2019), we do not reset positional embeddings when forming the bilingual sequence, and we also do not use language embeddings. In addition, the order of S en i and S tr i during concatenation is determined by the random input shuffling from the sentence-level alignment step and we add a [SEP] token between S en i and S tr i . We randomly mask 15% of the WordPiece tokens in each combined sequence. Masking is done by using a special [MASK] token 80% of the times, a random token in the vocabulary 10% of the times, and unchanged for the remaining 10%. TLM is performed using the query encoder of MoCo. Our final PPA model is trained in a multi-task manner with both sentence-level objective and TLM:

Finetuning on Downstream Tasks
After an alignment model is trained with PPA, we extract the query encoder from MoCo and finetune it on downstream tasks for evaluation. We follow the standard way of finetuning BERT-like models for sequence classification and QA tasks: (1) on XNLI, we concatenate the premise with the hypothesis, and add a [SEP] token in between. A softmax classifier is added on top of the final hidden state of the [CLS] token; (2) on MLQA, we concatenate the question with the context, and add a [SEP] token in between. We add two linear layers on top of mBERT followed by softmax over the context tokens to predict answer start and end positions, respectively. We conduct experiments in two settings: 1. Zeroshot cross-lingual transfer, where training data is available in English but not in target languages. 2. Translate-train, where the English training set is (machine) translated to all the target languages. For the latter setting, we perform data augmentation with code-switched inputs, when training on languages other than English. For example, a Spanish question q es and context c es pair can be augmented to two question-context pairs (q es , c en ) and (q en , c es ) with code-switching, resulting in 2x training data 2 . The same goes for XNLI with premises and hypotheses. The code-switching is always between English, and a target language. During training, we ensure the two augmented pairs appear in the same batch.

Parallel Data for Post-Pretraining
Parallel Data All parallel data we use involve English as the source language. Specifically, we collect en-fr, en-es, en-de parallel pairs from Europarl, en-ar, en-zh from MultiUN (Ziemski et al., 2016), en-hi from IITB (Kunchukuttan et al., 2018), and en-bg from both Europarl and EUbookshop. All datasets were downloaded from the OPUS 3 website (Tiedemann, 2012). In our experiments, we vary the number of parallel sentence pairs for PPA. For each language, we take the first 250k, 600k, and 2M English-translation parallel sentence pairs except for those too short (where either sentence has less than 10 WordPiece tokens), or too long (where both sentences concatenated together have more than 128 WordPiece tokens). Table 1 shows the actual number of parallel pairs in each of our 250k, 600k, and 2M settings.

Evaluation Benchmarks
XNLI is an evaluation dataset for cross-lingual NLI that covers 15 languages. The dataset is human-translated from the development and test sets of the English MultiNLI dataset . Given a sentence pair of premise and hypothesis, the task is to classify their relationship as entailment, contradiction, and neutral. For zero-shot cross-lingual transfer, we train on the English MultiNLI training set, and apply the model to the test sets of the other languages. For translatetrain, we train on translation data that come with the dataset 4 .
MLQA is an evaluation dataset for QA that covers seven languages. The dataset is derived from a three step process. (1)  We focus on XLT in this work. For zero-shot crosslingual transfer, we train on the English SQuAD v1.1 (Rajpurkar et al., 2016) training set. For translate-train, we train on translation data provided in Hu et al. (2020) 5

Training Details
For both PPA and finetuning on downstream tasks, we use the AdamW optimizer with 0.01 weight decay and a linear learning rate scheduler. For PPA, we use a batch size of 128, mBERT max sequence length 128 and learning rate warmup for the first 10% of the total iterations, peaking at 0.00003. The MoCo momentum is set to 0.999, queue size 32000 and temperature 0.05. Our PPA models are trained for 10 epochs, except for the 2M setting where 5 epochs are trained. On XNLI, we use a batch size of 32, mBERT max sequence length 128 and finetune the PPA model for 2 epochs. Learning rate peaks at 0.00005 and warmup is done to the first 1000 iterations. On MLQA, mBERT max sequence length is set to 386 and peak learning rate 0.00003. The other parameters are the same as XNLI. Our experiments are run on a single 32 GB V100 GPU, except for PPA training that involves either MLM or TLM, where two such GPUs are used. We also use mixed-precision training to save on GPU memory and speed up experiments.

Results
We report results on the test set of XNLI and MLQA and we do hyperparameter searching on the development set. All the experiments for translatetrain were done using the code-switching technique introduced in Section 2.
XNLI Table 2 shows results on XNLI measured by accuracy. Devlin et al. (2019) only provide results on a few languages 6 , so we use the mBERT results from  as our baseline for zeroshot cross-lingual transfer, and Wu and Dredze (2019) for translate-train. Our best model, trained with 2M parallel sentences per language improves over mBERT baseline by 4.7% for zero-shot transfer, and 3.2% for translate-train.
Compared to Cao et al. (2020), which use 250k parallel sentences per language from the same sources as we do for post-pretraining alignment,   Table 1: Parallel data statistics. All parallel data involve English as source language. We use Europarl for en-fr, enes, and en-de, both Europarl and EUbookshop for en-bg, MultiUN for en-ar, en-zh, and IITB for en-hi. Our 250k setting uses an equal amount of data from the same source as Cao et al. (2020). Our 2M setting uses approximately 63% and 17.8% of the parallel data Artetxe and Schwenk (2019)  our 250k model does better for all languages considered and we do not rely on the word-to-word pre-alignment step using FastAlign, which is prone to error propagation to the rest of the pipeline.
Compared to XLM, our 250k, 600k and 2M settings represent 3.1%, 7% and 17.8% of the parallel data used by XLM, respectively (see Table 1). The XLM model also has 45% more parameters than ours as Table 3 shows. Furthermore, XLM trained with MLM only is already significantly better than mBERT even though the source of its training data is the same as mBERT from Wikipedia. One reason could be that XLM contains 45% more model parameters than mBERT as model depth and capacity are shown to be key to cross-lingual success (K et al., 2020). Additionally, Wu and Dredze (2019) hypothesize that limiting pretraining to the languages used by downstream tasks may be beneficial since XLM models are pretrained on the 15 XNLI languages only. Our 2M model bridges the gap between mBERT and XLM from 7.5% to 2.8% for zero-shot transfer. Note that, for bg, our total processed pool of en-bg data consists of 456k parallel sentences, so there is no difference in en-bg data between our 600k and 2M settings. For translatetrain, our model achieves comparable performance to XLM with the further help of code-switching during finetuning.
Our alignment-oriented method is, to a large degree, upper-bounded by the English performance, since all our parallel data involve English and all the other languages are implicitly aligning with English through our PPA objectives. Our 2M model is able to improve the English performance to 82.4 from the mBERT baseline, but it is still lower than XLM (MLM), and much lower than XLM (MLM+TLM). We hypothesize that more highquality monolingual data and model capacity are needed to further improve our English performance, thereby helping other languages better align with it.
MLQA Table 4 shows results on MLQA measured by F1 score. We notice the mBERT baseline from the original MLQA paper is significantly lower than that from , so we use the latter as our baseline. Our 2M model outperforms the baseline by 2.3% for zero-shot and is also 0.2% better than XLM-R Base , which uses 57% more model parameters than mBERT as Table 3 shows. For translate-train, our 250k model is 1.3% better than the baseline.
Comparing our model performance using vary-   Table 3: Model architecture and sizes from Conneau et al. (2020). L is the number of Transformer layers, H m is the hidden size, H f f is the dimension of the feed-forward layer, A is the number of attention heads, and V is the vocabulary size.
ing amounts of parallel data, we observe that 600k per language is our sweet spot considering the trade-off between resource and performance. Going up to 2M helps on XNLI, but less significantly compared to the gain going from 250k to 600k. On MLQA, surprisingly, 250k slightly outperforms the other two for translate-train.
Ablation Table 5 shows the contribution of each component of our method on XNLI. Removing TLM (-TLM) consistently leads to about 1% accuracy drop across the board, showing positive effects of the word-alignment objective. To better understand TLM's consistent improvement, we replace TLM with MLM (repl TLM w/ MLM), where we treat S en i and S tr i from the parallel corpora as separate monolingual sequences and perform MLM on each of them. The masking scheme is the same as TLM described in Section 2. We observe that MLM does not bring significant improvement. This confirms that the improvement of TLM is not from the encoders being trained with more data and iterations. Instead, the word-alignment nature of TLM does help the multilingual training.
Comparing our model without word-level alignment, i.e., -TLM, to the baseline mBERT in Table 2, we get 2-4% improvement in the zero-shot setting and 1-2% improvement in translate-train as the amount of parallel data is increased. These are relatively large improvements considering the fact that only sentence-level alignment is used. This also conforms to our intuition that sentence-level alignment is a good fit here since XNLI is a sentencelevel task.
In the zero-shot setting, removing MoCo (-MoCo) performs similarly to -TLM, where we observe an accuracy drop of about 1% compared to our full system. In translate-train, -MoCo outperforms -TLM and even matches the full system performance for 250k.
Finally, we show ablation result for our codeswitching in translate-train. On average, codeswitching provides an additional gain of 1%.
(2020) train several bilingual BERT models such as en-es, and enfake-es, where data for enfake is constructed by Unicode shifting of the English data such that there is no character overlap with data of the other language. Result shows that enfake-es still transfers well to Spanish and the contribution from shared vocabulary is very small. The authors point out that model depth and capacity instead are the key factors contributing to mBERT's crosslingual transferability. XLM-R (Conneau et al., 2020) improves over mBERT by training longer with more data from CommonCrawl, and without the NSP objective. In terms of model size, XLM-R uses over 3x more parameters than mBERT. Its base version, XLM-R Base , is more comparable to mBERT with the same hidden size and number of attention heads, but a larger shared vocabulary.
Training Multilingual LMs with Parallel Sentences In addition to MLM on monolingual data, XLM (Conneau and Lample, 2019) further improves their cross-lingual LM pretraining by introducing a new TLM objective on parallel data. TLM concatenates source and target sentences together, and predicts randomly masked tokens. Our work uses a slightly different version of TLM to-gether with a contrastive objective to post-pretrain mBERT. Unlike XLM, our TLM does not reset positions of target sentences, and does not use language embeddings. We also randomly shuffle the order of source and target sentences. Another difference between XLM and our work is XLM has 45% more parameters and uses more training data. Similar to XLM, Unicoder (Huang et al., 2019) pretrains LMs on multilingual corpora. In addition to MLM and TLM, they introduce three additional cross-lingual pretraining tasks: word recover, paraphrase classification, and mask language model.  propose Alternating Language Modeling (ALM). On a pair of bilingual sequences, instead of TLM, they perform phrase-level code-switching and MLM on the code-switched sequence. ALM is pretrained on both monolingual Wikipedia data and 1.5B codeswitched sentences.
Training mBERT with Word Alignments Cao et al. (2020) post-align mBERT embeddings by first generating word alignments on parallel sentences that involve English. For each aligned word pair, the L 2 distance between their embeddings is minimized to train the model. In order to maintain original transferability to downstream tasks, a regularization term is added to prevent the target language embeddings from deviating too much from their mBERT initialization. Our approach post-aligns mBERT with two self-supervised signals from parallel data without using pre-alignment tools. Wang et al. (2019) also align mBERT em-  Table 5: Ablation Study on XNLI. 250k, 600k, 2M refer to the maximum number of parallel sentence pairs per language used in PPA. MoCo refers to our sentence-level alignment task using contrastive learning. TLM refers to our word-level alignment task with translation language modeling. CS stands for code-switching. We conduct an additional study repl TLM w/ MLM, which means instead of TLM training, we augment our sentence-level alignment with regular MLM on monolingual text. This ablation confirms that the TLM objective helps because of its word alignment capability, not because we train the encoders with more data and iterations. beddings using parallel data. They learn a linear transformation that maps a word embedding in a target language to the embedding of the aligned word in the source language. They show that their transformed embeddings are more effective on zero-shot cross-lingual dependency parsing.
Besides the aforementioned three major directions, Artetxe and Schwenk (2019) train a multilingual sentence encoder on 93 languages. Their stacked BiLSTM encoder is trained by first generating embedding of a source sentence and then decoding the embedding into the target sentence in other languages.
Concurrent to our work, Chi et al. (2020), Feng et al. (2020 and  also leverage variants of contrastive learning for cross-lingual alignment. We focus on a smaller model and improve on it using as little parallel data as possible. We also explore code-switching during finetuning on downtream tasks to complement the post-pretraining alignment objectives.

Conclusion
Post-pretraining embedding alignment is an efficient means of improving cross-lingual transferability of pretrained multilingual LMs, especially when pretraining from scratch is not feasible. We showed that our self-supervised sentence-level and word-level alignment tasks can greatly improve mBERT's performance on downstream tasks of NLI and QA, and the method can potentially be applied to improve other pretrained multilingual LMs.
In addition to zero-shot cross-lingual transfer, we also showed that code-switching with English during finetuning provides additional alignment signals, when training data is available for the target language.