Explicit Alignment Objectives for Multilingual Bidirectional Encoders

Pre-trained cross-lingual encoders such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) have proven impressively effective at enabling transfer-learning of NLP systems from high-resource languages to low-resource languages. This success comes despite the fact that there is no explicit objective to align the contextual embeddings of words/sentences with similar meanings across languages together in the same space. In this paper, we present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bidirectional EncodeR). AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities. We conduct experiments on zero-shot cross-lingual transfer learning for different tasks including sequence tagging, sentence retrieval and sentence classification. Experimental results on the tasks in the XTREME benchmark (Hu et al., 2020) show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLM-R-large model which has 3.2x the parameters of AMBER. Our code and models are available at http://github.com/junjiehu/amber.


Introduction
Cross-lingual embeddings, both traditional noncontextualized word embeddings (Faruqui and Dyer, 2014) and the more recent contextualized word embeddings (Devlin et al., 2019), are an essential tool for cross-lingual transfer in downstream applications. In particular, multilingual contextualized word representations have proven effective in reducing the amount of supervision needed in a variety of cross-lingual NLP tasks such as sequence labeling (Pires et al., 2019), question answering (Artetxe et al., 2020), parsing (Wang et al., * *Work partially done at Google Research. 2019), sentence classification (Wu and Dredze, 2019) and retrieval (Yang et al., 2019a).
Some attempts at training multilingual representations (Devlin et al., 2019;Conneau et al., 2020a) simply train a (masked) language model on monolingual data from many languages. These methods can only implicitly learn which words and structures correspond to each-other across languages in an entirely unsupervised fashion, but are nonetheless quite effective empirically (Conneau et al., 2020b;K et al., 2020). On the other hand, some methods directly leverage multilingual parallel corpora (McCann et al., 2017;Eriguchi et al., 2018;Conneau and Lample, 2019;Huang et al., 2019;Siddhant et al., 2020), which gives some degree of supervision implicitly aligning the words in the two languages. However, the pressure on the model to learn clear correspondences between the contextualized representations in the two languages is still implicit and somewhat weak. Because of this, several follow-up works (Schuster et al., 2019;Wang et al., 2020;Cao et al., 2020) have proposed methods that use word alignments from parallel corpora as the supervision signals to align multilingual contextualized representations, albeit in a post-hoc fashion.
In this work, we propose a training regimen for learning contextualized word representations that encourages symmetry at both the word and sentence levels at training time. Our word-level alignment objective is inspired by work in machine translation that defines objectives encouraging consistency between the source-to-target and targetto-source attention matrices (Cohn et al., 2016). Our sentence-level alignment objective encourages prediction of the correct translations within a minibatch for a given source sentence, which is inspired by work on learning multilingual sentence representations (Yang et al., 2019a;Wieting et al., 2019). In experiments, we evaluate the zero-shot crosslingual transfer performance of AMBER on four dif- ferent NLP tasks in the XTREME benchmark (Hu et al., 2020) including part-of-speech (POS) tagging, paraphrase classification, and sentence retrieval. We show that AMBER obtains gains of up to 1.1 average F1 score on cross-lingual POS tagging, up to 27.3 average accuracy score on sentence retrieval, and achieves competitive accuracy in paraphrase classification when compared with the XLM-R-large model. This is despite the fact that XLM-R-large is trained on data 23.8x as large 1 and has 3.2x parameters of AMBER. This shows that compared to large amounts of monolingual data, even a small amount of parallel data leads to significantly better cross-lingual transfer learning.

Cross-lingual Alignment
This section describes three objectives for training contextualized embeddings. We denote the monolingual and parallel data as M and P respectively.
Masked Language Modeling (MLM) A masked language modeling objective takes a pair of sentences x, y, and optimizes the prediction of randomly masked tokens in the concatenation of the sentence pair as follows: where z is the concatenation of the sentence pair z = [x; y], z s are the masked tokens randomly sampled from z, and z \s indicates all the other tokens except the masked ones.
In the standard monolingual setting, x, y are two contiguous sentences in a monolingual corpus. In Conneau and Lample (2019), x, y are two sentences in different languages from a parallel cor-pus, an objective we will refer to as Translation Language Modeling (TLM). Sentence Alignment Our first proposed objective encourages cross-lingual alignment of sentence representations. For a source-target sentence pair (x, y) in the parallel corpus, we separately calculate sentence embeddings denoted as c x , c y by averaging the embeddings in the final layer as the sentence embeddings. 2 We then encourage the model to predict the correct translation y given a source sentence x. To do so, we model the conditional probability of a candidate sentence y being the correct translation of a source sentence x as: where y can be any sentence in any language. Since the normalization term in Eq.
(2) is intractable, we approximate P (y|x) by sampling y within a mini-batch B rather than M ∪ P. We then define the sentence alignment loss as the average negative log-likelihood of the above probability: Bidirectional Word Alignment Our second proposed objective encourages alignment of word embeddings by leveraging the attention mechanism in the Transformer model. Motivated by the work on encouraging the consistency between the sourceto-target and target-to-source translations (Cohn et al., 2016;He et al., 2016), we create two different attention masks as the inputs to the Transformer model, and obtain two attention matrices in the top layer of the Transformer model. We compute the target-to-source attention matrix A y→x as follows: where g l yt is the embedding of the t-th word in y on the l-th layer, A y→x [i, j] is the (i, j)-th value in the attention matrix from y to x, and W = {W q , W k , W v } are the linear projection weights for Q, K, V respectively. We compute the sourceto-target matrix A x→y by switching x and y.
To encourage the model to align source and target words in both directions, we aim to minimize the distance between the forward and backward attention matrices. Similarly to Cohn et al. (2016), we aim to maximize the trace of two attention matrices, i.e., tr (A y→x T A x→y ). Since the attention scores are normalized in [0, 1], the trace of two attention matrices is upper bounded by min(|x|, |y|), and the maximum value is obtained when the two matrices are identical. Since the Transformer generates multiple attention heads, we average the trace of the bidirectional attention matrices generated by all the heads denoted by the superscript h .
Notably, in the target-to-source attention in Eq (4), with attention masking we enforce a constraint that the t-th token in y can only perform attention over its preceding tokens y <t and the source tokens in x. This is particularly useful to control the information access of the query token y t , in a manner similar to that of the decoding stage of NMT. Without attention masking, the standard Transformer performs self-attention over all tokens, i.e., Q = K = g h z , and minimizing the distance between the two attention matrices by Eq. (8) might lead to a trivial solution where W q ≈ W k . Combined Objective Finally we combine the masked language modeling objective with the alignment objectives and obtain the total loss in Eq. (9). Notice that in each iteration, we sample a minibatch of sentence pairs from M ∪ P.
L =E (x,y)∈M∪P MLM (x, y)  (2019) to prepare the parallel data with one change to maintain truecasing. We set the maximum number of subwords in the concatenation of each sentence pair to 256 and use 10k warmup steps with the peak learning rate of 1e-4 and a linear decay of the learning rate. We train AMBER on TPU v3 for about 1 week.

Datasets
Cross-lingual Part-Of-Speech (POS) contains data in 13 languages from the Universal Dependencies v2.

Result Analysis
In Table 2, we show the average results over all languages in all the tasks, and show detailed results for each language in Appendix A.3. First, we find that our re-trained mBERT (AMBER with MLM) performs better than the publicly available mBERT on all the tasks, confirming the utility of pre-training BERT models with larger batches for more steps (Liu et al., 2019). Second, AM-BER trained by the word alignment objective obtains a comparable average F1 score with respect to the best performing model (Unicoder) in the POS tagging task, which shows the effectiveness of the word-level alignment in the syntactic structure prediction tasks at the token level. Besides, it is worth noting that Unicoder is initialized from the larger XLM-R-base model that is pre-trained on a larger corpus than AMBER, and Unicoder improves over XLM-R-base on all tasks. Third, for the sentence classification tasks, AMBER trained with our explicit alignment objectives obtain a larger gain (up to 2.1 average accuracy score in PAWS-X, and 3.9 average accuracy score in XNLI) than AMBER with only the MLM objective. Although we find that AMBER trained with only the MLM objective falls behind existing XLM/XLM-R/Unicoder models with many more parameters, AMBER trained with our alignment objectives significantly narrows the gap of classification accuracy with respect to XLM/XLM-R/Unicoder. Finally, for sentence retrieval tasks, we find that XLM-15 and Unicoder are both trained on additional parallel data, outperforming the other existing models trained only on monolingual data. Using additional parallel data, AMBER with MLM and TLM objectives also significantly improves over AMBER

How does alignment help by language?
In Figure 2, we investigate the improvement of the alignment objectives over the MLM objective on low-resourced and high-resourced languages, by computing the performance difference between AMBER trained with alignment objectives and AM-BER (MLM). First, we find that AMBER trained with alignment objectives significantly improves the performance on languages with relatively small amounts of parallel data, such as Turkish, Urdu, Swahili, while the improvement on high-resourced languages is marginal. Through a further analysis (Appendix A.3), we observe that AMBER (MLM) performs worse on these low-resourced and morphologically rich languages than on high-resourced Indo-European languages, while AMBER trained with alignment objectives can effectively bridge the gap. Moreover, AMBER trained with our word-level alignment objective yields the highest improvement on these low-resourced languages on the POS task, and AMBER trained with sentence-level alignment performs the best on XNLI.

Alignment with Attention vs Dictionary
Recent studies (Cao et al., 2020;Wang et al., 2020) have proposed to use a bilingual dictionary to align cross-lingual word representations. Compared with these methods, our word-level alignment objective encourages the model to automatically discover word alignment patterns from the parallel corpus in an end-to-end training process, which avoids potential errors accumulated in separate steps of the pipeline. Furthermore, an existing dictionary may not have all the translations for source words, especially for words with multiple senses. Even if the dictionary is relatively complete, it also requires a heuristic way to find the corresponding substrings in the parallel sentences for alignment. If we use a word alignment tool to extract a bilingual dictionary in a pipeline, errors may accumulate, hurting the accuracy of the model. Besides, Wang et al.

A.1 Training Details for Reproducibility
Although English is not the best source language for some target languages (Lin et al., 2019), this zero-shot cross-lingual transfer setting is still practical useful as many NLP tasks only have English annotations. In the following paragraphs, we show details for reproducing our results on zero-shot cross-lingual transfer setting.
Model: We use the same architecture as mBERT for AMBER, and we build our AMBER trained with the alignment objectives on top of the original mBERT implementation at https://github. com/google-research/bert, and are released at http://github.com/junjiehu/amber.

Pre-training:
We first train the model on the Wikipedia data for 1M steps using the default hyper-parameters in the original repository except that we use a larger batch of 8,192 sentence pairs. The max number of subwords in the concatenation of each sentence pair is set to 256. To continue training AMBER with additional objectives on parallel data, we use 10K warmup steps with the peak learning rate of 1e-4, and use a linear decay of the learning rate. All models are pre-trained with our proposed objectives on TPU v3, and we use the same hyper-parameter setting for our AMBER variants in the experiments. We follow the practice of mBERT at https://github. com/google-research/bert/blob/master/ multilingual.md#data-source-and-sampling to sample from mutlilingual data for training. We select the checkpoint of all models at the 1M step for a fair comparison. It takes about 1 week to finish the pre-training.
Fine-tuning: For fine-tuning the models on the downstream applications, we use the constant learning rate of 2e-5 as suggested in the original paper (Devlin et al., 2019). We fine-tune all the models for 10 epochs on the cross-lingual POS tag prediction task, and 5 epochs on the sentence classification task. We use the batch size of 32 for all the models. All models are fine-tuned on 2080Ti GPUs, and the training can be finished within 1 day.
Datasets: We use the same parallel data that is used to train XLM-15. The parallel data can be processed by this script: https://github.com/facebookresearch/ XLM/blob/master/get-data-para.sh.
All the datasets in the downstream applications can be downloaded by the script at https: //github.com/google-research/xtreme/ blob/master/scripts/download_data.sh. Table 4 lists all the data statistic of parallel data by languages.

A.2 Source-to-target attention matrix
We derive the source-to-target attention matrix as follow:

A.3 Detailed Results
We show the detailed results over all languages on the cross-lingual POS task in Table 6, on the PAWS-X task in Table 5, on the XNLI task in Table 7, and on the Tatoeba retrieval task in Table 8.
A.4 Detailed Results on Performance Difference by Languages Figure 4 and Figure 3 show the performance deference between AMBER trained with alignment objectives and AMBER trained with only MLM objective on the POS and XNLI tasks over all languages.  Table 5: Accuracy of zero-shot cross-lingual classification on PAWS-X. Bold numbers highlight the highest scores across languages on the existing models (upper part) and AMBER variants (bottom part).      Figure 4: Performance difference between AMBER trained with alignments on parallel data and AMBER (MLM) on POS task. Languages are sorted by no. of parallel data used for training AMBER with alignments.