Do Explicit Alignments Robustly Improve Multilingual Encoders?

Multilingual BERT (mBERT), XLM-RoBERTa (XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.


Introduction
Unsupervised massively multilingual encoders including multilingual BERT (Devlin et al., 2019, mBERT) and XLM-RoBERTa (Conneau et al., 2019, XLMR) are now standard tools for zeroshot cross-lingual transfer for NLP tasks (Wu and Dredze, 2019;Xia et al., 2020). While almost all encoders are pretrained without explicit crosslingual objective, i.e. enforcing similar words from different languages have similar representation, improvements can be attained through the use of explicit cross-lingually linked data during pretraining, such as bitexts (Conneau and Lample, 2019;Huang et al., 2019;Ji et al., 2019) and dictionaries (Wu et al., 2019). As with cross-lingual embeddings (Ruder et al., 2019), these data can be used to support explicit alignment objectives with either linear mappings (Wang et al., 2019(Wang et al., , 2020Wu et al., 2019;Liu et al., 2019) or fine-tuning (Cao et al., 2020). However, as word-level alignments from an unsupervised aligner are often suboptimal, we develop a new cross-lingual alignment objective for training our model. We base on our objective on contrastive learning, in which two similar inputs -such as from a bitext -are directly optimized to be similar, relative to a negative set. These methods have been effective in computer vision tasks (He et al., 2019;Chen et al., 2020a). Additionally, most previous work on contextual alignments consider high-quality bitext like Europarl (Koehn, 2005) or MultiUN (Eisele and Chen, 2010). While helpful, these resources are unavailable for most languages for which we seek a zero-shot transfer. To better reflect the quality of bitext available for most languages, we additionally use OPUS-100 (Zhang et al., 2020), a randomly sampled 1 million subset (per language pair) of the OPUS collection (Tiedemann, 2012).
We show that our new contrastive learning alignment objectives outperform previous work (Cao et al., 2020) when applied to bitext from previous works or the OPUS-100 bitext. However, our experiments also produce a negative result. While previous work showed improvements from alignmentbased objectives on zero-shot cross-lingual transfer for a single task (XNLI) with a single random seed, our more extensive analysis tells a different story. We report the mean and standard derivation of multiple runs with the same hyperparam-eters and different random seeds. We find that previously reported improvements disappear, even while our new method shows a small improvement. Furthermore, we extend the evaluation to multiple languages on 4 tasks, further supporting our conclusions. Finally, we evaluate XLMR large on these tasks, which dominate the results obtained from the alignment objectives. We conclude that explicit alignments do not improve cross-lingual representations under a more extensive evaluation with noisier bitexts, and improvements are lost when compared to larger models. This negative result shows the limitation of explicit alignment objective with larger-scale bitext and encoders.

Explicit Alignment Objectives
We begin with a presentation of objective functions that use parallel data across languages for training multilingual encoders. These objectives assume multilingual data in the form of word pairs in parallel sentences. Since gold word alignments are scarce, we use an unsupervised word aligner. Let S and T be the contextual hidden state matrix of corresponding words from a pretrained multilingual encoder. We assume S is English while T is a combination of different target languages. As both mBERT and XLMR operate at the subword level, we use the representation of the first subword, which is consistent with the evaluation stage. Each s i and t i are a corresponding row of S and T, respectively. S and T come from the final layer of the encoder while S l and T l come from the l th -layer.
Linear Mapping If S and T are static feature (such as from ELMo (Peters et al., 2018)) then T can be aligned so that it is close to S via a linear mapping (Wang et al., 2019(Wang et al., , 2020Wu et al., 2019;Liu et al., 2019), similar to aligning monolingual embeddings to produce cross-lingual embeddings. For feature S l and T l from layer l, we can learn a mapping W l .
When W l is orthogonal, Eq. (1) is known as Procrustes problem (Smith et al., 2017) and can be solved by SVD. Alternatively, Eq. (1) can also be solved by gradient descent, without the need to store in memory huge matrices S and T. We adopt the latter more memory efficient approach. Following Lample et al. (2018), we enforce the orthogonality by alternating the gradient update and the following update rule where θ is the encoder parameters. To prevent a degenerative solution, they additionally use a regularization term whereS denote all hidden states of the source sentence including unaligned words, encouraging the source hidden states to stay close to the pretrained hidden states. With mBERT and 20k to 250k parallel sentences from Europarl and MultiUN, Cao et al. show improvement on XNLI but not parsing. 1 In preliminary experiments, we found constraining parameters to stay close to their original pretrained values also prevents degenerative solutions while being more efficient than Eq. (4). As a result, we adopt the following objective (with λ = 1):

Contrastive Alignment
Inspired by the contrastive learning framework of Chen et al. (2020a), we propose a contrastive loss to align S and T by fine-tuning the encoder. Assume in each batch, we have corresponding (s i , t i ) where i ∈ {1, . . . , B}. Instead of optimizing the absolute distance between s i and t i like Eq. (1) or Eq. (3), contrastive loss allows more flexibility by encouraging s i and t i to be closer as compared with any other hidden state. In other words, our proposed contrastive alignment optimizes the relative distance between s i and t i . As the alignment signal is often suboptimal, our alignment objective is more robust to errors in unsupervised word-level alignment. Additionally, unlike previous works, we select different sets of negative examples to enforce different levels of cross-lingual alignment. Finally, it naturally scales to multiple languages.
Weak alignment When the negative examples only come from target languages, we enforce a weak cross-lingual alignment, i.e. s i should be closer to t i than any other t j , ∀j = i. The same is true in the other direction. The loss of a batch is where T = 0.1 is a temperature hyperparameter and sim(a, b) measures the similarity of a and b.
We use a learned cosine similarity sim(a, b) = cos(f (a), f (b)) where f is a feed-forward feature extractor with one hidden layer (768-768-128) and ReLU. It can learn to discard language-specific information and only align the align-able information. Chen et al. (2020a) find that this similarity measure learns better representation for computer vision. After alignment, f is discarded as most cross-lingual transfer tasks do not need this feature extractor, though tasks like parallel sentence retrieval might find it helpful. This learned similarity cannot be applied to an absolute distance objective like Eq. (3) as it can produce degenerate solutions.
Strong alignment If the negative examples include both source and target languages, we enforce a strong cross-lingual alignment, i.e. s i should be closer to t i than any other t j , ∀j = i and s j , ∀j = i.
where aligned(h) is the aligned hidden state of h and H = {s 1 , . . . , s B , t 1 , . . . , t B }. For both weak and strong alignment objectives, we add a regularization term Eq. (5) with λ = 1.

Experiments
Multilingual Alignment We consider alignment and transfer from English to 8 target languages: Arabic, German, English, Spanish, French, Hindi, Russian, Vietnamese, and Chinese. We use two sets of bitexts: (1) bitext used in previous works (Conneau and Lample, 2019) and (2)  (2) The OPUS-100 covering 100 languages with English as the center, and sampled from the OPUS collection randomly, which better reflects the average quality of bitext for most languages. It contains 1M bitext for each target language, except Hindi (0.5M).
We tokenize the bitext with Moses (Koehn et al., 2007) and segment Chinese with Chang et al. (2008). We use fast align (Dyer et al., 2013) to produce unsupervised word alignments in both direction and symmetrize with the grow-diag-finaland heuristic. We only keep one-to-one alignment and discard any trivial alignment where the source and target words are identical.
We train the L2, weak, and strong alignment objectives in a multilingual fashion. Each batch contains examples from all target languages. Following Devlin et al. (2019), we optimize with Adam (Kingma and Ba, 2014), learning rate 1e-4, 128 batch size, 100k total steps (≈ 2 epochs), 4k steps linear warmup and linear decay. We use 16-bit precision and train each model on a single RTX TITAN for around 18 hours. We set the maximum sequence length to 96. For linear mapping, we use a linear decay learning rate from 1e-4 to 0 in 20k steps (≈ 3 epochs), and train for 3 hours for each language pairs. Compared to no alignment, Linear Mapping performs much worse on NER, performs better on POS tagging and parsing, and performs comparably on XNLI. While previous work observes small improvements on selected languages and tasks, it likely depends on the randomness during evaluation. Based on a more comprehensive evaluation including 4 tasks and multiple seeds, the previously proposed methods do not consistently perform better than no alignment with millions of parallel sentences.

Evaluation
Contrastive Alignment In Tab. 1, with mBERT, both proposed contrastive alignment methods consistently perform as well as no alignment while outperforming more than 1 standard derivation on POS tagging and/or parsing. This suggests the proposed methods are more robust to suboptimal alignments. We hypothesize that learned cosine similarity and contrastive alignment allow the model to recover from suboptimal alignments. Both weak and strong alignment perform comparably. While preliminary experiments found that increasing the batch size by 1.5× does not lead to better performance, future work could consider using a memory bank to greatly increase the number of negative examples (Chen et al., 2020b), which has been shown to be beneficial for computer vision tasks.
Alignment with XLMR XLMR, trained on 2.5TB of text, has the same number of transformer layers as mBERT but larger vocabulary. It performs much better than mBERT. Therefore, we wonder if an explicit alignment objective can similarly lead to better cross-lingual representations. Unfortunately, in Tab. 1, we find all alignment methods we consider do not improve over no alignment. Compared to no alignment, Linear Mapping and L2 Alignment have worse performance in 3 out of 4 tasks (except POS tagging). In contrast to previous work, both contrastive alignment objectives perform comparably to no alignment in all 4 tasks.
Impact of Bitext Quality Even though the OPUS-100 bitext has lower quality compared to bitext used in previous works (due to its greater inclusion of bitext from various sources), it has minimum impact on each alignment method we consider. This is good news for the lower resource languages, as not all languages are covered by Mul-tiUN or Europarl.
Model Capacity vs Alignment XLMR large has nearly twice the number of parameters as XLMR base . Even trained on the same data, it performs much better than XLMR base , with or without alignment. This suggests increasing model capacity likely leads to better cross-lingual representations than using an explicit alignment objective. Future work could tackle the curse of multilinguality (Conneau et al., 2019) by increasing the model capacity in a computationally efficient way (Pfeiffer et al., 2020).

Discussion
Our proposed contrastive alignment objective outperforms L2 Alignment (Cao et al., 2020) and consistently performs as well as or better than no alignment using various quality bitext on 4 NLP tasks under a comprehensive evaluation with multiple seeds. However, to our surprise, previously proposed methods do not show consistent improvement over no alignment in this setting. Therefore, we make the following recommendations for future work on cross-lingual alignment or multilingual representations: 1) Evaluations should consider average quality data, not exclusively high-quality bitext. 2) Evaluation must consider multiple NLP tasks or datasets. 3) Evaluation should report mean and variance over multiple seeds, not a single run. More broadly, the community must estab-lish a robust evaluation scheme for zero-shot crosslingual transfer as a single run with one random seed does not reflect the variance of the method (especially in a zero-shot or few-shot setting). 5 While Keung et al. (2020) advocate using oracle for model selection, we instead argue reporting the variance of test performance, following the few-shot learning literature. Additionally, no alignment methods improve XLMR and larger XLMR large performs much better, and raw text is easier to obtain than bitext. Therefore, scaling models to more raw text and larger capacity models may be more beneficial for producing better cross-lingual models.

Acknowledgments
This research is supported in part by ODNI, IARPA, via the BETTER Program contract #2019-19051600005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.    Table 4: Zero-shot cross-lingual transfer result with the OPUS-100 bitext. Blue or orange indicates the mean performance is one standard derivation above or below the mean of baseline.