On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning

Pre-trained multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on some source language (typically English) and evaluated on a different target language. However, published results for baseline mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results on the MLDoc and XNLI tasks. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy, and zero-shot cross-lingual performance varies greatly within the same fine-tuning run and between different fine-tuning runs. We recommend providing oracle scores alongside the zero-shot results: still fine-tune using English, but choose a checkpoint with the target dev set. Reporting this upper bound makes results more consistent by avoiding the variation from bad checkpoints.


Introduction
Zero-shot and zero-resource cross-lingual NLP has seen significant progress in recent years. The discovery of cross-lingual structure in word embedding spaces recently culminated in the work of Lample et al. (2018b), which showed that unsupervised word translation via adversarial mappings is competitive with supervised techniques. Concurrent work in machine translation also showed that it is possible to achieve non-trivial BLEU scores without any bitext (Artetxe et al., 2018;Lample et al., 2018a). Self-supervised multilingual contextual embeddings like mBERT (Devlin et al., 2019) and XLM (Conneau and Lample, 2019) have shown remarkably strong performance on cross-lingual named entity recognition, text classification, dependency parsing, and other tasks (e.g., Pires et al., 2019;Keung et al., 2019;Wu and Dredze, 2019).
Much of this recent work has demonstrated that mBERT performs very well on zero-shot tasks, superseding prior techniques as the baseline for zeroshot cross-lingual transfer learning. By zero-shot, we mean that no parallel text or labeled data from the target language was used during model training, fine-tuning, or hyperparameter search. In this setting, models are trained on labeled (usually English) text and tested on target (non-English) text. Standard practice prohibits the use of target language data for model selection; the final model is chosen using the English dev set only.
However, we find that zero-shot mBERT results can vary greatly. In Table 1  Even though these authors report English accuracies which are basically identical, the target language performance is dramatically different. For the listed target languages, the highest accuracy is up to 3 points better than the next best and up to 17 points better than the worst. Given that each experiment starts with the same pre-trained mBERT  Table 2: Zero-shot accuracies over 10 independent mBERT fine-tuning experiments on MLDoc and XNLI. For each run, we computed the zero-shot accuracies on the checkpoint with the best English dev performance. We present the minimum and maximum accuracy attained for each evaluation set over these 10 experiments. English dev accuracies are within 0.9% of each other, but target test accuracies vary by much more than that, depending on the language and dataset. Languages with ∆ ≥ 2.5% are bolded.
model and MLDoc dataset, it is clear that the crosslingual results from these publications are not reproducible. We investigate this reproducibility issue in both MLDoc and XNLI (Conneau et al., 2018), which is another major dataset for evaluating cross-lingual transfer.
In Section 3, we show that the final zero-shot accuracies between and within independent mBERT training runs are highly variable. Variations over different random seeds are similar in magnitude to those in Table 1, with variation due to checkpoint selection using English dev being a significant underlying cause. In Section 4, we find that in many cases, English (En) dev accuracy is not predictive of target language performance. In fact, for some languages, En dev performance is actually anti-correlated with target language accuracy.
Poor correlation between En dev and target test accuracy, combined with high variance between independent runs, means that published zero-shot accuracies are somewhat arbitrary. In addition to zero-shot results, we recommend reporting oracle results, where one still fine-tunes using En dev, but uses the target dev set for checkpoint selection.

Experimental setup
We use cased mBERT (Devlin et al., 2019) for all of our experiments. We illustrate the reproducibility issues in zero-shot cross-lingual transfer through the document classification task in MLDoc and the natural language inference task in XNLI.
For the MLDoc experiments, we used the english.train.10000 training data. For the XNLI experiments, we used the entire portion of the En training data. Unless stated otherwise, we use the En development set and the non-En test sets provided in these corpora. When finetuning mBERT, we used a constant learning rate of 2 × 10 −6 with a batch size of 32 for MLDoc, and a constant learning rate of 2 × 10 −5 with a batch size of 32 for XNLI. Checkpoints were saved at regular intervals; one checkpoint after 2% of the training corpus was processed. Models were trained until convergence based on En dev accuracy.

Between-run and within-run variations in zero-shot accuracy
Running mBERT finetuning experiments under different random seeds yields highly variable results, similar to what we observed in Table 1. Previous work that discussed evaluation with random initializations (e.g., Reimers and Gurevych, 2017;Melis et al., 2018) has reported only small effects on the test metric (e.g., ±1 point on En F1 for NER), but we observed much larger variations for zero-shot accuracy on MLDoc and XNLI. Firstly, we observed significant variation between independent runs (Tables 2a and 2b). We ran mBERT fine-tuning with different random seeds, and for each run, selected the checkpoint with the best En dev performance. The best checkpoint from each run gave very different zero-shot results, varying as much as 15.0% absolute in French (MLDoc) and 4.3% in Thai (XNLI).
Secondly, we observed significant variation within each run, which we illustrate in Figure 1. En dev accuracy reaches a stable plateau as mBERT fine-tuning proceeds; however, zero-shot Es and Ja  Table 3: Frequency of directional agreement between dev and test accuracy on MLDoc and XNLI (higher is better). We expect dev and test accuracy to generally increase and decrease together between randomly sampled checkpoints, which happens when using the target language dev set, but not when using the English dev set. English dev accuracy can be worse (bolded) than random chance (50%) at tracking target test accuracy.
accuracies swing by several percentage points. A simple calculation with a test of proportions at a significance level of 0.05 shows that a difference of at least 2.5% (absolute) would be statistically significant given the size of the MLDoc and XNLI test sets. For all of the MLDoc languages and for 7 of the 14 XNLI languages, the variation within 10 runs exceeds 2.5%, even though the En dev accuracy varies within a narrow 0.9% range.
In other words, the En dev accuracy is not necessarily useful for choosing the best model for zeroshot transfer among different runs. The En dev accuracy is very similar for each of our independent experiments, but the target test accuracy for each experiment fluctuates in a wide band.

English dev accuracy and its relationship with zero-shot accuracy
Experimenters use the En dev set for model selection under the assumption that zero-shot performance improves as En dev performance improves. We show that this assumption is often false.
We investigate whether En dev accuracy is a good predictor of the directional change in target language accuracy between different checkpoints by comparing the ability of En dev and target dev to predict changes in target test accuracy. In Tables 3a and 3b, we report the frequency of directional agreement on MLDoc and XNLI: how often does En dev accuracy increase/decrease with target test accuracy?
We randomly sample pairs of checkpoints where the target test accuracy changes by at least 0.5% and compute the proportion where En dev accuracy changed in the same direction. For MLDoc (Table  3a), En dev is not much better than a coin flip (∼50%) at predicting the direction of the change in target test accuracy, while target dev tracks target test accuracy more than 90% of the time. For XNLI (Table 3b), En dev sometimes approaches target dev in predictive power (i.e., Es and Fr), but otherwise falls short for the other languages. In general, one sees higher directional agreement in XNLI than MLDoc, which we attribute to XNLI's target test  Table 4: Oracle zero-shot accuracies with mBERT across 10 independent runs, using target dev to select the best checkpoint for each language. This provides an upper bound on the achievable zero-shot accuracy. Published results derived from sources in Table 1 and Table 6. Best En dev results are from Table 2. sets being professionally translated from En test.
Remarkably, for some languages (i.e., Ja in ML-Doc and Hi, Sw, Tr, and Vi in XNLI), the frequency of directional agreement is less than 50%, which means that, more often than not, when En dev accuracy increases, target test accuracy for these languages decreased; we discuss this in Section 5. Since En dev accuracy does not reliably move in the same direction as target test accuracy, it is an inadequate metric for tracking zero-shot cross lingual transfer performance.

Catastrophic forgetting
The strange phenomenon in Table 3, where the probability of directional agreement is sometimes less than 50%, occurs even on XNLI where dev and test sets are translated from English and therefore have the same content. Hence, we believe this phenomenon is a form of catastrophic forgetting (Kirkpatrick et al., 2017), with mBERT losing some cross-lingual knowledge it gained during pretraining during the En-only fine-tuning phase.
In Figure 1, we plotted the En dev accuracy and the target test accuracy over time for the language with the highest directional agreement (Es, 0.59) and the language with the lowest directional agreement (Ja, 0.42) for MLDoc (see Table 3). From the figure, Es test accuracy does increase with En dev accuracy, while Ja test accuracy decreases as En dev accuracy increases. The same pattern holds with XNLI for Tr and En (not shown), where Turkish accuracy decreases somewhat as fine-tuning with English training data continues.
We conclude that En dev accuracy cannot detect when mBERT is improving on the En training data at the expense of non-En languages, and should not (solely) be used to assess zero-shot performance.

Recommendations and discussion
Using a poor metric like En dev accuracy to select a model checkpoint is similar to picking a checkpoint at random. This would not be a major issue if the variance between different training runs were low; the test performance would, in that case, be consistently mediocre. The problem arises when the variability is high, which we have seen experimentally (Table 2) and in the wild (Table 1).
We showed that different experiments can report very different results, which prevents us from making meaningful comparisons between different baselines and methods. Currently, it is standard practice to use the En dev accuracy for checkpoint selection in the zero-shot cross-lingual setting. However, we showed that using En dev accuracy for checkpoint selection leads to somewhat arbitrary zero-shot results.
Instead, we propose reporting oracle accuracies, where one still fine-tunes using English, but selects a checkpoint using the target dev set. This represents the maximum achievable zero-shot accuracy. Note that we do not use the target dev for hyperparameter tuning; we are using target dev to avoid selecting bad checkpoints within each finetuning experiment. Table 4  In the Appendix, we include published results on MLDoc, XNLI, MLQA (Lewis et al., 2019), CoNLL 2002/2003(Sang and De Meulder, 2003, for mBERT and XLM-R large (Conneau et al., 2020), whose variations may suggest similar issues across datasets, contextual embeddings, and publications.
To avoid widespread variance in future published zero-shot cross-lingual experiments, we recommend reporting oracle accuracies alongside results from checkpoint selection with En dev alone.