Translation Artifacts in Cross-lingual Transfer Learning

Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notable impact in existing cross-lingual models. For instance, in natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them, which current models are highly sensitive to. We show that some previous findings in cross-lingual transfer learning need to be reconsidered in the light of this phenomenon. Based on the gained insights, we also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.


Introduction
While most NLP resources are English-specific, there have been several recent efforts to build multilingual benchmarks. One possibility is to collect and annotate data in multiple languages separately (Clark et al., 2020), but most existing datasets have been created through translation (Conneau et al., 2018;Artetxe et al., 2020). This approach has two desirable properties: it relies on existing professional translation services rather than requiring expertise in multiple languages, and it results in parallel evaluation sets that offer a meaningful measure of the cross-lingual transfer gap of different models. The resulting multilingual datasets are generally used for evaluation only, relying on existing English datasets for training.
Closely related to that, cross-lingual transfer learning aims to leverage large datasets available in one language-typically English-to build multilingual models that can generalize to other languages. Previous work has explored 3 main approaches to that end: machine translating the test set into English and using a monolingual English model (TRANSLATE-TEST), machine translating the training set into each target language and training the models on their respective languages (TRANSLATE-TRAIN), or using English data to finetune a multilingual model that is then transferred to the rest of languages (ZERO-SHOT).
The dataset creation and transfer procedures described above result in a mixture of original, 1 human translated and machine translated data when dealing with cross-lingual models. In fact, the type of text a system is trained on does not typically match the type of text it is exposed to at test time: TRANSLATE-TEST systems are trained on original data and evaluated on machine translated test sets, ZERO-SHOT systems are trained on original data and evaluated on human translated test sets, and TRANSLATE-TRAIN systems are trained on machine translated data and evaluated on human translated test sets.
Despite overlooked to date, we show that such mismatch has a notable impact in the performance of existing cross-lingual models. By using back-translation (Sennrich et al., 2016) to paraphrase each training instance, we obtain another English version of the training set that better resembles the test set, obtaining substantial improvements for the TRANSLATE-TEST and ZERO-SHOT approaches in cross-lingual Natural Language Inference (NLI). While improvements brought by machine translation have previously been attributed to data augmentation (Singh et al., 2019), we reject this hypothesis and show that the phenomenon is only present in translated test sets, but not in original ones. Instead, our analysis reveals that this behavior is caused by subtle artifacts arising from the translation process itself. In particular, we show that translating different parts of each instance separately (e.g., the premise and the hypothesis in NLI) can alter superficial patterns in the data (e.g., the degree of lexical overlap between them), which severely affects the generalization ability of current models. Based on the gained insights, we improve the state-of-the-art in XNLI, and show that some previous findings need to be reconsidered in the light of this phenomenon.

Related work
Cross-lingual transfer learning. Current crosslingual models work by pre-training multilingual representations using some form of language modeling, which are then fine-tuned on the relevant task and transferred to different languages. Some authors leverage parallel data to that end (Conneau and Lample, 2019;Huang et al., 2019), but training a model akin to BERT (Devlin et al., 2019) on the combination of monolingual corpora in multiple languages is also effective (Conneau et al., 2020). Closely related to our work, Singh et al. (2019) showed that replacing segments of the training data with their translation during fine-tuning is helpful. However, they attribute this behavior to a data augmentation effect, which we believe should be reconsidered given the new evidence we provide.
Multilingual benchmarks. Most benchmarks covering a wide set of languages have been created through translation, as it is the case of XNLI (Conneau et al., 2018) for NLI, PAWS-X (Yang et al., 2019) for adversarial paraphrase identification, and XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020) for Question Answering (QA). A notable exception is TyDi QA (Clark et al., 2020), a contemporaneous QA dataset that was separately annotated in 11 languages. Other cross-lingual datasets leverage existing multilingual resources, as it is the case of MLDoc (Schwenk and Li, 2018) for document classification and Wikiann (Pan et al., 2017) for named entity recognition. Concurrent to our work, Hu et al. (2020) combine some of these datasets into a single multilingual benchmark, and evaluate some well-known methods on it.
Annotation artifacts. Several studies have shown that NLI datasets like SNLI (Bowman et al., 2015) and MultiNLI  contain spurious patterns that can be exploited to obtain strong results without making real inferential decisions. For instance, Gururangan et al. (2018) and Poliak et al. (2018) showed that a hypothesis-only baseline performs better than chance due to cues on their lexical choice and sentence length. Similarly, McCoy et al. (2019) showed that NLI models tend to predict entailment for sentence pairs with a high lexical overlap. Several authors have worked on adversarial datasets to diagnose these issues and provide a more challenging benchmark (Naik et al., 2018;Glockner et al., 2018;Nie et al., 2020). Besides NLI, other tasks like QA have also been found to be susceptible to annotation artifacts (Jia and Liang, 2017;Kaushik and Lipton, 2018). While previous work has focused on the monolingual scenario, we show that translation can interfere with these artifacts in multilingual settings.
Translationese. Translated texts are known to have unique features like simplification, explicitation, normalization and interference, which are refer to as translationese (Volansky et al., 2013). This phenomenon has been reported to have a notable impact in machine translation evaluation (Zhang and Toral, 2019;. For instance, back-translation brings large BLEU gains for reversed test sets (i.e., when translationese is on the source side and original text is used as reference), but its effect diminishes in the natural direction (Edunov et al., 2020). While connected, the phenomenon we analyze is different in that it arises from translation inconsistencies due to the lack of context, and affects cross-lingual transfer learning rather than machine translation.

Experimental design
Our goal is to analyze the effect of both human and machine translation in cross-lingual models. For that purpose, the core idea of our work is to (i) use machine translation to either translate the training set into other languages, or generate English paraphrases of it through back-translation, and (ii) evaluate the resulting systems on original, human translated and machine translated test sets in comparison with systems trained on original data. We next describe the models used in our experiments ( §3.1), the specific training variants explored ( §3.2), and the evaluation procedure followed ( §3.3).

Models and transfer methods
We experiment with two models that are representative of the state-of-the-art in monolingual and cross-lingual pre-training: (i) ROBERTA (Liu et al., 2019), which is an improved version of BERT that uses masked language modeling to pre-train an English Transformer model, and (ii) XLM-R (Conneau et al., 2020), which is a multilingual extension of the former pre-trained on 100 languages. In both cases, we use the large models released by the authors under the fairseq repository. 2 As discussed next, we explore different variants of the training set to fine-tune each model on different tasks. At test time, we try both machine translating the test set into English (TRANSLATE-TEST) and, in the case of XLM-R, using the actual test set in the target language (ZERO-SHOT).

Training variants
We try 3 variants of each training set to fine-tune our models: (i) the original one in English (ORIG), (ii) an English paraphrase of it generated through back-translation using Spanish or Finnish as pivot (BT-ES and BT-FI), and (iii) a machine translated version in Spanish or Finnish (MT-ES and MT-FI). For sentences occurring multiple times in the training set (e.g., premises repeated for multiple hypotheses), we use the exact same translation for all occurrences, as our goal is to understand the inherent effect of translation rather than its potential application as a data augmentation method.
In order to train the machine translation systems for MT-XX and BT-XX, we use the big Transformer model (Vaswani et al., 2017) with the same settings as  and SentencePiece tokenization (Kudo and Richardson, 2018) with a joint vocabulary of 32k subwords. For English-Spanish, we train for 10 epochs on all parallel data from WMT 2013 (Bojar et al., 2013) and ParaCrawl v5.0 (Esplà et al., 2019). For English-Finnish, we train for 40 epochs on Europarl and Wiki Titles from WMT 2019 (Barrault et al., 2019), ParaCrawl v5.0, and DGT, EUbookshop and TildeMODEL from OPUS (Tiedemann, 2012). In both cases, we remove sentences longer than 250 tokens, with a source/target ratio exceeding 1.5, or for which langid.py (Lui and Baldwin, 2012) predicts a different language, resulting in a final corpus size of 48M and 7M sentence pairs, respectively. We use sampling decoding with a temperature of 0.5 for inference, which produces more diverse translations than beam search  and performed better in our preliminary experiments.

Tasks and evaluation procedure
We use the following tasks for our experiments: Natural Language Inference (NLI). Given a premise and a hypothesis, the task is to determine whether there is an entailment, neutral or contradiction relation between them. We fine-tune our models on MultiNLI  for 10 epochs using the same settings as Liu et al. (2019). In most of our experiments, we evaluate on XNLI (Conneau et al., 2018), which comprises 2490 development and 5010 test instances in 15 languages. These were originally annotated in English, and the resulting premises and hypotheses were independently translated into the rest of the languages by professional translators. For the TRANSLATE-TEST approach, we use the machine translated versions from the authors. Following Conneau et al. (2020), we select the best epoch checkpoint according to the average accuracy in the development set.
Question Answering (QA). Given a context paragraph and a question, the task is to identify the span answering the question in the context. We fine-tune our models on SQuAD v1.1 (Rajpurkar et al., 2016) for 2 epochs using the same settings as Liu et al. (2019), and report test results for the last epoch. We use two datasets for evaluation: XQuAD (Artetxe et al., 2020), a subset of the SQuAD development set translated into 10 other languages, and MLQA (Lewis et al., 2020) a dataset consisting of parallel context paragraphs plus the corresponding questions annotated in English and translated into 6 other languages. In both cases, the translation was done by professional translators at the document level (i.e., when translating a question, the text answering it was also shown). For our BT-XX and MT-XX variants, we translate the context paragraph and the questions independently, and map the answer spans using the same procedure as Carrino et al. (2020)   Both for NLI and QA, we run each system 5 times with different random seeds and report the average results. Space permitting, we also report the standard deviation across the 5 runs. In our result tables, we use an underline to highlight the best result within each block, and boldface to highlight the best overall result.

NLI experiments
We next discuss our main results in the XNLI development set ( §4.1, §4.2), run additional experiments to better understand the behavior of our different variants ( §4.3, §4.4, §4.5), and compare our results to previous work in the XNLI test set ( §4.6).

TRANSLATE-TEST results
We start by analyzing XNLI development results for TRANSLATE-TEST. Recall that, in this approach, the test set is machine translated into English, but training is typically done on original English data. Our BT-ES and BT-FI variants close this gap by training on a machine translated English version of the training set generated through back-translation. As shown in Table 1, this brings substantial gains for both ROBERTA and XLM-R, with an average improvement of 4.6 points in the best case. Quite remarkably, MT-ES and MT-FI also outperform ORIG by a substantial margin, and are only 0.8 points below their BT-ES and BT-FI counterparts. Recall that, for these two systems, training is done in machine translated Spanish or Finnish, while inference is done in machine translated English. This shows that the loss of performance when generalizing from original data to machine translated data is substantially larger than the loss of performance when generalizing from one language to another.

ZERO-SHOT results
We next analyze the results for the ZERO-SHOT approach. In this case, inference is done in the test set in each target language which, in the case of XNLI, was human translated from English. As such, different from the TRANSLATE-TEST approach, neither training on original data (ORIG) nor training on machine translated data (BT-XX and MT-XX) makes use of the exact same type of text that the system is exposed to at test time. However, as shown in Table 1, both BT-XX and MT-XX outperform ORIG by approximately 2 points, which suggests that our (back-)translated versions of the training set are more similar to the human translated test sets than the original one. This also provides a new perspective on the TRANSLATE-TRAIN approach, which was reported to outperform ORIG in previous work (Conneau and Lample, 2019): while the original motivation was to train the model on the same language that it is tested on, our results show that machine translating the training set is beneficial even when the target language is different.

Original vs. translated test sets
So as to understand whether the improvements observed so far are limited to translated test sets or apply more generally, we conduct additional experiments comparing translated test sets to original ones. However, to the best of our knowledge, all  existing non-English NLI benchmarks were created through translation. For that reason, we build a new test set that mimics XNLI, but is annotated in Spanish rather than English. We first collect the premises from a filtered version of CommonCrawl (Buck et al., 2014), taking a subset of 5 websites that represent a diverse set of genres: a newspaper, an economy forum, a celebrity magazine, a literature blog, and a consumer magazine. We then ask native Spanish annotators to generate an entailment, a neutral and a contradiction hypothesis for each premise. 5 We collect a total of 2490 examples using this procedure, which is the same size as the XNLI development set. Finally, we create a human translated and a machine translated English version of the dataset using professional translators from Gengo and our machine translation system described in §3.2, 6 respectively. We report results for the best epoch checkpoint on each set. As shown in Table 2, both BT-XX and MT-XX clearly outperform ORIG in all test sets created through translation, which is consistent with our previous results. In contrast, the best results on the original English set are obtained by ORIG, and neither BT-XX nor MT-XX obtain any clear improvement on the one in Spanish either. 7 This confirms that the underlying phenomenon is limited to translated test sets. In addition, it is worth mentioning that the results for the machine translated test set in English are slightly better than those for the human 5 Unlike XNLI, we do not collect 4 additional labels for each example. Note, however, that XNLI kept the original label as the gold standard, so the additional labels are irrelevant for the actual evaluation. This is not entirely clear in Conneau et al. (2018), but can be verified by inspecting the dataset. 6 We use beam search instead of sampling decoding. 7 Note that the standard deviations are around 0.3.

Stress tests
In order to better understand how systems trained on original and translated data differ, we run additional experiments on the NLI Stress Tests (Naik et al., 2018), which were designed to test the robustness of NLI models to specific linguistic phenomena in English. The benchmark consists of a competence test, which evaluates the ability to understand antonymy relation and perform numerical reasoning, a distraction test, which evaluates the robustness to shallow patterns like lexical overlap and the presence of negation words, and a noise test, which evaluates robustness to spelling errors. Just as with previous experiments, we report results for the best epoch checkpoint in each test set. As shown in Table 3, ORIG outperforms BT-FI and MT-FI on the competence test by a large margin, but the opposite is true on the distraction test. 8 In particular, our results show that BT-FI and MT-FI are less reliant on lexical overlap and the presence of negative words. This feels intuitive, as translating the premise and hypothesis independently-as BT-FI and MT-FI do-is likely to reduce the lexical overlap between them. More generally, the trans-lation process can alter similar superficial patterns in the data, which NLI models are sensitive to ( §2). This would explain why the resulting models have a different behavior on different stress tests.

Output class distribution
With the aim to understand the effect of the previous phenomenon in cross-lingual settings, we look at the output class distribution of our different models in the XNLI development set. As shown in Table 4, the predictions of all systems are close to the true class distribution in the case of English. Nevertheless, ORIG is strongly biased for the rest of languages, and tends to underpredict entailment and overpredict neutral. This can again be attributed to the fact that the English test set is original, whereas the rest are human translated. In particular, it is well-known that NLI models tend to predict entailment when there is a high lexical overlap between the premise and the hypothesis ( §2). However, the degree of overlap will be smaller in the human translated test sets given that the premise and the hypothesis were translated independently, which explains why entailment is underpredicted. In contrast, BT-FI and MT-FI are exposed to the exact same phenomenon during training, which explains why they are not that heavily affected.
So as to measure the impact of this phenomenon, we explore a simple approach to correct this bias: having fine-tuned each model, we adjust the bias term added to the logit of each class so the model predictions match the true class distribution for each language. 9 As shown in Table 5, this brings large improvements for ORIG, but is less effective for BT-FI and  This shows that the performance of ORIG was considerably hindered by this bias, which BT-FI and MT-FI effectively mitigate.

Comparison with the state-of-the-art
So as to put our results into perspective, we compare our best variant to previous work on the XNLI test set. As shown in Table 6, our method improves the state-of-the-art for both the TRANSLATE-TEST and the ZERO-SHOT approaches by 4.3 and 2.8 points, 9 We achieve this using an iterative procedure where, at each step, we select one class and set its bias term so the class is selected for the right percentage of examples.
10 Note that we are adjusting the bias term in the evaluation set itself, which requires knowing its class distribution and is thus a form of cheating. While useful for analysis, a fair comparison would require adjusting the bias term in a separate validation set. This is what we do for our final results in §4.6, where we adjust the bias term in the XNLI development set and report results on the XNLI test set.   Table 5: XNLI dev results with class distribution unbiasing (average acc across all languages). Adjusting the bias term of the classifier to match the true class distribution brings large improvements for ORIG, but is less effective for BT-FI and MT-FI. respectively. It also obtains the best overall results published to date, with the additional advantage that the previous state-of-the-art required a machine translation system between English and each of the 14 target languages, whereas our method uses a single machine translation system between English and Finnish (which is not one of the target languages). While the main goal of our work is not to design better cross-lingual models, but to analyze their behavior in connection to translation, this shows that the phenomenon under study is highly relevant, to the extent that it can be exploited to improve the state-of-the-art.

QA experiments
So as to understand whether our previous findings apply to other tasks besides NLI, we run additional experiments on QA. As shown in Table 7, BT-FI and BT-ES do indeed outperform ORIG for the TRANSLATE-TEST approach on MLQA. The improvement is modest, but very consistent across different languages, models and runs. The results for MT-ES and MT-FI are less conclusive, presumably because mapping the answer spans across languages might introduce some noise. In contrast, we do not ob-  serve any clear improvement for the ZERO-SHOT approach on this dataset. Our XQuAD results in Table 8 are more positive, but still inconclusive. These results can partly be explained by the translation procedure used to create the different benchmarks: the premises and hypotheses of XNLI were translated independently, whereas the questions and context paragraphs of XQuAD were translated together. Similarly, MLQA made use of parallel contexts, and translators were shown the sentence containing each answer when translating the corresponding question. As a result, one can expect both QA benchmarks to have more consistent translations than XNLI, which would in turn diminish this phenomenon. In contrast, the questions and context paragraphs are independently translated when using machine translation, which explains why BT-ES and BT-FI outperform ORIG for the TRANSLATE-TEST approach. We conclude that the translation artifacts revealed by our analysis are not exclusive to NLI, as they also show up on QA for the TRANSLATE-TEST approach, but their actual impact can be highly dependent on the translation procedure used and the nature of the task.

Discussion
Our analysis prompts to reconsider previous findings in cross-lingual transfer learning as follows: The cross-lingual transfer gap on XNLI was overestimated. Given the parallel nature of XNLI, accuracy differences across languages are commonly interpreted as the loss of performance when generalizing from English to the rest of languages. However, our work shows that there is another factor that can have a much larger impact: the loss of performance when generalizing from original to translated data. Our results suggest that the real cross-lingual generalization ability of XLM-R is considerably better than what the accuracy numbers in XNLI reflect.
Overcoming the cross-lingual gap is not what makes TRANSLATE-TRAIN work. The original motivation for TRANSLATE-TRAIN was to train the model on the same language it is tested on. However, we show that it is training on translated data, rather than training on the target language, that is key for this approach to outperform ZERO-SHOT as reported by previous authors.
Improvements previously attributed to data augmentation should be reconsidered. The method by Singh et al. (2019) combines machine translated premises and hypotheses in different languages ( §2), resulting in an effect similar to BT-XX and MT-XX. As such, we believe that this method should be analyzed from the point of view of dataset artifacts rather than data augmentation, as the authors do. 11 From this perspective, having the premise and the hypotheses in different languages can reduce the superficial patterns between them, which would explain why this approach is better than using examples in a single language.   The potential of TRANSLATE-TEST was underestimated. The previous best results for TRANSLATE-TEST on XNLI lagged behind the state-of-the-art by 4.6 points. Our work reduces this gap to only 0.8 points by addressing the underlying translation artifacts. The reason why TRANSLATE-TEST is more severely affected by this phenomenon is twofold: (i) the effect is doubled by first using human translation to create the test set and then machine translation to translate it back to English, and (ii) TRANSLATE-TRAIN was inadvertently mitigating this issue (see above), but equivalent techniques were never applied to TRANSLATE-TEST. Future evaluation should better account for translation artifacts. The evaluation issues raised by our analysis do not have a simple solution. In fact, while we use the term translation artifacts to highlight that they are an unintended effect of translation that impacts final evaluation, one could also argue that it is the original datasets that contain the artifacts, which translation simply alters or even mitigates. 12 In any case, this is a more general issue that falls beyond the scope of

Conclusions
In this paper, we have shown that both human and machine translation can alter superficial patterns in data, which requires reconsidering previous findings in cross-lingual transfer learning. Based on the gained insights, we have improved the state-of-theart in XNLI for the TRANSLATE-TEST and ZERO-SHOT approaches by a substantial margin. Finally, we have shown that the phenomenon is not specific to NLI but also affects QA, although it is less pronounced there thanks to the translation procedure used in the corresponding benchmarks. So as to facilitate similar studies in the future, we release our NLI dataset, 13 which, unlike previous benchmarks, was annotated in a non-English language and human translated into English.