Tagged Back-translation Revisited: Why Does It Really Work?

In this paper, we show that neural machine translation (NMT) systems trained on large back-translated data overfit some of the characteristics of machine-translated texts. Such NMT systems better translate human-produced translations, i.e., translationese, but may largely worsen the translation quality of original texts. Our analysis reveals that adding a simple tag to back-translations prevents this quality degradation and improves on average the overall translation quality by helping the NMT system to distinguish back-translated data from original parallel data during training. We also show that, in contrast to high-resource configurations, NMT systems trained in low-resource settings are much less vulnerable to overfit back-translations. We conclude that the back-translations in the training data should always be tagged especially when the origin of the text to be translated is unknown.


Introduction
During training, neural machine translation (NMT) can leverage a large amount of monolingual data in the target language. Among existing ways of exploiting monolingual data in NMT, the so-called back-translation of monolingual data (Sennrich et al., 2016a) is undoubtedly the most prevalent one, as it remains widely used in state-of-the-art NMT systems (Barrault et al., 2019). NMT systems trained on back-translated data can generate more fluent translations (Sennrich et al., 2016a) thanks to the use of much larger data in the target language to better train the decoder, especially for low-resource conditions where only a small quantity of parallel training data is available. However, the impact of the noisiness of the synthetic source sentences generated by NMT largely remains unclear and understudied. Edunov et al. (2018) even showed that introducing synthetic noise in back-translations actually improves translation quality and enables the use of a much larger quantity of back-translated data for further improvements in translation quality. More recently, Caswell et al. (2019) empirically demonstrated that adding a unique token at the beginning of each back-translation acts as a tag that helps the system during training to differentiate back-translated data from the original parallel training data and is as effective as introducing synthetic noise for improving translation quality. It is also much simpler since it requires only one editing operation, adding the tag, and non-parametric. However, it is not fully understood why adding a tag has such a significant impact and to what extent it helps to distinguish back-translated data from the original parallel data.
In this paper, we report on the impact of tagging back-translations in NMT, focusing on the following research questions (see Section 2 for our motivation).
Q1. Do NMT systems trained on large backtranslated data capture some of the characteristics of human-produced translations, i.e., translationese?
Q2. Does a tag for back-translations really help differentiate translationese from original texts?
Q3. Are NMT systems trained on back-translation for low-resource conditions as sensitive to translationese as in high-resource conditions?

Motivation
During the training with back-translated data (Sennrich et al., 2016a), we can expect the NMT system to learn the characteristics of back-translations, i.e., translations generated by NMT, and such characteristics will be consequently exhibited at test time. However, translating translations is a rather artificial task, whereas users usually want to perform translation of original texts. Nonetheless, many of the test sets used by the research community for evaluating MT systems actually contain a large portion of texts that are translations produced by humans, i.e., translationese. Translationese texts are known to be much simpler, with a lower mean sentence length and more standardized than original texts (Laviosa-Braithwaite, 1998). These characteristics overlap with those of translations generated by NMT systems that have been shown simpler, shorter, and to exhibit a less diverse vocabulary than original texts (Burlot and Yvon, 2018). These similarities raise Q1. Caswell et al. (2019) hypothesized that tagging back-translations helps the NMT system during training to make some distinction between the backtranslated data and the original parallel data. Even though the effectiveness of a tag has been empirically demonstrated, the nature of this distinction remains unclear. Thus, we pose Q2.
The initial motivation for back-translation is to improve NMT for low-resource language pairs by augmenting the training data. Therefore, we verify whether our answers to Q1 and Q2 for highresource conditions are also valid in low-resource conditions, answering Q3.

Data
As parallel data for training our NMT systems, we used all the parallel data provided for the shared translation tasks of WMT19 1 for English-German (en-de), excluding the Paracrawl corpus, and WMT15 2 for English-French (en-fr). 3 As monolingual data for each of English, German, and French to be used for back-translation, we concatenated all the News Crawl corpora provided by WMT, and randomly extracted 25M sentences. For our simulation of low-resource conditions, we randomly sub-sampled 200k sentence pairs from the parallel data to train NMT systems and used these systems to back-translate 1M sentences randomly sub-sampled from the monolingual data. For validation, i.e., selecting the best model after training, we chose newstest2016 for en-de and newstest2013 for en-fr, since they are rather balanced on their source side between translationese and original texts. For 1 http://www.statmt.org/wmt19/ translation-task.html 2 http://www.statmt.org/wmt15/ translation-task.html 3 After pre-processing and cleaning, we obtained 5.2M and 32.8M sentence pairs for en-de and en-fr, respectively. evaluation, since most of the WMT test sets are made of both original and translationese texts, we used all the newstest sets, from WMT10 to WMT19 for en-de, and from WMT08 to WMT15 for en-fr. 4 All our data were pre-processed in the same way: we performed tokenization and truecasing with Moses (Koehn et al., 2007).

NMT Systems
For NMT, we used the Transformer (Vaswani et al., 2017) implemented in Marian (Junczys-Dowmunt et al., 2018) with standard hyper-parameters for training a Transformer base model. 5 To compress the vocabulary, we learned 32k byte-pair encoding (BPE) operations (Sennrich et al., 2016b) for each side of the parallel training data.
The back-translations were generated through decoding with Marian the sampled monolingual sentences using beam search with a beam size of 12 and a length normalization of 1.0. The back-translated data were then concatenated to the original parallel data and a new NMT model was trained from scratch using the same hyperparameters used to train the model that generated the back-translations.
We evaluated all systems with BLEU (Papineni et al., 2002) computed by sacreBLEU (Post, 2018). To evaluate only on the part of the test set that have original text or translationese on the source side, we used the --origlang option of sacreBLEU with the value "non-L1" for translationese texts and "L1" for original texts, where L1 is the source language, and report on their respective BLEU scores. 6

Results in Resource-Rich Conditions
Our results with back-translations (BT) and tagged back-translations (T-BT) are presented in Table 1. When using BT, we consistently observed a drop of BLEU scores for original texts for all the translations tasks, with the largest drop of 12.1 BLEU points (en→fr, 2014). Conversely, BLEU scores for translationese texts were improved for most tasks, with the largest gain of 10.4 BLEU points 4 For WMT14, we used the "full" version instead of the default filtered version in sacreBLEU that does not contain information on the origin of the source sentences. 5 The full list of hyper-parameters is provided in the supplementary material (Appendix A). 6 sacreBLEU signatures where "L1" and "L2" respectively indicates a two-letter identifier for the source and target languages of either de-en, en-de, fr-en, or en-fr, and "XXX" the name of the test set: BLEU+case.mixed+lang.L1- L2+numrefs (de→en, 2018). These results give an answer to Q1: NMT overfits back-translations, potentially due to their much larger size than the original parallel data used for training. Interestingly, using backtranslations does not consistently improve translation quality. We assume that newstest sets may manifest some different characteristics of translationese from one year to another.
Prepending a tag (T-BT) had a strong impact on the translation quality for original texts, recovering or even surpassing the quality obtained by the NMT system without back-translated data, always beating BT. The large improvements of BLEU scores over BT show that a tag helps in identifying translationese (answer for Q2). In the supplementary material (Appendix B), we present additional results obtained using more back-translations (up to 150M sentences) showing a similar impact of tags.
However, while a tag in such a configuration prevents an even larger drop of the BLEU scores, it is not sufficient to attain a BLEU score similar to the configurations that use less back-translations.
Interestingly, the best NMT system was not always the same depending on the translation direction and the origin of the test sets. It is thus possible to select either of the models to obtain the best translation quality given the origin of the source sentences, according to the results on the validation set for instance. 7 7 Since this observation is rather secondary, we present results for best model selection in the supplementary material (Appendix C). Note also that these BLEU scores can potentially be further increased by using a validation set whose source side is either original texts or translationese respectively to translate original texts or translationese at test time.

Results in Low-Resource Conditions
In low-resource conditions, as reported in Table 2, the translation quality can be notably improved by adding back-translations. Using BT, we observed improvements of BLEU scores ranging from 0.7 (fr→en, 2011) to 12.4 (de→en, 2010) BLEU points for original texts and from 2.1 (en→de, 2011) to 21.1 (de→en, 2018) BLEU points for translationese texts. These results remain in line with one of the initial motivations for using backtranslation: improving translation quality in lowresource conditions. In this setting without backtranslated data, the data in the target language is too small for the NMT system to learn reasonably good representations for the target language. Adding 5 times more data in the target language, through back-translation, clearly helps the systems without any negative impact of the noisiness of the backtranslations that were generated by the initial sys-tem. We assume here that since the quality of the back-translations is very low, their characteristics are quite different from the ones of translationese texts. This is confirmed by our observation that adding the tag has only a negligible impact on the BLEU scores for all the tasks (answer to Q3).

Tagged Test Sets
A tag on back-translations helps identifying translationese during NMT training. Thus, adding the same tag on the test sets should have a very different impact depending on the origin of the source sentences. If we tag original sentences and decode them with a T-BT model, then we enforce the decoding of translationese. Since we mislead the decoder, translation quality should drop. On the other hand, by tagging translationese sentences, we help the decoder that can now rely on the tag to be very confident that the text to decode is translationese. Our results presented in Table 3

Discussions
We empirically demonstrated that training NMT on back-translated data overfits some of its characteristics that are partly similar to those of translationese. Using back-translation improves translation quality for translationese texts but worsens it for original texts. Previous work (Graham et al., 2019;Zhang and Toral, 2019) showed that stateof-the-art NMT systems are better in translating translationese than original texts. Our results show that this is partly due to the use of back-translations which is also confirmed by concurrent and independent work (Bogoychev and Sennrich, 2019;Edunov et al., 2019). Adding a tag to back-translations prevents a large drop of translation quality on original texts while improvements of translation quality for translationese texts remain and may be further boosted by tagging test sentences at decoding time. Moreover, in low-resource conditions, we show that the overall tendency is significantly different from the high-resource conditions: backtranslation improves translation quality for both translationese and original texts while adding a tag to back-translations has only a little impact. We conclude from this study that training NMT on back-translated data, in high-resource conditions, remains reasonable when the user knows in advance that the system will be used to translate translationese texts. If the user does not know it a priori, a tag should be added to back-translations during training to prevent a possible large drop of translation quality.
For future work, following the work on automatic identification of translationese (Rabinovich and Wintner, 2015;Rubino et al., 2016), we plan to investigate the impact of tagging translationese texts inside parallel training data, such as parallel sentences collected from the Web.

A NMT system hyper-parameters
For training NMT systems with Marian 1.7.6 (1d4ba73), we used the hyper-parameters, on 8 GPUs, presented by Table 4 and kept the remaining ones with their default values.

C Best Model Selection
As discussed in Section 3.3, among the original model, the one trained with back-translation (BT), and the one trained with tagged back-translation (T-BT), the best-performing model is not always the same depending on the translation direction. For de→en and en→de, the best model is always T-BT. However, for fr→en, the system that does not use any back-translation is the best to translate original texts while T-BT is the best for translationese texts. For en→fr, the best system for translating translationese texts is BT while the best system for translating original texts is T-BT. This selection is performed by evaluating the translation quality for each model on the validation sets original and translationese texts. By applying this selection strategy, we can significantly improve the overall translation quality for given test sets, as reported in  Table 6: BLEU scores for all the systems for en-fr on the overall test sets. "selection" denotes that decoding is performed by using the best model given the origin of the source sentence.