Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation

Neural Machine Translation (NMT) models have been proved strong when translating clean texts, but they are very sensitive to noise in the input. Improving NMT models robustness can be seen as a form of “domain” adaption to noise. The recently created Machine Translation on Noisy Text task corpus provides noisy-clean parallel data for a few language pairs, but this data is very limited in size and diversity. The state-of-the-art approaches are heavily dependent on large volumes of back-translated data. This paper has two main contributions: Firstly, we propose new data augmentation methods to extend limited noisy data and further improve NMT robustness to noise while keeping the models small. Secondly, we explore the effect of utilizing noise from external data in the form of speech transcripts and show that it could help robustness.

Despite this success, NMT models still lack robustness when applied to noisy sentences. Belinkov and Bisk (2017) show that perturbations in characters could cause a significant decrease in translation quality. They point out that training on noisy data, which can be seen as adversarial training, might help to improve model robustness. Michel and Neubig (2018) propose the Machine Translation on Noisy Text (MTNT) dataset, which contains parallel sentence pairs with comments crawled from Reddit 1 and manual translations. This dataset contains user-generated text 1 www.reddit.com with different kinds of noise, e.g., typos, grammatical errors, emojis, spoken languages, etc. for two language pairs.
In the WMT19 Robustness Task 2 (Li et al., 2019), improving NMT robustness is treated as a domain adaption problem. The MTNT dataset is used as in-domain data, where models are trained with clean data and adapted to noisy data. Domain adaption is conducted in two main methods: fine tuning on in-domain data (Dabre and Sumita, 2019;Post and Duh, 2019) and mixed training with domain tags (Berard et al., 2019;Zheng et al., 2019). The size of the noisy data provided by the shared task is small, with only thousands of noisy sentence pairs on each direction. Hence most approaches participating in the task performed noisy data augmentation using back translation (Berard et al., 2019;Helcl et al., 2019;Zheng et al., 2019), with some approaches also directly adding synthetic noise (Berard et al., 2019). The robustness of an NMT model can be seen as denoising source sentences (e.g. dealing with typos, etc.) while keeping a similar level of informal language in the translations (e.g. keeping emojis/emoticons). Based on this assumption, we believe that back translation of clean texts, although providing a large volume of extra data, is limited since it removes most types of noise from the translations. In addition to adapting models on noisy parallel data, other techniques have been used to improve performance, generally measured according to BLEU (Papineni et al., 2002) against clean references. For example, Berard et al. (2019) apply inline-casing by adding special tokens before each word to represent word casing. In (Murakami et al., 2019), placeholders are used to help to translate sentences with emojis.
In this paper, we also explore data augmenta-tion for robustness but focus on techniques other than back translation. We follow the WMT19 Robustness Task and conduct experiments under constrained and unconstrained data settings on Fr↔En as language pairs. Under the constrained setting, we only use datasets provided by the shared task, and propose new data augmentation methods to generate noise from this data. We compare back translation (BT) (Sennrich et al., 2016a) with forward translation (FT) on noisy texts and find that pseudo-parallel data from forward translation can help improve more robustness. We also adapt the idea of fuzzy matches from Bulte and Tezcan (2019) to the MTNT case by finding similar sentences in a parallel corpus to augment the limited noisy data. Results show that the fuzzy match method can extend noisy parallel data and improve model performance on both noisy and clean texts. The proposed techniques substantially outperform the baseline. While they still lag behind the winning submission in the WMT19 shared task, the resulting models are trained on much smaller clean data but augmented noisy data, leading to faster and more efficient training. Under the unconstrained setting, we propose for the first time the use of speech datasets, in two forms: (a) the IWSLT (Cettolo et al., 2012) and MuST-C (Di Gangi et al., 2019) human transcripts as a source of spontaneous, clean speech data, and (b) automatically generated transcripts for the MuST-C dataset as another source of noise. We show that using informal language from spoken language datasets can also help to increase NMT robustness. This paper is structured as follows: Section 2 introduces the data augmentation methods for noisy texts, including the previously proposed methods and our approaches to data augmentation. Section 3 describes our experimental settings, including the datasets we used, the augmented data and the baseline model. Section 4 shows the results of models built from different training and evaluated on both clean and noisy test sets.

Previous Work
Considering the limited size of noisy parallel data, data augmentation methods are commonly used to generate more noisy training materials.
Previous methods include back translation, injecting synthetic noise, and adversarial attacks. In the WMT19 Robustness Task, back translation on monolingual data was used to generate noisy parallel sentences (Murakami et al., 2019;Zhou et al., 2019;Berard et al., 2019;Helcl et al., 2019). Zheng et al. (2019) proposed an extension of back translation that generates noisy translations from clean monolingual data. Therefore, after reversing the direction, the noisy translations become the source, which would simulate the noisy source sentences from the MTNT parallel data. Synthetic noise is injected into clean data to form noisy parallel data in Belinkov and Bisk (2017); Karpukhin et al. (2019). However, rule-based synthetic noise injection is limited to certain types of noise. Adversarial methods are proposed to inject random noise into clean training data in Cheng et al. (2018Cheng et al. ( , 2019.
We explore the following new methods as alternative ways to augment noisy parallel data.

Fuzzy Matches
We adapted the method to augment data from parallel corpus from Bulte and Tezcan (2019). The original method aims to find similar source sentences to those in a parallel corpus (S, T ) using a monolingual corpus, and then reuse the translation of the original source sentences as translations for the similar source sentences found. We adapted it to use only on the provided noisy training corpus. For each source sentence s i ∈ S in the training set, all other source sentences s j ∈ S(s i = s j ) are compared with this sentence by measuring string similarity Sim(s i , s j ). If the similarity of the two sentences is above a threshold λ, the two sentences are mapped to each other's corresponding target sentence and the two new sentence pairs (s i , t j ), (s j , t i ) are added into our augmented training data. The similarity is measured with Levenshtein distance (Levenshtein, 1966) on the token level. The similarity score is calculated as the edit distance divided by the minimum length of the two sentences (Equation 1). In addition to fuzzy matches in the parallel corpus, we experimented with the monolingual corpus by mapping sentence m i in monolingual corpus to its fuzzy match's target sentence t j (If Sim(m i , s j ) > λ, we add new sentence pair (m i , t j ) to the training augmented data).
To boost the speed of finding matches, we followed the approach in Bulte and Tezcan (2019) and used a Python library SetSimilaritySearch 3 to select similar candidates before calculating edit distance. For each source sentence, only the top 10 similar candidates are selected to calculate the edit distance score.

Forward Translation
Back translation (Sennrich et al., 2016a) is a very popular technique for in-domain data augmentation. In the experiment, we back-translated MTNT monolingual data using a model fine-tuned on MTNT parallel corpus. However, considering that the task of improving robustness is to produce less noisy output, data generated with back translation might have noisy target translations (from monolingual data) and less noisy source texts (from back translation). Since this might increase the noise level of the output texts, we also experimented with forward translation using models fine-tuned on the noisy parallel corpus. Pseudo parallel data generated by forward translation is used for fine tuning models on the same language direction. To avoid overfitting, we merged the noisy parallel data of both language directions to produce noisy forward translations. The pseudo parallel data generated by back translation and forward translation is combined with noisy parallel data and fine-tuned on the baseline model.

Automatic Speech Recognition
We used ASR systems and transcribed audio files into texts. In this case, we would expect noise to be generated during the process of automatic speech recognition. We selected a dataset with both audio and human transcripts, namely the MuST-C dataset. In this dataset, audio files A, human transcripts S and transcript translated into another language T are provided. We used Google Speech-to-Text API 4 and transcribed the audio files into automatic transcripts S ′ . The human and ASR transcripts of the audio (S and S ′ ) are treated as the source texts while the translations T are target texts. We formed a new set of parallel data (S ′ , T ) with ASR generated texts and the corresponding gold translations by humans. Looking into the ASR transcripts, we found that the ASR system tends to skip some sentences due 3 https://github.com/ekzhu/ SetSimilaritySearch 4 https://cloud.google.com/speech-totext/ to the fast speaking speed, therefore we did some filtering based on the length ratio of the human transcripts and the ASR transcripts. We set a ratio threshold λ. For each pair s i , t i in the ASR parallel data (S ′ , T ), we removed this sentence pair if the length of t i dividing by s i is larger than the threshold λ.

Corpora
We used all parallel corpora from the WMT19 Robustness Task on Fr↔En. For out-of-domain training, we used the WMT15 Fr↔En News Translation Task data, including Europarl v7, Common Crawl, UN, News Commentary v10, and Gigaword Corpora. In the following sections, we represent the combination of these corpora as "clean data". The MTNT dataset is used as our indomain data for fine tuning. We also experimented with external corpora, namely the IWSLT2017 5 and MuST-C 6 corpora 7 , to explore the effect of informal spoken languages in human transcripts from speech. We also utilized monolingual data in the MTNT dataset. The size of parallel and monolingual data is shown in Table 1.
We used the development set in MTNT and the newsdiscussdev2015 for validation. Models with best performance on the validation set are evaluated on both noisy (MTNT and MTNT2019) and clean (newstest2014 and newsdiscusstest2015) test sets.
For prepossessing, we first tokenized the data with Moses tokenizer (Koehn et al., 2007) and applied Byte Pair Encoding (BPE) (Sennrich et al., 2016b) to segment words into subwords. We experimented with a large vocabulary to include noisy as well as clean tokens and applied 50k merge operations for BPE. Upon evaluation, we detokenized our hypothesis files with the Moses detokenizer.
We used multi-bleu-detok.perl to evaluate the BLEU score on the test sets.

Model
Our baseline model, which only uses clean data for training, is the standard Transformer model (Vaswani et al., 2017) with default hyperparameters. The batch size is 4096 tokens (subwords). Our models are trained with OpenNMTpy (Klein et al., 2017) on a single GTX 1080 Ti for 5 epochs. We experimented with fine tuning on noisy data and mixed training with "domain" tags (Caswell et al., 2019;Kobus et al., 2016) indicating where the sentences are sourced from. We used different tags for clean data, MTNT parallel data, forward translation, back translation, ASR data, and fuzzy match data. Tags are added at the beginning of each source sentences.
During the fine tuning on in-domain data, we continued the learning rate and model parameters. We stop the fine tuning when the perplexity on noisy validation set does not improve after 3 epochs. Best fine-tuned checkpoints are evaluated on the test sets.
The MTNT dataset provides noisy parallel data in specific language pairs. We used models with two fine tuning strategies: Tune-S and Tune-B. The Tune-S model is fine-tuned only with the noisy parallel data in the same direction while the Tune-B model is fine-tuned with the combination of both language directions (Fr→En and En→Fr).

Results
We evaluated the models fine-tuned on different datasets in terms of BLEU on both noisy and clean test sets. We note that although both MTNT and MTNT2019 test sets are noisy, the MTNT2019 is less noisy and contains fewer occurrences of noise such as emojis. Similarly, since the newsdiscuss test set contains informal language, it is slightly noisier than newstest test set. The evaluation results for both directions are shown in Tables 3 and  4.

Fine Tuning on Noisy Text
It can be seen that for both directions fine tuning on noisy data gives better performance for the noisy test set. Although the size of training data in MTNT is only 19k and 36k sentences, by simply fine tuning on it, the BLEU scores of Tune-S model increase by +5.65 and +6.03 on MTNT test set, +5.44 and +2.68 on MTNT2019 test set (See the second row in the tables). It is also worth noticing that although fine-tuned on noisy data, the   Table 4: BLEU scores of models fine-tuned on different data in the En→Fr direction. λ in ASR data represents the filtering threshold, as mentioned in Section 2.4. The mixed-training model combines all data available and adds domain tags in front of each sentence. Other notations are same as in Table 3. performance on clean test sets increases as well. This shows that noisy parallel data could improve model robustness on both noisy and clean texts.

Data Augmentation
As it is common in the field, we experimented with back translation (third row in Table 3). We used the target-to-source Tune-S model to backtranslate monolingual data in MTNT corpus. The back-translated data is combined with the noisy parallel data and used to fine-tune the baseline model. For Fr→En, by introducing backtranslated data, the model performance drops by over 2 BLEU scores compared the simply tuning on parallel data. This would suggest that the back translation data might break the noise level gap between source and target texts, and hence the model fine-tuned on back-translated data tends to output noisier translations and performs worse.
Since the size of the MTNT dataset is too small, we tried merging the data in both language directions, resulting in 55k sentence pairs for fine tuning. For both directions, the models tuned on the merged MTNT data (Tune-B) show worse performance than the models tuned on single direction data (Tune-S). This is due to the introduction of opposite direction data would increase the noise in target texts. We added the forward translation and fuzzy matches data separately and fine-tuned with the merged MTNT data. Results show that the in-troduction of either forward translation or fuzzy match data would improve model performance on noisy test sets, compared to the Tune-B model. However, with only forward translation or fuzzy match data added, the model still lags behind the Tune-S performance. Therefore, we mix the FT, FM, and merged MTNT data. After we use the mixed data for fine tuning, models in both directions scored better with the augmented data. The Fr→En model with forward translation and fuzzy matches data achieved a performance of 42.80 BLEU score on the MTNT2019 test set, an improvement of +1.32 BLEU points over the Tune-S model. The forward translation data is generated using the Tune-B model, which includes information on the opposite direction, and this might benefit forward translation and prevent the model from overfitting. Compared to back translation, forward translation could keep the noise level difference between the source and target sentences 9 .

Double Fine Tuning
Considering that the opposite direction data from that of the MTNT dataset would harm the model performance, we applied double fine tuning to compensate. We used the model which had already been tuned on the combination of forward translation data, fuzzy match data and merged MTNT data, and fine-tuned it with the MTNT data on the corresponding direction (e.g. Fr→En data for Fr→En model). In this case, the MTNT data in the same language direction would fine-tune the model twice, thus adapting the model to the specific language direction domain. The second fine tuning was able to further improve model robustness to noisy data and keep a similar performance on clean data. In the En→Fr direction, the second fine tuning improves +0.57 and +0.85 BLEU points on MTNT and MTNT2019.

Punctuation Fixing
The MTNT2019 test set uses a different set of punctuation in French text as the MTNT dataset and clean training data. In the MTNT2019 test set, the French references use apostrophes (') and angle quotes (« and »), instead of the single quotes (') and double quotes (") used in the MTNT training data (Berard et al., 2019). Therefore, mod-els fine-tuned with MTNT training data would show an inconsistent performance for punctuation when evaluating on the MTNT2019 test set. We fixed the punctuation in the En→Fr direction as a postprocessing step. This single replacement improves +4.36 BLEU score over the double finetuned model. For comparison, we also postprocessed the output from the Tune-S model. The punctuation fix results in an increase of +3.24 BLEU score.

External Data
To explore the effect of other types of noise, we fine-tuned our baseline model on different external datasets (see the "Unconstrained" rows in Table 3 and 4). We experimented with human transcript and translation in IWSLT dataset. The BLEU score (Fr→En) on MTNT2019 increases by +2.14 over the baseline, while the results on the other three test sets decrease. In the En→Fr direction, fine tuning on IWSLT improves the model performance on all four test sets, and with MTNT data added, the BLEU score on noisy data performs even better than the Tune-B model. The benefit of speech transcripts might come from informal languages such as slang, spoken language, and domain-related words. Apart from this, we also kept the indicating words (e.g. "[laughter]" and "[applause]") in the transcripts, which could also play a role of noise.
When using ASR data generated from the audio files in the MuST-C dataset, we first filtered ASR data by removing sentences where the original transcript length is over 1.5 times that of the ASR transcript. The model fine-tuned on ASR data shows a slight decrease in BLEU scores. We found that the ASR transcript often skips some phrases in a sentence. Therefore, we reduced the length ratio threshold to 1.2, and with that the model achieves similar performance as the baseline model. Evaluated on newstest, the ASR-tuned model improves +1.08 BLEU score over the baseline. Finally, we tried with the parallel texts from human transcript and translation in MuST-C corpus, similar to IWSLT, the performance increases on all test data. This suggests that the introduction of external data with different types of noise could improve model robustness, even without the use of in-domain noisy data.
Finally, we conducted a domain-sensitive training experiment by adding tags for different data.
We mixed all available data, including the MTNT data (two directions), IWSLT, MuST-C, ASR, forward translation, and fuzzy match data. Tags are added at the beginning of the source sentences. As shown in the last row in Table 4, mixed training could improve the performance over baseline on noisy texts. However, the model does not outperform the models with fine tuning. This might result from the introduction of ASR generated data, which can contain more low quality training samples.

WMT19 Robustness Leaderboard
We submitted our best constrained systems to WMT19 Robustness Leaderboard 10 , as shown in Table 5 and Table 6. In the Fr→En direction, we submitted the model fine-tuned on merged MTNT data, forward translation and fuzzy match data (row 5 in Table 3). In the En→Fr direction, the double-tuned model with punctuation fixed was submitted (row 6 in Table 4).

System
BLEU-uncased (Berard et al., 2019) 48.8 (Helcl et al., 2019) 45.8 (Zheng et al., 2019) 44.5 Ours 43.8 (Post and Duh, 2019) 41.8 (Zhou et al., 2019) 36.0 (Grozea, 2019) 30.8 MTNT paper baseline 26.2 System BLEU-uncased (Berard et al., 2019) 42.0 (Helcl et al., 2019) 39.1 Ours 37.1 (Zheng et al., 2019) 37.0 (Grozea, 2019) 24.8 MTNT paper baseline 22.5 Our systems would have achieved the 4th and 3rd place on Fr→En and En→Fr directions. The leading systems use back translation on a large volume of clean monolingual data, therefore could benefit from the size of clean data. Although our system does not utilize clean monolingual data, we find an alternative way to extend noisy parallel data, which might be more efficient for training. The results show that our systems could achieve a competitive position.

Conclusions
In this paper we use data augmentation strategies to improve neural machine translation models robustness. We experiment under the setting of the WMT19 Robustness Task for the Fr↔En language directions. We propose the use of forward translation and fuzzy matches as alternatives to back translation to augment noisy data. Our best models with augmented noisy data could improve +1.32 and +0.97 BLEU scores for Fr→En and En→Fr over models fine-tuned with noisy parallel data. We also explore the effect of external noisy data in the form of speech transcripts and show that models could benefit from data injected with noise through manual transcriptions of spoken language. The ASR generated data does not help improving robustness as it contains low quality training samples that break the sentences, while the human transcripts from speech proved helpful to translate noisy texts, even without indomain data. Future work might include training a domain-related speech recognition model and generate better ASR parallel data instead of using the off-the-shelf system.