Improving Robustness of Neural Machine Translation with Multi-task Learning

While neural machine translation (NMT) achieves remarkable performance on clean, in-domain text, performance is known to degrade drastically when facing text which is full of typos, grammatical errors and other varieties of noise. In this work, we propose a multi-task learning algorithm for transformer-based MT systems that is more resilient to this noise. We describe our submission to the WMT 2019 Robustness shared task based on this method. Our model achieves a BLEU score of 32.8 on the shared task French to English dataset, which is 7.1 BLEU points higher than the baseline vanilla transformer trained with clean text.


Introduction
Real world data, especially in the realm of social media, often contains noise such as mis-spellings, grammar errors, or lexical variations. Even though humans do not have much difficulty in recognizing and translating noisy or ungrammatical sentences, neural machine translation (NMT; Bahdanau et al. (2015); Vaswani et al. (2017)) systems are known to degrade drastically when confronted with noisy data (Belinkov and Bisk, 2017;Khayrallah and Koehn, 2018;. Thus, there is increasing need to build robust NMT systems that are resilient to naturally occurring noise. In this work, we attempt to enhance the robustness of the NMT system through multi-task learning. Our model is a transformer-based model (Vaswani et al., 2017) augmented with two decoders, with each decoder bound to different learning objectives. It has a cascade architecture (Niehues et al., 2016;Anastasopoulos and Chiang, 2018) where the first decoder reads in the output of the encoder and the second decoder reads in the output of both encoder and the first decoder. The objective of the first decoder, namely the denoising decoder, is to recover from the noisy sentence and generate the corresponding clean sentence. Given both the noisy and clean sentence, the objective of the second decoder, namely the translation decoder, is to correctly translate the sentence to the target language. This framework should be beneficial in two ways: 1) Since the model is trained with noisy text, it should inherently better generalize to noisy text. 2) The translation decoder could potentially take advantage of the recovered clean sentence while maintaining specific varieties of noise (e.g. emoji) by referring to the original noisy sentence. This framework requires triplets of clean and noisy source sentences, along with target translations, so we also follow Vaibhav et al. (2019) and design a back-translation strategy that synthesizes noisy data.
Our proposed model outperforms the baseline vanilla transformer trained with clean text by 4.6 BLEU points on the WMT 2019 Robustness shared task (Li et al., 2019) French to English dataset. The fine-tuning process brings an additional 2.5 points improvement. According to our analysis, however, the improvements can mainly be attributed to introducing noisy data during training rather than the multi-task learning objective.

Multi-task Transformer
In this section, we describe in detail the architecture of our proposed multi-task transformer. It is a transformer-based (Vaswani et al., 2017) cascade multi-task framework (Niehues et al., 2016;Anastasopoulos and Chiang, 2018).

Detailed Architecture
As illustrated in Figure 1, the model consists of one transformer encoder and two transformer de- coders. The dataset consists of triplets: T = {t n , t c , t t } where t n is the noisy source sentence, t c is the clean source sentence and t t is the target translation. Each t consists of a sequence of words [w 1 , w 2 , ..., w l ], where l is the length of the corresponding text. By looking up the word and position embedding lookup tables, each t is converted to a representation matrix x = {e 1 , e 2 , ..., e l } and thus result in X = {x n , x c , x t }.
The encoder reads in noisy text x n and generates the encoded representation M n . The layers of the first decoder (denoising decoder) first attends to x c (self-attention) and then attends to M n from the encoder. After N layers, this decoder generates another representation M c which represents the clean rather than the noisy source text. Now, the layers of the second decoder (translation decoder) first perform self-attention as usual, and then attend to both M n and M c simultaneously. After repeating this process N times, the translation decoder generates M t which is then passed on to a position-wise feed-forward network followed by a softmax layer. The output of the model is a probability matrix P ∈ R l×V , where V is the vocabulary size and l is the length of translated sentence.
As the description above, the denoising decoder is exactly the same as the decoder of the vanilla transformer. The only difference is that for the translation decoder each layer needs to attend to both encoder outputs M n and denoising decoder outputs M c after self-attention. Therefore, the translation decoder receives two contexts, namely from the encoder attention A n and the denoising decoder attention A c . In our model, we design the final attention context as the linear transformation of the concatenation of these two attention states: Following Tu et al. (2017); Anastasopoulos and Chiang (2018), the first objective is to maximize the log likelihood of the clean text t c and the second objective is to maximize that of the translated text t t . The importance of these two objectives are controlled by a hyper-parameter λ:

Two Phase Beam Search
Following Anastasopoulos and Chiang (2018), we use two separate beam search processes to decode the final translation. Let N beam be the size of the beam-search. The process is outlined here for clarity. Given a sentence t n , the denoising decoder produces a N beam outputs, each consisting of a denoised hypothesist c , the probability of the hypothesis P (t c |x n ; θ), and corresponding hidden state matrixM c . For each hypothesis from this first decoder, the second decoder also produces N beam tuples, each including a translation hypothesist t and its probability P (t t |t n ,t c ; θ). At the end of the second phase, we will have N beam ×N beam translation hypotheses. We rank the these hypothesis by their scores defined in Equation 1.

Training Triple Generation
As mentioned in Section 2, the desired training data for our multi-task transformer is a collection of triples T = {t n , t c , t t }. However, datasets of this kind are very rare; the available amounts of data are less than enough to train such a model with enormous number of parameters. Inspired by Vaibhav et al. (2019), we instead use a backtranslation strategy to synthesize these triples. Our proposed strategy is flexible and it could be used as long as we have at least one element of the T triple. Depending on which part of triple is available, we select the proper NMT model and synthesize the missing ones. In Figure 2, we show 3 ways that we did this in this work. Note that because we focus on the translation from French to English where the French text mostly consists of MTNTstyle noise (Michel and Neubig, 2018), we specify the source language as fr, the target language as en and the noise style as MTNT; however, our approach could be used for all other language pairs with different noise distributions.
Clean fr & Clean en: This is the most common parallel corpus that could be obtained from many existing resources. The only missing text is the noisy French text. In this case, we synthesize the noisy text with the help of the NMT model trained with both TED and MTNT training data. During training, we add a tag showing the source of this pair at the beginning of each English sentence (Kobus et al., 2017;Vaibhav et al., 2019). By adding this tag, the model could potentially better distinguish TED data and MTNT data. To generate the noisy French text, we add an MTNT tag at the beginning of each sentence and feed them to this NMT model. Ideally, besides the inherent noise as a result of imperfect translations, the translated French sentences could also possess a similar noise distribution as MTNT.
Noisy fr & Clean en: This kind of parallel text can be found in the MTNT training data. Note that even though the manually translated English sentences contain some level of "noise" (e.g. emoji), we treat them as clean English text. In this scenario, we leverage a pre-trained NMT system provided by fairseq (Ott et al., 2019) to translate English sentences back to French. Considering its good performance over other benchmarks (e.g. WMT newstest datasets) we assume that the trans-lated French sentences are of high quality and thus treat them as clean French text.
Clean fr: To make our back-translation strategy more generalized to settings where the above parallel data is not enough to train the model, we also design a pipeline to utilize monolingual data which is likely to be available most of the time. In this case, we first translate these sentences to English and then translate them back to French. Both NMT models are trained with TED and MTNT data as we describe above. Similarly, in both directions, we add the MTNT tag in the beginning of the sentences. Note that alternatively one could use an off-the-shelf NMT model to generate clean English text. 2

Experiments
In this section, we first describe in detail our data pre-processing scheme, as well as the choice of hyperparameters. Then we compare our system with the baseline model (a vanilla transformer trained on clean French and clean English parallel data). Finally, we carry out a case study by comparing the output of our model with the baseline model.

Data Pre-processing
Because of time limitations, we did not use all three kinds of training triples. We only used the first two triples introduced in Section 3.

Clean fr & Clean en:
The clean data consists of europarl-v7 3 and news-commentary-v10 copora. 4 We filter out sentences whose length is greater than 50. We apply a pretrained Byte Pair Encoding (BPE, Gage (1994)) model with 16k subword units to both source and target sentences. The process of synthesizing noisy French sentences is described in the corresponding paragraph of Section 3. We denote this set of triples as T europarl .
Noisy fr & Clean en: As mentioned in the corresponding paragraph of Section 3, both noisy French and clean English come from MTNT training data and we create clean French through backtranslation. This set of triples is denoted as T mtnt .

Hyperparameters
We follow the transformer-base setting of Vaswani et al. (2017), using N = 6 layers for both encoder and decoder, h = 8 heads for self-attention, and d k , d v are both set to 64. The hidden size of the model d model is set to 512 and the hidden size of the feed forward network is set to 2048. The smoothing rate is set to 0.1 and the dropout rate is set to 0.1. For our multi-task transformer specifically, the weight λ in Equation 1 is set to 0.5. The implementation of the model is based on fairseq (Ott et al., 2019)

Results
The baseline model is the vanilla transformer trained with clean French and clean English. In our experiment, it contains pairs T 1 = {t c , t t } that are extracted from X europarl . On the other hand, our model is the multitask transformer trained with X europarl . The same number of pairs and triples are used during training. We evaluate these two models on two MTNT datasets, one of them comes from the original paper (Michel and Neubig, 2018) while the other one is provided by WMT Robustness shared task (Li et al., 2019). The BLEU score of these two models are shown in the first and the third column of Table 1.
Compared to the vanilla transformer, our proposed multi-task transformer yields 2.5 and 4.6 BLEU points improvement on two MTNT datasets. However, the component that leads to the success of this model is unclear as there are mainly two differences: 1) our proposed model utilizes an auxiliary decoder to recover from the noisy text, it could potentially benefit the translation process with cleaner data 2) our model is further trained on 5 https://github.com/pytorch/fairseq/ tree/master/fairseq  Table 1: BLEU score of different models. The second column shows the score in MTNT test dataset introduced in Michel and Neubig (2018) and the third column shows the score in the MTNT test dataset provided by WMT Robustness share task (Li et al., 2019).
noisy data, presumably overcoming any domainadaptation issues. We investigate this issue by fine-tuning the baseline model with another set of pairs T 2 = {t n , t t } that are extracted from T europarl . We load the pre-trained model and continue training for an extra epoch. With this fine-tuning process, the baseline model sees exact the same number of data as our proposed model. The fine-tuning result is shown in the second row of Table 1.
The performance of the fine-tuned baseline system is very close to that of our proposed model on the original MTNT test data and is 3.2 BLEU points lower on the shared task dataset. This result suggest that while the inclusion of synthetic noisy sentences is generalizable among datasets, using the denoising decoder might be beneficial only in specific settings.
Further, to investigate model's potential when in possession of in-domain training data, we fine tune both models with MTNT parallel training data. The data we use here is the same as the MTNT data we use to train auxiliary NMT systems to generate triples (Section 3). During the fine-tuning process, hence, we do not introduce new parallel data. The performance of the fine-tuned systems are shown in the third and the last row of Table 1 respectively.
Even vanilla transformer could not beat the multi-task transformer on both datasets before fine-tuned with in-domain data, it performs significantly better and outperforms our proposed model on both datasets after the fine-tuning process. The results suggest the potential of vanilla transformer in fitting in-domain data. It is notable, of course, that the fine-tuning process leads to a 9.5/8.9 BLEU points improvement for the vanilla transformer and 7.2/1.5 points for our pro-posed model respectively. This again shows the power of domain adaptation for building a robust NMT system. Table 2 shows example outputs of original MTNT test dataset from different models. The denoised source is the sentence generated by the denoising decoder in our proposed model.

Case Study
The first example contains special characters '>' and the word 'xQc'. All models fail to correctly copy the special character > and generate a replacement. On the other hand, the word 'xQc' confuses the two baseline models and they fail to correctly copy this word. Our model, however, correctly copies the word and generates a reasonable translation. The denoised sentence seems to not bring benefit and, in fact, it attempts to denoise 'xQc' to 'XVC'. The translation decoder then seems to combine the two versions, copying the word from the source noisy sentence but uppercasing it just like the denoised version.
The second example contains the acronym 'PC' and our model does not produce a correct translation. It is interesting that the translated word 'pellets' is also not the corresponding translation of 'peloton' in the denoised sentence. Somewhat similar to the first example, this suggests that the translation decoder mostly ignores the context from the denoisy decoder. In terms of performance of vanilla transformer, although the baseline model also fails, the fine-tuned model deals with 'PC' correctly and procures a good translation. This indicates that explicitly having attention to both noisy and clean sentences does not always lead to better translation quality.
In the last example, the noise lies in a typo in the phrase corresponding to the phrase ''double negative''. None of the models produces a good translation of this phrase. Similar to the first case, the denoised sentence has a negative effect as it falsely "corrects" ''ngation'' to ''voie'' ("way" in English), which changes the meaning of the word and results in the bad translation 'track'. This demonstrates that all models still need to address issues regarding rare and misspelled words.
The main takeaway from a manual inspection of the outputs, is that the first (denoising) decoder does not really properly deal with noise in the desired way, and the translation decoder generally ignores its output. We suspect that this issue is caused by the data synthesis process which results in low quality triples. Other further improvements could be possibly achieved by constraining the output of the denoising decoder, such that it produces minimal, non-meaning-altering edits. We leave these investigations as future work.

Related Work
Here, we discuss how the MT community handles the noise problem. In general, there are mainly two kinds of approaches: the first attempts to denoise text, and the second proposes training with noisy texts.
Denoising text: Sakaguchi et al. (2017) proposes semi-character level recurrent neural network (scRNN) to correct words with scrambling characters. Each word is represented as a vector with elements corresponding to the characters' position. Heigold et al. (2018) investigates the robustness of character-based word embeddings in machine translation against word scrambling and random noise. The experiments show that the noise has a larger influence on character-based models than BPE-based models. To minimize the influence of word structure, Belinkov and Bisk (2017) proposes to represent word as its average character embeddings, which is invariant to these kinds of noise. The proposed method enables the MT system to be more robust to scrambling noise even training the model with clean text. Instead of handling noise at the word level, we try to recover the clean text from the noisy one at the sentence level. Besides noise like word scrambling, the sentence level denoising could potentially better deal with more complex noise like grammatical errors.
Training with noisy data:  designs methods to generate noise in the text, mainly focusing on syntactic noise and semantic noise. (Sperber et al., 2017) proposes a noise model based on automatic speech recognizer (ASR) error types, which consists of substitutions, deletions and insertions. Their noise model samples the positions of words that should be altered in the source sentence. Even training with synthetic noise data brings a large improvement in translating noisy data, Belinkov and Bisk (2017) shows that models mainly perform well on the same kind of noise that is introduced at training time, and they mostly fail to generalize to text with other Si tu joues sur pc, a-t-il t bien adapt? Target If you play on PC, has it been well adapted? Baseline If you are playing on a pile, has it been adequate?

Baseline FT
If you play on pc, has it been properly adapted?
Denoised Source Si vous jouez au peloton, a-t-il t bien adapt? Our model If you play on pellets, has you been well adapted? Source Les franais sont les champions de la double-ngation.

Target
French people are the champions of the double negative.

Baseline
The French are the champions of dual-nation.

Baseline FT
The French are the champions of double-nutrition.
Denoised Source Les Franais sont les champions de la double voie.

Our model
The French are the champions of the double-track.  (2019), which evaluated MT systems on natural and natural-like grammatical noise, specifically on English produced by non-native speakers. Natural noise appears to be richer and more complex compared to synthetic noise, making it challenging to manually design a comprehensive set of noise to approximate real world settings. In our work, we follow (Vaibhav et al., 2019) and synthesize the noisy text through back-translation. There is no need to manually control the distribution of noise.
In terms of multi-task learning for machine translation, Tu et al. (2017) proposes to add a reconstructor on top of the decoder. The auxiliary objective is to reconstruct the source sentence from the hidden layers of the translation decoder. This encourages the decoder to embed complete source information, which helps improve the translation performance. This approach was found to be helpful in low-resource MT scenarios also by Niu et al. (2019). Anastasopoulos and Chiang (2018) proposes a tied multitask learning model architecture to improve the speech translation task. The intuition is that, speech transcription as an intermediate task, should improve the performance of speech translation if the speech translation is based on both the input speech and its transcription.

Conclusion
In this work, we propose a multi-task transformer architecture that tries to not only denoisy the noisy source text but also translate it. We design a strategy for synthesizing data triplets for this architecture. Our model could be viewed as a combination of denoising source text and domain adaptation, both of which are popular approaches for designing robust NMT systems. Compared to the baseline vanilla transformer that is trained on clean data only, our proposed model with fine tuning enjoys 7.1 BLEU points improvement on the WMT Robustness shared task French to English dataset. However, this improvement is most likely attributed to the noisy text we add to the training process (hence, due to better domain adaptation), and not due to the denoising multi-task strategy.