Robust Neural Machine Translation with Doubly Adversarial Inputs

Neural machine translation (NMT) often suffers from the vulnerability to noisy perturbations in the input. We propose an approach to improving the robustness of NMT models, which consists of two parts: (1) attack the translation model with adversarial source examples; (2) defend the translation model with adversarial target inputs to improve its robustness against the adversarial source inputs. For the generation of adversarial inputs, we propose a gradient-based method to craft adversarial examples informed by the translation loss over the clean inputs. Experimental results on Chinese-English and English-German translation tasks demonstrate that our approach achieves significant improvements (2.8 and 1.6 BLEU points) over Transformer on standard clean benchmarks as well as exhibiting higher robustness on noisy data.


Introduction
In recent years, neural machine translation (NMT) has achieved tremendous success in advancing the quality of machine translation Hieber et al., 2017). As an end-to-end sequence learning framework, NMT consists of two important components, the encoder and decoder, which are usually built on similar neural networks of different types, such as recurrent neural networks Bahdanau et al., 2015;, convolutional neural networks (Gehring et al., 2017), and more recently on transformer networks (Vaswani et al., 2017). To overcome the bottleneck of encoding the entire input sentence into a single vector, an attention mechanism was introduced, which further enhanced translation performance (Bahdanau et al., 2015). Deeper neural networks with increased model capacities in NMT have also been explored and shown promising results .

他(她)一个残疾人，我女儿身体好好地。
Original he is a handicapped person, my Output daughter is in good health. Perturbed one of her handicapped people, my Output daughter is in good health. × Table 1: An example of Transformer NMT translation result for an input and its perturbed input by replacing "他(he)" to "她(she)".
Despite these successes, NMT models are still vulnerable to perturbations in the input sentences. For example, Belinkov and Bisk (2018) found that NMT models can be immensely brittle to small perturbations applied to the inputs. Even if these perturbations are not strong enough to alter the meaning of an input sentence, they can nevertheless result in different and often incorrect translations. Consider the example in Table 1, the Transformer model will generate a worse translation (revealing gender bias) for a minor change in the input from "he" to "she". Perturbations originate from two sources: (a) natural noise in the annotation and (b) artificial deviations generated by attack models. In this paper, we do not distinguish the source of a perturbation and term perturbed examples as adversarial examples. The presence of such adversarial examples can lead to significant degradation of the generalization performance of the NMT model.
A few studies have been proposed in other natural language processing (NLP) tasks aiming to tackle this issue in classification tasks, e.g. in (Miyato et al., 2017;Alzantot et al., 2018;Ebrahimi et al., 2018b;Zhao et al., 2018). As for NMT, previous approaches relied on prior knowledge to generate adversarial examples to improve the robustness, neglecting specific downstream NMT models. For example, Belinkov and Bisk (2018) and Karpukhin et al. (2019) studied how to use some synthetic noise and/or natural noise. Cheng et al. (2018) proposed adversarial stability training to improve the robustness on arbitrary noise type including feature-level and word-level noise.  examined the homophonic noise for Chinese translation.
This paper studies learning a robust NMT model that is able to overcome small perturbations in the input sentences. Different from prior work, our work deals with the perturbed examples jointly generated by a white-box NMT model, which means that we have access to the parameters of the attacked model. To the best of our knowledge, the only previous work on this topic is from (Ebrahimi et al., 2018a) on character-level NMT.
Overcoming adversarial examples in NMT is a challenging problem as the words in the input are represented as discrete variables, making them difficult to be switched by imperceptible perturbations. Moreover, the characteristics of sequence generation in NMT further intensify this difficulty. To tackle this problem, we propose a gradientbased method, AdvGen, to construct adversarial examples guided by the final translation loss from the clean inputs of a NMT model. AdvGen is applied to both encoding and decoding stages: (1) we attack a NMT model by generating adversarial source inputs that are sensitive to the training loss; (2) we then defend the NMT model with the adversarial target inputs, aiming at reducing the prediction errors for the corresponding adversarial source inputs.
Our contribution is threefold: 1. A white-box method to generate adversarial examples is explored for NMT. Our method is a gradient-based approach guided by the translation loss.
2. We propose a new approach to improving the robustness of NMT with doubly adversarial inputs. The adversarial inputs in the encoder aim at attacking the NMT models, while those in the decoder are capable of defending the errors in predictions.
3. Our approach achieves significant improvements over the previous state-of-the-art Transformer model on two common translation benchmarks.
Experimental results on the standard Chinese-English and English-German translation bench-marks show that our approach yields an improvement of 2.8 and 1.6 BLEU points over the state-of-the-art models including Transformer (Vaswani et al., 2017). This result substantiates that our model improves the generalization performance over the clean benchmark datasets. Further experiments on noisy text verify the ability of our approach to improving robustness. We also conduct ablation studies to gain further insight into which parts of our approach matter the most.

Background
Neural Machine Translation NMT is typically a neural network with an encoder-decoder architecture. It aims to maximize the likelihood of a parallel corpus S = {(x (s) , y (s) )} |S| s=1 . Different variants derived from this architecture have been proposed recently (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017). This paper focuses on the recent Transformer model (Vaswani et al., 2017) due to its superior performance, although our approach seems applicable to other models, too.
The encoder in NMT maps a source sentence x = x 1 , ..., x I to a sequence of I word embeddings e(x) = e(x 1 ), ..., e(x I ). Then the word embeddings are encoded to their corresponding continuous hidden representations h by the transformation layer. Similarly, the decoder maps its target input sentence z = z 1 , ..., z J to a sequence of J word embeddings. For clarity, we denote the input and output in the decoder as z and y. z is a shifted copy of y in the standard NMT model, i.e. z = sos , y 1 , · · · , y J−1 , where sos is a start symbol. Conditioned on the hidden representations h and the target input z, the decoder generates y as: where θ mt is a set of model parameters and z <j is a partial target input. The training loss on S is defined as: Adversarial Examples Generation An adversarial example is usually constructed by corrupting the original input with a small perturbation such that the difference to the original input remains less perceptible but dramatically distorts the model output.
The adversarial examples can be generated by a white-box or black-box model, where the latter does not have access to the attacked models and often relies on prior knowledge. The former white-box examples are generated using the information of the attacked models. Formally, a set of adversarial examples Z(x, y) is generated with respect to a training sample (x, y) by solving an optimization problem: where J(·) measures the possibility of a sample being adversarial, and R(x ′ , x) captures the degree of imperceptibility for a perturbation. For example, in the classification task, J(·) is a function outputting the most possible target class y ′ (y ′ = y) when fed with the adversarial example x ′ . Although it is difficult to give a precise definition of the degree of imperceptibility R(x ′ , x), l ∞ norm is usually used to bound the perturbations in image classification (Goodfellow et al., 2015).

Approach
Our goal is to learn robust NMT models that can overcome small perturbations in the input sentences. As opposed to images, where small perturbations to pixels are imperceptible, even a single word change in natural languages can be perceived. NMT is a sequence generation model wherein each output word is conditioned on all previous predictions. Thus, one question is how to design meaningful perturbation operations for NMT. We propose a gradient-based approach, called AdvGen, to construct adversarial examples and use these examples to both attack as well as defend the NMT model. Our intuition is that an ideal model would generate similar translation results for similar input sentences despite any small difference caused by perturbations.
The attack and defense are carried out in the end-to-end training of the NMT model. We first use AdvGen to construct an adversarial example x ′ from the original input x to attack the NMT model. We then use AdvGen to find an adversarial target input z ′ from the decoder input z to improve the NMT model robustness to adversarial perturbations in the source input x ′ . Thereby we hope the NMT model will be robust against both the source adversarial input x ′ and adversarial perturbations in target predictions z ′ . The rest of this section will discuss the attack and defense procedures in detail.

Attack with Adversarial Source Inputs
Following (Goodfellow et al., 2015;Miyato et al., 2017;Ebrahimi et al., 2018b), we study the whitebox method to generate adversarial examples tightly guided by the training loss. Given a parallel sentence pair (x, y), according to Eq. (3), we generate a set of adversarial examples A(x, y) specific to the NMT model by: where we use the negative log translation probability − log P (y|x ′ ; θ mt ) to estimate J(·) in Eq. (3). The formula constructs adversarial examples that are expected to distort the current prediction and retain semantic similarity bounded by R.
It is intractable to obtain an exact solution for Eq. (4). We therefore resort to a greedy approach based on the gradient to circumvent it. For the original input x, we induce a possible adversarial word x ′ i for the word x i in x: where g x i is a gradient vector wrt. e(x i ), V x is the vocabulary for the source language, and sim(·, ·) denotes the similarity function by calculating the cosine distance between two vectors. Eq. (5) enumerates all words in V x incurring formidable computational cost. We hence substitute it with a dynamic set V x i that is specific for each word ) as the set of the n most probable words among the top n scores in terms of Q(x i , x), where n is a small constant integer and |V x i | ≪ |V x |. For the source, we estimate it from: Here, P lm is a bidirectional language model for the source language.
The introduction of language model has three benefits. First, it enables a computationally feasible way to approximate Eq. (5). Second, the return s ′ language model can retain the semantic similarity between the original words and their adversarial counterparts to strengthen the constraint R in Eq. (4). Finally, it prevents word representations from being degenerative because replacements with adversarial words usually affect the context information around them. Algorithm 1 describes the function AdvGen for generating an adversarial sentence s ′ from an input sentence s. The function inputs are: Q is a likelihood function for the candidate set generation, and for the source, it is Q src from Eq. (7). D pos is a distribution over the word position {1, .., |x|} from which the adversarial word is sampled. For the source, we use the simple uniform distribution U . Following the constraint R, we want the output sentence not to deviate too much from the input sentence and thus only change a small fraction of its constituent words based on a hyper-parameter γ ∈ [0, 1].

Defense with Adversarial Target Inputs
After generating an adversarial example x ′ , we treat (x ′ , y) as a new training data point to improve the model's robustness. These adversarial examples in the source tend to introduce errors which may accumulate and cause drastic changes to the decoder prediction. To defend the model from errors in the decoder predictions, we generate an adversarial target input by AdvGen, simi-lar to what we discussed in Section 3.1. The decoder trained with the adversarial target input is expected to be more robust to the small perturbations introduced in the source input. The ablation study results in Table 8 substantiate the benefit of this defense mechanism.
Formally, let z be the decoder input for the sentence pair (x, y). We use the same AdvGen function to generate an adversarial target input z ′ from z by: Note that for the target, the translation loss in Eq. (6) is replaced by − log P (y|x ′ ). Q trg is the likelihood for selecting the target word candidate set V z . To compute it, we combine the NMT model prediction with a language model P lm (y; θ y lm ) as follow: where λ balances the importance between two models.
D trg is a distribution for sampling positions for the target input. Different from the uniform distribution used in the source, in the target sentence we want to change those relevant words influenced by the perturbed words in the source input. To do so, we use the attention matrix M learned in the NMT model, obtained at the current mini-batch, to compute the distribution over (x, y, x ′ ) by: , j ∈ {1, .., |y|} (10) where M ij is the attention score between x i and y j and δ(x i , x ′ i ) is an indicator function that yields 1 if x i = x ′ i and 0 otherwise.

Training
Algorithm 2 details the entire procedure to calculate the robustness loss for a parallel sentence pair (x, y). We run AdvGen twice to obtain x ′ and z ′ . We do not backpropagate gradients over AdvGen when updating parameters, which just plays a role of data generator. In our implementation, this function incurs at most a 20% time overhead compared to the standard Transformer model. Accordingly, we compute the robustness loss on S as: Set D src as a uniform distribution; 5 x ′ ← AdvGen(x, Qsrc, Dsrc, − log P (y|x)); 6 Q trg is computed as Eq. (9); 7 D trg is computed as Eq. (10); 8 z ′ ← AdvGen(z, Qtrg, Dtrg, − log P (y|x ′ )); 9 loss ← − log P (y|x ′ , z ′ ; θ mt ) 10 return loss The final training objective L is a combination of four loss functions: where θ x lm and θ y lm are two sets of model parameters for source and target bidirectional language models, respectively. The word embeddings are shared between θ mt and θ x lm and likewise between θ mt and θ y lm .

Setup
We conducted experiments on Chinese-English and English-German translation tasks. The Chinese-English training set is from the LDC corpus that compromises 1.2M sentence pairs. We used the NIST 2006 dataset as the validation set for model selection and hyper-parameters tuning, and NIST 2002NIST , 2003NIST , 2004NIST , 2005NIST , 2008 as test sets. For the English-German translation task, we used the WMT'14 corpus consisting of 4.5M sentence pairs. The validation set is newstest2013, and the test set is newstest2014.
In both translation tasks, we merged the source and target training sets and used byte pair encoding (BPE) (Sennrich et al., 2016c) to encode words through sub-word units.
We built a shared vocabulary of 32K sub-words for English-German and created shared BPE codes with 60K operations for Chinese-English that induce two vocabularies with 46K Chinese sub-words and 30K English sub-words. We report casesensitive tokenized BLEU scores for English-German and case-insensitive tokenized BLEU scores for Chinese-English (Papineni et al., 2002). For a fair comparison, we did not average multiple checkpoints (Vaswani et al., 2017), and only report results on a single converged model.
We implemented our approach based on the Transformer model (Vaswani et al., 2017). In AdvGen, We modified multiple positions in the source and target input sentences in parallel. The bidirectional language model used in AdvGen consists of left-to-right and right-to-left Transformer networks, a linear layer to combine final representations from these two networks, and a softmax layer to make predictions. The Transformer network was built using six transformation layers which keeps consistent with the encoder in the Transformer model. The hyperparameters in the Transformer model were set according to the default values described in (Vaswani et al., 2017). We denote the Transformer model with 512 hidden units as Trans.-Base and 1024 hidden units as Trans.-Big.
We tuned the hyperparameters in our approach on the validation set via a grid search. Specifically, λ was set to 0.5. The n in top n to select word candidates was set to 10. The ratio pair (γ src , γ trg ) was set to (0.25, 0.50) with the exception of Trans.-Base on English-German where it was set to (0.15, 0.15). We treated the single part of parallel corpus as monolingual data to train bidirectional language models without introducing additional data. The model parameters in our approach were trained from scratch except for the parameters in language models initialized by the models pre-trained on the single part of parallel corpus. The parameters of language models were still updated during robustness training. Table 3 shows the BLEU scores on the NIST Chinese-English translation task. We first compare our approach with the Transformer model (Vaswani et al., 2017) on which our model is built. As we see, the introduction of our method to the standard backbone model (Trans.-Base) leads to substantial improvements across the validation and test sets. Specifically, our approach achieves an average gain of 2.25 BLEU points and up to 2.8 BLEU points on NIST03.

Comparison to Baseline Methods
To further verify our method, we compare to recent related techniques for robust NMT learning methods. For a fair comparison, we implemented all methods on the same Transformer backbone. Miyato et al. (2017) applied perturbations to word embeddings using adversarial learning in text classification tasks. We apply this method to the NMT model. Sennrich et al. (2016a) augmented the training data with word dropout. We follow their method to randomly set source word embeddings to zero with the probability of 0.1. This simple technique performs reasonably well on the Chinese-English translation. Wang et al. (2018) introduced a dataaugmentation method for NMT called SwitchOut to randomly replace words in both source and target sentences with other words. Cheng et al. (2018) employed adversarial stability training to improve the robustness of NMT. We cite their numbers reported in the paper for the RNN-based NMT backbone and implemented their method on the Transformer backbone. We consider two types of noisy perturbations in their method and use subscripts lex. and fea. to denote them. Sennrich et al. (2016b) is a common dataaugmentation method for NMT. The method backtranslates monolingual data by an inverse translation model. We sampled 1.2M English sentences from the Xinhua portion of the GIGAWORD corpus as monolingual data. We then back-translated them with an English-Chinese NMT model and re-trained the Chinese-English model using backtranslated data as well as original parallel data.

Input & Noisy Input 这体现了中俄两国和两国议会间密切(紧密)的友好合作关系。
Reference this expressed the relationship of close friendship and cooperation between China and Russia and between our parliaments. Vaswani et al. this reflects the close friendship and cooperation between China and Russia on Input and between the parliaments of the two countries. Vaswani et al. this reflects the close friendship and cooperation between the two countries on Noisy Input and the two parliaments. Ours this reflects the close relations of friendship and cooperation between China on Input and Russia and between their parliaments. Ours this embodied the close relations of friendship and cooperation between China on Noisy Input and Russia and between their parliaments.   Table 7: BLEU scores computed using the zero noise fraction output as a reference. Table 2 shows the comparisons to the above five baseline methods. Among all methods trained without extra corpora, our approach achieves the best result across datasets. After incorporating the back-translated corpus, our method yields an additional gain of 1-3 points over (Sennrich et al., 2016b) trained on the same back-translated corpus. Since all methods are built on top of the same backbone, the result substantiates the efficacy of our method on the standard benchmarks that contain natural noise. Compared to (Miyato et al., 2017), we found that continuous gradient-based perturbations to word embeddings can be absorbed quickly, often resulting in a worse BLEU score than the proposed discrete perturbations by word replacement.

Results on Noisy Data
We have shown improvements on the standard clean benchmarks. This subsection validates the robustness of the NMT models over artificial noise. To this end, we added synthetic noise to the clean validation set by randomly replacing a word with a relevant word according to the similarity of their word embeddings. We repeated the process in a sentence according to a pre-defined noise fraction where a noise level of 0.0 yields the original clean dataset while 1.0 provides an entirely altered set. For each sentence, we generated 100 noisy sentences. We then re-scored those sentences using a pre-trained bidirectional language model, and picked the best one as the noisy input. Table 6 shows results on artificial noisy inputs. BLEU scores were computed against the groundtruth translation result. As we see, our approach outperforms all baseline methods across all noise levels. The improvement is generally more evident when the noise fraction becomes larger.
To further analyze the prediction stability, we compared the model outputs for clean and noisy inputs. To do so, we selected the output of a model on clean input (noise fraction equals 0.0) as a reference and computed the BLEU score against this reference. Table 7 presents the results where the second column 100 means that the output is exactly the same as the reference. The relative drop of our model, as the noise level grows, is smaller compared to other baseline methods. The results in Table 6 and Table 7 together suggest our model is more robust toward the input noise.   ally the same meaning, where "密切" and "紧密" both mean "close" in Chinese. Our model retains very important words such as "China and Russia", which are missing in the Transformer results. Table 8 shows the importance of different components in our approach, which include L clean , L robust and L lm . As for L robust , it includes the source adversarial input, i.e. x ′ = x and the target source adversarial input, i.e. z ′ = z. In the fourth row with x ′ = x and z ′ = z, we randomly choose replacement positions of z since no changes in x leads not to form the distribution in Eq. (10). We can find removing any component leads to a notable decrease in BLEU. Among those, the adversarial target input (z ′ = z) shows the greatest decrease of 1.87 BLEU points, and removing language models have the least impact on the BLEU score. However, language models are still important in reducing the size of the candidate set, regularizing word embeddings and generating fluent sentences.

Ablation Studies
The hyper-parameters γ src and γ trg control the ratio of word replacement in the source and target inputs. Table 9 shows their sensitive study result where the row corresponds to γ src and the column is γ trg . As we see, the performance is relatively insensitive to the values of these hyper-parameters, and the best configuration on the Chinese-English validation set is obtained at γ src = 0.25 and γ trg = 0.50. We found that a non-zero γ trg always yields improvements when compared to the result of γ trg = 0. While γ src = 0.25 increases BLEU scores for all the values of γ trg , a larger γ src seems to be damaging.

Related Work
Robust Neural Machine Translation Improving robustness has been receiving increasing attention in NMT. For example, Belinkov and Bisk (2018) We noticed that Michel and Neubig (2018) proposed a dataset for testing the machine translation on noisy text. Meanwhile they adopt a domain adaptation method to first train a NMT model on a clean dataset and then finetune it on noisy data. This is different from our setting in which no noisy training data is available. Another difference is that one of our primary goals is to improve NMT models on the standard clean test data. This differs from Michel and Neubig (2018) whose goal is to improve models on noisy test data. We leave the extension to their setting for future work. Adversarial Examples Generation Our work is inspired by adversarial examples generation, a popular research area in computer vision, e.g. in (Szegedy et al., 2014;Goodfellow et al., 2015;Moosavi-Dezfooli et al., 2016). In NLP, many authors endeavored to apply similar ideas to a variety of NLP tasks, such as text classification (Miyato et al., 2017;Ebrahimi et al., 2018b), machine comprehension (Jia and Liang, 2017), dialogue generation (Li et al., 2017), machine translation (Belinkov and Bisk, 2018), etc. Closely related to (Miyato et al., 2017) which attacked the text classification models in the embedding space, ours generates adversarial examples based on discrete word replacements. The experiments show that ours achieve better performance on both clean and noisy data. Data Augmentation Our approach can be viewed as a data-augmentation technique using adversarial examples.
In fact, incorporating monolingual corpora into NMT has been an important topic (Sennrich et al., 2016b;Cheng et al., 2016;Edunov et al., 2018). There are also papers augmenting a standard dataset based on the parallel corpora by dropping words (Sennrich et al., 2016a), replacing words (Wang et al., 2018), editing rare words (Fadaee et al., 2017), etc. Different from these about data-augmentation techniques, our approach is only trained on parallel corpora and outperforms a representative data-augmentation work (Sennrich et al., 2016b) trained with extra monolingual data. When monolingual data is included, our approach yields further improvements.

Conclusion
In this work, we have presented an approach to improving the robustness of the NMT models with doubly adversarial inputs. We have also introduced a white-box method to generate adversarial examples for NMT. Experimental results on Chinese-English and English-German translation tasks demonstrate the capability of our approach to improving both the translation performance and the robustness. In future work, we plan to explore the direction to generate more natural adversarial examples dispensing with word replacements and more advanced defense approaches such as curriculum learning (Jiang et al., 2018(Jiang et al., , 2015.