Towards Robust Neural Machine Translation

Small perturbations in the input can severely distort intermediate representations and thus impact translation quality of neural machine translation (NMT) models. In this paper, we propose to improve the robustness of NMT models with adversarial stability training. The basic idea is to make both the encoder and decoder in NMT models robust against input perturbations by enabling them to behave similarly for the original input and its perturbed counterpart. Experimental results on Chinese-English, English-German and English-French translation tasks show that our approaches can not only achieve significant improvements over strong NMT systems but also improve the robustness of NMT models.


Introduction
Neural machine translation (NMT) models have advanced the state of the art by building a single neural network that can better learn representations (Cho et al., 2014;. The neural network consists of two components: an encoder network that encodes the input sentence into a sequence of distributed representations, based on which a decoder network generates the translation with an attention model (Bahdanau et al., 2015;Luong et al., 2015). A variety of NMT models derived from this encoder-decoder framework have further improved the performance of machine translation systems (Gehring et al., 2017;Vaswani et al., 2017). NMT is capable of generalizing better to unseen text by exploiting word similarities in embeddings and capturing long-distance reordering by conditioning on larger contexts in a continuous way.

Output
They are not afraid of difficulties to make Go AI. Input tamen buwei kunnan zuochu weiqi AI. Output They are not afraid to make Go AI. Table 1: The non-robustness problem of neural machine translation. Replacing a Chinese word with its synonym (i.e., "bupa" → "buwei") leads to significant erroneous changes in the English translation. Both "bupa" and "buwei" can be translated to the English phrase "be not afraid of." However, studies reveal that very small changes to the input can fool state-of-the-art neural networks with high probability (Goodfellow et al., 2015;Szegedy et al., 2014). Belinkov and Bisk (2018) confirm this finding by pointing out that NMT models are very brittle and easily falter when presented with noisy input. In NMT, due to the introduction of RNN and attention, each contextual word can influence the model prediction in a global context, which is analogous to the "butterfly effect." As shown in Table 1, although we only replace a source word with its synonym, the generated translation has been completely distorted. We investigate severe variations of translations caused by small input perturbations by replacing one word in each sentence of a test set with its synonym. We observe that 69.74% of translations have changed and the BLEU score is only 79.01 between the translations of the original inputs and the translations of the perturbed inputs, suggesting that NMT models are very sensitive to small perturbations in the input. The vulnerability and instability of NMT models limit their applicability to a broader range of tasks, which require robust performance on noisy inputs. For example, simultaneous translation systems use auto-matic speech recognition (ASR) to transcribe input speech into a sequence of hypothesized words, which are subsequently fed to a translation system. In this pipeline, ASR errors are presented as sentences with noisy perturbations (the same pronunciation but incorrect words), which is a significant challenge for current NMT models. Moreover, instability makes NMT models sensitive to misspellings and typos in text translation.
In this paper, we address this challenge with adversarial stability training for neural machine translation. The basic idea is to improve the robustness of two important components in NMT: the encoder and decoder. To this end, we propose two approaches to constructing noisy inputs with small perturbations to make NMT models resist them. As important intermediate representations encoded by the encoder, they directly determine the accuracy of final translations. We introduce adversarial learning to make behaviors of the encoder consistent for both an input and its perturbed counterpart. To improve the stability of the decoder, our method jointly maximizes the likelihoods of original and perturbed data. Adversarial stability training has the following advantages: 1. Improving both the robustness and translation performance: Our adversarial stability training is capable of not only improving the robustness of NMT models but also achieving better translation performance.
2. Applicable to arbitrary noisy perturbations: In this paper, we propose two approaches to constructing noisy perturbations for inputs. However, our training framework can be easily extended to arbitrary noisy perturbations. Especially, we can design task-specific perturbation methods.
3. Transparent to network architectures: Our adversarial stability training does not depend on specific NMT architectures. It can be applied to arbitrary NMT systems.
Experiments on Chinese-English, English-French and English-German translation tasks show that adversarial stability training achieves significant improvements across different languages pairs. Our NMT system outperforms the state-of-the-art RNN-based NMT system (GNMT)  and obtains comparable performance with the CNN-based NMT sys-tem (Gehring et al., 2017). Related experimental analyses validate that our training approach can improve the robustness of NMT models.

Background
NMT is an end-to-end framework which directly optimizes the translation probability of a target sentence y = y 1 , ..., y N given its corresponding source sentence x = x 1 , ..., x M : where θ is a set of model parameters and y <n is a partial translation. P (y|x; θ) is defined on a holistic neural network which mainly includes two core components: an encoder encodes a source sentence x into a sequence of hidden representations H x = H 1 , ..., H M , and a decoder generates the n-th target word based on the sequence of hidden representations: where s n is the n-th hidden state on target side. Thus the model parameters of NMT include the parameter sets of the encoder θ enc and the decoder θ dec : θ = {θ enc , θ dec }. The standard training objective is to minimize the negative log-likelihood of the training corpus Due to the vulnerability and instability of deep neural networks, NMT models usually suffer from a drawback: small perturbations in the input can dramatically deteriorate its translation results. Belinkov and Bisk (2018) point out that characterbased NMT models are very brittle and easily falter when presented with noisy input. We find that word-based and subword-based NMT models also confront with this shortcoming, as shown in Table 1. We argue that the distributed representations should fulfill the stability expectation, which is the underlying concept of the proposed approach. Recent work has shown that adversarially trained models can be made robust to such perturbations (Zheng et al., 2016;Madry et al., 2018). Inspired by this, in this work, we improve the robustness of encoder representations against noisy perturbations with adversarial learning ).
x' x +perturbations Encoder H x H x'

Decoder Discriminator
Linv(x, x') L true (x, y) Lnoisy(x', y) Figure 1: The architecture of NMT with adversarial stability training. The dark solid arrow lines represent the forward-pass information flow for the input sentence x, while the red dashed arrow lines for the noisy input sentence x , which is transformed from x by adding small perturbations.

Overview
The goal of this work is to propose a general approach to make NMT models learned to be more robust to input perturbations. Our basic idea is to maintain the consistency of behaviors through the NMT model for the source sentence x and its perturbed counterpart x . As aforementioned, the NMT model contains two procedures for projecting a source sentence x to its target sentence y: the encoder is responsible for encoding x as a sequence of representations H x , while the decoder outputs y with H x as input. We aim at learning the perturbation-invariant encoder and decoder. Figure 1 illustrates the architecture of our approach. Given a source sentence x, we construct a set of perturbed sentences N (x), in which each sentence x is constructed by adding small perturbations to x. We require that x is a subtle variation from x and they have similar semantics. Given the input pair (x, x ), we have two expectations: (1) the encoded representation H x should be close to H x ; and (2) given H x , the decoder is able to generate the robust output y. To this end, we introduce two additional objectives to improve the robustness of the encoder and decoder: • L inv (x, x ) to encourage the encoder to output similar intermediate representations H x and H x for x and x to achieve an invariant encoder, which benefits outputting the same translations. We cast this objective in the adversarial learning framework.
• L noisy (x , y) to guide the decoder to generate output y given the noisy input x , which is modeled as − log P (y|x ). It can also be defined as KL divergence between P (y|x) and P (y|x ) that indicates using P (y|x) to teach P (y|x ).
As seen, the two introduced objectives aim to improve the robustness of the NMT model which can be free of high variances in target outputs caused by small perturbations in inputs. It is also natural to introduce the original training objective L(x, y) on x and y, which can guarantee good translation performance while keeping the stability of the NMT model. Formally, given a training corpus S, the adversarial stability training objective is where L true (x, y) and L noisy (x , y) are calculated using Equation 3, and L inv (x, x ) is the adversarial loss to be described in Section 3.3. α and β control the balance between the original translation task and the stability of the NMT model. θ = {θ enc , θ dec , θ dis } are trainable parameters of the encoder, decoder, and the newly introduced discriminator used in adversarial learning. As seen, the parameters of encoder θ enc and decoder θ dec are trained to minimize both the translation loss L true (x, y) and the stability losses (L noisy (x , y) and L inv (x, x )).
Since L noisy (x , y) evaluates the translation loss on the perturbed neighbour x and its corresponding target sentence y, it means that we augment the training data by adding perturbed neighbours, which can potentially improve the translation performance. In this way, our approach not only makes the output of NMT models more robust, but also improves the performance on the original translation task.
In the following sections, we will first describe how to construct perturbed inputs with different strategies to fulfill different goals (Section 3.2), followed by the proposed adversarial learning mechanism for the perturbation-invariant encoder (Section 3.3). We conclude this section with the training strategy (Section 3.4).

Constructing Perturbed Inputs
At each training step, we need to generate a perturbed neighbour set N (x) for each source sentence x for adversarial stability training. In this paper, we propose two strategies to construct the perturbed inputs at multiple levels of representations.
The first approach generates perturbed neighbours at the lexical level. Given an input sentence x, we randomly sample some word positions to be modified. Then we replace words at these positions with other words in the vocabulary according to the following distribution: ) measures the similarity between word x i and x. Thus we can change the word to another word with similar semantics. One potential problem of the above strategy is that it is hard to enumerate all possible positions and possible types to generate perturbed neighbours. Therefore, we propose a more general approach to modifying the sentence at the feature level. Given a sentence, we can obtain the word embedding for each word. We add the Gaussian noise to a word embedding to simulate possible types of perturbations. That is where the vector is sampled from a Gaussian distribution with variance σ 2 . σ is a hyper-parameter. We simply introduce Gaussian noise to all of word embeddings in x.
The proposed scheme is a general framework where one can freely define the strategies to construct perturbed inputs. We just present two possible examples here. The first strategy is potentially useful when the training data contains noisy words, while the latter is a more general strategy to improve the robustness of common NMT models. In practice, one can design specific strategies for particular tasks. For example, we can replace correct words with their homonyms (same pronunciation but different meanings) to improve NMT models for simultaneous translation systems.

Adversarial Learning for the Perturbation-invariant Encoder
The goal of the perturbation-invariant encoder is to make the representations produced by the encoder indistinguishable when fed with a correct sentence x and its perturbed counterpart x , which is directly beneficial to the output robustness of the decoder. We cast the problem in the adversarial learning framework . The encoder serves as the generator G, which defines the policy that generates a sequence of hidden representations H x given an input sentence x.
We introduce an additional discriminator D to distinguish the representation of perturbed input H x from that of the original input H x . The goal of the generator G (i.e., encoder) is to produce similar representations for x and x which could fool the discriminator, while the discriminator D tries to correctly distinguish the two representations. Formally, the adversarial learning objective is The discriminator outputs a classification score given an input representation, and tries to maximize D(G(x)) to 1 and minimize D(G(x )) to 0. The objective encourages the encoder to output similar representations for x and x , so that the discriminator fails to distinguish them. The training procedure can be regarded as a min-max two-player game. The encoder parameters θ enc are trained to maximize the loss function to fool the discriminator. The discriminator parameters θ dis are optimized to minimize this loss for improving the discriminating ability. For efficiency, we update both the encoder and the discriminator simultaneously at each iteration, rather than the periodical training strategy that is commonly used in adversarial learning. Lamb et al. (2016) also propose a similar idea to use Professor Forcing to make the behaviors of RNNs be indistinguishable when training and sampling the networks.

Training
As shown in Figure 1, our training objective includes three sets of model parameters for three modules. We use mini-batch stochastic gradient descent to optimize our model. In the forward pass, besides a mini-batch of x and y, we also construct a mini-batch consisting of the perturbed neighbour x and y. We propagate the information to calculate these three loss functions according to arrows. Then, gradients are collected to update three sets of model parameters. Except for the gradients of L inv with respect to θ enc are multiplying by −1, other gradients are normally backpropagated. Note that we update θ inv and θ enc simultaneously for training efficiency.

Setup
We evaluated our adversarial stability training on translation tasks of several language pairs, and reported the 4-gram BLEU (Papineni et al., 2002) score as calculated by the multi-bleu.perl script. Chinese-English We used the LDC corpus consisting of 1.25M sentence pairs with 27.9M Chinese words and 34.5M English words respectively. We selected the best model using the NIST 2006 set as the validation set (hyper-parameter optimization and model selection). The NIST 2002NIST , 2003NIST , 2004NIST , 2005NIST , and 2008 datasets are used as test sets. English-German We used the WMT 14 corpus containing 4.5M sentence pairs with 118M English words and 111M German words. The validation set is newstest2013, and the test set is new-stest2014. English-French We used the IWSLT corpus which contains 0.22M sentence pairs with 4.03M English words and 4.12M French words. The IWLST corpus is very dissimilar from the NIST and WMT corpora. As they are collected from TED talks and inclined to spoken language, we want to verify our approaches on the nonnormative text. The IWSLT 14 test set is taken as the validation set and 15 test set is used as the test set.
For English-German and English-French, we tokenize both English, German and French words using tokenize.perl script. We follow Sennrich et al. (2016b) to split words into subword units. The numbers of merge operations in byte pair encoding (BPE) are set to 30K, 40K and 30K respectively for Chinese-English, English-German, and English-French. We report the case-sensitive tokenized BLEU score for English-German and English-French and the caseinsensitive tokenized BLEU score for Chinese-English.
Our baseline system is an in-house NMT system. Following Bahdanau et al. (2015), we implement an RNN-based NMT in which both the encoder and decoder are two-layer RNNs with residual connections between layers (He et al., 2016b). The gating mechanism of RNNs is gated recurrent unit (GRUs) (Cho et al., 2014). We apply layer normalization (Ba et al., 2016) and dropout (Hinton et al., 2012) to the hidden states of GRUs. Dropout is also added to the source and target word embeddings. We share the same matrix between the target word embeedings and the pre-softmax linear transformation (Vaswani et al., 2017). We update the set of model parameters using Adam SGD (Kingma and Ba, 2015). Its learning rate is initially set to 0.05 and varies according to the formula in Vaswani et al. (2017).
Our adversarial stability training initializes the model based on the parameters trained by maximum likelihood estimation (MLE). We denote adversarial stability training based on lexical-level perturbations and feature-level perturbations respectively as AST lexical and AST feature . We only sample one perturbed neighbour x ∈ N (x) for training efficiency. For the discriminator used in L inv , we adopt the CNN discriminator proposed by Kim (2014) to address the variable-length problem of the sequence generated by the encoder. In the CNN discriminator, the filter windows are set to 3, 4, 5 and rectified linear units are applied after convolution operations. We tune the hyperparameters on the validation set through a grid search. We find that both the optimal values of α and β are set to 1.0. The standard variance in Gaussian noise used in the formula (6) is set to 0.01. The number of words that are replaced in the sentence x during lexical-level perturbations is taken as max(0.2|x|, 1) in which |x| is the length of x. The default beam size for decoding is 10.    Chinese-English NIST datasets trained on RNNbased NMT. Shen et al. (2016) propose minimum risk training (MRT) for NMT, which directly optimizes model parameters with respect to BLEU scores. Wang et al. (2017) address the issue of severe gradient diffusion with linear associative units (LAU). Their system is deep with an encoder of 4 layers and a decoder of 4 layers. Zhang et al. (2018) propose to exploit both left-to-right and right-to-left decoding strategies for NMT to capture bidirectional dependencies. Compared with them, our NMT system trained by MLE outperforms their best models by around 3 BLEU points. We hope that the strong baseline systems used in this work make the evaluation convincing.

NIST Chinese-English Translation
We find that introducing adversarial stability training into NMT can bring substantial improvements over previous work (up to +3.16 BLEU points over Shen et al. (2016), up to +3.51 BLEU points over Wang et al. (2017) and up to +2.74 BLEU points over Zhang et al. (2018)) and our system trained with MLE across all the datasets. Compared with our baseline system, AST lexical achieves +1.75 BLEU improvement on average. AST feature performs better, which can obtain +2.59 BLEU points on average and up to +3.34 BLEU points on NIST08.

WMT 14 English-German Translation
In Table 3, we list existing NMT systems as comparisons. All these systems use the same WMT 14 English-German corpus. Except that Shen et al. (2016) and  respectively adopt MRT and reinforcement learning (RL), other systems all use MLE as training criterion. All the systems except for Shen et al. (2016) are deep NMT models with no less than four layers. Google's neural machine translation (GNMT)  represents a strong RNN-based NMT system. Compared with other RNN-based NMT systems except for GNMT, our baseline system with two layers can achieve better performance than theirs.
When training our NMT system with AST leixcal , significant improvement (+1.11 Synthetic Type Training  Table 5: Translation results of synthetic perturbations on the validation set in Chinese-English translation. "1 Op." denotes that we conduct one operation (swap, replacement or deletion) on the original sentence.
Source zhongguo dianzi yinhang yewu guanli xingui jiangyu sanyue yiri qi shixing Reference china's new management rules for e-banking operations to take effect on march 1 MLE china's electronic bank rules to be implemented on march 1 AST lexical new rules for business administration of china 's electronic banking industry will come into effect on march 1 .
AST feature new rules for business management of china 's electronic banking industry to come into effect on march 1 Perturbed Source zhongfang dianzi yinhang yewu guanli xingui jiangyu sanyue yiri qi shixing MLE china to implement new regulations on business management AST lexical the new regulations for the business administrations of the chinese electronics bank will come into effect on march 1 .
AST feature new rules for business management of china's electronic banking industry to come into effect on march 1 Table 6: Example translations of a source sentence and its perturbed counterpart by replacing a Chinese word "zhongguo" with its synonym "zhongfang." BLEU points) can be observed. AST feature can obtain slightly better performance. Our NMT system outperforms the state-of-the-art RNN-based NMT system, GNMT, with +0.66 BLEU point and performs comparably with Gehring et al. (2017) which is based on CNN with 15 layers. Given that our approach can be applied to any NMT systems, we expect that the adversarial stability training mechanism can further improve performance upon the advanced NMT architectures. We leave this for future work. Table 4 shows the results on IWSLT English-French Translation. Compared with our strong baseline system trained by MLE, we observe that our models consistently improve translation performance in all datasets. AST feature can achieve significant improvements on the tst2015 although AST lexical obtains comparable results. These demonstrate that our approach maintains good performance on the non-normative text.

Results on Synthetic Perturbed Data
In order to investigate the ability of our training approaches to deal with perturbations, we experiment with three types of synthetic perturbations: • Swap: We randomly choose N positions from a sentence and then swap the chosen words with their right neighbours.
• Replacement: We randomly replace sampled words in the sentence with other words.
• Deletion: We randomly delete N words from each sentence in the dataset.
As shown in Table 5, we can find that our training approaches, AST lexical and AST feature , consistently outperform MLE against perturbations on all the numbers of operations. This means that our  Table 7: Ablation study of adversarial stability training AST lexical on Chinese-English translation. " √ " means the loss function is included in the training objective while "×" means it is not.
approaches have the capability of resisting perturbations. Along with the number of operations increasing, the performance on MLE drops quickly. Although the performance of our approaches also drops, we can see that our approaches consistently surpass MLE. In AST lexical , with 0 operation, the difference is +2.19 (43.57 Vs. 41.38) for all synthetic types, but the differences are enlarged to +3.20, +9.39, and +3.12 respectively for the three types with 5 operations.
In the Swap and Deletion types, AST lexical and AST feature perform comparably after more than four operations. Interestingly, AST lexical performs significantly better than both of MLE and AST feature after more than one operation in the Replacement type. This is because AST lexical trains the model specifically on perturbation data that is constructed by replacing words, which agrees with the Replacement Type. Overall, AST lexical performs better than AST feature against perturbations after multiple operations. We speculate that the perturbation method for AST lexical and synthetic type are both discrete and they keep more consistent. Table 6 shows example translations of a Chinese sentence and its perturbed counterpart.
These findings indicate that we can construct specific perturbations for a particular task. For example, in simultaneous translation, an automatic speech recognition system usually generates wrong words with the same pronunciation of correct words, which dramatically affects the quality of machine translation system. Therefore, we can design specific perturbations aiming for this task.

Ablation Study
Our training objective function Eq. (4) contains three loss functions. We perform an ablation study on the Chinese-English translation to understand the importance of these loss functions by choosing AST lexical as an example. As Table 7 shows, if we remove L adv , the translation performance decreases by 0.64 BLEU point. However, when L noisy is excluded from the training objective function, it results in a significant drop of 1.66 BLEU point. Surprisingly, only using L noisy is able to lead to an increase of 0.88 BLEU point. Figure 2 shows the changes of BLEU scores over iterations respectively for AST lexical and AST feature . They behave nearly consistently. Initialized by the model trained by MLE, their performance drops rapidly. Then it starts to go up quickly. Compared with the starting point, the maximal dropping points reach up to about 7.0 BLEU points. Basically, the curves present the state of oscillation. We think that introducing random perturbations and adversarial learning can make the training not very stable like MLE. Figure 3 shows the learning curves of three loss functions, L true , L inv and L noisy . We can find that their costs of loss functions decrease not steadily. Similar to the Figure 2, there still exist oscillations in the learning curves although they do not change much sharply. We find that L inv converges to around 0.68 after about 100K iterations, which indicates that discriminator outputs probability 0.5 for both positive and negative samples and it cannot distinguish them. Thus the behaviors of the encoder for x and its perturbed neighbour x perform nearly consistently.

Related Work
Our work is inspired by two lines of research: (1) adversarial learning and (2) data augmentation.
Adversarial Learning Generative Adversarial Network (GAN)  and its related derivative have been widely applied in computer vision (Radford et al., 2015;Salimans et al., 2016) and natural language processing . Previous work has constructed adversarial examples to attack trained networks and make networks resist them, which has proved to improve the robustness of networks (Goodfellow et al., 2015;Miyato et al., 2016;Zheng et al., 2016). Belinkov and Bisk (2018) introduce adversarial examples to training data for character-based NMT models. In contrast to theirs, adversarial stability training aims to stabilize both the encoder and decoder in NMT models. We adopt adversarial learning to learn the perturbation-invariant encoder.
Data Augmentation Data augmentation has the capability to improve the robustness of NMT models. In NMT, there is a number of work that augments the training data with monolingual corpora (Sennrich et al., 2016a;He et al., 2016a;Zhang and Zong, 2016). They all leverage complex models such as inverse NMT models to generate translation equivalents for monolingual corpora. Then they augment the parallel corpora with these pseudo corpora to improve NMT models. Some authors have recently endeavored to achieve zero-shot NMT through transferring knowledge from bilingual corpora of other language pairs (Chen et al., 2017;Zheng et al., 2017; or monolingual corpora (Lample et al., 2018;Artetxe et al., 2018). Our work significantly differs from these work. We do not resort to any complicated models to generate perturbed data and do not depend on extra monolingual or bilingual corpora. The way we exploit is more convenient and easy to implement. We focus more on improving the robustness of NMT models.

Conclusion
We have proposed adversarial stability training to improve the robustness of NMT models. The basic idea is to train both the encoder and decoder robust to input perturbations by enabling them to behave similarly for the original input and its perturbed counterpart. We propose two approaches to construct perturbed data to adversarially train the encoder and stabilize the decoder. Experiments on Chinese-English, English-German and English-French translation tasks show that the proposed approach can improve both the robustness and translation performance. As our training framework is not limited to specific perturbation types, it is interesting to evaluate our approach in natural noise existing in practical applications, such as homonym in the simultaneous translation system. It is also necessary to further validate our approach on more advanced NMT architectures, such as CNN-based NMT (Gehring et al., 2017) and Transformer (Vaswani et al., 2017).