Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets

This paper proposes a new route for applying the generative adversarial nets (GANs) to NLP tasks (taking the neural machine translation as an instance) and the widespread perspective that GANs can't work well in the NLP area turns out to be unreasonable. In this work, we build a conditional sequence generative adversarial net which comprises of two adversarial sub models, a generative model (generator) which translates the source sentence into the target sentence as the traditional NMT models do and a discriminative model (discriminator) which discriminates the machine-translated target sentence from the human-translated sentence. From the perspective of Turing test, the proposed model is to generate the translation which is indistinguishable from the human-translated one. Experiments show that the proposed model achieves significant improvements than the traditional NMT model. In Chinese-English translation tasks, we obtain up to +2.0 BLEU points improvement. To the best of our knowledge, this is the first time that the quantitative results about the application of GANs in the traditional NLP task is reported. Meanwhile, we present detailed strategies for GAN training. In addition, We find that the discriminator of the proposed model shows great capability in data cleaning.


Introduction
Machine translation is one of the traditional NLP tasks which aims to translate one sourcelanguage sentence into the corresponding target-language sentence automatically. Recently, with the rapid development of deep neural networks, the neural machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Ranzato et al., 2015) which leverages a single neural network directly to transform the source sentence into the target sentence, has obtained state-of-the-art performance for several language pairs Johnson et al., 2016;Bradbury and Socher, 2016). This end-to-end NMT typically consists of two sub recurrent neural nets. The encoder network reads and encodes the source sentence into the context vector representation; and the decoder network generates the target sentence word by word based on the context vector. To dynamically generate a context vector for a target word being generated, the attention mechanism is usually deployed. Optimization for this NMT model is to directly maximize the likelihood of the training data. Specifically, at each decoding step, the NMT model is optimized to maximize the likelihood estimation of the ground word (MLE) at the current step. Ranzato et al. (2015) indicate that the MLE loss function is only defined at the word level instead of the sentence level. Hence the NMT model may generate the best candidate word for the current time step yet a bad component of the whole sentence in the long run. Shen et al. (2015) give a solution by introducing the minimum risk training from the statistical machine translation (SMT). They incorporate the sentence-level BLEU (Chen and Cherry, 2014) into the loss function. Hence the NMT model is optimized to generate sentences with higher BLEU points. Since the BLEU point is computed as the geometric mean of the modified n-gram precisions (Papineni et al., 2002), we conclude that almost all of the prior NMT models are trained to cover more n-grams with the ground target sentence (MLE can be viewed as training the NMT to cover more 1-gram with the target sentence).
However, it is widely acknowledged that higher n-gram precisions don't ensure a better sentence (Callison-Burch and Osborne, 2006;Chatterjee et al., 2007). Additionally, the manually defined loss function is unable to cover all crucial aspects and the NMT model may be trained to deviate from the data distribution and generate suboptimal sentences. Intuitively, The model should be trained to directly generate a human-like translation instead of covering the human designed approximation features. From the Turing test perspective, we should enlighten the model to be aware of what is the human-generated sentence like and to generate the sentence which is indistinguishable form the human-generated one. Based on the analysis above, we propose that a good training objective for NMT includes: 1) The loss function should be defined on the sentence level rather than the word level; 2) No any manually defined approximation feature is used to guide the NMT model; 3) The NMT model should be directly exposed to the true data distribution. Specifically, the model should be trained to directly output the translation indistinguishable from humangenerated translations and if one poor sentence is generated, the model should be penalized with how far the poor sentence is from the humangenerated one.
Borrowing the idea of generative adversarial training in computer vision (Goodfellow et al., 2014;Denton et al., 2015), we build a conditional sequence generative adversarial net (CS-GAN) which implements the training objective mentioned above. In the proposed CSGAN-NMT, we jointly train two models, a generator (implemented as the traditional NMT model) which generates the target-language sentence based on the input source-language sentence, and a discriminator which conditioned on the source-language sentence predicts the probability of the targetlanguage sentence being a human-generated one. During the training process, the generator aims to fool the discriminator into believing that its output is a human-generated sentence, and the discriminator makes effort not to be fooled by improving its ability to distinguish the machine-generated sentence from the human-generated one. This kind of adversarial training achieves a win-win situation when the generator and discriminator reaches a Nash Equilibrium (Zhao et al., 2016).
In summary, we mainly make the following contributions: • To the best of our knowledge, we are the first to introduce the generative adversarial training into NMT, which trains the NMT model to generate sentences which are indistinguishable from the human-generated sentences. We build a conditional generative adversarial net which can be applied to any endto-end NMT systems. We do not assume the specific architecture of the NMT model.
• The extensive experiments on Chinese-to-English translation tasks show that the proposed CSGAN-NMT significantly outperforms the strong attention-based NMT model which serves as the baseline. We present detailed, quantitative results to demonstrate the effectiveness of the proposed CSGAN-NMT. This indicates the feasibility of applying the GANs into traditional NLP tasks.
• We successfully leverage the discriminator to clean the training data for NMT.
• We test different architectures for the discriminator, the convolutional neural network (CNN) based and the recurrent neural network (RNN) based one. We found that the RNNs are not applicable for the discriminator.
• We report our specific training strategies for the proposed CSGAN-NMT. This provides a new reliable route for applying generative adversarial nets into other NLP tasks.
2 Related work

Neural machine translation
This subsection briefly describes the attentionbased NMT model which simultaneously conducts dynamic alignment and generation of the target sentence. The NMT model produces the translation sentence by generating one target word at every time step. Given an input sequence x = (x 1 , . . . , x Tx ) and previous translated words (y 1 , . . . , y i−1 ), the probability of next word y i is: where s i is an decoder hidden state for time step i, which is computed as: Here f and g are nonlinear transform functions, which can be implemented as long short term memory network(LSTM) or gated recurrent unit (GRU), and c i is a distinct context vector at time step i, which is calculated as a weighted sum of the input annotations h j : where h j is the annotation of x j from a bidirectional RNN. The weight a ij for h j is calculated as: where where v a is the weight vector, W and U are the weight matrixes. All of the parameters in the NMT model are optimized to maximize the following conditional log-likelihood of the M sentence aligned bilingual samples: log p(y m i |y m <i , x m , θ) (6)

Generative adversarial net
Generative adversarial network in which a generative model is trained to generate outputs to fool the discriminator, has enjoyed great success in computer vision and has been widely applied to image generation. The conditional generative adversarial nets apply an extension of generative adversarial network to a conditional setting, which enables the networks to condition on some arbitrary external data.
However, to the best of our knowledge, this idea has not been applied in traditional NLP tasks with comparable success and few quantitative experimental result has been reported. Some recent works have begun to apply the generative adversarial training into the NLP area:  apply the idea of generative adversarial training to sentiment analysis and  use the idea to domain adaptation tasks.
For sequence generation problem,  leverage policy gradient reinforcement learning to back-propagate the reward from the discriminator, showing presentable results for poem generation, speech language generation and music generation. Similarly,  generate the text from random noise via adversarial training. A striking difference from the works mentioned above, our work is in the conditional settings where the target-language sentence is generated conditioned on the source-language one.
In parallel to our work, Li et al. (2017) propose a similar conditional sequence generative adversarial training for dialogue generation. They use a hierarchical LSTM architecture for the discriminator. In contrast to their approach, we apply the CNN-based discriminator for the machine translation task. Furthermore, we present detailed training strategies for the proposed model and extensive quantitative results are reported.

The CSGAN-NMT
In this section, we describe in detail the CSGAN-NMT that consists of a generator G which generates the target-language sentence based on the source-language sentence and a discriminator D which distinguishes the machinegenerated sentence from the human-generated one. The sentence generation process is viewed as a sequence of actions that are taken according to a policy regulated by the generator. In this work, we take the policy gradient training strategies which are same as .

Generator
Resembling the traditional NMT model, the generator G generates the target-language sentence conditioned on the input source-language sentence. It defines the policy that generates the target sentence y given the source sentence x. The generator takes exactly the same architecture with the traditional NMT model. Note that we do not assume the specific architecture of the generator. Here, we adopt the strong attention-based NMT model which is implemented as the open-source system dl4mt * , as the generator.

Discriminator
Recently, the deep discriminative models such as the CNN and RNN have shown a high per- Figure 1: The CNN-based architecture for the discriminator. Note that only the source-side CNN is depicted. Figure 2: The BiLSTM-based architecture for the discriminator. Only the source-side BiLSTM is depicted.
formance in complicated sequence classification tasks. To test the efficacy of the discriminator, we propose two different architectures for the discriminator: the CNN-based and RNN-based discriminators.
CNN-based Since sentences generated by the generator have a variable length, the CNN padding is used to transform the sentence to a sequence with the fixed length T which is the maximum length for the input sequence set by the user beforehand. Given the source-language sequence x 1 , . . . , x T and target-language sequence y 1 , . . . , y T , we build the source matrix X 1:T and target matrix Y 1:T respectively as: and Y 1:T = y 1 ; y 2 ; . . . ; y T where x t , y t ∈ R k is the k-dimensional word embedding and the semicolon is the concatenation operator. For the source matrix X 1:T , a kernel w j ∈ R l×k applies a convolutional operation to a window size of l words to produce a series of feature maps: where ⊗ operator is the summation of elementwise production and b is a bias term. ρ is a nonlinear activation function which is implemented as ReLU in this paper. Note that the batch normalization (Ioffe and Szegedy, 2015) which accelerates the training significantly, is applied to the input of the activation function (BN in equation 9). To get the final feature with respect to kernel w j , a maxover-time pooling operation is leveraged over the feature maps: We use various numbers of kernels with different window sizes to extract different features, which are finally concatenated to form the sourcelanguage sentence representation c x . Identically, the target-language sentence representation c y can be extracted from the target matrix Y 1:T . Finally, given the source-language sentence, the probability that the target-language sentence is being real can be computed as: where V is the transform matrix which transforms the concatenation of c x and c y into a 2-dimension embedding and σ is the logistic function. The CNN-based discriminator is depicted as figure 1 RNN-based Recurrent neural network has several different formations, such as the LSTM, GRU and the simple recurrent neural network (simple-RNN). This paper takes the LSTM as an instance. Given the source-language sequence x 1 , . . . , x s , a LSTM is used to map the input sequence into a sequence of hidden states h 1 , . . . , h s by leveraging the update function of LSTM cells recursively: The vector representation for the source-language sentence c x is computed as the average of the hidden states. The target sentence vector c y is computed with the same way. Finally, the probability that the target-language sentence is being real is computed as equation 11. We also take the bidirectional LSTM as an alternative for LSTM. The graphical illustration of the BiLSTM-based discriminator is depicted as figure 2.

Policy gradient training
Following , the objective of the generator G is defined as to generate a sequence from the start state to maximize its expected end reward. Formally, the objective function is computed as: where Y 1:T = y 1 , . . . , y T indicates the generated target sequence, R G θ D is the action-value function of a target-language sentence given the source sentence X, i.e. the expected accumulative reward starting from the state (Y 1:T −1 , X), taking action y T , and following the policy G θ . To estimate the action-value function, we consider the estimated probability of being real by the discriminator D as the reward: where b(X,Y) denotes the baseline value to reduce the variance of the reward. Practically, we take b(X,Y) as a constant, 0.5 for simplicity. The question is that, given the source sequence, the discriminator D only provides a reward value for a finished target sequence. If Y 1:T is not a finished target sequence, the value of D(X, Y 1:T ) makes no sense. Therefore, we can't get the action-value for an intermediate state directly. To evaluate the action-value for an intermediate state, the Monte Carlo search under the policy of G is applied to sample the unknown tokens. Each search ends until the end of sentence token is sampled or the sampled sentence reaches the maximum length. To obtain more stable reward and reduce the variance, we represent an N-time Monte Carlo search as: (15) where T i represents the length of the sentence sampled by the i'th Monte Carlo search. (Y 1:t , X) = (y 1 , . . . , y t , X) is the current state and Y N t+1:T N is sampled based on the policy G. The discriminator provides N rewards for the sampled N sentences respectively. The final reward for the intermediate state is calculated as the average of the N rewards. Hence, for the target sentence with the length T , we compute the reward for y t in the sentence level as: Using the discriminator as a reward function can further improve the generator iteratively by dynamically updating the discriminator. Once we get more realistic generated sequences, we re-train the discriminator as: After updating the discriminator, we are ready to re-train the generator. The gradient of the objective function J(θ) w.r.t the generator's parameter θ is calculated as:

Training strategies
It is hard to train the generative adversarial networks since the generator and discriminator need to be carefully synchronized. To make this work easier to reproduce, this paper gives detailed strategies for training the CSGAN-NMT model.
Firstly, we use the maximum likelihood estimation to pre-train the generator on the parallel training set s until the best translation performance is achieved.
Then, generate the machine-generated sentences by using the generator to decode the training data. We simply use the greedy sampling method instead of the beam search method for decoding. Hence, it is very fast to decode all of the training set.
Next, pre-train the discriminator on the combination of the true parallel data and the machinegenerated data until the classification accuracy achieves at f . Finally, we jointly train the generator and discriminator. The generator is trained with the policy gradient training method. We randomly sample a batch of source sentences from s as the training examples for the generator and the batch size is β g . Note that the target sentences are useless when the the generator is undergoing the policy gradient training. However, in our practice, we find that updating the generator only with the simple policy gradient training leads to unstable training. The translation performance drops sharply after a few updating. We conjecture that this is  Table 1: BLEU score on Chinese-English translation tasks. "+sgd" means using sgd to finetune the model and "baseline-small" indicates that the model is trained on the small data set.
because the generator can only indirectly access to the golden target sentence through the reward passed back from the discriminator, and this reward is used only to promote or discourage the machine-generated sentences. To alleviate this issue, we adopt the teacher forcing approach which is similar to Lamb et al. (2016); Li et al. (2017). We directly make the discriminator to automatically assign a reward of 1 to the golden targetlanguage sentence and the generator uses this reward to update itself on the true parallel examples. We run the teacher forcing training for one time once the generator is updated by the policy gradient training. After the generator gets updated, we use the new stronger generator to generate η more realistic sentences, which are then used to train the discriminator. The batch size for training the discriminator is referred as β d . Following Arjovsky et al. (2017), we clamp the weights of the discriminator to a fixed box ( [− , ] ) after each gradient update. We perform one optimization step for the discriminator for each step of the generator. In our practice, we set f as 0.82, β g as 100, β d as 64, η as 5000, as 1 and the N for Monte Carlo search as 20. We apply the Adam optimization method, with the initial learning rate of 0.001, for pre-training the generator and discriminator. During the process of generative adversarial training, the RMSProp optimization method with the initial learning rate of 0.0001 is utilized for the generator and discriminator.

Experiments and Results
In this section, we detail our experiments and results on the CSGAN-NMT model for Chinese-English translation tasks. The open-source NMT system dl4mt, which has been used to build topperforming submissions to shared translation tasks at WMT and IWSLT (Sennrich et al., 2017), is used as the baseline model. Note that the gener-ator of the CSGAN-NMT is implemented identically with the baseline model.

Setup
We perform two tasks on Chinese-English translation: One for small training data set (1.0M sentence pairs) and the other one for large-scale data set (1.6M sentence pairs). The training sets are randomly extracted from LDC corpora * . The large training set is only used to test the feasibility of the proposed model on settings where a great amount of training data is accessible. If no otherwise specified, the following experiments are run on the small training set. We choose the NIST02 as the development set. For testing, we use NIST03, NIST04 and NIST05 data sets. We apply word-level translation in our experiments and the Chinese sentences are segmented beforehand. To speed up the training procedure, the sentences of length over 50 words are removed. We limit the vocabulary in both Chinese and English to the most 30K words and the out-of-vocabulary words are replaced with UNK. The word embedding dimension is set as 512 and the size of the hidden layer is 1024. The other hyper-parameters are set according to the section 4. We use caseinsensitive 4-gram BLEU score as the evaluation metric. We train the NMT models on 4 GPU K80 and it takes about 30 hours to train the baseline model on the small training set. The training time for the proposed CSGAN-NMT model (pretraining included) is about 3 days.

CNN or RNN for discriminator
We test different architectures for the discriminator, CNN-based and RNN-based (LSTM for instance). Figure 3 shows the BLEU scores on the development set tested at different time steps. We can find that the performance of the RNN-based * LDC2002L27, LDC2002T01, LDC2002E18, LDC2003E07, LDC2004T08, LDC2004E12, LDC2005T10 (LSTM and BiLSTM) discriminators deteriorate rapidly with the time going. Even more striking, the performance of the discriminator based on the BiLSTM collapses sharply after several time steps. The training for the RNN-based discriminator is not stable. On the contrary, the CNNbased discriminator performs very well. Empirically, with a few times of updating, the classification accuracy of RNN-based discriminators can easily achieve as high as 0.9 which is too strong for the generator. The sentences generated by the generator can be easily discriminated by the strong discriminator and the generator is discouraged all the time. We conjecture that this is why the RNNbased discriminators work badly in our CSGAN-NMT. If no otherwise specified, we use the CNNbased discriminator for the following experiments. Table 1 shows the BLEU score on Chinese-English test sets. Compared to the baseline model, the proposed CSGAN-NMT model leads to improvement up to +1.7 BLEU points on average when trained on the small data set (see (1) and (4) in table 1). Naturally, there is a doubt that this improvement may owe much to the small learning rate of the optimization method rather than the CSGAN-NMT itself. To dispel this doubt and verify the efficacy of the proposed model, we use stochastic gradient descend optimization method with the learning rate of 0.0001, i.e., the learning rate used in CSGAN-NMT, to finetune the baseline model and we only get +0.7 BLEU points improvement (see (3) in table 1). There is a gap as large as 1.0 BLEU points on average (30.12 vs 31.14) between the finetuned baseline model and the CSGAN-NMT model (see (3) and (4) in table 1). Additionally, we get another +0.3 BLEU points improvement when running the CSGAN-NMT on the basis of the finetuned baseline model (see (3) and (5) in table 1). On the large training set, we find that the CSGAN-NMT model leads to improvement up to +2.5 BLEU points on NIST05 and +2.1 BLEU points on average (see (2) and (6) in table 1). To conclude, these experiments show that the NMT can be greatly improved by the generative adversarial training and the proposed CSGAN-NMT model can achieve consistent improvement when it is trained on the large data set.

Initial accuracy of the discriminator
The initial accuracy f of the discriminator which can be viewed as a hyper-parameter, can be controlled carefully during the process of pretraining. A natural question is that when shall we end the pre-training. Do we need to pre train the discriminator until that its accuracy reaches as high as possible? To test the impact of the initial accuracy of the discriminator, we pre train five discriminators which have the accuracy as 0.6, 0.7, 0.8, 0.9 and 0.95 respectively. With the five discriminators, we train five different CSGAN-NMT models and test their translation performance on the development set at regular intervals. Figure  4 reports the result and we can find that the initial accuracy of the discriminator shows great impacts on the translation performance of the proposed model. From figure 4, the initial accuracy of the discriminator needs to be set carefully and no matter it is set too high (0.9 and 0.95) or too low (0.6 and 0.7), the CSGAN-NMT performs badly. Empirically, we pre-train the discriminator until its accuracy reaches 0.82.

Sample times for Monte Carlo search
We are also curious about the sample times N for Monte Carlo search. If N is set as a small number, the intermediate reward computed as equation 16 may be incorrect and if otherwise, the computation shall be very time consuming. There is a trade-off between the accuracy and computation complexity. Table 2

Discriminator for data cleaning
Since the discriminator of the CSGAN-NMT directly outputs the probability that, given the source-language sentence, the target-language sentence is being a human-generated one. This motivates us to test the feasibility of applying the discriminator into data cleaning. When we finished training the CSGAN-NMT model, the accuracy of the discriminator is near 0.6, which is a little weak of handling the data cleaning task. Hence, we continue training the discriminator for 4 epoches and its accuracy reaches 0.95. Then, by feeding the parallel training data into the discriminator, we get a probability of being humantranslated for each sentence pair. We select a set of examples (s 1 ) from the training data by the probability in a descending order. Additionally, we also randomly choose the other set s 2 which has the same amount of examples with s 1 . Two traditional NMT models with the same configuration are trained on the data set s 1 and s 2 respectively. In our practice, we choose the set s 2 for five times with different random seed and report the average translation performance of five NMT models trained on the five s 2 sets respectively. The results are reported in table 3. We can find that the models trained on the cleaned data achieves better translation performance than the counterpart trained on the randomly sampled data. Furthermore, the model trained on the 60w cleaned data achieves comparable translation performance with the model trained on 80w noisy data on NIST02 and NIST05. This indicates that the discriminator is capable of cleaning the training data for the NMT.  Table 3: The translation performance of the NMT model on test sets when the data size for the sub set is 60w and 80w. The result for s2 is the average translation performance.