Bidirectional Generative Adversarial Networks for Neural Machine Translation

Generative Adversarial Network (GAN) has been proposed to tackle the exposure bias problem of Neural Machine Translation (NMT). However, the discriminator typically results in the instability of the GAN training due to the inadequate training problem: the search space is so huge that sampled translations are not sufficient for discriminator training. To address this issue and stabilize the GAN training, in this paper, we propose a novel Bidirectional Generative Adversarial Network for Neural Machine Translation (BGAN-NMT), which aims to introduce a generator model to act as the discriminator, whereby the discriminator naturally considers the entire translation space so that the inadequate training problem can be alleviated. To satisfy this property, generator and discriminator are both designed to model the joint probability of sentence pairs, with the difference that, the generator decomposes the joint probability with a source language model and a source-to-target translation model, while the discriminator is formulated as a target language model and a target-to-source translation model. To further leverage the symmetry of them, an auxiliary GAN is introduced and adopts generator and discriminator models of original one as its own discriminator and generator respectively. Two GANs are alternately trained to update the parameters. Experiment results on German-English and Chinese-English translation tasks demonstrate that our method not only stabilizes GAN training but also achieves significant improvements over baseline systems.


Introduction
The past several years have witnessed the rapid development of Neural Machine Translation (NMT) * This work was done when the first author was the intern at Microsoft Research Asia. Sutskever et al., 2014;, from catching up with Statistical Machine Translation (SMT) (Koehn et al., 2003;Chiang, 2007) to outperforming it by significant margins on many languages (Sennrich et al., 2016;Gehring et al., 2017;Vaswani et al., 2017;Hassan et al., 2018). The most common approach to training NMT is to maximize the conditional log-probability of the correct translation given the source sentence. However, as argued in Bengio et al. (2015), the Maximum Likelihood Estimation (MLE) principle suffers from so-called exposure bias in the inference stage: the model predicts next token conditional on its previously predicted ones that may be never observed in the training data. To address this problem, much recent work attempts to reduce the inconsistency between training and inference, such as adopting sequence-level objectives and directly maximizing BLEU scores (Bengio et al., 2015;Ranzato et al., 2015;Shen et al., 2016;Wiseman and Rush, 2016).
Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is another promising framework for alleviating exposure bias problem and recently shows remarkable promise in NMT Wu et al., 2017). Formally, GAN consists of two "adversarial" models: a generator and a discriminator. In machine translation, NMT model is used as the generator that produces translation candidates given a source sentence, and another neural network is introduced to serve as the discriminator, which takes sentence pairs as input and distinguishes whether a given sentence pair is real or generated. Adversarial training between the two models involves optimizing a min-max objective, in which, the discriminator learns to distinguish whether a given data instance is real or fake, and the generator learns to confuse the discriminator by generating high-quality translation candidates. Since the generated data is based on discrete symbols (words), we usually adopt policy gradient method (Yu et al., 2017) to update model parameters of the generator. Specifically, given a bunch of translation candidates sampled from the generator, confidence scores calculated by the discriminator are employed as rewards to update the generator.
However, in this training process, the discriminator typically suffers from inadequate training problem, leading to the instability of GAN training. In practice, sampling large translation candidates is time-consuming for NMT system, so we only use a few samples to train the discriminator. For a given source sentence, there is usually only one positive example (real target sentence). If the sampled negative examples are also few, the discriminator will easily overfit to the data. This is the inadequate training problem for the discriminator. In such a case, rewards calculated by the discriminator could be biased, especially for unobserved samples. These biased rewards will provide a wrong signal to the generator and make it update incorrectly, resulting in performance degradation of the generator. Since such issue can occur repeatedly throughout the entire training process, GAN training becomes unstable and the performance of generator will drop drastically.
On the other hand, the generator has welldesigned properties that benefit the discriminator, since it models the probability distribution over the entire translation space so that the generator does not overfit to observed samples, while prior knowledge for unobserved samples is naturally considered. At the same time, the generator also exhibits a certain ability to identify whether a given data instance is good enough. For example, target-to-source translation model serves as the discriminator to improve source-to-target translation model Tu et al., 2017). Inspired by this, we propose a novel Bidirectional Generative Adversarial Network for Neural Machine Translation (aka BGAN-NMT), which employs a generator model to perform the role of the discriminator so as to handle inadequate training problem and stabilize GAN training. To satisfy this property, both generator and discriminator of original GAN are designed to model the joint probability of sentence pairs, with the difference that, the generator model A is decomposed into a source language model and a source-to-target translation model, while the discriminator model B is formulated as a target language model and a target-to-source translation model. Intuitively, we can also leverage A to act as the discriminator to improve B, and then improved B reversely serves as a better discriminator to guide the training of A. To make use of this symmetry, we bring in an auxiliary GAN that adopts generator and discriminator models of original one as its own discriminator and generator respectively. Then we design a joint training algorithm to alternately utilize these two GANs to update the source-to-target and target-tosource translation models.
Our experiments are conducted on German-English and Chinese-English translation data sets. Experimental results demonstrate that our BGAN-NMT not only achieves the stability of GAN training but also significantly improves translation performance over baseline systems.

Neural Machine Translation
Attention-based NMT model ) is adopted as the source-to-target and targetto-source translation models used in our BGAN-NMT. The attention-based NMT system is implemented as an encoder-decoder framework with recurrent neural networks (RNN), which can be Gated Recurrent Unit (GRU)  or Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks in practice.

Encoder-Decoder Framework
The encoder reads the source sentence x = (x 1 , x 2 , ... , x T ) and transforms it into a sequence of hidden states h = (h 1 , h 2 , ... , h T ) using a bidirectional RNN. The decoder uses another RNN to generate the translation y = (y 1 , y 2 , ... , y T ) based on the hidden states h. At each time stamp i, the conditional probability of each word y i from a target vocabulary V y is computed with where z i is the i th hidden state of the decoder, which is calculated conditioned on the previous hidden state z i−1 , previous word y i−1 and the source context vector c i : The source context vector c i is a weighted sum of the hidden states (h 1 , h 2 , ... , h T ) with the coeffi-cients α 1 , α 2 , ... , α T calculated with where a is a feed-forward neural network with a single hidden layer.

MLE Training
NMT systems are usually trained to maximize the conditional log-probability of the correct translation given a source sentence with respect to the parameters θ of the model: where N is size of the training corpus, and |y n | is the length of the target sentence y n . However, MLE training suffers from exposure bias problem: in training stage, the history of any target word is correct and has been observed in the training data, while during testing, the model predicts next token conditioned on its previously predicted ones that may be never observed in the training data. To solve this problem, reinforcement learning methods are used to sample translation candidates, based on which, rewards are calculated and utilized to update the parameters. GAN follows the same way to solve exposure bias problem and rewards are computed by the discriminator.

Generative Adversarial Network
As a new paradigm of training generative models, GAN (Goodfellow et al., 2014) has been successfully applied in computer vision tasks (Radford et al., 2015;Arjovsky et al., 2017). Conceptually, GAN consists of two "adversarial" models: a generator G that captures the data distribution, and a discriminator D that estimates the probability that a sample is sampled from the training data rather than from G. When GAN is used for NMT, NMT model is employed as G, and CNN-based or RNN-based neural networks serve as D Wu et al., 2017). During adversarial training, G and D play a two-player minmax game with the following value function V (D, G): where (x, y) is a sentence pair, P d represents the data distribution and P G denotes the generator distribution. With this objective function, the discriminator learns to distinguish whether sentence pair is real (sampled from bilingual corpus) or fake (generated by G), and the generator tries to confuse the discriminator by generating high-quality translation samples. In practice, policy gradient method (Yu et al., 2017) is usually used to calculate gradients for the generator due to discrete symbols (words). To update the generator model, translation candidates are firstly sampled, for which rewards are calculated using the discriminator. With these rewards, we can compute gradients and run backpropagation to update the generator. In such a training process, real target sentence and sampled translation candidates are used as positive and negative examples of discriminator training respectively. Due to the computation cost, we cannot generate many negative examples, so that the discriminator is easy to overfit. The overfitted discriminator will give biased signals to the generator and make it update incorrectly, leading to the instability of the generator training. Wu et al. (2017) found that combining adversarial training objective with MLE can significantly improve the stability of generator training, which is also reported in language model and neural dialogue generation (Lamb et al., 2016;Li et al., 2017). Actually, although this method leverages real translation signal to guide the generator and alleviate the effect of overfitted discriminator, it cannot deal with the inadequate training problem of the discriminator, which essentially plays a more important role in GAN training.
rewards rewards Figure 1: The architecture of BGAN-NMT consisting of two GANs. The dotted line represents that GAN 2 adopts both generator and discriminator models of GAN 1 but interchanges their roles.
to address the inadequate training problem and stabilize GAN training. Based on these observations, we design a Bidirectional Generative Adversarial Network for Neural Machine Translation, named as BGAN-NMT. As illustrated in Figure 1, the overall architecture of BGAN-NMT consists of an original GAN (GAN 1) and an auxiliary GAN (GAN 2). Both generator and discriminator of original GAN are defined to model the joint probability of sentence pairs P (x, y). Formally, P (x, y) can be decomposed in two equivalent ways: P (x, y) = P (x)P (y|x) and P (x, y) = P (y)P (x|y), and they are used as generator G and discriminator D for GAN 1 respectively. Further, the generator model can be decomposed into a source language model and a source-to-target translation model, while the discriminator can be formulated as a target language model and a target-to-source translation model. Auxiliary GAN (GAN 2) employs G and D of GAN 1 as its own discriminator D and generator G to better exploit the symmetry between G and D. The following of this section details the objective function and joint training algorithm for BGAN-NMT.

Training Objective
As G and D are defined as P (x)P (y|x) and P (y)P (x|y) respectively, the adversarial training objective V (D, G) of GAN 1 in Equation 5 can be rewritten as which means, given a source sentence x, sourceto-target translation model P (y|x) tries to generate high quality translation y to fool the discriminator P (x|y)P (y), while target-to-source translation model P (x|y) and language model P (y) learn to distinguish translation candidates from real sentence pairs. In our implementations, two language models P (x) and P (y) are fixed to reduce training complexity. For discriminator D, D is trained with the ground-truth sentence pair (x, y) and the generated sample (x, y ) from G, respectively as positive and negative examples. Formally, the objective function of D is to maximize V (D, G): Since P (y) is fixed, the gradient of parameter θ D for the target-to-source translation model P (x|y) is calculated as: in which ∂ log P (x|y) ∂θ D is the gradient specified with standard sequence-to-sequence NMT network.
For generator G, following Goodfellow (2016), the objective of training G is to maximize the expected rewards (probability of D) instead of directly minimizing V (D, G): LG = E x∼P d (x),y ∼P (y|x) P (x|y )P (y ) Since the output of the generator G is a sequence of discrete symbols (words), policy gradient is used to update the parameters, and then the gradient of parameter θ G for source-to-target translation model P (y|x) can be calculated as: By exchanging generator and discriminator models of GAN 1, we introduce GAN 2, in which the original G is used as the discriminator D and original D serves as the generator G . Similarly, the adversarial training objective V (D , G ) of GAN 2 is defined as: According to this adversarial training objective, the objective functions of D and G are defined as following: where the gradients of parameters θ D = θ G for P (y|x) and θ G = θ D for P (x|y) can be respectively calculated as:

Joint Training Algorithm
In our approach, G and D actually act as discriminator systems for each other in a joint training process: the generator G can be improved with the discriminator D in GAN 1, and then the enhanced G serves as a better discriminator to guide Algorithm 1: Joint Training Algorithm for BGAN-NMT Input : Bilingual corpus T = {(x, y)} N n=1 ; Pre-trained source-side language model P (x); Pre-trained target-side language model P (y); Output: Well-trained models P (y|x) and P (x|y) 1 Pre-train P (y|x) and P (x|y) on T with MLE principle ; 2 for number of training iterations do the training of D in GAN 2. This training process can be iteratively carried out to obtain further improvements because after each iteration both G and D are expected to be improved with adversarial training. To simultaneously optimize these two models, we design a joint training algorithm to learn the parameters (θ G and θ D ) shared in two GANs of BGAN-NMT (GAN 1 and GAN 2).
As shown in Algorithm 1, the whole algorithm is mainly divided into two steps. Firstly, given parallel corpora T = {(x, y)} N n=1 , we pre-train P (y|x) and P (x|y) with MLE principle, while source and target language models P (x) and P (y) are pre-trained with corresponding sentences of bilingual data. Next, based on these pre-trained models, we implement the two player minmax game using an iterative approach, in which, we alternate between k (equals to 5 in our experiments) steps of optimizing all discriminators (D and D ) and one step of optimizing all generators (G and G ). The iterative training continues until the performance of a development data set is not increased.

Setup
To examine the effectiveness of our proposed approach, we conduct experiments on translation tasks with two language pairs: German-English (De-En for in short) and Chinese-English (Zh-En for in short). In all experiments, BLEU (Papineni et al., 2002) is adopted as the automatic metric for translation quality evaluation and computed using Moses multi-bleu.perl script.

Dataset
For German-English translation task, following previous work (Ranzato et al., 2015;Bahdanau et al., 2016), we select data from German-English machine translation track of IWSLT2014 evaluation tasks, which consists of sentence-aligned subtitles of TED and TEDx talks. We closely follow the pre-processing as described in Ranzato et al. (2015). The training corpus contains 153k sentence pairs with 2.83M English words and 2.68M German words. The validation set comprises of 6,969 sentence pairs taken from the training data, and the test set is a combination of dev2010, dev2012, tst2010, tst2011 and tst2012 with total number of 6,750 sentence pairs.
For Chinese-English translation task, training data consists of a set of LDC datasets 1 , which has around 2.6M sentence pairs with 65.1M Chinese words and 67.1M English words respectively. Any sentence longer than 80 words is removed from training data. NIST OpenMT 2006 evaluation set is used as the validation set, and NIST 2005, 2008, 2012 datasets as test sets. We limit the vocabulary to contain up to 50K most frequent words on both source and target sides, and convert remaining words into the <unk> token.

Model Architecture
RNNSearch model proposed by  is leveraged to be the translation model, but it should be noted that our BGAN-NMT is independent of the NMT network structure. We use a single layer GRU for encoder and decoder. For Zh-En, the size of word embedding (for both source and target words) is 256 and the size of hidden layer is set to 1024. For De-En, in order to compare with previous work (Ranzato et al., 2015;Bahdanau et al., 2016), the size of word embedding and GRU hidden state are both set to 256. In addition, P (x) and P (y) are designed as a singlelayer GRU language model, which is pre-trained 1 LDC2002E17, LDC2002E18, LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06, LDC2005T10, LDC2006E17, LDC2006E26, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006T06, LDC2004T08, LDC2005T10
to compute the marginal probability of a sentence, and the size of word embedding and GRU hidden state are the same as RNNSearch model.

Training Details
For the training of BGAN-NMT, parameters are firstly initialized using a normal distribution with a mean of 0 and a variance of 6/(d row + d col ), where d row and d col are the number of rows and columns in the structure (Glorot and Bengio, 2010). Then we pre-train NMT and language models with MLE principle to convergence, and select the best model according to the performances on the validation set, where BLEU scores and the perplexity are adopted as evaluation metrics for NMT and language models respectively. Both generator and discriminator models in BGAN-NMT are warmly started with those pretrained models, and optimized using the vanilla SGD algorithm with mini-batch 32 for De-En and 128 for Zh-En. We re-normalize gradients if their norm exceeds 2.0. The initial learning rate is set as 0.2 for De-En and 0.02 for Zh-En, and it is halved when BLEU scores on the validation set do not increase for 20,000 batches. To generate the synthetic bilingual data, beam search strategy with beam size 4 is adopted for both De-En and Zh-En. At test time, beam search is employed to find the best translation with beam size 8 and translation probabilities normalized by the length of the candidate translations. Follow Luong et al. (2015), <unk> is replaced with the corresponding target word in a post processing step.

Results on German-English Translation
For German-English translation task, in addition to the baseline system which is used to warmly start our BGAN-NMT training, we also include    Table 1, we can see that our BGAN-NMT achieves significant improvements over the baseline RNNSearch system. It demonstrates that GAN framework can alleviate exposure bias problem and improve the robustness of NMT systems. Our BGAN-NMT also obtains satisfactory translation quality against other existing NMT systems. In particular, our BGAN-NMT outperforms Adversarial-NMT* by 1.14 BLEU points. Adversarial-NMT* adopts MLE to stabilize the training of generator but gains limited improvement due to the inadequate training problem of the discriminator, while our BGAN-NMT can effectively handle this issue and obtain significant improvement. 2 The result of MRT method is taken from Wu et al. (2017) To better analyze training process of the different methods, we compare the BLEU score changes on IWSLT2014 German-English validation set for RNNSearch, Adversarial-NMT* and BGAN-NMT during the entire training. As illustrated in Figure 2, initialized with the best RNNSearch model, Adversarial-NMT* and BGAN-NMT can continually improve the translation performance. In addition, our BGAN-NMT steadily performs much better than Adversarial-NMT* in the whole training process. It confirms that our proposed approach not only stabilizes GAN training but also achieves better results.

Results on Chinese-English Translation
We also conduct experiments on Chinese-English translation task with strong SMT and NMT baselines: HPSMT, RNNSearch and Adversarial-NMT*. HPSMT is an in-house implementation of the hierarchical phrase-based MT system (Chiang, 2007), where a 4-gram language model is trained using the modified Kneser-Ney smoothing algorithm over the target data from bilingual data. Table 2 shows the evaluation results of different models on NIST datasets. All the results are reported based on case-insensitive BLEU. We can observe that RNNSearch significantly outperforms HPSMT by 4.78 BLEU points on average, and BGAN-NMT can further improve the performances, with 2.33 BLEU points on average. Additionally, our BGAN-NMT gains better performances than Adversarial-NMT* with 1.03 BLEU points on average. These experimental results confirm the effectiveness of our proposed approach, similar as shown in the German-English translation task.

Effect on Long Sentences
Longer source sentence implies longer translation that more easily suffers from exposure bias problem. In this subsection, we group source sentence of similar length together and calculate the BLEU score for each group. As shown in Figure 3, we can view that our approach outperforms RNNSearch and Adversarial-NMT* in all length segments, especially achieving notable improvements on long sentences. These results further demonstrate that our approach can better handle this problem and yield higher quality translations.

Effect of Discriminative Loss
We also perform an ablation experiment in order to quantify the effect of the discriminative loss on our models. As shown in Table 3, the discriminative loss can bring 0.58 and 0.73 BLEU score improvements on English-German and Chinese-English dataset respectively. This result proves that the discriminative loss can improve the discriminative ability of bidirectional NMT models, which can give more accurate rewards for the generator training in GAN framework.  from maximum likelihood learning into optimizing BLEU scores using reinforcement algorithm. Bahdanau et al. (2016) designed an actor-critic algorithm for sequence prediction, in which the NMT system is the actor, and a critic network is proposed to predict the value of output tokens. Recently,  and Wu et al. (2017) proposed to leverage GAN framework to deal with the exposure bias problem, in which NMT model is employed as the generator, and CNN-based or RNN-based model is used as the discriminator. Different from their work, both generator and discriminator in our approach are designed to model the joint probability of sentence pairs and then we design an auxiliary GAN to take advantage of the symmetry of them.

Related Work
Another similar research in NMT is to leverage bidirectional dependency to improve translation quality. Tu et al. (2017) designed a re-constructor module for NMT in order to make the target representation contain the complete source information which can reconstruct back to the source sentence.  and  proposed to reconstruct monolingual data by auto-encoder, in which bidirectional translation models form a closed loop and are jointly updated. Recently, this similar idea is used in unsupervised machine translation tasks (Artetxe et al., 2017;Lample et al., 2018).

Conclusion
In this paper, we have presented a Bidirectional Generative Adversarial Network for Neural Machine Translation, consisting of an original GAN and an auxiliary GAN. Both generator and discriminator in original GAN are designed to model the joint probability of sentence pairs. Auxiliary GAN adopts generator and discriminator models of original one but exchanges their roles to full utilize the symmetry of them. Then these two GANs are alternately updated using joint training algorithm. Experimental results on German-English and Chinese-English translation tasks demonstrate that our proposed approach not only stabilizes GAN training but also leads to significant improvements. In the future, we plan to extend this method to other sequence-to-sequence NLP tasks.