ARAML: A Stable Adversarial Training Framework for Text Generation

Most of the existing generative adversarial networks (GAN) for text generation suffer from the instability of reinforcement learning training algorithms such as policy gradient, leading to unstable performance. To tackle this problem, we propose a novel framework called Adversarial Reward Augmented Maximum Likelihood (ARAML). During adversarial training, the discriminator assigns rewards to samples which are acquired from a stationary distribution near the data rather than the generator’s distribution. The generator is optimized with maximum likelihood estimation augmented by the discriminator’s rewards instead of policy gradient. Experiments show that our model can outperform state-of-the-art text GANs with a more stable training process.


Introduction
Natural text generation, as a key task in NLP, has been advanced substantially thanks to the flourish of neural models (Bengio et al., 2003;Mikolov et al., 2010). Typical frameworks such as sequence-to-sequence (seq2seq) have been applied to various generation tasks, including machine translation (Sutskever et al., 2014) and dialogue generation (Vinyals and Le, 2015). The standard paradigm to train such neural models is maximum likelihood estimation (MLE), which maximizes the log-likelihood of observing each word in the text given the ground-truth proceeding context (Graves, 2013).
Although widely used, MLE suffers from the exposure bias problem (Bengio et al., 2015;Ranzato et al., 2016): during test, the model sequentially predicts the next word conditioned on its previous generated words while during training conditioned on ground-truth words. To tackle this * Equal contribution † Corresponding author: Minlie Huang problem, generative adversarial networks (GAN) with reinforcement learning (RL) training approaches have been introduced to text generation tasks Che et al., 2017;Lin et al., 2017;Shi et al., 2018;, where the discriminator is trained to distinguish real and generated text samples to provide reward signals for the generator, and the generator is optimized via policy gradient . However, recent studies have shown that potential issues of training GANs on discrete data are more severe than exposure bias (Semeniuta1 et al., 2018;Caccia et al., 2018). One of the fundamental issues when generating discrete text samples with GANs is training instability. Updating the generator with policy gradient always leads to an unstable training process because it's difficult for the generator to derive positive and stable reward signals from the discriminator even with careful pretraining (Che et al., 2017). As a result, the generator gets lost due to the high variance of reward signals and the training process may finally collapse .
In this paper, we propose a novel adversarial training framework called Adversarial Reward Augmented Maximum Likelihood (ARAML) to deal with the instability issue of training GANs for text generation. At each iteration of adversarial training, we first train the discriminator to assign higher rewards to real data than to generated samples. Then, inspired by reward augmented maximum likelihood (RAML) (Norouzi et al., 2016), the generator is updated on the samples acquired from a stationary distribution with maximum likelihood estimation (MLE), weighted by the discriminator's rewards. This stationary distribution is designed to guarantee that training samples are surrounding the real data, thus the exploration space of our generator is indeed restricted by the MLE training objective, resulting in more stable training. Compared to other text GANs with RL training techniques, our framework acquires samples from the stationary distribution rather than the generator's distribution, and uses RAML training paradigm to optimize the generator instead of policy gradient. Our contributions are mainly as follows: • We analyze the fundamental issue of current GANs for text generation from the perspectives of training instability.
• We propose a novel framework called Adversarial Reward Augmented Maximum Likelihood (ARAML), which incorporates stable RAML training into adversarial training paradigm. Experimental results on three text generation tasks show the effectiveness of our method.

Related Work
Recently, text generation has been widely studied with neural models trained with maximum likelihood estimation (Graves, 2013). However, MLE tends to generate universal text . Various methods have been proposed to enhance the generation quality by refining the objective function Mou et al., 2016) or modifying the generation distribution with external information like topic (Xing et al., 2017), sentence type (Ke et al., 2018), emotion (Zhou et al., 2018a) and knowledge (Zhou et al., 2018b). As mentioned above, MLE suffers from the exposure bias problem (Bengio et al., 2015;Ranzato et al., 2016). Thus, reinforcement learning has been introduced to text generation tasks such as policy gradient (Ranzato et al., 2016) and actorcritic (Bahdanau et al., 2017). (Norouzi et al., 2016) proposed an efficient and stable approach called Reward Augmented Maximum Likelihood (RAML), which connects the log-likelihood and expected rewards to incorporate MLE training objective into RL framework.
Since some text generation tasks have no explicit metrics to be directly optimized, adversarial training has been applied to generating discrete text samples with a discriminator to learn a proper reward. For instance, SeqGAN  devised a discriminator to distinguish the real data and generated samples, and a generator to maximize the reward from the discriminator via pol-icy gradient. Other variants of GANs have been proposed to improve the generator or the discriminator. To improve the generator, MaliGAN (Che et al., 2017) developed a normalized maximum likelihood optimization target for the generator to stably model the discrete sequences. LeakGAN  guided the generator with reward signals leaked from the discriminator at all generation steps to deal with long text generation task. MaskGAN  employed an actor-critic architecture to make the generator fill in missing text conditioned on the surrounding context, which is expected to mitigate the problem of mode collapse. As for the discriminator, RankGAN (Lin et al., 2017) replaced traditional discriminator with a ranker to learn the relative ranking information between the real texts and generated ones. Inverse reinforcement learning (Shi et al., 2018) used a trainable reward approximator as the discriminator to provide dense reward signals at each generation step. DPGAN ) introduced a language model based discriminator and regarded cross-entropy as rewards to promote the diversity of generation results.
The most similar works to our model are RAML (Norouzi et al., 2016) and MaliGAN (Che et al., 2017): 1) Compared with RAML, our model adds a discriminator to learn the reward signals instead of choosing existing metrics as rewards. We believe that our model can adapt to various text generation tasks, particularly those without explicit evaluation metrics. 2) Unlike MaliGAN, we acquire samples from a fixed distribution near the real data rather than the generator's distribution, which is expected to make the training process more stable.

Task Definition and Model Overview
Text generation can be formulated as follows: given the real data distribution P data (X), the task is to train a generative model G θ where P G θ (X) can fit P data (X) well. In this formulation, X = x 1 x 2 · · · x m and x t (1 ≤ t ≤ m) denotes a word in the vocabulary V. Figure 1 shows the overview of our model ARAML. This adversarial training framework consists of two phases: 1) The discriminator is trained to assign higher rewards to real data than to generated data. 2) The generator is trained on the samples acquired from a stationary distribu-  Figure 1: Overview of ARAML. The training samples are acquired from a stationary distribution P s based on the real data. The generator is then trained on the samples augmented by the discriminator's rewards. The discriminator is trained to distinguish real data and generated data. tion with reward augmented MLE training objective. This training paradigm of the generator indeed constrains the search space with the MLE training objective, which alleviates the issue of unstable training.

Discriminator
The discriminator D φ aims to distinguish real data and generated data like other GANs. Inspired by Least-Square GAN (Mao et al., 2017), we devise the loss function as follows: This loss function forces the discriminator to assign higher rewards to real data than to generated data, so the discriminator can learn to provide more proper rewards as the training proceeds.

Generator
The training objective of our generator G θ is derived from the objective of other discrete GANs with RL training method: where r φ (X) denotes the rewards from the discriminator D φ and the entropy regularized term H(P G θ (X)) encourages G θ to generate diverse text samples. τ is a temperature hyper-parameter to balance these two terms.
As mentioned above, discrete GANs suffer from the instability issue due to policy gradient, thus they are consequently difficult to train. Inspired by RAML (Norouzi et al., 2016), we introduce an exponential payoff distribution Q φ (X) to connect RL loss with RAML loss: where Z = X exp(r φ (X)/τ ). Thus, we can rewrite L RL,θ with P G θ (X) and Q φ (X) as follows: Following RAML, we remove the constant term and optimize the KL divergence in the opposite direction: (5) where H(Q φ (X)) is a constant in the training phase of the generator. It has been proved that L RL,θ and L RAML,θ are equivalent up to their first order Taylor approximations, and they have the same global optimum (Norouzi et al., 2016). L RAML,θ can be trained in a MLE-like fashion but sampling from the distribution Q φ (X) is intractable in the adversarial setting, because Q φ (X) varies with the discriminator D φ . Thus, we introduce importance sampling to separate sampling process from D φ and obtain the final loss function: where P s (X) denotes a stationary distribution and W φ (X) ∝ Q φ (X)/P s (X). To optimize this loss function, we first construct the fixed distribution P s (X) to get samples, and devise the proper reward function r φ (X) to train the generator in a stable and effective way.

Sampling
We construct the distribution P s based on P data : In this way, P s (X s |X) can be designed to guarantee that P s (X) is near P data (X), leading to a more stable training process. To obtain a new sample X s from a real data sample X, we can design three steps which contain sampling an edit distance d, the positions {p 1 , p 2 , · · · , p d } for substitution and the new words {w 1 , w 2 , · · · , w d } filled into the corresponding positions. Thus, P s (X s |X) can be decomposed into three terms: The first step is to sample an edit distance based on a real data sample X, where X = x 1 x 2 · · · x m is a sequence of length m. The number of sentences which have the edit distance e to some input sentence can be computed approximately as below: where c(e, m) denotes the number of sentences which have an edit distance e(e ∈ {0, 1, ..., m}) to a sentence of length m, and |V| indicates the size of vocabulary. We then follow (Norouzi et al., 2016) to re-scale the counts by exp{−e/τ } and do normalization, so that we can sample an edit distance d * from: where τ , as a temperature hyper-parameter, restricts the search space surrounding the original sentence. Larger τ brings more samples with long edit distances.
The next step is to select positions for substitution based on the sampled edit distance d * . Intuitively, we can randomly choose d * distinct positions in X to be replaced by new words. The probability of choosing the position p * is calculated as follows: Following this sampling strategy, we can obtain the position set {p 1 , p 2 , · · · , p d * }. This strategy approximately guarantees that the edit distance between a new sentence and the original sentence is d * .
At the final step, our model determines new words for substitution at each sampled position p j (j = 1, 2, ..., d * ). We can formulate this sampling process from the original sequence X to a new sample X s as a sequential transition X = X 0 → X 1 → · · · → X d * = X s . At each step from X j−1 to X j (j = 1, · · · , d * ), we first sample a new word w j from the distribution P (w|X j−1 , p = p j ), then replace the old word at position p j of X j−1 to obtain X j . The whole sampling process can be decomposed as follows: There are two common sampling strategies to model P (w|X j−1 , p = p j ), i.e. random sampling and constrained sampling. Random sampling strategy samples a new word w j according to the uniform distribution over the vocabulary V (Norouzi et al., 2016), while constrained sampling strategy samples w j to maximize the language model score of the target sentence X j (Su et al., 2018;Miao et al., 2019). Here, we adopt constrained sampling in our model and compare the performances of two strategies in the experiment.

Training
We devise the reward function r φ (X) according to the discriminator's output D φ (X) and the stationary distribution P s (X): Intuitively, this reward function encourages the generator to generate sentences with large sampling probability and high rewards from the discriminator. Thus, the weight of samples W φ (X) can be calculated as follows: So far, we can successfully optimize the generator's loss L G θ via Equation 6. This training paradigm makes our generator avoid possible variances caused by policy gradient and get more stable reward signals from the discriminator, because our generator is restricted to explore the training samples near the real data.
Algorithm 1 Adversarial Reward Augmented Maximum Likelihood Require: Total adversarial training iterations: N iters Steps of training generator: G steps Steps of training discriminator: D steps 1: Pre-train the generator G θ with MLE loss 2: Generate samples from P G θ 3: Pre-train the discriminator D φ via Eq.(1) 4: Construct P s (X) via Eq. (7)

Extension to Conditional Text Generation
We have shown our adversarial training framework for text generation tasks without an input. Actually, it can also be extended to conditional text generation tasks like dialogue generation. Given the data distribution P data (C, X) where C, X denote contexts and responses respectively, the objective function of ARAML's generator can be modified as below: where W φ (C, X s ) ∝ exp{D φ (C, X s )} and D φ (C, X s ) is trained to distinguish whether X s is the true response to C.

Comparison with RAML and MaliGAN
The most similar works to our framework are RAML (Norouzi et al., 2016) and MaliGAN (Che et al., 2017). The main difference among them is the training objective of their generators. We have shown different objective functions in Table 1. For comparison, we use the form with no input for all the three models. Our model is greatly inspired by RAML, which gets samples from a non-parametric distribution Q(X) constructed based on a specific reward. Compared to RAML, our reward comes from a learnable discriminator which varies as the adversarial training proceeds rather than a specific reward function. This difference equips our framework with the ability to adapt to the text generation tasks with no explicit evaluation metrics as rewards.
Our model is also similar to MaliGAN, which gets samples from the generator's distribution. In MaliGAN's training objective, G θ also indicates the generator's distribution but it's used in the sampling phase and fixed at each optimization step. The weight of samples W φ (X) ∝ D φ (X) 1−D φ (X) . Different from our model, MaliGAN acquires samples from the generator's distribution P G θ , which usually brings samples with low rewards even with careful pre-training for the generator, leading to training instability. Instead, our framework gets samples from a stationary distribution P s around real data, thus our training process is more stable.

Model
Training Objective of Generator RAML LG  We evaluated ARAML on three datasets: COCO image caption dataset (Chen et al., 2015), EMNLP2017 WMT dataset 1 and Weibo-Dial single-turn dialogue dataset (Qian et al., 2018). COCO and EMNLP2017 WMT are the common benchmarks with no input to evaluate the performance of discrete GANs, and we followed the existing works to preprocess these datasets (Shi et al., 2018;. WeiboDial, as a dialogue dataset, was applied to test the performance of our model with input trigger. We simply removed post-response pairs containing lowfrequency words and randomly selected a subset for our training/test set. The statistics of three datasets are presented in Table 2.

Baselines
We compared our model with MLE, RL and GAN baselines. Since COCO and EMNLP2017 WMT don't have input while WeiboDial regards posts as input, we chose the following baselines respectively: MLE: a RNN model trained with MLE objective (Graves, 2013). Its extension, Seq2Seq, can work on the dialogue dataset (Sutskever et al., 2014). SeqGAN: The first text GAN model that updates the generator with policy gradient based on the rewards from the discriminator . LeakGAN: A variant of SeqGAN that provides rewards based on the leaked information of the discriminator for the generator . MaliGAN: A variant of SeqGAN that optimizes the generator with a normalized maximum likelihood objective (Che et al., 2017). IRL: This inverse reinforcement learning method replaces the discriminator with a reward approximator to provide dense rewards (Shi et al., 2018). RAML: A RL approach to incorporate MLE objective into RL training framework, which regards BLEU as rewards (Norouzi et al., 2016). DialogGAN: An extension of SeqGAN tuned to dialogue generation task with MLE objective added to the adversarial objective . DPGAN: A variant of DialogGAN which uses a language model based discriminator and regards cross-entropy as rewards .
Note that MLE, SeqGAN, LeakGAN, Mali-GAN and IRL are the baselines on COCO and EMNLP2017 WMT, while MLE, RAML, Dialog-GAN, and DPGAN on WeiboDial. The original codes are used to test the baselines.

Implementation Details
The implementation details of our model are shown in Table 3. For COCO / EMNLP2017, the  generator is a LSTM unit (Hochreiter and Schmidhuber, 1997) with 128 cells, and the discriminator is implemented based on . For WeiboDial, the generator is an encoder-decoder structure with attention mechanism, where both the encoder and the decoder consist of a two-layer GRU (Cho et al., 2014) with 128 cells. The discriminator is implemented based on (Tao et al., 2018). The language model used in the constrained sampling of ARAML is implemented in the same setting as the generators, and is pretrained on the training set of each dataset. The codes and the datasets are available at https: //github.com/kepei1106/ARAML. As for the details of the baselines, the generators of all the baselines except LeakGAN are the same as ours. Note that the generator of Leak-GAN consists of a hierarchical LSTM unit, thus we followed the implementation in the original paper. In terms of the differences, the discriminators of GAN baselines are implemented based on the original papers. Other hyper-parameters of baselines including batch size, learning rate, and pre-training epochs, were set based on the original codes, because the convergence of baselines is sensitive to these hyper-parameters.

Language Generation on COCO and EMNLP2017 WMT
We adopted forward/reverse perplexity  and Self-BLEU  to evaluate the quality of generated texts. Forward perplexity (PPL-F) indicates the perplexity on the generated data provided by a language model trained on real data to measure the fluency of generated samples. Reverse perplexity (PPL-R) switches the roles of generated data and real data  to reflect the discrepancy between the generated distribution and the data distribution. Self-BLEU (S-BLEU) regards each sentence in the generated collection as hypothesis and the others as reference to obtain BLEU scores, which evaluates the diversity of generated results. Results are shown in Table 4. LeakGAN performs best on forward perplexity because it can generate more fluent samples. As for reverse perplexity, our model ARAML beats other baselines, showing that our model can fit the data distribution better. Other GANs, particularly LeakGAN, obtain high reverse perplexity due to mode collapse (Shi et al., 2018), thus they only capture limited fluent expressions, resulting in large discrepancy between the generated distribution and data distribution. ARAML also outperforms the baselines in terms of Self-BLEU, indicating that our model doesn't fall into mode collapse with the help of the MLE training objective and has the ability to generate more diverse sentences.
We also provide standard deviation of each metric in Table 4, reflecting the stability of each model's performance. Our model ARAML nearly achieves the smallest standard deviation in all the metrics, indicating that our framework outperforms policy gradient in the stability of adversarial training.

Dialogue Generation on WeiboDial
Dialogue evaluation is an open problem and existing works have found that automatic metrics have low correlation to human evaluation (Liu et al., 2016;Novikova et al., 2017;Chaganty et al., 2018). Thus, we resorted to manual evaluation to assess the generation quality on WeiboDial. We randomly sampled 200 posts from the test set and collected the generated results from all the models. For each pair of responses (one from ARAML and the other from a baseline, given the same input post), five annotators were hired to label which response is better (i.e. win, lose or tie) in terms of grammaticality (whether a response itself is gram-matical and logical) and relevance (whether a response is appropriate and relevant to the post). The two metrics were evaluated independently.
The evaluation results are shown in Table 5. To measure the inter-annotator agreement, we calculated Fleiss' kappa (Fleiss, 1971) for each pairwise comparison where results show moderate agreement (0.4 ≤ κ ≤ 0.6). We also conducted sign test to check the significance of the differences.
As shown in Table 5, ARAML performs significantly better than other baselines in all the cases. This result indicates that the samples surrounding true responses provide stable rewards for the generator, and stable RAML training paradigm significantly enhances the performance in both metrics. To verify the training stability, we conducted experiments on COCO many times and chose the best 5 trials for SeqGAN, LeakGAN, IRL, Mali-GAN and ARAML, respectively. Then, we presented the forward/reverse perplexity in the train-  ing process in Figure 2. We can see that our model with smaller standard deviation is more stable than other GAN baselines in both metrics. Although LeakGAN reaches the best forward perplexity, its standard deviation is extremely large and it performs badly in reverse perplexity, indicating that it generates limited expressions that are grammatical yet divergent from the data distribution.

Impact of Temperature
The temperature τ controls the search space surrounding the real data as we analyze in Section 3.3.1. To investigate its impact on the performance of our model, we fixed all the other hyperparameters and test ARAML with different temperatures on COCO. The experimental results are shown in Figure 3. We can see that as the temperature becomes larger, forward perplexity increases gradually while Self-BLEU decreases. As mentioned in Section 3.3.1, large temperatures encourage our generator to explore the samples that are distant from real data distribution, thus the diversity of generated results will be improved. However, these samples distant from the data distribution are more likely to be poor in fluency, leading to worse forward perplexity. Reverse perplexity is influenced by both generation quality and diversity, so the correlation between temperature and reverse perplexity is not intuitive. We can observe that the model with τ = 0.95 reaches the best reverse perplexity.

Impact of Sampling Strategy
We have mentioned two common sampling strategies in Section 3.3.1, i.e. random sampling and constrained sampling. To analyze their impact, we keep all the model structures and hyperparameters fixed and test ARAML with these two strategies on COCO.  Table 6: PPL-F, PPL-R and S-BLEU of ARAML with random sampling (ARAML-R) and constrained sampling (ARAML-C) on COCO. Table 6 shows the results. It's obvious that random sampling hurts the model performance except Self-BLEU-1, because it indeed allows lowquality samples available to the generator. Exploring these samples degrades the quality and diversity of generated results. Despite the worse performance on automatic metrics, random sampling doesn't affect the training stability of our framework. The standard deviation of ARAML-R is still smaller than other GAN baselines. Table 7 presents the examples generated by the models on COCO. We can find that other baselines suffer from grammatical errors (e.g. "in front of flying her kite" from MLE), repetitive expressions Model Generated Samples MLE A little girl sitting on a beach in front of flying her kite at the beach. A little boy standing in a room next to a desk. SeqGAN A man sitting on a bench with snow board in the background. A brown gray cat is in the corner of a street. LeakGAN A person that is holding something while another kid is standing in the water. A room with a television, mantle, and a chair. MaliGAN A man with a shirt on holding one large pink giant and white kite. A couple and vases are outside on the bed. IRL A group of people wearing helmet sitting down on a cell phone. A group of people sitting in the middle of tracks. ARAML A man is wearing a hat and holding a toothbrush as he stands on the grass of a field. A boy reading a book on a sofa in a room. (e.g. "A group of people" from IRL) and incoherent statements (e.g. "A group of people sitting on a cell phone" from IRL). By contrast, our model performs well in these sentences and has the ability to generate grammatical and coherent results.   Table 8 shows the generated examples on Wei-boDial. It's obvious that other baselines don't capture the topic word "late" in the post, thus generate irrelevant responses. ARAML can provide a response that is grammatical and closely relevant to the post.

Conclusion
We propose a novel adversarial training framework to deal with the instability problem of current GANs for text generation. To address the instability issue caused by policy gradient, we incorporate RAML into the adversarial training paradigm to make our generator acquire stable rewards. Experiments show that our model performs better than several state-of-the-art GAN baselines with lower training variance, yet producing better performance on three text generation tasks.