DivGAN: Towards Diverse Paraphrase Generation via Diversified Generative Adversarial Network

Paraphrases refer to texts that convey the same meaning with different expression forms. Traditional seq2seq-based models on paraphrase generation mainly focus on the fidelity while ignoring the diversity of outputs. In this paper, we propose a deep generative model to generate diverse paraphrases. We build our model based on the conditional generative adversarial network, and propose to incorporate a simple yet effective diversity loss term into the model in order to improve the diversity of outputs. The proposed diversity loss maximizes the ratio of pairwise distance between the generated texts and their corresponding latent codes, forcing the generator to focus more on the latent codes and produce diverse samples. Experimental results on benchmarks of paraphrase generation show that our proposed model can generate more diverse paraphrases compared with baselines.


Introduction
The task of paraphrase generation refers to rewriting a given sentence to a new paraphrase sentence, which requires that the generated sentence and input sentence are different in expression form, but have the same expressed meaning. Paraphrase generation is a fundamental task of natural language processing (NLP). The technique of paraphrase generation has been widely used in many downstream applications, such as information retrieval, question answering, machine translation, and so on.
Paraphrases should be diversified in nature, i.e., an input sentence can correspond to multiple plausible paraphrases. Traditional seq2seq-based methods tend to generate highly similar outputs since the maximum likelihood estimation (MLE)-based objective function mostly cares about the validity rather than the diversity of outputs. Some works introduce control mechanisms over seq2seq models to produce diverse outputs (Iyyer et al., 2018;Park et al., 2019;. However, the templates or exemplars in control mechanism cannot cover all the possibility of paraphrase, and the introduction of control mechanism is inflexible. Xu et al. (2018b) propose to use a shared decoder with different decoder embeddings to generate different outputs, but the decoder embeddings are not explicitly encouraged and learned to produce different outputs.
Generative models, such as Variational Autoencoder (VAE) (Kingma and Welling, 2014) and Generative Adversarial Network (GAN) (Goodfellow et al., 2014), which learn distributions over the latent space, can generate diverse outputs. In this paper, we build a new framework on top of the conditional GAN (Mirza and Osindero, 2014) to generate diverse paraphrases. To get multiple outputs, the generative models often take an additional random vector (latent code) as inputs, where the noise vector is responsible for producing variations in the outputs. However, compared with the traditional GAN, the conditional GAN takes external conditional contexts as additional inputs. The conditional contexts are highly structured and complex compared to the latent vector, making the latent code easily ignored and inoperative. Besides, the GAN-based methods usually fall into the mode collapse (Salimans et al., 2016) problem, that only a few modes in the latent space can work.
We address the above problems by encouraging the generator to be sensitive to latent codes and explore more modes in the latent space. For this purpose, we incorporate the conditional GAN with a simple yet efficient diversity loss term. During training, the diversity loss maximizes the ratio of the pairwise distance between the generated texts and their corresponding latent codes. As a result, the generator is forced to pay attention to latent codes and has the chance to generate different outputs.
We conduct experiments on Quora and MSCOCO datasets. Experimental results show that our proposed model can generate more diverse paraphrases compared with baselines while retaining the same semantics.
In summary, the primary contributions of this paper are as follows: • We propose a conditional GAN-based framework to generate diverse paraphrases.
• To make the latent code valid and to alleviate the mode collapse problem, we propose a diversity loss term, which makes the generator sensitive to the change of latent codes.
• The experimental results show that our model can successfully generate more diverse paraphrases.
2 Related Work

Paraphrase Generation
Seq2seq-based methods have been widely used in the task of paraphrase generation (Prakash et al., 2016;Kajiwara, 2019). Li et al. (2018) further adopt reinforcement learning with policy gradient technique to generate semantically consistent paraphrases. Gupta et al. (2018) propose a conditional VAE-based framework to generate paraphrases from the latent space. Shakeri and Sethy (2019) improve the VAE framework by conditioning the generator on a label which specifies whether the paraphrases are semantically consistent or not. Yang et al. (2019) further introduce the CVAE-GAN framework for paraphrase generation.
Some translation-based methods have also been proposed to generate paraphrases Wieting et al., 2017;Guo et al., 2019). The main philosophy of these methods is to translate a text into another language (often referred to as "pivot language"), and translate it back to the original language. Then the original text and backtranslated text are considered as a pair of paraphrases.
There are also some works trying to generate paraphrase in an unsupervised way. For example, Roy and Grangier (2019) adopt the vectorquantized VAE framework to discrete the latent space to generate paraphrases. Bao et al. (2019) decompose the latent space into syntactic and semantic space, and sample in the syntactic space while keeping semantics unchanged when generating paraphrases.

Generative Adversarial Nets
Generative Adversarial Nets was proposed by Goodfellow et al. (2014). The main idea of GAN is to train the generator and discriminator via minimax optimization, where the generator tries to generate realistic samples that match the real distribution, and the discriminator tries to distinguish between generated and real samples. GAN was first applied in the computer vision area. Some recent work have applied GAN-based framework in text generation (Yu et al., 2017;Kusner and Hernández-Lobato, 2016;Fedus et al., 2018;Guo et al., 2018;Wang and Wan, 2018). Applying GAN to text generation is nontrivial because generating discrete tokens is non-differentiable, making it difficult to optimize via back-propagation. The policy gradient technique (Sutton et al., 1999) is usually used to address this problem.

Methods
Given an input sentence x = {x 1 , x 2 , · · · , x n }, we seek to generate a set of k paraphrase sentences Y = {y (1) , y (2) , · · · , y (k) }, that all y ∈ Y have the same meaning with x, but are different in expression form.

Base Model
We build our model on top of the conditional GAN. The model consists of a generator G and a discriminator D. The generator is a GRU-based seq2seq network and the discriminator is a CNN network.

Generator
The generator G is a GRU-based seq2seq network which consists of a GRU encoder G enc and a GRU decoder G dec . Given a text x, the encoder takes x as input and encodes it into latent vector h x . The decoder takes two inputs: the latent vector h x and a random vector z sampled from the standard normal distribution, and generates the paraphrase y corresponding to x. This process can be formalized as: and we abbreviate it as y = G(x, z). It is worth noting that the generator is architecture-free and it can adopt many other seq2seq frameworks such as Transformer (Vaswani et al., 2017). Our work is orthogonal to those works that focus on designing sophisticated encoder and decoder architectures.

Discriminator
The discriminator D adopts a CNN network since CNN has recently been shown of great effectiveness in short text classification. Given a text x and the paraphrase y, the CNN network encodes them into C(x) and C(y) of the same dimension respectively. Then the quality of the paraphrase is measured by a one-layer feed-forward network with sigmoid activation: where w and b are weight parameters, σ refers to the sigmoid activation, and q(x, y) ∈ [0, 1] is the quality of the paraphrase y given the sentence x.

Training Objective
Considering that a good paraphrase should not only be natural, but also have the same meaning with the input sentence. Similar to Reed et al. (2016), We extend the discriminator D to identify three types of paraphrases for each input sentence x: (1) S x : the set of paraphrases produced by human corresponding to x, (2) S G the set of paraphrases produced by the generator G corresponding to x, and (3) S \x the set of paraphrases produced by human, but are randomly sampled from all paraphrases which may be irrelevant to the given sentence x. Then the training objective is given below: Notice that the irrelevant sentences given to the discriminator is a common practice of training CGANs. Without this term, theoretically any topic sentences given to the discriminator will be considered correct. The goal of the generator is to generate paraphrases that are semantically consistent and natural (i.e., indistinguishable for the discriminator). Therefore it should minimize Eq. 3. The goal of the discriminator is to distinguish artificial paraphrases (i.e., those generated from the generator), the golden paraphrases (i.e., those produced by humans corresponding to the input), and irrelevant paraphrases (i.e., those produced by humans but irrelevant to the input). Therefore it should maximize Eq. 3. This can be formalized as the following minimax problem: We adopt the adversarial training technique to optimize problem 4. To address the problem that the gradient cannot pass back to the generator, we formalize the generation of discrete tokens as a sequential decision-making process and adopt the policy gradient and early feedback techniques described in Yu et al. (2017). We recommend readers refer to Yu et al. (2017) for more details.

Motivation
We find in experiments that directly applying the conditional GAN model described above does not satisfactorily generate diverse paraphrases. Specifically, even if we sample multiple different z, the generated paraphrases are the same in many cases. This means that the latent code does not work or has minor impacts. We think this is because the conditional texts are highly structured and provide strong prior knowledge to guide the generation process, making the latent code negligible. Besides, from the perspective of optimization, this can be interpreted as the mode collapse problem (Salimans et al., 2016), where only a few modes get learned and the generator only generates samples from a few modes.
To solve this problem and produce diverse paraphrases, we propose to encourage the generator to explore more modes in the latent space and make the generator sensitive to latent codes. Inspired by Odena et al. (2018), we incorporate the conditional GAN with a diversity loss term.

Formulation
Given an input sentence x, we sample a set of k latent codes {z (i) } k i=1 from the Gaussian distribution and generate corresponding paraphrases For the convenience of narration, we denoteỹ (i) as the vector representation of y (i) , whereỹ (i) is obtained by taking the hidden state of the last time step of y (i) . We use the L2 distance ỹ (i) −ỹ (j) 2 to measure the difference betweenỹ (i) andỹ (j) , and use z (i) − z (j) 2 to measure the difference between z (i) and z (j) , and denote u (i,j) as the ratio of ỹ (i) −ỹ (j) 2 and z (i) − z (j) 2 : Then diversity loss is calculated as: where λ is a slack factor.
During training, the diversity loss L div are appended to the original objective function: where γ is the weight parameter. Combining the diversity loss term, the optimization problem becomes We use the same techniques described in Section 3.1.3 to solve this problem.

Why does it work
2 , this means that the generator will be punished if it does not produce different paraphrases given different latent codes. Therefore, the generator are forced to focus more on the latent codes and generate different paraphrases.
From the perspective of mode collapse, minimizing Eq. 6 can prevent the generator from producing samples only from a few modes, and enhance the chances of producing samples from some minor modes. Minimizing Eq. 6 can be seen as maximizing corresponds to a lowerbound of the gradient of the generator: where Γ(t) = tz (i) + (1 − t)z (j) is a line segment with z (i) and z (j) as the end points. Eq. 9 reveals that for any two modes z (i) and z (j) , maximizing Eq. 5 will increase the gradient of the generator between z (i) and z (j) . Therefore, by increasing the gradient of the generator, more modes can be learned, and thus the generator has the chance to generate samples from minor modes.

Dataset
There are many datasets for paraphrase generation. We choose the two most widely used datasets, Quora 1 and MSCOCO (Lin et al., 2014) for experiments.
Quora Quora dataset consists of over 400K candidate question paraphrase pairs, and each pair has a manually annotated label. The two questions are paraphrasing each other only when the question pair is annotated as 1. This dataset contains 155K paraphrase question pairs in total.
MSCOCO MSCOCO is a benchmark for the task of image captioning. This dataset contains over 82K training and 42K validation images, and each image contains at most five human-labeled captions. Similar to previous work on paraphrase generation, we consider different captions of the same image as paraphrases. Following previous work, we reduce the sentences to the size of 15 words.

Evaluation Metrics
BLEU4 : BLEU4 is the most widely used evaluation metric in paraphrase generation. We report the average BLEU4 score of the k outputs. Notice that some works also calculate the ROUGE or TER scores, but we think the role of these two metrics overlaps with the BLEU metric, as they all calculate the degree of overlap between outputs and references. Therefore we only calculate the BLEU score to evaluate the closeness of outputs to the references.
Self-BLEU : To evaluate the degree to which the generated paraphrases are different from the original sentence, we propose to calculate the BLEU4 score between the generated paraphrases and input sentence. We name it "self-BLEU". The lower the self-bleu score, the more significant the change in the generated paraphrase. We report the average Self-BLEU score of the k outputs.
Pairwise-BLEU : We propose to calculate the "pairwise-BLEU" score to evaluate the difference between the k different paraphrases generated from the same given sentence. Concretely, for k outputs {y 1 , y 2 , · · · , y k }, we compute the BLEU4 score 1 https://data.quora.com/ First-Quora-Dataset-Release-Question-Pairs between all y i and y j (i = j), and average the k(k − 1)/2 scores. A low pairwise-BLEU score means a high diversity between outputs, and vice versa. We abbreviate the Pairwise-BLEU as "P-BLEU".
BERTScore : To evaluate the semantic changes of the generated paraphrase compared with the input sentence, we calculate the BERTScore (Zhang et al., 2020) between the generated paraphrase and input sentence. We report the average BLEU4 score of the k outputs.
Human Evaluation : In addition to the above automatic evaluation metrics, we also conduct human evaluation. We randomly sample 50 examples from the test set of Quora and MSCOCO datasets respectively. We ask five volunteers to evaluate the quality of the generated paraphrases from the following three aspects: (1) Fidelity: how semantically consistent are the generated paraphrases compared to the input sentence? (2) Fluency: how fluent are the generated paraphrases? (3) Diversity: how diverse are the generated paraphrases? (4) Variability: How much change do the generated paraphrases have in the form of expression compared with the input sentences? These scores are all between 1-5, with 5 being the best.

Competitive Models
We compare our model with the following baselines: LSTM The stacked residual-LSTM proposed by Prakash et al. (2016). We reimplemented this baseline ourselves.
Transformer The standard Transformer model proposed by Vaswani et al. (2017). To improve the diversity of outputs, we test three variants: (1) Transformer + beam: using beam search to generate k different outputs, (2) Transformer + divbeam: using the diverse beam search proposed by Vijayakumar et al. (2016) to generate k different outputs, and (3) Transformer + sampling: using the sampling strategy to generate each token in the decoding stage.

VAE-SVG
The variational auto-encoder model described in Gupta et al. (2018). We implement this model ourselves to participate in the experiments.

D-PAGE
The Diverse Paraphrase Generation model proposed by Xu et al. (2018b). They use a  shared decoder with different decoder embeddings to generate different outputs.
DPGAN The Diversity-Promoting GAN proposed by Xu et al. (2018a). They assign low reward for repeated text and high reward for novel text to prompt diverse outputs. 2 CGAN The conditional GAN with the same architecture as our model, but without the diversity loss term.
Other baselines We also report the results of DNPG , RbM-SL (Li et al., 2018), MC-WGAN , and UPSA . Notice that they focus on generating high-quality single paraphrase, and do not test to generate multiple paraphrases in their experiments. Thus we can only list their BLEU4 scores for reference. 3

Implementation Details
For the generator, the encoder is set as a onelayer bidirectional GRU network with inner self-2 https://github.com/lancopku/DPGAN 3 They do not release their codes, so we cannot get their results of generating multiple paraphrases. attention, and the decoder is set as a two-layer unidirectional GRU network. The dimension of the input and hidden size is set to 512. The latent code dimension is set to 512, and the latent code is concatenated to each input token. For the discriminator, the CNN network is the same as Kim (2014), where the size of filter windows are set as 3, 4, 5 with 100 feature maps each.
Following previous work on GAN-based text generation, we pre-train the generator using standard MLE loss for 25 epochs, and pre-train the discriminator using the objective in Eq. 3 for 5 epochs. After pre-training, the generator and discriminator are trained alternatively, where each iteration consists of a G-step followed by a D-step.
We use the NLTK 4 tool to process the English texts. The vocabulary sizes are set as 50,000 and 80,000 for Quora and MSCOCO datasets, respectively. We set α = 0.8 and β = 0.8 in Eq. 3, and γ = 10 in Eq. 7 according to the performance on the validation set.

Experiments Setup
For generative models, we sample z (1) , z (2) , · · · , z (k) from the Gaussian distribution to generate k outputs. For Transformer models, we use the beam search to generate k outputs. We set k = 3 for all models in experiments.

Results of Automatical Evaluation Metrics
The comparison results of our model and main baseline models on Quora and MSCOCO datasets are   shown in Table 1. For the other baselines, we also show the BLEU4 scores in Table 2 for reference. In terms of BLEU4 score, our DivGAN (average) performs worse than RbM-SL, MC-WGAN, VAE-SVG, D-PAGE and those transformer-based methods. However, we strongly argue that this does not mean that the quality of our generated paraphrases is worse than those generated by these models. Previous works have shown that BLEU is not a good measure for evaluating several text generation tasks, including dialogue generation , sentence simplification (Sulem et al., 2018) and paraphrase generation (Liu et al., 2010;. First, we also think that the BLEU itself is not is a perfectly reasonable metric for the paraphrase generation task. The paraphrases are highly diversified in nature, but there is only one reference in these paraphrase datasets. Taking the sentences "what can i do to overcome anxiety" with the human reference "what do i do to reduce my anxiety" for example, our model generates sentences like "how do i overcome anxiety" or "what's the best way to overcome anxiety" which are low in BLEU score, but are good paraphrases from human's point of view. Therefore, we think that a high BLEU score only indicates a high degree of overlap between the generated paraphrase and reference, but does not indicate high quality. Second, the BERTScore and the human evaluation results show that the paraphrases we generate are no worse than these models in terms of relevance and fluency, and even better than these models. It is worth mentioning that in terms of BERTScore and human evaluation, the DivGAN model even outperforms the human reference. Third, we also find that the more diverse the paraphrases generated, the lower the average BLEU score is. This is because once we generate a paraphrase which is very similar to the reference, the diverse loss will encourage the rest paraphrases to be different from this paraphrase, which causes the BLEU score of the rest k − 1 paraphrases to be lower, thereby lowering the average BLEU score. We calculate the highest BLEU score among the k results, and find that it is 3 ∼ 4 points higher than the average score (see DivGAN (best)).
In terms of the Pairwise-BLEU score, the Div-GAN model significantly outperforms all baselines (except the Transformer + sampling model on Quora dataset), indicating that the proposed model can generate diverse sentences effectively. We notice that just by removing the diverse loss term from DivGAN, the Pairwise-BLEU of CGAN is greatly increased (from 32.64 to 53.06 on Quora, and from 15.45 to 44.55 on MSCOCO). By checking the outputs, we find that CGAN generates a lot of repeated sentences, thereby boosting the Pairwise-BLEU score. We find that our DivGAN occasionally produces repeated sentences either, but the number of repeated sentences generated by DivGAN is far less than that of C-GAN, D-PAGE and VAE-SVG. These results demonstrate the effectiveness of our proposed diverse loss.
The Transformer + sampling model seems to be able to generate diverse outputs according to the low scores of Self-BLEU and Pairwise-BLEU. However, by checking the outputs, we find that Transformer + sampling model produces large amounts of meaningless text, such as sentences in Table 5. These near-randomly generated tokens make Transformer + sampling's Self-BLEU and Pairwise-BLEU scores lower, making the BLEU and BERTScore scores lower, either.
Although the D-PAGE tries to obtain different outputs from using different decoder embeddings, we find that the sentences generated by different decoders are the same, or of little changes in many cases. This is because the decoders are not explicitly encouraged to produce different results.
The DPGAN model can achieve a low pairwisebleu score, but its BLEU4 and BERTScore are also low. By checking the outputs, we find that this is because DPGAN tends to produce long sentences. To generate "novel" sentences, DPGAN uses the cross-entropy loss as the reward and long sentences can have a high reward. Therefore, DP-GAN achieves low BLEU4 score as the references are relatively short, and achieves low BERTScore as the long text will change the semantics to some extent.
In terms of BERTScore, it can be seen that although our model achieves lower BLEU scores in some cases, it can achieve similar or even higher BERTScore. To some extent, this shows that although the paraphrases generated by our model are more different from human references, the quality of these paraphrases is still good. Table 3 and Table 4 show the results of the human evaluation on the Quora and MSCOCO datasets, respectively.

Results of Human Evaluations
It can be seen that in terms of the quality (fidelity, fluency, and variability), all models' scores are close to the human reference, except the Transformer + sampling. This shows that all models can generate human-like paraphrases. But in terms of the diversity score, our proposed model surpasses other competitive models, indicating that our model can generate more diverse paraphrases.
6 Case Study Table 5 shows outputs of different models for an input sentence from the Quora dataset. We have the following observations. First, using traditional beam search can produce different outputs, but the generated texts are of high similarity with minor modification (for example, replacing "can you" with "do you", or replacing "while awake" with "while you are awake"). Secondly, using the sampling strategy during decoding sometimes produces unnatural output, especially at the beginning or end of the sentence (see the second and third sentences in Table 5). Thirdly, VAE-SVG and C-GAN sometimes produce the same outputs (see the first and third sentences in C-GAN in Table 5), indicating that the latent codes sometimes do not work well. Transformer + divbeam, DP-GAN, and our DivGAN model can produce highquality and diverse outputs. By comparing more  generated samples from the test set, we find that our DivGAN model can generate more diverse samples than the other two models.

Conclusions
In this paper, we propose a conditional generative adversarial network based model to tackle the task of diverse paraphrase generation. To solve the problem of the minor impacts of the latent codes and the mode collapse in the conditional GAN, we propose to add a diversity loss term to the objective. The diversity loss term encourages the generator to explore more in the latent space and generate samples from some minor modes. Experimental results demonstrate the effectiveness of the proposed diversity loss term. In the future, we will apply the diversity loss to more tasks and models.