An End-to-End Generative Architecture for Paraphrase Generation

Generating high-quality paraphrases is a fundamental yet challenging natural language processing task. Despite the effectiveness of previous work based on generative models, there remain problems with exposure bias in recurrent neural networks, and often a failure to generate realistic sentences. To overcome these challenges, we propose the first end-to-end conditional generative architecture for generating paraphrases via adversarial training, which does not depend on extra linguistic information. Extensive experiments on four public datasets demonstrate the proposed method achieves state-of-the-art results, outperforming previous generative architectures on both automatic metrics (BLEU, METEOR, and TER) and human evaluations.


Introduction
Paraphrases convey the same meaning as the original sentences or text, but with different expressions in the same language. Paraphrase generation aims to synthesize paraphrases of a given sentence automatically. This is a fundamental natural language processing (NLP) task, and it is important for many downstream applications (Madnani and Dorr, 2010;Passonneau et al., 2018). For example, paraphrases can help diversify the response of chatbot engines , strengthen question answering (Harabagiu and Hickl, 2006;Duboue and Chu-Carroll, 2006;Fader et al., 2014), augment relation extraction (Romano et al., 2006), and extend the coverage of semantic parsers (Berant and Liang, 2014).
Generating accurate and diverse paraphrases automatically is still very challenging. Traditional methods (McKeown, 1983;Bolshakov and Gelbukh, 2004;Zhao et al., 2008) exploit how linguis- * Corresponding author tic knowledge can improve the quality of generated paraphrases, including shallow linguistic features (Zhao et al., 2009), and syntactic and semantic information (Kozlowski et al., 2003;Ellsworth and Janin, 2007). However, they are often domainspecific and hard to scale, or yield inferior results.
With the help of growing large data and neural network models, recent studies have shown promising results. Three families of deep learning architectures have been investigated with the goal of generating high-quality paraphrases. The first is to formulate paraphrase generation as a sequence-to-sequence problem, following experience from machine translation (Bahdanau et al., 2014;Cheng et al., 2016). Prakash et al. (2016) proposes stacked residual long short-term memory networks (LSTMs), while Cao et al. (2017) makes use of the gated recurrent unit (GRU) as the recurrent unit. Shortly afterwards, several works have focused on providing extra information to enhance the encoder-decoder model. Ma et al. (2018) adds distributed word representations,  considers a paraphrased dictionary and Iyyer et al. (2018) utilizes a syntactic parser. Recently  utilizes a semantic augmented Transformer (Vaswani et al., 2017); however, its accuracy largely depends on extra semantic information. The second family of models employs reinforcement learning (Li et al., 2017b). The third family is the generative architecture that we focus on. Specifically,  applies the conditional variational autoencoders (CVAEs) (Sohn et al., 2015) with LSTM encoderdecoders in generating paraphrases, expecting RNN's advantage of modeling local dependencies to be a good complement to the CVAE's power at learning global representations. This simple combination, however, has two major challenges.
First, CVAE may fail to generate realistic sen-  tences. When generating synthetic sentences by decoding random samples from the latent space, most regions of the latent space do not necessarily decode to realistic sentences; the example in Table  1 reflects this challenge. In (Bowman et al., 2016), the authors attempted to utilize RNN-based VAE to generate more diverse sentences. However, this ignores the fundamental problem of the posterior distribution over latent variables not covering the latent space appropriately. Second, when learning, the ground-truth words are used to decode, while during inference, the RNN generates words in sequence from previously generated words. Bengio et al. (2015) called this phenomenon "Exposure Bias" and tried to solve it by a scheduled sampling method. However, in practice, it produced unstable results, because it is fundamentally an inconsistent training strategy (Huszár, 2015).
We propose the first paraphrase generative model via adversarial training to tackle the above problems, which we believe is a natural answer to the aforementioned challenges. Specifically, we formalize CVAE as a generator of the Generative Adversarial Network (GAN) (Goodfellow et al., 2014) and tailor a discriminator to CVAE. By introducing an adversarial game between a generator and a discriminator, GAN matches the distribution of synthetic data with that of real data. The generator of GAN seeks to map samples from a given prior distribution to realistic synthetic data. The discriminator of GAN compares the entire real and synthetic sentences, instead of individual words, which should in principle alleviates the exposurebias issue (Yu et al., 2017;Li et al., 2017a;Guo et al., 2018). The intuition behind our model can be interpreted as using CVAE to generate similar sentences and enhancing CVAE with an extended discriminator, so that the generator and discriminator work effectively together; they provide feedback signals to each other, result-ing in a mutual adversarial framework. Overall, our contributions are as follows.
• To the best of our knowledge, this work represents the first to propose an end-to-end paraphrase generation architecture via adversarial training, which does not require extra linguistic information.
• We take advantage of GAN to help choose a better latent-variable distribution. By this, we not only utilize the better latent variables but also strengthen the expressiveness of the generative model.
• Our experiments show that the proposed model is capable of generating plausible paraphrased sentences and outperforms competitive baseline models, with state-of-the-art results.

Preliminaries
We first provide preliminaries of variational auto-encoders and conditional variational autoencoders.
VAE. The variational auto-encoder (VAE) (Kingma and Welling, 2013;Sohn et al., 2015) is a directed graphical model with latent variables. The generative process of VAE employs an • (Encoder) generating a set of latent variable z from the prior distribution p θ (z), where θ is the generative parameters; • (Decoder) then generating the data x from the generative distribution p θ (x|z) conditioned on z.
Although due to intractable posterior inference, parameter estimation of directed graphical models is generally challenging, VAE parameters can be estimated efficiently by a stochastic gradient variational bayes (SGVB) estimator (Kingma and Welling, 2013), and can be optimized straightforwardly using standard stochastic gradient techniques. SGVB treats the variational lower bound of the log-likelihood as a surrogate objective function, 3134 which can be written as: where the inequality follows because the second term KL(q φ (z|x) p θ (z|x)) in (1) is nonnegative.
In practice, we can approximate the second term by drawing latent variable samples {z 1 , z 2 , ..., z L } following q φ (z|x), where φ is the variational parameters, and the empirical objective of VAE with Gaussian latent variables can be represented as: where z l = g φ (x, l ), and l ∼ N (0, I). That means q φ (z|x) is reparameterized with a differentiable unbiased function g φ (x, l ), where x is the data and is the noise variable. VAE can be trained efficiently by stochastic gradient descent (SGD) because the reparameterization trick allows error backpropagation through the Gaussian latent variables.
CVAE. Sohn et al. (2015) develops a deep conditional generative model (CVAE) for structured output prediction using Gaussian latent variables. Different from the normal VAE, CVAE consists of three variables: input variables x, output variables y, and latent variables z, and its prior distribution is p θ (z|x). CVAE is trained to maximize the conditional log-likelihood log p θ (y|x), and again it is formulated in the framework of SGVB. The variational lower bound of the model is: and the empirical lower bound is represented as: where z l = g φ (x, y, l ), ∼ N (0, I), and L is the number of samples.

Model
Our new model, which we call GAP for Generative Adversarial Paraphrase model, targets the goal of generating plausible paraphrased sentences conditioned on the original sentences, without any extra linguistic information. We first propose our end-to-end conditional generative architecture for generating paraphrase via adversarial training. Two training techniques are then described for the proposed model.

GAP
We consider a dataset with pairs of the original sentence and the paraphrased sentence where W e ∈ R d×V is a learned word embedding matrix, V is the vocabulary size, and the υ-th column of W e is W e [υ]. All columns of W e are normalized to have unit 2 -norm, i.e., W e [υ] 2 = 1, ∀υ, by dividing each column with its 2 -norm. After embedding, a sentence of length T (padded when necessary) is represented as X ∈ R d×T , by simply concatenating its word embeddings, i.e., x t is the t-th column of X.
The training set is S = {(s o k , s p k )} K k=1 of pairwise data, where s o k ∈ S o denotes the original sentence sequence, and s p k ∈ S p denotes the reference paraphrased sentence sequence. The goal of our model is to learn a function f : S o → S p . The overall model structure of the proposed method is shown in Figure 1. The components can be divided into three modules: encoder in yellow, decoder in blue (which is also the generator), and the discriminator in purple. The connections between them are drawn in arrows.
Encoder E. The encoder E consists of two LSTM networks E 1 and E 2 , and E generates latent variable z from the input sentences. There are two different paths generating latent variable z. In the first path, we input sampled s o and s p into E 1 and E 2 , respectively, and generate sufficient statistics of a Gaussian distribution, mean µ and standard deviation σ. Therefore, the distribution of the latent variable z can be represented as q(z|s o , s p ).
However, we cannot know the paraphrased sentence s p when we evaluate or apply the trained model in real applications. The training framework is not fit for the circumstance at testing time, and only E 1 is used during the inference. Thus, we also compute the other set of sufficient statistics, mean µ and standard deviation σ by inputting s o into encoder E 1 . The sampled latent variable z is only dependent on s o , and we can represent the distribution of z as p(z|s o ).
Decoder/Generator G. We again combine two LSTM decoders G 1 and G 2 as the generator G. At first, we feed the original sentence s o into decoder G 1 and obtain final state c G 1 T , h G 1 T . The paraphrased sentences p is predicted word by word. At each step, we concatenate the latent variable z and the previously predicted (ground truth) word, and then input it into LSTM decoder G 2 with hidden state from the last step. At timestep 0, the input word is BOS ("begin of sentence" symbol), and the hidden state is the output from G 1 . Therefore, the probability of the predicted paraphrased sentences p , given the encoded feature vector z, is defined as: wherew p t is the t-th generated token, < t {0, · · · , t − 1}. w p 0 denotes BOS. All the words ins p are generated sequentially using the LSTM, based on previously generated words, until the end-of-sentence symbol is generated. The t-th wordw p t is generated asw p t = argmax(Vh t ). Note that the hidden units h t are updated through , where E denotes the transition function in the second LSTM cell related to the updates. The transition function E(·) is implemented with an LSTM. V is a weight matrix used to compute a distribution over words.x p t−1 is the embedding vector of the previously generated word and is the input for t-th step. Consequently, the generated sentences p = [w p 1 , · · · ,w p T ] is obtained given z, by simply concatenating the generated words.
Two-Path Loss. Thus far, we can compute the loss from two paths of the encoder. In the first path, we minimize the KL loss and MLE loss: (7) In the second path, the loss is presented as follows: The proposed two-path reconstruction loss diminishes the gap between prediction pipelines at training and testing, and helps to generate diverse but realistic predictions. The objective function of optimizing E and G can be written as follows: where D denotes the discriminator (described below) and z ∼ q(z|s o ). We use λ 1 , λ 2 and λ 3 to balance the weights of the different losses.
Discriminator D. We use one LSTM as the discriminator. The goal of the discriminator is to distinguish real paraphrased sentences from generated sentences. Given the embeddings of true paraphrased sentence s p and the fake generated sentences p , the discriminator loss is defined as: where z ∼ q(z|s o ).

Algorithm 1 Training Pipeline
Initialize: Gradients of corresponding parameters: Encoder E parameters g E = [g E1 , g E2 ], Decoder/Generator G parameters g G = [g G1 , g G2 ], Discriminator D parameters g D ; 1: while G has not converged do L DG = − log(D(s p 2 )); 13: Update model parameters using Adam optimizer with gradients g E1 , g E2 , g G and g D . 18: end while

Training Techniques
We summarize the training pipeline in Algorithm 1. At each iteration, a pair of original and paraphrased sentences are fed into two paths for sentence reconstruction. A discriminator is also used to distinguish between real and generated sentences, which is helpful to the exposure bias problem.
Policy Gradient. Backpropagation of gradients from the discriminative model to the generative model is difficult for sequence generative models. Following Yu et al. (2017), we use the RE-INFORCE algorithm (Williams, 1992) to approximate gradients with respect to the generator and encoder. In the experiments, we regard the generator G as the stochastic policy and the output of discriminator D as its reward. In this way, we can propagate the gradients from the discriminator to both the generator models and encoder models.
Warm-up Training. It is difficult to train GAN using gradient-based methods (Goodfellow et al., 2014). Previous generative models  often pre-train the generator using a supervised model, like an auto-encoder. In this paper, we propose a warm-up technique to train our generative model. Following the notation in Algorithm 1, we suppose a warm-up step t wp . In the training process, we gradually increase the weight of L DG to a fixed value λ 3 . From step t = 0 to t wp , the update of λ t 3 can be represented as When t ≥ t wp , we fix λ t 3 = λ 3 as a constant. We also need to balance the losses between (7) and (8). The second loss (8) represents that during inference, z is generated only depending on x o . We let λ 1 = 1 − λ t 2 and increase the value of λ t 2 gradually until t wp . Similarly, λ t 2 is updated from t = 0 to t wp through: If t ≥ t wp , we fix λ t 2 = λ 2 .

Experiments
We assess the performance of our GAP model and compare it with previous methods. We first describe the datasets we use, then present the details of experimental setup, and finally analyze both quantitative and qualitative results.

Datasets
Following previous work (Prakash et al., 2016), we conduct experiments on the same four standard datasets that are used widely for paraphrase generation. Their content is specified below.
MSCOCO. MSCOCO was originally an imagecaptions dataset, containing over 120K images, with five different captions from five different annotators per image. All the annotations toward one image describe the most prominent object or action in this image, which is suitable for the paraphrase generation task. Specifically, the dataset has two divisions: training dataset ("Train2014", over 82K images) and validation dataset ("Val2014", over 40K images). We follow the operations of previous papers, randomly choosing four captions out of five as two sourcereference pairs and limit the length of sentence to be 15 words (removing the words beyond the first 15), so that we can compare our results with published work.
Quora. Quora consists of over 400K lines of potential question pairs which are duplicate to each other if a particular position of this line is annotated with 1. Again, we follow the operations of previous work and filter out those pairs annotated with 1. There are 155K such question pairs in total, among which three sub-datasets are created, i.e., training dataset 50K, 100K, 150K and test dataset 4K. The goal of using three sub-datasets is to show how the size of dataset can affect the results of paraphrase generation.

Experimental Setup
Encoders E 1 , E 2 and generators G 1 , G 2 are constructed using 2-layer LSTMs (Hochreiter and Schmidhuber, 1997). For discriminator D, it is also 2-layer LSTM. At the beginning, we map each word to a 300-dimensional feature vector and it is initialized with 300-dimensional GloVe vectors (Pennington et al., 2014). Therefore, a sentence of length N can be represented by a matrix of size N × 300. Before inputting the embedding vector into LSTM models, we pre-process these vectors using a two-layer highway network (Srivastava et al., 2015). We set the dimension of latent variables z to be 300. To balance these losses, we set λ 2 = 0.5 and λ 3 = 0.01. The warm-up step is 1 × 10 4 in all experiments. We use Adam optimizer (Kingma and Ba, 2014) with learning rate 1 × 10 −4 and other parameters as default, for all models. Following (Sutskever et al., 2014), we clip gradients of the encoder and generator parameters if the norm of the parameter vector exceeds 10, and we clip the gradients of the discriminator if the norm of the parameter vector exceeds 5. Greedy search is employed to generate paraphrases in the experiment.

Automatic Evaluation
We follow previous papers to choose the same three well-known automatic evaluation metrics: BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and TER (Snover et al., 2006). As studied by (Wubben et al., 2010), human judgments on generated paraphrases correlate well with these metrics. We then compare our results on these metrics with previous baselines below.
Baselines. Our GAP model contains LSTM and VAE components, so we compare it with (i) the basic attentive sequence-to-sequence model from Machine Translation (seq2seq) (Bahdanau et al., 2014) Table 2: Test accuracy on MSCOCO dataset, in percentage. VAE * is our implementation of VAE. "best-B", "best-M" and "best-T" represent the scores with the best BLEU, METEOR and TER respectively. "avg" denotes an average over "best-B", "best-M", and "best-T". All other results are directly cited from the respective papers. For different model variants exist in one paper, we only show the one with highest scores here.
For the arrows ↑ of BLEU and TER, a higher score is better. ↓ of TER represents a lower score is better. The best results are in boldface.
Results and Analysis. The results on MSCOCO are shown in Table 2. We find that our GAP model outperforms the baseline models on all metrics. For example, comparing our scores on MSCOCO dataset when we achieve the best BLEU with the ones from VAE-SVG, we improve the results about 4 BLEU, 5 METEOR, and 4 TER, respectively. Table 3 shows the performance on Quora dataset. We notice that our accuracy increases with the increasing size of data. Meanwhile, our model is more robust for a relatively smaller dataset such as Quora-50K, leveraging its advantage in learning with fewer data. For example, we achieve much better results than VAE and its variants on the smaller Quora-50k.
Ablation Study. VAE * in Table 3 Table 3: Test accuracy on Quora dataset, in percentage. VAE * is our implementation of VAE. "best-B", "best-M" and "best-T" represent the scores with the best BLEU, METEOR and TER respectively. "avg" denotes an average over "best-B", "best-M", and "best-T". All other results are directly cited from the respective papers, and "-" means no such results reported. For the arrows ↑ of BLEU and TER, a higher score is better. ↓ of TER represents a lower score is better. The best results are in boldface.
vious VAE-SVG in  by using our proposed two-path loss. We observe that our VAE * usually outperforms previous VAE-based models. It demonstrates the proposed two-path reconstruction loss can improve the quality of generated paraphrases. Moreover, our proposed method via adversary training, denoted as "Ours", also has superior performance than VAE * . Therefore, our proposed adversarial training and two-path loss take an apparently positive effect on alleviating the exposure bias problem and generating diverse but realistic predictions.

Human Evaluation
Data Preparation. The accurate evaluation of paraphrases is an open problem. We believe that automatic evaluation is not enough for evaluating paraphrases from a fine-grained perspective, in terms of three attributes: grammar correctness (plausibility), equivalence to the original sentence (equivalence), diversity expression (diversity). For both MSCOCO and Quora datasets, we randomly sample 100 sentence pairs (original sentence, paraphrased sentence) from the test corpus, and apply the generative architectures, including both the VAE * model and our GAP model, to generating paraphrases. Thus, we obtain three different sentence pairs: (original sentence, "generated" sentence) by the reference, (original sentence, generated paraphrase) by VAE * , and by our GAP model. To make the analysis fair, we ran-domly shuffled all of them. We then partitioned them into ten buckets.
Process. We set up an Amazon Mechanical Turk experiment; ten human judges are asked to evaluate the quality of paraphrases. We hope our judges play similar roles to the discriminator in our model, to make a true/fake judgment. It is easier for them to make a binary choice than score how good the paraphrased sentences are from a wide score range. These ten human judges are finally involved in evaluating the total 600 sentence pairs. Each pair is judged by two different judges, and the average score is the final judgment. The agreement between judges is moderate (kappa=0.42). These ten human judges confirmed that they were proficient in English and they had understood the goals of the annotation process very well. They were trained by means of instructions and examples. If the meaning of a generated sentence contains a grammatical error or does not express the same meaning as the original sentence, or lacks diversity, we asked the annotators to score 0 for the corresponding attributes, and 1 otherwise.
Results and Analysis. We report the results in  paraphrased Is Arnab Goswami quitting from Times now? generated-VAE Why did Arnab Goswami quit? generated-Ours Why did Arnab Goswami resign from Times Now? Table 5: Samples of generated paraphrases from MSCOCO and Quora. Note that for the last two examples, our generated result is even better than the ground truth where our model paraphrasing "immediate" using "direct", and paraphrasing "quit" using "resign". lent, and inaccurate expressions. It is believed that the reason for this is related to the input training data. It contains noise caused by the length limitation of ≤ 15 words. But note that even for the reference paraphrase, the accuracy cannot be as high as 100%. Here is an example from the real data: for the original sentence "children are playing soccer on a field with several adults observing nearby", the reference paraphrase is "soccer player hits another player in the face". Considering this, the results show that our GAP generates relatively more plausible and diverse paraphrase sentence, compared to the baseline model.
Case Study. In Table 5, we show examples sampled from both MSCOCO and Quora. We observe that our model generates paraphrases with higher diversity than VAE, with no loss of information.
What's more, we discover some results where our model is even better than the ground truth, which are shown in the last two examples in the Table 5.

Related Work
Generative models or conditional generative models have experienced remarkable progress in the visual domain, such as VAEs in Sohn et al., 2015) and InfoGAN . Recent work (Larsen et al., 2015; also considers combining autoencoders or variational autoencoders with GAN to demonstrate superior performance on image generation. In the area of NLP, generation can be summarized from two perspectives: text generation with the goal of yielding diverse and plausible sentences given an original sentence, and conditional text generation aiming to generate new sentences conditioned on an original sentence. Considering the latter, examples include generating a new sentence similar to an original sentence (paraphrase generation), in a different style from the original sentence (style transfer), or dependent on dialogue history (dialogue generation). Attempts at using VAEs (Bowman et al., 2016;Wang et al., 2019), GANs (Yu et al., 2017;Zhang et al., 2016), and both (Hu et al., 2017) have been made to address generic text generation. However, all of them are not suitable for conditional text generation. Recent work in  tries to handle paraphrase generation using VAEs. However, it suffers from challenges common to VAEs. Our method addresses these challenges with the help of GAN. To the best of our knowledge, this is the first work on using GAN with VAEs for paraphrase generation task.

Conclusions
We propose the first deep conditional generative architecture for generating paraphrases via adversarial training, in the hope of combining advantages of CVAE to generate similar distributions with the advantage of GAN to generate plausible sentences. Experimental results evaluated on automatic metrics demonstrate the advantages of our model, with human evaluations also verifying effectiveness. In future work, we intend to accelerate the training of our encoders and decoders with the techniques in (Huo et al., 2018) and apply our architecture and training techniques to other NLP tasks. Overall, we believe that our research makes an important step for using generative models in NLP, especially in conditional text generation.