Paraphrase Generation via Adversarial Penalizations

Paraphrase generation is an important problem in Natural Language Processing that has been addressed with neural network-based approaches recently. This paper presents an adversarial framework to address the paraphrase generation problem in English. Unlike previous methods, we employ the discriminator output as penalization instead of using policy gradients, and we propose a global discriminator to avoid the Monte-Carlo search. In addition, this work use and compare different settings of input representation. We compare our methods to some baselines in the Quora question pairs dataset. The results show that our framework is competitive against the previous benchmarks.


Introduction
Paraphrase generation is a task in NLP which aims to transform a given sentence in another with the same meaning. This task is challenging because of the complexity of semantic and syntactic relationships in language. Moreover, the capacity to generate paraphrases automatically is an opportunity to use data augmentation in NLP. However, one of the main problems of paraphrase generation is that the meaning of a sentence can be changed radically by modifying a word.
Paraphrase generation is a hot task and recent neural-approaches have addressed the problem. We organize previous works of paraphrase generation into two groups: task-support and task-based. The task-support paraphrase generation works create paraphrases to add training data or adversarialexamples. Paraphrases have been created to augment data in question answering (Dong et al., 2017;Gan and Ng, 2019). Some techniques are substitution of words (Jiao et al., 2019;Xie et al., 2019), make syntactic changes to the original sentences (Coulombe, 2018;Iyyer et al., 2018;Sennrich et al., 2016), and back-translation Xie et al., 2019). Other works apply rule-based generative models to perform multiple changes (Samanta and Mehta, 2017;Li et al., 2017).
On the other hand, task-based paraphrase generation works aim to benchmark their results on specific paired datasets. In this paper, we focus on taskbased works. Prakash et al. (2016) use a stacked residual Long-Short Term Memory (LSTM) network that outperforms a vanilla LSTM. Gupta et al. (2018) apply a Variational Autoencoder (VAE) to get better results than the stacked LSTM. Huang et al. (2018) use a Seq2Seq-based model with a dictionary-based attention mechanism. The dictionary search guides the insertion or deletion of a word. Li et al. (2018b) generate paraphrases using a deep reinforcement learning framework. Ma et al. (2018) propose an attention network using word embeddings information.  present an adversarial setup over latent space using a conditional VAE as a generator. Chen et al. (2019) propose a VAE to make syntactically and semantically changed paraphrases.  propose a transformer (Vaswani et al., 2017) with multiple encoders to process extra semantic information of the input. The work of Egonmwan and Chali (2019) shows a hybrid model between a transformer and a Recurrent Neural Network (RNN). Li et al. (2019) design a transformer-based model that can generate paraphrases at different levels of granularity.
From the prior works on paraphrase generation, most of them learn by conditional Maximum Likelihood Estimation (MLE). However,  highlight the exposure bias problem (Bengio et al., 2015;Ranzato et al., 2015) in paraphrase generation. Some works address the problem by applying REINFORCE (Williams, 1992) to generate text in their adversarial setups (Yu et al., 2017;Fedus et al., 2018;Li et al., 2018a;Liu et al., 2018;de Masson d'Autume et al., 2019). However, simi-lar to prior works (He et al., 2019;, we observe that REINFORCE has a high variance and is difficult to tune. At the same time, pre-trained Language Models (LMs) (Devlin et al., 2019;Peters et al., 2018;Radford et al., 2019) have outperformed previous works in many NLP tasks. An important reason is that they provide contextual representations of words, which are more specific than static word embeddings (Pennington et al., 2014;Mikolov et al., 2013Mikolov et al., , 2018. However, static word embeddings consume fewer resources and are faster than pretrained LMs. Ethayarajh (2019) shows that the static embeddings extracted from pre-trained LMs outperform Glove (Pennington et al., 2014) and FastText (Bojanowski et al., 2017) in many word vector benchmarks.
In this paper, we propose an adversarial model (generator-discriminator) to address the English paraphrase generation task. Unlike previous approaches, we train our model using a weighted conditional maximum likelihood by a "penalization" score given by the discriminator. Also, we test variations of our setup by changing the input representations and the Monte-Carlo search. Overall, our contributions are as follows.
• We propose the use of penalizations (discriminator outputs) in supervised adversarial setups as an alternative to the REINFORCE algorithm.
• We evaluate the substitution of the Monte-Carlo search by using a discriminator that outputs a score for each word.
• We provide an experimental analysis of the impact of input representations over a paraphrase generation model. Further, we include the use of the first-layer embeddings from pretrained language models.
• Our experiments show that our setup can generate feasible paraphrases. Furthermore, our results are competitive against prior benchmarks in the Quora question pairs dataset.

Preliminaries
Before presenting our model, we provide some preliminaries about the MLE training and REIN-FORCE algorithm in sequential problems.

Conditional Maximum Likelihood Estimation
The conditional MLE training for sequence to sequence tasks aims to learn the probability distribution of the estimated tokenŷ t constrained by an input sequence X 1:T = {x 1 , ..., x T } and a list of previous tokensŶ 1:t−1 = {ŷ 1 , ...,ŷ t−1 }. We consider the sets of tokens X andŶ in a finite vocabulary V . Given a dataset D with pairs of sequences of tokens X 1:T , Y 1:T with length T (due to truncation and padding), the objective function J M LE of conditional MLE is Where G(ŷ t |Ŷ 1:t−1 , X) is the neural network that generates the wordŷ t . Most works use teacher forcing (Williams and Zipser, 1989) by using the correct tokens Y instead ofŶ to improve the results. In that way, the "exposure" of the network to the target words could "bias" the inference process. (Williams, 1992) in sequence to sequence tasks aims to maximize the reward r of a policy (neural network) received due to the generation of the wordŷ t . In similar dataset conditions, the objective function J R of REINFORCE is
In REINFORCE, the discriminator/evaluator D outputs a single scalar for a whole sentence. D learns using the cross-entropy of the dataset distribution Y and the generated distributionŶ . The objective function J D of the discriminator/evaluator is So, the reward function for a complete sequence is r(T ) = D(X,Ŷ 1:T ). However, the discriminator is trained with only completed sequences. REIN-FORCE establishes the use of Monte-Carlo search with roll-out to sample possible sequences starting from incomplete ones. We define the Monte-Carlo search M C as M C G (Ŷ 1:t ; N ) = Ŷ 1 1:T , ...,Ŷ N 1:T In that way, the reward function for all timesteps is

Methodology
Our adversarial setup is composed of two neural networks like previous works: a generator and a discriminator. We make some observations of prior approaches to propose our model. On the one hand, MLE could suffer exposure bias when switching from train to inference. On the other hand, we observe that the REINFORCE algorithm is very fluctuating when training because it relies only on the discriminator. Furthermore, it is slow to train due to the Monte-Carlo search. We propose a weighted conditional maximum likelihood objective to train our generator. In consequence, we take advantage of the conditional MLE principle and, also, we can guide the training using a penalization function. The penalization function is similar to the reward in REINFORCE. The conditional MLE part let us reduce the variance of training. In addition, we substitute the Monte-Carlo search by using a discriminator that outputs a score for each token.

Model Architecture
We first define two sequences of tokens X 1:T = {x 1 , ..., x T }, Y 1:T = {y 1 , ..., y T } of length T that represent a paraphrase. Let G θ and D φ be a θparameterized generator and a φ-parameterized discriminator.
Given X we train G θ to produce a sequence of tokensŶ 1:T = (ŷ 1 , ...,ŷ T ) that is similar to Y . Given X we train D φ to distinguish between Y and Y .
In the following sections we also call X, Y , and Y as the condition sentence, target sentence, and generated sentence respectively.

Generator (G θ )
Our generator is a Convolutional Sequence to Sequence (ConvS2S) model (Gehring et al., 2017). We choose this architecture over a Seq2Seq (Sutskever et al., 2014) and Transformer (Vaswani et al., 2017) because the ConvS2S needs fewer number of parameters to achieve similar results. That let us train our framework using large batch sizes to reduce the generator variance (de Masson d' Autume et al., 2019). Furthermore, the model performs parallel convolutions to speed up the training time. That feature allows us to conduct more experiments. Figure 1 shows the overall architecture of G θ . We feed G θ encoder with the condition sentence embeddings (input embeddings). We add position embeddings on the encoder side. The first encoder layer is a fully connected. Then, each following layer performs iteratively: (a) one-dimensional convolutions without padding and (b) a gated linear unit over the previous layer result. We also add residual connections between non-adjacent layers.
The decoder has convolutional and attention layers interleaved. However, the decoder performs temporal convolutions to avoid leakage of future information. We achieve that by padding the input vector on the left side. There are no position embeddings on the decoder side because they produce a negative impact on the generation. At training time, we feed the decoder with the target sentence to perform convolutions in parallel. At inference time, the decoder is sequential. First, we feed the decoder with our initial token. Then, we repeat two procedures until G θ generates all words: the decoder performs a forward pass and output a new token; we concatenate the new token to the previous input in order to feed the decoder. Finally, the result vector passes through two dense layers to output the vocabulary probabilities.

Discriminator (D φ )
The architecture of the discriminator is similar to that of the generator. We change the last fully connected layer of D φ decoder to output one number. So, D φ output one score per token. Then, we pass the result to a sigmoid layer that outputs the probability that the tokens belong to the fake category.
We feed the encoder with the condition sentence, and the decoder with either the generated or target sentence.

Training
The training process of D φ and G θ are different. Figure 2 shows the overall training procedure.
We first generate paraphrases for all condition sentences. We build a mixed set of sentence pairs using the condition-target (real pairs) and condition-generated (fake pairs). In that way, we feed the discriminator with a pair of sentences. D φ outputs a score for each generated / target word. So, each pair of sentences is classified as a set of zeros and ones in the real or fake case, respectively. D φ learns using the following function We train G θ using a unified learning objective. We multiply the negative log-likelihood loss of each word by the result of our penalization function.
We calculate the G θ log-likelihood loss using the real sentence as decoder input. Nevertheless, we estimate the penalization function P G θ D φ (t) with the decoding inference result. Thus, we increase the loss value of tokens that tend to yield infeasible paraphrases.
is the discriminator output multiplied by a constant k. The discriminator outputs scores in the interval [0, 1] according to the probability that a token is classified as fake. That is, tokens classified as fake receive higher penalizations.
We avoid the Monte-Carlo search using our discriminator score per word. We update D φ at each training round to improve the quality of our generated sentences.
Algorithm 1 presents the overall procedure to train our model. As first step, we pre-train G θ us-ing conditional maximum likelihood with the condition and target samples. Also we pre-train D φ using supervised learning using pairs composed of condition-real or condition-generated. Then, we start the adversarial training phase for several rounds. First, we sample and calculate P G θ D φ to train G θ using equation 7. After updating the parameters, we output a generated sample per condition sentence using G θ . That results in a balanced set of fake and real pairs to feed D φ . Finally, we train D φ with Equation 6.

Experiments
In this section, we evaluate our model and compare it with prior methods. We describe the dataset used, experimental setup, baseline methods, and results of our experiments.

Dataset
To test our model, we used the Quora question pairs dataset. This dataset contains paired questions associated with a label. The pairs are labeled with 1 whether they express the same idea and 0 otherwise. For this work, we decided to use only the duplicated ones. There are around 155K duplicated questions. We observed that some questions appear in more than one pair.
We built three sets: Quora I, Quora II, and Quora III to evaluate our framework. In Quora I, we randomly selected 133K pairs: 100K for training, 30K for testing, and 3K for validation. In Quora III, we sampled 83K pairs: 50K for training, 30K for testing, and 3K for validation. In Quora II, we sampled 30K pairs for testing which questions do not appear in the training (50K) and validation (3K) sets. That makes Quora II the most challenging set. It is worth to notice that there are some overlaps of input questions between the training and testing sets in Quora I and III. Table 1 shows the distribution of quantities of our sets.

Experimental setup
G and D are 5-layer ConvS2S in the encoder and decoder side. The value of k in the penalization function is 2. For all models, we use the negative log-likelihood as the loss function. We adopted the optimization algorithm Adam (Kingma and Ba, 2014) for pre-training the generator and to train the discriminator. We use 1e − 4 as the learning rate for G θ and 1e − 6 for D φ without modifying the default betas. Model parameters have been initialized using uniform distribution values, as described in (He et al., 2015). We pre-trained the generator and discriminator for 20 epochs. In the adversarial training phase, we changed the learning rate to 2e − 4 performing 40 rounds of adversarial training. In Monte-Carlo and REINFORCE baselines, a roll-out of size four has been used. The batch size to feed our generator and discriminator is 250. All samples were generated using greedy decoding. We tuned our hyperparameters manually. We run our experiments in a PC with Intel 9900K and Nvidia Titan RTX for two hours on average for each model. The framework used for implementation has been Pytorch 1.3 (Paszke et al., 2019).

Input representations
The input representation for deep learning models is important because it is their unique information before solving a task. We tested the influence of the input representations in our results by changing the source of the embedding. In our experiments, we consider three recent methods: The Byte-level BPE extracted from the OpenAI GPT-2 (Radford et al., 2019) with 50257 tokens, the pre-trained wordpiece embeddings from BERT proposed by (Devlin et al., 2019) with 30522 tokens, and the 1 million FastText embeddings trained on 16 billion tokens (Mikolov et al., 2018) considering the  100000 most common tokens in their vocabulary. We used ConvS2S architecture in all cases. We only change the input to lowercase and truncate in 30 tokens as preprocessing. Table 3 shows the BLEU-2 score of the testing sets.  As can be seen in Table 3, the ConvS2S architecture that uses the Byte level BPE embeddings surpasses the BERT embeddings in 11.13 points in the BLEU score in average, and in 14.68 to the Fast-Text. Also, the generated texts of Table 2 presents a correlation with the BLEU scores. The results confirm the superiority of embeddings extracted from contextualizing environments against traditional embeddings. Overall, the presented results indicate that the byte-level BPE embeddings from GPT-2 are the most suitable input representation for our framework.

Automatic evaluation
We used some automatic metrics to evaluate our framework and compare it with other methods. We use BLEU (Papineni et al., 2002) which evaluates similarities between n-grams, ROUGE (Lin, 2004) which is a common metric in text summarization, METEOR (Denkowski and Lavie, 2014) that considers synonyms, and iBleu (Sun and Zhou, 2012) which penalizes similarities with the source sentence (parroting), as Li Egonmwan and Chali (2019). Quora I and II are analogous to the sets proposed by Li et al. (2018b). Also, the Quora I set is similar to the set of Li et al. (2019). Table 4, 5, and 6 show the results for Quora I, II, and III corpora respectively. The results presented refer to the scores of testing sets. Conv-Adv-MC refers to the method with Monte-Carlo discriminator and Conv-Adv-S, to our discriminator. The best results in each metric are in bold.
The automatic evaluation results show that our models are competitive against the state of the art baselines. Moreover, the Conv-Adv-S has slightly higher scores than previous methods in Quora I and III. However, RBM-SL is still the best method for generating paraphrases that are completely differ-    ent than the training set (Quora II). The scores on BLEU and ROUGE indicate that our model produces paraphrases that are more similar to the targets. Also, the iBLEU scores suggest that our model generates more diverse paraphrases than some of the prior works. However, the METEOR scores indicate that some of the previous baselines use more synonyms when generating paraphrases than our model does.

Human evaluation
We make a human evaluation of the generated models because we believe that the automatic evaluation is not always accurate. We randomly select 120 condition-target questions and the generated of two methods of the Quora I testing set, and we distribute them to 4 evaluators using a form. We ask the evaluators to verify three main aspects in scores from 1 to 5: • Relevance: Whether the paraphrase has the same meaning of the original sentence and does not lose information.
• Fluency: Whether the paraphrase has a correct grammar and use of vocabulary.
• Diversity: Whether the paraphrase varies syntactically and semantically.  From the results, we analyze each evaluated aspect: The three paraphrases have equivalent Relevance scores. We infer that this is due to some reference examples that lose some extra information in the paraphrase and for the random sampling. Overall, the scores indicate that our model can produce paraphrases that are highly related to the original input questions. Although the ConvS2S model has a better fluency than our method with non-statistical significant differences (paired t-test, p-value∼0.21), both are near the reference examples. We believe that the process of correcting some specific words could cause disorder when decoding some sentences. Besides, we assume that another decoding algorithm who is going to win, trump or hillary?
who will win the election, trump or clinton?
who will win, trump or clinton?
who will win, trump or clinton?
is it possible to advertise on quora?
is it possible to advertise on quora?
is promotion allowed on quora?
can we advertise our business on quora?  could benefit the generation of fluent paraphrases. Our model produces more diverse sentences than ConvS2S does. However, the difference is not statistical significant (p-value∼0.19). Also, the reference sentences are still more diverse than the outputs of our model. We believe that a penalization function for similarity with the source sentence could help to improve the results.
Overall, human evaluation shows that our model produces relevant and fluent paraphrases, which indicates that our model works properly. Although we improved the diversity factor of the ConvS2S, it is still non-comparable with the variety of the original distribution.
As a case study, Table 8 presents some of the sentences that our model and the ConvS2S generated in the test set of Quora I. We color the cells depending on our criterion of the quality of each paraphrase. The red color represents a bad paraphrase (a repetition from the source question, a nonsense sentence, or a not related question). The yellow color represents a paraphrase with some missing information. The green color represents a correct paraphrase.

Discussion
The presented results on the experimentation of the representations of the input show that the embeddings of pre-trained models are better input representations than the ones provided by classic algorithms of word-embeddings. The results concur with the study of (Ethayarajh, 2019). Furthermore, we found that the input representation has a high impact on the framework results (difference up to 14.68). Also, the results indicate that the GPT-2 has more robust embeddings in the first layer than the ones provided by BERT. We infer that BERT relies more on its contextual similarities calculated in its higher layers than GPT-2 does, as is suggested in previous studies (Ethayarajh, 2019;Hoover et al., 2019).
The comparison with previous baselines indicates that our framework achieves competitive results against the state-of-the-art methods considering all automatic metrics. Further, we improved some benchmarks on BLEU, ROUGE, and ME-TEOR. The results indicate that the weighted adversarial loss is a suitable option to REINFORCE, and it provides generation diversity to our ConvS2S implementation (improvement of 0.12 in human evaluation). The BLEU and ROUGE scores indicate that our model has similarities with the target questions. However, it also has a lack of diversity as the METEOR score shows. Similar to Banerjee and Lavie, 2005), the results from the automatic evaluation concur with the human evaluation, especially on diversity.

Conclusions
We propose an adversarial setup to address the paraphrase generation task. The automatic evaluation results show some improvements over previous baselines. The human evaluation suggests a tradeoff between the fluency and diversity of the generated paraphrases of the fine-tuned model. We conclude that our setup is a suitable option to RE-INFORCE and MLE training. In addition, the case study shows that our method helps to improve the quality of paraphrases in general.