A Learning-Exploring Method to Generate Diverse Paraphrases with Multi-Objective Deep Reinforcement Learning

Paraphrase generation (PG) is of great importance to many downstream tasks in natural language processing. Diversity is an essential nature to PG for enhancing generalization capability and robustness of downstream applications. Recently, neural sequence-to-sequence (Seq2Seq) models have shown promising results in PG. However, traditional model training for PG focuses on optimizing model prediction against single reference and employs cross-entropy loss, which objective is unable to encourage model to generate diverse paraphrases. In this work, we present a novel approach with multi-objective learning to PG. We propose a learning-exploring method to generate sentences as learning objectives from the learned data distribution, and employ reinforcement learning to combine these new learning objectives for model training. We first design a sample-based algorithm to explore diverse sentences. Then we introduce several reward functions to evaluate the sampled sentences as learning signals in terms of expressive diversity and semantic fidelity, aiming to generate diverse and high-quality paraphrases. To effectively optimize model performance satisfying different evaluating aspects, we use a GradNorm-based algorithm that automatically balances these training objectives. Experiments and analyses on Quora and Twitter datasets demonstrate that our proposed method not only gains a significant increase in diversity but also improves generation quality over several state-of-the-art baselines.


Introduction
Paraphrase generation (PG) creates different expressions that share the same meaning (e.g., "how far is Earth from Sun" and "what is the distance between Sun and Earth"). It is a crucial technology in many downstream natural language processing (NLP) applications such as question answering (Dong et al., 2017), machine translation , and text summarization (Zhao et al., 2018).
Diversity is an essential characteristic of human language, as the meaning of a text can often have multiple different expressions. A good paraphrase generation system is often required to conform to two desired properties (Xu et al., 2018b). The first is diversity, capturing a wide range of linguistic variations. The second is fidelity, preserving semantic meanings while paraphrasing. Therefore, we hope to generate diverse paraphrases while ensuring same meaning, which is important for enhancing generalization capability and robustness of downstream applications (Iyyer et al., 2018). As shown in Table 1, we give some examples. These examples express the same meaning but with different diversities.
Most recent state-of-the-art approaches to PG Gupta et al., 2018) employ neural sequence-to-sequence (Seq2Seq) models, which mainly uses one given reference for model learning, while the nature of paraphrasing indicates that we can paraphrase one sentence into several different sentences. Meanwhile these methods usually adopt the cross-entropy loss which requires a strict pairwise matching at the word level between the predicted sentence and the ground truth sentence. It lacks flexibility and may penalize the generation model for a diverse paraphrase even if the sentence retains the meaning. For example, considering one model prediction "I watched a movie last night." and the reference "I saw a film last night.", the cross-entropy loss lacks the ability to properly optimize model to generate a diverse paraphrase even with only one changed token at word level. In recent years, there are also growing interests in generating lexically and syntactically diverse paraphrases (Gupta et al., 2018;Xu et al., 2018b;Xu et al., 2018a;Park et al., 2019;Qian et al., 2019;Kajiwara, 2019). For Seq2Seq models, the techniques of generating diverse paraphrases mainly include two categories: i) applying decoding methods such as using beam search or multiple decoders; ii) introducing random noise as model input. Park et al. (2019) use multi-time decoding to diverse generation by considering those generated sentences previously. Qian et al. (2019) use multiple generators to generate a variety of different paraphrases. Gupta et al. (2018) employ a variational auto-encoder framework to produce multiple paraphrases according to different noise inputs. Although these methods can improve paraphrase generation with different decodings or noise inputs, their model training still lacks the ability of directly exploring diverse paraphrases as learning objectives.
In order to address these problems, in this work, we propose a novel learning-exploring method to generate paraphrases with multi-objective deep reinforcement learning. Our method makes it possible for every objective to focus on different aspects of generated paraphrases, breaking the restriction of learning with only one target sentence given by data in supervised learning. Concretely, we first train a paraphrase generation model with cross-entropy loss. Then we design a sample-based exploring algorithm to generate multiple candidate paraphrases from the learned data distribution, and utilize the explored sentences to train the generation model with deep reinforcement learning. Therefore, the model can be effectively trained in a learning-exploring fashion, to find more diverse paraphrases.
In particular, we use the variations between the generated paraphrase and the original input as learning signal of diversity. To ensure expressive diversity and semantic fidelity simultaneously, we design two rewards and one for each aspect. One reward is to distinguish whether the explored sentence has diverse expression, the other one is to examine whether the explored sentence conveys the same meaning with original input. The higher the confidence of one sentence is to be judged as a good paraphrase explored by the sample algorithm, that will be used for better model training. We also use a reward evaluated with reference to train model. Finally, we combine these rewards evaluated from different aspects as guiding signals to train the generation model via reinforcement learning. Furthermore, in order to effectively combine these rewards that optimize different aspects of a generated paraphrase, we additionally use a GradNorm-based algorithm (Chen et al., 2017) that automatically balances these training objectives.
In summary, our contributions are as follows: • We propose a novel learning-exploring framework to learn to generate diverse paraphrases with multi-objective deep reinforcement learning.
• In order to enable the model to learn to generate diverse paraphrases, we propose to equip the model with several vital components: (1) sample-based exploring algorithm to generate diverse candidate paraphrases; (2) multiple reward functions for evaluating sampled sentences to ensure expressive diversity and semantic fidelity simultaneously; (3) GradNorm-based algorithm that automatically balances training objectives for effective learning.
• We conduct experiments on two standard benchmark datasets Quora and Twitter. Empirically, experimental results show that our new learning methods with reinforcement learning perform significantly better than several state-of-the-art baselines in term of diversity and quality.  Figure 1: The proposed learning-exploring paraphrase generation framework with multi-objective deep reinforcement learning.

The Proposed Model
In this section, we elaborate our proposed model, including its essential components and their working mechanisms.

Problem and Framework
Given an input sentence X = [x 1 , x 2 , · · · , x S ] with length S, we aim to generate an output paraphrase Y = [y 1 , y 2 , · · · , y T ] with length T that has the same meaning but a different expression with X. We denote the pair of sentences in paraphrasing as (X; Y ). We use Y 1:t to denote the subsequence of Y ranging from 1 to t and useŶ to denote the sequence generated by the model. Our proposed model mainly contains three components: paraphrase generator, sample-based exploring algorithm and reinforcement learning with explored paraphrase. Figure 1 gives an overview of our framework. Basically the generator can generate paraphrases of a given sentence, and the evaluator measures the quality of explored paraphrases in term of expressive diversity and semantic fidelity.

Paraphrase Generator
We frame paraphrase generation as a sequence-to-sequence (Seq2Seq) problem. We adopt the encoderdecoder framework See et al., 2017), both of which are implemented as recurrent neural networks (RNN). All RNNs use LSTM cells (Hochreiter and Schmidhuber, 1997). Given an input sentence X, the goal is to learn a model p(θ) that can generate a sentenceŶ = p θ (X) as its paraphrase. Traditionally the parameters θ are learned by maximizing the likelihood of the predicted sentence. Finally, model estimates the conditional probability p(Y |X) via directly mapping the input sentence X to its target paraphrase Y. The learning objective is to minimize the cross-entropy loss: We choose the pointer-generator (See et al., 2017) as our paraphrase generator. In pointer-generator, the decoder allows either generating words from a vocabulary or copying words from the input sentence, which alleviates the out-of-vocabulary problem (such as named entities) and improves the performance. The probability of copying words can be represented as: where s t−1 is the decoder state at time step t -1, h i is the encoder hidden state at time step i. a ti represents the attention weight at time step t and p copy is actually a ti . f is the attention function, which is a feed-forward neural network.
The probability of generating the next word from vocabulary is given by: where y t−1 is the word generated last step, c t means the context vector computed by attention mechanism at time step t, g is a linear function. The overall probability of the next word is: where m(s t , y t−1 , c t ) is a binary neural classifier with sigmoid activation output. p gen acts as a gate to control whether the next word is generated from vocabulary or is copied from the input sentence. Specifically, model predicts the next word at each decoding time step by sampling from probability distributionŷ t ∼ p ( y t |Y 1:t−1 , X). We will use the sampled sentences to train our model.

Sample-Based Exploring Algorithm
Our learning-exploring method uses explored sentences for model training. It is critical to set up an appropriate sample algorithm to acquire diverse paraphrases. To better sample more diverse paraphrases, we adopt Gumbel-Softmax technique (Jang et al., 2016), which injects noise to adjust the word distribution, enabling us to sample diverse sentences from the model's approximation of the data distribution.
We modify the probability distribution in Equation 3 by shaping the distribution through Gumbel noise. The Gumbel noise, treated as a form of regularization, is added to o t in Equation 3, then softmax function is performed. The word distribution of y t is approximated by: where η is the Gumbel noise calculated from a uniform random variable u ∼ U (0, 1), τ is temperature. When τ → 0, the sample generated from the vocabulary is similar to the argmax operation, and when τ → ∞, the sample is gradually closing uniform distribution. Increasing the temperature increases the use of infrequent words (Holtzman et al., 2019), which has the implicit effect of weakening the tail distribution, making the model to explore more diverse generation. Finally, according to p vocab (y t |Y 1:t−1 , X), we apply multinomial sampling (Chatterjee and Cancedda, 2010) to generate sentenceŶ for computing rewards, which produces each word one by one through multinomial sampling over the model's output distribution. The sampling terminate the expansion of a candidate sentence when an end of sentence (<EOS>) token is met.

Reinforcement Learning with Explored Paraphrase
We adopt reinforcement learning (RL) (Sutton and Barto, 1998) to train our paraphrase generator by using the sampled sentences. Our paraphrase generator can be viewed as an "agent" that interacts with an external "environment" (original input or reference). The parameters of the agent define a policy, i.e., a conditional probability p(y t |Y 1:t−1 , X). The agent will pick an action, i.e., the prediction of the next candidate word, according to the policy. When generating the EOS token, the agent observes a terminal reward for evaluating the generated sentenceŶ . The reward is denoted as R(Ŷ , Y i ), where Y i represents a comparable sentence. The goal of the RL training is to minimize the negative expected reward: whereŶ s = (ŷ s 1 ,ŷ s 2 , · · · ,ŷ s t ) is a sampled sentence andŷ s t is the word sampled from the model at the time step t.Ŷ is the space of all candidate paraphrase sentences, which is exponentially large due to the large vocabulary size, making it impossible to exactly optimize L rl . In practice, we adopt REINFORCE-based policy gradient approach (Williams, 1998) with a single sample from p θ .
In order to reduce the variance of policy gradient method, a typical technique is subtracting baseline values from the original rewards. We use the self-critical algorithm (Rennie et al., 2017), in which the baseline is the reward of sentences generated in inference. Finally, the expected gradient based on REINFORCE of a non-differentiable reward function can be computed as follows: where p θ (Ŷ s ) = T t=1 p(ŷ s t |Ŷ s 1:t−1 , X) is the probability for generating sentenceŶ s given X.Ŷ s is sampled from the output probability distribution,ŷ s t ∼ p(y t |Ŷ s 1:t−1 , X).Ŷ b is the baseline output using the test time inference algorithm greedy search. We can see that minimizing L rl (θ) is equivalent to maximizing the conditional likelihood of the sampled sentenceŶ s if the generated sentenceŶ s outperformŝ Y b , thus giving positive signals. As a result, reward will be increased as the generator increases the generation probability of better sentences while decreasing the chance of worse sentence generation.

Multi-Objective Learning
This section explains how to learn the generator using multi-objective deep reinforcement learning. Our training algorithm allows model to explore the space of possible paraphrases, making our model to generate more diverse paraphrases and also enhance model performance. As a result, multi-objective reinforcement learning helps paraphrase generation in two ways: (a) it directly optimizes the evaluation metric instead of maximizing the likelihood of the ground-truth reference and (b) it makes our model has the ability to explore unseen paraphrases beyond one single reference.

Rewards for Multi-Objective Learning
ROUGE Reward with Reference The first basic reward is based on the primary evaluation metric of ROUGE package (Lin, 2004). We compare a sampled sentenceŶ s 1 with ground-truth reference Y ref with ROUGE score (namely ROUGE-ref), and then takes the score as a reward. The loss function is given by: Similar to previous work (Li et al., 2018), we find that ROUGE-ref score as a reward works better compared to only using cross-entropy loss. This reward can be taken as sentence-level learning signal, which overcomes the full token-level matching issue of cross-entropy loss at training stage.
On the other hand, as pointed in Kajiwara (2019), paraphrase generation rewrites only a limited portion of an original input and the reference often includes some words occurred in the original input, thus a sentence with higher ROUGE-ref score may have low diversity (Miao et al., 2019). Therefore, the reward based on reference do not focus on the variations between the sampled sentence and the original input. Addressing these issues, we next introduce two new reward functions.
ROUGE Reward with Input To obtain more diverse paraphrases, we introduce the second reward function. We compare a sampled sentenceŶ s 2 with the original input Y ori by computing ROUGE score (namely ROUGE-ori), which can focus on the word variations between the sampled sentence and the original input. We use the negative ROUGE score as reward, in which the lower word overlap, the better variation. The score reflects model ability to produce diverse paraphrases. The loss function is given by: Semantic Similarity Reward with Input The explored paraphrases may have distant semantic similarity with original input, which may hurts fidelity, preserving semantic meanings while paraphrasing. We further introduce a semantic similarity reward to ensure the semantic accuracy of explored sentences. We adopt embedding-based method (Sharma et al., 2017) for computing semantic similarity score, which do not increase extra learnable parameters. We use Greed Matching (GM) to compute the semantic similarity score between a sampled sentenceŶ s 3 and the original input Y ori . The computation of semantic similarity score is as follow: G(C, r) = w ∈ C maxŵ ∈r cos sim(e w , eŵ) Each word in the candidate sentence C is greedily matched to a word in the reference sentence r based on their embeddings' cosine similarity. The score is an average of these similarities over the number of words in the candidate sentence. Then we take the semantic similarity score compared with original input (namely SEM-ori) as a reward, and the loss function is given by:

Multi-Objective Optimization
Our objective function combines the maximum-likelihood cross-entropy loss (L ce ) with rewards from policy gradient reinforcement learning to jointly optimize our model. Finally, the over learning objective is to minimize the following combined losses: where α is the the weights to combine these losses.
Optimizing multiple objectives at the same time is important for final performance, in which one objective can easily dominate the learning of a shared model, leading the other objectives are ineffective. Previous work (Wu et al., 2018) choose fixed weights α from manual experience for RL learning. Different from them, we use an adaptive method GradNorm (Chen et al., 2017), and the α i is vary at each training step t: α i = α i (t). The GradNorm algorithm controls gradient magnitudes through tuning of the multi-objective loss function. To optimize the weights α i (t) for gradient balancing, following Chen et al. (2017), we penalize the network when back-propagated gradients from any task are too large or too small. If objective i is training relatively quickly, then its weight α i (t) should decrease relative to other objective weights to allow other objectives more influence on training.

Experiment
In this section, we described the datasets, experimental setup, evaluation metrics and the results of our experiments.

Datasets
We conducted experiments on two standard datasets Quora and Twitter to evaluate the proposed model.
Quora Dataset This dataset is a paired paraphrase dataset in question domain. It consists of 150K paraphrase pairs. Following previous work (Li et al., 2018;Qian et al., 2019), we used 30K pairs and 4K pairs as test set and validation set, and 100K pairs for training, respectively.
Twitter Dataset This dataset is Twitter URL paraphrasing corpus (Lan et al., 2017) that contains two subsets, one is labelled by human annotators while the other is labelled automatically by algorithm. Following previous work (Li et al., 2018;Qian et al., 2019), we sampled 5K pairs as the test set and 1K pairs as validation set from the labeled subset, while using the remaining 110K pairs as training set.

Model Configuration
We used the following experimental setting for our model. Following Li et al. (2018), we maintained a fixed-size vocabulary of 5K shared by the words in input and output, and truncate all the sentences longer than 20 words. For paraphrase generators, we used two-layer LSTM for both encoder and decoder. The hidden dimension of encoder and decoder was set to 256, and the word embedding dimension was 128. We adopted ROUGE-1 score for computing rewards.
In training, we used Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9 and β 2 = 0.98 for optimization. We pre-trained model 10 epoch with cross-entropy loss with a learning rate of 1e-3 and then started RL training. In reinforcement learning, the learning rate decreased to 1e-4. The mini-batch size was fixed at 64. We set τ = 0.3 for Gumbel-softmax sampling. For GradNorm-based (Chen et al., 2017) multi-objective optimization, the initial wight was 1 for all losses, and we used the parameters in the linear layer before softmax to automatically adjust weight gradients. At test time, we used beam search of width 5 on all our models to generate our final predictions.

Automatic Evaluation Metrics
Following previous work  on paraphrase generation, we adopted well-known automatic evaluation metrics BLEU (B) (Papineni et al., 2002), ROUGE (R) (Lin, 2004) and METEOR (MET) (Lavie and Agarwal, 2007) to compute lexical similarity with reference. Pervious studies have shown that these metrics perform well in evaluating generated paraphrases. These n-gram-based matching may obtain low score for predictions with highly lexical and syntactical variation, but these predictions are not necessarily poor quality (Chen and Dolan, 2011;Wang et al., 2019). We further used Embedding Similarity (Sharma et al., 2017) to evaluate generated paraphrases. This metric measures the semantic similarity between the reference and prediction based on the cosine similarity of their embeddings on word and sentence level. Following previous work (Park et al., 2019;Egonmwan and Chali, 2019), we used average, extreme, and greedy (A/E/G) embedding similarities.
Besides, we hope to generate more diverse paraphrases when preserving meaning. However, previous work (Miao et al., 2019) has shown that it is insufficient when only comparing with reference because simply copying the input sentence itself yields the highest BLEU-ref score. To evaluate the variation of generated paraphrases, following Miao et al. (2019), we used BLEU-ori (B-ori) metric that against the original input sentence, in which the lower n-gram overlaps, the better variation and diversity.

Baselines
We compared our model with several state-of-the-art models in the paraphrase generation field.
• RbM-SL and RbM-IRL (Li et al., 2018): This is a generator-evaluator framework with the matching-based semantic evaluator trained by reinforcement learning.
• DEPD (Qian et al., 2019): This uses multiple generators trained by reinforcement learning to generate a variety of different paraphrases.  • DPNG : This is a Transformer-based model that can learn and generate paraphrases of a sentence at different levels of granularity (word or phrase) in a disentangled way.

Results
Baseline Cross-Entropy Model Results Our paraphrase generation model has attention mechanism (Seq2SeqAtt) and pointer-generator network (PNet). For better observing model behaviour, we first trained two baselines by applying cross-entropy optimization. As we can see, the model with pointergenerator network effectively improves performance in all metrics related to reference. And it is also natural for PNet to obtain higher scores than Seq2SeqAtt in BLEU-ori-1 metric, as copy mechanism directly copes words from input to resolve UNK and entity word generation problem.

Multi-Objective Model Results
Results of automatic evaluation on Quora are shown in Table 2. First, we can see that using ROUGE-ref as reward in the proposed learning-exploring (LE) framework (LE-ROUGE-ref) improves the performance in metrics related to reference, as compared to the cross-entropy baseline. When using ROUGE-input as reward, we can see that the model obtains higher diversity score but lower BLEU and ROUGE score related to reference. A main reason is that every test case only has one reference sentence, which makes the word matching-based evaluation metric more difficult to measure the real quality of diverse paraphrases. But, the model obtains comparable METEOR and Embedding similarity score, since both of them consider synonym matching, and therefore they measure the generation quality more accurately. These results indicate that using ROUGE-input as reward in the learning-exploring framework can exactly improve diversity of paraphrase generation.
When using semantic similarity reward (LE-SEM-ori), we can see that the model obtains better performance than baseline PNet in valuation metrics related to reference, but low diversity score, since the learning objective does not consider variation in model training procedure. These results show that semantic similarity reward is effective to ensure the semantic fidelity, preserving meanings while paraphrasing.
Finally, when we combine all the learning objectives, we can obtain better performance on Quora than several state-of-the-art baselines in term of diversity and quality. Our model not only obtains better BLEU, ROUGE, METEOR score in traditional evaluation metrics, but also better performance on semantic similarity and diversity. These results demonstrate the effectiveness of our proposed learningexploring method with multi-objective deep reinforcement learning for Quora paraphrase generation. Table 3 shows scores on Twitter dataset. Finally, our model achieves the best ROUGE and METEOR score than all baselines. It also shows that the different learning objectives have the ability to focus on the different natures of generated paraphrases. These results on Twitter further demonstrate that our proposed multi-objective learning can improve paraphrase generation in a learning-exploring fashion.   Table 4 displays several generated examples by pointer-generator and our model. We can see that the proposed model produces fairly good samples in terms of both closeness in meaning and diversity in expression, because the model is encouraged to output better paraphrases during the learning phase.

Case Study and Discussion
In these examples, although pointer-generator indeed generates different sentences, it only yields little diversity. Compared to sentences generated by pointer-generator, those produced by our model have obvious variations compared to original input. In the first example, our model has better meaning preservation. In the second example, pointer-generator generates similar sentence and yields obvious meaning changes from the input sentence. In the third example, the sentence generated by pointer-generator show minor variations, while this generated by our model presents richer expressions.

Related Work
Neural paraphrase generation is often formalized as a sequence-to-sequence (Seq2Seq) learning formalism.  employ a stacked residual LSTM network in the Seq2Seq model to enlarge the model capacity.  incorporate the attention mechanism  to generate paraphrases. Egonmwan and Chali (2019) integrate Transformer model (Vaswani et al., 2017) and Recurrent Neural Network GRU  to learn long-range dependencies in the input sequence. Li et al. (2018) propose a generator-evaluator architecture to reinforce the paraphrase generator by a reward function.  suppose a sentence-level paraphrase can be decomposed to word/phrase-level paraphrase and learn to generate paraphrases at different levels of granularity.
More recent works also focus on generating diverse paraphrases, which is important for improving model generalization capability and robustness of downstream applications. Gupta et al. (2018) use a variational autoencoder framework to generate diverse paraphrases by introducing random noise as input. Iyyer et al. (2018) harness syntactic-tree template information for controllable paraphrase generation.  use sentences as exemplars to graft their syntax style to generated paraphrases. Qian et al. (2019) uses multiple generators trained by reinforcement learning to generate diverse paraphrases.
Similar to these works, we also adopt Seq2Seq model for paraphrase generation. However, significantly different from them, our work extend the Seq2Seq model to use explored paraphrases for model training via deep reinforcement learning. We further introduce evaluation metrics in terms of expressive diversity and semantic similarity for model learning. Finally, our model can effectively generate paraphrases by exploring unseen paraphrases beyond one single reference in a learning-exploring fashion.

Conclusion
In this work, we have presented a novel method to paraphrase generation in a learning-exploring fashion via multi-objective reinforcement learning. We designed sample-based exploring algorithm to acquire diverse paraphrases for model training, and used reinforcement learning with expressive diversity and semantic similarity rewards. Experiments and analyses on both Quora and Twitter datasets show that the proposed method can effectively learn to generate high-quality paraphrases and achieves better performance over several strong baselines. These results disclose that we can improve paraphrase generation by using explored sentences, breaking the restriction of single reference in supervised learning.