Paraphrase Generation by Learning How to Edit from Samples

Neural sequence to sequence text generation has been proved to be a viable approach to paraphrase generation. Despite promising results, paraphrases generated by these models mostly suffer from lack of quality and diversity. To address these problems, we propose a novel retrieval-based method for paraphrase generation. Our model first retrieves a paraphrase pair similar to the input sentence from a pre-defined index. With its novel editor module, the model then paraphrases the input sequence by editing it using the extracted relations between the retrieved pair of sentences. In order to have fine-grained control over the editing process, our model uses the newly introduced concept of Micro Edit Vectors. It both extracts and exploits these vectors using the attention mechanism in the Transformer architecture. Experimental results show the superiority of our paraphrase generation method in terms of both automatic metrics, and human evaluation of relevance, grammaticality, and diversity of generated paraphrases.


Introduction
Paraphrases are texts conveying the same meaning while using different words (Bhagat and Hovy, 2013). Paraphrase generation is an important task in Natural Language Processing (NLP) that has many applications in other down-stream tasks, such as text summarization, question answering, semantic parsing, and information retrieval (Cao et al., 2017;Fader et al., 2014;Berant and Liang, 2014).
Early works on paraphrasing mostly investigated rule-based or statistical machine translation approaches to this task (Bannard and Callison-Burch, 2005). With the recent advances of neural sequence-to-sequence (Seq2Seq) framework in different NLP tasks, especially in machine translation, an increasing amount of literature have also applied  (Edit Provider), and applies these edits to the input sequence x to generate its paraphrase (Edit Performer). Seq2Seq models to the task of paraphrase generation (Prakash et al., 2016;Gupta et al., 2018;Li et al., 2018).
Although the proposed Seq2Seq methods for paraphrase generation have shown promising results, they are not yet as dominant as their counterparts used in neural machine translation. The main reason is that the available training data for paraphrasing is scarce and domain-specific . In fact, the necessity to generate sequences from scratch, which is a major drawback of traditional Seq2Seq models , magnifies itself when dealing with scarce training data. Thus, one can expect that the model would not be trained well and consequently, would not be able to generate diverse outputs.
Although retrieval-based text generation has been evaluated recently in ; ; Wu et al. (2019) as a remedy for this problem, to the best of our knowledge, there is no previous study exploring the usage of this approach in paraphrase generation. Moreover, none of the existing works in the realm of retrieval text generation, such as ; Wu et al. (2019); , focuses on learning how to extract edits from the retrieved sentences. Indeed, ; Wu et al. (2019) computes a single edit vector heuristically through concatenating the weighted sum of the inserted word embeddings and the weighted sum of deleted word embeddings. Moreover, Hashimoto et al. (2018) only focuses on improving the retrieving stage and uses a standard Seq2Seq model to edit the retrieved sentence.
In this paper, we present an effective retrievalbased approach to paraphrase generation by proposing a novel editor module. Our method can be summarized as follows: Given an input sentence x, the model first retrieves a similar sentence p and its associated paraphrase q from the training data. Then, by getting x and (p, q), the editor both learns how to extract the fine-grained relations between p and q as a set of edits, and also when and how to use these extracted edits to paraphrase x. By incorporating the retrieved pairs into the editing process, we invigorate our model with a non-parametric memory, which enables it to produce non-generic and more diverse outputs. Both the retriever and editor components of our method are modeled by deep neural networks. We employ the Transformer architecture (Vaswani et al., 2017) as the backbone of our model, and use its attention mechanism as an effective tool to apply edits in a selective manner.
Our main contributions are: • We propose the Fine-grained Sample-based Editing Transformer (FSET) model. It contains a novel editor that can be used in a retrieval-based framework for paraphrase generation. This editor learns how to discover the relationship between a pair of paraphrase sentences as a set of edits, and transforms the input sentence according to these edits. It is worth noting that the set of edits is learned in an end-to-end manner as opposed to ; Wu et al. (2019) that compute the edit vector heuristically.
• For the first time, we utilize the Transformer as an efficient fully-attentional architecture for the task of retrieval-based text generation.
• Experimentally, we compare our method with the recent paraphrase generation methods, and also with the retrieval-based text generation methods that have been introduced recently. Both of the quantitative and qualitative results show the superiority of our model.
2 Related Work 2.1 Neural paraphrase generation Prakash et al. (2016) was the first work that adapted a neural approach to paraphrase generation with a residual stacked LSTM network. Gupta et al. (2018) combined a variational auto-encoder with a Seq2Seq LSTM model to generate multiple paraphrases for a given sentence. Li et al. (2018) proposed a model in which a generator is first trained on the paraphrasing dataset, and then is fine-tuned by using reinforcement learning techniques. Cao et al. (2017) utilized separate decoders for copying and rewriting as the two main writing modes in paraphrasing. Mallinson et al. (2017) addressed paraphrasing with bilingual pivoting on multiple languages in order to better capture different aspects of the source sentence. Iyyer et al. (2018) proposed a method to generate syntactically controlled paraphrases and use them as adversarial examples. Chen et al. (2019) addressed the same problem, but the syntax is controlled by a sentence exemplar. Kajiwara (2019) proposed a model that first identifies a set of words to be paraphrased, and then generates the output by using a pre-trained paraphrase generation model.  proposed a Transformer-based model that utilizes structured semantic knowledge to improve the quality of paraphrases. Kumar et al. (2019) modified the beam search algorithm with a sub-modular objective function to make the generated set of paraphrases syntactically diverse.  decomposed paraphrasing into sentential and phrasal levels and employed separate Transformer-based models for each of these levels. Fu et al. (2019) decomposes paraphrasing into two steps: content planning and surface realization, and improves the interpretability of the first step by incorporating a latent bag of words model.

Retrieval-based text generation
Retrieval-based text generation has received much attention in the last few years. Song et al. (2016); Wu et al. (2019) augmented Seq2Seq generationbased models with retrieval frameworks to make the dialog responses more meaningful and nongeneric. Gu et al. (2017) utilized a search engine to retrieve a set of source-translation pairs from the training corpus, both at train and test time, and use them as a guide to translate an input query.  proposed the neural editor model for unconditional text generation, which produces a new sentence by editing a retrieved prototype using an edit vector.  proposed a task-specific retriever using the variational framework to generate complex structured outputs, such as Python code. This work, however, does not have any novelty in the editor's architecture and uses a standard Seq2Seq model with attention and copy mechanism .

Proposed Approach
Let D = {x n , y n } N n=1 denotes a dataset where x n is a sequence of words, and y n is its target paraphrase. In the paraphrasing task, our goal is to find the set of parameters of the model that maximizes N n=1 p model (y n |x n ). Figure 1 illustrates the overview of our proposed model which is composed of a Retriever and an Editor. Given an input sequence x, the retriever first finds a paraphrase pair (p, q) from the training corpus based on similarity of x and p. Then, the editor utilizes the retrieved pair (p, q) to paraphrase x. We discuss the details in the following subsections.

Retriever
The goal of the retriever module is to select the paraphrase pairs (from the training corpus) that are similar to the input sequence x. To do that, the retriever finds a neighborhood set N (x) consisting of the K most similar source sentences {p k } K k=1 to x and their associated paraphrases {q k } K k=1 (K is a hyper-parameter of the model). To measure similarity of sentences, we first embed them employing the pre-trained transformer-based sentence encoder proposed by Cer et al. (2018). The similarity is then calculated using cosine similarity measure in the resulted embedding space. We call this retriever as General Retriever throughout the paper. Note that using a pre-trained retriever can help us to alleviate the scarcity problem of the training data available for paraphrasing 1 .
1 Pre-trained model is available at https://tfhub.dev/google/universal-sentence-encoder-large/3 In order to search for the similar sentences to an input sequence efficiently, we use the FAISS software package (Johnson et al., 2019) to create a fast search index from the sentences in the training corpus. We would also pre-compute the neighborhood set of each source sentence in the training set, so at the training time, our model just needs to sample one of the pairs in the neighborhood set uniformly and feed it as an input to the editor module. The probability of retrieving a pair can thus be stated as Note that the same procedure also holds for the test time, and the retriever computes N (x) so the model can sample any one of the pairs in N (x) to generate the output based on that pair.

Editor
To edit a sentence according to a retrieved pair, we propose an editor module consisting of two components: 1) Edit Provider and 2) Edit Performer. The Edit Provider computes a set of edit vectors based on the retrieved pair of sentences (p, q). After that, the Edit Performer rephrases the input sequence x by utilizing this prepared set of edits.

Edit Provider
This part of the editor extracts the edits from the retrieved pair as a set of vectors which we call Micro Edit Vectors (MEVs). MEVs are responsible for encoding the information about fine-grained edits that transform p into q. Each one of the MEVs represents the most plausible soft alignment between a token in p and the semantically relevant parts in q: where l is the length of p.
avoid how can one overcome procrastination ? how should i avoid procrastination ?

Compute edit
Find the most similar in target Figure 2: The general scheme of computing a MEV corresponding to a token of p. Figure 2 presents, in schematic form, the procedure of computing one MEV. For each arbitrary token of p, such as p i , we intend to compute a MEV that encodes the edit corresponding to p i using attention over q. Then, given p i as the source of the edit, and the attention's result as the target, we concatenate their representations and feed it as the input to a neural network, which calculates m i as the corresponding edit vector. To make this process differentiable and parallelizable, we use a fully-attentional architecture consisting of two main sub-modules: 1) Edit Encoder and 2) Target Encoder. Figure 3 shows the overview of the Edit Provider.
In this model, at first, a context-aware representation R q = [r 1 q , ..., r k q ] of the sequence q is computed using the Target Encoder which is the encoder sub-graph of the Transformer architecture (Vaswani et al., 2017). The Edit Encoder is also the encoder of the Transformer model, but, with an extra multi-head attention over R q . This module outputs a vector that encodes the most semantically relevant parts of q to p i . After that, the MEVs, i.e. m i s, are computed by feeding these vectors one by one into a single dense layer (with the tanh(.) activation function). By setting the output dimension of the dense layer to be smaller than the dimension of the word embeddings, we introduce a bottleneck, which hinders the Edit Encoder from copying q directly. Finally, all of the MEVs are aggregated into a single vector z by leveraging a technique inspired by Devlin et al. (2019); we prepend a special token [AGR] to p in order to encode all the edits into a single vector z p→q . The intuition behind encoding into a single vector z p→q is to allow the model learn a global edit that can be applied to the whole sentence, in addition to the MEVs as local edits. We run the Edit Performer with the same parameters in the reverse direction, i.e. from q to p, to compute R p and z q→p . The final edit vector z is then computed as

Dense
where Linear denotes a dense layer without activation and bias.

Edit Performer
The Edit Performer transforms the input sequence x = [x 1 , ..., x s ] to the final outputŷ using the edit vectors. We employ a fully-attentional Seq2Seq architecture composed of an encoder and a decoder for this part of the model. The encoder of the Edit Performer has exactly the same architecture as the original encoder of the Transformer model and outputs a context-aware For the decoder, we use a slightly modified version of the original Transformer's decoder. Indeed, the Transformer learns to model p(y|x), while we would like to model a conditional setting p(y|x, (p, q)). Moreover, as mentioned in the description of the Edit Provider, the relation between p and q is encoded in MEVs M and the vector z. Therefore, in order to edit x, instead of using (p, q) directly, we only need M and z to specify the edits, and the sentence p to identify the locations in x to which the edits should be applied. Thus, we aim to model p(y|x, p, M, z) with the Edit Performer. Figure 4 depicts the architecture of the Edit Performer. To condition the generation process on the edit vector z, we append it to each token of the decoder's input. To apply the edits in a finegrained manner, we would like the model to attend to the most similar token of p and select the corresponding edit in MEVs M to be applied to the input sentence. Therefore, in addition to the input sequence representation R x , the model also attends to MEVs M using an extra multi-head attention sub-layer which computes the representation where h comes from the previous sub-layer and R p is the context-aware representation of the retrieved sequence p, which is calculated by the Edit Provider. Hence, this sub-layer allows the model to apply edits only when the current context matches somewhere in p. Finally, we project h (after applying the residual connection and the layernorm) using a fully-connected sub-layer and feed it to the above layer. For the last layer, a softmax activation is employed to predict the next token of the output.

Training
During the training phase, our aim is to maximize the log likelihood objective L = (x,y)∈D log p(y|x). (2) As we decompose the training procedure to two stages of retrieving and editing, we can rewrite p(y|x) as Substituting Eq. 1 into Eq. 3 and then inserting the resulted p(y|x) into Eq. 2 yields the following formulation for the log likelihood: We train our model by maximizing the following lower bound of the log likelihood (obtained by Jensen's inequality): log p(y|x, (p, q)).
Note that p(y|x, (p, q)) = p θ (y|x, p, m φ (p, q), z φ (p, q)), where θ denotes the parameters of the Edit Performer and φ shows the parameters of the Edit Provider. Thus, we solve the following optimization problem: Except for the retriever which is a pre-trained component of our model, other components are fully coupled and trained together. To prevent the model from ignoring the information coming from the retrieval pathway during the training procedure (i.e. ignoring the edit vectors extracted from the retrieved pair), we use a simple yet effective trick; we manually add extra (x, y) pairs to N (x) proportionate to the number of retrieved pairs K so the presence of y as the exact ground-truth paraphrase encourages the model to use the retrieved pairs more. Please refer to A.1 for further details.

Experiments
In this section, we empirically evaluate the performance of our proposed method in the task of paraphrase generation, and compare it with various other methods, including previous state-of-the-art paraphrasing models.

Datasets
We conduct experiments on two of the most frequently used datasets for paraphrase generation: the Quora question pair dataset and the Twitter URL paraphrasing corpus. For the Quora dataset, we only consider the paraphrase pairs. Similar to Li et al. (2018), we sample 100k, 30k, 3k instances for train, test, and validation sets, respectively. Twitter URL paraphrasing dataset consists of two subsets, one is labeled by human annotators, and the other is labeled automatically, thus, it is noisier compared to the Quora dataset. Similar to Li et al. (2018), we sample 110k instances from automatically labeled part as our training set and two non-overlapping subsets of 5k and 1k instances from the part annotated by humans for the test and validation sets, respectively. As in Li et al. (2018, we truncate sentences in both of the datasets to 20 tokens.

Baselines
We compare our method with both the existing paraphrasing methods that are not retrieval-based, and also with the existing or newly created retrievalbased text generation methods which we adapt for paraphrasing: • Non-retrieval paraphrasing methods: -Residual LSTM (Prakash et al., 2016) which is the first Seq2Seq model proposed for paraphrase generation, -RbM (Li et al., 2018) that fine-tunes a paraphrase generation model using reinforcement learning, -Transformer (Vaswani et al., 2017) which is a Seq2Seq model relying entirely on attention mechanism, -DNPG ( • Retrieval-based models: We compare our method with one existing retrieval-based text generation model and two other combinational methods that we create by ourselves: -Seq2Seq+Ret which is an extended version of Seq2Seq Residual LSTM. This model conditions the generation process at each time step on an edit vector encoding the differences between the retrieved sentences p and q. To make the comparison fair, we use the General Retriever (introduced in the Retriever subsection of the Proposed Approach Section) to find (p, q). The edit vector for this pair is also computed by concatenating the sum of inserted word embeddings with the sum of deleted word embeddings as it is stated by . -RaE that is proposed by Hashimoto et al.
(2018) as a method with an in-domain retriever. The editor of this model is a Seq2Seq LSTM equipped with attention mechanism over the input x, and copy mechanism over the retrieved pair p and q. -CopyEditor+Ret which is composed of the editor of , and the General Retriever. We compare FSET with this baseline model to further evaluate the role of our proposed editor. Table 1 shows the settings of our model. We select the hyperparameters suggested by Li et al. (2018) for the LSTM-based Seq2Seq baselines, and the hyperparameters mentioned by  for the Transformer-based baselines. It is worth noting that our model's size w.r.t. the number of parameters is approximately 1 2 of the baseline LSTM's size and 1 5 of the baseline Transformer's size. The newly created retrieval-based baselines have the same hidden size and the same number of layers as the non-retrieval models. For the Seq2Seq+Ret model, we keep the ratio of hidden size to the edit vector dimension same as the reported ratio in . We train all of the models for 100k iterations, and choose the best version based on their validation loss after training. We set the batch size to 128 and the vocabulary size to 8k in all of the experiments. The embeddings are also trained from scratch. In all of the experiments on the retrievalbased methods, the hyper-parameter K is set to 1. However, results for different values of K are also reported in A.2. During the decoding stage, we use beam search to generate a set of outputs. In order to select the final output, an approach similar to Gupta et al. (2018) is used which chooses the most lexically similar sentence to the input where the similarity is calculated based on the Jaccard measure.

Results and analysis
We compare different methods using BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and ME-TEOR (Banerjee and Lavie, 2005) as the most common metrics for automatic evaluation of paraphrase generation methods. Table 2 summarizes the results of different methods. These results indicate that our model outperforms the previous state-ofthe-art models in terms of all of the metrics.
It is worth noting that the models which have utilized copy mechanism, such as DNPG, RbM, RaE, and CopyEditor+Ret, generally outperform the other baselines. The Seq2Seq+Ret, i.e. the  retrieval-based Residual LSTM, shows an improvement over Residual LSTM on Quora dataset. However, this is not the case on the Twitter dataset and we hypothesize that it is due to uncommon texts in this corpus (i.e. informal text with hashtags and abbreviated words), on which the General Retriever has not been trained. Therefore, a pretrained retriever cannot help in this case. The Copy-Editor+Ret model which incorporates a more powerful editor than Seq2Seq+Ret shows better results than both of the Residual LSTM and Seq2Seq+Ret. However, a phenomenon similar to what was stated for Seq2Seq+Ret is also observed for this model on the Twitter dataset. The RaE model with the same editor as CopyEditor but with a supervised (task-specific) retriever leads to near state-of-theart results. This indicates the role of the supervised task-specific retriever used in RaE, especially in the results on Twitter dataset. The superiority of our method over RaE in all of the metrics could be a sign of the effectiveness of our proposed editor module. Although our model uses the General Retriever, it still outperforms all other methods even on the Twitter dataset. It is worth mentioning that we can replace the General Retriever in our method with other retrievers like supervised task-specific ones to improve the results even more. Moreover, it is worth noting that our model that is only based on the Transformer architecture and the General Retriever (that is not required to be trained in each domain) needs much less training time than RaE.

Human evaluation
As there is no appropriate automatic metric for evaluating the diversity and novelty of generated sentences, we use human evaluation to assess the performance of our model qualitatively. We   Figure 5: Results of the one-on-one human evaluation (second experiment). Annotators decide "Tie" when the outputs of the two models have the same quality in their opinion.
compare our method with two other methods: 1) RaE (Hashimoto et al., 2018) as a retrieval-based method adapted for paraphrasing, and 2) DiPS (Kumar et al., 2019) as a paraphrasing model which generates semantically diverse outputs by adopting a novel approach instead of beam search during the decoding stage. We choose these models as we would like to compare our method both with a state-of-the-art retrieval-based method and with a method that can generate diverse outputs. It must be noted that many of the recent methods in Table  2 are not able to generate diverse outputs. We first select 100 sentences randomly from the test set of Quora dataset. Then, for each model, three paraphrases are generated for each one of the sentences, and these three outputs are considered as a paraphrase group. We aggregate and shuffle these paraphrase groups and ask six human annotators to evaluate them in two scenarios.
In the first scenario, we ask the human annotators to score the outputs individually based on the following two criteria: 1) Grammar and fluency, 2) Consistency and coherency. Similar to Li et al. (2018), we use a 5-scale rating for each criterion. Table 3 presents the results. As can be seen, our model generally outperforms the other methods. Although RaE and our model can both produce grammatically correct outputs, the consistency and coherency for the outputs of our method is much better. Moreover, the inter-annotator agreement measured by Cohen's kappa κ shows fair or intermediate agreement between raters assessing the models.
Since directly scoring diversity and novelty of one paraphrase group is not simple even for humans, in the second scenario, we ask the annotators to make one-on-one comparisons on the groups of generated paraphrases. In other words, for each pair of the models, they have to decide which model produces better outputs for each one of the sentences (Ties are also allowed). Figure 5 depicts the one-on-one diversity evaluation. Our method and RaE both outperform DiPS, probably due to their retrieval-based nature. Moreover, this figure reveals that our method can generate significantly better outputs compared to RaE. We believe the reason is that RaE's editor is not as properly designed as our editor module. We explicitly inject the paraphrasing patterns found in the neighboring paraphrases into the Edit Performer which helps it to generate more diverse paraphrases. Please refer to the A.3 for some further details on the experiments. Table 4 shows some examples of the paraphrases generated by our model. A common pattern among the output paraphrases is that the model has combined different parts of the input sentence and the retrieved target sentence to create a grammatically correct paraphrase. Furthermore, the model is able to change the syntactic form of words correctly, or replace them with their synonyms, as it is the case for "via" and "by using", and "is" and "are". The fact that the model can apply these changes in different levels of granularity can indicate that the idea of editing based on MEVs works properly.

Case study
(1) x Why do people ask such questions here on Quora which could be easily found on the internet ? p Why do people ask questions on Quora that could simply be googled? q why do people ask questions on quora that are easily answerable via a quick internet search ?
y why do people ask questions on quora that could be easily answered by using a quick internet search ?
(2) x Who can the Democrats effectively run against Trump in 2020 ? p Who will lead Trump 's movement after the election ? q Who do you think is most likely to lead the Trump movement after 2016 ?
y who do you think is the most likely democrat to run against trump in 2020 ?
( (4) x What is the best way to make money as a 15 year old ? p What are the best ways to make money as a teenager ? q I 'm a teen , how can I make some money online ?
y I 'm a 15 year old , how can I make money ?

Model Ablation
In order to further evaluate the role of each module in our model, we train and assess different variants of it where in each variant, a key component has been replaced by an alternative simpler one: • Jaccard Retriever: The retriever of our model is replaced by a simple retriever that selects neighbor sentences using the Jaccard similarity metric.
• No edit vector z: A variant in which we do not condition the Transformer in the Edit Performer on the aggregated edit vector z, and edit the source sentence merely based on MEVs.
• No Attention on MEVs: In this variant of our model, the Transformer in the Edit Performer is not conditioned on MEVs, and the source sentence is edited based on only z.
We train all of these variants on the Quora paraphrasing dataset. Table 5 shows the results of these models. As it is seen, the model which uses the Jaccard similarity measure performs worse than the original model with the General Retriever. Nonetheless, the results of this version explains that even the combination of our editor module with this simple retriever outperforms previous state-of-theart methods. This indicates that our proposed editor can distinguish whether the extracted edits are plausible enough to be applied to the input sentence. Moreover, the results show that both eliminating z and M from our editor decrease its performance. In other words, both conditioning on z as the aggregated edit at each step of generation and the attention on MEVs M help the proposed editor.

Conclusion
In this paper, we proposed a retrieval-based paraphrase generation model which includes a novel fully-attentional editor. This editor learns how to extract edits from a paraphrase pair and also when and how to apply these edits to a new input sentence. We also introduced the new idea of Micro Edit Vectors, where each one of these vectors represents a small edit that should be applied to the source sentence to get its paraphrase. We incorporated Transformer modules in our editor and augmented them with attention over Micro Edit Vectors. The proposed model outperforms the previous state-of-the-art paraphrase generation models in terms of both automatic metrics and human evaluation. Moreover, the outputs show that our model is able to produce paraphrases by editing sentences in a fine-grained manner using the idea of MEVs. In future work, we intend to adapt our editor module for other learning tasks with both the structured input and structured output.

A.1 Construction of N (x) during training
For each pair of sentences, such as (x, y), we augment its neighbourhood set N (x) with multiple (x, y) pairs to get the new neighbourhood set q 1 ), ..., (p K , q K ), (x, y), ..., (x, y)], where first K pairs are the K-most similar pairs (excluding (x, y) itself), and (x, y) is repeated K < K times (K is another hyperparameter of the model). Since the model sees the (x, y) pair K K+K times during training as the retrieved pair, and these particular pairs include the output y themselves, the model is encouraged to use information coming from the retrieved neighboring pairs more often.

A.2 Analysis of Varying K
We conduct an experiment to evaluate the effect of the hyper-parameter K in the proposed method. For each value of K ∈ {1, 3, 5}, we train our model once and obtain its results on the Quora dataset. Then, the value of two quality metrics (i.e. BLUE-2 and ROUGE-2) and two diversity metrics (i.e. SelfBLEU-2 And PINC-4) are computed. Figure 6 summarizes the obtained results. According to this figure, increasing the value of K slightly decreases the quality metrics while highly increases the diversity measures (Note that lower values of SELf-BLEU and higher values of PINC indicate more diversity in the outputs). It shows that incorporating wider neighborhood in the editing process results in more diversity in the paraphrasing made by the editor.

A.3 Human Evaluation
The form used for the one-by-one experiment contains the following material: Which set of the outputs do you prefer? Please opt based on the following criteria: • Novelty: If any one of the outputs has expressed the semantic content of source sentence in a novel way like what we do as humans (e.g. changing the voice from active to passive or vice versa, using different words, phrases, or sentences that are not present in the source sentence but without changing the meaning considerably). • Diversity: if the three outputs are not expressed in the same way using the same words.
• Quality: If the outputs are paraphrases of the input sentence to a good extent.
• Readability: If the outputs are understandable by humans.
Note: Please select the option 'both(#1 and #2)' if you can not decide which one is better. Table 6 shows the paraphrases generated for more sample inputs from the Quora dataset.  (1) y what is the story behind andy warhol paint marilyn monroe ?

A.4 Example outputs
(8) x Can we start a Sainik School in the private sector ? p Is n't it time for Indian goverment to change its reservation system for jobs and education from caste based to economic based system ? q Is n't it time India change its reservation policies from implemented on caste basis to based on financial backgrounds ?
y is it possible to start a sainik school in the private sector ?
(9) x Can we start a Sainik School in the private sector ? p Should the Indian education system be revamped ? q Does the Indian education system need to change and why ?
y does the indian government need to start a sainik school in the private sector ?
(10) x What is the best thing any one has ever done for you ? p What is the nicest thing anyone has ever done for you ? q What 's the nicest thing anyone has ever done for you ?
y what 's the best thing that has ever done for you ?
(11) x What are some of the things people like or dislike in general ? p What are some things that you hate ? q What are some things that irritate you and that you hate ?
y what are some things that most people dislike in general ?