Decomposable Neural Paraphrase Generation

Paraphrasing exists at different granularity levels, such as lexical level, phrasal level and sentential level. This paper presents Decomposable Neural Paraphrase Generator (DNPG), a Transformer-based model that can learn and generate paraphrases of a sentence at different levels of granularity in a disentangled way. Specifically, the model is composed of multiple encoders and decoders with different structures, each of which corresponds to a specific granularity. The empirical study shows that the decomposition mechanism of DNPG makes paraphrase generation more interpretable and controllable. Based on DNPG, we further develop an unsupervised domain adaptation method for paraphrase generation. Experimental results show that the proposed model achieves competitive in-domain performance compared to state-of-the-art neural models, and significantly better performance when adapting to a new domain.


Introduction
Paraphrases are texts that convey the same meaning using different wording. Paraphrase generation is an important technique in natural language processing (NLP), which can be applied in various downstream tasks such as information retrieval, semantic parsing, and dialogue systems. Neural sequence-to-sequence (Seq2Seq) models have demonstrated the superior performances on generating paraphrases given a sentence (Prakash et al., 2016;Cao et al., 2017;Ma et al., 2018). All of the existing works learn to paraphrase by mapping a sequence to another, with each word processed and generated in a uniform way. This work is motivated by a commonly observed phenomenon that the paraphrase of a sentence is usually composed of multiple paraphrasing patterns at different levels of granularity, e.g., from the lexical to phrasal to sentential levels. For instance, the following pair of paraphrases contains both the phrase-level and the sentence-level patterns.
what is the reason of $x → what makes $x happen world war II → the second world war Specifically, the blue part is the sentence-level pattern, which can be expressed as a pair of sentence templates, where $x can be any fragment of text. The green part is the phrase-level pattern, which is a pair of phrases. Table 1 shows more examples of paraphrase pairs sampled from WikiAnswers corpus 1 and Quora question pairs 2 . We can see that the sentence-level paraphrases are more general and abstractive, while the word/phrase-level paraphrases are relatively diverse and domain-specific. Moreover, we notice that in many cases, paraphrasing can be decoupled, i.e., the word-level and phrase-level patterns are mostly independent of the sentence-level paraphrase patterns.
To address this phenomenon in paraphrase generation, we propose Decomposable Neural Paraphrase Generator (DNPG). Specifically, the DNPG consists of a separator, multiple encoders and decoders, and an aggregator. The separator first partitions an input sentence into segments belonging to different granularities, which are then processed by multiple granularity-specific encoders and decoders in parallel. Finally the aggregator combines the outputs from all the decoders to produce a paraphrase of the input.
We explore three advantages of the DNPG: Table 1: Examples of paraphrase pairs in WikiAnswers and Quora datasets. We manually labeled the sentences with the blue italic words being sentence-level and the green underlined words being phrase-level.
What is the population of New York? How many people is there in NYC?
Who wrote the Winnie the Pooh books? Who is the author of winnie the pooh?
What is the best phone to buy below 15k? Which are best mobile phones to buy under 15000?
How can I be a good geologist? What should I do to be a great geologist?
How do I reword a sentence to avoid plagiarism? How can I paraphrase my essay and avoid plagiarism?
Interpretable In contrast to the existing Seq2Seq models, we show that DNPG can automatically learn the paraphrasing transformation separately at lexical/phrasal and sentential levels.
Besides generating a paraphrase given a sentence, it can meanwhile interpret its prediction by extracting the associated paraphrase patterns at different levels, similar to the examples shown above.
Controllable The model allows the user to control the generation process precisely. By employing DNPG, the user can specify the part of the sentence being fixed while the rest being rephrased at a particular level.
Domain-adaptable In this work, we assume that high-level paraphrase patterns are more likely to be shared across different domains. With all the levels coupled together, it is difficult for conventional Seq2Seq models to well adapt to a new domain. The DNPG model, however, can conduct paraphrase at abstractive (sentential) level individually, and thus be more capable of performing well in domain adaptation. Concretely, we develop a method for the DNPG to adapt to a new domain with only non-parallel data.
We verify the DNPG model on two large-scale paraphrasing datasets and show that it can generate paraphrases in a more controllable and interpretable way while preserving the quality. Furthermore, experiments on domain adaptation show that DNPG performs significantly better than the state-of-the-art methods. The technical contribution of this work is of three-fold: 1. We propose a novel Seq2Seq model that decomposes the paraphrase generation into learning paraphrase patterns at different gran-ularity levels separately.
2. We demonstrate that the model achieves more interpretable and controllable generation of paraphrases.
3. Based on the proposed model, we develop a simple yet effective method for unsupervised domain adaptation.

Decomposable Neural Paraphrase Generator
This section explains the framework of the proposed DNPG model. We first give an overview of the model design and then elaborate each component in detail. As illustrated in Figure 1, DNPG consists of four components: a separator, multi-granularity encoders and decoders (denoted as m-encoder and m-decoder respectively), and an aggregator. The m-encoder and m-decoder are composed of multiple independent encoders and decoders, with each corresponding to a specific level of granularity. Given an input sentence of words X = [x 1 , . . . , x L ] with length L, the separator first determines the granularity label for each word, denoted as Z = [z 1 , . . . , z L ]. After that, the input sentence X together with its associated labels Z are fed into m-encoder in parallel and summarized as

Model Overview
where the subscript z denotes the granularity level. At the decoding stage, each decoder can individually predict the probability of generating the next word y t as P z (y t |y 1:t−1 , X) = m-decoder z (U z , y 1:t−1 ).
(2) Finally, the aggregator combines the outputs of all the decoders and make the final prediction of the next word: P (y t |y 1:t−1 , X) = zt P zt (y t |y 1:t−1 , X)P (z t |y 1:t−1 , X). (3) Here P (z t |y 1:t−1 , X) is computed as the probability of being at the granularity level z t , and P zt (y t |y 1:t−1 , X) is given by the decoder m-decoder zt at level z t .
The choice of the encoder and decoder modules of DNPG can be quite flexible, for instance longshort term memory networks (LSTM) Hochreiter and Schmidhuber (1997) or convolutional neural network (CNN) (LeCun et al., 1998). In this work, the m-encoder and m-decoder are built based on the Transformer model (Vaswani et al., 2017). Besides, we employ LSTM networks to build the separator and aggregator modules. Without loss of generality, we consider two levels of granularity in our experiments, that is, z = 0 for the lexical/phrasal level and z = 1 for the sentential level.

Separator
For each word x l in the sentence, we assign a latent variable z l indicating its potential granularity level for paraphrasing. This can be simply formulated as a sequence labeling process. In this work we employ the stacked LSTMs to compute the distribution of the latent variables recursively: where h l and g l represent the hidden states in the LSTMs and GS(·, τ ) denotes the Gumbel-Softmax function (Jang et al., 2016). The reason of using Gumbel-Softmax is to make the model differentiable, and meanwhile produce the approximately discrete level for each token. τ is the temperature controlling the closeness of z towards 0 or 1.

Multi-granularity encoder and decoder
We employ the Transformer architecture for the encoders and decoders in DNPG. Specifically, the phrase-level Transformer is composed of m-encoder 0 and m-decoder 0 , which is responsible for capturing the local paraphrasing patterns. The sentence-level Transformer is composed of m-encoder 1 and m-decoder 1 , which aims to learn the high-level paraphrasing transformations. Based on the Transformer design in Vaswani et al. (2017), each encoder or decoder is composed of positional encoding, stacked multihead attention, layer normalization, and feedforward neural networks. The multi-head attention in the encoders contains self-attention while the one in the decoders contains both self-attention and context-attention. We refer readers to the original paper for details of each component. In order to better decouple paraphrases at different granularity levels, we introduce three inductive biases to the modules by varying the model capacity and configurations in the positional encoding and multi-head attention modules. We detail them hereafter.
Positional Encoding: We adopt the same variant of the positional encoding method in Vaswani et al. (2017), that is, the sinusoidal function: For phrase-level Transformer, we use the original position, i.e., p := pos. For the sentence-level Transformer, in order to make the positional encoding insensitive to the lengths of the phraselevel fragment, we set: Multi-head Attention: We modify the selfattention mechanism in the encoders and decoders by setting a different receptive field for each granularity, as illustrated in Figure 2. Specifically, for the phrase-level model, we restrict each position in the encoder and decoder to attend only the adjacent n words (n = 3), so as to mainly capture the local composition. As for the sentence-level model, we allow the self-attention to cover the entire sentence, but only those words labeled as sentence-level (i.e., z l = 1) are visible. In this manner, the model will focus on learning the sentence structures while ignoring the low-level details. To do so, we re-normalize the original atten-tion weights α t,l as We also restrict the decoder at z level only access the position l : z l = z at encoder in the same way. Model Capacity: We choose a larger capacity for the phrase-level Transformer over the sentence-level Transformer. The intuition behind is that lexical/phrasal paraphrases generally contain more long-tail expressions than the sentential ones. In addition, the phrase-level Transformer is equipped with the copying mechanism . Thus, the probability of generating the target word y t by the m-decoder 0 is: P z=0 (y t |y 1:t−1 , X) =(1 − ρ t )P gen (y t |y 1:t−1 , X) + ρ t P copy (y t |y 1:t−1 , X) (8) where ρ t is the copying probability, which is jointly learned with the model. Table 2 summarizes the specifications of the Transformer models for each granularity.

Aggregator
Each Transformer model works independently until generating the final paraphrases. The prediction of the token at t-th position is determined by the aggregator, which combines the outputs from the m-decoders. More precisely, the aggregator first decides the probability of the next word being at each granularity. The previous word y t−1 and the context vectors c 0 and c 1 given by m-decoder 0 and m-decoder 1 , are fed into a LSTM to make the prediction: where v t is the hidden state of the LSTM. Then, jointly with the probabilities computed by mdecoders, we can make the final prediction of the next word via Eq (3).

Learning of Separator and Aggregator
The proposed model can be trained end-to-end by maximizing the conditional probability (3). However, learning from scratch may not be informative for the separator and aggregator to disentangle the paraphrase patterns in an optimal way. Thus we induce weak supervision to guide the training of the model. We construct the supervision based on a heuristic that long-tail expressions contain more rare words. To this end, we first use the word alignment model (Och and Ney, 2003) to establish the links between the words in the sentence pairs from the paraphrase corpus. Then we assign the label z * = 0 (phrase-level) to n (randomly sampled from {1, 2, 3}) pairs of aligned phrases that contain most rare words. The rest of the words are labeled as z * = 1 (sentence-level).
We train the model with explicit supervision at the beginning, with the following loss function: log P (z * t |y 1:t−1 , X)) (10) where λ is the hyper-parameter controlling the weight of the explicit supervision. In experiments, we decrease λ gradually from 1 to nearly 0.

Applications and Experimental Results
We verify the proposed DNPG model for paraphrase generation in three aspects: interpretability, controllability and domain adaptability. We conduct experiments on WikiAnswers paraphrase corpus (?) and Quora duplicate question pairs, both of which are questions data. While the Quora dataset is labeled by human annotators, the WikiAnswers corpus is collected in a automatic way, and hence it is much noisier. There are more than 2 million pairs of sentences on WikiAnswers corpus. To make the application setting more similar to realworld applications, and more challenging for domain adaptation, we use a randomly sampled subset for training. The detailed statistics are shown in Table 3.

Implementation and Training Details
As the words in the WikiAnswers are all stemmed and lower case, we do the same pre-processing on Quora dataset. For both datasets, we truncate all the sentences longer than 20 words. For the models with copy mechanism, we maintain a vocabulary of size 8K. For the other baseline models besides vanilla Transformer, we include all the words in the training sets into vocabulary to ensure that the improvement of our models does not come from solving the out-of-vocabulary issue.
For a fair comparison, we use the Transformer model with similar number of parameters with our model. Specifically, it is with 3 layers, model size of 450 dimensions, and attention with 9 heads. We use early stopping to prevent the problem of over-fitting. We train the DNPG with Adam optimizer (Kingma and Ba, 2014). We set the learning rate as 5e − 4, τ as 1 and λ as 1 at first, and then decrease them to 1e − 4, 0.9 and 1e − 2 after 3 epochs. We set the hyper-parameters of models and optimization in all other baseline models to remain the same in their original works. We implement our model with PyTorch (Paszke et al., 2017).

Interpretable Paraphrase Generation
First, we evaluate our model quantitatively in terms of automatic metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), which have been widely used in previous works on paraphrase generation. In addition, we include iBLEU (Sun and Zhou, 2012), which penalizes repeating the source sentence in its paraphrase. We use the same hyper-parameter in their original work. We compare DNPG with four existing neural-based models: ResidualLSTM (Prakash et al., 2016), VAE-SVG-eq (Gupta et al., 2017), pointer-generator (See et al., 2017) and the Transformer (Vaswani et al., 2017), the latter two of which have been reported as the state-of-the-art models in  and Wang et al. (2018) respectively. For a fair comparison, we also include a Transformer model with copy mechanism. Table 4 shows the performances of the models, indicating that DNPG achieves competitive performance in terms of all the automatic metrics among all the models. In particular, the DNPG has similar performance with the vanilla Transformer model on Quora dataset, while significantly performs better on WikiAnswers. The reason maybe that the DNPG is more robust to the noise, since it can process the paraphrase in an abstractive way. It also validates our assumption that paraphrasing can be decomposed in terms of granularity. When the training data of high quality is available, the transformer-based models significantly outperforms the LSTM-based models.
Besides the quantitative performance, we demonstrate the interpretability of DNPG. Given an input sentence, the model can not only generate its paraphrase but also predict the granularity level of each word. By using the predicted granularity levels and the context attentions in the Transformer, we are able to extract the phrasal and sentential paraphrase patterns from the model. Specifically, we extract the sentential templatesX  of X (orȲ of Y ) by substituting each fragment of words at the phrasal level by a unique placeholder such as $x. The extraction process is denoted as X = T (X, Z) = [x 1 , . . . ,xL], where the elementxl is either a placeholder or a word labeled as sentence-level. Through the attention weights, we ensure that the pair of aligned fragment share the same placeholder in {X,Ȳ }. The whole generation and alignment process is detailed in Appendix A. Each pair of fragments sharing the same placeholder are extracted as the phrasal paraphrase patterns. Table 6 gives examples of the generated paraphrases and the corresponding extracted templates. For instance, the model learns a sentential paraphrasing pattern:X: what is $x's $y →Ȳ : what is the $y of $x, which is a common rewriting pattern applicable in general practice. The results clearly demonstrate the ability of DNPG to decompose the patterns at different levels, making its behaviors more interpretable.

Controllable Paraphrase Generation
The design of the DNPG model allows the user to control the generating process more precisely. Thanks to the decomposable mechanisms, it is flexible for the model to conduct either sentential paraphrasing or phrasal paraphrasing individually. Furthermore, instead of using the learned separator, the user can manually specify the granularity labels of the input sentence and then choose the following paraphrasing strategies. Sentential paraphrasing is performed by restricting the phrase-level decoder (m-decoder 0 ) to copying from the input at the decoding stage, i.e., keeping the copy probability ρ t = 1. To ensure that the phrasal parts are well preserved, we replace each phrase-level fragment by a unique placeholder and recover it after decoding.
Phrasal paraphrasing is performed with sentence template being fixed. For each phrase-level fragment, paraphrase is generated by m-decoder 0 only and the generation stopped at t : z t = 1.
Once the beam search of size B finished, there are B paraphrase candidatesŶ b . We pick up the one with the best accuracy and readability. Specifically, we re-rank them by P (Ŷ b |X, Z) calculated by the full model of DNPG.
Given a sentence, we manually label different segments of words as phrase-level, and employ the model to conduct sentential and phrasal paraphrasing individually. With the manual labels, the model automatically selects different paraphrase patterns for generation. Table 7 shows examples of the generated results by different paraphrasing strategies. As demonstrated by the examples, DNPG is flexible enough to generate paraphrase given different sentence templates and phrases.
Controllable generation is useful in downstream applications, for instance, data augmentation in the task-oriented dialogue system. Suppose we have the user utterance book a flight from New York to London and want to produce more  utterances with the same intent. With the DNPG, we can conduct sentential paraphrasing and keep the slot values fixed, e.g. buy an airline ticket to London from New York.

Unsupervised Domain Adaptation
Existing studies on paraphrase generation mainly focus on the in-domain setting with a large-scale parallel corpus for training. In practice, there is always a need to apply the model in a new domain, where no parallel data is available. We formulate it as an unsupervised domain adaptation problem for paraphrase generation. Based on the observation that the sentence templates generated by DNPG tend to be more general and domaininsensitive, we consider directly performing the sentential paraphrase in the target domain as a solution. However, the language model of the source and target domains may differ, we therefore finetune the separator of DNPG so that it can identify the granularity of the sentence in the target domain more accurately. Specifically, to adapt the separator P sep (Z|X) to the target domain, we employ a reinforcement learning (RL) approach by maxi-mizing the accumulative reward: We define the reward functions based on the principle that the source and target domain share the similar sentence templates. We first train a neural language model, specifically LSTM, with the sentence templates in the source domain, with the conditional probability denoted as P LM (xl|x 1:l−1 ).
In the target domain, the template language model is employed as a reward function for separator. Formally we define the reward r l at position l as: where the templatex 1:l = T (X, z 1:l ) is extracted in the way as detailed in Section 3.2. And α is a scaling factor that penalizes the long fragment labeled as phrase-level, since more informative sentence templates are preferred. With the reward, the separator is further tuned with the policy gradient method (Williams, 1992;Sutton et al., 2000). To bridge the gap between training and testing of the Transformer models in different domain, we finetune the DNPG model on the sentential paraphrase patterns extracted in source domain. Since only the unlabeled data in the target domain is needed to fine-tune separator, the domain adaptation can be done incrementally. An overview of the complete training process is illustrated in Figure 4. We refer the model fine-tuned in this way as Adapted DNPG.
We evaluate the performance of the original DNPG and the Adapted DNPG in two settings of domain transfer: 1) Quora dataset as the source domain and WikiAnswers as the target domain, denoted as Quora→WikiAnswers, and 2) in reverse as WikiAnswers→Quora. For the baseline models, in addition to the pointer-generator network and the Transformer model with copy mechanism (denoted as Transformer+Copy), we use the shallow fusion (Gulcehre et al., 2015) and the multi-task learning (MTL) (Domhan and Hieber, 2017) that harness the non-parallel data in the target domain for adaptation. For fair comparisons, we use the Transformer+Copy as the base model for shallow fusion and implement a variant of MTL with copy mechanism (denoted as MTL+Copy). Table 5 shows performances of the models in two settings. DNPG achieves better performance over the pointer-generator and Transformer-based model, and has the competitive performance with MTL+Copy, which accesses target domain for training. With a fine-tuned separator, Adapted DNPG outperforms other models significantly on Quora→WikiAnswers. When it comes to WikiAnswers→Quora, where domain adaptation is more challenging since the source domain is noisy, the margin is much larger. The main reason is that the original meaning can be preserved well when the paraphrasing is conducted at the sentential level only. For an intuitive illustration, We show examples of the generated paraphrases from Adapted DNPG and MTL+Copy in Table 10 in Appendix. It is shown that the sentential paraphrasing is an efficient way to reuse the general paraphrase patterns and meanwhile avoid mistakes on rephrasing domain-specific phrases. However, it is at the expense of the diversity of the generated paraphrases. We leave this problem for future work.
To further verify the improvement of Adapted DNPG, we conduct the human evaluation on the WikiAnswers→Quora setting. We have six human assessors to evaluate 120 groups of paraphrase candidates given the input sentences. Each group consists of the output paraphrases from MTL+Copy, DNPG and Adapted DNPG as well as the reference. The evaluators are asked to rank the candidates from 1 (best) to 4 (worst) by their readability, accuracy and surface dissimilarity to the input sentence. The detailed evaluation guide can be found in Appendix B. Table 8 shows the mean rank and inter-annotator agreement (Cohen's kappa) of each model. Adapted DNPG again significantly outperforms MTL+Copy by a large margin (p-value < 0.01). The performance of the original DNPG and MTL+Copy has no significant difference (p-value = 0.18). All of the interannotator agreement is regarded as fair or above.

Ablation Studies and Discussion
We quantify the performance gain of each inductive bias we incorporated in the DNPG model. Specifically, we compare the DNPG with three variants: one with vanilla attention modules, one with vanilla positional encoding and the one uses vanilla softmax. We train them with the training set of WikiAnswers and test in the validation set of Quora. The results are shown in Table 9, which shows that each inductive bias has a positive contribution. It further proves that the decomposition mechanism allows the model to capture more abstractive and domain-invariant patterns. We also note that there is a large drop without the constraints on multi-head attention, which is a core part of the decomposition mechanism. We investigate the effect of the weak supervision for separator and aggregator by setting λ as 0. Though there is not a significant drop on quantitative performance, we observe that the model struggles to extract meaningful paraphrase patterns. It means that explicit supervision for separator and aggregator can make a difference and it does not need to be optimal. It opens a door to incorporate symbolic knowledge, such as regular expression of sentence templates, human written paraphrase patterns, and phrase dictionary, into the neural network. Through training in a parallel corpus, DNPG can generalize the symbolic rules. Most of the existing neural methods of paraphrase generation focus on improving the indomain quality of generated paraphrases. Prakash et al. (2016) and Ma et al. (2018) adjust the network architecture for larger capacity. Cao et al. (2017) and Wang et al. (2018) utilize external resources, in other words, phrase dictionary and semantic annotations.  reinforce the paraphrase generator by a learnt reward function.
Although achieving state-of-the-art performances, none of the above work considers the paraphrase patterns at different levels of granularity. Moreover, their models can generate the paraphrase in a neither interpretable nor a fine-grained controllable way. In Iyyer et al. (2018)'s work, the model is trained to produce a paraphrase of the sentence with a given syntax. In this work, we consider automatically learning controllable and interpretable paraphrasing operations from the corpus. This is also the first work to consider scalable unsupervised domain adaptation for neural paraphrase generation.

Controllable and Interpretable Text Generation
There is extensive attention on controllable neural sequence generation and its interpretation. A line of research is based on variational auto-encoder (VAE), which captures the implicit (Gupta et al., 2017; or explicit information (Hu et al., 2017;Liao et al., 2018) via latent representations. Another approach is to integrate probabilistic graphical model, e.g., hidden semi-Markov model (HSMM) into neural network (Wiseman et al., 2018;Dai et al., 2016). In these works, neural templates are learned as a sequential composition of segments controlled by the latent states, and be used for language modeling and data-totext generation. Unfortunately, it is non-trivial to adapt this approach to the Seq2Seq learning framework to extract templates from both the source and the target sequence.

Domain Adaptation for Seq2Seq Learning
Domain adaptation for neural paraphrase generation is under-explored. To our best knowledge, Su and Yan (2017)'s work is the only one on this topic. They utilize the pre-trained word embedding and include all the words in both domains to vocabulary, which is tough to scale. Meanwhile, we notice that there is a considerable amount of work on domain adaptation for neural machine translation, another classic sequence-to-sequence learning task. However, most of them require parallel data in the target domain (Wang et al., 2017a,b). In this work, we consider unsupervised domain adaptation, which is more challenging, and there are only two works that are applicable. Gulcehre et al. (2015) use the language model trained in the target domain to guide the beam search. Domhan and Hieber (2017) optimize two stacked decoders jointly by learning language model in the target domain and learning to translate in the source domain. In this work, we utilize the similarity of sentence templates in the source and target domains. Thanks to the decomposition of multi-grained paraphrasing patterns, DNPG can fast adapt to a new domain without any parallel data.

Conclusion
In this paper, we have proposed a neural paraphrase generation model, which is equipped with a decomposition mechanism. We construct such mechanisms by latent variables associated with each word, and a couple of Transformer models with various inductive biases to focus on paraphrase patterns at different levels of granularity. We further propose a fast and incremental method for unsupervised domain adaptation. The quantitative experiment results show that our model has competitive in-domain performance compared to the state-of-the-art models, and outperforms significantly upon other baselines in domain adaptation. The qualitative experiments demonstrate that the generation of our model is interpretable and controllable. In the future, we plan to investigate more efficient methods of unsupervised domain adaptation with decomposition mechanism on other NLP tasks.