Unsupervised Paraphrasing without Translation

Paraphrasing is an important task demonstrating the ability to abstract semantic content from its surface form. Recent literature on automatic paraphrasing is dominated by methods leveraging machine translation as an intermediate step. This contrasts with humans, who can paraphrase without necessarily being bilingual. This work proposes to learn paraphrasing models only from a monolingual corpus. To that end, we propose a residual variant of vector-quantized variational auto-encoder. Our experiments consider paraphrase identification, and paraphrasing for training set augmentation, comparing to supervised and unsupervised translation-based approaches. Monolingual paraphrasing is shown to outperform unsupervised translation in all contexts. The comparison with supervised MT is more mixed: monolingual paraphrasing is interesting for identification and augmentation but supervised MT is superior for generation.


Introduction
Many methods have been developed to generate paraphrases automatically (Madnani and J. Dorr, 2010).Approaches relying on Machine Translation (MT) have proven popular due to the scarcity of labeled paraphrase pairs (Callison-Burch, 2007;Mallinson et al., 2017;Iyyer et al., 2018).Recent progress in MT with neural methods (Bahdanau et al., 2014;Vaswani et al., 2017) has popularized this latter strategy.Conceptually, translation is appealing since it abstracts semantic content from its linguistic realization.For instance, assigning the same source sentence to multiple translators will result in a rich set of semantically close sentences (Callison-Burch, 2007).At the same time, bilingualism does not seem necessary to humans to generate paraphrases.
This work evaluates if data in two languages is necessary for paraphrasing.We consider three settings: supervised translation (parallel bilingual data is used), unsupervised translation (nonparallel corpora in two languages are used) and monolingual (only unlabeled data in the paraphrasing language is used).Our comparison devises comparable encoder-decoder neural networks for all three settings.While the literature on supervised (Bahdanau et al., 2014;Cho et al., 2014;Vaswani et al., 2017) and unsupervised translation (Lample et al., 2018a;Artetxe et al., 2018;Lample et al., 2018b) offer solutions for the bilingual settings, monolingual neural paraphrase generation has not received the same attention.
We consider discrete and continuous autoencoders in an unlabeled monolingual setting, and contribute improvements in that context.We introduce a model based on Vector-Quantized Auto-Encoders, VQ-VAE (van den Oord et al., 2017), for generating paraphrases in a purely monolingual setting.Our model introduces residual connections parallel to the quantized bottleneck.This lets us interpolate from classical continuous autoencoder (Vincent et al., 2010) to VQ-VAE.Compared to VQ-VAE, our architecture offers a better control over the decoder entropy and eases optimization.Compared to continuous auto-encoder, our method permits the generation of diverse, but semantically close sentences from an input sentence.
We compare paraphrasing models over intrinsic and extrinsic metrics.Our intrinsic evaluation evaluates paraphrase identification, and generations.Our extrinsic evaluation reports the impact of training augmentation with paraphrases on text classification.Overall, monolingual approaches can outperform unsupervised translation in all settings.Comparison with supervised translation shows that parallel data provides valuable information for paraphrase generation compared arXiv:1905.12752v1 [cs.LG] 29 May 2019 to purely monolingual training.

Related Work
Paraphrase Generation Paraphrases express the same content with alternative surface forms.Their automatic generation has been studied for decades: rule-based (McKeown, 1980;Meteer and Shaked, 1988) and data-driven methods (Madnani and J. Dorr, 2010) have been explored.Data-driven approaches have considered different source of training data, including multiple translations of the same text (Barzilay and McKeown, 2001;Pang et al., 2003) or alignments of comparable corpora, such as news from the same period (Dolan et al., 2004;Barzilay and Lee, 2003).
Machine translation later emerged as a dominant method for paraphrase generation.Bannard and Callison-Burch (2005) identify equivalent English phrases mapping to the same non-English phrases from an MT phrase table.Kok and Brockett (2010) performs random walks across multiple phrase tables.Translation-based paraphrasing has recently benefited from neural networks for MT (Bahdanau et al., 2014;Vaswani et al., 2017).Neural MT can generate paraphrase pairs by translating one side of a parallel corpus (Wieting and Gimpel, 2018;Iyyer et al., 2018).Paraphrase generation with pivot/round-trip neural translation has also been used (Mallinson et al., 2017;Yu et al., 2018).
Although less common, monolingual neural sequence models have also been proposed.In supervised settings, Prakash et al. (2016); Gupta et al. (2018) learn sequence-to-sequence models on paraphrase data.In unsupervised settings, Bowman et al. (2016) apply a VAE to paraphrase detection while Li et al. (2017) train a paraphrase generator with adversarial training.Paraphrase Evaluation Evaluation can be performed by human raters, evaluating both text fluency and semantic similarity.Automatic evaluation is more challenging but necessary for system development and larger scale statistical analysis (Callison-Burch, 2007;Madnani and J. Dorr, 2010).Automatic evaluation and generation are actually linked: if an automated metric would reliably assess the semantic similarity and fluency of a pair of sentences, one would generate by searching the space of sentences to maximize that metric.Automated evaluation can report the overlap with a reference paraphrase, like for transla-tion (Papineni et al., 2002) or summarization (Lin, 2004).BLEU, METEOR and TER metrics have been used (Prakash et al., 2016;Gupta et al., 2018).These metrics do not evaluate whether the generated paraphrase differs from the input sentence and large amount of input copying is not penalized.Galley et al. (2015) compare overlap with multiple references, weighted by quality; while Sun and Zhou (2012) explicitly penalize overlap with the input sentence.Grangier and Auli (2018) alternatively compare systems which have first been calibrated to a reference level of overlap with the input.We follow this strategy and calibrate the generation overlap to match the average overlap observed in paraphrases from humans.
In addition to generation, probabilistic models can be assessed through scoring.For a sentence pair (x, y), the model estimate of P (y|x) can be used to discriminate between paraphrase and non-paraphrase pairs (Dolan and Brockett, 2005).The correlation of model scores with human judgments (Cer et al., 2017) can also be assessed.We report both types of evaluation.
Finally, paraphrasing can also impact downstream tasks, e.g. to generate additional training data by paraphrasing training sentences (Marton et al., 2009;Zhang et al., 2015;Yu et al., 2018).We evaluate this impact for classification tasks.

Residual VQ-VAE for Unsupervised Monolingual Paraphrasing
Auto-encoders can be applied to monolingual paraphrasing.
Our work combines Transformer networks (Vaswani et al., 2017) and VQ-VAE (van den Oord et al., 2017), building upon recent work in discrete latent models for translation (Kaiser et al., 2018;Roy et al., 2018).VQ-VAEs, as opposed to continuous VAEs, rely on discrete latent variables.This is interesting for paraphrasing as it equips the model with an explicit control over the latent code capacity, allowing the model to group multiple related examples under the same latent assignment, similarly to classical clustering algorithms (Macqueen, 1967).This is conceptually simpler and more effective than rate regularization (Higgins et al., 2016) or denoising objectives (Vincent et al., 2010) for continuous auto-encoders.At the same time, training auto-encoder with discrete bottleneck is difficult (Roy et al., 2018).We address this difficulty with an hybrid model using a continuous residual connection around the quantization module.
We modify the Transformer encoder (Vaswani et al., 2017) as depicted in Figure 1.Our encoder maps a sentence into a fixed size vector.This is simple and avoids choosing a fixed length compression rate between the input and the latent representation (Kaiser et al., 2018).Our strategy to produce a fixed sized representation from transformer is analogous to the special token employed for sentence classification in (Devlin et al., 2018).
At the first layer, we extend the input sequences with one or more fixed positions which are part of the self-attention stack.At the output layer, the encoder output is restricted to these special positions which constitute the encoder fixed sizedoutput.As in (Kaiser et al., 2018), this vector is split into multiple heads (sub-vectors of equal dimensions) which each goes through a quantization module.For each head h, the encoder output e h is quantized as, where {c i } K i=0 denotes the codebook vectors.The codebook is shared across heads and training combines straight-through gradient estimation and exponentiated moving averages (van den Oord et al., 2017).The quantization module is completed with a residual connection, with a learnable weight α, z h (e h ) = αe h + (1 − α)q h (e h ).One can observe that residual vectors and quantized vectors always have similar norms by definition of the VQ module.This is a fundamental difference with classical continuous residual networks, where the network can reduce activation norms of some modules to effectively rely mostly on the residual path.This makes α an important parameter to trade-off continuous and discrete auto-encoding.Our learning encourages the quantized path with a squared penalty α 2 .
After residual addition, the multiple heads of the resulting vector are presented as a matrix to which a regular transformer decoder can attend.Models are trained to maximize the likelihood of the training set with Adam optimizer using the learning schedule from (Vaswani et al., 2017).

Experiments & Results
We compare neural paraphrasing with and without access to bilingual data.For bilingual settings, we consider supervised and unsupervised translation using round-trip translation (Mallinson  (Chelba et al., 2013).
Monolingual Residual VQ-VAE is trained only on LM1B with K = 2 16 , with 2 heads and fixed window of size 16.We also evaluate plain VQ-VAE α = 0 to highlight the value of our residual modification.We further compare with a monolingual continuous denoising auto-encoder (DN-AE), with noising from Lample et al. (2018b).
Paraphrase Identification For classification of sentence pairs (x, y) over Microsoft Research Paraphrase Corpus (MRPC) from Dolan and Brockett (2005), we train logistic regression on P (y|x) and P (x|y) from the model, complemented with encoder outputs in fixed context settings.We also perform paraphrase quality regression on Semantic Textual Similarity (STS) from Cer et al. (2017)  (2002).MTC contains English paraphrases collected as translations of the same Chinese sentences from multiple translators (Mallinson et al., 2017).We pair each MTC sentence x with a paraphrase y and 100 randomly chosen nonparaphrases y .We compare the paraphrase score P (y|x) to the 100 non-paraphrase scores P (y |x) and report the fraction of comparisons where the paraphrase score is higher.Table 1 (left) reports that our residual model outperforms alternatives in all identification setting, except for STS, where our Pearson correlation is slightly under supervised translation.Paraphrases for Data Augmentation We augment the training set of text classification tasks for sentiment analysis on Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) and question classification on Text REtrieval Conference (TREC) (Voorhees and Tice, 2000).In both cases, we double training set size by paraphrasing each sentence and train Support Vector Machines with Naive Bayes features (Wang and Manning, 2012).
In Table 2, augmentation with monolingual models yield the best performance for SST-2 sentiment classification.TREC question classification is better with supervised translation augmentation.Unfortunately, our monolingual training set LM1B does not contain many question sentences.Future work will revisit monolingual training on larger, more diverse resources.
Paraphrase Generation Paraphrase generation are evaluated on MTC.We select the 4 best translators according to MTC documentation and paraphrase pairs with a length ratio under 1.2.Our evaluation prevents trivial copying solutions.We select sampling temperature for all models such that their generation overlap with the input is 20.9 BLEU, the average overlap between humans on MTC.We report BLEU overlap with the target and run a blind human evaluation where raters pick the best generation among supervised translation, unsupervised translation and monolingual.a worthy substitute Out: A worthy replacement. In: Local governments will manage the smaller enterprises.Out: Local governments will manage smaller companies. In: Inchon is 40 kilometers away from the border of North Korea.Out: Inchon is 40 km away from the North Korean border. In: Executive Chairman of Palestinian Liberation Organization, Yasar Arafat, and other leaders are often critical of aiding countries not fulfilling their promise to provide funds in a timely fashion.Out: Yasar Arafat , executive chairman of the Palestinian Liberation Organization and other leaders are often critical of helping countries meet their pledge not to provide funds in a timely fashion.lights the value of parallel data for paraphrase generation.

Discussions
Our experiments highlight the importance of the residual connection for paraphrase identification.From Table 1, we see that a model without the residual connection obtains 66.3%, 10.6% and 69.0%accuracy on MRPC, STS and MTC.
Adding the residual connection improves this to 73.3%, 59.8% and 94.0% respectively.The examples in Table 3 show paraphrases generated by the model.The overlap with the input from these examples is high.It is possible to generate sentences with less overlap at higher sampling temperatures, we however observe that this strategy impairs fluency and adequacy.We plan to explore strategies which allow to condition the decoding process on an overlap requirement instead of varying sampling temperatures (Grangier and Auli, 2018).

Conclusion
We compared neural paraphrasing with and without access to bilingual data.Bilingual settings considered supervised and unsupervised translation.Monolingual settings considered autoencoders trained on unlabeled text and introduced continuous residual connections for discrete autoencoders.This method is advantageous over both discrete and continuous auto-encoders.Overall, we showed that monolingual models can outperform bilingual ones for paraphrase identification and data-augmentation through paraphrasing.We also reported that generation quality from monolingual models can be higher than model based on unsupervised translation but not supervised translation.Access to parallel data is therefore still advantageous for paraphrase generation and our monolingual method can be a helpful resource for languages where such data is not available.

Table 1 :
by training ridge regression on the same features.Finally, we perform paraphrase ranking on Multiple Translation Chinese (MTC) from Huang et al.Paraphrase Identification & Generation.Identification is evaluated with accuracy on MRPC, Pearson Correlation on STS and ranking on MTC.Generation is evaluated with BLEU and human preferences on MTC.

Table 2 :
Paraphrasing for Data Augmentation: Accuracy and F1-scores of a Naive Bayes-SVM classifier on sentiment (SST-2) and question (TREC) classification.
Table 3 shows examples.Table 1 (right) reports that monolingual paraphrasing compares favorably with unsupervised translation while supervised translation is the best technique.This high-

Table 3 :
Examples of generated paraphrases from the monolingual residual model (Greedy search).