A Semantically Consistent and Syntactically Variational Encoder-Decoder Framework for Paraphrase Generation

Paraphrase generation aims to generate semantically consistent sentences with different syntactic realizations. Most of the recent studies rely on the typical encoder-decoder framework where the generation process is deterministic. However, in practice, the ability to generate multiple syntactically different paraphrases is important. Recent work proposed to cooperate variational inference on a target-related latent variable to introduce the diversity. But the latent variable may be contaminated by the semantic information of other unrelated sentences, and in turn, change the conveyed meaning of generated paraphrases. In this paper, we propose a semantically consistent and syntactically variational encoder-decoder framework, which uses adversarial learning to ensure the syntactic latent variable be semantic-free. Moreover, we adopt another discriminator to improve the word-level and sentence-level semantic consistency. So the proposed framework can generate multiple semantically consistent and syntactically different paraphrases. The experiments show that our model outperforms the baseline models on the metrics based on both n-gram matching and semantic similarity, and our model can generate multiple different paraphrases by assembling different syntactic variables.


Introduction
Paraphrase generation is a longstanding problem in Natural Language Processing (NLP) (McKeown, 1983), which aims to generate semantically consistent sentences for a given sentence with different syntactic realizations. The task is not only an important building block for many text generation systems such as question answering (Buck et al., 2018;Dong et al., 2017), machine translation (Cho et al., 2014), but also beneficial to some NLP tasks such as semantic parsing (Su and Yan, 2017), sentence-level representation learning (Patro et al., 2018), data augmentation (Kumar et al., 2019).
Neural network-based methods (Prakash et al., 2016;Gupta et al., 2018;Li et al., 2018;Fu et al., 2019) have shown great progress on paraphrase generation. The models mainly rely on the sequence-tosequence (seq2seq) learning framework (Sutskever et al., 2014) with typical encoder-decoders, which are relatively deterministic during the testing stage. Generally, the models will select the best result through the beam search but are not able to produce multiple paraphrases in a principled way (Gupta et al., 2018). Due to the nature of beam-search, the quality of k-th variant will be worse than the first variant.
In practice, the ability to generate multiple high-quality and diverse paraphrases is an important characteristic of text generation systems. A target-oriented seq2seq model is applaudable to achieve this goal. For example, Gupta et al. (2018) applied variational inference (Kingma and Welling, 2014) on a target-related latent variable z. During testing, the model can sample multiple latent variables z from a prior distribution to generate multiple different paraphrases. But the remained problem is that z may be contaminated by the semantic information of other unrelated sentences in the training set, leading to an unexpected semantic change of the generated sentences.
In this paper, we propose to constrain the target-related latent variable z to contain merely the syntactic information. To achieve this goal, we introduce a syntactic encoder to extract z syn from the target y, and develop a discriminator with adversarial learning to ensure z syn is semantic-free. The idea is inspired by (Bao et al., 2019), which disentangled the latent space of variational autoencoder (VAE) into semantic and syntactic spaces. But they considered the bag of words (BOWs) as the semantic information for adversarial training. This is not optimal because human-generated paraphrases can use quite different words but still express the same meaning. Instead, our model is data-driven. We do not constrain the semantic variables to be syntax-free, as the syntactic information entangled in the semantic variables will be overwritten by the target-oriented syntactic variables.

Types
Sentences Word Distance Semantic Distance Gold Reference S r : It is an excellent film! --More Penalized S a : It is an easy way! 2 D Less Penalized S b : It is an awesome movie! 2 <D Table 1: Illustration of the problem of MLE. The sentence S r is the reference, and the rest sentences are two generated samples. S a and S b have the same word distance to S r , but S b is semantically similar to S r . MLE will equally penalize the phrases "easy way" and "awesome movie" because they are non-target.
When considering semantic consistency, there exists another problem in many text generation models that maximum likelihood estimation (MLE) which is implemented by the cross-entropy function will penalize all the non-target words. An example is shown in Table 1. The cross-entropy function will equally penalize the two generated sentences S a and S b because both of them have two words not match the gold ones. But the semantics of them are quite different. It means that MLE captures the word distance well but does not precisely reflect the semantic distance. Our proposition is that sentences with larger semantic distance should be more penalized. We develop another discriminator, which determines whether the generated sentences are semantically consistent with the references. Unlike the discriminator for the latent variable z syn , this discriminator needs to have access to the sampled tokens, which will cause the non-differentiable problem. We adopt Gumbel-softmax (Jang et al., 2017;Maddison et al., 2017) to make the model end-to-end differentiable. And we introduce two losses to measure both wordlevel and sentence-level semantic consistency.
The experiments on two datasets show that our model yields competitive results over other baseline models, and can generate multiple syntactically different and semantically consistent paraphrases. The main contributions of this work are as follows: • We propose a target-oriented seq2seq framework that involves different syntactic variables to generate multiple different paraphrases.
• Our method not only increases the syntactic diversity with variational inference but also improves the word-level and sentence-level semantic consistency for the generated paraphrases.
• The experiments use metrics based on both n-gram matching and semantic similarity, and demonstrate the effectiveness of our model.

Related Work
Recently, many neural network-based models are proposed for paraphrase generation and can be categorized into three groups: reconstruction-based learning, typical seq2seq learning, and target-oriented seq2seq learning.
Reconstruction-based Learning. The first category of studies mainly deals with paraphrase generation in an unsupervised manner, which adds constraints on language models (LMs) including RNN-LM (Mikolov et al., 2010) or VAE (Bowman et al., 2016). Kovaleva et al. (2018) introduced a similaritybased reconstruction loss to VAE which considered similarities between words in the embedding space. Miao et al. (2019) introduces three kinds of constraints on RNN-LM including keywords matching, word embedding similarity, and skip-thoughts similarity. However, the similarity-based losses could not guarantee the semantic consistency between two words. For example, the words "good", "great" and "bad" are all close in the embedding space because they appear in similar contexts. Recently, an intuitive approach was proposed which disentangled the latent space of VAE into syntactic and semantic spaces (Bao et al., 2019). In their model, the constituency parse tree was used to supervise the syntactic latent variable, and the BOWs were used to supervise the semantic latent variable. Although the proposal of disentanglement is promising, supervision with BOWs is not optimal because paraphrases are possible to use quite different words and still convey the same meaning.
Typical Seq2seq Learning. The second category of studies considered paraphrase generation as a typical seq2seq task with parallel data. Prakash et al. (2016) proposed to use a seq2seq model for paraphrase generation with residual stack LSTM, and still performs as a strong baseline (Fu et al., 2019). Recent studies improved seq2seq models by involving some efficient mechanisms such as copy and constrained decoding (Cao et al., 2017), inverse reinforcement learning (Li et al., 2018), decomposition of phrase-level and sentence-level patterns , and content planning with latent bag of words (Fu et al., 2019). When a sentence has multiple paraphrases in training data, these models will convert them into multiple pairwise sentences. From the perspective of probability modeling, these studies maximize the log conditional probability ∑ k i=1 log p(y i |x) where x denotes the original sentence and y i is the i-th sentence among k paraphrases.
Target-oriented Seq2seq Learning. Compared with the second category of studies, the third included the target information which substantially maximized the log probability ∑ k i=1 log p(y i |x, z y i ) where z y i conveyed the information of target y i . Apparently, there was a train-test discrepancy because z y i was not available during testing. Gupta et al. (2018) tackled the issue by a combination of the seq2seq architecture with VAE which allowed z y i to sample from a prior distribution. The remained problem is that z y i may contain semantic information of other unrelated sentences, which is possible to mislead the model. Ideally, for paraphrase generation, z y i is expected to only convey the syntactic information. Kumar et al. (2020) implicitly tackled this problem by focusing on a slightly different task, the syntacticguided controlled paraphrase generation, which inputted an exemplar to tell the syntactic information. As a result, the train-test discrepancy does not exist in the controlled task. However, for the traditional paraphrase generation task, constraining z y i is still a problem.

Variational Autoencoder
Before introducing our models, we briefly review the architecture of VAE (Kingma and Welling, 2014), a generative model which allows to generate high-dimensional samples from a continuous space. In the probability model framework, the probability of data x can be computed by: Since this integral is unavailable in closed form or requires exponential time to compute (Blei et al., 2016), it is approximated by maximizing the evidence lower bound (ELBO): where p θ (x|z) denotes the generator with parameters θ and q ϕ (z|x) is obtained by an encoder with parameters ϕ, and p(z) is a prior distribution, for example, a Gaussian distribution. And KL(·||·) denotes the Kullback-Leibler (KL) Divergence between the two distributions. Moreover, a previous work proposed β-VAE (Higgins et al., 2017) to use a weight β for the KL divergence. This approach was considered as a baseline for paraphrase generation (Fu et al., 2019).

Continuous Approximation
When a text generation model involves the process of sampling words and expecting a reward from a discriminator or an evaluator, it will suffer from the non-differentiable problem due to the discrete nature of texts. Recently, many studies use reinforcement learning (RL) Lin et al., 2017;Guo et al., 2018;Li et al., 2018) or Gumbel-softmax (Jang et al., 2017;Maddison et al., 2017;Yang et al., 2018;Nie et al., 2019) to overcome the problem. In our model, we use Gumbel-softmax because it makes models end-to-end differentiable, improving the stability and speed of training over RL (Chen et al., 2018). Assuming that the model outputs a logit value o t when generating a sentence at tth timestep. A softmax function is used to produce probability p t over the vocabulary set: Traditionally, a token w t will be sampled from p t with multinomial function or the argmax operation, both of which are non-differentiable. Gumbel-softmax uses a re-parameter trick by: where g samples from Gumbel(0, 1) and τ is the temperature. When τ → 0, p t is approximated to the one-hot representation of the sampled token w t . This process is a continuous approximation to the multinomial sampling, and we denote it by Gumbel-softmax(·) in the following sections.

Semantically Consistent and Syntactically Variational Encoder-Decoder
Our method belongs to the category of target-oriented seq2seq learning, and aims to generate diverse paraphrases by involving the target-oriented syntactic information. We assume that each paraphrase should convey the same semantic with the original sentence, and multiple paraphrases have different syntaxes from each other. The architecture of our model is shown in Figure 1. The model contains a semantic encoder, a syntactic encoder, and a decoder with parameters ϕ, φ, and θ respectively. Given the sentence x and one of its paraphrases y, the generation process can be defined as: where Z sem and z syn denote the semantic and syntactic latent variables respectively. The variables Z sem are a sequence of hidden states and z syn is a vector representation. And our model can cooperate with the attention mechanism (Bahdanau et al., 2015). At each timestep, the decoder will produce a variable by the weighted sum of hidden states in Z sem and then concatenate it with z syn to decode each token. This process is modeling the probability p (y|x, z syn ) instead of p (y|x).
The key problem is how to constrain the syntactic variable z syn , as y is not available during testing. Similar to VAE, we apply variational inference on the variable z syn , which can be shown from the modeling of the likelihood p(y, x) and p(y|x): where z syn ⊥ x means that z syn is independent from x. Since p(x) can be moved outside of the integral, we divide both sides of Equation 8 by p(x) to obtain the conditional probability: where maximizing the log likelihood log p(y|x) is approximated to maximize the ELBO. And p θ,ϕ (y|x, z syn ) can be modeled by Equation 5 and 7, and the posterior q φ (z syn |y) is modeled by Equation 6. Then the first term of Equation 10 is a considered as the target-oriented seq2seq loss denoted by L tos2s (ϕ; θ). The second term is the KL loss denoted by L KL (φ).

Adversarial Learning for Syntactic Variables
There are two important assumptions to make Equation 8 − 10 true: 1) z syn is independent from x; 2) z syn contains merely the syntactic information of y. Since z syn is extracted from y by Equation 6, the first assumption is met if z syn does not contain the information shared by x and y, which is typically the semantic information. The second assumption also requires that z syn does not contain the semantic information. Therefore, we use adversarial learning to derive semantic-free information for z syn .
Given z x syn ∼ q φ (z x syn |x) and z y syn ∼ q φ (z y syn |y) corresponding to the syntactic variables of the original sentence x and the paraphrase y respectively, we employ a discriminator with trainable weight W syn ∈ R 4dsyn×c where d syn denotes the dimension of the syntactic variables and c = 2 means that it is a binary classification process. The probability of whether z x syn and z y syn contain same semantic information can be computed by: where | · | means taking the absolute value, ⊙ denotes the element-wise multiplication, and [, ] denotes the concatenation operation. Moreover, we construct negative samples by randomly sampling a sentence x ̸ = x in the dataset. The predicted probability is denoted by p x,y . Then the loss of the discriminator is computed by: where p pos = [1, 0] and p neg = [0, 1] representing the labels for the positive pair (x, y) and the negative pair (x, y) respectively. And χ denotes the parameters (W syn ) of the discriminator. Equation 12 means that the discriminator is trying to recognize the semantic information shared between x and y. Then the syntactic encoder is considered as the generator to fool the discriminator by minimizing the loss: And the generator and discriminator play an adversarial game by minimizing L g syn (φ) and L d syn (χ) alternatively. When combining the other losses, the objective of our model is: where the first term is the total loss for the generator and the second term is the loss for the discriminator.

Ensuring Semantic Consistency
There remains a train-test discrepancy where z syn ∼ q φ (z syn |y) during training while z syn ∼ p(z syn ) during testing. Minimizing the KL divergence between q φ (z syn |y) and p(z syn ) can help reduce the discrepancy, but does not provide end-to-end guarantee for the semantic consistency. Therefore, we further employ another discriminator D ψ with parameters ψ, which consists of a sentence encoder, and a fullyconnected neural network followed with the softmax function. For arbitrary two sentences represented by the sequences of one-hot vectors u ∈ R T ×V and v ∈ R T ×V , the discriminator predicts the probability p u,v ∈ R 2 of whether two sentences are semantically consistent: where T and V represent the maximum length of the sentences and the vocabulary size respectively. Traditionally, when z syn ∼ q φ (z syn |y), the model minimizes L tos2s (ϕ; θ) with MLE: where y t and y t denote the predicted and referenced tokens respectively at t-th timestep, and y <t denotes the sequence of tokens preceding y t . However, when z syn ∼ p(z syn ), the syntactic information is different from that of z syn ∼ q φ (z syn |y), and the predicted tokens is therefore not required to match all the tokens of y. Instead, we assume that there is a set of semantically consistent words W c (y t ) with respect to y t , using which will not change the conveyed meaning.
where the objective is to ensure the word-level semantic consistency (WSC). We construct a sequence of tokens represented by one-hot vectorsŷ = (ŷ 1 ,ŷ 2 , ...,ŷ T ). The sentence is obtained by replacing a certain ratio (η) of tokens in y with predicted tokens y t sampled from the predicted probability distribution p t ∈ R V . The process can be described by: where rand() is a random function to sample numbers between 0 and 1 following the uniform distribution. Then the loss for word-level semantically consistency is computed by: Moreover, we further reduce the train-test discrepancy by reducing the exposure bias problem (Ranzato et al., 2016). We let each token in the sentence s be generated conditioning on previously generated tokens instead of gold ones, and get a sentence-level feedback from the discriminator: where the objective is to ensure sentence-level semantic consistency (SSC). S c (y) denotes the set of semantically consistent sentences, and y = ( y 1 , y 2 , ..., y T ) denotes the sequence of generated tokens with one-hot representations. The discriminator will also include positive samples (x, y) and negative samples (x, y) to learn to predict whether two sentences are semantically consistent: Then, the final objective can be computed by: where λ KL , λ g syn , λ sc , and λ d syn are the hyperparameters to balance the losses in overall objective.

Datasets
Following previous work on paraphrase generation, we experiment on two datasets: Quora (Lin et al., 2014) 1 and MSCOCO 2 . The Quora dataset is originally developed for duplicated question detection which contains about 140k pairs of paraphrase and 260k pairs of non-paraphrase sentence pairs. We only use the paraphrase sentences and hold out 3k and 30k validation and test sets respectively. We set the maximum decoding length to be 20 which equals the maximum length of 95% of sentences. The MSCOCO dataset is originally developed for image captioning and each image has 5 captions. In our experiments, we randomly choose 1 of the 5 captions as the source and use the rest 4 captions as the targets. The original dataset contains about 80k and 40k samples in the train and test sets respectively. We randomly hold out about 4k samples from the train set as the validation set. The detailed statistics of the two datasets are shown in Table 2.

Evaluation and Settings
The evaluation of paraphrase generation remains an open issue. Most of previous studies (Prakash et al., 2016;Gupta et al., 2018;Li et al., 2018;Bao et al., 2019;Fu et al., 2019) adopt metrics based on n-gram matching, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). To compare our model with them, we also report the n-gram metrics (1-4 grams in BLEU, 1-2 gram in ROUGE). However, we observe that they are not always sufficient to evaluate the semantic consistency because human-generated paraphrases have lower BLEU or ROUGE scores than machine-generated on the MSCOCO dataset (will be discussed in Section 5.3). Therefore, we further employ a metric BERTCS (Reimers and Gurevych, 2019) which computes the cosine similarity of sentence-level embeddings of fine-tuned BERT (Devlin et al., 2019). We choose the BERT-base model fine-tuned on SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets with mean-tokens pooling 3 . Moreover, since simply copying the source sentence is not an interesting model but definitely yields semantically consistent outputs, we evaluate the syntactic difference from the source sentence based on BLEU-ori (up to 4 grams) which were recently used to evaluate the reconstruction-based models (Miao et al., 2019;Bao et al., 2019). Compared Models. We compare our models with three categories of existing methods introduced in Section 2. The reconstruction-based models for comparison include β-VAE (Higgins et al., 2017) with β = 1e −3 and β = 1e −4 , and DSS-VAE (Bao et al., 2019). The typical seq2seq models include vanilla seq2seq LSTM with (or without) the attention mechanism (Bahdanau et al., 2015), and LBOW-Topk which is the state-of-the-art (SOTA) model (Fu et al., 2019). The compared target-oriented seq2seq model is the variational encoder-decoder (VAE-SVG-eq) (Gupta et al., 2018). Since variational models can generate multiple paraphrases for a source sentence by sampling multiple latent variables, we can select the best one with highest BERTCS scores computed with the source sentence (not with the reference sentences because they not available in practice). This searching mechanism is also used in (Gupta et al., 2018) and is denoted by VarSearch in the following sections. We search 5 times for both VAE-SVG-eq and our model.
Hyperparameters. Word embeddings are 300-dimensional and initialized with GloVe (Pennington et al., 2014). The dimension of the encoders and the decoder are based on two-layer LSTM with 500 state size. The latent space dimension is also set to 500. We set a fixed temperature of τ = 0.01 for Gumbel-softmax during training. The weights for different losses are λ KL = 0.2 (with the annealing trick), λ g syn = 0.5, λ sc = 0.5, and λ d syn = 0.5 respectively. The replacement ratio η for word-level semantic consistency is set to 0.5. The learning rate of all models is set to 5 × 10 −4 . The batch size is set to 32. All models are trained for 15 epochs. We report the averaged metrics after the training process is repeated 3 times.
Copying ( Table 3 and 4 show the overall performance of different models. To understand what is an applaudable score on each metric, we do a preliminary experiment by designing a copying and a randomly sampling model, which can be considered as the upper and lower bound for metrics. Higher B-i, R-j, and BertCS scores represent better consistency with reference sentences. Lower BLEU-ori scores represent better syntactic differences from source sentences. The interesting finding on the MSCOCO dataset is that the source sentences, which are human-generated paraphrases with regard to reference sentences, have lower B-i and R-j scores than the machine-generated. The possible reason may be that humans will use diverse n-grams and still express the same meaning while machines prefer to use high-frequency n-grams. And BertCS scores confirm the high semantic consistency of human-generated paraphrases. Generally, our model with variational search achieves competitive B-i, R-j scores, and the best BertCS scores on the Quora and MSCOCO datasets. Compared with the previous SOTA model LBOW-Topk, our model improves B-4 by 1.20 and 2.97 points on Quora and MSCOCO respectively. Compared with Seq2Seq-Att, our model improves B-4 and BertCS by 3.35 and 2.72 points respectively on Quora, and 2.92 and 1.19 points respectively on MSCOCO. When compared with variational models including β-VAE and VAE-SVG-eq, our model also outperforms them with a large margin. The reason may be that the sampled variational latent variables in their models contain semantic information, and lead to a change of the conveyed meaning. DSS-VAE which disentangles the semantic and syntactic representations outperforms β-VAE with an increase of B-4 and a decrease of BLEU-ori scores on Quora but does not outperform seq2seq models. It means that the disentanglement of the latent spaces is not sufficient to guarantee the decoder of VAE to generate semantically consistent sentences.  Table 5: Results of the ablation study.

Ablation Study
To analyze which mechanisms are driving the improvements, we present an ablation study in Table 5. We eliminate sentence-level and word-level semantic consistency (SSC and WSC), syntactically adversarial learning (SynAdv) one by one, which results in three ablated models. Further eliminating the variational inference of syntactic variables yields the Seq2Seq-Att model. Generally, the three mechanisms are all influential. For example, eliminating the two semantic consistent losses leads to a total drop of BertCS by 0.82 and 0.65 points on Quora and MSCOCO respectively. When further eliminating SynAdv, the model has worse performance than Seq2Seq-Att. It demonstrates the importance of guaranteeing the syntactic variable to be semantic-free.  Table 6: An example of the generated sentences of the models on MSCOCO dataset.

Case Study
To help understand our model, we present a case study in Table 6. For the MSCOCO dataset, each image has multiple diverse captions. We show the source and two gold references for an image. After training, Seq2Seq-Att and our model both produce three paraphrases for the given source, and BertCS scores are presented to measure their semantic consistency with respect to the source sentence. Following traditional seq2seq models, we choose the top 3 results through the beam search for Seq2Seq-Att. The results show that the three generated sentences lack diversity. Different from Seq2Seq-Att, our model generates 3 paraphrases by sampling 3 different latent variables z i syn , z j syn , z k syn , which produces high-quality and diverse paraphrases. However, it is worth noting that the variable z syn is data-driven, which means that the information in z syn may not perfectly match human-defined syntaxes. Moreover, the references may contain additional information than the source, which is not statistically easy to learn. This phenomenon can explain why the BLEU and ROUGE scores of the references are lower than machine-generated sentences in Table 5. However, the key information is preserved.

Conclusion
In this paper, we propose a semantically consistent and syntactically variational encoder-decoder framework for paraphrase generation, which enables the model to generate different paraphrases according to different syntactic variables. We first introduce an adversarial learning method to ensure the variational syntactic variable not be contaminated by semantic information, and further develop word-level and sentence-level objectives to ensure the generated sentences be semantic consistent. The experiments show that our model yields competitive results and can generate high-quality and diverse paraphrases.