Generating Sentences from Disentangled Syntactic and Semantic Spaces

Variational auto-encoders (VAEs) are widely used in natural language generation due to the regularization of the latent space. However, generating sentences from the continuous latent space does not explicitly model the syntactic information. In this paper, we propose to generate sentences from disentangled syntactic and semantic spaces. Our proposed method explicitly models syntactic information in the VAE’s latent space by using the linearized tree sequence, leading to better performance of language generation. Additionally, the advantage of sampling in the disentangled syntactic and semantic latent spaces enables us to perform novel applications, such as the unsupervised paraphrase generation and syntax transfer generation. Experimental results show that our proposed model achieves similar or better performance in various tasks, compared with state-of-the-art related work.


Introduction
Variational auto-encoders (VAEs, Kingma and Welling, 2014) are widely used in language generation tasks (Serban et al., 2017;Kusner et al., 2017;Semeniuta et al., 2017;Li et al., 2018b).VAE encodes a sentence into a probabilistic latent space, from which it learns to decode the same sentence.In addition to traditional reconstruction loss of an autoencoder, VAE employs an extra regularization term, penalizing the Kullback-Leibler (KL) divergence between the encoded posterior distribution and its prior.This property enables us to sample and generate sentences from the continuous latent space.Additionally, we can even manually manipulate the latent space, inspiring various applications such as sentence interpo-lation (Bowman et al., 2016) and text style transfer (Hu et al., 2017).
However, the continuous latent space of VAE blends syntactic and semantic information together, without modeling the syntax explicitly.We argue that it may be not necessarily the best in the text generation scenario.Recently, researchers have shown that explicitly syntactic modeling improves the generation quality in sequence-tosequence models (Eriguchi et al., 2016;Zhou et al., 2017;Li et al., 2017;Chen et al., 2017).It is straightforward to adopt such idea in the VAE setting, since a vanilla VAE does not explicitly model the syntax.A line of studies (Kusner et al., 2017;Gómez-Bombarelli et al., 2018;Dai et al., 2018) propose to impose context-free grammars (CFGs) as hard constraints in the VAE decoder, so that they could generate syntactically valid outputs of programs, molecules, etc.However, the above approaches cannot be applied to syntactic modeling in VAE's continuous latent space, and thus, we do not enjoy the two benefits of VAE, namely, sampling and manipulation, towards the syntax of a sentence.
In this paper, we propose to generate sentences from a disentangled syntactic and semantic spaces of VAE (called DSS-VAE).DSS-VAE explicitly models syntax in the continuous latent space of VAE, while retaining the sampling and manipulation benefits.In particular, we introduce two continuous latent variables to capture semantics and syntax, respectively.To separate the semantic and syntactic information from each other, we borrow the adversarial approaches from the text style-transfer research (Hu et al., 2017;Fu et al., 2018;John et al., 2018), but adapt it into our scenario of syntactic modeling.We also observe that syntax and semantics are highly interwoven, and therefore further propose an adversarial reconstruction loss to regularize the syntactic and se-arXiv:1907.05789v1[cs.CL] 6 Jul 2019 mantic spaces.
Our proposed DSS-VAE takes following advantages: First, explicitly syntactic modeling in VAE's latent space improves the quality of unconditional language generation.Experiments show that, compared with traditional VAE, DSS-VAE generates more fluent sentences (lower perplexity), while preserving more amount of encoded information (higher BLEU scores for reconstruction).Comparisons with a state-of-the-art syntactic language model (Shen et al., 2017) are also included.
Second, the advantage of manipulation in the syntactic and semantic spaces of DSS-VAE provides a natural way of unsupervised paraphrase generation.If we sample a vector in the syntactic space but perform max a posterior (MAP) inference in the semantic space, we are able to generate a sentence with the same meaning but different syntax.This is known as unsupervised paraphrase generation, as no parallel corpus is needed during training.Experiments show that DSS-VAE outperforms the traditional VAE as well as a state-of-theart Metropolis-Hastings sampling approach (Miao et al., 2019) in this task.
Additionally, with the disentangled syntactic and semantic latent spaces, we propose an interesting application that transfers the syntax of one sentence to another.Both qualitative and quantitative experimental results show that DSS-VAE could graft the designed syntax to another sentence under certain circumstances.

Related Work
The variational auto-encoders (VAEs) is proposed by Kingma and Welling (2014) for image generation.Bowman et al. (2016) successfully applied VAE in the NLP domain, showing that VAE improves recurrent neural network (RNN)-based language modeling (RNN-LM, Mikolov et al., 2010); that VAE allows sentence sampling and sentence interpolation in the continuous latent space.Later, VAE is widely used in various natural language generation tasks (Gupta et al., 2018;Kusner et al., 2017;Hu et al., 2017;Deriu and Cieliebak, 2018).
Syntactic language modeling, to the best of our knowledge, could be dated back to Chelba (1997).Charniak (2001) and Clark (2001) propose to utilize a top-down parsing mechanism for language modeling.Dyer et al. (2016) and Kuncoro et al. (2017) introduce the neural network to this direction.The Parsing-Reading-Predict Network (PRPN, Shen et al., 2017), which reports a state-of-the-art results on syntactic language modeling, learns a latent syntax by training with a language modeling objective.Different from their work, our approach models syntax in a continuous space, facilitating sampling and manipulation of syntax.
Our work is also related to style-transfer text generation (Fu et al., 2018;Li et al., 2018a;John et al., 2018).In previous work, the style is usually defined by categorical features such as sentiment.We move one step forward, extending their approach to the sequence level and dealing with more complicated, non-categorical syntactic spaces.Due to the complication of syntax, we further design adversarial reconstruction losses to encourage the separation of syntax and semantics.

Approach
In this section, we present our proposed DSS-VAE in detail.We first introduce the variational autoencoder in §3.1.Then, we describe the general architecture of DSS-VAE in §3.2, where we explain how we generate sentences from disentangled syntactic and semantic latent spaces and how we disentangle information from the two separated spaces.Model training is discussed in §3.3.

Variational Autoencoder
A traditional VAE employs a probabilistic latent variable z to encode the information of a sentence x, and then decodes the original x from z.The probability of a sentence x could be computed as: where p(z) is the prior, and p(x|z) is given by the decoder.VAE is trained by maximizing the evidence lower bound (ELBO): where q(z sem |x) and q(z syn |x) are posteriors for the two latent variables.We further assume the variational posterior families, q(z sem |x) and q(z syn |x), are independent, taking the form N (µ sem , σ 2 sem ) and N (µ syn , σ 2 syn ), respectively, We use RNN to parameterize the posteriors (also called the encoder).Here, µ sem , σ sem , µ syn , and σ syn are predicted by the encoder network, described as follows.
Encoding In the encoding phase, we first obtain the sentence representation r x by an RNN with the gated recurrent units (GRUs, Cho et al., 2014); then, r x is evenly split into two spaces r x = [r sem x ; r syn x ].For the semantic encoder, we compute the mean and variance of q(z sem |x) from r sem x as: where the activation function is the rectified linear unit (ReLU, Nair and Hinton, 2010).W µ sem ,W σ sem ,W sem , and b sem are the parameters of the semantic encoder.
Likewise, a syntactic encoder predicts µ syn and σ syn for q(z syn |x) in the same way, with parameters W µ syn ,W σ syn ,W syn , and b syn .Decoding in the Training Phase We first sample from the posterior distributions by the reparameterization trick (Kingma and Welling, 2014), obtaining sampled semantic and syntactic representations, z sem and z syn ; then, they are concatenated as z = [z sem ; z syn ] and fed as the initial state of the decoder for reconstruction.The parse tree and its linearized tree sequence of a sentence "This is an interesting idea." Decoding in the Test Phase The treatment depends on applications.If we would like to synthesize a sentence from scratch, both z syn and z sem are sampled from prior.If we would like to preserve/vary semantics/syntax, max a posterior (MAP) inference or sampling could be applied in respective spaces.Details are provided in § 4.
In the following part, we will introduce how syntax is modeled in our approach and how syntax and semantics are ensured to be separated.

Modeling Syntax by Predicting Linearized Tree Sequence
While previous studies have tackled the problem of categorical sentiment modeling in the latent space (Hu et al., 2017;Fu et al., 2018), syntax is much more complicated and not finitely categorical.We propose to adopt the linearized tree sequence to explicitly model syntax in the latent space of VAE.
Figure 1 shows the constituency parse tree of the sentence "This is an interesting idea."The linearized tree sequence can be obtained by traversing the syntactic tree in a top-down order; if the node is non-terminal, we add a backtracking node (e.g., /NP) after its child nodes are traversed.
We ensure that z syn contains syntactic information by predicting the linearized tree sequence.
In training, the parse tree for sentences are obtained by the ZPar1 toolkit, and serves as the groundtruth training signals; in testing, we do not need external syntactic trees.We build an RNN (independent of the VAE's decoder) to predict such linearized parse trees, where each parsing token is represented by an embedding (similar to a traditional RNN decoder).Notice that, a node and  its backtracking, e.g., NP and /NP, have different embeddings.
The linearized tree sequence has achieved promising parsing results in a traditional constituency parsing task (Vinyals et al., 2015;Liu et al., 2018;Vaswani et al., 2017), which shows its ability of preserving syntactic information.Additionally, the linearized tree sequence works in a sequence-to-sequence fashion, so that it can be used to regularize the latent spaces.

Disentangling Syntax and Semantics into Different Latent Spaces
Having solved the problem of syntactic modeling, we now turn to the question: how could we disentangle syntax and semantics from each other?
We are inspired by the research in text style transfer and apply auxiliary losses to regularize the latent space (Hu et al., 2017;Fu et al., 2018).
In particular, we adopt the multi-task and adversarial losses in John et al. (2018), but extend it to the sequence level.In §3.2.3, we further propose two adversarial reconstruction losses to discourage the model to encode a sentence from a single subspace.
Multi-Task Loss Intuitively, a multi-task loss ensures that each space (z syn or z sem ) should capture respective information.
For the semantic space, we predict the bag-ofwords (BoW) distribution of a sentence from z sem with softmax, whose objective is the cross-entropy loss against the groundtruth distribution t, given by: where p(w|z syn ) is the predicted distribution.BoW has been explored by previous work (Weng et al., 2017;John et al., 2018), showing good ability of preserving semantics.
For the syntactic space, the multi-task loss trains a model to predict syntax on z syn .Due to our proposal in §3.2.1, we could build a dedicated RNN, predicting the tokens in the linearized parse tree sequence, whose loss is: where s i is a token in the linearized parse tree (with a total length of n).
Adversarial Loss The adversarial loss is widely used for aligning samples from different distributions.It has various applications, including style transfer (Hu et al., 2017;Fu et al., 2018;John et al., 2018) and domain adaptation (Tzeng et al., 2017).To apply adversarial losses, we add extra model components (known as adversaries) to predict semantic information t w based on the syntactic space z syn , but to predict syntactic information s 1 • • • s n−1 based on the semantic space z sem .They are denoted by p adv (w|z syn ) and The training of these adversaries are similar to (3) and (4), except that the gradient only trains the adversaries themselves, and does not backpropagate to VAE.
Then, VAE is trained to "fool" the adversaries by maximizing their losses, i.e., minimizing the following terms: In this phase, the adversaries are fixed and their parameters are not updated.

Adversarial Reconstruction Loss
Our next intuition is that syntax and semantics are more interwoven to each other than other information such as style and content.Suppose, for example, the syntax and semantics have been perfectly separated by the losses in §3.2.2, where z sem could predict BoW well, but does not contain any information about the syntactic tree.Even in this ideal case, the decoder can reconstruct the original sentence from z sem by simply learning to re-order words (as z sem does contain BoW).Such word re-ordering knowledge is indeed learnable (Ma et al., 2018), and does not necessarily contain the syntactic information.Therefore, the multi-task and adversarial losses for syntax and semantics do not suffice to regularize DSS-VAE.We now propose an adversarial reconstruction loss to discourage the sentence being predicted by a single subspace z syn or z sem .When combined, however, they should provide a holistic view of the entire sentence.Formally, let z s be a latent variable (z s = z syn or z sem ).A decoding adversary is trained to predict the sentence based on z s , denoted by Then, the adversarial reconstruction loss is imposed by minimizing Such adversarial reconstruction loss is applied to both the syntactic and semantic spaces, shown by black bashed arrows in Figure 2.

Training Details
Overall Training Objective The overall training loss is a combination of the VAE loss (2), the multi-task and adversarial losses for syntax and semantics (3-6), as well as the adversarial reconstruction losses (7), , i.e., minimizing where the λ KL sem , λ KL syn , λ mul sem , λ adv sem , λ rec sem , λ mul syn , λ adv syn , and λ rec syn are the hyperparameters to adjust the importance of each loss in overall objective.
Hyperparameter Tuning We select the parameter values with the lowest ELBO value on the validation set in all experiments.They are tuned by (grouped) grid search on the validation set, but due to the large hyperparameter space, we conduct tuning mostly for sensitive hyperparameters and admit that it is empirical.We choose the VAE as our baseline, and the KL weight of VAE is tuned in the same way.We list the hyperparameters in Appendix A.
The training objective is optimized by Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.995, and the initial learning rate is 0.001.Word embeddings are 300-dimensional and initialized randomly.The dimension of each latent space (namely, z syn and z sem ) is 100.

KL Annealing and Word Dropout
We adopt the tricks of KL annealing and word dropout from Bowman et al. (2016) to avoid KL collapse.We anneal λ KL syn and λ KL syn from zero to predefined values in a sigmoid manner.Besides, the word dropout trick randomly replaces the ground-truth token with <unk> with a fixed probability of 0.50 at each time step of the decoder during training.

Reconstruction and Unconditional
Language Generation First, we compare our model in reconstruction and unconditional language generation with a traditional VAE and a syntactic language model (PRPN, Shen et al., 2017).
Dataset We followed previous work (Bowman et al., 2016) and used a standard benchmark, the WSJ sections in the Penn Treebank (PTB) (Marcus et al., 1993).We also followed the standard split: Sections 2-21 for training, Section 24 for validation, and Section 23 for test.
Settings We trained VAE and DSS-VAE, both with 100-dimensional RNN states.For the vocabulary, we chose 30k most frequent words.We trained PRPN with the default parameter in the code base.2 Evaluation We evaluate model performance with the following metrics: 1. Reconstruction BLEU.The reconstruction task aims to generate the input sentence itself.In the task, both syntactic and semantic vectors are chosen as the predicted mean of the encoded distribution.We evaluate the reconstruction performance by the BLEU score (Papineni et al., 2002) with input as the reference.coder for generation; for LSTM-LM, we first feed the start sentence token <s> to the decoder, and sample the word at each time step by predicted probabilities (i.e., forward sampling).

Results
We see in Table 1 that BLEU and PPL are more or less contradictory.Usually, a smaller KL weight makes the autoencoder less "variational" but more "deterministic," leading to less fluent sampled sentences but better reconstruction.
If the trade-off is not analyzed explicitly, the VAE variant could have arbitrary results based on KLweight tuning, which is unfair.
We therefore present the scatter plot in Figure 3, showing the trend of forward PPL and BLEU scores with different KL weights.Clearly, DSS-VAE outperforms a plain VAE in BLEU if Forward PPL is controlled, and in Forward PPL if BLEU is controlled.The scatter plot shows that our proposed DSS-VAE outperforms the original counterpart in language generation with different KL weights.
In terms of Reverse PPL (Table 2), DSS-VAE also achieves better Reverse PPL than a traditional VAE.Since DSS-VAE leverages syntax to improve the sentence generation, we also include a state-of-the-art syntactic language model (PRPN-LM, Shen et al., 2017)  outperform VAE and LSTM-LM in sentence generation.
We also include the Reverse PPL of the real training sentences.As expected, training a language model on real data outperforms training on sampled sentences from a generation model, showing that there is still much room for improvement for all current sentence generators.

Unsupervised Paraphrase Generation
Given an input sentence, paraphrase generation aims to synthesize a sentence that appears different from the input, but conveys the same meaning.We propose a novel approach to unsupervised paraphrase generation with DSS-VAE.Suppose a DSS-VAE is well trained according to §3.3, our approach works in the inference stage.
For a particular input sentence x * , let q(z syn |x * ) and q(z sem |x * ) be the encoded posterior distributions of the syntactic and semantic spaces, respectively.The inferred latent vectors are: and are further combined as: Finally, z * is fed to the decoder and perform a greedy decoding for paraphrase generation.
The intuition behind is that, when generating the paraphrase, semantics should remain the same, but the syntax of a paraphrase could (and should) vary.Therefore, we sample a z * syn vector from its probabilistic distribution, while fixing z * sem .Dataset We used the established Quora dataset6 to evaluate paraphrase generation, following previous work (Miao et al., 2019).The dataset contains 140k pairs of paraphrase sentences and 260k  et al. (2019) propose to measure this by computing BLEU against the original sentence (denoted as BLEU-ori), which ideally should be low.We only consider the DSS-VAE that yields a BLEUori lower than 55, which is empirically suggested by Miao et al. (2019) that ensures the obtained sentence is different from the original to at least a certain degree.
Results Table 3 shows the performance of unsupervised paraphrase generation.In the first row of Table 3 (Gupta et al., 2018).We admit that it is hard to present the trade-off by listing a single score for each model in the Table 3.We therefore have the scatter plot in

Syntax-Transfer Generation
In this experiment, we propose a novel application of syntax-transfer text generation, inspired by previous sentiment-style transfer studies (Hu et al., 2017;Fu et al., 2018;John et al., 2018).

Consider two sentences:
x 1 : There is a dog behind the door.
x 2 : The child is playing in the garden.
If we would like to generate a sentence having the syntax of "there is/are" as x 1 but conveying the meaning of x 2 , we could graft the respective syntactic and semantic vectors as:  Zhang and Shasha, 1989).TED is essentially the minimum-cost sequence of node edit operations (namely, delete, insert, and rename) between two trees, which reflects the difference of two syntactic trees.
Since we hope the generated sentence has a higher word-BLEU score compared with Ref sem but a lower word-BLEU score compared with Ref syn , we compute their difference, denoted by ∆word-BLEU, to consider both.Likewise, ∆TED is also computed.We further take the geometric mean of ∆word-BLEU and ∆TED to take both into account.

Results
We see from Table 4 that a traditional VAE cannot accomplish the task of syntax transfer.This is because Ref syn and Ref sem -even if we artificially split the latent space into two parts-play the same role in the decoder.With the multi-task and adversarial losses for syntactic and semantic latent spaces, the total difference is increased by 12.09, which shows the success of syntax-transfer sentence generation.This further implies that explicitly modeling syntax is feasible in the latent space of VAE.We incrementally applied the adversarial reconstruction loss, proposed in § 3.2.3.
As seen, an adversarial reconstruction loss drastically strengthens the role of the other space.For example, +L (adv)  rec (z sem ) repels information to the syntactic space and achieves the highest ∆TED.
When applying the adversarial reconstruction losses to both semantic and syntactic spaces, we have a balance between ∆word-BLEU and ∆TED, both ranking second in the respective columns.Eventually, we achieve the highest total difference, showing that our full DSS-VAE model achieves the best performance of syntax-transfer generation.
Discussion on syntax transfer between incompatible sentences We provide a few case studies of syntax-transfer generation in Appendix B. We empirically find that the syntactic transfer be- tween "compatible" sentences give more promising results than transfer between "incompatible" sentences.Intuitively, this is reasonable because it may be hard to transfer a sentence with a length of 5, say, to a sentence with a length of 50.

Conclusion
In this paper, we propose a novel DSS-VAE model, which explicitly models syntax in the distributed latent space of VAE and enjoys the benefits of sampling and manipulation in terms of the syntax of a sentence.Experiments show that DSS-VAE outperforms the VAE baseline in reconstruction and unconditioned language generation.We further make use of the sampling and manipulation advantages of DSS-VAE in two novel applications, namely unsupervised paraphrase and syntax-transfer generation.In both experiments, DSS-VAE achieves promising results.
Figure1: The parse tree and its linearized tree sequence of a sentence "This is an interesting idea."

Figure 2 :
Figure 2: Overview of our DSS-VAE.Forward dashed arrows are multi-task losses; backward dashed arrows are adversarial losses.

Figure 4 :
Figure 4: Trade-off between BLEU-ori (the lower, the better) and BLEU-ref (the larger, the better) in unsupervised paraphrase generation.Again, the upper-left corner indicates a better performance.

Table 1 :
BLEU and Forward PPL of VAE with varying KL weights on the PTB test set.The larger ↑ (or lower ↓ ), the better.

Table 2 :
for comparison.Results show that DSS-VAE has achieved a Reverse PPL comparable to (and slightly better than) PRPN-LM.It is also seen that explicitly modeling syntactic structures does yield better generation results-DSS-VAE and PRPN consistently Reverse PPL reflect the diversity and fluency of sampling data, the lower ↓ , the better.Training on the model sampled and evaluated on the real test set.

Table 3 :
Performance of paraphrase generation.The larger ↑ (or lower ↓ ), the better.Some results are quoted from† Miao et al. (2019)and‡ Gupta et al. (2018).Evaluation Since the test set contains a reference paraphrase for each input, it is straightforward to compute the BLEU against the reference, denoted by BLEU-ref.However, this metric alone does not model whether the generated sentence is different from the input, and thus, Miao In each pair we constructed, one sentence serves as the semantic provider (denoted by Ref sem ), and the other serves as the syntactic provider (denoted by Ref syn ).The goal of syntax-transfer text generation is to synthesize a sentence that resembles Ref sem but not Ref syn in semantics, and resembles Ref syn but not Ref sem in syntax.For the semantic part, we use the traditional word-based BLEU scores to evaluate how the generated sentence is close to Ref sem but different from Ref syn .For syntactic similarity, we use the zss package 7 to calculate the Tree Edit Distance (TED,

Table 4 :
Model word-BLEU (corpus) ∆word-BLEU ↑ Average TED (per sentence) ∆TED ↑ Geo Mean ∆ ↑ Ref sem Performance of syntax-transfer generation.The larger ↑ (or lower ↓ ), the better.The results of VAE are obtained by averaging interpolation.∆word-BLEU = word-BLEU(Ref sem ) − word-BLEU(Ref syn ).We also compute the difference as ∆TED = TED(Ref sem ) − TED(Ref syn ) to measure if the generated sentence is syntactically similar to Ref syn but not Ref sem .Due to the difference of scale between BLEU and TED, we compute the geometric mean of ∆word-BLEU and ∆TED reflect the total differences.