Controllable Paraphrase Generation with a Syntactic Exemplar

Prior work on controllable text generation usually assumes that the controlled attribute can take on one of a small set of values known a priori. In this work, we propose a novel task, where the syntax of a generated sentence is controlled rather by a sentential exemplar. To evaluate quantitatively with standard metrics, we create a novel dataset with human annotations. We also develop a variational model with a neural module specifically designed for capturing syntactic knowledge and several multitask training objectives to promote disentangled representation learning. Empirically, the proposed model is observed to achieve improvements over baselines and learn to capture desirable characteristics.


Introduction
Controllable text generation has recently become an area of intense focus in the natural language processing (NLP) community.Recent work has focused both on generating text satisfying certain stylistic requirements such as being formal or exhibiting a particular sentiment (Hu et al., 2017;Shen et al., 2017;Ficler and Goldberg, 2017), as well as on generating text meeting structural requirements, such as conforming to a particular template (Iyyer et al., 2018;Wiseman et al., 2018).
These systems can be used in various application areas, such as text summarization (Fan et al., 2018), adversarial example generation (Iyyer et al., 2018), dialogue (Niu and Bansal, 2018), and data-to-document generation (Wiseman et al., 2018).However, prior work on controlled generation has typically assumed a known, finite set of values that the controlled attribute can take on.In this work, we are interested instead in the novel setting where the generation is controlled through an exemplar sentence (where any syntactically valid sentence is a valid exemplar).We will focus in particular on using a sentential exemplar to control the syntactic realization of a generated sentence.This task can benefit natural language interfaces to information systems by suggesting alternative invocation phrases for particular types of queries (Kumar et al., 2017).It can also bear on dialogue systems that seek to generate utterances that fit particular functional categories (Ke et al., 2018;Li et al., 2019).
To address this task, we propose a deep generative model with two latent variables, which are designed to capture semantics and syntax.To achieve better disentanglement between these two variables, we design multi-task learning objectives that make use of paraphrases and word order information.To further facilitate the learning of syntax, we additionally propose to train the syntactic component of our model with word noising and latent word-cluster codes.Word noising randomly replaces word tokens in the syntactic inputs based on a part-of-speech tagger used only at training time.Latent codes create a bottleneck layer in the syntactic encoder, forcing it to learn a more compact notion of syntax.The latter approach also learns interpretable word clusters.Empirically, these learning criteria and neural architectures lead to better generation quality and generally better disentangled representations.
To evaluate this task quantitatively, we manually create an evaluation dataset containing triples of a semantic exemplar sentence, a syntactic exemplar sentence, and a reference sentence incorporating the semantics of the semantic exemplar and the syntax of the syntactic exemplar.This dataset is created by first automatically finding syntactic exemplars and then heavily editing them by ensuring (1) semantic variation between the syntactic inputs and the references, (2) syntactic similarity between the syntactic inputs and the references, and (3) syntactic variation between the semantic input and references.Examples are shown in Figure 1.This dataset allows us to evaluate different approaches quantitatively using standard metrics, including BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004).As the success of controllability of generated sentences also largely depends on the syntactic similarity between the syntactic exemplar and the reference, we propose a "syntactic similarity" metric based on evaluating tree edit distance between constituency parse trees of these two sentences after removing word tokens.
Empirically, we benchmark the syntacticallycontrolled paraphrase network (SCPN) of Iyyer et al. (2018) on this novel dataset, which shows strong performance with the help of a supervised parser at test-time but also can be sensitive to the quality of the parse predictor.We show that using our word position loss effectively characterizes syntactic knowledge, bringing consistent and sizeable improvements over syntactic-related evaluation.The latent code module learns interpretable latent representations.Additionally, all of our models can achieve improvements over baselines.Qualitatively, we show that our models do suffer from the lack of an abstract syntactic representation, though we also show that SCPN and our models exhibit similar artifacts.
In seeking to control generation with exemplars, our approach relates to recent work in controllable text generation.Whereas much work on controllable text generation seeks to control distinct attributes of generated text (e.g., its sentiment or formality) (Hu et al., 2017;Shen et al., 2017;Ficler and Goldberg, 2017;Fu et al., 2018;Zhao et al., 2018;Fan et al., 2018, inter alia), there is also recent work which attempts to control structural aspects of the generation, such as its latent (Wiseman et al., 2018) or syntactic (Iyyer et al., 2018) template.
Our work is closely related to this latter category, and to the syntactically-controlled paraphrase generation of Iyyer et al. (2018) in particular, but our proposed model is different in that it simply uses a single sentence as a syntactic exemplar rather than requiring a supervised parser.This makes our setting closer to style transfer in computer vision, in which an image is generated that combines the content from one image and the style from another (Gatys et al., 2016).In particular, in our setting, we seek to generate a sentence that combines the semantics from one sentence with the syntax from another, and so we only require a pair of (unparsed) sentences.We also note recent, concurrent work that attempts to use sentences as exemplars in controlling generation (Wang et al., 2019) in the context of data-to-document generation (Wiseman et al., 2017).
Another related line of work builds generation upon sentential exemplars (Guu et al., 2018;Weston et al., 2018;Pandey et al., 2018;Cao et al., 2018;Peng et al., 2019) in order to improve the quality of the generation itself, rather than to allow for control over syntactic structures.
There has been a great deal of work in applying multi-task learning to improve performance on NLP tasks (Plank et al., 2016;Rei, 2017;Augenstein and Søgaard, 2017;Bollmann et al., 2018, inter alia).Some recent work used multi-task learning as a way of improving the quality or disentanglement of learned representations (Zhao et al., 2017;Goyal et al., 2017;Du et al., 2018;John et al., 2018).
Part of our evaluation involves assessing the dif- ferent characteristics captured in the semantic and syntactic encoders, relating them to work on learning disentangled representations in NLP, including morphological reinflection (Zhou and Neubig, 2017), sequence labeling (Chen et al., 2018), and sentence representations (Chen et al., 2019).

Methods
Given two sentences X and Y , our goal is to generate a sentence Z that follows the syntax of Y and the semantics of X.We refer to X and Y as the semantic template and syntactic template, respectively.
To solve this problem, we follow Chen et al. (2019) and take an approach based on latentvariable probabilistic modeling, neural variational inference, and multi-task learning.In particular, we assume a generative model that has two latent variables: y for semantics and z for syntax (as depicted in Figure 2).We refer to our model as a vMF-Gaussian Variational Autoencoder (VG-VAE).Formally, following the conditional independence assumptions in the graphical model, the joint probability p θ (x, y, z) can be factorized as: where x t is the tth word of x and p θ (x t | x 1:t−1 , y, z) is given by a softmax over a vocabulary of size V .Further details on the parameterization are given below.
When applying neural variational inference, we assume a factorized approximated posterior q φ (y|x)q φ (z|x) = q φ (y, z|x), which has also been used in some prior work (Zhou and Neubig, 2017;Chen et al., 2018).Learning in VGVAE maximizes a lower bound of marginal log-likelihood: 3.1 Parameterization vMF Distribution.We choose a von Mises-Fisher (vMF) distribution for the y (semantic) latent variable.vMF can be regarded as a Gaussian distribution on a hypersphere with two parameters: µ and κ. µ ∈ R m is a normalized vector (i.e., µ 2 = 1) defining the mean direction.κ ∈ R ≥0 is often referred to as a concentration parameter analogous to the variance in a Gaussian distribution.We will assume q φ (y|x) follows a vMF distribution and p θ (y) follows the uniform distribution vMF(•, 0).We follow Davidson et al. (2018) and use an acceptance-rejection scheme to sample from the vMF distribution.

Gaussian
Distribution.We assume q φ (z|x) follows a Gaussian distribution N (µ β (x), diag(σ β (x))) and that the prior p θ (z) is N (0, I d ), where I d is a d × d identity matrix.
Encoders.At test time, we want to have different combinations of semantic and syntactic inputs, which naturally suggests separate parameterizations for q φ (y|x) and q φ (z|x).Specifically, q φ (y|x) is parameterized by a word averaging encoder followed by a three-layer feedforward neural network since it has been observed that word averaging encoders perform surprisingly well for semantic tasks (Wieting et al., 2016).q φ (z|x) is parameterized by a bidirectional long short-term memory network (LSTM; Hochreiter and Schmidhuber, 1997) also followed by a three-layer feedforward neural network, where we concatenate the forward and backward vectors produced by the biLSTM and then take the average of these vectors.
Decoders.As shown in Figure 3, at each time step, we concatenate the syntactic variable z with the previous word's embedding as the input to the r w 4 7 8 7 H v H X F y W e O 4 A + c z x 8 P V I 9 i < / l a t e x i t > w t+2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 6 x H l x 3 p 2 P e e u K k 8 8 c w R 8 4 n z 8 Q 2 Y 9 j < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " decoder and concatenate the semantic variable y with the hidden vector output by the decoder for predicting the word at the next time step.Note that the initial hidden state of the decoder is always set to zero.

Latent Codes for Syntactic Encoder
Since what we want from the syntactic encoder is only the syntactic structure of a sentence, using standard word embeddings tends to mislead the syntactic encoder to believe the syntax is manifested by the exact word tokens.An example is that the generated sentence often preserves the exact pronouns or function words in the syntactic input instead of making necessary changes based on the semantics.To alleviate this, we follow Chen and Gimpel (2018) to represent each word with a latent code (LC) for word clusters within the word embedding layer.Our goal is for this to create a bottleneck layer in the word embeddings, thereby forcing the syntactic encoder to learn a more abstract representation of the syntax.However, since our purpose is not to reduce model size (unlike Chen and Gimpel, 2018), we marginalize out the latent code to get the embeddings during both training and testing.That is, where c w is the latent code for word w, v cw is the vector for latent code c w , and e w is the resulting word embedding for word w.In our models, we use 10 binary codes produced by 10 feedforward neural networks based on a shared word embedding, and then we concatenate these 10 individual cluster vectors to get the final word embeddings.
Figure 4: Diagram showing the training process when using the paraphrase reconstruction loss (dash-dotted lines).The pair (x 1 , x 2 ) is a sentential paraphrase pair, the y's are the semantic variables corresponding to each x, and the z's are syntactic variables.

Multi-Task Learning
We now describe several additional training losses designed to encourage a clearer separation of information in the semantic and syntactic variables.These losses were also considered in (Chen et al., 2019), but in the context of learning sentence representations.

Paraphrase Reconstruction Loss
Our first loss, the paraphrase reconstruction loss (PRL), requires a dataset of sentence paraphrase pairs.The key assumption is that for a pair of paraphrastic sentences x 1 , x 2 , the semantics is shared but the syntax may differ.As shown in Figure 4, we swap the paraphrases to the semantic encoder during training but keep the input to the syntactic encoder to be the same.It is defined as (2) In the following experiments, unless explicitly noted, we will always include PRL as part of the model training and will discuss its effect in Section 7.1.

Word Position Loss
Since word ordering is relatively unimportant for semantic similarity (Wieting et al., 2016), we assume it is more relevant to the syntax of a sentence than to its semantics.Based on this, we introduce a word position loss (WPL).As shown in Figure 3, WPL is computed by predicting the position at each time step based on the concatenation of word embeddings with the syntactic variable z.
That is, where softmax(•) t indicates the probability at position t.Empirically, we observe that adding WPL to both the syntactic encoder and decoder improves performance, so we always use it in our experiments unless otherwise indicated.

KL Weight
As observed in previous work (Alemi et al., 2017;Bowman et al., 2016;Higgins et al., 2016), the weight of the KL divergence in Equation 1 can be important when learning with latent variables.We attach weights to the KL divergence in Equation 1and tune them based on development set performance.

Word Noising via Part-of-Speech Tags
In practice, we often observe that the syntactic encoder tends to remember word types instead of learning syntactic structures.To provide a more flexible notion of syntax, we add word noising (WN) based on part-of-speech (POS) tags.More specifically, we tag the training set using the Stanford POS tagger (Toutanova et al., 2003).Then we group the word types based on the top two most frequent tags for each word type.During training, as shown in Figure 5, we noise the syntactic inputs by randomly replacing word tokens based on the groups and tags we obtained.This provides our framework many examples of word interchangeability based on POS tags, and discourages the syntactic encoder from memorizing the word types in the syntactic input.When using WN, the probability of noising a word is tuned based on development set performance.

Training Setup
For training with the PRL, we require a training set of sentential paraphrase pairs.We use ParaNMT (Wieting and Gimpel, 2018), a dataset of approximately 50 million paraphrase pairs.To ensure there is enough variation between paraphrases, we filter out paraphrases with high BLEU score between the two sentences in each pair, which leaves us with around half a million paraphrases as our training set.All hyperparameter tuning is based on the BLEU score on the development set (see appendix for more details).

Evaluation Dataset and Metrics
To evaluate models quantitatively, we manually annotate 1300 instances based on paraphrase pairs from ParaNMT independent from our training set.
Each instance in the annotated data has three sentences: semantic input, syntactic input, and reference, where the semantic input and the reference can be seen as human generated paraphrases and the syntactic input shares its syntax with the reference but is very different from the semantic input in terms of semantics.The differences among these three sentences ensure the difficulty of this task.Figure 1 shows examples.
The annotation process involves two steps.We begin with a paraphrase pair u, v .First, we use an automatic procedure to find, for each sentence u, a syntactically-similar but semanticallydifferent other sentence t.We do this by seeking sentences t with high edit distance of predicted POS tag sequences and low BLEU score with u.Then we manually edit all three sentences to ensure (1) strong semantic match and large syntactic variation between the semantic input u and reference v, (2) strong semantic match between the syntactic input t and its post-edited version, and (3) strong syntactic match between the syntactic input t and the reference v.We randomly pick 500 instances as our development set and use the remaining 800 instances as our test set.We perform additional manual filtering and editing of the test set to ensure quality.
For evaluation, we consider two categories of automatic evaluation metrics, designed to cap- ture different components of the task.To measure roughly the amount of semantic content that matches between the predicted output and the reference, we report BLEU score (BL), ME-TEOR score (MET; Banerjee and Lavie, 2005) and three ROUGE scores, including ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L).Even though these metrics are not purely based on semantic matching, we refer to them in this paper as "semantic metrics" to differentiate them from our second metric category, which we refer to as a "syntactic metric".For the latter, to measure the syntactic similarity between generated sentences and the reference, we report the syntactic tree edit distance (ST).To compute ST, we first parse the sentences using Stanford CoreNLP (Manning et al., 2014), and then compute the tree edit distance (Zhang and Shasha, 1989) between constituency parse trees after removing word tokens.

Baselines
We report results for three baselines.The first two baselines directly output the corresponding syntactic or semantic input for each instance.For the last baseline, we consider SCPN (Iyyer et al., 2018).As SCPN requires parse trees for both the syntactic and semantic inputs, we follow the process in their paper and use the Stanford shiftreduce constituency parser (Manning et al., 2014) to parse both, then use the parsed sentences as inputs to SCPN.We report results for SCPN when using only the top two levels of the parse as input (template) and using the full parse as input (full parse).

Results
As shown in Table 1, simply outputting the semantic input shows strong performance across the BLEU, ROUGE, and METEOR scores, which are more relevant to semantic similarity, but shows much worse performance in terms of ST.On the other hand, simply returning the syntactic input leads to lower BLEU, ROUGE, and METEOR scores but also a very strong ST score.These trends provide validation of the evaluation dataset, as they show that the reference and the semantic input match more strongly in terms of their semantics than in terms of their syntax, and also that the reference and the syntactic input match more strongly in terms of their syntax than in terms of their semantics.The goal in developing systems for this task is then to produce outputs with higher semantic metric scores than the syntactic input baseline and simultaneously higher syntactic scores than the semantic input baseline.Among our models, adding WPL leads to gains across both the semantic and syntactic metric scores.The gains are much larger without WN, but even with WN, adding WPL improves nearly all scores.Adding LC typically helps the semantic metrics (at least when combined with WPL) without harming the syntactic metric (ST).We see the largest improvements, however, by adding WN, which uses an automatic part-of-speech tagger at training time only.Both the semantic and syntactic metrics increase consistently with WN, as the syntactic variable is shown many examples of word interchangeability based on POS tags.
While the SCPN yields very strong metric scores, there are several differences that make the SCPN results difficult to compare to those of our models.In particular, the SCPN uses a supervised parser both during training and at test time, while our strongest results merely require a POS tagger and only use it at training time.Furthermore, since ST is computed based on parse trees from a parser, systems that explicitly use constituency parsers at test time, such as SCPN, are likely to be favored by such a metric.This is likely the reason why SCPN can match the syntactic input baseline in ST.Also, SCPN trains on a much larger portion of ParaNMT.
We find large differences in metric scores when SCPN only uses a parse template (i.e., the top two levels of the parse tree of the syntactic input).In this case, the results degrade, especially in ST, showing that the performance of SCPN depends on the quality of the input parses.Nonetheless, the SCPN results show the potential benefit of explicitly using a supervised constituency parser at both training and test time.Future work can explore ways to combine syntactic parsers with our models for more informative training and more robust performance.

Effect of Multi-Task Training
Effect of Paraphrase Reconstruction Loss.We investigate the effect of PRL by removing PRL from training, which effectively makes VGVAE a variational autoencoder.As shown in Table 2, making use of pairing information can improve performance both in the semantic-related metrics and syntactic tree edit distance.
Effect of Position of Word Position Loss.We also study the effect of the position of WPL by (1) using the decoder hidden state, (2) using the concatenation of word embeddings in the syntac- tic encoder and the syntactic variable, (3) using the concatenation of word embeddings in the decoder and the syntactic variable, or (4) adding it on both the encoder embeddings and decoder word embeddings.Table 3 shows that adding WPL on hidden states can help improve performance slightly but not as good as adding it on word embeddings.In practice, we also observe that the value of WPL tends to vanish when using WPL on hidden states, which is presumably caused by the fact that LSTMs have sequence information, making the optimization of WPL trivial.We also observe that adding WPL to both the encoder and decoder brings the largest improvement.

Encoder Analysis
To investigate what has been learned in the encoder, we evaluate q φ (y|x) and q φ (z|x) on both semantic similarity tasks and syntactic similarity tasks and also inspect the latent codes.
Semantic Similarity.We use cosine similarity between two variables encoded by the inference networks as the predictions and then compute Pearson correlations on the STS Benchmark test set (Cer et al., 2017).As shown in Table 4, the semantic variable y always outperforms the syntactic variable z by a large margin, suggesting that different variables have captured different information.Every time when we add WPL the differences in performance between the two variables increases.Moreover, the differences between these two variables are correlated with the performance of models in Table 1, showing that a better generation system has a more disentangled latent representation.
Syntactic Similarity.We use the syntactic evaluation tasks from Chen et al. (2019) to evaluate the syntactic knowledge encoded in the encoder.
The tasks are based on a 1-nearest-neighbor constituency parser or POS tagger.To understand the difficulty of these two tasks,  results for two baselines."Random" means randomly pick candidates as predictions.The second baseline ("Best") is to compute the pairwise scores between the test instances and the sentences in the candidate pool and then take the maximum values.
It can be seen as the upper bound performance for these tasks.
As shown in Table 5, similar trends are observed as in Tables 1 and 4. When adding WPL or WN, there is a boost in the syntactic similarity for the syntactic variable.Adding LC also helps the performance of the syntactic variable slightly.
Latent Code Analysis.We look into the learned word clusters by taking the argmax of latent codes and treating it as the cluster membership of each word.Although these are not the exact word clusters we would use during test time (because we marginalize over the latent codes), it provides us intuition on what individual cluster vectors have contributed to the final word embeddings.As shown in Table 6, the words in the first and last rows are mostly function words.The second row has verbs.The third row has special symbols.The fourth row also has function words but somewhat different from the first row.The fifth row is a large cluster populated by content words, mostly nouns and adjectives.The sixth row has words that are not very important semantically and the seventh BL R-1 R-2 R-L MET ST LC 13.6 44.7 21.0 48.3 24.8 6.7 Single LC 12.9 44.2 20.3 47.4 24.1 6.9 Table 7: Test results when using a single code.
T c y w x P K R v T I e 9 a q m j M T Z D P j 5 2 S c 6 s M S J R o W w r J X P 0 9 k d P Y m E k c 2 s 6 Y 4 s g s e z P a c 0 4 2 c w x / 4 H z + A O q P j Q I = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " m E c z 1 F L h u G 1 B p P 6 c 5 h i 5 0 q A I J 0 g = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f 6 Q u N A Q = = < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " k 7 f T 5 p c p 1 0 T c y w x P K R v T I e 9 a q m j M T Z D P j 5 2 S c 6 s M S J R o W w r J X P 0 9 k d P Y m E k c 2 s 6 Y 4 s g s e z P a c 0 4 2 c w x / 4 H z + A O q P j Q I = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " m E c z 1 F L h u G 1 B p P 6 c 5 h i 5 0 q A I J 0 g = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 T c y w x P K R v T I e 9 a q m j M T Z D P j 5 2 S c 6 s M S J R o W w r J X P 0 9 k d P Y m E k c 2 s 6 Y 4 s g s e z P a a T 2 O A t s Z U T P U y 9 5 M / M / r p C a 8 9 i d c J q l B y R a L w l Q Q E 5 P Z 1 6 T P F T I j x p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u A t v 7 x K m p W y d 1 G u 1 C 9 L 1 Z s s j j y c w C m c g w d X U I U 7 q E E D G C A 8 w y u 8 O Q / O i / P u f C x a c 0 4 2 c w x / 4 H z + A O q P j Q I = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " m E c z 1 F L h u G 1 B p P 6 c 5 h i 5 0 q A I J 0 g = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f 6 Q u N A Q = = < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " k 7 f T 5 p c p 1 0 T c y w x P K R v T I e 9 a q m j M T Z D P j 5 2 S c 6 s M S J R o W w r J X P 0 9 k d P Y m E k c 2 s 6 Y 4 s g s e z P a c 0 4 2 c w x / 4 H z + A O q P j Q I = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " m E c z 1 F L h u G 1 B p P 6 c 5 h i 5 0 q A I J 0 g = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 T c y w x P K R v T I e 9 a q m j M T Z D P j 5 2 S c 6 s M S J R o W w r J X P 0 9 k d P Y m E k c 2 s 6 Y 4 s g s e z P row has mostly adverbs.We also observe that the size of clusters often correlates with how strongly it relates to topics.In Table 6, clusters that have size under 20 are often function words while the largest cluster (5th row) has words with the most concrete meanings.
We also compare the performance of LC by using a single latent code that has 50 classes.The results in Table 7 show that it is better to use smaller number of classes for each cluster instead of using a cluster with a large number of classes.

Effect of Decoder Structure
As shown in Figure 6, we evaluate three variants of the decoder, namely INIT, CONCAT, and SWAP.For INIT, we use the concatenation of semantic variable y and syntactic variable z for computing the initial hidden state of decoder and then use the word embedding as input and hidden state to predict the next word.For CONCAT, we move both y and z to the input of the decoder and use the concatenation of these two variables as input to the decoder and use the hidden state for predicting the next word.For SWAP, we swap the position of y and z to use the concatenation of y and word embeddings as input to the decoder and the concatenation of z and hidden states as output for predicting the next word.Results for these three settings are shown in BL R-1 R-2 R-L MET ST VGVAE 4.5 26.5 8.2 31.5 13.3 10.0 INIT 3.5 22.7 6.0 24.9 9.8 11.5 CONCAT 4.0 23.9 6.6 27.9 11.2 10.9 SWAP 4.3 25.6 7.5 30.4 12.5 10.5 decoder, which improves performance.SWAP arranges variables in different positions in the decoder and further improves over CONCAT in all metrics.

Generated Sentences
We show several generated sentences in Table 8.We observe that both SCPN and our model suffer from the same problems.When comparing syntactic input and results from both our models and SCPN, we find that they are always the same length.This can often lead to problems like the first example in Table 8.The length of the syntactic input is not sufficient for expressing the semantics in the semantic input, which causes the generated sentences from both models to end at "you?" and omit the verb "think".Another problem is in the consistency of pronouns between the generated sentences and the semantic inputs.An example is the second row in Table 8.Both models alter "i" to be either "you" or "she" while the "kick that bastard in the ass" becomes "kicked the bastard in my ass".
We found that our models sometimes can generate nonsensical sentences, for example the last row in Table 8. while SCPN, which is trained on a much larger corpus, does not have this problem.Also, our models can sometimes be distracted by the word tokens in the syntactic input as shown in the 3rd row in Table 8, where our model directly copies "of course" from the syntactic input while since SCPN uses a parse tree, it outputs "with luck".In some rare cases where the function words in both syntactic inputs and the references are the exactly the same, our models can perform better than SCPN, e.g., the last two rows in Table 8.Generated sentences from our model make use of the word tokens "and" and "like" while SCPN does not have access to this information and generates inferior sentences.

Conclusion
We proposed a novel setting for controlled text generation, which does not require prior knowledge of all the values the control variable might take on.We also proposed a variational model accompanied with a neural component and multiple multi-task training objectives for addressing this task.The proposed approaches do not rely on a test-time parser or tagger and outperform our baselines.Further analysis shows the model has learned both interpretable and disentangled representations.

Figure 1 :
Figure 1: Examples from our annotated evaluation dataset of paraphrase generation using semantic input X (red), syntactic exemplar Y (blue), and the reference output Z (black).

Figure 2 :
Figure 2: Graphical model.Dashed lines indicate the inference model.Solid lines indicate the generative model.

Figure 3 :
Figure 3: Diagram showing training of the decoder.Blue lines indicate the word position loss (WPL).

Figure 5 :
Figure 5: An example of word noising.For each word token in the training sentences, we randomly replace it with other words that share the same POS tags.
t e x i t s h a 1 _ b a s e 6 4 = " k 7 f T 5 p c p 1 0 j y M m l W K 9 5 F p X p / W a 7 d 5 H E U 4 B h O 4 A w 8 u I I a 3 E E d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x 8 P V I 9 i < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = "V L E o 6 V g U n u 2 T n O x o O k q s M P X v y T o = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x 4 h k U c C G z I 7 9 M L I 7 O x m Z t Y E C V /g x Y P G e P W T v P k 3 D r A H B S v p p F L V n e 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W + b T c y w x P K R v T I e 9 a q m j M T Z D P j 5 2 S c 6 s M S J R o W w r J X P 0 9 k d P Y m E k c 2 s 6 Y 4 s g s e z Px P 6 + b Y X Q T 5 E K l G X L F F o u i T B J M y O x z M h C a M 5 Q T S y j T w t 5 K 2 I h q y t D m U 7 Y h e M s v r 5 J W v e Z d 1 u o P V 9 X G b R F H C U 7 h D C 7 A g 2 t o w D 0 0 w Q c G A p 7 h F d4 c 5 b w 4 7 8 7 H o n X N K W Z O 4 A + c z x 8 e 8 o 7 j < / l a t e x i t > e t < l a t e x i t s h a 1 _ b a s e 6 4 = " W v H U k e 6 I c N o z O B i F t A p 4 x z s 5 k h w H P 8 A p v j n J e n H f n Y 9 F a c P K Z Y / g D 5 / M H N e q O 8 g = = < / l a t e x i t > w t+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " R d 0 M p e N c 0 C

Figure 6 :
Figure 6: Variants of decoder.Left (SWAP): we swap the position of variable y and z.Middle (CONCAT): we concatenate word embedding with y and z as input to decoder.Right (INIT): we use word embeddings as input to the decoder and use the concatenation of y and z to compute the initial hidden state of the decoder.

Table 1 :
Test results.The final metric (ST) measures the syntactic match between the output and the reference.

Table 2 :
Test results when including PRL.

Table 3 :
Test results with WPL at different positions.

Table 5 shows
Semantic var.Syntactic var.

Table 6 :
Examples of learned word clusters.Each row is a different clusters.Numbers in the first column indicate the number of words in that cluster.

Table 9 .
INIT performs the worst across the three settings.Both CONCAT and SWAP have variables in each time step in the

Table 8 :
Examples of generated sentences.

Table 9 :
Test results with decoder variants.