Towards Generating Long and Coherent Text with Multi-Level Latent Variable Models

Variational autoencoders (VAEs) have received much attention recently as an end-to-end architecture for text generation with latent variables. However, previous works typically focus on synthesizing relatively short sentences (up to 20 words), and the posterior collapse issue has been widely identified in text-VAEs. In this paper, we propose to leverage several multi-level structures to learn a VAE model for generating long, and coherent text. In particular, a hierarchy of stochastic layers between the encoder and decoder networks is employed to abstract more informative and semantic-rich latent codes. Besides, we utilize a multi-level decoder structure to capture the coherent long-term structure inherent in long-form texts, by generating intermediate sentence representations as high-level plan vectors. Extensive experimental results demonstrate that the proposed multi-level VAE model produces more coherent and less repetitive long text compared to baselines as well as can mitigate the posterior-collapse issue.


Introduction
The variational autoencoder (VAE) for text (Bowman et al., 2016) is a generative model in which a stochastic latent variable provides additional information to modulate the sequential text-generation process.VAEs have been used for various text processing tasks (Semeniuta et al., 2017;Zhao et al., 2017;Kim et al., 2018;Du et al., 2018;Xu and Durrett, 2018).Most recent work has focused on generating relatively short sequences (e.g., a single sentence or multiple sentences up to around twenty words), while generating long-form text (e.g., a single or multiple paragraphs) with deep latent-variable models has been less explored.
Recurrent Neural Networks (RNNs) have been a cornerstone for many text generation models (Bahdanau et al., 2015;Chopra et al., 2016), including the standard VAE model (Bowman et al., flat-VAE (baseline) multilevel-VAE (our model) i went here for a grooming and a dog .it was very good .the owner is very nice and friendly .the owner is really nice and friendly .i don t know what they are doing .
i have been going to this nail salon for over a year now .the last time i went there .the stylist was nice .but the lady who did my nails .she was very rude and did not have the best nail color i once had .the staff is very friendly and helpful .the only reason i can t give them 5 stars .the only reason i am giving the ticket is because of the ticket .can t help but the staff is so friendly and helpful .can t help but the parking lot is just the same .i am a huge fan of this place .my husband and i were looking for a place to get some good music .this place was a little bit pricey .but i was very happy with the service .the staff was friendly .
Table 1: Comparison of samples generated from two generative models on the Yelp reviews dataset.The baseline model struggles with repetitions of the same context or words, yielding non-coherent text.A hierarhical decoder with multilayered latent variables eliminates redundancy and yields more coherent text planned around focused concepts.(See more examples in the Supplementary Material, Table 12).2016).However, it is difficult to scale RNNs for long-form text generation, as they tend to generate text that is repetitive, ungrammatical, selfcontradictory, overly generic and often lacking coherent long-term structure (Holtzman et al., 2018).
A sample text generated from a baseline VAE model that uses an RNN decoder is shown in Table 1.
In this work, we propose various multi-level network structures for the VAE model, to address challenges associated with long-term structure and repetitiveness associated with long-form text generation.To generate globally-coherent long text sequences, it is desirable that both the higher-level abstract features (e.g., topic, sentiment, etc.) and lower-level fine-granularity details (e.g., specific word choices) of long text can be leveraged by the generative network.It's difficult for a standard RNN to capture such structure and learn to planahead.To improve the model's plan-ahead capability for capturing long-term dependency, following (Roberts et al., 2018), our first multi-level structure defines a hierarchical RNN decoder as the generative network that learns sentenceand word-level representations.Rather than using the Introducing long-term structure into a VAE model by a multi-level decoder structure, may not mitigate the "posterior collapse" issue, which is inherent in training VAEs with strong autoregressive decoders with a teacher-forcing scheme (Bowman et al., 2016;Yang et al., 2017;Goyal et al., 2017;Semeniuta et al., 2017;Shen et al., 2018b) when training.Bowman et al. (2016) has shown that the posterior distribution of latent codes tends to match the prior distribution regardless of the input sequence (the KL divergence between the two distributions is very close to zero).Consequently, the information from the latent variable is not leveraged by the generative network (Bowman et al., 2016) causing "posterior collapse."Several strategies have been proposed (see optimization challenges in Section 4.2) to make the decoder less autoregressive, so less contextual information is utilized by the decoder network (Yang et al., 2017;Shen et al., 2018b).We argue that learning more informative latent codes can enhance the generative model without the need to lessen the contextual information.In this regard, we propose leveraging a hierarchy of latent variables between the convolutional inference (encoder) networks and a multi-level recurrent generative network (decoder).With multiple stochastic layers, the prior of bottom-level latent variable is inferred from the data, rather than fixed as a standard Gaussian distribution (as in the typical VAE setting (Kingma and Welling, 2013)).The induced latent code distribution at the bottom level can be perceived as a Gaussian mixture, and thus is endowed with more flexibility to abstract meaningful features from the input sequences.Recent work has also explored extending latent codes to be more informative (Kim et al., 2018;Gu et al., 2018).Our approach, however, is conceptually simple and easy to implement.
In this paper, we propose a novel framework, multi-level variational autoencoders (ml-VAE), to enhance long and coherent text generation.We evaluate the proposed ml-VAE comprehensively on language modeling, generic (unconditional) text generation, and conditional generation.The proposed model demonstrates substantial improvement relative to several baseline methods, in terms of perplexity on language modeling and quality of generated samples (based on BLEU statistics and human evaluation).We further show that our network can be generalized for conditional-generation scenarios.

Variational Autoencoder (VAE)
Let x denote a text sequence, which consists of L tokens, i.e., x 1 , x 2 , ..., x L .A VAE encodes the text x using a recognition (encoder) model, q φ (z|x), parameterizing an approximate posterior distribution over a continuous latent variable z (whose prior is typically chosen as standard diagonalcovariance Gaussian).The latent code z is sampled stochastically from the posterior distribution, and text sequences x are generated conditioned on z, via a generative (decoder) network, denoted as p θ (x|z).A variational lower bound is typically used to estimate the parameters (Kingma and Welling, 2013): Although VAEs have been shown to be effective in a wide variety of text processing tasks (Bowman et al., 2016;Miao et al., 2016;Yang et al., 2017;Serban et al., 2017;Semeniuta et al., 2017;Miao et al., 2017;Zhao et al., 2017;Shen et al., 2017;Guu et al., 2018;Kim et al., 2018;Yin et al., 2018;Kaiser et al., 2018;Bahuleyan et al., 2018;Chen et al., 2018b;Shen et al., 2018a;Deng et al., 2018;Shah and Barber, 2018), there are two challenges associated with applying them for generating longer sequences: (i) they lack a long-term planning mechanism, which is critical for generating semantically-coherent long texts (Serdyuk et al., 2017); and (ii) they are characterized by posterior collapse.Concerning (ii), it was demonstrated in (Bowman et al., 2016) that due to the autoregressive nature of the RNN, the decoder tends to ignore the information from z entirely, resulting in an extremely small KL term (see Section 4.2).(Li et al., 2015b).
Our hypothesis is that an explicit design of (inherently hierarchical) paragraph structure can capture sentence-level coherence and potentially mitigate repetitiveness.Intuitively, when predicting each token, the decoder can use information from both the words generated previously and from sentence-level representations.Suppose an input paragraph consist of M sentences, and each sentence t has N t words, t=1,. . ., M .To generate the plan vectors, the model first samples a latent code z through a onelayer multi-layered perceptron (MLP), with ReLU activation functions, to obtain the starting state of the sentence-level LSTM decoder.Subsequent sentence representations, namely the plan vectors, are generated with the sentence-level LSTM in a sequential manner: The latent code z can be considered as a paragraph-level abstraction, relating to information about the semantics of each generated subsequence.Therefore we input z at each time step of the sentence-level LSTM, to predict the sentence representation.A schematic view of our singlelatent-variable model is shown in Figure 2 in the Supplementaty Material.
The generated sentence-level plan vectors are then passed onto the word-level LSTM decoder to generate the words for each sentence.To generate each word of a sentence t, the corresponding plan vector, h s t , is concatenated with the word embedding of the previous word and fed to LSTM word at every time step1 .Let w t,i denote the i-th token of the t-th sentence This process can be expressed as (for t = 1, 2, ..., M and i = 1, 2, 3, ..., N t ): The initial state h w t,0 of LSTM word is inferred from the corresponding plan vector via an MLP layer.V represents the weight matrix for computing distribution over words, and W e are word embeddings to be learned.For each sentence, once the special END token is generated, the word-level LSTM stops decoding2 .LSTM word decoder parameters are shared for each generated sentence.

Double Latent Variables (ml-VAE-D):
Similar architectures of our single latent variable ml-VAE-S model have been applied recently for multi-turn dialog response generation (Serban et al., 2017;Park et al., 2018), mainly focusing on short (one-sentence) response generation.Different from these works, our goal is to generate long text which introduces additional challenges to the hierarchical generative network.We hypothesize that with the two-level LSTM decoder embedded into the VAE framework, the load of capturing global and local semantics are handled differently than the flat-VAEs (Chen et al., 2016).Specifically, while the multi-level LSTM decoder can capture relatively detailed information (e.g., word-level (local) coherence) via the word-and sentence-level LSTM networks, the latent codes of the VAE are encouraged to abstract more global and high-level semantic features of multiple sentences of long text.
Our double latent variable extension, ml-VAE-D, is shown in Figure 1.The inference network encodes upward through each latent variable to infer their posterior distributions, while the generative network samples downward to obtain the distributions over the latent variables.The distribution of the latent variable at the bottom is inferred from the top-layer latent codes, rather than fixed (as in a standard VAE model).This also introduces flexibility to the model to abstract useful highlevel features (Gulrajani et al., 2016), which can then be leveraged by the multi-level LSTM network.Without loss of generality, here we choose to employ a two-layer hierarchy of latent variables, where the bottom and top layers are denoted as z 1 and z 2 , respectively, which can be easily extended to multiple latent-variable layers.
Another important advantage of multi-layer latent variables in the VAE framework is related to the posterior collapse issue.With a single latent variable network, even with the multi-level LSTM decoder, the posterior collapse can still exist because the LSTM can still ignore the latent codes while decoding due to its autoregressive property.With the hierarchical latent variables, we propose a novel strategy to mitigate this problem, by making less restrictive assumptions regarding the prior distribution of the latent variable.As shown in the experiments, our network yields a larger KL loss term relative to flat-VAEs, indicating more informative latent codes.
The posterior distributions over the latent variables are assumed to be conditionally independent given the input x.We can represent the joint posterior distribution of the two latent variables as3 : Concerning the generative network, the latent variable at the bottom is sampled conditioned on the one at the top.Thus, we have: To optimize the parameters of the inference and generative networks, the second term in the VAE objective, D KL (q φ (z|x)||p(z)), can be regarded as the KL divergence between the joint posterior and prior distributions of the two latent variables.
Under the assumptions of ( 5) and ( 6), the varia-tional lower bound is: , where the functions p θ and q φ are abbreviated as p and q and: Note that the left-hand side of ( 8) is the abbreviation of Given the Gaussian assumption for both the prior and posterior distributions, both KL divergence terms can be written in closed-form.

Model Specifications
To abstract meaningful representations from the input paragraphs, we choose a hierarchical CNN architecture for the inference/encoder networks.Specifically, our model first applies a sentencelevel CNN encoder to each sentence to obtain a fixed-length vector.Later, a paragraph-level CNN encoder is utilized to aggregate the vectors with respect to all sentences.Note that the inference networks parameterizing q(z 1 |x) and q(z 1 |x) share the parameters of the lower-level CNN.
The single-variable ml-VAE-S model feeds the paragraph feature vector into the linear layers to infer the mean and variance of the latent variable z.In the double-variable model ml-VAE-D, the feature vector is further transformed with two MLP layers, and then is used to compute the mean and variance of the top-level latent variable.

VAE for text generation
The variational autoencoder, trained under the neural variational inference (NVI) framework, has been widely used for generating text sequences (Bowman et al., 2016;Yang et al., 2017;Semeniuta et al., 2017;Zhao et al., 2017).By encouraging the latent feature space to match a prior distribution within an encoder-decoder architecture, the learned latent variable could potentially encode high-level semantic features and serve as a global representation during the decoding process (Bowman et al., 2016).The generated results are also endowed with better diversity due to the sampling procedure of the latent codes (Zhao et al., 2017).Another type of deep generative model that has been widely adopted for text generation is the Generative Adversarial Networks (GANs) (Yu et al., 2017;Hu et al., 2017;Zhang et al., 2017;Fedus et al., 2018;Chen et al., 2018a).However, existing works have mostly focused on generating one sentence (or multiple sentences with at most twenty words in total).The task of generating relatively longer units of text has been less explored.

Optimization Challenges with Text-VAEs
The "posterior collapse" issue associated with training text-VAEs was first outlined by (Bowman et al., 2016).They used two strategies, KL divergence annealing and word dropout, however, none of them help to improve the perplexity compared to a plain neural language model.(Yang et al., 2017) argue that the small KL term relates to the strong autoregressive nature of an LSTM generative network, and they proposed to utilize a dilated CNN as a decoder to improve the informativeness of the latent variable.(Zhao et al., 2018b) proposed to augment the VAE training objective with an additional mutual information term.This further yields an intractable integral in the case where the latent variables are continuous.We deal with "posterior collapse" from two perspectives: i) more flexible priors are assumed over the latent variables (learned from the data); and ii) the hierarchical structure within a paragraph is taken into account, so that the latent variables can focus less on the local information (e.g., word-level coherence) and more on the global features.

Hierarchical Structures in NLP
Natural language is inherently organized in a hierarchical manner (characters form a word, words form a sentence, sentences form a paragraph, paragraphs from a document, etc.).In (Yang et al., 2016), multi-level LSTM encoders are used at the word-and sentence-level along with an attention mechanism to learn document representations.A hierarchical autoencoder is proposed in (Li et al., 2015a) to reconstruct long-paragraph text.Our approach is conceptually similar the model in (Serban et al., 2017), in which a stochastic latent vari-able is produced for each sentence during decoding.In contrast, our model encodes the entire paragraph into one single latent variable.As a result, the latent variable learned in our model relates more to the global semantic information of a paragraph, whereas those in (Serban et al., 2017) mainly contain the local information of a specific sentence.Therefore, their model is not suitable for tasks such as latent space interpolation.
Finally, our work is related to prior work that addresses plan-ahead capabilities in decoders.In (Park et al., 2018) a variational hierarchical conversational model (VHCR) model is proposed with global and local latent variables.The VHCR model generates its local/utterance variables from the global latent variable, while fixing the priors for the two sets of latent variables to be standard diagonal-covariance Gaussian.In contrast, both of out latent variables in ml-VAE-D are designed to contain global information.The prior of the bottom-level latent variable in our model is learned from the data (and is thus more flexible relative to a fixed prior), which exhibits promising results in terms of mitigating the issue of "posterior collapse" see Table 2).Furthermore, in VHCR, the responses are generated conditionally on the latent variables and context, while our ml-VAE-D model captures the underlying data distribution of the entire paragraph in the bottom latent variable (z 1 ).Therefore, the (global) latent variable learned by our model should contain more information.

Experimental Setup
Datasets We conducted experiments on both generic (unconditional) long-form text generation and conditional paragraph generation (with additional text input as auxiliary information).For the former, we use two datasets: Yelp Reviews (Zhang et al., 2015) and arXiv Abstracts (Celikyilmaz et al., 2018).For the conditional-generation experiments, we consider the task of synthesizing a paper abstract (which typically includes several sentences) conditioned on the paper title (with the arXiv Abstracts dataset).More details of the dataset statistics and model architectures are provided in the Supplementary Materials.
Baselines For language modeling experiments, we implemented several baselines: language model with a flat LSTM decoder (flat-LM), VAE with a flat LSTM decoder (flat-VAE), and language model with a multi-level LSTM decoder (ml-LM).
For generic text generation, we further consider two recently proposed generative models as baselines: Adversarial Autoencoders (AAE) (Makhzani et al., 2015) and Adversarially-Regularized Autoencoders (ARAE) (Zhao et al., 2018a).Instead of penalizing the KL divergence term, AAE introduces a discriminator network to match the prior and posterior distributions of the latent variable.AARE model extends AAE by introducing Wassertein GAN loss (Arjovsky et al., 2017) and a stronger generator network.We build two variants of our multi-level VAE models: single latent variable ml-VAE-S and double latent variable ml-VAE-D.Our code will be released to encourage future research.

Language Modeling Results
We first evaluate our method on the language modeling task using Yelp and arXiv datasets, where we report the negative log likelihood (NLL) and perplexity (PPL).Following (Bowman et al., 2016;Yang et al., 2017;Kim et al., 2018), we utilize the KL loss term to measure the extent of "posterior collapse."For this experiment flat-LM, flat-VAE, and ml-LM are considered as baselines.
As shown in Table 2, on the Yelp dataset, the standard flat-VAE has a KL divergence term very close to zero, indicating that the generative model makes negligible use of the information from latent variable z.Consequently, flat-VAE model obtains slightly worse NNL and PPL relative to a flat LSTM-based language model.In contrast, with a multi-level LSTM decoder, our ml-VAE-S yields increased KL divergence, demonstrating that the VAE model tends to leverage more information from the latent variable in the decoding stage.The PPL of ml-VAE-S is also decreased from 47.9 to 46.6 (compared to ml-LM), indicating that the sampled latent codes is helping in making wordlevel predictions.
Our double latent variable model ml-VAE-D exhibits an even larger KL divergence cost term (in-creased from 3.6 to 6.8) than that with a single latent variable, indicating that more information from the latent variable has been utilized by the generative network.This may be attributed to the fact that the latent variable priors of the ml-VAE-D model are inferred from the data, rather than a fixed standard Gaussian distribution.As a result, the model is endowed with more flexibility to encode informative semantic features in the latent variables, yet matching their posterior distributions to the corresponding priors.More importantly, by effectively exploiting the sampled latent codes, ml-VAE-D achieves the best PPL results on both datasets (on the arXiv dataset, our hierarchical decoder outperforms the ml-LM by reducing the PPL from 58.1 down to 54.3).

Unconditional Text Generation
We further evaluate the quality of generated paragraphs as follows.We randomly sample 1000 latent codes and send them to all trained generative models to generate text.We use corpus-level BLEU score (Papineni et al., 2002) to quantitatively evaluate the generated paragraphs.Specifically, we follow the strategy in (Yu et al., 2017;Zhang et al., 2017) and use the entire test set as the reference for each generated text, and get average BLEU scores over 1000 generated sentences for each model.
As shown in Table 3, VAE tends to be a stronger baseline for paragraph generation, exhibiting higher corpus-level BLEU scores than both AAE and ARAE.This observation is consistent with the results in (Cífka et al., 2018).The VAE with multi-level decoder demonstrates better BLEU scores than the one with a flat decoder, indicating that the plan-ahead mechanism associated with the hierarchical decoding process indeed benefits the sampling quality.Moreover, ml-VAE-D exhibits slightly better results than ml-VAE-S.We attribute this to the more flexible prior distribution of ml-VAE-D, which improves the ability of the inference networks to extract semantic features from a paragraph, and thus yields more informative latent codes.
To further illustrate the capability of our model to extract global features, we visualize the learned latent variable.Using the arXiv dataset, we select the most frequent four article topics and retrain our ml-VAE-D model on the corresponding abstracts in an unsupervised way (no topic infor-   In Table 1 two samples of generations from flat-VAE and ml-VAE-D are shown.Compared to the our hierarchical model, a flat decoder with a flat VAE exibits repetitions as well as suffers from uninformative sentences.The hierarchical model generates reviews that contain more information with less repetitions (word or semantic semantic repetitions), and tend to be semantically-coherent.

Diversity of Generated Paragraphs
We also evaluate the diversity of random samples from a trained model, since one model might generate realistic-looking sentences while suffering from severe mode collapse (i.e., low diversity).Three metrics are employed to measure the diversity of generated paragraphs: Self-BLEU scores (Zhu et al., 2018), unique n-grams (Fedus et al., 2018) and the entropy score (Zhang et al., 2018).For a set of sampled sentences, the Self-BLEU metric calculates the BLEU score of each sample with respect to all other samples as the reference (the numbers over all samples are then averaged); the we study the effect of disorder on the dynamics of a two-dimensional electron gas in a two-dimensional optical lattice , we show that the superfluid phase is a phase transition , we also show that , in the presence of a magnetic field , the vortex density is strongly enhanced . in this work we study the dynamics of a colloidal suspension of frictionless , the capillary forces are driven by the UNK UNK , when the substrate is a thin film , the system is driven by a periodic potential , we also study the dynamics of the interface between the two different types of particles .unique score computes the percentage of unique n-grams within all the generated reviews; and the entropy score measures how evenly the empirical n-gram distribution is for a given sentence, which does not depend on the size of testing data, as opposed to unique scores.Note that all three metrics are the lower, the better.
We randomly sample 1000 reviews from each model, and the corresponding results are shown in Table 4.Note that a small self-BLEU score must be accompanied with a large BLEU score to justify the effectiveness of a model, i.e., being able to generate realistic-looking as well as diverse samples.Among all the VAE variants, ml-VAE-D shows the smallest BLEU score and largest unique n-grams percentage, further demonstrating the advantages of making both the generative networks and latent variables hierarchical.Concerning AAE and ARAE, although they exhibit better diversity according to both metrics, their corpus-level BLEU scores are much worse relative to ml-VAE-D.Thus, we leverage human evaluation for further comparison.
Human Evaluation We conducted human evaluation using Amazon Mechanical Turk to assess the coherence and non-redundancy of the texts generated from our models in comparison to the baselines, which is difficult to measure based on automated metrics.Given a pair of generated reviews, the judges are asked to select their preferences (no difference between the two reviews is also an option) according to the following four evaluation criteria: fluency & grammar, consistency, non-redundancy, and overall.Details of the evaluation are provided in the SM.As shown Title: Magnetic quantum phase transitions of the antiferromagnetic -Heisenberg model We study the phase diagram of the model in the presence of a magnetic field, The model is based on the action of the Polyakov loop, We show that the model is consistent with the results of the first order perturbation theory.

Title: Kalman Filtering With UNK Over Wireless UNK Channels
The Kalman filter is a powerful tool for the analysis of quantum information, which is a key component of quantum information processing, However, the efficiency of the proposed scheme is not well understood . in Table 8, ml-VAE generates superior humanlooking samples compared to flat-VAE on the Yelp Reviews dataset.Even though both models underperform when compared against the ground-truth real reviews, ml-VAE was rated higher in comparison to flat-VAE (raters find ml-VAE closer to human-generated than the flat-VAE) in all the criteria evaluation criteria.We further compare our methods against AAE (the same data preprocessing steps and hyperparameters are employed).The results show that ml-VAE again produces more grammatically-correct and semantically-coherent samples than the AAE baseline.

Conditional Paragraph Generation
We further evaluate the proposed VAE model on a conditional generation task.Specifically, we consider the task of generating the abstract of a paper based on the corresponding title.The same arXiv dataset is utilized, where when training the title and abstract are given as paired text sequences.The title is used as input of the inference network.
For the generative network, instead of reconstructing the same input (i.e., title), the paper abstract is employed as the target for decoding.We compare the ml-VAE-D model against ml-LM.We observe that the ml-VAE-D model achieves a test perplexity of 55.7 (with a KL term of 2.57), which is smaller that the test perplexity of ml-LM (58.1).This indicates that the information from the title has indeed been leveraged by the generative network to facilitate the decoding process.In Table 7 we show two generated samples from the ml-VAE-D model.
A the service was great, the receptionist was very friendly and the place was clean, we waited for a while, and then our room was ready .
• same with all the other reviews, this place is a good place to eat, i came here with a group of friends for a birthday dinner, we were hungry and decided to try it, we were seated promptly.• this place is a little bit of a drive from the strip, my husband and i were looking for a place to eat, all the food was good, the only thing i didn t like was the sweet potato fries.• this is not a good place to go, the guy at the front desk was rude and unprofessional, it s a very small room, and the place was not clean.• service was poor, the food is terrible, when i asked for a refill on my drink, no one even acknowledged me, they are so rude and unprofessional.B how is this place still in business, the staff is rude, no one knows what they are doing, they lost my business .
Table 9: Intermediate sentences are produced from linear transition between two points in the latent space.

Analysis
The Continuity of Latent Space Following (Bowman et al., 2016), we further measure the continuity of the learned latent space.Specifically, two points are randomly sampled from the prior latent space (denoted as A and B).Sentences are generated based on the equidistant intermediate points along the linear trajectory between A and B. As shown in Table 9, these intermediate samples are all realistic-looking reviews that are syntactically and semantically reasonable, demonstrating the smoothness of the learned VAE latent space.Interestingly, we even observe that the generated sentences gradually transit from positive to negative sentiment along the linear trajectory.To validate that the sentences are not generated by simply retrieving the training data, we further find the closest instance, among the entire training set, for each generated review.We demonstrate the details of the results in the SM (Table 13).
Attribute Vector Arithmetic To investigate the structure of the latent space, we conduct an experiment to alter the sentiments of reviews with an attribute vector.We encode the reviews of the Yelp Review training dataset with positive sentiment and sample a latent code for each review and measure the mean latent vector.The mean latent vector of the negative reviews are computed in the same way.We subtract the negative mean vector from the positive mean vector to obtain the "sentiment attribute vector".Next, for evaluation, we randomly sample 1000 reviews with negative sentiment and add the "sentiment attribute vector" to their latent codes.The manipulated latent vectors are then fed to the hierarchical decoder to produce the transferred sentences, hypothesizing that they will convey positive sentiment.
As shown in Table 10, the original sentences Original: you have no idea how badly i want to like this place, they are incredibly vegetarian vegan friendly , i just haven t been impressed by anything i ve ordered there , even the chips and salsa aren t terribly good , i do like the bar they have great sangria but that s about it .
Transferred: this is definitely one of my favorite places to eat in vegas , they are very friendly and the food is always fresh, i highly recommend the pork belly , everything else is also very delicious, i do like the fact that they have a great selection of salads .
Original: my boyfriend and i are in our 20s , and have visited this place multiple times , after our visit yesterday , i don t think we ll be back , when we arrived we were greeted by a long line of people waiting to buy game cards .Transferred: my boyfriend and i have been here twice , and have been to the one in gilbert several times too , since my first visit , i don t think i ve ever had a bad meal here , the servers were very friendly and helpful .
Table 10: Sentiment transfer results with attribute vector arithmetic.More samples can be found in the SM (Table 14).
have been successfully manipulated to positive sentiment with the simple attribute vector operation.However, the specific contents of the reviews are not fully retained.interesting future direction is to decouple the style and content of long-form texts to allow content-preserving attribute manipulation.We further employed a CNN sentiment classifier to evaluate the sentiment of manipulated sentences.The classifier is trained on the entire training set and achieves a test accuracy of 94.2%.With this pre-trained classifier, 83.4% of the transferred reviews are judged to be positive-sentiment, indicating that "attribute vector arithmetic" consistently produces the intended manipulation of sentiment.

Conclusion
We have introduced a hierarchically-structured variational autoencoder for long text generation.
A multi-level LSTM generative network is employed, that models the semantic coherence at both the word-and sentence-levels.A hierarchy of stochastic layers is further utilized, where the priors of the latent variables are learned from the data.Consequently, more informative latent codes are manifested, indicated by a larger KL loss term yet smaller variational lower bound.The generated samples from the proposed model also exhibit superior quality relative to those from several baseline methods (according to automatic metrics).Human evaluations further demonstrate that the samples from ml-VAE are less repetitive and more semantically-consistent.

A Datasets & Model Details
In the following, we provide details of data preprocessing and the experimental setups used in the experiments.For both Yelp Reviews and arXiv Abstracts datasets, we truncate the original paragraph to the first five sentences (split by punctuation marks including comma, period and point symbols), where each sentence contains at most 25 words.Therefore, each paragraph has at most 125 words.We further remove those sentences that contain less than 30 words.The statistics of both datasets are detailed in Table 11.Note that the average length of paragraphs considered here are much larger than previous generative models for text (Bowman et al., 2016;Yu et al., 2017;Hu et al., 2017;Zhang et al., 2017), since these works considered text sequences that contain only one sentence with at most twenty words.In all the VAE and extentions, the dimension of the latent variable z is set to 300.The dimensions of both the sentence-level and word-level LSTM decoders are set to 512.For the generative networks, to infer the bottom-level latent variable (i.e., modeling p(z 1 |z 2 )), we first feed the sampled latent codes from z 2 to two MLP layers, which is followed by two linear transformation to infer the mean and variance of z 1 , respectively.
The model is trained using Adam (Kingma and Ba, 2014) with a learning rate of 3 × 10 −4 for all parameters, with a decay rate of 0.99 for every 3000 iterations.Dropout (Srivastava et al., 2014) is employed on both word embedding and latent variable layers, with rates selected from {0.3, 0.5, 0.8} on the validation set.We set the mini-batch size to 128.Following (Bowman et al., 2016) we adopt the KL cost annealing strategy to stabilize training: the KL cost term is increased linearly to 1 until 10,000 iterations.All experiments are implemented in Tensorflow (Abadi et al., 2016), using one NVIDIA GeForce GTX TITAN X GPU with 12GB memory.

D Human evaluation setup and details
Some properties of the generated paragraphs, such as (topic) coherence or non-redundancy, can not be easily measured by automated metrics.Therefore, we further conduct human evaluation based on 100 samples randomly generated by each model (the models are trained on the Yelp Reviews dataset for this evaluation).We consider flat-VAE, adversarial autoencoders (AAE) and real samples from the test set to compare with our proposed ml-VAE-D model.The same hyperparameters are employed for the different model variants to ensure fair comparison.We evaluate the quality of these generated samples with a blind heads-up comparison using Amazon Mechanical Turk.Given a pair of generated reviews, the judges are asked to select their preferences ("no difference between the two reviews" is also an option) according to the following 4 evaluation criteria: (1) fluency & grammar, the one that is more grammatically correct and fluent; (2) consistency, the one that depicts a sequence of topics and events that is more consistent; (3) non-redundancy, the one that is better at non-redundancy (if a review repeats itself, this can be taken into account); and (4) overall,  the one that more effectively communicates reasonable content.These different criteria help to quantify the impact of the hierarchical structures employed in our model, while the non-redundancy and consistency metrics could be especially correlated with the model's plan-ahead abilities.The generated paragraphs are presented to the judges in a random order and they are not told the source of the samples.Each sample is rated by three judges and the results are averaged across all samples and judges.

E More Samples on Attribute Vector Arithmetic
We provide more samples for sentiment manipulation, where we intend to alter sentiment of negative Yelp reviews with "attribute vector arithmetic", as a continuation of Table 10.

F Comparison with the "utterance drop" strategy
To resolve the "posterior collapse" issue of training textual VAEs, (Park et al., 2018) also introduced a strategy called utterance drop (u.d).Specifically, they proposed to weaken the autoregressive power of hierarchical RNNs by dropping the utterance encoder vector with a certain probability.To investigate the effectiveness of their method relative to our strategy of employing a hierarchy of latent variables, we further conduct a comparative study.Particularly, we utilize ml-VAE-S as the baseline model and apply the two strategies to it respectively.The corresponding results on language modeling (Yelp dataset) are shown in Table 15.Their u.d strategy indeed allows better usage of the latent variable (indicated by a larger KL divergence value).However, the NLL of the language model becomes even worse, possibly due to the weakening of the decoder during training (similar observations have also been reported in Table 2 of (Park et al., 2018)).In contrast, our hierarchical prior strategy yields larger KL terms as well as lower NNL value, indicating the advantage of our strategy to mitigate the "posterior collapse" issue.this is a great little restaurant in vegas , i had the shrimp scampi and my wife had the shrimp scampi, and my husband had the shrimp scampi , it was delicious , i had the shrimp scampi which was delicious and seasoned perfectly .my wife and i went to this place for dinner , we were seated immediately , the food was good , i ordered the shrimp and grits , which was the best part of the meal .
very good chinese food, very good chinese food, the service was very slow, i guess that s what they were doing, very slow to get a quick meal.
we got a gift certificate from a store, we walked in and were greeted by a young lady who was very helpful and friendly, so we decided to get a cut, I was told that they would be ready in 15 minutes.
we go there for eakfast, i ve been here 3 times and it s always good, the hot dogs are delicious, and the hot dogs are delicious, i ve been there for eakfast and it is so good.the place was packed, chicken was dry, tasted like a frozen hot chocolate, others were just so so, i wouldn t recommend this place.
do not go here, their food is terrible, they were very slow, in my opinion.
went today with my wife, and received a coupon for a free appetizer, we were not impressed, we both ordered the same thing, and we were not impressed.
the wynn is a great place to eat, the food was great and i had the linguine, and it was so good, i had the linguine and clams, ( i was so excited to try it ).recently visited this place for the first time, i live in the area and have been looking for a good local place to eat, we stopped in for a quick bite and a few beers, always a nice place to sit and relax, wonderful and friendly staffs.
i came here for a quick bite before heading to a friend s recommendation, the place was packed, but the food was delicious, i am a fan of the place, and the place is packed with a lot of people.best haircut i ve had in years, friendly staff and great service, he made sure that i was happy with my hair cut, just a little pricey but worth it, she is so nice and friendly.had a great experience here today, the delivery was friendly and efficient and the food was good, i would recommend this place to anyone who will work in the future, will be back again.great place to go for a date night, first time i went here, service is good, the staff is friendly, 5 stars for the food.best place to get in vegas, ps the massage here is awesome, if you want to spend your money, then go there, ps the massage is great.
Table 12: Samples randomly generated from ml-VAE-D and flat-VAE, which are both trained on the Yelp review dataset.The repetitive patterns within the generated reviews are highlighted.

Figure 2 :
Figure 2: Schematic diagram of the proposed multi-level VAE with single latent variable.

Table 3 :
Evaluation results for generated sequences by our models and baselines on corpus-level BLEU scores (B-n denotes the corpus-level BLEU-n score.).

Table 4 :
The self-BLEU scores, unique n-gram percentages and 2-gram entropy score of 1000 generated sentences.Models are trained on the Yelp Reviews dataset to evaluate the diversity of generated samples.
mation is used).We sample latent codes from the learned model and visualize with t-SNE in Figure5.Each point indicates one paper abstract and the color of each point indicates the topic it belongs to.The embeddings of the same label are very close in the 2-D plot, while those with different labels are relatively farther away from each other.The embeddings of the High Energy Physics and Nuclear topic abstracts are meshed, which is expected since these two topics are semantically highly related.Results show that he inference network is able to extract meaningful global patterns from the input paragraph.

Table 5 :
t-SNE visualization of the learned latent codes.

Table 6 :
Generated samples from ml-VAE-D (trained on the arXiv abstract dataset).

Table 7 :
Conditionally generated paper abstracts based upon a title (trained with the arXiv data).

Table 8 :
A Mechanical Turk blind heads-up evaluation between pairs of models trained on the Yelp Reviews dataset.

Table 11 :
Summary statistics for the datasets used in the generic text generation experiments.
Samples from ml-VAE-D vs flat-VAE We provide additional examples for the comparison between ml-VAE-D vs flat-VAE in Table 12, as a continuation of Table 1.We provide samples of retrieved instances from the Yelp Review training dataset which are closest to the generated samples.Table 13 shows the closest training samples of each generated Yelp review.The first column indicates the intermediate generated sentences produced from linear transition from a point A to another point B in the prior latent space.The second column on the right are the real sentences retrieved from the training set that are closest to the ones generated on the left (determined by BLEU-2 score).We can see that the retrieved training data is quite different from the generated samples, indicating that our model is indeed generating samples that it has never seen during training.
love this place.Lots of veggie options.Try veggie quesadilla.I love this place.Lots of veggie options.Try veggie quesadilla. I

Table 15 :
Comparison with the utterance drop strategy.would give this place zero stars if i could , the guy who was working the front desk was rude and unprofessional , i have to say that i was in the wrong place , and i m not sure what i was thinking , this is not a good place to go to . i