FlowPrior: Learning Expressive Priors for Latent Variable Sentence Models

Variational autoencoders (VAEs) are widely used for latent variable modeling of text. We focus on variations that learn expressive prior distributions over the latent variable. We find that existing training strategies are not effective for learning rich priors, so we propose adding the importance-sampled log marginal likelihood as a second term to the standard VAE objective to help when learning the prior. Doing so improves results for all priors evaluated, including a novel choice for sentence VAEs based on normalizing flows (NF). Priors parameterized with NF are no longer constrained to a specific distribution family, allowing a more flexible way to encode the data distribution. Our model, which we call FlowPrior, shows a substantial improvement in language modeling tasks compared to strong baselines. We demonstrate that FlowPrior learns an expressive prior with analysis and several forms of evaluation involving generation.


Introduction
Variational autoencoders (VAEs; Kingma and Welling, 2014) have been widely applied to many natural language processing tasks (Bowman et al., 2016;Zhang et al., 2016;Shen et al., 2017;Kim et al., 2018;Fang et al., 2019;Chen et al., 2019). VAEs provide statistical transparency in describing observations in a latent space and flexibility when used in applications that require directly manipulating the learned representation (Hu et al., 2017). Recent work (Li et al., 2020) has combined VAEs with BERT/GPT in representation learning and guided generation. However, the representation capacity of VAEs is still limited for modeling sentences due to two main reasons.
One is known as the posterior collapse problem, in which the posterior "collapses" to the prior and the generator learns to ignore the latent variable (Bowman et al., 2016). Many methods have been developed to address it: annealing (Fu et al., 2019), weakening the capacity of the generator (Semeniuta et al., 2017;Yang et al., 2017), manipulating training objectives (Burda et al., 2016;Higgins et al., 2017;Zhao et al., 2017), including the use of free bits (FB) (Kingma et al., 2016;Li et al., 2019), and changing training (He et al., 2019).
The other reason is the restrictive assumption of the parametric forms for the prior and approximate posterior. While these forms are computationally efficient, they limit the expressivity of the model. The main existing solutions (Kingma et al., 2016;Tomczak and Welling, 2018;Razavi et al., 2019) focus on enriching the variational posterior, while other work focuses on learning an expressive prior (Tomczak and Welling, 2018;Serban et al., 2017;Chen et al., 2017).
In this paper, we follow the latter line of research and draw upon methods in building and learning expressive priors. We first show empirically that the original VAE objective, the evidence lower bound (ELBO), is not effective when learning priors. The issue is not solely due to posterior collapse since it is not resolved by using modifications based on free bits. To address this issue, we propose using a combined objective, adding to the ELBO a second objective (denoted M IS ) which is a different lower bound on the log marginal likelihood obtained using importance sampling (Burda et al., 2016).
Using the combination of the ELBO and M IS , we compare multiple choices for the prior, including a mixture of Gaussians, a prior based on a variational mixture of posteriors (VampPrior; Tomczak and Welling, 2018), and a prior based on normalizing flows (NF), specifically real NVP transformations (Dinh et al., 2016). Using a real NVP prior entails creating an invertible mapping from a simple base distribution to the prior distribution of the latent variable in a VAE. This choice allows a flexible prior distribution that is not constrained to a specific parametric family. The hope is that it would be better at modeling the data distribution.
We perform an empirical evaluation of priors and objective functions for training VAE sentence models on four standard datasets. We find the best performance overall when using the flow-based prior and the combined objective in the training objective. We refer to this setting as FlowPrior. The generation of prior samples with FlowPrior comports to the training distribution while maintaining a higher diversity than competing models in our quantitative and qualitative evaluation.
To summarize, this paper contributes: (1) a strategy for improved training of sentence VAEs based on combining multiple lower bounds on the log marginal likelihood; (2) the first results applying real NVP to model the prior in sentence VAEs; and (3) comprehensive evaluation and analysis with three expressive priors and training objective variations.

Background
Variational autoencoders (VAEs; Kingma and Welling, 2014) are a popular framework for learning latent variable models with continuous latent variables. Let x be the observed variable and z the latent variable. The model factorizes the joint distribution over x and z into a prior p ψ (z) and a generator p θ (x | z). Maximizing the log marginal likelihood log p(x) is intractable in general, so VAEs introduce an approximate posterior q φ (z | x) parameterized using a neural network (i.e., an "inference network"), and replace the log marginal likelihood with the evidence lower bound (ELBO): Maximizing the right-hand side of the equation above can be viewed as a regularized autoencoder in which the first term is the negative reconstruction error and the second is the negative KL divergence between the approximate posterior q φ (z|x) and the latent variable prior p ψ (z). It is common in practice to fix the prior p ψ (z) to be a standard Gaussian distribution and only learn θ and φ (Bowman et al., 2016;Yang et al., 2017;Shen et al., 2017). While constraining the prior to be a fixed standard Gaussian is common, it is not necessary for tractability. Researchers have found benefit from using richer priors and posteriors (Rezende and Mohamed, 2015;Kingma et al., 2016;Chen et al., 2017;Ziegler and Rush, 2019;Ma et al., 2019). In this paper, we consider investigating alternative priors while still using the standard Gaussian form for the approximate posterior.

Choices for Prior Families
We now describe the three kinds of priors we will compare in our experiments. The first two are based on Gaussian mixtures (Sec. 3.1) and the third is based on normalizing flows (Sec. 3.2). We take these three prior families into consideration because they represent the three main categories of work in learning priors: simple Gaussian mixtures (usually as baselines), defining the prior as a function of the approximate posterior (Tomczak and Welling, 2018;Chen et al., 2018), and flow-based priors (Chen et al., 2017;Ziegler and Rush, 2019;Ma et al., 2019;Lee et al., 2020). Note that we do not make any changes to the approximate posterior distribution. That is, the approximate posterior follows a Gaussian distribution with a diagonal covariance matrix as in standard VAEs.

Gaussian Mixture Priors
Our first choice is a uniform mixture of K Gaussians (MoG): where f (z; µ, Σ) is the density function of a ddimensional Gaussian with mean µ and covariance matrix Σ. The µ k and σ k are learnable parameter vectors with dimensionality d (which is 32 in our experiments). This prior was used as a baseline by Tomczak and Welling (2018). We refer to a VAE that uses this prior as MoG-VAE. Tomczak and Welling (2018) extend MoG-VAE to a "Variational Mixture of Posteriors" prior (VampPrior). This approach parameterizes the prior using a mixture of Gaussians with components given by a variational posterior conditioned on learnable "pseudo-inputs": where K is the number of pseudo-inputs, each of which is denoted u k . Pelsmaeker and Aziz (2020) applied this idea to text modeling and we follow their strategy for defining pseudo-inputs. That is, each u k consists of a sequence of embeddings that have the same dimensionality as word embeddings. For each component k, the lengths of pseudo-inputs can vary; they are sampled based on the statistics of the lengths in the training set. We refer to a VAE with this prior as Vamp-VAE.

Flow-based Priors
Our third choice for a prior distribution is to leverage normalizing flows (NF). A normalizing flow is a sequence of invertible, deterministic transformations. By repeatedly applying the rule for change of variables (see the Appendices for details), the base density is transformed into a more complex one. Networks parameterized using NF can be trained through exact maximum log-likelihood computation. Exact sampling is performed by drawing a sample from the base distribution and performing the chain of transformations. This allows a flexible prior and is expected to have more expressive latent components compared to those based on Gaussian mixtures.
Computing the Jacobian of functions with high dimension and the determinants of large matrices (i.e., the two main computations in NF) are very expensive. Our flow-based prior uses real-valued non-volume preserving (real NVP) transformations (Dinh et al., 2016) which are efficient in both training and sampling. The transformations are based on scale and translation operations. 1 It is worth noting that these two operations are not used in computing the Jacobian determinant and inverse. So one can design arbitrarily complex operations that allow a flexible transformation without incurring large computational cost.
More specifically, we apply real NVP as a prior by creating an invertible mapping between a base distribution p 0 (z 0 ) (in our case, z 0 ∼ N (0, I)) and the prior distribution p ψ (z L ) in the VAE: where z L is the sentence latent variable and f 1 , f 2 , ..., f L are all bijective functions.
Using the change-of-variables theorem, given a latent variable z L , we can compute the exact density under the prior with the "image" z 0 acquired by inverting the transformation: 1 More details about normalizing flows and real NVP are in the Appendices.
We refer to a VAE with a real NVP prior as real NVP-VAE. We find our best setting to consist of a real NVP prior and the combined objective in Section 4.1 and we refer to this setting as FlowPrior.

Objectives for Learning Priors in VAEs
ELBO. Our preliminary experiments found that, when training with the standard ELBO, using more sophisticated priors does not improve perplexity compared to standard Gaussian priors (Table 3). Though these priors could potentially be highly multimodal, the learned prior parameters yield approximately unimodal forms (Figure 1, left).
Several approaches have been proposed to mitigate or avoid collapse in the approximate posterior. One method that we include in our experiments is a variation of KL divergence known as "free bits" (FB) KL (Li et al., 2019;Kingma et al., 2016). Posterior collapse is mitigated, but the VAE models still do not benefit much from expressive priors (Tables 1-2). Pelsmaeker and Aziz (2020) made similar observations with an improved FB objective. We speculate that these undesirable results are due to the lack of learning signal for the prior parameters.
Marginal Likelihood via Importance Sampling. In the ELBO, the prior distribution only appears in the KL term. As a consequence, the prior parameters receive a limited amount of learning signal. The posterior network, by contrast, receives gradient updates from both the reconstruction and KL terms. When minimizing the KL term the potentially expressive prior density can "collapse" to a unimodal form, as this may facilitate minimizing the KL divergence between the approximate posterior and prior.
We consider optimizing another objective, a different lower bound on the log marginal likelihood obtained using importance sampling (Burda et al., 2016): where x is an input in the training data and N is the number of samples in use. This objective was proposed as the training objective in the importanceweighted autoencoder (IWAE; Burda et al., 2016), and was shown to be a tighter lower bound on the log marginal likelihood than the ELBO. In this paper, we denote this objective by M IS .
In addition to providing a tighter lower bound, M IS also increases the flexibility of the approximate posterior, as shown by Cremer et al. (2017). By increasing N , the approximate posterior has an implicit complex distribution that approaches the true posterior, which may also be beneficial in learning an expressive prior.
Combination of the Two. However, M IS is not necessarily optimal by itself for training VAEs. Rainforth et al. (2018) prove that using M IS with a large value of N is detrimental in learning the posterior, which is also shown in our empirical evaluation in Table 3. If we only have M IS , the approximate posterior q only appears in the denominator so learning seeks to make samples from the posterior q less likely under q, which could cause q to become a poor proposal distribution. The ELBO, with its reconstruction loss, appears helpful in learning a better posterior. Therefore, we optimize the sum of the ELBO and M IS , which was proposed by Rainforth et al. (2018).

Combined Training Objective
Our combined training objective then contains three terms: M IS , reconstruction, and sample-based KL. We draw N samples from q φ (z|x), and compute the three terms using the same samples: When training with the ELBO alone, one typically uses a single sample from q φ (z|x). However, since we draw multiple samples anyway in order to compute M IS , we use those same samples for the reconstruction term, which can lead to more robust gradients of that term than the standard approach of using one sample.
The reason we use sample-based estimates for the KL divergence is because our choices for the prior preclude the possibility of a closed form for the KL. We consider two different approaches when computing sample-based KLs: standard KL and a modified one inspired by free bits (Li et al., 2019;Pelsmaeker and Aziz, 2020;Kingma et al., 2016), which we refer to as FB KL.
For standard KL, we use Monte Carlo estimation in computing the KL divergence with N samples: For the FB KL, we follow prior work (Kingma et al., 2016) that replaces the KL with a hinge loss term in each latent dimension: where KL j φ,ψ denotes the KL computed only for dimension j of the latent variable, and λ is the "target rate" hyperparameter.

Training Procedure
We describe our training procedure below for Flow-Prior, which combines a real NVP prior with the objective in Eq. 7. For simplicity, our description only uses one input x. In practice, we use minibatches with a stochastic gradient based optimizer. All the parameters (θ, φ, ψ) are updated simultaneously during training. When using the other priors (standard Gaussian, MoG, and VampPrior), we do not need steps 2 and 3 above because those priors can be computed directly without the inverse transformation or change of variable theorem.

Datasets
We consider four widely-used, publicly available In addition, we include two prior-learning baselines: MoG-VAE (Eq. 2) and Vamp-VAE (Eq. 3). We follow Pelsmaeker and Aziz (2020) and set 100 components/pseudo-inputs. Unlike the earlier baselines, for which we used open source codebases, we implemented the MoG-VAE and Vamp-VAE models on top of our standard VAE implementation, which was also used for FlowPrior.

Implementation and Training Details
Across all the experiments for our implemented baselines (i.e., standard VAE, MoG-VAE, Vamp-VAE) and our proposed model FlowPrior, we follow prior work (Kim et al., 2018;He et al., 2019;Li et al., 2019) and use a single-layer LSTM encoder and decoder with a 32-dimensional latent variable. We use a batch size of 32 and train using SGD. 3

Evaluation Metrics
Our evaluation measures language modeling performance, the use of the latent variable, and the quality and diversity of generations from the prior and posterior. The metrics are listed below: PPL: We estimate log marginal likelihood using importance sampling (Burda et al., 2016)

MI:
We follow Hoffman and Johnson (2016) and report estimated mutual information between the observation and its latent variable.
AU is then the number of active latent dimensions (Burda et al., 2016). F-PPL and R-PPL: These metrics measure the correspondence between generated sentences from the model and the training corpus. We evaluate both F-PPL and R-PPL by estimating 5-gram language models using the KenLM toolkit (Heafield, 2011) with its default smoothing method. For F-PPL, we estimate language models from the actual text and compute the perplexity of the generated samples. For R-PPL, we estimate language models from the generated samples and compute the perplexity of the actual text. 5 Self-BLEU: The self-BLEU metric is one measure of the diversity of a set of samples (Zhu et al., 2018). It is calculated by averaging the BLEU scores computed between all pairs of samples.

Language Modeling
We first perform language modeling tasks to characterize models' efficacy at modeling texts in terms of modeling the distribution of language data and making use of the latent variable. We refer to our model as FlowPrior, which uses the training objective in Eq. 7 which includes M IS and the standard KL (Eq. 8). We use FlowPrior + FB to refer to our model with the FB KL (Eq. 9).  Comparison to baselines. Table 1 shows results on the PTB dataset for several VAEs from prior work and our implemented models. Since our contributions lie in learning the prior instead of changing the training procedure or manipulating the KL term, we set the baselines as standard VAE, MoG, and VampPrior for the rest of the paper. We report the performance of FlowPrior and those baselines on Yahoo, Yelp, and SNLI in Table 2. From Tables 1 and 2, we observe that FlowPrior consistently outperforms the baselines in test set perplexity, sometimes by large margins. This is not surprising since the M IS term in our training objective directly targets the perplexity metric because the expressions are identical (differing only in the number of samples used). While FB typically improves models on PTB, and helps FlowPrior to reach a higher AU and KL on the other datasets, it does not lead to better test PPL and reconstruction. We report additional results on measuring the impact of FB in the Appendix.

Model PPL(↓) Recon(↓) KL AU(↑) MI(↑)
Another finding is that simply enriching the parametric family of the prior is not sufficient to improve our evaluation metrics. Tables 1 and 2 show mixed results when moving from the VAE with its standard Gaussian prior to the MoG-or Vamp-VAE. Though these priors have the potential to be multimodal, they could still be unimodal after training.  scale. Also, the complexity of the prior learned by the Vamp-VAE is dependent upon the inference network, so if the inference network does not learn anything useful, the learned prior may not be useful either.
Impact of selection of objectives. The learned prior baselines (MoG-VAE and Vamp-VAE) fail to learn to use the latent variable, as shown by the small numbers (nearly zero) for the AU and MI metrics in Tables 1-2. Similar observations were made by Pelsmaeker and Aziz (2020). We argue that only improving the prior might not be sufficient, as the ELBO objective is difficult to optimize and little information may be learnable for the prior from the ELBO alone. To measure the utility of the M IS term, we introduce this term to standard-VAE, MoG-VAE, and Vamp-VAE and evaluate the improved models under the same language model metrics. Table 3 compares models trained with M IS , the ELBO, and the combined training objective (Eq. 7). The combined objective is beneficial to all metrics for all priors and datasets. Our results are consistent with the observations of Rainforth et al. (2018) that tighter bounds are preferable for training the gener-

Vamp-VAE + MIS
Three people are sitting on a bench . People are walking down the street . Man in a blue shirt and jeans is sitting on a bench . Man in a blue shirt and jeans is sitting on a bench . Women in a white dress and a man in a black shirt are standing in front of a microphone . Women in a white dress and a man in a black shirt are standing in front of a microphone . two men are playing soccer two men are playing basketball Two men are playing a game of chess . Two men are playing a game of chess .

FlowPrior
The dog is running through the snow . Two young boys are playing in the snow . There is a man in a blue shirt and a woman in a black shirt and black pants . Three people are sitting on a bench . two men are standing on a bench A girl is sitting on a bench . A young girl is sitting on a bench . A young man is sitting on a bench . A woman in a black shirt is sitting on a bench . A woman is sitting on a bench . Table 4: Interpolation from the prior on SNLI dataset. In each cell, the first and last sentences correspond to two sampled latent codes and between are linearly interpolated samples. ative network, while looser bounds are preferable for training the inference network. Still, FlowPrior (real NVP + M IS ) performs the best in PPL and MI compared to other models, showing the flexibility and the power of the real NVP architecture.
For the "Standard" setting in Table 3, the prior is fixed and not learned while in the other three settings the prior is learned. The combination of ELBO and M IS is helpful across all settings. 6

Interpolations Between Prior Samples
One appealing aspect of VAEs for sentence modeling is the potential for learning a smooth, interpretable space for sentences. A qualitative way to explore the latent space is to interpolate between samples from the prior distribution. We randomly sample two latent vectors from the prior and linearly interpolate between them with evenly divided intervals (Bowman et al., 2016). 7 We use greedy 6 For the MoG setting, we also performed experiments with setting the number of Gaussian components K = 1 and observed comparable or slightly worse test PPL under all 3 choices of training loss than Standard setting. 7 FlowPrior is slightly different. Instead of directly sampling from the latent variable of VAE (in MoG-VAE and Vamp-VAE), FlowPrior samples from the base distribution of real NVP, interpolates in the base distribution, and maps to the decoding in generation. 8 Table 4 shows linear interpolation between prior samples in FlowPrior and Vamp-VAE + M IS (i.e., Vamp-VAE with the combined training objective). We observe substantial improvement with FlowPrior, as it can generate sentences with smooth semantic evolution while maintaining plausible generations in terms of fluency. This semantic evolution may reflect the complex structure in the learned prior distribution. Interpolations with MoG-VAE + M IS and Vamp-VAE + M IS have more repetitions and do not transit smoothly from one to the other. (Results with MoG-VAE are in the appendix.)

Visualization of Learned Priors
We randomly select 4 dimensions from the learned priors per model and plot their densities in Fig. 1. In MoG-VAE, each dimension is a Gaussian mixture with 100 components. When only using the ELBO for training ( Fig. 1(a)), the four visualized components all have similar shapes. After adding M IS (Fig. 1(b)), different dimensions have similar locations but different scales.
Vamp-VAE permits relatively complex components because the means and variances are acquired from the inference network applied to learned latent with Eq. 4. We also experiment with interpolating the two samples after mapping, namely interpolating in the VAE latent space, and find similar results.

MoG-VAE MoG-VAE + MIS
The man is wearing a black shirt . An older gentleman in a white shirt is walking in a parking lot . A man is standing in front of a building .
A woman is walking in a field . A man is standing in front of a building .
A young girl in a red shirt is playing with a toy .

Vamp-VAE Vamp-VAE + MIS
A man is playing a guitar . Man in a blue shirt and jeans is sitting on a bench . A man is playing a guitar .
The man is wearing a black shirt . A man is playing a guitar .
People are walking down the street .

VAE FlowPrior
A man is sitting on a bench . Man in a blue shirt and blue jeans is sitting on a rock with a hammer . A man is sitting on a bench .
Two young boys are playing in the snow . A man is sitting on a bench .
A dog is running through the snow . pseudo-inputs. Fig. 1(c) shows that Vamp-VAE trained without M IS does not show much difference compared to MoG-VAE. However, when training with M IS (Fig. 1(d)), the distributions in several dimensions appear to be multimodal. The real NVP prior learns little information when training without M IS , as all dimensions are akin to standard normal distributions. When training with M IS , different dimensions show distinct placement and shape. The prior in FlowPrior is highly multimodal overall and smooth in each dimension.

Generations from Prior Samples
Sampling from Prior. To measure the expressiveness of the prior and the richness of the learned latent variable, we randomly sample 5000 times from the prior distribution and evaluate their greedy-decoded generations qualitatively and quantitatively. Table 5 shows greedy generations from prior samples. We observe substantial improvements in term of generation diversity when adding M IS in the training objective. Note that these diverse samples are achieved with a purely deterministic decoding. A diverse set of samples implies that (1) richer latent codes and a highly multimodal distribution is learned by the model; (2) and the generator is trained to attend to the latent codes.
Sample Mundanity and Coverage. A stronglyperforming generative model should be able to generate samples that comport to the training data distribution. We use the forward and reverse PPL to estimate the similarity between the training data and samples. We can consider F-PPL as a generation "precision" as it reflects the amount of information in the samples that is relevant to the actual text. Analogously, we can consider R-PPL  as a generation "recall'' as it reflects how much the samples as a whole provide coverage of the actual text. Moreover, both F-PPL and R-PPL reflect whether the decoder is able to attend to the latent variable in generation. Table 6 shows the F-PPL and R-PPL with greedy generation from prior samples. While Fang et al.
(2019) treats a lower F-PPL as an indicator of better samples, we argue that it is not necessarily true. A model could achieve a low F-PPL by simply generating identical (or nearly-identical) highprobability sequences, like those observed from the VAE, MoG-VAE, and Vamp-VAE in Table 5. This reflects how an overly-simplified or restrictive assumption in the prior can lead to less diversity in samples.
Indeed, we find that models with very low F-PPL values often have very high R-PPL values. A lower R-PPL indicates the distribution of generated samples matches the distribution of the training data. From Table 6 we observe that adding M IS is beneficial as it leads to a lower R-PPL. FlowPrior has the best R-PPL, and shows the capability of capturing characteristics of the target distribution that are not captured by simpler priors.
Generation Diversity. To identify which model has richer usage of latent variables, we use self-BLEU to measure the diversity of a set of samples. We observe significant improvements in FlowPrior in Table 6, which implies a diverse latent representation and a better utilization of the latent variable.

Related Work
When considering the parameterized family of VAE models, expressive latent components (i.e., posterior and prior) have been widely studied in computer vision (Dinh et al., 2015(Dinh et al., , 2016Kingma and Dhariwal, 2018). However, multimodal priors have been seldom applied to language, with some exceptions ( (2020) empirically characterize the performance of NF and simple Gaussian priors in token-level latent variable models, and observe that flexible priors yield higher log-likelihoods but not better BLEU scores on machine translation tasks.
Our work differs from that of Ziegler and Rush (2019) and Chen et al. (2017) as we are using a non-autoregressive flow-based architecture for the prior, while they are using autoregressive NF. Also, we focus on models with a single latent variable for an entire sentence, while similar prior work has focused on token-level latent variables (Ziegler and

Conclusion
We proposed a method, FlowPrior, that uses normalizing flow to define the prior in a sentence VAE and adds the importance-sampled marginal likelihood (M IS ) as a second term to the standard VAE objective. Our empirical results show FlowPrior yields a substantial improvement in language modeling and generation tasks as compared to prior work. Adding M IS improves performance for other models as well, especially in settings when the prior parameters are being learned. where ∂f −1 (z) ∂z is the Jacobian of f at z.
Normalizing Flows. A normalizing flow is a sequence of invertible, deterministic transformations. By repeatedly applying the rule for change of variables, the base density is transformed into a more complex one. Networks parameterized using NF can be trained through exact maximum log-likelihood computation. Exact sampling is performed by drawing sample from the base distribution and performing the chain of transformations. Our work uses NF because it allows a flexible functional form, and it is capable of capturing data complexity and performing exact likelihood computation and sampling.
Real NVP. Computing the Jacobian of functions with high dimension and the determinants of large matrices (i.e., the two main computation in NF) are very expensive. Prior work has addressed this challenge by introducing efficient transformations (Dinh et al., 2015(Dinh et al., , 2016Germain et al., 2015;Kingma et al., 2016;Kingma and Dhariwal, 2018;Ho et al., 2019). Our flow-based prior is based on real-valued non-volume preserving (real NVP; Dinh et al., 2016) which is efficient in both training and sampling. The main building block of real NVP transformation is the affine coupling layer.
An affine coupling layer is a bijective transformation f i : z i−1 → z i that follows the equations: where D is the dimensionality, z (1:d) i stands for the first d dimensions of z i (d < D); s and t denote the functions for scale and translation operations that map from R d → R D−d ; and denotes elementwise product.
The Jacobian determinant and inverse of the affine coupling layer are easy to compute. The transformation is flexible because its computation of the Jacobian determinant and inverse do not require any operation with the functions s and t, so these two functions could be designed to be arbitrarily complex.

B Datasets
The statistic of our dataset is in Table 7

C Training Details
We use a batch size of 32 and train using SGD without momentum. The optimizer is initialized with learning rate 1 or 0.5, and the learning rate is decayed by 1/2 if the dev loss is not improved in two consecutive epochs. The training stops early after 5 learning rate decay operations. We use a linear annealing schedule that increases the weight from 0 to 1 in the first 10 or 20 epoch for the weight of both KL and M IS term if they are in the training objective. When training with the combined objective, we start adding M IS after training ELBO objective 10 epochs. For each model variation, we experiment with 5 different random seeds and report the median numbers in the paper.

D Hyperparameter Settings
Across all the experiments for our implemented baselines (i.e., standard VAE, MoG-VAE, Vamp-VAE) and our proposed model FlowPrior, we follow prior work (Kim et al., 2018;He et al., 2019;Li et al., 2019) and use a single-layer LSTM encoder and decoder with a 32-dimensional latent variable. 9 We follow the prior work (He et al., 2019;Li et al., 2019) and set the embedding dimension as in Table 8. We set a dropout rate of 0.5 to both the  input embeddings and the output embeddings before the softmax layer in the decoder. All the parameters are initialized with a uniform distribution U(-0.01,0.01). For both MoG and Vamp-VAE we use 100 components/pseudo-inputs in the prior. For real NVP, we use 10 affine coupling layers with 3-layer MLP networks for the parameterized scale and translation operations with the dimensionality of 32. We follow Dinh et al. (2016) to compose the affine coupling layers in an alternative pattern and add batch normalization (Ballé et al., 2016) between adjacent affine coupling layers. For models trained with FB KL, we set the target rate as 2, 4, or 8.

E Additional Results with Free Bits KL
Using the Free Bits method can help achieve a consistently better AU and higher KL as shown in the overall results in the main text. We report additional empirical comparisons to focus on measuring the impact of FB for three models in Table  9. Though adding FB yields higher AU and MI, it    Table 11 shows the impact of M IS and FB on F-PPL, R-PPL, and self-BLEU with greedy generation from prior samples.

G Reconstruction Results with Sampling
Tables 12-13 show the reconstruction performance with standard sampling and nucleus sampling with p = 0.9 (Holtzman et al., 2020). We observe the trends are consistent with the results that use greedy decoding.    Table 14 shows more examples of interpolationbased generation with greedy decoding. We show results with sampling methods for decoding in Tables 15 and 16. The results with greedy decoding provide a lower-variance way to interpret the learned latent space. The additional results with sampling methods provide a richer picture as they also capture the randomness in the relationship between the latent variable and the text. This is especially helpful when we observe repetition in neighboring samples with greedy decoding, as we see with MoG-VAE and Vamp-VAE in Table 14.

H Interpolation with Sampling
Even with sampling, FlowPrior shows a smoother semantic evolution in the latent space than MoG-VAE and Vamp-VAE, at least in terms of aspects of the subjects of the generated sentences. Table 17 shows more greedy generations from prior samples. We observe substantial improvements in term of generation diversity in FlowPrior and Flow-

Mog-VAE
The man is wearing a black shirt . The man is wearing a black shirt . The man is wearing a black shirt . A man is standing in front of a building . A man is standing in front of a building . A man is standing in front of a building . A man is standing in front of a building . A man is standing in front of a building . A man is standing in front of a building . A man is standing in front of a building .

Vamp-VAE
Three people are sitting on a bench . People are walking down the street . People are walking down the street . People are walking down the street . Man in a blue shirt and jeans is sitting on a bench . Man in a blue shirt and jeans is sitting on a bench . Man in a blue shirt and jeans is sitting on a bench . Man in a blue shirt and jeans is sitting on a bench . Man in a blue shirt and jeans is sitting on a bench . Man in a blue shirt and jeans is sitting on a bench .

FlowPrior
The dog is running through the snow . Two young boys are playing in the snow . There is a man in a blue shirt and a woman in a black shirt and black pants . Three people are sitting on a bench . two men are standing on a bench A girl is sitting on a bench . A young girl is sitting on a bench . A young man is sitting on a bench . A woman in a black shirt is sitting on a bench . A woman is sitting on a bench .

MoG-VAE
A girl is laughing at the beach . The dog is walking in the water . Man in khaki jacket painting an elephant . The man is breakdancing .
People are outside on a sunny day . Some people are playing in the snow . Two men are working in a lab .
The boy is at the beach . Five soccer players playing soccer . Men stand on a pier .

Vamp-VAE
A person is riding a bicycle at a parade . A brown and white dog with a brown collar is climbing over its hind legs while lying on the floor , talking and fabric in the grass . A little girl wearing a yellow shirt looks at a fountain while a man is kneeling next to her and a child . Two men play dominoes . Young lady in blue dress waits at a bus . A woman carrying a small child looking through a window . Three men are fighting with swords . The little boy is riding his scooter down the paved road . A man wearing a yellow suit eats a hotdog on a wooden table .
A group of friends are smiling.
FlowPrior three dogs are in the water 3 people walking down the street with their hands in the air .
Man getting a picture Three people are on the beach . There are two dogs standing near each other . Two men in white uniforms are cleaning on a mess of an escalator . Two girls are getting some exercise together . Two women working in a restaurant outdoors . The children are riding . The child is standing on the sidewalk in front of an apartment building . In each cell, the first sentence and the last sentence correspond to the two sampled latent codes, and between are linearly interpolated samples.

MoG-VAE
A girl is attending a birthday of popcorn . The dog is walking in the green snow . Man lady on the beach . The man is breakdancing . People are outside on a sunny day . Some people are rooting in an outdoor resaurant . Two men both face out for directions on a street . Big hikers . Five soccer players playing soccer after fifty finish in a field . Men stand on a doorstep .

Vamp-VAE
Two farmers share a drink , while one looks at a woolen . A couple in black and white with a long blue scarf are standing in a store wearing a yellow hat . A three people are riding a white bike through the desert along a road . A tribesman near a playground . People all gathered in the street looking at something on a woman . A couple of angel making corn on a mountainside in a city . A man holding a sign can and the young female . The man playing a guitar concert festival . The shirtless woman and woman performing on sand . dog is outside .

FlowPrior
A dog is jumping through the air to catch a Frisbee in the air . A brown and white dog chewing on a red disc . A crowd of people are blowing in a brown down balloon . A woman is pushing her cart . A person is skiing through a snowy mountain . A woman carrying a small child is playing with her friend on a busy street . A man with a black shirt and brown long-sleeve shirt is standing near a graffiti that has come poles off around two . A man , dressed with purple and black stands in bottoms while bandannas disbelief . A man is looking at shoulder on a rack . A man is about to fall . In each cell, the first sentence and the last sentence correspond to the two sampled latent codes, and between are linearly interpolated samples.

VAE VAE + FB
A man is sitting on a bench . Two men are playing basketball . A man is sitting on a bench .
A man is playing a guitar . A man is sitting on a bench .
The man is wearing a blue shirt . A man is sitting on a bench .
The man is wearing a blue shirt . A man is sitting on a bench .
Two men are playing basketball .

MoG-VAE MoG-VAE + MIS
The man is wearing a black shirt . An older gentleman in a white shirt is walking in a parking lot . A man is standing in front of a building .
A dog is running . A man is standing in front of a building .
A woman is walking in a field . A man is standing in front of a building .
A young girl in a red shirt is playing with a toy . A man is standing in front of a building .
An older gentleman in a white shirt and white pants is standing on a ladder with a large ladder on his right hand

Vamp-VAE Vamp-VAE + MIS
A man is playing a guitar . Women in a white dress and a man in a black shirt are standing in front of a microphone . A man is playing a guitar .
Man in a blue shirt and jeans is sitting on a bench . A man is playing a guitar .
The man is wearing a black shirt . A man is playing a guitar .
People are walking down the street . A man is playing a guitar .
Two men are playing a game of chess .

FlowPrior FlowPrior + FB
Man in a blue shirt and blue jeans is sitting on a rock with a hammer .
Children are standing in the middle of a building with a man in a blue shirt and black pants . Two young boys are playing in the snow .
A man is standing in the middle of a large building . A dog is running through the snow .
The man is wearing a black shirt . Two men are standing on a boat .
Two men are playing basketball . A young man is sitting on a bench .
Girl in a blue shirt and black jacket standing on a bench .