A Batch Normalized Inference Network Keeps the KL Vanishing Away

Variational Autoencoder (VAE) is widely used as a generative model to approximate a model’s posterior on latent variables by combining the amortized variational inference and deep neural networks. However, when paired with strong autoregressive decoders, VAE often converges to a degenerated local optimum known as “posterior collapse”. Previous approaches consider the Kullback–Leibler divergence (KL) individual for each datapoint. We propose to let the KL follow a distribution across the whole dataset, and analyze that it is sufficient to prevent posterior collapse by keeping the expectation of the KL’s distribution positive. Then we propose Batch Normalized-VAE (BN-VAE), a simple but effective approach to set a lower bound of the expectation by regularizing the distribution of the approximate posterior’s parameters. Without introducing any new model component or modifying the objective, our approach can avoid the posterior collapse effectively and efficiently. We further show that the proposed BN-VAE can be extended to conditional VAE (CVAE). Empirically, our approach surpasses strong autoregressive baselines on language modeling, text classification and dialogue generation, and rivals more complex approaches while keeping almost the same training time as VAE.


Introduction
Variational Autoencoder (VAE) (Kingma and Welling, 2014;Rezende et al., 2014)is one of the most popular generative framework to model complex distributions.Different from the Autoencoder (AE), VAE provides a distribution-based latent representation for the data, which encodes the input x into a probability distribution z and reconstructs the original input using samples from z.When *This work was done when Qile Zhu was an intern at Tencent AI Lab.Wei Bi is the corresponding author.
inference, VAE first samples the latent variable from the prior distribution and then feeds it into the decoder to generate an instance.VAE has been successfully applied in many NLP tasks, including topic modeling (Srivastava and Sutton, 2017;Miao et al., 2016;Zhu et al., 2018), language modeling (Bowman et al., 2016), text generation (Zhao et al., 2017b) and text classification (Xu et al., 2017).
An autoregressive decoder (e.g., a recurrent neural network) is a common choice to model the text data.However, when paired with strong autoregressive decoders such as LSTMs (Hochreiter and Schmidhuber, 1997) and trained under conventional training strategy, VAE suffers from a wellknown problem named the posterior collapse or the KL vanishing problem.The decoder in VAE learns to reconstruct the data independent of the latent variable z, and the KL vanishes to 0.
Many convincing solutions have been proposed to prevent posterior collapse.Among them, fixing the KL as a positive constant is an important direction (Davidson et al., 2018;Guu et al., 2018;van den Oord et al., 2017;Xu and Durrett, 2018;Tomczak and Welling, 2018;Kingma et al., 2016;Razavi et al., 2019).Some change the Gaussian prior with other distributions, e.g., a uniform prior (van den Oord et al., 2017;Zhao et al., 2018) or a von Mises-Fisher (vMf) distribution (Davidson et al., 2018;Guu et al., 2018;Xu and Durrett, 2018).However, these approaches force the same constant KL and lose the flexibility to allow various KLs for different data points (Razavi et al., 2019).Without changing the Gaussian prior, free-bits (Kingma et al., 2016) adds a threshold (free-bits) of the KL term in the ELBO object and stops the optimization of the KL part when its value is smaller than the threshold.Chen et al. (2017) point out that the objective of free-bits is non-smooth and suffers from the optimization challenges.δ-VAE (Razavi et al., 2019) sets the parameters in a specific range arXiv:2004.12585v1 [cs.LG] 27 Apr 2020 to achieve a positive KL value for every latent dimension, which may limit the model performance.
Other work analyzes this problem form a view of optimization (Bowman et al., 2016;Zhao et al., 2017a;Chen et al., 2017;Alemi et al., 2018).Recently, He et al. (2019) observe that the inference network is lagging far behind the decoder during training.They propose to add additional training loops for the inference network only.Li et al. (2019) further propose to initialize the inference network with an encoder pretrained from an AE objective, then trains the VAE with the free-bits.However, these two methods are much slower than the original VAE.
The limitation of the constant KL and the high cost of additional training motivate us to seek an approach that allows flexible modeling for different data points while keeping as fast as the original VAE.In this paper, instead of considering the KL individually for each data point, we let it follow a distribution across the whole dataset.We demonstrate that keeping a positive expectation of the KL's distribution is sufficient to prevent posterior collapse in practice.By regularizing the distribution of the approximate posterior's parameters, a positive lower bound of this expectation could be ensured.Then we propose Batch Normalized-VAE (BN-VAE), a simple yet effective approach to achieving this goal, and discuss the connections between BN-VAE and previous enhanced VAE variants.We further extend BN-VAE to the conditional VAE (CVAE).Last, experimental results demonstrate the effectiveness of our approach on real applications, including language modeling, text classification and dialogue generation.Empirically, our approach surpasses strong autoregressive baselines and is competitive with more sophisticated approaches while keeping extremely higher efficiency.Code and data are available at https://github.com/valdersoul/bn-vae.

Background and Related Work
In this section, we first introduce the basic background of VAE, then we discuss the lagging problem (He et al., 2019).At last, we present more related work.(Kingma and Welling, 2014;Rezende et al., 2014) aims to learn a generative model p(x, z) to maximize the marginal likelihood log p(x) on a dataset.The marginal likelihood cannot be calculated directly due to an intractable integral over the latent variable z.To solve this, VAE introduces a variational distribution q φ (z|x) which is parameterized by a complex neural network to approximate the true posterior.Then it turns out to optimize the ELBO of log p(x):

VAE
where φ represents the inference network and θ denotes the decoder.The above first term is the reconstruction loss, while the second one is the KL between the approximate posterior and the prior.The Gaussian distribution N ∼ (0, I) is a usual choice for the prior, and the KL between the approximate posterior q φ (z|x) and the prior p(z) can be computed as: where µ i and σ i is the mean and standard deviation of approximate posterior for the i th latent dimension, respectively.When the decoder is autoregressive, it can recover the data independent of the latent z (Bowman et al., 2016).The optimization will encourage the approximate posterior to approach the prior which results to the zero value of the KL.

The Lagging Problem
Recently, He et al. (2019) analyze posterior collapse with the Gaussian prior from a view of training dynamics.The collapse is a local optimum of VAE when q φ (z|x) = p θ (z|x) = p(z) for all inputs.They further define two partial collapse states: model collapse, when p θ (z|x) = p(z), and inference collapse, when q φ (z|x) = p(z).They observe that the inference collapse always happens far before the model collapse due to the existence of autoregressive decoders.Different from the model posterior, the inference network lacks of guidance and easily collapses to the prior at the initial stage of training, and thus posterior collapse happens.
Based on this understanding, they propose to aggressively optimize the inference network.However, this approach cost too much time compared with the original VAE.In our work, we also employ the Gaussian prior and thus suffer from the same lagging problem.Yet, our proposed approach does not involve additional training efforts, which can effectively avoid the lagging problem (Section 3.3) and keep almost the same training efficiency as the original VAE (Section 5.1).More details can be found in Section 3.3.

Related Work
To prevent posterior collapse, we have mentioned many work about changing the prior in the introduction.Besides these approaches, some work modifies the original training objective directly.For example, Bowman et al. ( 2016) introduce an annealing strategy, where they slightly increase the weight of KL from 0 to 1 during the warm-up period.β-VAE (Higgins et al., 2017) treats the KL weight as a hyperparameter to constrain the minimum value of the KL.Alemi et al. (2017), on the other hand, set a fixed KL weight to control the mutual information between z and x.  (Cremer et al., 2018) in VAE and propose Semi-Amortized VAE to compose the inference network with additional mean-field updates.Fu et al. (2019) propose a cyclical annealing schedule, which repeats the process of increasing β multiple times.
There are various other approaches to solve the posterior collapse.For example, some researchers choose to weaken the decoder by replacing the LSTM decoder with convolution neural networks without autoregressive modeling (Semeniuta et al., 2017;Yang et al., 2017).Chen et al. (2017) input a lossy representation of data to the autoregressive decoder and enforce z to capture the information about the original input.Inheriting this idea, some following work add direct connections between z and x (Zhao et al., 2017b;Dieng et al., 2019).Ma et al. (2019) introduce an additional regularization to learn diverse latent representation.δ-VAE (Razavi et al., 2019) and free-bits (Kingma et al., 2016) set a minimum number of KL for each latent dimension to prevent the posterior collapse.Srivastava andSutton (2017, 2018) find that using ADAM (Kingma and Ba, 2014) with a high learning rate to train VAE may cause the gradients to diverge early.Their explanation for the diverg-ing behavior lies in the exponential curvature of the gradient from the inference network which produces the variance part of the approximate posterior.Then they apply batch normalization to the variance part to solve this problem.We use the simple SGD without momentum to train our model.Moreover, we apply batch normalization to the mean part of the inference network to keep the expectation of the KL's distribution positive, which is different from their work.We also find that Sønderby et al. (2016) utilize batch normalization in all fully connected layers with nonlinear activation functions to improve the model performance.Different from it, our approach directly applies batch normalization to the parameters of the approximate posterior, which is the output of the inference network.

Batch-Normalized VAE
In this section, we first derive the expectation of the KL's distribution and show that it is enough to avoid posterior collapse by keeping the expectation of the KL's distribution positive.Then we propose our regularization method on the parameters of the approximate posterior to ensure a positive lower bound of this expectation.We further discuss the difference between our approach and previous work.

Expectation of the KL's Distribution
Given an x ∈ X , the inference network parametrizes a n-dimension diagonal Gaussian distribution with its mean µ = f µ (x) and diagonal covariance Σ = diag(f Σ (x)), where f µ and f Σ are two neural networks.In practice, the ELBO is computed through a Monte Carlo estimation from b samples.The KL in Eq. 2 is then computed over b samples from X : When b gets larger, the above empirical value will approach the mean of the KL across the whole dataset.
To make use of this observation, we assume that µ i and log σ 2 i for each latent dimension i follow a certain distribution with a fixed mean and variance across the dataset respectively.The distribution may vary between different latent dimensions.In this way, the KL turns to a distribution of µ i 's and log σ 2 i 's.From Eq. 3, we can see that . Thus, we can derive the expectation of the KL's distribution as: where E[σ 2 i − log σ 2 i ] ≥ 1 since the minimum of e x − x is 1.If we can guarantee a positive lower bound of E[KL], we can then effectively prevent the posterior collapse.
Based on Eq. 4, the lower bound is only dependent on the number of latent dimensions n and µ i 's mean and variance.This motivates our idea that with proper regularization on the distributions of µ i 's to ensure a positive lower bound of E[KL].

Normalizing Parameters of the Posterior
The remaining key problem is to construct proper distributions of µ i 's that can result in a positive lower bound of E[KL] in Eq. 4. Here, we propose a simple and efficient approach to accomplish this by applying a fixed batch normalization on the output of the inference network (µ i ).Batch Normalization (BN) (Ioffe and Szegedy, 2015) is a widely used regularization technique in deep learning.It normalizes the output of neurons and makes the optimization landscape significantly smoother (Santurkar et al., 2018).Different from other tasks that apply BN in the hidden layers and seek fast and stable training, here we leverage BN as a tool to transform µ i into a distribution with a fixed mean and variance.Mathematically, the regularized µ i is written by: μi where µ i and μi are means of the approximate posterior before and after BN. µ Bi and σ Bi denote the mean and standard deviations of µ i .They are biased estimated within a batch of samples for each dimension indecently.γ and β are the scale and shift parameter.Instead of using a learnable γ in Eq. 5, we use a fixed BN which freezes the scale γ.
In this way, the distribution of µ i has the mean of β and the variance of γ 2 .β is a learnable parameter that makes the distribution more flexible.Now, we derive the lower bound of E[KL] by using the fixed BN.With the fixed mean β and variance γ 2 for µ i in hand, we get a new lower bound as below: To this end, we can easily control the lower bound of E[KL] by setting γ.Algorithm 1 shows the training process.
Sample a mini-batch x.

Connections with Previous Approaches
Constructing a positive KL: Both free-bits (Kingma et al., 2016) and δ-VAE (Razavi et al., 2019) set a threshold on the KL value.Free-bits changes the KL term in the ELBO to a hinge loss term: ).Another version of free-bits is to apply the threshold to the entire sum directly instead of the individual value.Training with the free-bits objective, the model will stop to drive down the KL value when it is already below λ.However, Chen et al. (2017) point out that the objective of free-bits is non-smooth and suffers from the optimization challenges.Our approach does not face the optimization problem since we use the original ELBO objective.
δ-VAE sets a target rate of δ for each latent dimension by constraining the mean and variance of the approximate posterior: where [σ l , σ u ] are the feasible interval for σ q by solving ln(σ 2 q )−σ 2 q +2δ +1 ≥ 0. Although δ-VAE can ensure a minimum value for the KL, it limits the model performance due to that the parameters are constrained in the interval.Our approach only constrains the distributions of µ, which is more flexible than δ-VAE.Experiments further show that our approach surpass both free-bits and δ-VAE.Reducing inference lag: As we focus on the setting of the conventional Gaussian prior, the lagging problem mentioned in Section 2.2 is crucial.To this point, it is beneficial to analyze an alternate form of the ELBO: With this view, the only goal of the approximate posterior q φ (z|x) is to match the model posterior p θ (z|x).We examine the performance of our approach to reduce inference lag using the same synthetic experiment in He et al. (2019).Details can be found in Section 1 of the Appendix.The synthetic experiment indicates that our approach with the regularization is beneficial to rebalance the optimization between inference and generation, and finally overcomes posterior collapse.We also prefer a large γ due to that a small γ will push the approximate posterior to the prior.More details on the synthetic experiment can be found in the Appendix.

Extension to CVAE
Given an observation x and its output y, CVAE (Sohn et al., 2015;Zhao et al., 2017b) models the conditional distribution p(y|x).The variational lower bound of the conditional log-likelihood is: Different from VAE, the prior p θ (z|x) in CVAE is not fixed, which is also parametrized by a neural network.It is possible to apply another BN on the mean of the prior with a different γ so that the expectation of the KL becomes a constant.However, this lower bound is uncontrollable due to the density of µ 1 + µ 2 is the convolution of their densities, which is intractable. 1o overcome this issue, we propose to constrain the prior with a fixed distribution.We achieve it by adding another KL between the prior and a known Gaussian distribution r(z), i.e.KL(p θ (z|x)||r(z)).Instead of optimizing the ELBO in Eq. 10, we optimize a lower bound of the ELBO for CVAE: The KL term in the new bound is the sum of KL(q φ (z|x, y)||p θ (z|x)) and KL(p θ (z|x)||r(z)), which can be computed as: where σ q , µ q and σ p , µ p are the parameters of q φ and p θ respectively.n denotes the hidden size.
The KL term vanishes to 0 when and only when q φ and p θ collapse to r(z), which is the normal distribution.As we explained in Section 3.2, KL won't be 0 when we apply BN in q φ .We then prove that when q φ collapses to p θ , the KL term is not the minima (details in Section 2 of the Appendix) so that KL(q φ (z|x, y)||p θ (z|x)) won't be 0. In this way, we can avoid the posterior collapse in CVAE.
Algorithm 2 shows the training details.
Sample a mini-batch x, y.

VAE for Language Modeling
Setup: We test our approach on two benchmark datasets: Yelp and Yahoo corpora (Yang et al., 2017).We use a Gaussian prior N (0, I), and the approximate posterior is a diagonal Gaussian.Following previous work (Burda et al., 2016;He et al., 2019), we report the estimated negative log likelihood (NLL) from 500 importance weighted samples, which can provide a tighter lower bound compared to the ELBO and shares the same information with the perplexity (PPL).Besides the NLL, we also report the KL, the mutual information (MI) I q (Alemi et al., 2017) and the number of activate units (AU) (Burda et al., 2016) in the latent space.
The I q can be calculated as: where p d (x) is the empirical distribution.The aggregated posterior q φ (z) = E p d (x) [q φ (z|x)] and KL(q φ (z)||p(z)) can be approximated with Monte Carlo estimations.The AU is measured as We report both the absolute hours and relative speed.
-Semi-amortized VAE (SA-VAE) (Kim et al., 2018).-VAE with an aggressive training (Agg-VAE) (He et al., 2019).-FB with a pretrained inference network (AE+FB) (Fu et al., 2019) Main results: Table 1 shows the results.We further split the results into two different settings, one for models with a pretrained inference network and one without it.Our approach achieves the best NLL in the setting without a pretrained inference network on both datasets and is competitive in the setting with a pretrained encoder.Moreover, we can observe that: • δ-VAE does not perform well in both settings, which shows that constraining the parameters in a small interval is harmful to the model.In vMF-VAE, data points share the same KL value.Our approach is flexible and gets better performance.Performance on a downstream task -Text classification: The goal of VAE is to learn a good representation of the data for downstream tasks.
Here, we evaluate the quality of latent representations by training a one-layer linear classifier based on the mean of the posterior distribution.We use a downsampled version of the Yelp sentiment dataset (Shen et al., 2017).Li et al. (2019) further sampled various labeled data to train the classifier.To compare with them fairly, we use the same samples in Li et al. (2019).Results are shown in Table 3.
Our approach achieves the best accuracy in all the settings.For 10k training samples, all the methods get a good result.However, when only using 100 training samples, different methods vary a lot in accuracy.The text classification task shows that our approach can learn a good latent representation even without a pretrained inference network.

CVAE for Dialogue Generation
Setup: For dialogue generation, we test our approach in the setting of CVAE.Following previous work (Zhao et al., 2017b), we use the Switchboard (SW) Corpus (Godfrey and Holliman, 1997), which contains 2400 two-sided telephone conversations.We use a bidirectional GRU with hidden size 300 to encode each utterance and then a one-layer GRU with hidden size 600 to encode previous k-1 utterances as the context.The response decoder is a one-layer GRU with hidden size 400.The latent representation z has a size of 200.We use the evaluation metrics from Zhao et al. (2017b): ( 1) Smoothed Sentence-level BLEU (Chen and Cherry, 2014); (2) Cosine Distance of Bag-of-word Embedding, which is a simple method to obtain sentence embeddings.We use the pretrained Glove embedding (Pennington et al., 2014) and denote the average method as A-bow and the extreme method as E-bow.Higher values indicate more plausible responses.We compared our approach with CVAE and CVAE with bag-of-words (BOW) loss (Zhao et al., 2017b), which requires the decoder in the generation network to predict the bag-of-words in the response y based on z.Automatic evaluation: Table 4 shows the results of these three approaches.From the KL values, we find that CVAE suffers from posterior collapse while CVAE (BOW) and our approach avoid it effectively.For BLEU-4, we observe the same phenomenon in the previous work (Fu et al., 2019;Zhao et al., 2017b) that CVAE is slightly better than the others.This is because CVAE tends to generate the most likely and safe responses repeatedly with the collapsed posterior.As for precision, these three models do not differ much.However, CVAE (BOW) and our BN-VAE outperform CVAE in recall with a large margin.This indicates that BN-VAE can also produce diverse responses with good quality like CVAE (BOW).
Human evaluation: We conduct the human evaluation by asking five annotators from a commercial annotation company to grade 200 sampled conver-sations from the aspect of fluency, relevance and informativeness on a scale of 1-3 (see Section 4 of the Appendix for more details on the criteria).We also report the proportion of acceptable/high scores (≥ 2 and = 3) on each metric.Table 5 shows the annotation results.Overall, our approach beats the other two compared methods in relevance and fluency with more informative responses.Also, our approach has the largest proportion of responses whose scores are High.This indicates that our model can produce more meaningful and relevant responses than the other two.
Case study: Table 6 shows the sampled responses generated by the three methods (more can be found in the Appendix).By maintaining a reasonable KL, responses generated by our approach are more relevant to the query with better diversity compared to the other two.We test the three methods in the simplest setting of dialogue generation.Note that the focus of this work is to improve the CVAE itself by avoiding its KL vanishing problem but not to hack the state-of-the-art dialogue generation performance.To further improve the quality of generated responses, we can enhance our approach by incorporating knowledge such as dialogue acts (Zhao et al., 2017b), external facts (Ghazvininejad et al., 2018) and personal profiles (Zhang et al., 2018).

Conclusions and Future Work
In this paper, we tackle the posterior collapse problem when VAE is paired with autoregressive decoders.Instead of considering the KL individually, we make it follow a distribution D KL and show that keeping the expectation of D KL positive is sufficient to prevent posterior collapse.We propose Batch Normalized VAE (BN-VAE), a simple but effective approach to set a lower bound of D KL by regularization the approximate posterior's parameters.Our approach can also avoid the recently proposed lagging problem efficiently without additional training efforts.We show that our approach can be easily extended to CVAE.We test our approach on three real applications, language modeling, text classification and dialogue generation.
Experiments show that our approach outperforms strong baselines and is competitive with more complex methods which keeping substantially faster.
We leverage the Gaussian prior as the example to introduce our method in this work.The key to our approach to be applicable is that we can get a formula for the expectation of the KL.However, it is hard to get the same formula for some more strong or sophisticated priors, e.g., the Dirichlet prior.For these distributions, we can approximate them by the Gaussian distributions (such as in Srivastava and Sutton (2017)).In this way, we can batch normalize the corresponding parameters.Further study in this direction may be interesting.

A.1 Experiments on Synthetic Data
We follow the Agg-VAE and construct the synthetic data to validate whether our approach can avoid the lagging problem.VAE used in this synthetic task has a LSTM encoder and a LSTM decoder.We use a scalar latent variable because we need to compute µ x,θ which is approximated by discretization of p θ (z|x).To visualize the training progress, we sample 500 data points from the validation set and show them on the mean space.
We plot the mean value of the approximate posterior and the model posterior during training for the basic VAE and BN-VAE.As shown the first column in Fig. 1, all points have the zero mean of the model posterior (the x-axis), which indicates that z and x are independent at the beginning of training.For the basic VAE, points start to spread in the x-axis during training while sharing almost the same y value, since the model posterior p θ (z|x) is well learned with the help of the autoregressive decoder.However, the inference posterior q φ (z|x) is lagging behind p θ (z|x) and collapses to the prior in the end.Our regularization approximated by BN, on the other hand, pushes the inference posterior q φ (z|x) away from the prior (p(z)) at the initial training stage, and forces q φ (z|x) to catch up with p θ (z|x) to minimize KL(q φ (z|x)||p θ (z|x)) in Eq. 9.As in the second row of Fig. 1, points spread in both directions and towards the diagonal.
We also report the results on different γ's with different batch sizes (32 in Fig. 1).Fig. 2 shows the training dynamics.Both settings of γ avoid posterior collapse efficiently.A larger γ produces more diverse µ's which spread on the diagonal.However, a small γ results in a small variance for the distribution of µ, thus µ's in the bottom row are closer to the original (mean of the distribution).When γ is 0, posterior collapse happens.Different batch sizes do not diff a lot, so 32 is a decent choice.An intuitive improvement of our method is to automatically learn different γ for different latent dimensions, which we leave for future work.

A.2 Proof in CVAE
The KL can be computed as: We need to prove that KL will not achieve the minimum number when µ pi equals to µ qi and σ pi equals σ qi .We take hidden size as 1 for example.The binary function about µ pi and σ pi is: , the maxima and minima of f µ pi ,σ pi must be the stationary point of f µ pi ,σ pi due to its continuity.The stationary point is: When µ pi = µ qi and σ pi = σ qi , both partial derivative is not 0.So it is not the stationary point of f , then it won't be the minima.

A.3 Language Modeling
We investigate the training procedure for different models.We plot the MI I q , D KL in the ELBO and the distance between the approximated posterior and the prior, D KL (q φ (z)||p(z)).As in Eq. 4 in the main paper, D KL in the ELBO is the sum of the other two.Fig. 3 shows these three values throughout the training.Although D KL is the upper bound of the mutual information, we notice that the gap is usually large.In the initial training stage, D KL increases in the basic VAE with annealing, while its MI remains small.With the weight decreases, the method finally suffers from posterior collapse.In contrast, our approach can obtain a high MI with a small D KL value like aggressive VAE.The full results on language modeling are in Table 8.

A.4 CVAE for dialogue generation
Human evaluation: We evaluate the generated responses from three aspects: relevance, fluency and informativeness.Here we introduce the criteria of the evaluation as shown in

Figure 1 :
Figure 1: Visualization of 500 sampled data from the synthetic dataset during the training.The x-axis is µ x,θ , the approximate model posterior mean.The y-axis is µ x,φ , which represents the inference posterior mean.b is batch size and γ is 1 in BN.

Figure 2 :Figure 3 :
Figure 2: Visualization of our BN-VAE on different γ for synthetic data.

Table 1 :
Li et al. (2019) and Yelp datasets.We report mean values across 5 different random runs.*indicates the results are from our experiments, while others are fromHe et al. (2019);Li et al. (2019).We only show the best performance of every model for each dataset.More results on various parameters can be found in the Appendix.

Table 2 :
Comparison of training time to convergence.
(Razavi et al., 2019)of 0.01, which means if A zi > 0.01, the unit i is active.Configurations: We use a 512-dimension word embedding layer for both datasets.For the encoder and the decoder, a single layer LSTM with 1024 hidden size is used.We use z to generate the initial state of the encoder followingKim et al. (2018);He et al. (2019);Li et al. (2019).To optimize the objective, we use mini-batch SGD with -Free-bits (FB)(Kingma et al., 2016).δ-VAE(Razavietal., 2019).-vMF-VAE(Xu and Durrett, 2018) • Methods with a modified training strategy.
Training time: Table2shows the training time (until convergence) and the relative ratio of the basic VAE, our approach and the other best three models in Table1.SA-VAE is about 12 times slower than our approach due to the local update for each data point.Agg-VAE is 2-4 times slower

Table 4 :
Comparison on dialogue generation.thanours because it requires additional training for the inference network.AE+FB needs to train an autoencoder before the VAE.However, our approach is fast since we only add one-layer batch normalization, and thus the training cost is almost the same as the basic VAE.More results about the training behavior can be found in Section 3 of the Appendix.

Table 5 :
Human evaluation results.Numbers in parentheses is the corresponding variance on 200 test samples.

Table 6 :
Sampled generated responses.Only the last sentence in the context is shown here.

Table 7
(BOW)and our approach both can generate diverse responses.However, responses from ours are more related to the context compared with the other two.

Table 8 :
Results on Yahoo and Yelp datasets.We report mean values across 5 different random runs.
*indicates the results are from our experiments, while others are from previous report.