On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond

Variational autoencoders (VAEs) combine latent variables with amortized variational inference, whose optimization usually converges into a trivial local optimum termed posterior collapse, especially in text modeling. By tracking the optimization dynamics, we observe the encoder-decoder incompatibility that leads to poor parameterizations of the data manifold. We argue that the trivial local optimum may be avoided by improving the encoder and decoder parameterizations since the posterior network is part of a transition map between them. To this end, we propose Coupled-VAE, which couples a VAE model with a deterministic autoencoder with the same structure and improves the encoder and decoder parameterizations via encoder weight sharing and decoder signal matching. We apply the proposed Coupled-VAE approach to various VAE models with different regularization, posterior family, decoder structure, and optimization strategy. Experiments on benchmark datasets (i.e., PTB, Yelp, and Yahoo) show consistently improved results in terms of probability estimation and richness of the latent space. We also generalize our method to conditional language modeling and propose Coupled-CVAE, which largely improves the diversity of dialogue generation on the Switchboard dataset.


Introduction
The variational autoencoder (VAE) (Kingma and Welling, 2014) is a generative model that combines neural latent variables and amortized variational inference, which is efficient in estimating and sampling from the data distribution. It infers a posterior distribution for each instance with a shared inference network and optimizes the evidence lower bound (ELBO) instead of the intractable marginal 1 Our code is publicly available at https://github. com/ChenWu98/Coupled-VAE. log-likelihood. Given its potential to learn representations from massive text data, there has been much interest in using VAE for text modeling (Zhao et al., 2017;Xu and Durrett, 2018;. Prior work has observed that the optimization of VAE suffers from the posterior collapse problem, i.e., the posterior becomes nearly identical to the prior and the decoder degenerate into a standard language model (Bowman et al., 2016;Zhao et al., 2017). A widely mentioned explanation is that a strong decoder makes the collapsed posterior a good local optimum of ELBO, and existing solutions include weakened decoders (Yang et al., 2017;Semeniuta et al., 2017), modified regularization terms (Higgins et al., 2017;Wang and Wang, 2019), alternative posterior families (Rezende and Mohamed, 2015; Davidson et al., 2018), richer prior distributions (Tomczak and Welling, 2018), improved optimization strategies , and narrowed amortization gaps .
In this paper, we provide a novel perspective for the posterior collapse problem. By comparing the optimization dynamics of VAE with deterministic autoencoders (DAE), we observe the incompatibility between a poorly optimized encoder and a decoder with too strong expressiveness. From the perspective of differential geometry, we show that this issue indicates poor chart maps from the data manifold to the parameterizations, which makes it difficult to learn a transition map between them. Since the posterior network is a part of the transition map, we argue that the posterior collapse would be mitigated with better parameterizations.
To this end, we propose the Coupled-VAE approach, which couples the VAE model with a deterministic network with the same structure. For better encoder parameterization, we share the encoder weights between the coupled networks. For better decoder parameterization, we propose a signal matching loss that pushes the stochastic decod-ing signals to the deterministic ones. Notably, our approach is model-agnostic since it does not make any assumption on the regularization term, the posterior family, the decoder architecture, or the optimization strategy. Experiments on PTB, Yelp, and Yahoo show that our method consistently improves the performance of various VAE models in terms of probability estimation and the richness of the latent space. The generalization to conditional modeling, i.e., Coupled-CVAE, largely improves the diversity of dialogue generation on the Switchboard dataset. Our contributions are as follows: • We observe the encoder-decoder incompatibility in VAE and connect it to the posterior collapse problem.
• We propose the Coupled-VAE, which helps the encoder and the decoder to learn better parameterizations of the data manifold with a coupled deterministic network, via encoder weight sharing and decoder signal matching.
• Experiments on PTB, Yelp, and Yahoo show that our approach improves the performance of various VAE models in terms of probability estimation and richness of the latent space. We also generalize Coupled-VAE to conditional modeling and propose Coupled-CVAE, which largely improves the diversity of dialogue generation on the Switchboard dataset.

Variational Inference for Text Modeling
The generative process of VAE is first to sample a latent code z from the prior distribution P(z) and then to sample the data x from P (x|z; θ) (Kingma and Ba, 2015). Since the exact marginalization of the log-likelihood is intractable, a variational family of posterior distributions Q(z|x; φ) is adopted to derive the evidence lower bound (ELBO), i.e., For training, as shown in Figure 1(a), the encoded text e is transformed into its posterior via a posterior network. A latent code is sampled and mapped to the decoding signal h. Finally, the decoder infers the input with the decoding signal. The objective can be viewed as a reconstruction loss L rec plus a regularization loss L reg (whose form varies), i.e., However, the optimization of the VAE objective is challenging. We usually observe a very small L reg and a L rec similar to a standard language model, i.e., the well-known posterior collapse problem.

Deterministic Autoencoders
An older family of autoencoders is the deterministic autoencoder (DAE) (Rumelhart et al., 1986;Ballard, 1987). Figure 1(b) shows an overview of DAE for text modeling, which is composed of a text encoder, an optional MLP, and a text decoder. The reconstruction loss of DAE is usually much lower than that of VAE after convergence.

Encoder-Decoder Incompatibility in VAE for Text Modeling
To understand the posterior collapse problem, we take a deeper look into the training dynamics of VAE. We investigate the following questions. How much backpropagated gradient does the encoder receive from reconstruction? How much does it receive from regularization? How much information does the decoder receive from the encoded text?

Tracking Training Dynamics
To answer the first question, we study the gradient norm of the reconstruction loss w.r.t. the encoded text, i.e., ∂L rec /∂e 2 , which shows the magni- tude of gradients received by the encoder parameters. From Figure 2(a), we observe that it constantly increases in DAE, while in VAE it increases marginally in the early stage and then decreases continuously. It shows that the reconstruction loss actively optimizes the DAE encoder, while the VAE encoder lacks backpropagated gradients after the early stage of training.
We seek the answer to the second question by studying the gradient norm of the regularization loss w.r.t. the encoded text, i.e., ∂L reg /∂e 2 . In a totally collapsed posterior, i.e., Q(z|x; φ) = P(z) for each x, ∂L reg /∂e 2 would be zero. Thus, ∂L reg /∂e 2 can show how far the posterior of each instance is from the aggregate posterior or the prior. Figure 2(b) shows a constant decrease of the gradient norm in VAE from the 2.5K step until convergence, which shows that the posterior collapse is aggravated as the KL weight increases.
For the third question, we compute the normalized gradient norm of the decoding signal w.r.t. the encoded text, i.e., ∂h/∂e F / h 2 . As this term shows how relatively the decoding signal changes with the perturbation of the encoded text, it reflects the amount of information passed from the encoder to the decoder. Figure 2(c) shows that for DAE, it constantly increases. For VAE, it at first increases even faster than DAE, slows down, and finally decreases until convergence, indicating that the VAE decoder, to some extent, ignores the encoder in the late stage of training.

Encoder-Decoder Incompatibility
Based on the training dynamics in Section 3.1 and the observations in previous work (Bowman et al., 2016;Zhao et al., 2017), text VAE has three features, listed as follows. First, the encoder is poorly optimized, as shown by the low ∂L rec /∂e 2 . Second, the decoder degenerates into a powerful language model. Third, h contains less information from e in VAE than in DAE, which is indicated by the lower ∂h/∂e F / h 2 . We call these features as encoder-decoder incompatibility.
To bridge the incompatibility and posterior collapse, we start with the manifold hypothesis which states that real-world data concentrates near a manifold with a lower dimensionality than the ambient space (Narayanan and Mitter, 2010;Bengio et al., 2013). In our case, we denote the manifold of text data as X ⊂ l∈N V l where V is the vocabulary. In the language of differential geometry, the encoded text e ∈ E ⊂ R d and the decoding signal h ∈ H ⊂ R d can be viewed as the parameterizations (or coordinates) of x ∈ X under two different charts (or coordinate systems). Formally, we denote the chart maps as ϕ e : X → E and ϕ h : X → H, which satisfy e = ϕ e (x) and h = ϕ h (x) for any x ∈ X . Given the two charts, the map from E to H is called the transition map ϕ h • ϕ −1 e : E → H between the two charts. In DAE, the two chart maps and the transition map between them are learned simultaneously via the single reconstruction loss, which we rewrite as where ϕ e , ϕ h • ϕ −1 e , and ϕ −1 h are modeled as the encoder, the MLP, and the decoder (strictly speaking, in text modeling, the range of ϕ −1 h is not X but distributions on X ), as illustrated in Figure 3.
In VAE, as discussed before, both ϕ e and ϕ h inadequately parameterize the data manifold. We argue that the inadequate parameterizations make it harder to find a smooth transition map in VAE than in DAE, as shown by the lower ∂h/∂e F / h 2 .  Figure 3: Left: DAE and VAE interpreted as manifold parameterizations and a transition map. Right: A graphical overview of the proposed Coupled-VAE. The upper path is deterministic, and the lower path is stochastic.
Since the posterior network is a part of the transition map, it consequently seeks to map each instance to the prior (discussed in Section 3.1) rather than learning the transition map.

Coupling Variational and Deterministic Networks
Based on the above analysis, we argue that posterior collapse could be alleviated by learning chart maps (i.e., ϕ e and ϕ h ) that better parameterize the data manifold. Inspired by the chart maps in DAE, we propose to couple the VAE model with a deterministic network, outlined in Figure 3. Modules with a subscript c are deterministic networks that share the structure with those in the stochastic network. Sampling is disabled in the deterministic network, e.g., in the case of Gaussian posterior, we use the predicted mean vector for later computation. Please find details for other posterior families in Appendix B. Similar to DAE, the coupled deterministic network is optimized solely by the coupled reconstruction loss L c rec , which is the same autoregressive cross-entropy loss as L rec .
To learn a well-optimized ϕ e , we share the encoder between the stochastic and the deterministic networks, which leverages the rich gradients backpropagated from L c rec . To learn better ϕ h , we propose to guide ϕ h with a well-learned chart map, i.e., the one characterized by Decoder c . Thus, we introduce a signal matching loss L match that pushes the h to h c . The objective of our approach is where λ r and λ m are hyperparameters 2 , L c rec is the coupled reconstruction loss, and the signal matching loss L match is essentially a distance function between h and h c . We evaluate both the Euclidean distance and the Rational Quadratic kernel 3 , i.e., where s ∈ {0.1, 0.2, 0.5, 1, 2, 5, 10}, C is a hyperparameter, and Detach prevents gradients to be propagated into h c since we would like h c to guide h but not the opposite. One would question the necessity of sharing the structure of the posterior network by resorting to universal approximation (Hornik et al., 1989). Specifically, a common question is: why not using an MLP as Posterior c ? We argue that each structure has a favored distribution of H in R d , so structure sharing facilitates the optimization when we are learning by gradient descent. For example, the latent space learned by planar flows (Rezende and Mohamed, 2015) has compression and expansion, and vMF-VAE (Xu and Durrett, 2018), which is supported on a sphere, may significantly influence the distribution of H in its ambient space R d .

Baselines
We evaluate the proposed Coupled-VAE approach by applying it to various VAE models, which in-   , and Lagging-VAE . We also show the result of GRU-LM (Cho et al., 2014) and SA-VAE . We do not apply our method to SA-VAE since it does not follow amortized variational inference. Please find more details in Appendix C and previous footnotes.

Language Modeling Results
We report negative log-likelihood (NLL), KL divergence, and perplexity as the metrics for language modeling. NLL is estimated with importance sampling, KL is approximated by its Monte Carlo estimate, and perplexity is computed based on NLL. Please find the metric details in Appendix D. Table 1 displays the language modeling results. For all models, our proposed approach achieves smaller negative log-likelihood and lower perplexity, which shows the effectiveness of our method to improve the probability estimation capability of various VAE models. Larger KL divergence is also observed, showing that our approach helps address the posterior collapse problem.

Mutual Information and Reconstruction
Language modeling results only evaluate the probability estimation ability of VAE. We are also interested in how rich the latent space is. We report the mutual information (MI) between the text x and the latent code z under Q(z|x), which is approximated with Monte Carlo estimation. Better  reconstruction from the encoded text is another way to show the richness of the latent space. For each text x, we sample ten latent codes from Q(z|x) and decode them with greedy search. We report the BLEU-1 and BLEU-2 scores between the reconstruction and the input. Please find the metric details in Appendix E. In Table 2, we observe that our approach improves MI on all datasets, showing that our approach helps learn a richer latent space. BLEU-1 and BLEU-2 are consistently improved on Yelp and Yahoo, but not on PTB. Given that text samples in PTB are significantly shorter than those in Yelp and Yahoo, we conjecture that it is easier for the decoder to reconstruct on PTB by exploiting its autoregressive expressiveness, even without a rich latent space.

Hyperparameter Analysis: Distance
Function, λ r , and λ m We investigate the effect of key hyperparameters. Results are shown in Table 3. Note that the lowest NLL does not guarantee the best other metrics, which shows the necessity to use multiple metrics for a more comprehensive evaluation. For the distance function, we observe that the Euclidean distance (denoted as Eucl in Table 3) is more sensitive to λ m than the Rational Quadratic kernel (denoted as RQ in Table 3). The first and the third block in Table 3 show that, with larger λ m , the model achieves higher KL divergence, MI, and reconstruction metrics. Our interpretation is that by pushing the stochastic decoding signals closer to the deterministic ones, we get latent codes with richer text information. We leave the analysis of λ m = 0.0 in Section 5.6.
The second block in Table 3 shows the role of λ r , which we interpret as follows. When λ r is too small (e.g., 0.5), the learned parameterizations are still inadequate for a smooth transition map; when λ r is too large (e.g., 5.0), it distracts the optimization too far away from the original objective (i.e., L rec + L reg ). Note that λ r = 0.0 is equivalent to removing the coupled reconstruction loss L c rec in Eq. (4)).

The Heterogeneous Effect of Signal Matching on Probability Estimation
In Section 5.5 we observe richer latent space (i.e., larger MI and BLEU scores) with larger λ m . However, a richer latent space does not guarantee a better probability estimation result. Thus, in this   part, we delve deeper into whether the decoder signal matching mechanism helps improve probability estimation. We study three models of different posterior families (i.e., Coupled-VAE, Coupled-VAE-NF, and Coupled-vMF-VAE). Results are shown in Table 4, where we do not report the KL, MI, and BLEU scores because they have been shown to be improved with larger λ m in Table 3. We observe that the effects of signal matching on probability estimation vary in different posterior families.

Is the Incompatibility Mitigated?
We study the three gradient norms defined in Section 3 on the test sets, displayed in Table 5 (for Coupled-VAE, λ m = 0.1). Notably, ∂L c rec /∂e 2 in Coupled-VAE is even larger than ∂L rec /∂e 2 in DAE. It has two indications. First, the encoder indeed encodes rich information of the text. Second, compared with DAE, Coupled-VAE better generalizes to the test sets, which we conjecture is due to the regularization on the posterior. Coupled-VAE also has a larger ∂L reg /∂e 2 compared with VAE, which based on the argument in Section 3.1 indicates that, in Coupled-VAE, the posterior of each instance is not similar to the prior. We also observe larger ∂h/∂e F / h 2 in Coupled-VAE, which indicates a better transition map between the two parameterizations in Coupled-VAE than in VAE.
We also track the gradient norms of Coupled-VAE (λ m = 10.0 for a clearer comparison), plotted along with VAE and DAE in Figure 2. The curve for Coupled-VAE in Figure 2(a) stands for ∂(L rec + L c rec )/∂e 2 . We observe that Coupled-VAE receives constantly increasing backpropagated gradients from the reconstruction. In contrast to VAE, the ∂L reg /∂e 2 in Coupled-VAE does not decrease significantly as the KL weight increases. The decrease of ∂h/∂e F / h 2 , which VAE suffers from, is not observed in Coupled-VAE. Plots on more datasets are in Appendix F.

Sample Diversity
We evaluate the diversity of the samples from the prior distribution. We sample 3200 texts from the prior distribution and report the Dist-1 and Dist-2 metrics (Li et al., 2016), which are the ratios of distinct unigrams and bigrams over all generated unigrams and bigrams. Distinct-1 and Distinct-2 in Table 6 show that texts sampled from Coupled-VAE (λ m = 10.0) are more diverse than those from VAE. Given limited space, we put several samples in Appendix G for qualitative analysis.

Interpolation
A property of VAE is to match the interpolation in the latent space with the smooth transition in the data space (Bowman et al., 2016). In Table 7 Figure 4: A graphical overview of the generalization to Coupled-CVAE. u is the condition, encoded as e u .
(both sides → it) and verbs (are expected → have been → has been → has), indicating that the linguistic information is more smoothly encoded in the latent space of Coupled-VAE.

Generalization to Conditional Language Modeling: Coupled-CVAE
To generalize our approach to conditional language modeling, we propose Coupled-CVAE. A graphical overview is displayed in Figure 4. Specifically, the (coupled) posterior network and the (coupled) decoder are additionally conditioned. The objective of Coupled-CVAE is identical to Eq. (4). We compare Couple-CVAE with GRU encoderdecoder (Cho et al., 2014) and CVAE (Zhao et al., 2017) for dialogue generation. We use the Switchboard dataset (John and Holliman, 1993), whose training/validation/test splits are 203K/5K/5K, and the vocabulary size is 13K. For probability estimation, we report the NLL, KL, and PPL based on the gold responses. Since the key motivation of using CVAE in Zhao et al. (2017) is the diversity of responses, we sample one response for each post and report the Distinct-1 and Distinct-2 metrics over all samples. Please find more details in Appendix I. Table 8 shows that Coupled-CVAE greatly increases the diversity of dialogue modeling, while it only slightly harms the probability estimation capability. It indicates that Coupled-CVAE better captures the one-to-many nature of conversations than CVAE and GRU encoder-decoder. We also observe that the diversity is improved with increasing λ m , which shows that λ m can control diversity via specifying the richness of the latent space.
6 Relation to Related Work Bowman et al. (2016) identify the posterior collapse problem of text VAE and propose KL annealing and word drop to handle the problem. Zhao et al. (2017) propose the bag-of-words loss to mitigate this issue. Later work on this problem focuses on less powerful decoders (Yang et al., 2017;Semeniuta et al., 2017), modified regularization objective (Higgins et al., 2017;Bahuleyan et al., 2019;Wang and Wang, 2019), alternative posterior families (Rezende and Mohamed, 2015;Xu and Durrett, 2018;Davidson et al., 2018;Xiao et al., 2018), richer prior distributions (Tomczak and Welling, 2018), improved optimization  or KL annealing strategy (Fu et al., 2019), the use of skip connections (Dieng et al., 2019), hierarchical or autoregressive posterior distributions (Park et al., 2018;Du et al., 2018), and narrowing the amortization gap (Hjelm et al., 2016;Marino et al., 2018). We provide the encoderdecoder incompatibility as a new perspective on the posterior collapse problem. Empirically, our approach can be combined with the above ones to alleviate the problem further.

VAE
Coupled-VAE (λm = 10.0) Text A (sampled from PTB): now those routes are n't expected to begin until jan they are n't expected to be completed both sides are expected to be delivered at their contract the new york stock exchange is scheduled to resume today both sides are expected to be delivered at least the new york stock exchange is scheduled to resume both sides have been able to produce up with the current level it is n't clear that it will be sold through its own account it also has been used for comment it is n't a major source of credit it also has been working for the first time it also has a major chunk of its assets it also has a new drug for two years it also has a major pharmaceutical company it also has a $ N million defense initiative Text B (sampled from PTB): it also has a unk facility in california   A model to be noted is β-VAE (Higgins et al., 2017), in which the reconstruction and regularization are modeled as a hyperparameterized trade-off, i.e., the improvement of one term compromises the other. Different from β-VAE, we adopt the idea of multi-task learning, i.e., the coupled reconstruction task helps improve the encoder chart map and the signal matching task helps improve the decoder chart map. Both our analysis in Section 3.2 and the empirical results show that the modeling of posterior distribution can be improved (but not necessarily compromised) with the additional tasks. Ghosh et al. (2020) propose to substitute stochasticity with explicit and implicit regularizations, which is easier to train and empirically improves the quality of generated outputs. Different from their work, we still strictly follow the generative nature (i.e., data density estimation) of VAE, and the deterministic network in our approach serves as an auxiliary to aid the optimization.
Encoder pretraining  initializes the text encoder and the posterior network with an autoencoding objective.  shows that encoder pretraining itself does not improve the performance of VAE, which indicates that initialization is not strong enough as an inductive bias to learn a meaningful latent space.
Given the discrete nature of text data, we highlight the two-level representation learning for text modeling: 1) the encoder and decoder parameterizations via autoencoding and 2) a transition map between the parameterizations. Notably, the transition map has large freedom. In our case, the transition map decides the amount and type of information encoded in the variational posterior, and there are other possible instances of the transition map, e.g., flow-based models (Dinh et al., 2015).

Conclusions
In this paper, we observe the encode-decoder incompatibility of VAE for text modeling. We bridge the incompatibility and the posterior collapse problem by viewing the encoder and the decoder as two inadequately learned chart maps from the data manifold to the parameterizations, and the posterior network as a part of the transition map between them. We couple the VAE model with a deterministic network and improve the parameterizations via encoder weight sharing and decoder signal matching. Our approach is model-agnostic and can be applied to a wide range of models in the VAE family. Experiments on benchmark datasets, i.e., PTB, Yelp, and Yahoo, show that our approach improves various VAE models in terms of probability estimation and the richness of the latent space. We also generalize Coupled-VAE to conditional language modeling and propose Coupled-CVAE. Results on Switchboard show that Coupled-CVAE largely improves diversity in dialogue generation.

B.2 Gaussian with Normalizing Flows
We first review the background and notations of normalizing flows. An initial latent code is first sampled from an initial distribution, i.e., z 0 ∼ Q 0 (z 0 |x). The normalizing flow is defined as a series of reversible transformations f 1 , . . . , f K , i.e., where k = 1, . . . , K. The evidence lower bound (ELBO) for normalizing flows is derived as where P K (z K ) is the prior distribution of the transformed latent variable and the reversibility of the transformations guarantees non-zero determinants.
Obviously, the optimization of the ELBO for normalizing flows requires sampling from the initial distribution; thus, we compute the coupled latent code z c by transforming the predicted mean vector of the coupled initial distribution, i.e., is the coupled initial distribution and f c 1 , . . . , f c K are the coupled transformations. Note that all modules in the deterministic network share the structure with those in the stochastic network. We do not use the posterior mean as the coupled latent code for two reasons. First, our interest is to acquire a deterministic representation that guides the stochastic network, but not necessarily the mean vector. Second, the computation of the posterior mean after the transformations is intractable.

B.3 Von Mises-Fisher
The von Mises-Fisher distribution is supported on a (d−1)-dimensional sphere in R d and parameterized by a direction parameter µ ∈ R d ( µ = 1) and a concentration parameter κ, both of which are mapped from the encoded text by the posterior network. The probability density function is where I v is the modified Bessel function of the first kind at order v. We use the direction parameter µ as the coupled latent code z c . Note that we do not use the posterior mean as the coupled latent code for two reasons. First, similar to normalizing flows, our interest is a deterministic representation rather than the mean vector. Second, the posterior mean of von Mises-Fisher never lies on the support of the distribution, which is suboptimal to guide the stochastic network.

C Details of the Experimental Setup
The dimension of latent vectors is 32. The dimension of word embeddings is 200. The encoder and the decoder are one-layer GRUs with the hidden state size of 128 for PTB and 256 for Yelp and Yahoo. For optimization, we use Adam (Kingma and Ba, 2015) with a learning rate of 10 −3 and β 1 = 0.9, β 1 = 0.999. The decoding signal is viewed as the first word embedding and also concatenated to the word embedding in each decoding step. After 30K steps, the learning rate is decayed by half each 2K steps. Dropout (Srivastava et al., 2014) (Gretton et al., 2012) as the regularization term. An additional KL regularization term with the weight β = 0.8 (also with KL-annealing) is added to WAE and WAE-NF since MMD does not guarantee the convergence of the KL divergence.

D Estimation of Language Modeling Metrics
For language modeling, we report negative loglikelihood (NLL), KL divergence, and perplexity.
To get more reliable results, we make the estimation of each metric explicit. For each test sample x, NLL is estimated by importance sampling, and KL is approximated by its Monte Carlo estimate: where z (i) ∼ Q(z|x) are sampled latent codes and all notations follow Eq. (1) in the main text. We report the averaged NLL and KL on all test samples. Perplexity is computed based on the estimated NLL. For validation, the number of samples is N = 10; for evaluation, the number of samples is N = 100.

E Estimation of Mutual Information and Reconstruction Metrics
We report the mutual information (MI) between the text x and the latent code z under Q(z|x) to investigate how much useful information is encoded. The MI component of each test sample x is approximated by Monte Carlo estimation: where the aggregated posterior density q(z (i) ) is approximated with its Monte Carlo estimate: where x (j) are sampled from the test set. For convenience, most previous work uses the texts within each batch as the sampled x (j) 's (which are supposed to be sampled from the entire test set). However, this convention results in a biased estimation since the q(z (i) |x (i) ) is computed when j = i, i.e., the text itself is always sampled when computing its MI component. We remedy it by skipping the term when j = i. The overall MI = E x [MI x ] is then estimated by averaging MI x over all test samples. We set the numbers of samples as N = 100 and M = 512.
For reconstruction, we sample ten latent codes from the posterior of each text input and decode them with greedy search. We compute BLEU-1 and BLEU-2 between the reconstruction and the input with the Moses script.

F Training Dynamics of Gradient Norms
We show the tracked gradient norms on all datasets in Figure 5. The observations are consistent with those discussed in Section 5.7 in the main text.

G Diversity and Samples from the Prior Distribution
Given the limited space in the main text, we place the comprehensive evaluation of samples from the prior distribution in this part. Table 9 shows the diversity metrics and the first three (thus totally random) samples from each model. Qualitatively, samples from Coupled-VAE is more diverse than those from VAE. The long texts generated from VAE have more redundancies compared with Coupled-VAE. Given that both models have the same latent dimension, it indicates that Coupled-VAE is using the latent codes more efficiently.

H Interpolation
A property of VAE is to match the interpolation in the latent space with the smooth transition in the text space (Bowman et al., 2016). In Table 7, we show the interpolation of VAE and Coupled-VAE on PTB. It shows that compared with VAE, Coupled-VAE has smoother transitions of subjects (both sides → it) and verbs (are expected → have been → has been → has), indicating that the information about subjects and verbs is more smoothly encoded in the latent space of Coupled-VAE.

I Generalization to Conditional Generation: Coupled-CVAE
To generalize our approach to conditional generation, we focus on whether it can improve the CVAE model (Zhao et al., 2017) for dialogue generation.
To this end, we propose the Coupled-CVAE model.

I.1 CVAE
CVAE adopts a two-step view of diverse dialogue generation. Let x be the response and y be the post (or the context). CVAE first samples the latent code z from the prior distribution P(z|y) and then samples the response from the decoder P (x|z, y; θ). Given the post y, the marginal distribution of the response x is P (x|y; θ) = E z∼P(z|y) [P (x|z, y; θ)] Similar to VAE, the exact marginalization is intractable, and we derive the evidence lower bound During training, the response and the post are encoded as e x and e y , respectively. The two vectors are concatenated and transformed into the posterior via the posterior network. A latent code is then sampled and mapped to a higher-dimensional h.
The decoding signal in CVAE is computed by h and e y and utilized to infer the response. Similar to VAE, the objective of CVAE can also be viewed as a reconstruction loss and a regularization term in Eq. (15).

I.2 Coupled-CVAE
As observed in Zhao et al. (2017), the CVAE model also suffers from the posterior collapse problem. We generalize our approach to the conditional setting and arrive at Coupled-CVAE. A graphical overview is displayed in Figure 4. The difference from Coupled-VAE is shown in red. Specifically, the (coupled) posterior network and the (coupled) decoder are additionally conditioned on the post representation. The objective of Coupled-CVAE is identical to Eq. (4) in the main text. The coupled reconstruction loss L c rec in Coupled-CVAE has two functions. First, it improves the encoded response e x , which is similar to Coupled-VAE. Second, it encourages h c to encode more response information rather than the post information, which collaborates with L match to improve the parameterization h.

I.3 Dataset
We use the Switchboard dataset (John and Holliman, 1993). We split the dialogues into single-turn post-response pairs, and the number of pairs in the training/validation/test split is 203K/5K/5K. The vocabulary size is 13K.

I.4 Evaluation
For probability estimation, we report the NLL, KL, and PPL based on the gold responses. NLL, KL, and PPL are as computed in Appendix D except for the additional condition on the post. Since the key motivation of using CVAE in Zhao et al. (2017) is the response diversity, we sample one response for each post and report the Distinct-1 and Distinct-2 metrics over all test samples.

I.5 Experimental Setup
We compare our Coupled-CVAE model with two baselines: GRU encoder-decoder (Cho et al., 2014) and CVAE (Zhao et al., 2017). The detailed setup follows that of the PTB dataset in Appendix C. For each 1K steps, we estimate the NLL for validation.

I.6 Results
Experimental results of Coupled-CVAE are shown in the main text.