Neural Gaussian Copula for Variational Autoencoder

Variational language models seek to estimate the posterior of latent variables with an approximated variational posterior. The model often assumes the variational posterior to be factorized even when the true posterior is not. The learned variational posterior under this assumption does not capture the dependency relationships over latent variables. We argue that this would cause a typical training problem called posterior collapse observed in all other variational language models. We propose Gaussian Copula Variational Autoencoder (VAE) to avert this problem. Copula is widely used to model correlation and dependencies of high-dimensional random variables, and therefore it is helpful to maintain the dependency relationships that are lost in VAE. The empirical results show that by modeling the correlation of latent variables explicitly using a neural parametric copula, we can avert this training difficulty while getting competitive results among all other VAE approaches.


Introduction
Variational Inference (VI) (Wainwright et al., 2008;Hoffman et al., 2013) methods are inspired by calculus of variation (Gelfand et al., 2000). It can be dated back to the 18th century when it was mainly used to study the problem of the change of functional, which is defined as the mapping from functions to real space and can be understood as the function of functions. VI takes a distribution as a functional and then studies the problem of matching this distribution to a target distribution using calculus of variation. After the rise of deep learning (Krizhevsky et al., 2012), a deep generative model called Variational Autoencoder (Kingma and Welling, 2014; Hoffman Figure 1: Intuitive illustration of VI: The elliptic P is a distribution family containing the true posterior p ∈ P , and the circle Q is a Mean-field variational family containing a standard normal prior N . The optimal solution q * is the one in Q that has the smallest KL(q||p).
In reality these two families may not overlap. et al., 2013) is proposed based on the theory of VI and achieves great success over a huge number of tasks, such as transfer learning (Shen et al., 2017), unsupervised learning (Jang et al., 2017), image generation (Gregor et al., 2015), semi-supervised classification (Jang et al., 2017), and dialogue generation (Zhao et al., 2017). VAE is able to learn a continuous space of latent random variables which are useful for a lot of classification and generation tasks.
Recent studies (Bowman et al., 2015;Yang et al., 2017;Xiao et al., 2018;Xu and Durrett, 2018) show that when it comes to text generation and language modeling, VAE does not perform well and often generates random texts without making good use of the learned latent codes. This phenomenon is called Posterior Collapse, where the Kullback-Leibler (KL) divergence between the posterior and the prior (often assumed to be a standard Gaussian) vanishes. It makes the latent codes completely useless because any text input will be mapped to a standard Gaussian variable. Many recent studies (Yang et al., 2017;Xu and Durrett, 2018;Xiao et al., 2018;Miao et al., 2016;He et al., 2018) try to address this issue by providing new model architectures or by changing the objective functions. Our research lies in this second direction. We review the theory of VAE, and we argue that one of the most widely used assumptions in VAE, the Mean-field assumption, is problematic. It assumes that all approximated solutions in a family of variational distributions should be factorized or dimensional-wise independent for tractability. We argue that it leads to the posterior collapse problem since any variational posterior learned in this way does not maintain the correlation among latent codes and will never match the true posterior which is unlikely factorized.
We avert this problem by proposing a Neural Gaussian Copula (Copula-VAE) model to train VAE on text data. Copula (Nelsen, 2007) can model dependencies of high-dimensional random variables and is very successful in risk management (Kole et al., 2007;McNeil et al., 2005), financial management (Wang and Hua, 2014), and other tasks that require the modeling of dependencies. We provide a reparameterization trick (Kingma and Welling, 2014) to incorporate it with VAE for language modeling. We argue that by maintaining the dependency relationships over latent codes, we can dramatically improve the performance of variational language modeling and avoid posterior collapse. Our major contributions can be summarized as the following: • We propose Neural parameterized Gaussian Copula to get a better estimation of the posterior for latent codes.
• We provide a reparameterization technique for Gaussian Copula VAE. The experiments show that our method achieves competitive results among all other variational language modeling approaches.
• We perform a thorough analysis of the original VAE and copula VAE. The results and analysis reveal the salient drawbacks of VAE and explain how introducing a copula model could help avert the posterior collapse problem.

Related Work
Copula: Before the rise of Deep Learning Copula (Nelsen, 2007) is a multivariate distribution whose marginals are all uniformly distributed.
Over the years, it is widely used to extract correlation within high-dimensional random variables, and achieves great success in many subjects such as risk management (Kole et al., 2007;McNeil et al., 2005), finance (Wang and Hua, 2014), civil engineering (Chen et al., 2012;Zhang and Singh, 2006), and visual description generation (Wang and Wen, 2015). In the past, copula is often estimated by Maximum Likelihood method (Choroś et al., 2010;Jaworski et al., 2010) via parametric or semi-parametric approaches (Tsukahara, 2005;Choroś et al., 2010). One major difficulty when estimating the copula and extracting dependencies is the dimensionality of random variables. To overcome the curse of dimensionality, a graphical model called vine copula (Joe and Kurowicka, 2011;Czado, 2010;Bedford et al., 2002) is proposed to estimate a high-dimensional copula density by breaking it into a set of bivariate conditional copula densities. However, this approach is often hand-designed which requires human experts to define the form of each bivariate copula, and hence often results in overfitting. Therefore, Gaussian copula (Xue-Kun Song, 2000;Frey et al., 2001) is often used since its multivariate expression has a simple form and hence does not suffer from the curse of dimensionality. Bowman et al. (2015) proposed to use VAE for text generation by using LSTM as encoder-decoder. The encoder maps the hidden states to a set of latent variables, which are further used to generate sentences. While achieving relatively low sample perplexity and being able to generate easy-to-read texts, the LSTM VAE often results in posterior collapse, where the learned latent codes become useless for text generation.

VAE for Text
Recently, many studies are focusing on how to avert this training problem. They either propose a new model architecture or modify the VAE objective. Yang et al. (2017) seeks to replace the LSTM (Hochreiter and Schmidhuber, 1997) decoder with a CNN decoder to control model expressiveness, as they suspect that the over-expressive LSTM is one reason that makes KL vanish. Xiao et al. (2018) introduces a topic variable and pre-trains a Latent Dirichlet Allocation (Blei et al., 2003) model to get a prior distribution over the topic information. Xu and Durrett (2018) believes the bubble soup effect of highdimensional Gaussian distribution is the main reason that causes KL vanishing, and therefore learns a hyper-spherical posterior over the latent codes.

Variational Inference
The problem of inference in probabilistic modeling is to estimate the posterior density p(z|x) of latent variable z given input samples {x i } D i=1 . The direct computation of the posterior is intractable in most cases since the normalizing constant p(z, x)dz lacks an analytic form. To get an approximation of the posterior, many approaches use sampling methods such as Markov chain Monte Carlo (MCMC) (Gilks et al., 1995) and Gibbs sampling (George and McCulloch, 1993). The downside of sampling methods is that they are inefficient, and it is hard to tell how close the approximation is from the true posterior. The other popular inference approach, variational inference (VI) (Wainwright et al., 2008;Hoffman et al., 2013), does not have this shortcoming as it provides a distance metric to measure the fitness of an approximated solution.
In VI, we assume a variational family of distributions Q to approximate the true posterior. The Kullback-Leibler (KL) divergence is used to measure how close q ∈ Q is to the true p(z|x). The optimal variational posterior q * ∈ Q is then the one that minimizes the KL divergence Based on this, variational autoencoder (VAE) (Kingma and Welling, 2014) is proposed as a latent generative model that seeks to learn a posterior of the latent codes by minimizing the KL divergence between the true joint density p θ (x, z) the variational joint density q φ (z, x). This is equivalent to maximizing the following evidence lower bound ELBO, In this case, Mean-field (Kingma and Welling, 2014) assumption is often used for simplicity. That is, we assume that the members of variational family Q are dimensional-wise independent, meaning that the posterior q can be written as q(z|x) = D i=1 q(z i |x). The simplicity of this form makes the estimation of ELBO very easy. However, it also leads to a particular training difficulty called posterior collapse, where the KL divergence term becomes zero and the factorized variational posterior collapses to the prior. The latent codes z would then become useless since the generative model p(x|z) no longer depends on it.
We believe the problem comes from the nature of variational family itself and hence we propose our Copula-VAE which makes use of the dependency modeling ability of copula model to guide the variational posterior to match the true posterior. We will provide more details in the following sections.
We hypothesize that the Mean-field assumption is problematic itself as the q under this assumption can never recover the true structure of p. On the other hand, Copula-VAE makes use of the dependency relationships maintained by a copula model to guide the variational posterior to match the true posterior. Our approach differs from Gaussian Copula-VAE (Suh and Choi, 2016) in that we use copula to estimate the joint density p(z) rather than the empirical data density p(x).

Gaussian Copula
In this section, we review the basic concepts of Gaussian copula. Copula is defined as a probability distribution over a high-dimensional unit cube [0, 1] d whose univariate marginal distributions are uniform on [0, 1]. Formally, given a set of uniformaly distributed random variables U 1 , U 2 , ..., U n , a copula is a joint distribution defined as What makes a copula model above so useful is the famous Sklar's Theorem. It states that for any joint cumulative distribution function (CDF) with a set of random variables By probability integral transform, each marginal CDF is a uniform random variable on [0, 1]. Hence, the above copula is a valid one. Since for each joint CDF, there is one unique copula function associated with it given a set of marginals, we can easily construct any joint distribution whose marginal univariate distributions are the ones F i (x i ) that are given. And, for a given joint distribution, we can also find the corresponding copula which is the CDF function of the given marginals.
A useful representation we can get by Sklar's Theorem for a continuous copula is, If we further restrict the marginals to be Gaussian, then we can get an expression for Gaussian copula, that is, To calculate the joint density of a copula function, we take the derivative with respect to random variables u and get Then, if the joint density p(x 1 , ..., x d ) has a Gaussian form, it can be expressed by a copula density and its marginal densities, that is, Therefore, we can decompose the problem of estimating the joint density into two smaller subproblems: one is the estimation for the marginals; the other is the estimation for the copula density function c Σ . In many cases, we assume independence over random variables due to the intractability of the joint density. For example, in the case of variational inference, we apply Mean-Field assumption which requires the variational distribution family to have factorized form so that we can get a closed-form KL divergence with respect to the prior. This assumption, however, sacrifices the useful dependency relationships over the latent random variables and often leads to training difficulties such as the posterior collapse problem. If we assume the joint posterior of latent variables to be Gaussian, then the above Gaussian copula model can be used to recover the correlation among latent variables which helps obtain a better estimation of the joint posterior density. In the VAE setting, we can already model the marginal independent posterior of latent variables, so the only problem left is how to efficiently estimate the copula density function c Σ . In the next section, we introduce a neural parameterized Gaussian copula model, and we provide a way to incorporate it with the reparameterization technique used in VAE.

Neural Gaussian Copula for VI
By Mean-field assumption, we construct a variational family Q assuming that each member q ∈ Q can be factorized, This assumption, however, loses dependencies over latent codes and hence does not consider the non-factorized form of the true posterior. In this case, as pictured by Figure 2, when we search the optimal q*, it will never reach to the true posterior p. If we relieve the assumption, the variational family may overlap with the posterior family. However, this is intractable as the Monto Carlo estimator with respect to the objective often has very high variance (Kingma and Welling, 2014). Hence, we need to find a way to make it possible to match the variational posterior with the true posterior while having a simple and tractable objective function so that the gradient estimator of the expectation is simple and precise. This is where Gaussian Copula comes into the story. Given a factorized posterior, we can construct a Gaussian copula for the joint posterior, where c Σ is the Gaussian copula density. If we take the log on both sides, then we have,  Note that the second term on the right hand side is just the factorized log posterior we have in the original VAE model. By reparameterization trick (Kingma and Welling, 2014), latent codes sampled from the posterior are parameterized as a deterministic function of µ and σ 2 , that is, z = µ + σ · , ∼ N (0, I), where µ, σ 2 are parameterized by two neural networks whose inputs are the final hidden states of the LSTM encoder. Since i q φ (z i |x) = N (µ, σ 2 I), we can compute the sum of log density of posterior by, Now, to estimate the log copula density log c Σ (·), we provide a reparameterization method for the copula samples q ∼ C Σ (Φ 1 (q 1 ), ...Φ(q d )). (2014); Hoffman et al. (2013), reparameterization is needed as it gives a differentiable, low-variance estimator of the objective function. Here, we parameterize the copula samples with a deterministic function with respect to the Cholesky factor L of its covariance matrix Σ. We use the fact that for any multivariate Gaussian random variables, a linear transformation of them is also a multivariate Gaussian random variable. Formally, if X ∼ N (µ, Σ), and Y = AX, then we must have Y ∼ N (Aµ, AΣA T ). Hence, for a Gaussian copula with the form c Σ = N (0, Σ), we can reparameterize its samples q by,

As suggested by Kingma and Welling
It is easy to see that q = L · ∼ N (0, LI − L T = Σ) is indeed a sample from the Gaussian copula model. This is the standard way of sampling from Gaussian distribution with covariance matrix LL T . To ensure numerical stability of the above reparameterization and to ensure that the covariance Σ = LL T is positive definite, we provide the following algorithm to parameterize L.

Algorithm 1: Neural reparameterization of Copula: Cholesky approach
In Algorithm 1, we first parameterize the covariance matrix and then perform a Cholesky factor-the company said it will be sold to the company 's promotional programs and UNK the company also said it will sell $ n million of soap eggs turning millions of dollars the company said it will be UNK by the company 's UNK division n the company said it would n't comment on the suit and its reorganization plan mr . UNK said the company 's UNK group is considering a UNK standstill agreement with the company traders said that the stock market plunge is a UNK of the market 's rebound in the dow jones industrial average one trader of UNK said the market is skeptical that the market is n't UNK by the end of the session the company said it expects to be fully operational by the company 's latest recapitalization i was excited to try this place out for the first time and i was disappointed . the food was good and the food a few weeks ago , i was in the mood for a UNK of the UNK i love this place . i ' ve been here a few times and i ' m not sure why i ' ve been this place is really good . i ' ve been to the other location many times and it 's very good . i had a great time here . i was n't sure what i was expecting . i had the UNK and the i have been here a few times and have been here several times . the food is good , but the food is good this place is a great place to go for lunch . i had the chicken and waffles . i had the chicken and the UNK i really like this place . i love the atmosphere and the food is great . the food is always good . ization (Chen et al., 2008) to get the Cholesky factor L. The covariance matrix Σ = w · I + aa T formed in this way is guaranteed to be positive definite. It is worth noting that we do not sample the latent codes from Gaussian copula. In fact, z still comes from the independent Gaussian distribution. Rather, we get sample q from Gaussian copula C Σ so that we can compute the log copula density term in the following, which will then be used as a regularization term during training, in order to force the learned z to respect the dependencies among individual dimensions. Now, to calculate the log copula density, we only need to do, To make sure that our model maintains the dependency structure of the latent codes, we seek to maximize both the ELBO and the joint log posterior likelihood log q(z|x) during the training. In other words, we maximize the following modified ELBO, where L is the original ELBO. λ is the weight of log density of the joint posterior. It controls how good the model is at maintaining the dependency relationships of latent codes. The reparameterization tricks both for z and q makes the above objective fully differentiable with respect to µ, σ 2 , Σ.
Maximizing L will then maximize the log input likelihood log p(x) and the joint posterior loglikelihood log q(z|x). If the posterior collapses to the prior and has a factorized form, then the joint posterior likelihood will not be maximized since the joint posterior is unlikely factorized. Therefore, maximizing the joint posterior log-likelihood along with ELBO forces the model to generate readable texts while also considering the dependency structure of the true posterior distribution, which is never factorized.

Evidence Lower Bound
Another interpretation can be seen by taking a look at the prior. If we compose the copula density with the prior, then, like the non-factorized posterior, we can get the non-factorized prior, And the corresponding ELBO is, Like Normalizing flow (Rezende and Mohamed, 2015), maximizing the log copula density will then learns a more flexible prior other than a standard Gaussian. The dependency among each z i is then restored since the KL term will push the posterior to this more complex prior.
We argue that relieving the Mean-field assumption by maintaining the dependency structure can avert the posterior collapse problem. As shown in Figure 2, during the training stage of original VAE, if kl-annealing (Bowman et al., 2015) is used, the model first seeks to maximize the expectation E q(z|x) [p(x|z)]. Then, since q(z|x) can never reach to the true p(z|x), q will reach to a boundary and then the expectation can no longer increase. During this stage, the model starts to maximize the ELBO by minimizing the KL divergence. Since the expectation is maximized and can no longer leverage KL, the posterior will collapse to the prior and there is not sufficient gradient to move it away since ELBO is already maximized. On the other hand, if we introduce a copula model to help maintain the dependency structure of the true posterior by maximizing the joint posterior likelihood, then, in the ideal case, the variational family can approximate distributions of any forms since it is now not restricted to be factorized, and therefore it is more likely for q in Figure 3 to be closer to the true posterior p. In this case, the E q(z|x) [p(x|z)] can be higher since now we have latent codes sampled from a more accurate posterior, and then this expectation will be able to leverage the decrease of KL even in the final training stage.  In the paper, we use Penn Tree (Marcus et al., 1993), Yahoo Answers (Xu and Durrett, 2018;Yang et al., 2017), and Yelp 13 reviews (Xu et al., 2016) to test our model performance over variational language modeling tasks. We use these three large datasets as they are widely used in all other variational language modeling approaches (Bowman et al., 2015;Yang et al., 2017;Xu and Durrett, 2018;Xiao et al., 2018;He et al., 2018;Kim et al., 2018). Table 2 shows the statistics, vocabulary size, and number of samples in Train/Validation/Test for each dataset.

Experimental Setup
We set up a similar experimental condition as in (Bowman et al., 2015;Xiao et al., 2018;Yang et al., 2017;Xu and Durrett, 2018). We use LSTM as our encoder-decoder model, where the number of hidden units for each hidden state is set to 512. The word embedding size is 512. And, the number of dimension for latent codes is set to 32. For both encoder and decoder, we use a dropout layer for the initial input, whose dropout rate is α = 0.5. Then, for inference, we pass the final hidden state to a linear layer following a Batch Normalizing (Ioffe and Szegedy, 2015) layer to get reparameterized samples from i q(z i |x) and from Gaussian copula C Σ . For the training stage, the maximum vocabulary size for all inputs are set to 20000, and the maximum sequence length is set to 200. Batch size is set to 32, and we train for 30 epochs for each dataset, where we use the Adam stochastic optimization (Kingma and Ba, 2014) whose learning rate is r = 10 −3 .
We use kl annealing (Bowman et al., 2015) during training. We also observe that the weight of log copula density is the most important factor which determines whether our model avoids the posterior collapse problem. We hyper tuned this parameter in order to find the optimal one.  When we gradually increase λ, the KL divergence increases and the test reconstruction will decrease.

Comparison with other Variational models
We compare the variational language modeling results over three datasets. We show the results for Negative log-likelihood (NLL), KL divergence, and sample perplexity (PPL) for each model on these datasets. NLL is approximated by the evidence lower bound.
First, we observe that kl-annealing does not help alleviate the posterior collapse problem when it comes to larger datasets such as Yelp, but the problem is solved if we can maintain the latent code's dependencies by maximizing the copula likelihood when we maximize the ELBO. We also observe that the weight λ of log copula density affects results dramatically. All λ produce competitive results compared with other methods. Here, we provide the numbers for those weights λ that produce the lowest PPL. For PTB, copula-VAE achieves the lowest sample perplexity, best NLL approximation, and do not result in posterior collapse when λ = 0.4. For Yelp, the lowest sample perplexity is achieved when λ = 0.5.
We also compare with VAE models trained with normalizing flows (Rezende and Mohamed, 2015). We observed that our model is superior to VAE based on flows. It is worth noting that Wasserstein Autoencoder trained with Normalizing flow (Wang and Wang, 2019) achieves the lowest PPL 66 on PTB, and 41 on Yelp. However, the problem of designing flexible normalizing flow is orthogonal to our research. Table 1 presents the results of text generation task. We first randomly sample z from p(z), and then feed it into the decoder p(x|z) to generate text using greedy decoding. We can tell whether a model suffers from posterior collapse by examining the diversity of the generated sentences. The original VAE tends to generate the same type of sequences for different z. This is very obvious in PTB where the posterior of the original VAE collapse to the prior completely. Copula-VAE, however, does not have this kind of issue and can always generate a diverse set of texts.

Hyperparameter-Tuning: Copula weights play a huge role in the training of VAE
In this section, we investigate the influence of log copula weight λ over training. From Figure 5, we observe that our model performance is very sensitive to the value of λ. We can see that when λ is small, the log copula density contributes a small part to the objective, and therefore does not help to maintain dependencies over latent codes. In this case, the model performs like the original VAE, where the KL divergence becomes zero at the end. When we increase λ, test KL becomes larger and test reconstruction loss becomes smaller. This phenomenon is also observed in validation datasets, as shown in Figure 4. The training PPL is monotonically decreasing in general. However, when λ is small and the dependency relationships over latent codes are lost, the model quickly overfits, as the KL divergence quickly becomes zero and the validation loss starts to increase. This further confirms what we showed in Figure 2. For original VAE models, the model first maximizes E q(z|x) [p(x|z)] which results in the decrease of both train and validation loss. Then, as q(z|x) can never match to the true posterior, E q(z|x) [p(x|z)] reaches to its ceiling which then results in the decrease of KL as it is needed to maximize the ELBO. During this stage, the LSTM decoder starts to learn how to generate texts with standard Gaussian latent variables which then causes the increase of validation loss. On the other hand, if we gradually increase the contribution of copula density by increasing the λ, the model is able to maintain the dependencies of latent codes and hence the structure of the true posterior. In this case, E q(z|x) [p(x|z)] will be much higher and will leverage the decrease of KL. In this case, the decoder is forced to generate texts from non-standard Gaussian latent codes. Therefore, the validation loss also decreases monotonically in general.
One major drawback of our model is the amount of training time, which is 5 times longer than the original VAE method. In terms of performance, copula-VAE achieves the lowest reconstruction loss when λ = 0.6. It is clear that from Figure 5 that increasing λ will result in larger KL divergence.

Conclusion
In this paper, we introduce Copula-VAE with Cholesky reparameterization method for Gaussian Copula. This approach averts Posterior Collapse by using Gaussian copula to maintain the dependency structure of the true posterior. Our results show that Copula-VAE significantly improves the language modeling results of other VAEs.