Better Exploiting Latent Variables in Text Modeling

We show that sampling latent variables multiple times at a gradient step helps in improving a variational autoencoder and propose a simple and effective method to better exploit these latent variables through hidden state averaging. Consistent gains in performance on two different datasets, Penn Treebank and Yahoo, indicate the generalizability of our method.


Introduction
Introducing latent variables to neural language models would help in generating plausible sentences that reflect sentential semantics (Bowman et al., 2016). The success of learning latent variables is also beneficial to various natural language processing (NLP) tasks such as sentence compression (Miao and Blunsom, 2016) and text style transfer (Shen et al., 2017). One of the widely-used latent variable models is the variational autoencoder (VAE) (Kingma and Welling, 2014;Rezende et al., 2014). When applying the VAE to text data, recurrent neural networks are typically utilized for both the encoder and the decoder. Training the VAE with a high-capacity decoder such as a long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997) can be challenging. The LSTM is powerful enough to model the underlying data distribution without the use of latent variables.
In this paper, we take a closer look at one of the components in an LSTM-VAE model, namely the latent variable sampling scheme. Fig. 1 illustrates our baseline LSTM-VAE model built upon Bowman et al. (2016)'s model. At each gradient step (i.e., a minibatch run), most previous work pairs an input sentence with a single latent variable denoted by z. This would be sufficient in some 1 The code for reproducibility is available at https:// research-lab.yahoo.co.jp/en/software. tasks but not necessarily effective in text modeling. At the beginning of training, the latent variable z contains a small amount of information about the input sentence. Many latent units of z are pulled towards the prior early to optimize an objective function before they capture useful information (Hoffman et al., 2013;Sønderby et al., 2016). Without a cost annealing strategy or a constraint on the decoder (Bowman et al., 2016;Chen et al., 2017;Yang et al., 2017), z would be entirely ignored for the remaining training steps. In our work, we aim at developing a simple variant of the LSTM-VAE model to address this common training issue. We observe that pairing the input sentence with multiple latent variables improves latent variable usage. In addition, we present a method that leverages multiple latent variables to further boost the performance of the baseline LSTM-VAE model.
Our contributions are as follows: We suggest sampling the latent variables multiple times at each gradient step. We propose a simple method to better exploit these latent variables through hidden state averaging. We evaluate the proposed method on two different datasets, Penn Treebank and Yahoo, and compare to the best results published in the literature. Our empirical results show that our method can effectively make use of the latent variables, leading to the state-of-the-art performance. Bowman et al. (2016) first proposed an LSTM-VAE model for text. They observed the posteriorcollapse problem in which the approximate posterior collapses to the prior, and the model ignores the latent variable. They suggested two techniques to alleviate this issue: cost annealing (called warm-up in (Sønderby et al., 2016)) and word dropout. Weakening the decoder with word dropout forces the latent variable to encode more information, but their LSTM-VAE model still underperforms against the standard LSTM language model. Yang et al. (2017) proposed to replace the LSTM decoder with a dilated convolutional neural network (CNN) (van den Oord et al., 2016) to control the contextual capacity. However, their positive results also came from initializing the encoder with a pre-trained LSTM language model. Guu et al. (2018) first proposed using the von Mises-Fisher (vMF) distribution to model the VAE instead of using the Gaussian distribution. However, the vMF distribution presupposes that all data are directional unit vectors. Other applications of the vMF distribution can be found in (Davidson et al., 2018;Xu and Durrett, 2018). Kim et al. (2018) presented a semi-amortized (SA) approach to training the VAE, while He et al. (2019) proposed an aggressive inference network training. However, their training algorithms are computationally expensive since they require backpropagating through the decoder or the encoder multiple times. Our method is simpler and easy to implement. In practice, we just place a loop before reparameterization and do averaging.

Background
Let x = [w 1 , w 2 , . . . , w T ] be a sentence representation, where w t is the t-th word. Assume that x is generated from a continuous latent variable z using a random process x ∼ p θ (x|z) parameterized by θ. By applying the standard language model (Bengio et al., 2003), we get: (1) Given a dataset X = {x (1) , . . . , x (N ) }, we typically fit the model by maximizing the average logmarginal likelihood 1 N N 1 log p θ (x (i) ). We can express an individual log-marginal likelihood by log p θ (x) = log z p θ (x|z)p(z)dz, where p(z) is the prior on z. Unfortunately, the integral over z is intractable (Hoffman et al., 2013). Alternatively, we would sample z directly from the posterior distribution p θ (z|x). However, p θ (z|x) is also intractable since p θ (z|x) = p θ (x|z)p(z)/p θ (x).
Variational inference approximates the posterior distribution p θ (z|x) with a variational family of distributions q φ (z|x) parameterized by φ. We wish that q φ (z|x) is close to p θ (z|x). We measure this closeness by the Kullback-Leibler (KL) divergence: KL(q φ (z|x)||p θ (z|x)). Instead of maximizing the true log-marginal likelihood, we maximize its lower bound: ( 2) The above equation is typically referred to as the evidence lower bound (ELBO) (Hoffman and Johnson, 2016). The ELBO consists of two terms: the expected reconstruction term and the KLdivergence term. We can solve the KL-divergence term analytically given that both the prior p(z) and the variational posterior q φ (z|x) are Gaussian (see Kingma and Welling (2014)'s Appendix B). We then need to rewrite the expected reconstruction term into some closed-form expression (detailed in §4) so that we can maximize it by applying stochastic optimization methods.
Optimizing the ELBO forms the VAE architecture in which q φ (z|x) encodes x into a latent variable z, and p θ (x|z) decodes z to reconstruct x. The gradient of the ELBO w.r.t. φ can have low variance by applying the reparameterization trick (Kingma and Welling, 2014) that estimates z ∼ q φ (z|x) using z = µ + σ , where mean µ and variance σ 2 are outputs of some neural networks, and ∼ N (0, 1).

Proposed method
Having covered the technical background, we now describe our two extensions to improve the baseline LSTM-VAE model in Fig. 1. The baseline model approximates the expected reconstruction term by sampling one latent variable z ∼ q φ (z|x) at each gradient step (Bowman et al., 2016). Thus, Our first extension is to improve the sampling by using a Monte Carlo estimate of the expected Embedding layer LSTM decoder Embedding layer Linear+Softmax layer where z (l) = µ + σ (l) and (l) ∼ N (0, 1). Sampling latent variables multiple times at each gradient step should result in a better approximation of the expected reconstruction term. Fig. 2 shows an example of sampling two latent variables. Note that we use the same µ and σ for both latent variables. By using the language model from Eq. (1), we can decompose the reconstruction term as: log p θ (x|z (l) ) = T t=1 log p θ (w t |w 1:t−1 , z (l) ). (4) Let V be a fixed size vocabulary of words in a dataset. Given the entire history of previous words w 1:t = [w 1 , . . . , w t ] and the latent variable z (l) , we compute the distribution over the possible corresponding values of w t+1 by applying a linear transformation to the decoder hidden state followed by a softmax: where M 1 ∈ R m×|V| and M 2 ∈ R m×n are the trainable weight matrices, h (l) t ∈ R m is the de-coder hidden state, z (l) ∈ R n is the latent variable at each sampling step l, and w t ∈ R d is the embedding vector of the word w t . We compute µ and σ 2 used in the reparameterization trick by: where M 3 , M 4 ∈ R n×m are the trainable weight matrices and s T ∈ R m is the last encoder hidden state.
Our second extension is to exploit multiple latent variables to directly improve the expressiveness of the decoder. Instead of computing the separate reconstruction terms and taking the average of them as in Eq. (3), we combine the decoder hidden states at each time step t: where each hidden state is initialized with a different latent variable z (l) . Fig. 3 shows an example of averaging two hidden states at each decoding step. Thus our distribution of w t+1 becomes: p θ (w t+1 |w 1:t , z) = softmax(h t M 1 ). (8) Here we drop the superscript (l) since all hidden states h (l) t are averaged intoh t .

Datasets and training details
We experiment on two datasets: Penn Treebank (PTB) (Marcus et al., 1993) and Yahoo (Zhang et al., 2015). We sample z using the reparameterization trick and feed it through a linear transformation to get the initial hidden state of the LSTM decoder while setting the initial cell state to zero. We concatenate z with the word embedding at each decoding step. We use dropout (Hinton et al., 2012) with probability 0.5 on the input-to-hidden layers and the hidden-to-softmax layers. We initialize all model parameters and word embeddings by sampling from U(−0.1, 0.1). We train all models using stochastic gradient descent (SGD) with the batch size of 32, the learning rate of 1.0, and the gradient clipping at 5. The learning rate decays by halves if the validation perplexity does not improve. We train for 30 epochs or until the validation perplexity has not improved for 3 times. All models are trained on NVIDIA Tesla P40 GPUs.
Following previous work (Bowman et al., 2016;Sønderby et al., 2016), we apply KL cost annealing to all LSTM-VAE models. The multiplier on the KL term is increased linearly from 0 to 1 during the first 10 epochs of training.
We also try word dropout (Bowman et al., 2016) during development but find that it is not effective when combined with standard dropout. Our finding conforms to (Kim et al., 2018). So we do not apply word dropout to our models.

Main results
We report the upper bounds (i.e., the negative ELBO in Eq. (2)) on NLL/PPL. We vary the number of latent variables L in the variational models to assess their impact on performance. LSTM-VAE-AVG indicates the averaging of hidden states at each decoding step in Eq. (8). We also report the results of the inputless setting (Bowman et al., 2016), which corresponds to dropping all ground truth words during decoding. Table 2 shows the results of various models. The LSTM-VAE-AVG models with multiple latent variables provide the best improvements in terms of NLL/PPL. The LSTM-VAE models trained with more latent variables offer slight improvements over the baseline version (i.e., using one latent variable) for the standard setting.
The baseline LSTM-VAE models have low KL values and underperform against LSTM-LM for the standard setting. Incorporating multiple latent variables consistently helps in increasing the KL values. Note that a high KL term does not necessarily imply a better upper bound. Generally, we do not expect the KL term to approach zero. When KL(q φ (z|x)||p(z)) = 0, it indicates that z and x are independent (i.e., q φ (z|x) = q φ (z) = p(z)). In other words, z learns nothing from x.
The LSTM-VAE-AVG models have relatively high KL values (except the inputless setting on Yahoo), while still maintaining better upper bounds on NLL/PPL. These results suggest that our models with expressive decoders can effectively make use of the latent variables.

Discussion
On PTB, LSTM-VAE-AVG (L = 10) achieves the best results compared to previous work (Bowman et al., 2016;Xu and Durrett, 2018). On Yahoo, LSTM-VAE-AVG (L = 5) slightly outperforms Kim et al. (2018)'s SA-VAE. Our model can provide similar improvements while being simpler. We also observe that our vanilla LSTM-LM model and that of Kim et al. (2018) have better results than Yang et al. (2017)'s models. One plausible explanation is that Yang et al. (2017) trained their models with Adam (Kingma and Ba, 2015), while we used SGD. For text modeling, researchers have shown that SGD performs better than other adaptive optimization methods such as Adam (Wilson et al., 2017;Keskar and Socher, 2017).
The ELBO has been commonly used to evaluate the variational models (Bowman et al., 2016;Yang et al., 2017;Xu and Durrett, 2018;Kim et al., 2018). There also exists a line of work that uses importance sampling to estimate the true logmarginal likelihood (Rezende et al., 2014;Burda et al., 2016;Tomczak and Welling, 2018;He et al., 2019). We further conduct experiments by computing the importance sampling estimates with 500 samples and comparing to He et al. (2019) Table 2; NLL IW = the importance sampling estimates of NLL with 500 samples. We report mean and standard deviation computed across five training/test runs from different random initial starting points.
aggressive inference network (AIN) training. Table 3 shows a comparison of different NLL estimates. Our results are consistent with those of (He et al., 2019) in which the importance sampling yields the tighter bounds than the ELBO.

Conclusion
We have shown that using multiple latent variables at each gradient step can improve the performance of the baseline LSTM-VAE model. The empirical results indicate that our models combined with expressive decoders can successfully make use of the latent variables, resulting in higher KL values and better NLL/PPL results. Our proposed method is simple and can serve as a strong baseline for latent variable text modeling.