On the Importance of the Kullback-Leibler Divergence Term in Variational Autoencoders for Text Generation

Variational Autoencoders (VAEs) are known to suffer from learning uninformative latent representation of the input due to issues such as approximated posterior collapse, or entanglement of the latent space. We impose an explicit constraint on the Kullback-Leibler (KL) divergence term inside the VAE objective function. While the explicit constraint naturally avoids posterior collapse, we use it to further understand the significance of the KL term in controlling the information transmitted through the VAE channel. Within this framework, we explore different properties of the estimated posterior distribution, and highlight the trade-off between the amount of information encoded in a latent code during training, and the generative capacity of the model.


Introduction
Despite the recent success of deep generative models such as Variational Autoencoders (VAEs) (Kingma and Welling, 2014) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) in different areas of Machine Learning, they have failed to produce similar generative quality in NLP. In this paper we focus on VAEs and their mathematical underpinning to explain their behaviors in the context of text generation.
The vanilla VAE applied to text (Bowman et al., 2016) consists of an encoder (inference) and decoder (generative) networks: Given an input x, the encoder network parameterizes q (z|x) and infers about latent continuous representations of x, while the decoder network parameterizes p ✓ (x|z) and generates x from the continuous code z. The two models are jointly trained by maximizing the Evidence Lower Bound (ELBO), L(✓, ; x, z): ⌦ log p ✓ (x|z) ↵ q (z|x) D KL q (z|x)||p(z) (1) 1 The code is available on https://github.com/ VictorProkhorov/KL_Text_VAE where the first term is the reconstruction term, and the second term is the Kullback-Leibler (KL) divergence between the posterior distribution of latent variable z and its prior p(z) (i.e., N (0, I)). The KL term can be interpreted as a regularizer which prevents the inference network from copying x into z, and for the case of a Gaussian prior and posterior has a closed-form solution.
With powerful autoregressive decoders, such as LSTMs, the internal decoder's cells are likely to suffice for representing the sentence, leading to a sub-optimal solution where the decoder ignores the inferred latent code z. This allows the encoder to become independent of x, an issue known as posterior collapse (q (z|x) ⇡ p(z)) where the inference network produces uninformative latent variables. Several solutions have been proposed to address the posterior collapse issue: (i) Modifying the architecture of the model by weakening decoders (Bowman et al., 2016;Miao et al., 2015;Yang et al., 2017;Semeniuta et al., 2017), or introducing additional connections between the encoder and decoder to enforce the dependence between x and z (Zhao et al., 2017;Goyal et al., 2017;Dieng et al., 2018); (ii) Using more flexible or multimodal priors (Tomczak and Welling, 2017;Xu and Durrett, 2018); (iii) Alternating the training by focusing on the inference network in the earlier stages (He et al., 2019), or augmenting amortized optimization of VAEs with instancebased optimization of stochastic variational inference Marino et al., 2018).
All of the aforementioned approaches impose one or more of the following limitations: restraining the choice of decoder, modifying the training algorithm, or requiring a substantial alternation of the objective function. As exceptions to these, -VAE (Razavi et al., 2019) and -VAE (Higgins et al., 2017) aim to avoid the posterior collapse by explicitly controlling the regularizer term in eqn. 1. While -VAE aims to impose a lower bound on the divergence term, -VAE ( §2.2) controls the impact of regularization via an additional hyperparameter (i.e., D KL q (z|x)||p(z) ). A special case of -VAE is annealing (Bowman et al., 2016), where increases from 0 to 1 during training.
In this study, we propose to use an extension of -VAE (Burgess et al., 2018) which permits us to explicitly control the magnitude of the KL term while avoiding the posterior collapse issue even in the existence of a powerful decoder. We use this framework to examine different properties of the estimated posterior and the generative behaviour of VAEs and discuss them in the context of text generation via various qualitative and quantitative experiments.

Kullback-Leibler Divergence in VAE
We take the encoder-decoder of VAEs as the sender-receiver in a communication network. Given an input message x, a sender generates a compressed encoding of x denoted by z, while the receiver aims to fully decode z back into x. The quality of this communication can be explained in terms of rate (R) which measures the compression level of z as compared to the original message x, and distortion (D) which quantities the overall performance of the communication in encoding a message at sender and successfully decoding it at the receiver. Additionally, the capacity of the encoder channel can be measured in terms of the amount of mutual information between x and z, denoted by I(x; z) (Cover and Thomas, 2012).

Reconstruction vs. KL
The reconstruction loss can naturally measure distortion (D := ⌦ log p ✓ (x|z) ↵ ), while the KL term quantifies the amount of compression (rate; ) by measuring the divergence between a channel that transmits zero bit of information about x, denoted by p(z), and the encoder channel of VAEs, q (z|x).  introduced the H D  I(x; z)  R bounds 2 , where H is the empirical data entropy (a constant). These bounds on mutual information allow us to analyze the trade-off between the reconstruction and KL terms in eqn. (1). For instance, since I(x; z) is non-negative (using Jensen's inequality), the posterior collapse can be explained as the situation where I(x; z) = 0, where encoder transmits no information about x, causing R = 0, D = H. Increasing I(x; z) can be encouraged by increasing both bounds: increasing the upper-bound (KL term) can be seen as the mean to control the maximum capacity of the encoder channel, while reducing the distortion (reconstruction loss) will tighten the bound by pushing the lower bound to its limits (H D ! H). A similar effect on the lower-bound can be encouraged by using stronger decoders which could potentially decrease the reconstruction loss. Hence, having a framework that permits the use of strong decoders while avoiding the posterior collapse is desirable. Similarly, channel capacity can be decreased.

Explicit KL Control via -VAE
Given the above interpretation, we now turn to a slightly different formulation of ELBO based on -VAE (Higgins et al., 2017). This allows control of the trade-off between the reconstruction and KL terms, as well as to set explicit KL value. While -VAE offers regularizing the ELBO via an additional coefficient 2 IR + , a simple extension (Burgess et al., 2018) of its objective function incorporates an additional hyperparameter C to explicitly control the magnitude of the KL term, ⌦ where C 2IR + and |.| denotes the absolute value. While we could apply constraint optimization to impose the explicit constraint of KL=C, we found that the above objective function satisfies the constraint ( §3). Alternatively, it has been shown (Pelsmaeker and Aziz, 2019) the similar effect could be reached by replacing the second term in eqn. 2 with max C, D KL q (z|x)||p(z) at the risk of breaking the ELBO when KL<C (Kingma et al., 2016).

Experiments
We conduct various experiments to illustrate the properties that are encouraged via different KL magnitudes. In particular, we revisit the interdependence between rate and distortion, and shed light on the impact of KL on the sharpness of the approximated posteriors. Then, through a set of qualitative and quantitative experiments for text generation, we demonstrate how certain genera-

Rate and Distortion
To analyse the dependence between the values of explicit rate (C) and distortion, we trained our models with different values of C, ranging from 10 to 100. Figure 1 reports the results for C -VAE GRU , C -VAE LSTM , and C -VAE CNN models on Yahoo and Yelp corpora. In all our experiments we found that C 1  KL  C +1, demonstrating that the objective function effectively imposed the desired constraint on KL term. Hence, setting any C > 0 can in practice avoid the collapse issue. The general trend is that by increasing the value of C one can get a better reconstruction (lower distortion) while the amount of gain varies depending on the VAE's architecture and corpus. 5 Additionally, we measured rate and distortion on CBT, WIKI, and WebText corpora using C -VAE LSTM and observed the same trend with the increase of C, see Table 1. This observation is consistent with the bound on I(x; z) we discussed earlier ( §2.1) such that with an increase of KL we increase an upper bound on I(x; z) which in turn allows to have smaller values of reconstruction loss. Additionally, as reported in Table 1, encouraging higher rates (via larger C) encourages more active units (AU; Burda et al. (2015)) in the latent code z. 6 As an additional verification, we also group the test sentences into buckets based on their length and report BLEU-2/4 and ROUGE-2/4 metrics to measure the quality of reconstruction step in Table  1. As expected, we observe that increasing rate has a consistently positive impact on improving BLEU and ROUGE scores.

Aggregated Posterior
To understand how the approximated posteriors are being affected by the magnitude of the KL, we adopted an approach from Zhao et al. (2017) and looked at the divergence between the aggregated posterior, q (z) = P x⇠q(x) q (z|x), and prior p(z). Since during generation we generate samples from the prior, ideally we would like the aggregated posterior to be as close as possible to the prior.
We obtained unbiased samples of z first by sampling an x from data and then z ⇠ q (z|x), and measured the log determinant of covariance of the samples (log det(Cov[q (z)])). As reported in Figure 1, we observed that log det(Cov[q (z)]) degrades as C grows, indicating sharper approximate posteriors. We then consider the difference of p(z) and q(z) in their means and variances, by computing the KL divergence from the momentmatching Gaussian fit of q(z) to p(z): This returns smaller values for C=5 -VAE GRU (Yelp: 0, Yahoo: 0), and larger values for C=100 -VAE GRU (Yelp: 8, Yahoo: 5), which illustrates that the overlap between q (z) and p(z) shrinks further as C grows. 6 To see if the conclusions hold with different number of parameters, we doubled the number of parameters in C -VAEGRU and C -VAELSTM and observed the similar pattern with a slight change in performance.
The above observation is better pronounced in Table 1, where we also report the mean (||µ|| 2 2 ) of unbiased samples of z, highlighting the divergence from the mean of the prior distribution as rate increases. Therefore, for the case of lower C, the latent variables observed during training are closer to the generated sample from the prior which makes the decoder more suitable for generation purpose. We will examine this hypothesis in the following section.

Text Generation
To empirically examine how channel capacity translates into generative capacity of the model, we experimented with the C -VAE LSTM models from Table 1. To generate a novel sentence, after a model was trained, a latent variable z is sampled from the prior distribution and then transformed into a sequence of words by the decoder p(x|z).
During decoding for generation we try three decoding schemes: (i) Greedy: which selects the most probable word at each step, (ii) Top-k (Fan et al., 2018): which at each step samples from the K most probable words, and (iii) Nucleus Sampling (NS) (Holtzman et al., 2019): which at each step samples from a flexible subset of most probable words chosen based on their cumulative mass (set by a threshold p, where p = 1 means sampling from the full distribution). While similar to Topk, the benefit of NS scheme is that the vocabulary size at each time step of decoding varies, a property that encourages diversity and avoids degenerate text patterns of greedy or beam search decoding (Holtzman et al., 2019). We experiment with NS (p = {0.5, 0.9}) and Top-k (k = {5, 15}). 6: and where s the news of the world? 6: cried her mother, who is hidden or power. 7: and what is the matter with you? 7: 〈unk〉 of 〈unk〉, said i. ay, 〈unk〉! 7: but this was his plight, and the smith knew.

C=15
1: old mother west wind and her eyes were in the same place, but she had never seen her.
1: eric found out this little while, but there in which the old man did not see it so.
1: aunt tommy took a sudden notion of relief and yellow-dog between him sharply until he tried to go to. 2: old mother west wind and his wife had gone and went to bed to the palace.
2: old mother west wind and his wife gave her to take a great 〈unk〉, she said.
2: his lord marquis of laughter expressed that soft hope and miss cornelia was not comforted. 3: little joe otter and there were a 〈unk〉 of them to be seen.
3: little joe otter got back to school all the 〈unk〉 together.
3: meanwhile the hounds were both around and then by a thing was not yet. 4: little joe otter s eyes are just as big as her. 4: little joyce s eyes grew well at once, there.
4: in a tone, he began to enter after dinner.
5: a few minutes did not answer the 〈unk〉. 5: pretty a woman, but there had vanished. 5: once a word became, just got his way. 6: a little while they went on. 6: from the third day, she went. 6: for a few moments, began to find. 7: a little while they went.

C=100
1: it will it, all her 〈unk〉, not even her with her?
1: it will her you, at last, bad and never in her eyes.
1: it s; they liked the red, but i kept her and growing. 2: it will get him to mrs. matilda and nothing to eat her long clothes. 2: other time, i went into a moment -she went in home and.
2: it 〈unk〉 not to her, in school, and never his bitter now. 3: the thing she put to his love, when it were 〈unk〉 and too.
3: going quite well to his mother, and remember it the night in night! 3: was it now of the beginning, and dr. hamilton was her away and. 4: one day, to the green forest now and a long time ago, sighed. 4: one and it rained for his feet, for she was their eyes like ever. 4: of course she flew for a long distance; and they came a longing now. 5: one and it became clear of him on that direction by the night ago. 5: the thing knew the tracks of 〈unk〉 and he never got an 〈unk〉 before him. 5: one door what made the pain called for her first ear for losing up. 6: every word of his horse was and the rest as the others were ready for him. 6: of course he heard a sound of her as much over the 〈unk〉 that night can. 6: one and he got by looking quite like her part till the marriage know ended. 7: a time and was half the 〈unk〉 as before the first 〈unk〉 things were ready as. 7: every, who had an interest in that till his legs got splendid tongue than himself. 7: without the thought that danced in the ground which made these delicate child s teeth so. Table 2: Homotopy (CBT corpus) -The three blocks correspond to C = {3, 15, 100} values used for training C -VAE LSTM . The columns correspond to the three decoding schemes: greedy, top-k (with k=15), and the nucleus sampling (NS; with p=0.9). Initial two latent variables z were sampled from a the prior distribution i.e. z ⇠ p(z) and the other five latent variables were obtained by interpolation. The sequences that highlighted in gray are the one that decoded into the same sentences condition on different latent variable. Note: Even though the learned latent representation should be quite different for different models (trained with different C) in order to be consistent all the generated sequences presented in the table were decoded from the same seven latent variables.

Qualitative Analysis
We follow the settings of homotopy experiment (Bowman et al., 2016) where first a set of latent variables was obtained by performing a linear interpolation between z 1 ⇠ p(z) and z 2 ⇠ p(z). Then each z in the set was converted into a sequence of words by the decoder p(x|z). Besides the initial motivation of Bowman et al. (2016) to examine how neighbouring latent codes look like, our additional incentive is to analyse how sensitive the decoder is to small variations in the latent variable when trained with different channel capacities, C = {3, 15, 100}. Table 2 shows the generated sentences via different decoding schemes for each channel capac-ity. For space reason, we only report the generated sentences for greedy, Top-k = 15, and NS p = 0.9. To make the generated sequences comparable across different decoding schemes or C values, we use the same samples of z for decoding.

Sensitivity of Decoder
To examine the sensitivity 7 of the decoder to variations of the latent variable, we consider the sentences generate with the greedy decoding scheme (the first column in Table 2). The other two schemes are not suitable for this analysis as they include sampling proce-  dure. This means that if we decode the same latent variable twice we will get two different sentences. We observed that with lower channel capacity (C = 3) the decoder tends to generate identical sentences for the interpolated latent variables (we highlight these sentences in gray), exhibiting decoder's lower sensitivity to z's variations. However, with the increase of channel capacity (C = 15, 100) the decoder becomes more sensitive. This observation is further supported by the increasing pattern of active units in Table 1: Given that AU increases with increase of C one would expect that activation pattern of a latent variable becomes more complex as it comprises more information. Therefore small change in the pattern would have a greater effect on the decoder.

Coherence of Sequences
We observe that the model trained with large values of C compromises sequences' coherence during the sampling. This is especially evident when we compare C = 3 with C = 100. Analysis of Top-15 and NS (p=0.9) generated samples reveals that the lack of coherence is not due to the greedy decoding scheme per se, and can be attributed to the model in general.
To understand this behavior further, we need two additional results from Table 1: LogDetCov and ||µ|| 2 2 . One can notice that as C increases LogDet-Cov decreases and ||µ|| 2 2 increases. This indicates that the aggregated posterior becomes further apart from the prior, hence the latent codes seen during the training diverge more from the codes sampled from the prior during generation. We speculate this contributes to the coherence of the generated samples, as the decoder is not equipped to decode prior samples properly at higher Cs.

Quantitative Analysis
Quantitative analysis of generated text without gold reference sequences (e.g. in Machine Translation or Summarization) has been a long-standing challenge. Recently, there have been efforts towards this direction, with proposal such as self-BLEU (Zhu et al.), forward cross entropy (Cífka et al., 2018, FCE) and Fréchet InferSent Distance (Cífka et al., 2018, FID). We opted for FCE as a complementary metric to our qualitative analysis. To calculate FCE, first a collection of synthetic sentences are generated by sampling z ⇠ p(z) and decoding the samples into sentences. The synthetic sequences are then used to train a language model (an LSTM with the parametrisation of our decoder). The FCE score is estimated by reporting the negative log likelihood (NLL) of the trained LM on the set of human generated sentences.
We generated synthetic corpora using trained models from Table 1 with different C and decoding schemes and using the same exact z samples for all corpora. Since the generated corpora using different C values would have different coverage of words in the test set (i.e., Out-of-Vocabulary ratios), we used a fixed vocabulary to minimize the effect of different vocabularies in our analysis. Our dictionary contains words that are common in all of the three corpora, while the rest of the words that don't exist in this dictionary are replaced with 〈unk〉 symbol. Similarly, we used this fixed dictionary to preprocess the test sets. Also, to reduce bias to a particular set of sampled z's we measure the FCE score three times, each time we sampled a new training corpus from a C -VAE LSTM decoder and trained an LM from scratch. In Table 3 we report the average FCE (NLL) for the generated corpora.
In the qualitative analysis we observed that the text generated by the C -VAE LSTM trained with large values of C = 100 exhibits lower quality (i.e., in terms of coherence). This observation is supported by the FCE score of NS(p=0.9) decoding scheme (3), since the performance drops when the LM is trained on the corpus generated with C = 100. The generated corpora with C = 3 and C = 15 achieve similar FCE score. However, these patterns are reversed for Greedy decoding scheme 8 , where the general tendency of FCE scores suggests that for larger values of C the C -VAE LSTM seems to generate text which better approximates the natural sentences in the test set. To understand this further, we report additional statistics in Table 3: percentage of 〈unk〉 symbols, self-BLEU and average sentence length in the corpus.
The average sentence length, in the generated corpora is very similar for both decoding schemes, removing the possibility that the pathological pattern on FCE scores was caused by difference in sentence length. However, we observe that for Greedy decoding more than 30% of the test set consists of 〈unk〉. Intuitively, seeing more evidence of this symbol during training would improve our estimate for the 〈unk〉. As reported in the table, the %unk increases on almost all corpora as C grows, which is then translated into getting a better FCE score at test. Therefore, we believe that FCE at high %unk is not a reliable quantitative metric to assess the quality of the generated syntactic corpora. Furthermore, for Greedy decoding, self-BLEU decreases when C increases. This suggests that generated sentences for higher value of C are more diverse. Hence, the LM trained on more diverse corpora can generalise better, which in turn affects the FCE.
In contrast, the effect the 〈unk〉 symbol has on the corpora generated with the NS(p=0.9) decoding scheme is minimal for two reasons: First, the vocabulary size for the generated corpora, for all values of C is close to the original corpus (the corpus we used to train the C -VAE LSTM ). Second, the vocabularies of the corpora generated with three values of C is very close to each other. As a result, minimum replacement of the words with the 〈unk〉 symbol is required, making the experiment to be more reflective of the quality of the generated text. Similarly, self-BLEU for the NS(p=0.9) is the same for all values of C. This suggests that the diversity of sentences has minimal, if any, effect on the FCE.

Syntactic Test
In this section, we explore if any form of syntactic information is captured by the encoder and represented in the latent codes despite the lack of any explicit syntactic signal during the training of the C -VAE LSTM . To train the models we used the same WIKI data set as in Marvin and Linzen (2018), but we filtered out all the sentences that are longer than 50 space-separated tokens. 9 We use the data set of Marvin and Linzen (2018) which consists of pairs of grammatical and ungrammatical sentences to test various syntactic phenomenon. For example, a pair in subjectverb agreement category would be: (The author laughs, The author laugh). We encode both the grammatical and ungrammatical sentences into the latent codes z + and z , respectively. Then we condition the decoder on the z + and try to determine whether the decoder assigns higher probability to the grammatical sentence (denoted by x + ): p(x |z + ) < p(x + |z + ) (denoted by p 1 in Table 4). We repeat the same experiment but this time try to determine whether the decoder, when conditioned on the ungrammatical code (z ), still prefers to assign higher probability to the grammatical sentence: p(x |z ) < p(x + |z ) (denoted by p 2 in Table 4). Table 4 shows the p 1 and p 2 for the C -VAE LSTM model trained with C = {3, 100}. Both the p 1 and p 2 are similar to the accuracy and correspond to how many times a grammatical sentence was assigned a higher probability.
since lower channel capacity encourages a more dominating decoder which in our case was trained on grammatical sentences from the WIKI. On the other hand, this illustrates that despite avoiding the KL-collapse issue, the dependence of the decoder on the latent code is so negligible that the decoder hardly distinguishes the grammatical and ungrammatical inputs. This changes for C = 100, as in almost all the cases the decoder becomes strongly dependent on the latent code and can differentiate between what it has seen as input and the closely similar sentence it hasn't received as the input: The decoder assigns larger probability to the ungrammatical sentence when conditioned on the z and, similarly, larger probability to the grammatical sentence when conditioned on the z + . However, the above observations neither confirm nor reject existence of grammar signal in the latent codes. We run a second set of experiments where we aim to discard sentence specific information from the latent codes by averaging the codes 10 inside each syntactic category. The averaged codes are denoted byz + andz , and the corresponding accuracies are reported byp 1 and p 2 in Table 4. Our hypothesis is that the only invariant factor during averaging the codes inside a category is the grammatical property of its corre-10 Each syntactic category is further divided into subcategories, for instance simple subject-verb agreement We average z's within each sub-categories. sponding sentences.
As expected, due to the weak dependence of decoder on latent code, the performance of the model under C = 3 is almost identical (not included for space limits) when comparing p 1 vs.p 1 , and p 2 vs.p 2 . However, for C = 100 the performance of the model deteriorates. While we leave further exploration of this behavior to our future work, we speculate this could be an indication of two things: the increase of complexity in the latent code which encourages a higher variance around the mean, or the absence of syntactic signal in the latent codes.

Discussion and Conclusion
In this paper we analysed the interdependence of the KL term in Evidence Lower Bound (ELBO) and the properties of the approximated posterior for text generation. To perform the analysis we used an information theoretic framework based on a variant of -VAE objective, which permits explicit control of the KL term, and treats KL as a mechanism to control the amount of information transmitted between the encoder and decoder.
The immediate impact of the explicit constraint is avoiding the collapse issue (D KL = 0) by setting a non-zero positive constraint (C 0) on the KL term (|D KL q (z|x)||p(z) C|). We experimented with a range of constraints (C) on the KL term and various powerful and weak decoder architectures (LSTM, GRU, and CNN), and empiri-cally confirmed that in all cases the constraint was satisfied.
We showed that the higher value of KL encourages not only divergence from the prior distribution, but also a sharper and more concentrated approximated posteriors. It encourages the decoder to be more sensitive to the variations on the latent code, and makes the model with higher KL less suitable for generation as the latent variables observed during training are farther away from the prior samples used during generation. To analyse its impact on generation we conducted a set of qualitative and quantitative experiments.
In the qualitative analysis we showed that small and large values of KL term impose different properties on the generated text: the decoder trained under smaller KL term tends to generate repetitive but mainly plausible sentences, while for larger KL the generated sentences were diverse but incoherent. This behaviour was observed across three different decoding schemes and complemented by a quantitative analysis where we measured the performance of an LSTM LM trained on different VAE-generated synthetic corpora via different KL magnitudes, and tested on human generated sentences.
Finally, in an attempt to understand the ability of the latent code in VAEs to represent some form of syntactic information, we tested the ability of the model to distinguish between grammatical and ungrammatical sentences. We verified that at lower (and still non-zero) KL the decoder tends to pay less attention to the latent code, but our findings regarding the presence of a syntactic signal in the latent code were inconclusive. We leave it as a possible avenue to explore in our future work. Also, we plan to develop practical algorithms for the automatic selection of the C's value, and verify our findings under multi-modal priors and complex posteriors.