Implicit Deep Latent Variable Models for Text Generation

Deep latent variable models (LVM) such as variational auto-encoder (VAE) have recently played an important role in text generation. One key factor is the exploitation of smooth latent structures to guide the generation. However, the representation power of VAEs is limited due to two reasons: (1) the Gaussian assumption is often made on the variational posteriors; and meanwhile (2) a notorious “posterior collapse” issue occurs. In this paper, we advocate sample-based representations of variational distributions for natural language, leading to implicit latent features, which can provide flexible representation power compared with Gaussian-based posteriors. We further develop an LVM to directly match the aggregated posterior to the prior. It can be viewed as a natural extension of VAEs with a regularization of maximizing mutual information, mitigating the “posterior collapse” issue. We demonstrate the effectiveness and versatility of our models in various text generation scenarios, including language modeling, unaligned style transfer, and dialog response generation. The source code to reproduce our experimental results is available on GitHub.


Introduction
Deep latent variable models (LVM) such as variational auto-encoder (VAE) (Kingma and Welling, 2013;Rezende et al., 2014) are successfully applied for many natural language processing tasks, including language modeling (Bowman et al., 2015;Miao et al., 2016), dialogue response generation (Zhao et al., 2017b), controllable text generation (Hu et al., 2017) and neural machine translation (Shah and Barber, 2018) etc.One advantage of VAEs is the flexible distribution-based latent representation.It captures holistic properties of input, such as style, topic, and high-level linguis-tic/semantic features, which further guide the generation of diverse and relevant sentences.
However, the representation capacity of VAEs is restrictive due to two reasons.The first reason is rooted in the assumption of variational posteriors, which usually follow spherical Gaussian distributions with diagonal co-variance matrices.It has been shown that an approximation gap generally exists between the true posterior and the best possible variational posterior in a restricted family (Cremer et al., 2018).Consequently, the gap may militate against learning an optimal generative model, as its parameters may be always updated based on sub-optimal posteriors (Kim et al., 2018).The second reason is the so-called posterior collapse issue, which occurs when learning VAEs with an auto-regressive decoder (Bowman et al., 2015).It produces undesirable outcomes: the encoder yields meaningless posteriors that are very close to the prior, while the decoder tends to ignore the latent codes in generation (Bowman et al., 2015).Several attempts have been made to alleviate this issue (Bowman et al., 2015;Higgins et al., 2017;Zhao et al., 2017a;Fu et al., 2019;He et al., 2019).
These two seemingly unrelated issues are studied independently.In this paper, we argue that the posterior collapse issue is partially due to the restrictive Gaussian assumption, as it limits the optimization space of the encoder/decoder in a given distribution family.(i) To break the assumption, we propose to use sample-based representations for natural language, thus leading to implicit latent features.Such a representation is much more expressive than Gaussian-based posteriors.(ii) This implicit representation allows us to extend VAE and develop new LVM that further mitigate the posterior collapse issue.It represents all the sentences in the dataset as posterior samples in the latent space, and matches the aggregated posterior samples to the prior distribution.Consequently, latent features are encouraged to cooperate and behave diversely to capture meaningful information for each sentence.
However, learning with implicit representations faces one challenge: it is intractable to evaluate the KL divergence term in the objectives.We overcome the issue by introducing a conjugate-dual form of the KL divergence (Rockafellar et al., 1966;Dai et al., 2018).It facilitates learning via training an auxiliary dual function.The effectiveness of our models is validated by producing consistently stateof-the-art results on a broad range of generation tasks, including language modeling, unsupervised style transfer, and dialog response generation.

Preliminaries
When applied to text generation, VAEs (Bowman et al., 2015) consist of two parts, a generative network (decoder) and an inference network (encoder).Given a training dataset Starting from a prior distribution p(z), VAE generates a sentence x using the deep generative network p θ (x|z), where θ is the network parameter.Therefore, the joint distribution p θ (x, z) is defined as p(z)p θ (x|z).The prior p(z) is typically assumed as a standard multivariate Gaussian.Due to the sequential nature of natural language, the decoder p θ (x|z) takes an autoregressive form p θ (x|z) = T t=1 p θ (x t |x <t , z).The goal of model training is to maximize the marginal data log-likelihood E x∼D logp θ (x).
However, it is intractable to perform posterior inference.An φ-parameterized encoder is introduced to approximate p θ (z|x) ∝ p θ (x|z)p(z) with a variational distribution q φ (z|x).Variational inference is employed for VAE learning, yielding following evidence lower bound (ELBO): (1) Note that L 1 and L 2 provide two different views for the VAE objective: • L 1 consists of a reconstruction error term L E and a KL divergence regularization term L R .With a strong auto-regressive decoder p θ (x|z), the objective tends to degenerate all encoding distribution q φ (z|x) to the prior, causing L R → 0, i.e., the posterior collapse issue.
• L 2 indicates that VAE requests a flexible encoding distribution family to minimize the approximation gap L G between the true posterior and the best possible encoding distribution.This motivates us to perform more flexible posterior inference with implicit representations.

The Proposed Models
We introduce a sample-based latent representation for natural language, and develop two models that leverage its advantages.(1) When replacing the Gaussian variational distributions with samplebased distributions in VAEs, we derive implicit VAE (iVAE).( 2) We further extend VAE to maximize mutual information between latent representations and observed sentences, leading to a variant termed as iVAE MI .

Implicit VAE
Implicit Representations Instead of assuming an explicit density form such as Gaussian, we define a sampling mechanism to represent q φ (z|x) as a set of sample {z x,i } M i=1 , through the encoder as where the i-th sample is drawn from a neural network G φ that takes (x, i ) as input; q( ) is a simple distribution such as standard Gaussian.It is difficult to naively combine the random noise with the sentence x (a sequence of discrete tokens) as the input of G φ .Our solution is to concatenate noise i with hidden representations h of x. h is generated using a LSTM encoder, as illustrated in Figure 1.
Dual form of KL-divergence Though theoretically promising, the implicit representations in (6) render difficulties in optimizing the KL term L R in (3), as the functional form is no longer tractable with implicit q φ (z|x).We resort to evaluating its dual form based on Fenchel duality theorem (Rockafellar et al., 1966;Dai et al., 2018):  x, z) is used in iVAE, and ν(z) is used in iVAE MI .In this example, the piror is p(z) = N (0, 1), the sample-based aggregated posterior q(z) = q φ (z|x)q(x)dx for four observations is shown, where the posterior q φ (z|x) for each observation is visualized in a different color.
where ν ψ (x, z) is an auxiliary dual function, parameterized by a neural network with weights ψ.By replacing the KL term with this dual form, the implicit VAE has the following objective: Training scheme Implicit VAE inherits the endto-end training scheme of VAEs with extra work on training the auxiliary network ν ψ (x, z): • Sample a mini-batch of x i ∼ D, i ∼ q( ), and generate • Update parameters {φ, θ} to maximize In practice, we implement ν ψ (x, z) with a multilayer perceptron (MLP), which takes the concatenation of h and z.In another word, the auxiliary network distinguishes between (x, z x ) and (x, z), where z x is drawn from the posterior and z is drawn from the prior, respectively.We found the MLP-parameterized auxiliary network converges faster than LSTM encoder and decoder (Hochreiter and Schmidhuber, 1997).This means that the auxiliary network practically provides an accurate approximation to the KL regularization L R .

Mutual Information Regularized iVAE
It is noted the inherent deficiency of the original VAE objective in (3): the KL divergence regularization term matches each posterior distribution independently to the same prior.This is prone to posterior collapse in text generation, due to a strong auto-regressive decoder p θ (x|z).When sequentially generating x t , the model learns to solely rely on the ground-truth and ignore the dependency from z (Fu et al., 2019).It results in the learned variational posteriors q φ (z|x) to exactly match p(z), without discriminating data x.
To better regularize the latent space, we propose to replace L R = E x∼D [−KL (q φ (z|x) p(z))] in (3), with the following KL divergence: where q φ (z) = q(x)q φ (z|x)dx is the aggregated posterior, q(x) is the empirical data distribution for the training dataset D. The integral is estimated by ancestral sampling in practice, i.e. we first sample x from dataset and then sample z ∼ q φ (z|x).
In ( 11), variational posterior is regularized as a whole q φ (z), encouraging posterior samples from different sentences to cooperate to satisfy the objective.It implies a solution that each sentence is represented as a local region in the latent space, the aggregated representation of all sentences match the prior; This avoids the degenerated solution from (3) that the feature representation of individual sentence spans over the whole space.
Connection to mutual information The proposed latent variable model coincides with (Zhao et al., 2018(Zhao et al., , 2017a) ) where mutual information is introduced into the optimization, based on the following decomposition result (Please see detailed proof in Appendix A): where I(x, z) is the mutual information between z and x under the joint distribution q φ (x, z) = q(x)q φ (z|x).Therefore, the objective in ( 11) also maximizes the mutual information between individual sentences and their latent features.We term the new LVM objective as iVAE MI : Training scheme Note that the aggregated posterior q φ (z) is also a sample-based distribution.
Similarly, we evaluate (12) through its dual form: Therefore, iVAE MI in ( 12) can be written as: where the auxiliary network ν ψ (z) is parameterized as a neural network.Different from iVAE, ν ψ (z) in iVAE MI only takes posterior samples as input.The training algorithm is similar to iVAE in Section 3.1, except a different auxiliary network ν ψ (z).In Appendix B, we show the full algorithm of iVAE MI .We illustrate the proposed methods in Figure 1.Note that both iVAE and iVAE MI share the same model architecture, except a different auxiliary network ν.

Solutions to posterior collapse
Several attempts have been made to alleviate the posterior collapse issue.The KL annealing scheme in language VAEs has been first used in (Bowman et al., 2015).An effective cyclical KL annealing schedule is used in (Fu et al., 2019), where the KL annealing process is repeated multiple times.KL term weighting scheme is also adopted in β-VAE (Higgins et al., 2017) for disentanglement.On model architecture side, dilated CNN was considered to replace auto-regressive LSTMs for decoding (Yang et al., 2017).The bag-of-word auxiliary loss was proposed to improve the dependence on latent representations in generation (Zhao et al., 2017b).More recently, lagging inference proposes to aggressively update encoder multiple times before a single decoder update (He et al., 2019).Semiamortized VAE refines variational parameters from an amortized encoder per instance with stochastic variational inference (Kim et al., 2018).
All these efforts utilize the Gaussian-based forms for posterior inference.Our paper is among the first ones to attribute posterior collapse issue to the restrictive Gaussian assumption, and advocate more flexible sample-based representations.

Implicit Feature Learning
Sample-based distributions, as well as implicit features, have been widely used in representation learning (Donahue et al., 2017;Li et al., 2017a).Vanilla autoencoders learn point masses of latent features rather than their distributions.Adversarial variational Bayes introduces an auxiliary discriminator network like GANs (Goodfellow et al., 2014;Makhzani et al., 2015) to learn almost arbitrarily distributed latent variables (Mescheder et al., 2017;Pu et al., 2017b).We explore the similar spirit in the natural language processing (NLP) domain.Amortized MCMC and particle based methods are introduced for LVM learning in (Li et al., 2017d;Pu et al., 2017a;Chen et al., 2018).Coupled variational Bayes (Dai et al., 2018) emphasizes an optimization embedding, i.e., a flow of particles, in a general setting of non-parametric variational inference.It also utilizes similar dual form with auxiliary function ν ψ (x, z) to evaluate KL divergence.Adversarially regularized autoencoders (Makhzani et al., 2015;Kim et al., 2017) use similar objectives with iVAEs, in the form of a reconstruction error plus a specific regularization evaluated with implicit samples.Mutual information has also been considered into regularization in (Zhao et al., 2018(Zhao et al., , 2017a) ) to obtain more informative representations.
Most previous works focus on image domain.It is largely unexplored in NLP.Further, the autoregressive decoder renders an additional challenge when applying implicit latent representations.Adversarial training with samples can be empirically unstable, and slow even applying recent stabilization techniques in GANs (Arjovsky et al., 2017;Gulrajani et al., 2017).To the best of our knowledge, this paper presents the first to effectively apply implicit feature representations, to NLP.

Experiments
In this section, the effectiveness of our methods is validated by largely producing state-of-the-art metrics on a broad range of text generation tasks under various scenarios.

Language Modeling
Datasets.We consider three public datasets, the Penn Treebank (PTB) (Marcus et al., 1993;Bowman et al., 2015), Yahoo, and Yelp corpora (Yang et al., 2017;He et al., 2019) as the sum of reconstruction loss and KL term, as well as perplexity (PPL).• Compared with traditional neural language models, VAEs has its unique advantages in feature learning.To measure the quality of learned features, we consider (1) KL: KL (q φ (z|x) p(z)); (2) Mutual information (MI) I(x, z) under the joint distribution q φ (x, z); (3) Number of active units (AU) of latent representation.The activity of a latent dimension z is measured as The evaluation of implicit LVMs is unexplored in language models, as there is no analytical forms for the KL term.We consider to evaluate both KL (q φ (z) p(z)) and KL (q φ (z|x) p(z)) by training a fully connected ν network in Eq. ( 7) and (13).To avoid the inconsistency between ν(x, z) and ν(z) networks due to training, we train them using the same data and optimizer in every iteration.We evaluate each distribution q φ (z|x) with 128 code samples per x.
We report the results in Table 2.A better language model would pursue a lower negative ELBO (also lower reconstruction errors, lower PPL), and make sufficient use of the latent space (i.e., maintain relatively high KL term, higher mutual infor-  mation and more active units).Under all these metrics, the proposed iVAEs achieve much better performance.The posterior collapse issue is largely alleviated as indicated by the improved KL and MI values, especially with iVAE MI which directly takes mutual information into account.
The comparison on training time is shown in Table 3. iVAE and iVAE MI requires updating an auxiliary network, it spends 30% more time than traditional VAEs.This is more efficient than SA-VAE and Lag-VAE.
Latent space interpolation.One favorable property of VAEs (Bowman et al., 2015;Zhao et al., 2018) is to provide smooth latent representations that capture sentence semantics.We demonstrate this by interpolating two latent feature, each of which represents a unique sentence.Table 4 shows the generated examples.We take two sentences x 1 and x 2 , and obtain their latent features as the sample-averaging results for z 1 and z 2 , respectively, from the implicit encoder, and then greedily decode conditioned on the interpolated feature utilizing high-quality latent features, and learning a better decoder.We use PTB to confirm our findings.We draw samples from the prior p(z), and greedily decode them using the trained decoder.The quality of the generated text is evaluated by an external library "KenLM Language Model Toolkit" (Heafield et al., 2013) with two metrics (Kim et al., 2017): (1) Forward PPL: the fluency of the generated text based on language models derived from the PTB training corpus; (2) Reverse PPL: the fluency of PTB corpus based on language model derived from the generated text, which measures the extent to which the generations are representative of the PTB underlying language model.For both the PPL numbers, the lower the better.We use n=5 for n-gram language models in "KenLM".
As shown in Table 5, implicit LVMs outperform others in both PPLs, which confirms that the implicit representation can lead to better decoders.The vanilla VAE model performs the worst.This is expected, as the posterior collapse issue results in poor utilization of a latent space.Besides, we can see that iVAE MI generates comparably fluent but more diverse text than pure iVAE, from the lower reverse PPL values.This is reasonable, due to the ability of iVAE MI to encourage diverse latent samples per sentence with the aggregated regularization in the latent space.

Unaligned style transfer
We next consider the task of unaligned style transfer, which represents a scenario to generate text with desired specifications.The goal is to control one style aspect of a sentence that is independent of its content.We consider non-parallel corpora of sentences, where the sentences in two corpora have the same content distribution but with different styles, and no paired sentence is provided.
Model Extension.The success of this task depends on the exploitation of the distributional equivalence of content to learn sentiment independent content code and decoding it to a different style.To ensure such independence, we extend iVAE MI by adding a sentiment classifier loss to its objective ( 14), similar to the previous style transfer methods (Shen et al., 2017;Kim et al., 2017).Let y be the style attribute, x p and x n (with corresponding features z p , z n ) be sentences with positive and negative sentiments , respectively.The style classifier loss L class (z p , z n ) is the cross entropy loss of a binary classifier.
The classifier and encoder are trained adversarially: (1) the classifier is trained to distinguish latent features with different sentiments; (2) the encoder is trained to fool the classifier in order to remove distinctions of content features from sentiments.In practice, the classifier is implemented as a MLP.We implement two separate decoder LSTMs for clean sentiment decoding: for positive p(x|z, y = 1), and one for negative sentiment p(x|z, y = 0).The prior p(z) is also implemented as an implicit distribution, via transforming noise from a standard Gaussian through a MLP.Appendix C.2.2 lists more details.
Datasets.Following (Shen et al., 2017), the Yelp restaurant reviews dataset is processed from the original Yelp dataset in language modeling.Reviews with user rating above three are considered positive, and those below three are considered negative.The pre-processing allows sentiment analysis on sentence level with feasible sentiment, ending up with shorter sentences with each at most 15 words than those in language modeling.Finally, we get two sets of unaligned reviews: 250K negative sentences, and 350K positive ones.Other dataset details are shown in Appendix C.2.1.

Evaluation Metrics.
(1) Acc: the accuracy of transferring sentences into another sentiment measured by an automatic classifier: the "fasttext" library (Joulin et al., 2017); (2) BLEU: the consistency between the transferred text and the original; (3) PPL: the reconstruction perplexity of original sentences without altering sentiment; (4) RPPL: it was super dry and had a weird taste to the entire slice .

ARAE:
it was super nice and the owner was super sweet and helpful .iVAEMI: it was super tasty and a good size with the best in the burgh .
Input: so i only had half of the regular fries and my soda .

ARAE:
it 's the best to eat and had a great meal .iVAEMI: so i had a huge side and the price was great .
Input: i am just not a fan of this kind of pizza .

ARAE:
i am very pleased and will definitely use this place .iVAEMI: i am just a fan of the chicken and egg roll .

Input:
i have eaten the lunch buffet and it was outstanding !ARAE: once again , i was told by the wait and was seated .iVAEMI: we were not impressed with the buffet there last night .
Input: my favorite food is kung pao beef , it is delicious .ARAE: my husband was on the phone , which i tried it .iVAEMI: my chicken was n't warm , though it is n't delicious .
Input: overall , it was a very positive dining experience .ARAE: overall , it was very rude and unprofessional .iVAEMI: overall , it was a nightmare of terrible experience .the reverse perplexity that evaluates the training corpus based on language model derived from the generated text, which measures the extent to which the generations are representative of the training corpus; (5) Flu: human evaluated index on the fluency of transferred sentences when read alone (1-5, 5 being most fluent as natural language); (6) Sim: the human evaluated similarity between the original and the transferred sentences in terms of their contents (1-5, 5 being most similar).Note that the similarity measure doesn't care sentiment but only the topic covered by sentences.For human evaluation, we show 1000 randomly selected pairs of original and transferred sentences to crowdsourcing readers, and ask them to evaluate the "Flu" and "Sim" metrics stated above.Each measure is averaged among crowdsourcing readers.
As shown in Table 7, iVAE MI outperforms ARAE in metrics except Acc, showing that iVAE MI captures informative representations, generates consistently opposite sentences with similar grammatical structure and reserved semantic meaning.Both methods perform successful sentiment transfer as shown by Acc values.iVAE MI achieves a little lower Acc due to much more content reserving, even word reserving, of the source sentences.
Table 6 presents some examples.In each box, we show the source sentence, the transferred sentence by ARAE and iVAE MI , respectively.We observe that ARAE usually generates new sentences that miss the content of the source, while iVAE MI shows better content-preserving.

Dialog response generation
We consider the open-domain dialog response generation task, where we need to generate a natural language response given a dialog history.It is crucial to learn a meaningful latent feature representation of the dialog history in order to generate a consistent, relevant, and contentful response that is likely to drive the conversation (Gao et al., 2019).
Datasets.We consider two mainstream datasets in recent studies (Zhao et al., 2017b(Zhao et al., , 2018;;Fu et al., 2019;Gu et al., 2018): Switchboard (Godfrey and Holliman, 1997) and Dailydialog (Li et al., 2017c).Switchboard contains 2,400 two-way telephone conversations under 70 specified topics.Dailydialog has 13,118 daily conversations for a English learner.We process each utterance as the response of previous 10 context utterances from both speakers.The datasets are separated into training, validation, and test sets as convention: 2316:60:62 for Switchboard and 10:1:1 for Dailydialog, respectively.
Model Extension.We adapt iVAE MI by integrating the context embedding c into all model components.The prior p(z|c) is defined as an implicit mapping between context embedding c and prior samples, which is not pre-fixed but learned together with the variational posterior for more modeling flexibility.The encoder q(z|x, c), auxiliary dual function ν ψ (z, c) and decoder p(x|c, z) depend on context embedding c as well.Both encoder and decoder are implemented as GRUs.The utterance encoder is a bidirectional GRU with 300 hidden units in each direction.The context encoder and decoder are both GRUs with 300 hidden units.Appendix C.3.1 presents more training details.
Tables 8 show the performance comparison.iVAE MI achieves consistent improvement on a majority of the metrics.Especially, the BOW embeddings and Distinct get significantly improvement, which implies that iVAE MI produces both meaningful and diverse latent representations.

Conclusion
We present two types of implicit deep latent variable models, iVAE and iVAE MI .Core to these two models is the sample-based representation of the latent features in LVM, in replacement of traditional Gaussian-based distributions.Extensive experiments show that the proposed implicit LVM models consistently outperform the vanilla VAEs on three tasks, including language modeling, style transfer and dialog response generation.

Figure 1 :
Figure1: Illustration of the proposed implicit LVMs.ν(x, z) is used in iVAE, and ν(z) is used in iVAE MI .In this example, the piror is p(z) = N (0, 1), the sample-based aggregated posterior q(z) = q φ (z|x)q(x)dx for four observations is shown, where the posterior q φ (z|x) for each observation is visualized in a different color.

Table 1 :
Statistics of datasets for language modeling.

Table 2 :
. PTB is a relatively small dataset with sentences of varying lengths, whereas Yahoo and Yelp contain larger amounts of with long sentences.Detailed statistics of these datasets are shown in Table1.Language modeling on three datasets.
Evaluation metrics.Two categories of metrics are used to study VAEs for language modeling:• To characterize the modeling ability of the observed sentences, we use the negative ELBO

Table 3 :
Abs. ↓ Re. ↓ Abs.↓ Re. ↓ Abs.↓ Total training time in hours: absolute time and relative time versus VAE.=0.1 in new york the company declined comment t =0.2 in new york the transaction was suspended t =0.3 in the securities company said yesterday t =0.4 in other board the transaction had disclosed t =0.5 other of those has been available t =0.6 both of companies have been unchanged t =0.7 both men have received a plan to restructure t =0.8 and to reduce that it owns t =0.9 and to continue to make prices t =1 and they plan to buy more today

Table 5 :
with t increased from 0 to 1 by a step size of 0.1.It generates sentences with smooth semantic evolution.Forward and reverse PPL on PTB.

Table 6 :
Sentiment transfer on Yelp.(Up: From negative to positive, Down: From positive to negative.)