Piecewise Latent Variables for Neural Variational Text Processing

Advances in neural variational inference have facilitated the learning of powerful directed graphical models with continuous latent variables, such as variational autoencoders. The hope is that such models will learn to represent rich, multi-modal latent factors in real-world data, such as natural language text. However, current models often assume simplistic priors on the latent variables - such as the uni-modal Gaussian distribution - which are incapable of representing complex latent factors efficiently. To overcome this restriction, we propose the simple, but highly flexible, piecewise constant distribution. This distribution has the capacity to represent an exponential number of modes of a latent target distribution, while remaining mathematically tractable. Our results demonstrate that incorporating this new latent distribution into different models yields substantial improvements in natural language processing tasks such as document modeling and natural language generation for dialogue.


Introduction
The development of the variational autoencoder framework (Kingma and Welling, 2014;Rezende et al., 2014) has paved the way for learning largescale, directed latent variable models.This has led to significant progress in a diverse set of machine learning applications, ranging from computer vision (Gregor et al., 2015;Larsen et al., 2016) to natural language processing tasks (Mnih and Gregor, 2014;Miao et al., 2016;Bowman et al., 2015; * The first two authors contributed equally.Serban et al., 2017b).It is hoped that this framework will enable the learning of generative processes of real-world data -including text, audio and images -by disentangling and representing the underlying latent factors in the data.However, latent factors in real-world data are often highly complex.For example, topics in newswire text and responses in conversational dialogue often posses latent factors that follow non-linear (non-smooth), multi-modal distributions (i.e.distributions with multiple local maxima).
Nevertheless, the majority of current models assume a simple prior in the form of a multivariate Gaussian distribution in order to maintain mathematical and computational tractability.This is often a highly restrictive and unrealistic assumption to impose on the structure of the latent variables.First, it imposes a strong uni-modal structure on the latent variable space; latent variable samples from the generating model (prior distribution) all cluster around a single mean.Second, it forces the latent variables to follow a perfectly symmetric distribution with constant kurtosis; this makes it difficult to represent asymmetric or rarely occurring factors.Such constraints on the latent variables increase pressure on the down-stream generative model, which in turn is forced to carefully partition the probability mass for each latent factor throughout its intermediate layers.For complex, multi-modal distributions -such as the distribution over topics in a text corpus, or natural language responses in a dialogue system -the unimodal Gaussian prior inhibits the model's ability to extract and represent important latent structure in the data.In order to learn more expressive latent variable models, we therefore need more flexible, yet tractable, priors.
In this paper, we introduce a simple, flexible prior distribution based on the piecewise constant distribution.We derive an analytical, tractable form that is applicable to the variational autoencoder framework and propose a differentiable parametrization for it.We then evaluate the effectiveness of the distribution when utilized both as a prior and as approximate posterior across variational architectures in two natural language processing tasks: document modeling and natural language generation for dialogue.We show that the piecewise constant distribution is able to capture elements of a target distribution that cannot be captured by simpler priors -such as the uni-modal Gaussian.We demonstrate state-ofthe-art results on three document modeling tasks, and show improvements on a dialogue natural language generation.Finally, we illustrate qualitatively how the piecewise constant distribution represents multi-modal latent structure in the data.

Related Work
The idea of using an artificial neural network to approximate an inference model dates back to the early work of Hinton and colleagues (Hinton and Zemel, 1994;Hinton et al., 1995;Dayan and Hinton, 1996).Researchers later proposed Markov chain Monte Carlo methods (MCMC) (Neal, 1992), which do not scale well and mix slowly, as well as variational approaches which require a tractable, factored distribution to approximate the true posterior distribution (Jordan et al., 1999).Others have since proposed using feed-forward inference models to initialize the mean-field inference algorithm for training Boltzmann architectures (Salakhutdinov and Larochelle, 2010;Ororbia II et al., 2015).Recently, the variational autoencoder framework (VAE) was proposed by Kingma and Welling (2014) and Rezende et al. (2014), closely related to the method proposed by Mnih and Gregor (2014).This framework allows the joint training of an inference network and a directed generative model, maximizing a variational lower-bound on the data log-likelihood and facilitating exact sampling of the variational posterior.Our work extends this framework.
With respect to document modeling, neural architectures have been shown to outperform wellestablished topic models such as Latent Dirichlet Allocation (LDA) (Hofmann, 1999;Blei et al., 2003).Researchers have successfully proposed several models involving discrete latent vari-ables (Salakhutdinov and Hinton, 2009;Hinton and Salakhutdinov, 2009;Srivastava et al., 2013;Larochelle and Lauly, 2012;Uria et al., 2014;Lauly et al., 2016;Bornschein and Bengio, 2015;Mnih and Gregor, 2014).The success of such discrete latent variable models -which are able to partition probability mass into separate regionsserves as one of our main motivations for investigating models with more flexible continuous latent variables for document modeling.More recently, Miao et al. (2016) proposed to use continuous latent variables for document modeling.
Researchers have also investigated latent variable models for dialogue modeling and dialogue natural language generation (Bangalore et al., 2008;Crook et al., 2009;Zhai and Williams, 2014).The success of discrete latent variable models in this task also motivates our investigation of more flexible continuous latent variables.
Closely related to our proposed approach is the Variational Hierarchical Recurrent Encoder-Decoder (VHRED, described below) (Serban et al., 2017b), a neural architecture with latent multivariate Gaussian variables.
Researchers have explored more flexible distributions for the latent variables in VAEs, such as autoregressive distributions, hierarchical probabilistic models and approximations based on MCMC sampling (Rezende et al., 2014;Rezende and Mohamed, 2015;Kingma et al., 2016;Ranganath et al., 2016;Maaløe et al., 2016;Salimans et al., 2015;Burda et al., 2016;Chen et al., 2017;Ruiz et al., 2016).These are all complimentary to our approach; it is possible to combine them with the piecewise constant latent variables.In parallel to our work, multiple research groups have also proposed VAEs with discrete latent variables (Maddison et al., 2017;Jang et al., 2017;Rolfe, 2017;Johnson et al., 2016).This is a promising line of research, however these approaches often require approximations which may be inaccurate when applied to larger scale tasks, such as document modeling or natural language generation.Finally, discrete latent variables may be inappropriate for certain natural language processing tasks.

Neural Variational Models
We start by introducing the neural variational learning framework.We focus on modeling discrete output variables (e.g.words) in the context of natural language processing applications.How-ever, the framework can easily be adapted to handle continuous output variables.

Neural Variational Learning
Let w 1 , . . ., w N be a sequence of N tokens (words) conditioned on a continuous latent variable z.Further, let c be an additional observed variable which conditions both z and w 1 , . . ., w N .Then, the distribution over words is: where we note that Q ψ (z|w 1 , . . ., w N , c) is the approximation to the intractable, true posterior P θ (z|w 1 , . . ., w N , c).Q is called the encoder, or sometimes the recognition model or inference model, and it is parametrized by ψ.The distribution P θ (z|c) is the prior model for z, where the only available information is c.The VAE framework further employs the re-parametrization trick, which allows one to move the derivative of the lower-bound inside the expectation.To accomplish this, z is parametrized as a transformation of a fixed, parameter-free random distribution z = f θ (�), where � is drawn from a random distribution.Here, f is a transformation of �, parametrized by θ, such that f θ (�) ∼ P θ (z|c).For example, � might be drawn from a standard Gaussian distribution and f might be defined as f θ (�) = µ + σ�, where µ and σ are in the parameter set θ.In this case, z is able to represent any Gaussian with mean µ and variance σ 2 .Model parameters are learned by maximizing the variational lower-bound in eq. ( 1) using gradient descent, where the expectation is computed using samples from the approximate posterior.
The majority of work on VAEs propose to parametrize z as multivariate Gaussian distribtions.However, this unrealistic assumption may critically hurt the expressiveness of the latent variable model.See Appendix A for a detailed discussion.This motivates the proposed piecewise constant latent variable distribution.

Piecewise Constant Distribution
We propose to learn latent variables by parametrizing z using a piecewise constant probability density function (PDF).This should allow z to represent complex aspects of the data distribution in latent variable space, such as non-smooth regions of probability mass and multiple modes.
Let n ∈ N be the number of piecewise constant components.We assume z is drawn from PDF: where 1 (x) is the indicator function, which is one when x is true and otherwise zero.The distribution parameters are a i > 0, for i = 1, . . ., n.The normalization constant is: It is straightforward to show that a piecewise constant distribution with more than n > 2 pieces is capable of representing a bi-modal distribution.
When n > 2, a vector z of piecewise constant variables can represent a probability density with 2 |z| modes.Figure 1 illustrates how these variables help model complex, multi-modal distributions.
In order to compute the variational bound, we need to draw samples from the piecewise constant distribution using its inverse cumulative distribution function (CDF).Further, we need to compute the KL divergence between the prior and posterior.The inverse CDF and KL divergence quantities are both derived in Appendix B. During training we must compute derivatives of the variational bound in eq. ( 1).These expressions involve derivatives of indicator functions, which have derivatives zero everywhere except for the changing points where the derivative is undefined.However, the probability of sampling the value exactly at its changing Figure 1: Joint density plot of a pair of Gaussian and piecewise constant variables.The horizontal axis corresponds to z 1 , which is a univariate Gaussian variable.The vertical axis corresponds to z 2 , which is a piecewise constant variable.
point is effectively zero.Thus, we fix these derivatives to zero.Similar approximations are used in training networks with rectified linear units.

Latent Variable Parametrizations
In this section, we develop the parametrization of both the Gaussian variable and our proposed piecewise constant latent variable.
Let x be the current output sequence, which the model must generate (e.g.w 1 , . . ., w N ).Let c be the observed conditioning information.If the task contains additional conditioning information this will be embedded by c.For example, for dialogue natural language generation c represents an embedding of the dialogue history, while for document modeling c = ∅.

Gaussian Parametrization
Let µ prior and σ 2,prior be the prior mean and variance, and let µ post and σ 2,post be the approximate posterior mean and variance.For Gaussian latent variables, the prior distribution mean and variances are encoded using linear transformations of a hidden state.In particular, the prior distribution covariance is encoded as a diagonal covariance matrix using a softplus function: where Enc(c) is an embedding of the conditioning information c (e.g. for dialogue natural language generation this might, for example, be produced by an LSTM encoder applied to the dialogue history), which is shared across all latent variable are learnable parameters.For the posterior distribution, previous work has shown it is better to parametrize the posterior distribution as a linear interpolation of the prior distribution mean and variance and a new estimate of the mean and variance based on the observation x (Fraccaro et al., 2016).The interpolation is controlled by a gating mechanism, allowing the model to turn on/off latent dimensions: where Enc(c, x) is an embedding of both c and x.The matrices H The interpolation mechanism is controlled by α µ and α σ , which are initialized to zero (i.e.initialized such that the posterior is equal to the prior).

Piecewise Constant Parametrization
We parametrize the piecewise prior parameters using an exponential function applied to a linear transformation of the conditioning information:

Variational Text Modeling
We now introduce two classes of VAEs.The models are extended by incorporating the Gaussian and piecewise latent variable parametrizations.

Document Model
The neural variational document model (NVDM) model has previously been proposed for document modeling (Mnih and Gregor, 2014;Miao et al., 2016), where the latent variables are Gaussian.
Since the original NVDM uses Gaussian latent variables, we will refer to it as G-NVDM.We propose two novel models building on G-NVDM.The first model we propose uses piecewise constant latent variables instead of Gaussian latent variables.
We refer to this model as P-NVDM.The second model we propose uses a combination of Gaussian and piecewise constant latent variables.The models sample the Gaussian and piecewise constant latent variables independently and then concatenates them together into one vector.We refer to this model as H-NVDM.
Let V be the vocabulary of document words.Let W represent a document matrix, where row w i is the 1-of-|V | binary encoding of the i'th word in the document.Each model has an encoder component Enc(W ), which compresses a document vector into a continuous distributed representation upon which the approximate posterior is built.For document modeling, word order information is not taken into account and no additional conditioning information is available.Therefore, each model uses a bag-of-words encoder, defined as a multi-layer perceptron (MLP) Enc(c = ∅, x) = Enc(x).Based on preliminary experiments, we choose the encoder to be a two-layered MLP with parametrized rectified linear activation functions (we omit these parameters for simplicity).For the approximate posterior, each model has the parameter matrix W for the Gaussian means and variances.We initialize the bias parameters to zero in order to start with centered Gaussian and piecewise constant priors.The encoder will adapt these priors as learning progresses, using the gating mechanism to turn on/off latent dimensions.
Let z be the vector of latent variables sampled according to the approximate posterior distribution.Given z, the decoder Dec(w, z) outputs a distribution over words in the document: , where R is a parameter matrix and b is a parameter vector corresponding to the bias for each word to be learned.This output probability distribution is combined with the KL divergences to compute the lower-bound in eq. ( 1).See Appendix C. Our baseline model G-NVDM is an improvement over the original NVDM proposed by Mnih and Gregor (2014) and Miao et al. (2016).We learn the prior mean and variance, while these were fixed to a standard Gaussian in previous work.This increases the flexibility of the model and makes optimization easier.In addition, we use a gating mechanism for the approximate posterior of the Gaussian variables.This gating mechanism allows the model to turn off latent variable (i.e.fix the approximate posterior to equal the prior for specific latent variables) when computing the final posterior parameters.Furthermore, Miao et al. (2016) alternated between optimizing the approximate posterior parameters and the generative model parameters, while we optimize all parameters simultaneously.

Dialogue Model
The variational hierarchical recurrent encoderdecoder (VHRED) model has previously been proposed for dialogue modeling and natural language generation (Serban et al., 2017b(Serban et al., , 2016a)).The model decomposes dialogues using a two-level hierarchy: sequences of utterances (e.g.sentences), and sub-sequences of tokens (e.g.words).Let w n be the n'th utterance in a dialogue with N utterances.Let w n,m be the m'th word in the n'th utterance from vocabulary V given as a 1-of-|V | binary encoding.Let M n be the number of words in the n'th utterance.For each utterance n = 1, . . ., N , the model generates a latent variable z n .Conditioned on this latent variable, the model then generates the next utterance: where θ are the model parameters.VHRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN.The encoder RNN computes an embedding for each utterance.This embedding is fed into the context RNN, which computes a hidden state summarizing the dialogue context before utterance n: h con n−1 .This state represents the additional conditioning information, which is used to compute the prior distribution over z n : where f prior is a PDF parametrized by both θ and h con n−1 .A sample is drawn from this distribution: z n ∼ P θ (z n |w <n ).This sample is given as input to the decoder RNN, which then computes the output probabilities of the words in the next utterance.The model is trained by maximizing the variational lower-bound, which factorizes into independent terms for each sub-sequence (utterance): where distribution Q ψ is the approximate posterior distribution with parameters ψ, computed similarly as the prior distribution but further conditioned on the encoder RNN hidden state of the next utterance.
The original VHRED model (Serban et al., 2017b) used Gaussian latent variables.We refer to this model as G-VHRED.The first model we propose uses piecewise constant latent variables instead of Gaussian latent variables.We refer to this model as P-VHRED.The second model we propose takes advantage of the representation power of both Gaussian and piecewise constant latent variables.This model samples both a Gaussian latent variable z gaussian n and a piecewise latent variable z piecewise n independently conditioned on the context RNN hidden state: where f prior, gaussian and f prior, piecewise are PDFs parametrized by independent subsets of parameters θ.We refer to this model as H-VHRED.

Experiments
We evaluate the proposed models on two types of natural language processing tasks: document modeling and dialogue natural language generation.All models are trained with back-propagation using the variational lower-bound on the loglikelihood or the exact log-likelihood.We use the first-order gradient descent optimizer Adam (Kingma and Ba, 2015) with gradient clipping (Pascanu et al., 2012) 1  Table 1: Test perplexities on three document modeling tasks: 20-NewGroup (20-NG), Reuters corpus (RCV1) and CADE12 (CADE).Perplexities were calculated using 10 samples to estimate the variational lower-bound.The H-NVDM models perform best across all three datasets.

Document Modeling
Tasks We use three different datasets for document modeling experiments.First, we use the 20 News-Groups (20-NG) dataset (Hinton and Salakhutdinov, 2009).Second, we use the Reuters corpus (RCV1-V2), using a version that contained a selected 5,000 term vocabulary.As in previous work (Hinton and Salakhutdinov, 2009;Larochelle and Lauly, 2012), we transform the original word frequencies using the equation log(1 + TF), where TF is the original word frequency.Third, to test our document models on text from a non-English language, we use the Brazilian Portuguese CADE12 dataset (Cardoso-Cachopo, 2007).For all datasets, we track the validation bound on a subset of 100 vectors randomly drawn from each training corpus.
Training All models were trained using minibatches with 100 examples each.A learning rate of 0.002 was used.Model selection and early stopping were conducted using the validation lowerbound, estimated using five stochastic samples per validation example.Inference networks used 100 units in each hidden layer for 20-NG and CADE, and 100 for RCV1.We experimented with both 50 and 100 latent random variables for each class of models, and found that 50 latent variables performed best on the validation set.For H-NVDM we vary the number of components used in the PDF, investigating the effect that 3 and 5 pieces had on the final quality of the model. of hidden units was chosen via preliminary experimentation with smaller models.On 20-NG, we use the same set-up as (Hinton and Salakhutdinov, 2009) and therefore report the perplexities of a topic model (LDA, (Hinton and Salakhutdinov, 2009)), the document neural auto-regressive estimator (docNADE, (Larochelle and Lauly, 2012)), and a neural variational document model with a fixed standard Gaussian prior (NVDM, lowest reported perplexity, (Miao et al., 2016)).

Results
In Table 1, we report the test document perplexity: exp(− Ln log P θ (x n ).We use the variational lower-bound as an approximation based on 10 samples, as was done in (Mnih and Gregor, 2014).First, we note that the best baseline model (i.e. the NVDM) is more competitive when both the prior and posterior models are learnt together (i.e. the G-NVDM), as opposed to the fixed prior of (Miao et al., 2016).Next, we observe that integrating our proposed piecewise variables yields even better results in our document modeling experiments, substantially improving over the baselines.More importantly, in the 20-NG and Reuters datasets, increasing the number of pieces from 3 to 5 further reduces perplexity.Thus, we have achieved a new state-of-theart perplexity on 20 News-Groups task and -to the best of our knowledge -better perplexities on the CADE12 and RCV1 tasks compared to using a state-of-the-art model like the G-NVDM.We also evaluated the converged models using an nonparametric inference procedure, where a separate approximate posterior is learned for each test example in order to tighten the variational lowerbound.H-NVDM also performed best in this evaluation across all three datasets, which confirms that the performance improvement is due to the piecewise components.See appendix for details.
In Table 2, we examine the top ten highest ranked words given the query term "space", using the decoder parameter matrix.The piecewise variables appear to have a significant effect on what is uncovered by the model.In the case of "space", the hybrid with 5 pieces seems to value two senses of the word-one related to "outer space" (e.g., "sun", "world", etc.) and another related to the dimensions of depth, height, and width within which things may exist and move (e.g., "area", "form", "scale", etc.).On the other hand, G-NVDM appears to only capture the "outer space" sense of Finally, we visualized the means of the approximate posterior latent variables on 20-NG through a t-SNE projection.As shown in Figure 2, both G-NVDM and H-NVDM-5 learn representations which disentangle the topic clusters on 20-NG.However, G-NVDM appears to have more dispersed clusters and more outliers (i.e.data points in the periphery) compared to H-NVDM-5.Although it is difficult to draw conclusions based on these plots, these findings could potentially be explained by the Gaussian latent variables fitting the latent factors poorly.

Dialogue Modeling
Task We evaluate VHRED on a natural language generation task, where the goal is to generate responses in a dialogue.This is a difficult problem, which has been extensively studied in the recent literature (Ritter et al., 2011;Lowe et al., 2015;Sordoni et al., 2015;Li et al., 2016;Serban et al., 2016a,b).Dialogue response generation has recently gained a significant amount of attention from industry, with high-profile projects such as Google SmartReply (Kannan et al., 2016) and Microsoft Xiaoice (Markoff and Mozur, 2015).Even more recently, Amazon has announced the Alexa Prize Challenge for the research community with the goal of developing a natural and engaging chatbot system (Farber, 2016).
We evaluate on the technical support response generation task for the Ubuntu operating system.We use the well-known Ubuntu Dialogue Corpus (Lowe et al., 2015(Lowe et al., , 2017)), which consists of about 1/2 million natural language dialogues extracted from the #Ubuntu Internet Relayed Chat (IRC) channel.The technical problems discussed span a wide range of software-related and hardwarerelated issues.Given a dialogue history -such as a conversation between a user and a technical support assistant -the model must generate the next appropriate response in the dialogue.For example, when it is the turn of the technical support assistant, the model must generate an appropriate response helping the user resolve their problem.
We evaluate the models using the activity-and entity-based metrics designed specifically for the Ubuntu domain (Serban et al., 2017a).These metrics compare the activities and entities in the model generated responses with those of the reference responses; activities are verbs referring to high-level actions (e.g.download, install, unzip) and entities are nouns referring to technical objects (e.g.Firefox, GNOME).The more activities and entities a model response overlaps with the reference response (e.g.expert response) the more likely the response will lead to a solution.
Training The models were trained to maximize the log-likelihood of training examples using a learning rate of 0.0002 and mini-batches of size 80.We use a variant of truncated backpropagation.We terminate the training procedure for each model using early stopping, estimated using one stochastic sample per validation example.We evaluate the models by generating dialogue responses: conditioned on a dialogue context, we fix the model latent variables to their median values and then generate the response using a beam search with size 5.We select model hyperparameters based on the validation set using the F1 activity metric, as described earlier.
It is often difficult to train generative models for language with stochastic latent variables (Bowman et al., 2015;Serban et al., 2017b).For the latent variable models, we therefore experiment with reweighing the KL divergence terms in the variational lower-bound with values 0.25, 0.50, 0.75 and 1.0.In addition to this, we linearly increase the KL divergence weights starting from zero to their final value over the first 75000 training batches.Finally, we weaken the decoder RNN by randomly replacing words inputted to the decoder RNN with the unknown token with 25% probability.These steps are important for effectively training the models, and the latter two have been used in previous work by Bowman et al. (2015) and Serban et al. (2017b).

HRED (Baseline):
We compare to the HRED model (Serban et al., 2016a): a sequence-tosequence model, shown to outperform other es-This suggests that the Gaussian latent variables learn useful latent representations for frequent actions.On the other hand, H-VHRED performs best w.r.t.entities (e.g.Firefox, GNOME), which are often much rarer and mutually exclusive in the dataset.This suggests that the combination of Gaussian and piecewise latent variables help learn useful representations for entities, which could not be learned by Gaussian latent variables alone.We further conducted a qualitative analysis of the model responses, which supports these conclusions.See Appendix G.4

Conclusions
In this paper, we have sought to learn rich and flexible multi-modal representations of latent variables for complex natural language processing tasks.We have proposed the piecewise constant distribution for the variational autoencoder framework.We have derived closed-form expressions for the necessary quantities required for in the autoencoder framework, and proposed an efficient, differentiable implementation of it.We have incorporated the proposed piecewise constant distribution into two model classes -NVDM and VHRED -and evaluated the proposed models on document modeling and dialogue modeling tasks.We have achieved state-of-the-art results on three document modeling tasks, and have demonstrated substantial improvements on a dialogue modeling task.Overall, the results highlight the benefits of incorporating the flexible, multi-modal piecewise constant distribution into variational autoencoders.Future work should explore other natural language processing tasks, where the data is likely to arise from complex, multi-modal latent factors.
before, we define the posterior parameters as a function of both c and x:a post i = exp(H post a,i Enc(c, x) + b post a,i ), i = 1, . . ., n,

Figure 2 :
Figure 2: Latent variable approximate posterior means t-SNE visualization on 20-NG for G-NVDM and H-NVDM-5.Colors correspond to the topic labels assigned to each document.

Table 2 :
Word query similarity test on 20 News-Groups: for the query 'space", we retrieve the top 10 nearest words in word embedding space based on Euclidean distance.H-NVDM-5 associates multiple meanings to the query, while G-NVDM only associates the most frequent meaning.