Effective Estimation of Deep Generative Language Models

Advances in variational inference enable parameterisation of probabilistic models by deep neural networks. This combines the statistical transparency of the probabilistic modelling framework with the representational power of deep learning. Yet, due to a problem known as posterior collapse, it is difficult to estimate such models in the context of language modelling effectively. We concentrate on one such model, the variational auto-encoder, which we argue is an important building block in hierarchical probabilistic models of language. This paper contributes a sober view of the problem, a survey of techniques to address it, novel techniques, and extensions to the model. To establish a ranking of techniques, we perform a systematic comparison using Bayesian optimisation and find that many techniques perform reasonably similar, given enough resources. Still, a favourite can be named based on convenience. We also make several empirical observations and recommendations of best practices that should help researchers interested in this exciting field.


Introduction
Deep generative models (DGMs) are probabilistic latent variable models parameterised by neural networks (NNs).Specifically, DGMs optimised with amortised variational inference and reparameterised gradient estimates (Kingma and Welling, 2014;Rezende et al., 2014), better known as variational auto-encoders (VAEs), have spurred much interest in various domains, including computer vision and natural language processing (NLP).
We investigate this problem, dubbed by many posterior collapse, in the context of language modelling (LM).This is motivated by the fact that within NLP, DGMs attract a lot of attention from researchers in language generation, where systems usually employ an LM component. 1In a deep generative LM (Bowman et al., 2016), sentences are generated conditioned on samples from a continuous latent space, an idea with various practical applications.For example, it gives NLP researchers an opportunity to shape this latent space and promote generalisations that are in line with linguistic knowledge and/or intuition (Xu and Durrett, 2018).This also allows for greater flexibility in how the model is used, for example, we can generate sentences that live-in latent space-in a neighbourhood of a given observation (Bowman et al., 2016).Deterministically trained language models, e.g.recurrent NN-based LMs (Mikolov et al., 2010), lack a latent space and are thus deprived of such explicit mechanisms. 2espite this potential, VAEs that employ strong generators (e.g.recurrent NNs) tend to ignore the latent variable (Bowman et al., 2016;Zhang et al., 2016).Figure 1 illustrates this point with samples from a vanilla VAE LM: the model does not capture useful patterns in data space and behaves just Decoding Generated sentence

Greedy
The company said it expects to report net income of $UNK-NUM million Sample They are getting out of my own things ?IBM also said it will expect to take next year .
(a) Greedy generation from prior samples (top) yields the same sentence every time, showing that the latent code is ignored.Yet, ancestral sampling (bottom) produces good sentences, showing that the recurrent decoder learns about the structure of English sentences.
The two sides hadn't met since Oct. 18.I don't know how much money will be involved.The specific reason for gold is too painful.The New Jersey Stock Exchange Composite Index gained 1 to 16.And some of these concerns aren't known.
Prices of high-yield corporate securities ended unchanged.
(b) Homotopy: ancestral samples mapped from points along a linear interpolation of two given sentences as represented in latent space.The sentences do not seem to exhibit any coherent relation, showing that the model does not exploit neighbourhood in latent space to capture regularities in data space.like a standard recurrent LM.Various strategies to counter this problem have been independently proposed and tested, in particular, within the computer vision and machine learning communities.One of our contributions is a review and comparison of such strategies, as well as a novel strategy based on constrained optimisation.
There have also been attempts at identifying the fundamental culprit for posterior collapse (Chen et al., 2017;Alemi et al., 2018) leading to strategies based on changes to the generator, prior, and/or posterior.To follow on that, we improve inference for Bowman et al. (2016)'s VAE by employing a class of flexible approximate posteriors (Tabak et al., 2010;Rezende et al., 2014) and modify the model to employ strong priors.
Finally, we compare models and techniques intrinsically in terms of perplexity as well as bounds on mutual information between latent variable and observations.Our findings support a number of recommendations on how to effectively train a deep generative language model.

Density Estimation for Text
Density estimation for written text has a long history (Jelinek, 1980;Goodman, 2001), but in this work we concentrate on neural network models (Bengio et al., 2003), in particular, autoregressive ones (Mikolov et al., 2010).Following common practice, we model sentences independently, each a sequence x = x 1 , . . ., x n of n = |x| tokens.

Language models
A language model (LM) prescribes the generation of a sentence as a sequence of categorical draws parameterised in context, i.e.P (x|θ) = To condition on all of the available context, a fixed NN f (•) maps from a prefix sequence (denoted x <i ) to the parameters of a categorical distribution over the vocabulary.Given a dataset D of i.i.d.observations, we estimate the parameters θ of the model by searching for a local optimum of the loglikelihood function L(θ|D) = E X [log P (x|θ)] via stochastic gradient-based optimisation (Robbins and Monro, 1951;Bottou and Cun, 2004), where the expectation is taken w.r.t. the true data distribution and approximated with samples x ∼ D. Throughout, we refer to this model as RNNLM.

Deep generative language models
Bowman et al. ( 2016) model observations as draws from the marginal of a DGM.An NN maps from a latent sentence embedding z ∈ R dz to a distribution P (x|z, θ) over sentences, (2) where z follows a standard Gaussian prior. 3Generation still happens one word at a time without Markov assumptions, but f (•) now conditions on z in addition to the observed prefix.The conditional P (x|z, θ) is commonly referred to as generator or decoder.The quantity P (x|θ) is the marginal likelihood, essential for parameter estimation.
This model is trained to assign high (marginal) probability to observations, much like standard LMs.However, unlike standard LMs, it employs a latent space which can accommodate a low-dimensional manifold where discrete sentences are mapped to-via posterior inference P (z|x, θ)-and from-via generation P (x|z, θ).This gives the model an explicit mechanism to exploit neighbourhood and smoothness in latent space to capture regularities in data space.For example, it may group sentences according to certain latent factors (e.g.lexical choices, syntactic complexity, lexical semantics, etc.).It also gives users a mechanism to steer generation towards a certain purpose, for example, one may be interested in generating sentences that are mapped from the neighbourhood of another in latent space.To the extent this embedding space captures appreciable regularities, interest in this property is heightened.

Approximate inference
Marginal inference for this model is intractable and calls for approximate methods, in particular, variational inference (VI;Jordan et al., 1999), whereby an auxiliary and independently parameterised model q(z|x, λ) approximates the true posterior p(z|x, θ).Where this inference model is itself parameterised by a neural network, we have a case of amortised inference (Kingma and Welling, 2014;Rezende et al., 2014) and an instance of what is known as a VAE.Bowman et al. (2016) approach posterior inference with a Gaussian model Z|λ, x ∼ N (u, diag(s s)) u, s = g(x; λ) (3) whose parameters, i.e. a location vector u ∈ R D and a scale vector s ∈ R D >0 , are predicted by a neural network architecture g(•; λ) from an encoding of the complete observation x. 4 In this work, we use a bidirectional recurrent encoder (see Appendix A.1 for the complete design).Throughout the text we will refer to this model as SENVAE.
Parameter estimation We can jointly estimate the parameters of both models (i.e.generative and inference) by locally maximising a lowerbound on the log-likelihood function (ELBO) via gradient-based optimisation.For as long as we can reparameterise latent samples using a fixed random source, automatic differentiation (Baydin et al., 2018) can be used to obtain unbiased gradient estimates of the ELBO (Kingma and Welling, 2014;Rezende et al., 2014).In §5 we discuss a general class of reparameterisable distributions of which the Gaussian distribution is a special case.

Posterior Collapse and the Strong Generator Problem
In VI, we make inferences using an approximation q(z|x, λ) to the true posterior p(z|x, θ) and choose λ as to minimise the KL divergence KL(q(z|x, λ)||p(z|x, θ)).The same principle yields a lowerbound on log-likelihood used to estimate θ jointly with λ, thus making the true posterior p(z|x, θ) a moving target.If the estimated conditional P (x|z, θ) can be made independent of z, which in our case means relying exclusively on x <i to predict the distribution of X i , the true posterior will be independent of the data and equal to the prior. 5Based on such observation, Chen et al. (2017) argue that information that can be modelled by the generator without using latent variables will be modelled that way-precisely because when no information is encoded in the latent variable the true posterior equals the prior and it is then trivial to reduce KL(q(z|x, λ)||p(z|x, θ)) to 0. This is typically diagnosed by noting that after training KL(q(z|x, λ)||p(z)) → 0 for most x: we say that the approximate posterior collapses to the prior.
In fact, Alemi et al. (2018) show that the rate, R = E X [KL(q(z|x)||p(z))], is an upperbound to the mutual information (MI) between X and Z. From the non-negativity of MI, it follows that whenever KL(q(z|x)||p(z)) is close to zero for most training instances, MI is either 0 or negligible.Alemi et al. (2018) also show that the distortion, D = −E X [E q(z|x) [log P (x|z)]], relates to a lowerbound on MI (the lowerbound being H − D, where H is the unknown but constant data entropy).Due to this relationship to MI, they argue that reporting R and D along with loglikelihood on held-out data offers better insights about a trained VAE, an advice we follow in §6.
A generator that makes no Markov assumptions, such as a recurrent LM, can potentially achieve X i ⊥ Z | x <i , and indeed many have noticed that VAEs whose observation models are parameterised by such strong generators (or strong decoders) learn to ignore the latent representation (Bowman et al., 2016;Higgins et al., 2017;Sønderby et al., 2016;Zhao et al., 2018b).For this reason, a strategy to prevent posterior collapse is to weaken the decoder (Yang et al., 2017;Park et al., 2018).While in many cases there are good reasons for changing the model's factorisation, in this work we are interested in employing a strong generator, thus we will not investigate weaker decoders.Alternative solutions typically involve changes to the optimisation procedure and/or manipulations to the objective.The former aims at finding local optima of the ELBO with non-negligible MI.The latter seeks alternatives to the ELBO that target MI more directly.
Annealing Bowman et al. (2016) propose "KL annealing", whereby the KL term in the ELBO is incorporated into the objective in gradual steps.This way early on in optimisation the optimiser can focus on reducing distortion, potentially by increasing the MI between X and Z.They also propose to drop words from the prefix x <i uniformly at random to somewhat weaken the decoder and promote an increase in MI-the intuition is that the model would have to rely on z to compensate for missing history.As we do not want to compromise the decoder, we propose a slight modification of this technique whereby we slowly vary this word dropout rate from 1 → 0, instead of selecting a fixed value.In a sense, we anneal the decoder from a weak generator to a strong generator.
Targeting rates Another idea is to target a pre-specified positive rate (Alemi et al., 2018).Kingma et al. (2016) replace the KL term in the ELBO with max(r, KL(q(z|x, λ)||p(z))), dubbed free bits (FB) because it allows encoding the first r nats of information "for free".For as long as KL(q(z|x, λ)||p(z)) < r we are not optimising a proper ELBO (it misses the KL term), and the max introduces a discontinuity at KL(q(z|x, λ)||p(z)) = r.Chen et al. (2017) propose soft free bits (SFB), that instead multiplies the KL term in the ELBO with a weighing factor 0 < β ≤ 1 that is dynamically adjusted based on the target rate r: β is incremented (or reduced) by α if R > γr (or R < εr).Note that this technique requires hyperparameters (i.e.γ, ε, α) besides r to be tuned in order to determine how β is updated.
Change of objective If we accept there is a fundamental problem with the ELBO, we may seek alternative objectives and relate them to quantities of interest such as marginal likelihood and MI.A simple adaptation of the ELBO is weighing its KL-term by a constant factor β (β-VAE; Higgins et al., 2017).Although it was originally aimed at disentanglement of latent features with a β > 1, setting β < 1 promotes R > 0 and thus increased MI.Whilst being a useful counter to posterior collapse, low β might lead to variational posteriors becoming point estimates.The InfoVAE objective (Zhao et al., 2018b) mitigates this with an extra term on top of the β-VAE objective which minimises the divergence from the aggregated variational posterior q(z) = E X [q(z|x)] and the prior.
In our experiments we compute this divergence with an unbiased estimate of the maximum mean discrepancy (MMD; Gretton et al., 2012).

Minimum desired rate
We propose minimum desired rate (MDR), a technique to attain ELBO values at a pre-specified rate r that does not suffer from the gradient discontinuities of FB, and does not introduce the additional hyperparameters of SFB.The idea is to optimise the ELBO subject to a minimum rate constraint r: (5) Because constrained optimisation is generally intractable, we optimise the Lagrangian (Boyd and Vandenberghe, 2004) where u ∈ R ≥0 is a positive Lagrangian multiplier.We define the dual function φ(u) = max θ,λ Φ(θ, λ, u) and solve the dual problem min u∈R ≥0 φ(u).Local minima of the resulting min-max objective can be found by performing stochastic gradient descent with respect to u and stochastic gradient ascent with respect to θ, λ.
Appendix B presents further theoretical remarks comparing β-VAE, KL annealing, FB, SFB and the proposed MDR.We show that MDR is a form of KL weighing, albeit one that targets a specific rate.It can be seen, for example, as β-VAE where β = 1 − u (though note that u is not fixed).Compared to KL annealing, we argue that a target rate is far more interpretable a hyperparameter than the length (number of steps) and type (e.g.linear or exponential) of annealing schedule.Like SFB, MDR addresses FB's discontinuity in the gradients of the rate.Finally, we show that MDR is a form of SFB where α is dynamically set to ∂φ(u)/∂u , thus much simpler to tune.
The observation by Chen et al. (2017) suggests that estimating q(z|x, λ) and p(z, x|θ) jointly leads to choosing a generative model such that its corresponding (true) posterior is simple and can be matched exactly.With a Gaussian prior and a complex observation model, unless the latent variable is ignored, the posterior is certainly not Gaussian and likely multimodal.In section §5.1, we modify Bowman et al. ( 2016)'s inference network to parameterise an expressive posterior approximation in an attempt to reach better local optima.
The information theoretic perspective of Alemi et al. (2018) suggests that the prior regularises the inference model capping the MI between Z and X.Their bounds also suggest that, for a fixed posterior approximation, the optimum prior should be the aggregated posterior q(z) = E X [q(z|x)], and, therefore, investigating the use of strong priors seems like a fruitful avenue for effective estimation of DGMs.In §5.2, we modify SENVAE's generative story to employ an expressive prior.

Expressive posterior
We improve inference for SENVAE using normalising flows (NFs; Rezende and Mohamed, 2015).An NF expresses the density of a transformed variable z = t( ) in terms of the density of a base variable ∼ s(•) using the change of density rule: were z and are d z -dimensional and t( ) is a differentiable and invertible transformation with Jacobian J t ( ).For efficiency, it is crucial that the determinant of J t ( ) is simple, e.g.computable in O(d z ).NFs parameterise t (or its inverse) with neural networks, where either t, the network, or both are carefully designed to comply with the aforementioned conditions.A special case of an NF is when ∼ N (0, I) and t( ) = µ + σ is affine with strictly positive slope, which essentially makes Z ∼ N (µ, σ 2 ) a diagonal Gaussian.
We design a posterior approximation based on an inverse autoregressive flow (IAF; Kingma et al., 2016), whereby we transform ∼ N (0, I dz ) into a posterior sample z = t( , x; λ) by computing via an affine transformation whose inverse is autoregessive.This is crucial to obtaining a Jacobian whose determinant is simple to compute (see Appendix C.1 for the derivation), i.e. |det J t ( )| = dz k=1 σ( <k , x; λ).For increased flexibility, we compose T such transformations, each parameterised by an independent MADE (Germain et al., 2015). 6We also investigate a more compact flow-in terms of number of parameters-known as a planar flow (PF; Rezende and Mohamed, 2015).The transformation in a PF is not autoregressive, but it is designed such that the determinant of its Jacobian is simple (see Appendix C.2).

Expressive priors
Here we extend the prior to some more complex, ideally multimodal, parametric family and fit p(z|θ).A perhaps obvious choice is a uniform mixture of K Gaussians (MoG), i.e. p(z|θ) where the Gaussian parameters are optimised along with other generative parameters.
A less obvious choice is a variational mixture of posteriors (VampPrior;Tomczak and Welling, 2017).This prior is motivated by the fact, for a fixed posterior approximation, the prior that optimises the ELBO is the aggregated posterior E X [q(z|x, λ)].Though we could obtain an empirical estimate of this quantity, this is an intensive computation to perform for every sampled z.Instead, Tomczak and Welling (2017) propose to use K learned pseudo inputs and design the prior where v (k) is the kth such input-in their case a continuous deterministic vector.Again the parameters of the prior, i.e. {v (k) } K k=1 , are optimised along with other generative parameters.
Applying this technique to our deep generative LM poses additional challenges as our inference model conditions on a sequence of discrete observations.We adapt this technique by pointestimating a sequence of word embeddings, which makes up a pseudo input.That is v (k) is a sequence has the dimensionality of our embeddings, and l k is the length of the sequence (fixed at the beginning of training).See Appendix A.1 for remarks about both priors.

KL term
Be it due to an expressive posterior or due to an expressive prior (or both), we lose analytical access to the KL term in the ELBO.That is, however, not a problem, since we can MC-estimate the KL term using M samples z (m) ∼ q(z|x, λ): where in experiments we make M = 1.

Experiments
Our goal is to identify which techniques are effective in training VAEs for language modelling and our evaluation concentrates on intrinsic metrics: negative log-likelihood (NLL), perplexity per token (PPL), rate (R), distortion (D), the number of active units (AU; Burda et al., 2015)) and gap in accuracy of next word prediction (given gold prefixes) when decoding from prior samples versus decoding from posterior samples (ACC gap ).
For VAE models, NLL (and therefore PPL) can only be estimated, since we do not have access to the exact marginal likelihood.For that we derive an importance sampling (IS) estimate p(z (s) , x|θ) q(z (s) |x) where z (s) ∼ q(z|x) (12) using our trained approximate posterior as importance distribution (we use S = 1000 samples).We train and test our models on the English Penn Treebank (PTB) dataset (Marcus et al., 1993). 7Hyperparameters for our architectures are chosen via Bayesian optimisation (BO; Snoek et al., 2012)-see Appendix A.2 for details.
Baseline We compare our RNNLM to an external baseline employing a comparable number of parameters (Dyer et al., 2016). 8  our RNNLM is a strong baseline and its architecture makes a strong generator building block.
On optimisation strategies First, we assess the effectiveness of techniques that aim at promoting local optima of SENVAE with better MI tradeoff.
The techniques we compare have hyperparameters of their own (see Table 2), which we tune using BO towards minimising estimated NLL of the validation data.As for the architecture, the approximate posterior q(z|x, λ) employs a bidirectional recurrent encoder, and the generator P (x|z, θ) is essentially our RNNLM initialised with a learned projection of z (Appendix A.1 contains the complete specification).Models were trained with Adam (Kingma and Ba, 2014) with default parameters and a learning rate of 0.001 until convergence five times for each technique.
Results can be found in Table 3. First, note how the vanilla VAE (no special treatment) encodes no information in latent space (R = 0).Then note that all techniques converged to VAEs that attain better perplexities than the RNNLM, and all but annealed word dropout did so at non-negligible rate.Notably, the two most popular techniques, word dropout and KL annealing, perform subpar to the other techniques. 9   target rate, whereas the second requires tuning of one or more hyperparameters. 10We argue that the rate hyperparameter is more interpretable and practical in most cases, for example, it likely requires less (manual or Bayesian) tuning by the researcher.Hence, we further investigated this first class, specifically FB and MDR, by varying the target rate further.Figure 2a shows that they attain comparable perplexities over a large range of rates.Figure 2c shows the difference between the specified target rate and the rate estimated on validation data at the end of training: MDR is just as good as FB for lower targets and becomes more effective than FB for higher targets.
On expressive priors and posteriors Second, we compare the impact of expressive posteriors and priors.This time, flow and prior hyperparameters were selected via grid search, and can be found in Appendix A.1.All models were trained with a target rate of five, with settings otherwise the same as the previous experiment.In Table 4 it can be seen that more expressive components did not improve perplexity further.It is possible, however, that now that we have stronger latent components, we need to target models with higher MI between Z and X.In Figure 3a it can be seen 10 Soft free bits actually requires both.that this is not the case, since all models perform roughly the same and beyond 20 nats performance degrades quickly.It is worth highlighting that, though perplexity did not improve, models with expressive latent components did show other indicators of increased MI. Figure 3b shows that SEN-VAEs trained with expressive latent components learn to rely more on information encoded in the latent variables-note the increased gap in performance when reconstructing from posterior rather than prior samples.This result is also hinted at by the increase in active units for expressive latent components shown in Table 4.

Generated samples
Figure 4 shows samples from a well-trained SENVAE, where we decode greedily from a prior sample-this way all variability is due to the generator's reliance on the latent sample.Recall that a vanilla VAE ignores z and thus greedy generation from a prior sample is essentially deterministic in that case (see Figure 1a).Next to the samples we show the closest training instance, which we measure in terms of an edit distance (TER; Snover et al., 2006). 11The motivation to retrieve this "nearest neighbour" is     to help us assess whether the generator is producing novel text or simply reproducing something it memorised from training.We also show a homotopy in Figure 5: here we decode greedily from points lying between a posterior sample conditioned on the first sentence and a posterior sample conditioned on the last sentence.In contrast to the vanilla VAE (Figure 1b), neighbourhood in latent space is now used to capture some regularities in data space.These samples add support to the quantitative evidence that our DGMs have been effectively trained not to neglect the latent space.In Appendix D we provide more samples (also for other variants of the model).
Recommendations Based on our path through the land of SENVAEs, we recommend to target a specific rate via MDR (or FB) instead of annealing (or word dropout).It is easy to pick a rate by plotting validation performance against a handful of rate values without sophisticated Bayesian optimisation.Use importance-sampled estimates of NLL, rather than single-sample ELBO estimates, for model selection, for the latter can be too loose a bound and/or too heavily influenced by noisy estimates of KL.Use as many samples as you can for that, you will observe a tighter bound and lower variance (we use 1000).Inspect sentences generated by greedily decoding from a prior (or posterior) sample as this shows whether the generator is at all sensitive to variation in latent space.Retrieve nearest neighbours from training data to spot copying behaviour.Do investigate stronger latent components (priors and approximate posteriors), they seem to lead to higher mutual information without hurting perplexity (which weaker generators probably would).

Related Work
In NLP, posterior collapse was probably first noticed by Bowman et al. ( 2016), who addressed it via word dropout and/or KL scaling.Further investigation revealed that in the presence of strong generators, the ELBO itself becomes the culprit (Chen et al., 2017;Alemi et al., 2018), for it does not have a term that explicitly promotes high MI between latent and observed data.Posterior collapse has also been ascribed to amortised inference (Kim et al., 2018)

0.38
The department store concern said it expects to report profit from continuing operations in 1990.
Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

0.59
The new U.S. auto makers say the accord would require banks to focus on their core businesses of their own account.
International Minerals said the sale will allow Mallinckrodt to focus its resources on its core businesses of medical products, specialty chemicals and flavors.

0.78
Figure 4: Samples from SENVAE (MoG prior and IAF posterior) trained via MDR (r = 10): we sample from the prior and decode greedily.We also show the closest training instance in terms of a string edit distance (TER).
The inquiry soon focused on the judge.
The judge declined to comment on the floor.The judge was dismissed as part of the settlement.
The judge was sentenced to death in prison.
The announcement was filed against the SEC.The offer was misstated in late September.
The offer was filed against bankruptcy court in New York.
The letter was dated Oct. 6.GECO (Rezende and Viola, 2018) and the Lagrangian VAE (LagVAE; Zhao et al., 2018a) cast VAE optimisation as a dual problem, and for that they are closely-related to our MDR.GECO targets minimisation of KL[q(z|x)||p(z)] under constraints on reconstruction error, whereas LagVAE targets either maximisation or minimisation of (bounds on) the MI between X and Z under constraints on the InfoVAE objective.Contrary to MDR, GECO focuses on latent space regularisation and offers no explicit mechanism to mitigate posterior collapse.LagVAE, in MI-maximisation mode, promotes non-negligible rates, but requires constraints based on feasible ELBO values.12Thus, in this setting, it is somewhat the opposite of our technique: MDR minimises ELBO while targeting an upperbound on MI, LagVAE maximises MI while targeting an ELBO.It might depend on the specific problem which of the two methods is more convenient.All three techniques share the advantage that they can be trivially extended with other constraints at the researchers behest.
Expressive latent components have been extensively and successfully applied to the image domain.Expressive posteriors, based on NFs, include the IAF (Kingma et al., 2016), NAF (Huang et al., 2018a), ODE (Chen et al., 2018) and FFJORD (Grathwohl et al., 2019), Sylvester flow (Van den Berg et al., 2018) and Householder flow (Tomczak and Welling, 2017).Expressive priors include the VampPrior (Tomczak and Welling, 2017), autoregressive flows (Papamakarios et al., 2017) and various non-parametric priors (Nalisnick and Smyth, 2016;Goyal et al., 2017b;Bodin et al., 2017).However, these techniques have seen little application to the language domain so far, with the exception of the Householder flow for variational topic modelling (Liu et al., 2018) and, concurrently to this work, NFs for latent sentence modelling with character-level latent variables and weak generators (Ziegler and Rush, 2019).We believe we are the first to employ expressive latent models at the sentence-level, and hope it will stimulate the NLP community to further investigate these techniques.

Discussion
The typical RNNLM is built upon an exact factorisation of the joint distribution, thus a well trained architecture is hard to improve upon in terms of log-likelihood of gold-standard data.Our interest in latent variable models stems from the desire to obtain generative stories that are less opaque than that of an RNNLM, for example, in that they may expose knobs that we can use to control generation and a hierarchy of steps that may award a degree of interpretability to the model.The SENVAE is not that model, but it is a crucial building block in the pursue for hierarchical probabilistic models of language.SENVAE is a deep generative model whose generative story is rather shallow, yet, due to its strong generator component, it is hard to make effective use of the extra knob it offers.In this paper, we have shown that effective estimation of such a model is possible, in particular, optimisation subject to a minimum rate constraint seems a simple and effective strategy to alleviate posterior collapse.Many questions remain open, especially regarding the potential of expressive latent components, but we hope this work, i.e. the organised review it contributes and the techniques it introduces, will pave the way to deeper-in statistical hierarchy-generative models of language.

A Architectures and Hyperparameters
In order to ensure that all our experiments are fully reproducible, this section provides an extensive overview of the model architectures, as well as model an optimisation hyperparameters.
Some hyperparameters are common to all experiments, e.g.optimiser and dropout, they can be found in Table 5.All models were optimised with Adam using default settings (Kingma and Ba, 2014).To regularise the models, we use (variational) dropout with a shared mask across timesteps (Gal and Ghahramani, 2016) and weight decay proportional to the dropout rate (Gal and Ghahramani, 2015) on the input and output layers of the generative networks (i.e.RNNLM and the recurrent decoder in SENVAE).No dropout is applied to layers of the inference network as this does not lead to consistent empirical benefits and lacks a good theoretical basis.Gradient norms are clipped to prevent exploding gradients, and long sentences are truncated to three standard deviations above the average sentence length in the training data.

A.1 Architectures
This section describes the components that parameterise our models. 13We use mnemonic blocks layer(inputs; parameters) to describe architectures.Table 6 lists hyperparameters for the models discussed in what follows.
RNNLM At each step, an RNNLM parameterises a categorical distribution over the vocabulary, i.e.X i |x <i ∼ Cat(f (x <i ; θ)), where f (x <i ; θ) = softmax(s i ) and We employ an embedding layer (emb), one (or more) GRU cell(s) (h 0 ∈ θ is a parameter of the model), and an affine layer to map from the dimensionality of the GRU to the vocabulary size.
Gaussian SENVAE A Gaussian SENVAE also parameterises a categorical distribution over the vocabulary for each given prefix, but, in addition, it conditions on a latent embedding Z ∼ N (0, I), i.e.X i |z, x <i ∼ Cat(f (z, x <i ; θ)) where f (z, x <i ; θ) = softmax(s i ) and Compared to RNNLM, we modify f only slightly by initialising GRU cell(s) with h 0 computed as a learnt transformation of z.Because the marginal of the Gaussian SENVAE is intractable, we train it via variational inference using an inference model q(z|x, λ) = N (z|u, diag(s s)) where Note that we reuse the embedding layer from the generative model.Finally, a sample is obtained via z = u + s where ∼ N (0, I dz ).
IAF SENVAE Unlike the Gaussian case, an IAF (Kingma et al., 2016) does not parameterise a distribution directly, but rather a sampling procedure where we transform a d z -dimensional sample from a base distribution (here a standard Gaussian) via an invertible and differentiable transformation.
Here we show the design of an IAF where we em-ploy T MADE layers: The context vector c represents the complete input sequence and allows each step of the flow to condition on x.Note that while z 0 is actually Gaussiandistributed, i.e.Z 0 ∼ (u 0 , diag(s 0 s 0 )), the distribution of each z t for t = 1, . . ., T is potentially increasingly more complex.A sample from q(z|x, λ) is the output of the flow at step T , i.e.
whose log-density is log q(z|x, λ) See Appendix C for more on NFs.
MADE We denote by u, s = MADE(z, c; λ M ) a masked dense layer (Germain et al., 2015) with inputs z and c, which is autoregressive on z, where: where T is a lower-triangular weight matrix with non-zero diagonal elements and T a strictly lowertriangular weight matrix (with zeros on and above the diagonal).The parameters of the made are Planar SENVAE A planar flow has a more compact parameterisation than an IAF, but is based on the same principle, namely, we parameterise a sampling procedure by an invertible and differentiable transformation of a fixed random source (a standard Gaussian in this case): where a sample from q(z|x, λ) is the output of the flow at step T , i.e. z = g T • . . .g 1 (z 0 ; λ) with log-density log q(z|x, λ) and ψ(z t ) = w t ∂ ∂zt tanh w t z t + b t (Rezende and Mohamed, 2015).In line with the work of Van den Berg et al. (2018), we amortise all parameters of the flow in addition to the parameters u 0 , s 0 of the base distribution.

MoG prior
We parameterise K diagonal Gaussians, which are mixed uniformly.To do so we need K location vectors, each in R dz , and K scale vectors, each in R dz >0 .To ensure strict positivity for scales we make σ (k) = softplus( σ(k) ).The set of generative parameters θ is therefore extended with VampPrior For this we estimate K sequences This means we extend the set of generative parameters θ with {v , each in R de , for k = 1, . . ., K. For each k = 1, . . ., K, we sample l k at the beginning of training and keep it fixed.Specifically, we drew K samples from a normal, l k ∼ N (•|µ l , σ l ), which we rounded to the nearest interger.µ l and σ l are the dataset sentence length mean and variance respectively.

A.2 Bayesian Optimisation
Bayesian optimisation (BO) is an efficient method to approximately search for global optima of a (typically expensive to compute) objective function y = f (x), where x ∈ R M is a vector containing the values of M hyperparameters that may influence the outcome of the function (Snoek et al., 2012).Hence, it forms an alternative to grid search or random search (Bergstra and Bengio, 2012) for tuning the hyperparameters of a machine learning algorithm.BO works by assuming that our observations y n |x n (for n = 1, . . ., N ) are drawn from a Gaussian process (GP; Rasmussen and Williams, 2005).Then based on the GP posterior, we can design and infer an acquisition function.This acquisition function can be used to determine where to "look next" in parameter-space, i.e. it can be used to draw x N +1 for which we then evaluate the objective function f (x N +1 ).This procedure iterates until a set of optimal parameters is found with some level of confidence.
In practice, the efficiency of BO hinges on multiple choices, such as the specific form of the acquisition function, the covariance matrix (or kernel) of the GP and how the parameters of the acquisition function are estimated.Our objective function is the (importance-sampled) validation NLL, which can only be computed after a model convergences (via gradient-based optimisation of the ELBO).We follow the advice of Snoek et al. (2012) and use MCMC for estimating the parameters of the acquisition function.This reduced the amount of objective function evaluations, speeding up the overall search.Other settings were also based on results by Snoek et al. (2012), and we refer the interested reader to that paper for more information about BO in general.A summary of all relevant settings of BO can be found in Table 7.We used the GPYOPT library (authors, 2016) to implement this procedure.

B Relation between optimisation techniques
It is insightful to compare the various techniques we surveyed to the technique we propose in terms of the quantities involved in their optimisation.To avoid clutter, let us assume a single data point x, and denote the distortion −E q(z|x,λ) [log P (x|z, θ)] by D and the rate KL(q(z|x, λ)||p(z)) by R.
The losses minimised by the β-VAE, KL annealing and SFB, all have the form where β ≥ 0 is a weighting factor.FB minimises the loss where r > 0 is the target rate.Last, with respect to θ and λ, MDR minimises the loss where u ∈ R ≥0 is the Lagrangian multiplier.And with respect to u, it minimises Since we aim to minimise these losses as a function of the parameters θ, λ with stochastic gradient descent, it makes sense to evaluate how these methods influence optimisation by checking their gradients.First, FB has the following gradients w.r.t.its parameters: which shows the discontinuity in the gradients as a results of this objective.I.e., there is a sudden 'jump' from zero to a large gradient w.r.t. the KL when the KL dips above R. β-VAE, KL annealing, and SFB have a gradient that does not suffer such discontinuities: where you can see that the magnitude of the gradient w.r.t. the KL is influenced by the value of β at that point in the optimisation.Last, observe the gradient of the MDR objective: thus, essentially, ∇ θ,λ β (θ, λ) with β = 1 − u.Hence, MDR is another form of KL weighting, albeit one that allows specific rate targeting.Compared to β-VAE, MDR has the advantage that β is not fixed, but estimated to meet the requirements on rate.This might mitigate the problem noticed by He et al. (2019) that β-VAE can lead to under-regularisation at the end of training.Similar to their technique, MDR can cut the inference network more 'slack' during the start of training, but enforce stricter regularisation at the end, once the constraint is met.We observe that this happens in practice.Furthermore, we would argue that tuning towards a specific rate is more interpretable than tuning β.
A similar argument can be made against KLannealing.Although β is not fixed in this scheme, it requires multiple decisions that are not very interpretable, such as the length (number of steps) and type (e.g.linear or exponential) of the schedule.
Most similar then, seems SFB.Like MDR, it flexibly updates β by targeting a rate.However, differences between the two techniques become apparent when we observe how β is updated.In case of SFB: where α, γ and ε are hyperparameters.In case of MDR (not taking optimiser-specific dynamics into account): where ρ is a learning rate.From this, we can draw the conclusion that MDR is akin to SFB without any extra hyperparameters.Yet, it also gives some insight into suitable hyperparameters for SFB; if we set α = ρ(R − r), 14 γ = 1 and ε = 1, SFB is essentially equal to performing Lagrangian relaxation on the ELBO with a constraint on the minimum rate.
All in all, this analysis shows that there is a clear relation between several of the optimisation techniques compared in this paper.MDR seems to be the most flexible, whilst requiring the least amount of hyperparameter tuning or heuristics.

C Normalising flows
This section reviews a general class of reparameterisable distributions known as a normalising flow (NF; Tabak et al., 2010).NFs express the density of a transformed variable y = t(x) in terms of the density of a base variable x using the change of density rule: or conversely, by application of the inverse function theorem, were x and y are D-dimensional and t(x) is a differentiable and invertible transformation with Jacobian J t (x).The change of densities rule can be used to map a sample from a complex distribution to a sample from a simple distribution, or the other way around, and it relates their densities analytically.For efficiency, it is crucial that the determinant of J t (x) be simple, e.g.assessed in time O(D).NFs parameterise t (or its inverse) with neural networks, where either t, the network, or both are carefully designed to comply with the aforementioned conditions.NFs can be used where the input to the flow is a sample from a simple fixed distribution, such as uniform or standard Gaussian, and the output is a sample from a much more complex distribution.This leads to very expressive approximate posteriors for amortised variational inference.A general strategy for designing tractable flows is to design simple transformations, each of which meets our requirements, and compose enough of them exploiting the fact that composition of invertible functions remains invertible.In fact, where the base distribution is a standard Gaussian and the transformation is affine with strictly positive slope (an invertible and differentiable function), the resulting distribution is a parameterised Gaussian, showing that Gaussians can be seen as a particularly simple normalising flow.NFs can also be used where the input to the flow is a data point and the output is a sample from a simple distribution, this leads to very expressive density estimators for continuous observations-differentiability and invertibility constraints preclude direct use of NFs to model discrete distributions.

C.1 Inverse autoregressive flows
In an IAF (Kingma et al., 2016), y = t(x) = t 1 (x 1 , x <1 ), . . ., t D (x D , x <D ) , where is a differentiable transformation whose inverse is autoregressive (the output x k depends on x <k ).The parameters of t k (•), i.e. µ(a <k ) ∈ R and σ(a <k ) ∈ R =0 , are compute by neural networks, and note that in the forward direction we can compute all D transformations in parallel using a MADE (Germain et al., 2015).Moreover, the Jacobian J t (x) is lower-triangular and thus has a simple determinant (product of diagonal elements).To see that, let us compute the entries ∂y k ∂x j of the Jacobian of J t (x).Below the main diagonal, i.e. k > j, we have: On the main diagonal, i.e. k = j, we have And finally, above the main diagonal, i.e. k > j, the partial derivative is zero.The Jacobian matrix is therefore lower triangular with the kth element of its diagonal equal to σ(x <k ), which leads to efficient determinant computation: and from the inverse function theorem it holds that Therefore, where x is sampled from a simple random source (e.g. a Gaussian), we can assess the log-density of y = t(x) via: Naturally, composing T transformations as the one in (30) leads to more complex distributions.

C.2 Planar flows
A planar flow (Rezende and Mohamed, 2015) is based on a transformation y = t(x) where Again, composing a number of independently parameterised steps make the distribution potentially more complex.Unlike in IAFs, the parameters of the flow are not a function of x.On the other hand, while in an IAF y k depends only on x j≤k , in a planar flow y k depends on every x j , albeit only through a (h-transformed) dot-product.Originally, planar flows did not explore amortisation, that is, {u, w, b} were free parameters of the model, but for flows that condition on data, we can have NNs predict these parameters from a representation the data (Van den Berg et al., 2018).

0.50
The company said it expects to report net income of $UNK-NUM million, or $1.04 a share, from $UNK-NUM million, or, Nine-month net climbed 19% to $UNK-NUM million, or $2.21 a primary share, from $UNK-NUM million, or $1.94 a share.

0.50
The company said it expects to report net income of $UNK-NUM million, or $1.04 a share, from $UNK-NUM million, or, Nine-month net climbed 19% to $UNK-NUM million, or $2.21 a primary share, from $UNK-NUM million, or $1.94 a share.

+ MDR training
They have been growing wary of institutional investors.
People have been very respectful of each other.

0.46
The Palo Alto retailer adds that it expects to post a third-quarterlossofabout$1.8million,or68cents a share, compared Not counting the extraordinary charge, the company said it would have had a net loss of $3.1 million, or seven cents a share.

0.62
But Mr. Chan didn't expect to be the first time in a series of cases of rape and incest, including a claim of two, For the year, electronics emerged as Rockwell's largest sector in terms of sales and earnings.

+ Vamp prior
But despite the fact that they're losing.
As for the women, they're UNK-LC.

0.45
Other companies are also trying to protect their holdings from smaller companies.
And ship lines carrying containers are also trying to raise their rates.

0.38
The department store concern said it expects to report profit from continuing operations in 1990.
Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

0.59
The new U.S. auto makers say the accord would require banks to focus on their core businesses of their own account.
International Minerals said the sale will allow Mallinckrodt to focus its resources on its core businesses of medical products, specialty chemicals and flavors.

0.78
Figure 6: Sampled sentences from various models considered in this paper.For the RNNLM, we ancestral-sample directly from the softmax layer.For SENVAE, we sample from the prior and decode greedily.The vanilla SENVAE consistently produces the same sample in this setting, that is because it makes no use of the latent space and all source of variability is encoded in the dynamics of its strong generator.Other SENVAE models were trained with MDR targeting a rate of 10.Next to each sample we show in italics the closest training instance in terms of an edit distance (i.e.TER).The higher this distance (it varies from 0 to 1), the more novel the sentence is.This gives us an idea of whether the model is generating novel outputs or copying from the training data.
Target rate minus validation rate at the end of training for various targets.

Figure 2 :
Figure 2: Validation results for SENVAE trained with free-bits (FB) or minimum desired rate (MDR).
Accuracy gap: VAEs with stronger latent components rely more on posterior samples for reconstruction.

Figure 3 :
Figure 3: Comparison of SENVAEs trained with standard prior and Gaussian posterior (Gauss), MoG prior and IAF posterior (IAF-MoG), and Vamp prior and Gaussian posterior (Vamp) to attain pre-specified rates.

Figure 5 :
Figure 5: Latent space homotopy from a properly trained SENVAE.Note the smooth transition of topic and grammatically of the samples.All sentences were greedily decoded from a prior sample.
) where u, w ∈ R D and b ∈ R are parameters of the flow, h(•) is a smooth elementwise non-linearity (we use tanh) and derivative h .It can be shown that |det J t (x)| = 1 + u ψ(x) (38) where ψ(x) = wh (w x + b).Then, where x is sampled from a simple random source (e.g. a Gaussian), log p Y (t(x)) = log p X (x) − log |1 + u ψ(x)| .
said he has been able to unveil a new proposal for Warner Communications Inc., which has been trying to participate in the U.S. President Bush says he will name Donald E. UNK-INITC to the new Treasury post of inspector general, which has responsibilities for the IRS... to close at 2643.65.By futures-related program buying, the Dow Jones Industrial Average gained 4.92 points to close at 2643.65.

Table 1 :
Dyer et al. (2016)e PTB test set: avg ± (std) over five independent runs.Contrary to us,Dyer et al. (2016)removed the end of sentence token when computing perplexity.In the last column, we report perplexity computed with the stop token removed.

Table 2 :
Techniques and their hyperparameters.

Table 3 :
The techniques that work well at non-negligible rate can be separated in two classes.The first class requires setting a Performance (avg ± std across 5 independent runs) of SENVAE on the PTB validation set.

Table 4 :
Performance on the PTB test set of the SENVAE with various prior and posterior distributions (avg ± std across 5 independent runs).All VAEs were trained with a target rate of five, and the top row shows RNNLM.
. Beyond the techniques compared and developed in this work, other solutions have been proposed, including further adaptations to the generator architecture (Semeniuta By futures-related program buying, the Dow Jones Industrial Average gained 4.92 points to close at 2643.65.
And you'll have no longer sure whether you would do anything not -if you want to get you don't know what you're, Reaching for that extra bit of yield can be a big mistake -especially if you don't understand what you're investing.Nine-month net climbed 19% to $UNK-NUM million, or $2.21 a primary share, from $UNK-NUM million, or $1.94 a share.