Spherical Latent Spaces for Stable Variational Autoencoders

A hallmark of variational autoencoders (VAEs) for text processing is their combination of powerful encoder-decoder models, such as LSTMs, with simple latent distributions, typically multivariate Gaussians. These models pose a difficult optimization problem: there is an especially bad local optimum where the variational posterior always equals the prior and the model does not use the latent variable at all, a kind of “collapse” which is encouraged by the KL divergence term of the objective. In this work, we experiment with another choice of latent distribution, namely the von Mises-Fisher (vMF) distribution, which places mass on the surface of the unit hypersphere. With this choice of prior and posterior, the KL divergence term now only depends on the variance of the vMF distribution, giving us the ability to treat it as a fixed hyperparameter. We show that doing so not only averts the KL collapse, but consistently gives better likelihoods than Gaussians across a range of modeling conditions, including recurrent language modeling and bag-of-words document modeling. An analysis of the properties of our vMF representations shows that they learn richer and more nuanced structures in their latent representations than their Gaussian counterparts.


Introduction
Recent work has established the effectiveness of deep generative models for a range of tasks in NLP, including text generation (Hu et al., 2017;Yu et al., 2017), machine translation (Zhang et al., 2016), and style transfer (Shen et al., 2017;Zhao et al., 2017a).Variational autoencoders, which have been explored in past work for text modeling (Miao et al., 2016;Bowman et al., 2016), posit a continuous latent variable which is used to capture latent structure in the data.Typical VAE implementations assume the prior of this latent space is a multivariate Gaussian; during training, a Kullback-Leibler (KL) divergence term in loss function encourages the variational posterior to approximate the prior.One major limitation of this approach observed by past work is that the KL term may encourage the posterior distribution of the latent variable to "collapse" to the prior, effectively rendering the latent structure unused (Bowman et al., 2016;Chen et al., 2016).
In this paper, we propose to use the von Mises-Fisher (vMF) distribution rather than Gaussian for our latent variable.vMF places a distribution over the unit hypersphere governed by a mean parameter µ and a concentration parameter κ.Our prior is a uniform distribution over the unit hypersphere (κ = 0) and our family of posterior distributions treats κ as a fixed model hyperparameter.Since the KL divergence only depends on κ, we can structurally prevent the KL collapse and make our model's optimization problem easier.We show that this approach is actually more robust than trying to flexibly learn κ, and a wide range of settings for fixed κ lead to good performance.Our model systematically achieves better log likelihoods than analogous Gaussian models while having higher KL divergence values, showing that it more successfully makes use of the latent variables at the end of training.
Past work has suggested several other techniques for dealing with the KL collapse in the Gaussian case.Annealing the weight of KL term (Bowman et al., 2016) still leaves us with brittleness in the optimization process, as we show in Section 2. Other prior work (Yang et al., 2017;Semeniuta et al., 2017) focuses on using CNNs rather than RNNs as the decoder in order to weaken the model and encourage the use of the 3 G e T 0 s P j / V k n X 3 T L 7 g x o l X g L U q w U 7 s Q n A F Q 7 + a 9 2 N y K J o N I Q j r V u e W 5 s / B Q r w w i n k 1 w 7 0 T T G Z I B 7 t G W p x I J q P 5 0 d P U G n V u m i M F K 2 p E E z 9 f d E i o X W Y x H Y T o F N X y 9 7 U / E / r 5 W Y 8 N J P m Y w T Q y W Z L w o T j k y E p g m g L l O U G D 6 2 B B P F 7 K 2 I 9 L H C x N i c c j Y E b / n l V V I / L 3 t u 2 b v x i p U y z J G F Y z i B E n h w A R W 4 h i r U g M A Q n u A F X p 2 R 8 + y 8 O e / z 1 o y z m D m C P 3 A + f g B z y 5 P + < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 h 8 u w G Y e s e s 6 g e t l E M E U 7 g k s / 3 s = " > A A A B 9 H i c b V A 9 T 8 M w E L 3 w W c p X g Z H F o k I q S 5 S w w F i J h b F I 9 E N q o 8 p x n d a q 7 a S 2 U 1 F C f w c L A w i x 8 m P Y + D e 4 b Q Z o e d J J T + / d 6 e 5 e m H C m j e d 9 O 2 v r G 5 t b 2 4 W d 4 u 7 e / s F h 6 e i 4 o e N U E V o n M Y 9 V K 8 S a c i Z p 3 T D D a S t R F I u Q 0 2 Y 4 v J n 5 z T F V m s X y 3 k w S G g j c l y x i B B s r B a N u 1 k k G b F p 5 f H q 4 6 J b K n u v N g V a J n 5 M y 5 L q 6 R x 6 f q e 6 9 / 5 5 a q b x 1 G A U z i D C v h w B V W 4 h R r U g c A I n u E V 3 p y x 8 + K 8 O x + L 1 j U n n z m B P 3 A + f w D P 2 J I J < / l a t e x i t > q (z|x) < l a t e x i t s h a 1 _ b a s e 6 4 = " e R n Y s h r 1 8 z 9 i d M J q H h S d 9 z f 6 Z D U = " > A A A B 9 H i c b V D L T g J B E O z F F + I L 9 O h l I j H B C 9 n 1 o k c S L x 4 x k U c C G z I 7 z M K E m d l l Z h b F l e / w 4 k F j v P o N f o M 3 / 8 b h c V C w k k 4 q V d 3 p 7 g p i z r R x 3 W 8 n s 7 a + s b m V 3 c 7 t 7 O 7 t H + Q L h 3 U d J Y r Q G o l 4 p J o B 1 p Q z S W u G G U 6 b s a J Y B J w 2 g s H V 1 G + M q N I s k r d m H F N f 4 J 5 k I S P Y W M k f d t J 2 3 G e T 0 s P j / V k n X 3 T L 7 g x o l X g L U q w U 7 s Q n A F Q 7 + a 9 2 N y K J o N I Q j r V u e W 5 s / B Q r w w i n k 1 w 7 0 T T G Z I B 7 t G W p x I J q P 5 0 d P U G n V u m i M F K 2 p E E z 9 f d E i o X W Y x H Y T o F N X y 9 7 U / E / r 5 W Y 8 N J P m Y w T Q y W Z L w o T j k y E p g m g L l O U G D 6 2 B B P F 7 K 2 I 9 L H C x N i c c j Y E b / n l V V I / L 3 t u 2 b v x i p U y z J G F Y z i B E n h w A R W 4 h i r U g M A Q n u A F X p 2 R 8 + y 8 O e / z 1 o y z m D m C P 3 A + f g B z y 5 P + < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 h 8 u w G Y e s e s 6 g e t l E M E U 7 g k s / 3 s = " > A A A B 9 H i c b V A 9 T 8 M w E L 3 w W c p X g Z H F o k I q S 5 S w w F i J h b F I 9 E N q o 8 p x n d a q 7 a S 2 U 1 F C f w c L A w i x 8 m P Y + D e 4 b Q Z o e d J J T + / d 6 e 5 e m H C m j e d 9 O 2 v r G 5 t b 2 4 W d 4 u 7 e / s F h 6 e i 4 o e N U E V o n M Y 9 V K 8 S a c i Z p 3 T D D a S t R F I u Q 0 2 Y 4 v J n 5 z T F V m s X y 3 k w S G g j c l y x i B B s r B a N u 1 k k G b F p 5 f H q 4 6 J b K n u v N g V a J n 5 M y 5 L q 6 R x 6 f q e 6 9 / 5 5 a q b x 1 G A U z i D C v h w B V W 4 h R r U g c A I n u E V 3 p y x 8 + K 8 O x + L 1 j U n n z m B P 3 A + f w D P 2 J I J < / l a t e x i t >

Gaussian
Linear Linear µ < l a t e x i t s h a 1 _ b a s e 6 4 = " y I v A X o T 3 K N k K e p j 2 8 S O U 7 u t U J / g = " > A A A B 6 n i c b Z D L S g M x F I Z P 6 q 3 W W 1 V w 4 y Z Y F F d l p h t d D r h x W d F e o B 1 K J s 2 0 o U l m S D J C G f o I 3 b h Q x K 1 P 5 M 6 3 M b 0 s t P W H w M f / n 0 P O O V E q u L G e 9 4 0 K G 5 t b 2 z v F 3 d L e / s H h U f n 4 p G m S T F P W o I l I d D s i h g m u W M N y K 1 g 7 1 Y z I S L B W N L q b 5 a 1 n p g 1 P 1 J M d p y y U Z K B 4 z C m x z n r s y q x X r n h V b y 6 8 D v 4 S K g G a n g U A U O + V v 7 r 9 h G a S K U s F M a b j e 6 k N c 6 I t p 4 J N S t 3 M s < l a t e x i t s h a 1 _ b a s e 6 4 = " b 8 e + D q O S D e 0 a z l Z r X 5 9 s < l a t e x i t s h a 1 _ b a s e 6 4 = " b 8 e + D q O S D e 0 a z l Z r X 5 9 s The Neural Variational RNN (NVRNN) language model based on a Gaussian prior (left) and a vMF prior (right).The encoder model first computes the parameters for the variational approximation q φ (z|x) (see dotted box); we then sample z and generate the word sequence x given z.We show samples from N (0, I) and vMF(•,κ = 100); the latter samples lie on the surface of the unit sphere.While κ can be predicted from the encoder network, we find experimentally that fixing it leads to more stable optimization and better performance.
latent code, but the gains are limited and changing the decoder in this way requires ad hoc model engineering and careful tuning of various decoder capacity parameters.Our method is orthogonal to the choice of the decoder and can be combined with any of these approaches.Using vMF distributions in VAEs also leaves us the flexibility to modify the prior in other ways, such as using a product distribution with a uniform (Guu et al., 2018) or piecewise constant term (Serban et al., 2017a).We evaluate our approach in two generative modeling paradigms.For both RNN language modeling and bag-of-words document modeling, we find that vMF is more robust than a Gaussian prior, and our model learns to rely more on the latent variable while achieving better held-out data likelihoods.To better understand the contrast between these models, we design and conduct a series of experiments to understand the properties of the Gaussian and vMF latent code spaces, which make different structural assumptions.Unsurprisingly, these latent code distributions capture much of the same information as in a bag of words, but we show that vMF can more readily go beyond this, capturing ordering information more effectively than a Gaussian code.

Variational Autoencoders for Text
Bowman et al. ( 2016) propose a variational au-toencoder model for generative text modeling inspired by Kingma and Welling (2013).Instead of modeling p(x) directly as in vanilla language models, VAEs introduce a continuous latent variable z and take the form p(z)p(x|z). To train a VAE, we optimize the marginal likelihood p(x) = p θ (z)p(x|z)dz.The marginal log likelihood can be written as: q φ (z|x), a variational approximation to the posterior p(z|x), can be variously interpreted as a recognition model or encoder, parameterized by a neural network to encode the sentence x into a dense code z.L(θ, φ; x) is often called the evidence lower bound (ELBO).The first term of ELBO is the KL divergence of the approximate posterior from prior and the second term is an expected reconstruction error.
Since KL divergence is always non-negative, we can use L(θ, φ; x) as a lower bound of marginal likelihood log p θ (x).
We optimize L(θ, φ; x), jointly learning the recognition model parameters φ and generative model parameters θ.
As the choice of prior p(z), most previous work uses a centered multivariate Gaussian p θ (z) = N (z; 0, I).Since Gaussians are a location-scale family of distributions, using them for both the prior and posterior allows us to apply the reparameterization trick and differentiate through the sampling stage z ∼ E q φ (z|x) when optimizing ELBO in practice (Kingma and Welling, 2013).

Case Study: NVRNN
A Neural Variational RNN (NVRNN) for language modeling is described in Bowman et al. ( 2016) and depicted in Figure 1.The goal of the NVRNN model is to extract a high level representation of a sentence into z and reconstruct the sentence with a neural language model.
We denote a sequence of words as Unlike in vanilla language modeling, an NVRNN conditions on the latent variable z at each step of the generation . This probability distribution is modeled using a recurrent model like an LSTM (Hochreiter and Schmidhuber, 1997) as illustrated in Figure 1.There is nothing unique about this choice; other recurrent sequence models like a CNN or a Transformer (Vaswani et al., 2017) could be used.

Posterior Collapse
When training a VAE, we update θ and φ simultaneously.Optimizing Eq. 1 gives two gradient terms: an update from the reconstruction loss (likelihood of the correct labels) and an update from the KL divergence.While the reconstruction loss term encourages the z to convey useful information to this model, the KL term consistently tries to regularize q(z|x) towards the prior on every gradient update.This may trap the model in a bad local optimum where q φ (z|x) = p θ (z) for all x: in this case, z is simply a noise source, which is useless to the model, so the model has learned to ignore it and will not make large enough gradient updates to break q(z|x) out of this optimum.Bowman et al. (2016) termed this issue KL collapse and proposed an annealing schedule to handle it, where the weight of the KL term is increased over the course of training.2In this way, the model initially learns to use the latent code but is then regularized towards the prior as training progresses.However, this trick is not sufficient to avert KL collapse in all scenarios, particularly when strong decoders are used and z has a minor impact on p θ (x|z).
Table 1 shows experiments in a similar setup to that of Bowman et al. (2016).We train an NVRNN model on the Penn Treebank with four different hyperparameter settings.We either use a 3-layer LSTM encoder or a 1-layer LSTM and use or do not use a sigmoid annealing schedule (increase the KL weight from 0 to 1 over the first 20 epochs).We observe the best performance using the 1-layer model with annealing.One might conclude from this table that the annealing trick has worked since both models achieve better performance when annealing is used.But in fact, a vMF-based model can do better than either (NLL of 117), and moreover, we have no way of knowing that a better annealing scheme might not achieve even higher performance after training.Furthermore, the highercapacity 3-layer model can theoretically do anything the 1-layer model can, so its lower performance indicates that our training is derailed either by overfitting or getting stuck in a local optimum where the latent variable is unused. 3etting the best performance out of a VAE is, therefore, a challenging problem that requires careful tuning of the objective function and optimization procedure (Bowman et al., 2016;Zhao et al., 2017b;Higgins et al., 2017).Beyond the well-documented problem of KL collapse, an optimizer may simply get stuck in a local optimum during training and as a result, fail to find a model that most effectively exploits the latent variable.
The solution we advocate for in this paper is to change the distribution for the latent space and simplify the optimization problem.In the next section, we describe the von Mises-Fisher distribution and its use in VAE, where it forces the model to put the latent representations on the surface of the unit hypersphere rather than squeezing everything to the origin.Critically, this distribution lets us fix the value of the KL term by fixing the distribution's concentration parameter κ; this averts the KL collapse and leads to good model performance across two generative modeling paradigms.

von Mises-Fisher VAE
The von Mises-Fisher distribution is a distribution on the (d − 1)-dimensional sphere in R d .The vMF distribution is defined by a direction vector µ with ||µ||= 1 and a concentration parameter κ ≥ 0. The PDF of the vMF distribution for the d-dimensional unit vector x is defined as: where I v stands for the modified Bessel function of the first kind at order v.
Figure 1 shows samples from vMF distributions with various µ vectors (arrows), d = 3, and κ = 100.This is a high κ value, leading to samples that are tightly clustered around µ, which is the mean and mode of the distribution.When κ = 0, the distribution degenerates to a uniform distribution over the hypersphere independent of µ.
Past work has used vMF as an emission distribution in unsupervised clustering models (Banerjee et al., 2005), VAE for other domains (Davidson et al., 2018;Hasnat et al., 2017), and a generative editing model for text (Guu et al., 2018).We focus specifically on the empirical properties of vMF for text modeling and conduct a systematic examination of how this prior affects VAE models compared to using a Gaussian.
VAE with vMF We will use vMF as both our prior and variational posterior in our VAE models.Otherwise, the setup for our VAE remains the same as in the Gaussian case established in Section 2. Our prior is the uniform distribution vMF(•, κ = 0).Since true posterior p θ (z|x) is intractable, we will approximate it with a variational posterior q φ (z|x) = vMF(z; µ, κ) where the mean G z e P Q h M / W P j x f T P s z P i J 4 N q 4 7 r d T 2 N j c 2 t 4 p l s q 7 e / s H h 5 W j 4 7 a O U 8 W w x W I R q 6 5 P e n P d F a c F Z 9 p z A H z m f P 1 j 9 j r w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 X G z e P Q h M / W P j x f T P s z P i J 4 N q 4 7 r d T 2 N j c 2 t 4 p l s q 7 e / s H h 5 W j 4 7 a O U 8 W w x W I R q 6 5 P e n P d F a c F Z 9 p z A H z m f P 1 j 9 j r w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 X < l a t e x i t s h a 1 _ b a s e 6 4 = " V b / P R R J 9 h L R m z 1 n K Z l / X 7 P r P e e s = " > x M O p c N z 2 1 4 9 1 6 t W S / j q K A T d I r q y E N X q I l u U Q u 1 E U Y 5 e k a v 6 M 1 6 s l 6 s d + t j 3 r p i l T N H 6 A + s z x 9 J k Z Y Y < / l a t e x i t > N (µ, ) < l a t e x i t s h a 1 _ b a s e 6 4 = " V b / P R R J 9 h L R m z 1 n K Z l / X 7 P r P e e s = " > x D D a 6 g D g 0 g M I Z H e I Y X 6 8 F 6 s l 6 t t 2 l r w Z r N 7 M M f W O 8 / c u C Y c A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m n 7 F V Z / e O d h m a y z k + 3 8 W 7 f 7 q + k w 5 z 9 v 2 9 v 1 9 r z / / o 7 P / P 2 Z 1 f c e e z 7 B i / C P H w A C s 6 p o < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 z 9 v 2 9 v 1 9 r z / / o 7 P / P 2 Z 1 f c e e z 7 B i / C P H w A C s 6 p o < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G z e P Q h M / W P j x f T P s z P i J 4 N q 4 7 r d T 2 N j c 2 t 4 p l s q 7 e / s H h 5 W j 4 7 a O U 8 W w x W I R q 6 5 P e n P d F a c F Z 9 p z A H z m f P 1 j 9 j r w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 X Z Z c n M r W f X V m S 7 5 j / o 6 8 e B 3 Z S k = " > O S E P b 2 5 p I 1 e 3 v H 7 p 5 w H K k t b C w U s R P / J D v / G z e P Q h M / W P j x f T P s z P i J 4 N q 4 7 r d T 2 N j c 2 t 4 p l s q 7 e / s H h 5 W j 4 7 a O U 8 W w x W I R q 6 5 P e n P d F a c F Z 9 p z A H z m f P 1 j 9 j r w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 X Z Z c n M r W f X V m S 7 5 j / o 6 8 e B 3 Z S k = " > q S e / Z e / P e 5 6 V r 3 q L n G P 7 I + / w B z o + P l Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M 7 q S e / Z e / P e 5 6 V r 3 q L n G P 7 I + / w B z o + P l Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M 7 Figure 2: Visualization of optimization of how q varies over time for a single example during learning.In the Gaussian case, the KL term tends to pull the model towards the prior (moving from µ, σ to µ , σ ), whereas in the vMF case there is no such pressure towards a single distribution.
direction µ is the output of encoding neural networks (Figure 1, right side) and κ is treated as a constant.
Before we can implement a VAE, we need to derive an expression for KL divergence in order to optimize ELBO (Equation 1) and give a sampling algorithm that admits the reparameterization trick (Kingma and Welling, 2013).
Figure 2 shows a visualization of the learning trajectories of Gaussian and vMF VAE.For the Gaussian VAE, the KL divergence in the objective function tends to pull the posterior towards the prior centered at the origin and, therefore, make the optimization difficult as mentioned before.For the vMF VAE, given fixed κ, there is no such vacuous state and µ can vary freely.Figure 3 shows the KL value and concentration of vMF(µ, κ) for two different dimensionalities.KL increases monotonically with κ, as does concentration measured by cosine similarity.To get a fixed cosine dispersion as dimensionality increases, higher κ values are needed, resulting in higher KL values.
Sampling from vMF Following the implementation of Guu et al. (2018), we use the rejection sampling scheme of Wood (1994) to sample a "change magnitude" w.Our sample is then given by z = wµ + v √ 1 − w 2 , where v is a randomly sampled unit vector tangent to the hypersphere at µ. Neither v nor w depends on µ, so we can now take gradients of z with respect to µ as required.

Experiments on Language Modeling
We first evaluate our vMF approach in the NVRNN setting.We will return to this model and analyze its properties further in Sections 6 and 7 after showing experiments on document modeling.
Dataset For NVRNN, we use the Penn Treebank (Marcus et al., 1993), also used in Bowman et al. (2016), and Yelp 2013 (Xu et al., 2016).Examples in the Yelp dataset are much longer and more diverse than those from PTB, requiring more understanding of high-level semantics to generate a coherent sequence.Yelp has a long tail of very long reviews, so we truncate the examples to a maximum length of 50 words; this still gives an average length over twice as long as in the PTB setting.Statistics about all datasets used in this paper are shown in Table 2.
Settings We evaluate our NVRNN as in Bowman et al. ( 2016) and explore two different settings.In the Standard setting, the input to the RNN at each time step is the concatenation of the latent code z and the ground truth word from the last time step, while the Inputless setting does not use the prior word.The more powerful decoder of the Standard setting makes the latent representations inherently less useful.In the Inputless setting, the decoder needs to predict the whole sequence with only the help of given latent code.In this case, a high-quality representation of the sentence is badly needed and the model is driven to learn it.Our implementation of VAE uses a one layer unidirectional LSTM as both encoder and decoder.We use an embedding size of 100 and hidden units of size 400 in the LSTM.The dimension of the latent code is chosen from {25, 50, 100} by tuning on the development set.We use SGD to optimize all models with decayed learning rate and gradient clipping.For Yelp, the sentiment bit, which ranges from 1 to 5, is also embedded into a 50 dimension vector and input for every time step of the decoding phase.
Results Experimental results of the NVRNN are shown in Table 3.We report negative log likelihood (NLL) 5 and perplexity (PPL) on the test set.We follow the implementation reported in Bowman et al. (2016)    perplexities, and even when KL collapse does not appear to be the case (e.g., G-VAE on the PTB-Standard setting), a Gaussian family of distributions results in lower KLs and worse log likelihoods, possibly due to optimization challenges.
In the Inputless setting, we see large gains: vMF VAE reduces PPL from 379 to 262 in PTB, and from 256 to 134 in Yelp compared to Gaussian VAE.
Trade-off Comparison Besides the overall perplexity, we are also interested in the trade-off be-tween reconstruction loss and KL, and the contribution of KL to the whole objective.Figure 4 shows the ability of our model to explicitly control the balance between the KL and the reconstruction term.First, we "permanently" anneal the Gaussian VAE by setting the weight of the KL term to a constant smaller than 1 (0.2 and 0.5 in our case).We find that this trick does mitigate the KL collapse, but the overall performance is worse.Therefore, this is not only a numerical game about the KL vs. NLL trade-off but a deeper challenge of how to structure models to learn effective latent representations.
For vMF VAE, when we gradually increase the value of κ, the concentration of the distribution around the mean direction µ is higher and samples from vMF are closer to µ.The model achieves the best perplexity when κ = 80.The reconstruction error is bounded around 4.5 due to the difficulty of the task and limited capacity of LSTM decoder.While κ is a hyperparameter that needs to be tuned, the model is overall not very sensitive to it, and we show in Section 7 that reasonable κ values transfer across similar tasks.

Experiments on Document Modeling
We also investigate how vMF VAE performs in a different setting, one less plagued by the KL collapse issue.Specifically, the Neural Variational Document Model (NVDM), proposed by Miao et al. (2016), is a VAE-based unsupervised document model.This model follows the VAE framework introduced in Section 2. Our document representation is an indicator vector x of word presence or absence in the document.Since this is a fixed-size representation, we use 2-layer MLPs with 400 hidden units for both the encoder q(z|x) and decoder p(x|z); the decoder places a simple multinomial distribution over words in the vocabulary, and the probability of a document is the product of the probabilities of its words.
Dataset For NVDM, we use two standard news corpus, 20 News Groups (20NG) and the Reuters RCV1-v2, which were used in Miao et al. (2016). 6  Results Experimental results7 are shown in Table 4.In contrast with NVRNN, the NVDM fully relies on the power of latent code to predict the word distribution, so we never observe a KL collapse, yet vMF still achieves better performance than Gaussian.As shown in Figure 3, in order to keep the same amount of dispersion in samples from the variational posterior, larger latent dimensions need larger κ values and correspondingly larger KL term values.For 20NG, which is much smaller than RCV1, smaller dimensions therefore give better performance.For both datasets, the settings of κ = 100, dim = 25 and κ = 150, dim ∈ {50, 200} work well.
6 What do our VAEs encode?Is the latent code more than a bag of words?For all of these models, one hypothesis is that the encoder may be learning to memorize the bag of words and then preferentially generate words in that bag from the decoder.To verify this, we investigate whether the BoW representation and the learned latent code can be reconstructed from each other.Specifically, given a sentence x we can compute BoW as defined above and µ = enc(x), the latent encoding of x as represented by the mean vector output by the encoder.We can use a simple multilayer perceptron to to try to map from the bag of words to the latent code: μ = M LP (BoW ), then learn the parameters of the MLP by minimizing μ − µ 2 on a sample.The same process can be used to learn a mapping from µ back to the bag of words.
Table 6 shows averaged cosine similarities of our reconstructions under both Gaussian and vMF models.For vMF, µ can reconstruct the bag-ofwords more accurately than the bag-of-words can reconstruct µ, indicating that the latent code in vMF captures more information beyond the bag of words.
We repeat this experiment in a separate NVRNN model where the decoder can explicitly condition on the BoW vector described above.The results are shown in the right column of Table 6.Our model, v-VAE, achieves a lower cosine similarity than G-VAE (0.23 vs. 0.32), indicating that it capturing less redundant information and using the latent space to more efficiently model other properties of the data.
Sensitivity to word order Table 6 shows that NVRNN with vMF encodes information beyond the bag of words; a natural hypothesis is that it is encoding word order.We can more directly investigate this in the context of both NVRNN and NVRNN-BoW settings.Inspired by Zhao et al. (2017a), we propose an experiment probing the sensitivity to randomly swapping adjacent pairs of words for the encoding in the Inputless setting on PTB.We vary the probability of swapping each word pair and see how the latent code changes as the number of swaps increases.Ideally, our models should capture ordering information and therefore be sensitive to this change.
Figure 5 shows the results.v-VAE's representations are more sensitive than those of the G-VAE: they change faster as swaps become more likely. 8 In the NVRNN-BoW setting, we see that the models are even more sensitive.vMF enables us to more easily learn this kind of desirable information in our sentence encodings.

Controlling Variance with κ
A core aspect of our approach so far has been treating κ as a fixed hyperparameter.Fixing κ is beneficial from an optimization standpoint: it makes it more difficult for the model to get stuck in local optima.But it also reduces the model's flexibility, since we can no longer predict per-example κ values, and it introduces another parameter that the system designer must tune. 8The Gaussian VAE here makes very little use of the latent variable, hence why the representations change very little.Cosine similarity is measured between the latent code (encoded mean vector) of the original sentence and the sentence after swaps are applied.We see that vMF is more highly sensitive to swaps in both the NVRNN and NVRNN-BoW settings, indicating that its latent space likely encodes more ordering information.
Fortunately, a wide range of κ values appear to work well for the tasks we consider.Figure 6 shows how the concentration parameter κ changes the results on PTB when the latent dimension and other hyperparameters are held fixed.We have ordered the tasks left-to-right from "hardest" to "easiest" in terms of necessity of latent representation: the Inputless setting needs heavy information from the latent code to reconstruct the sentence, whereas the Standard-BoW setting has an extremely strong decoder to predict the next word.We see that in each of these cases, a wide range of κ values works, and moreover reasonable κ values transfer between the two Standard and between the two Inputless settings, indicating that the overall approach is not highly sensitive to these hyperparameter values.
Brittleness of Learning κ Throughout this work, we have treated κ as a fixed parameter.However, we can treat κ in the same way as σ in the Gaussian case and learn it on a per-instance basis.The KL divergence of vMF is differentiable with respect to κ given gradients of the modified Bessel function of the first kind, 9 allowing us to change the concentration on a per-instance basis.However, this reintroduces the issue of KL collapse: the KL term will encourage κ to be as low as possible, potentially making the latent variable Darker colors correspond to perplexity values closer to the best observed for that setting.For each task, we see that there is a range of κ values that work well, and these transfer between comparable tasks. vacuous.
In practice, we observe that it is necessary to clip κ values to a certain range for numerical reasons.Within this range, the model gravitates towards the smallest κ values and performs substantially worse than models trained with our fixed κ approach.This indicates that even with the vMF model, the optimization problem posed by ELBO is simply a hard one and the approach of fixing KL divergence is a surprisingly good optimization technique.

Related Work
Applications of VAE in NLP Deep generative models have achieved impressive successes in domains adjacent to NLP such as image generation (Gregor et al., 2015;Oord et al., 2016a) and speech generation (Chung et al., 2015;Oord et al., 2016b).VAEs specifically (Kingma and Welling, 2013;Rezende et al., 2014) have been a popular model variant in NLP.They have been applied to tasks including document modeling (Miao et al., 2016), language modeling (Bowman et al., 2016), and dialogue generation (Serban et al., 2017b).VAEs can be also be applied for semi-supervised classification (Xu et al., 2017).Recent twists on the standard VAE approach including combining VAE and holistic attribute discriminators for conditional generation (Hu et al., 2017) and using a more flexible latent space regularized by an adversarial method (Zhao et al., 2017a).
VAE Objective Several pieces of recent work have highlighted the issues with optimizing the VAE objective.Alemi et al. (2018) shed light on the problem from the perspective of information theory.Zhao et al. (2017b) and Higgins et al. (2017) both propose various reweightings of the objective along with theoretical and empirical justification.
Choices of Priors for VAE Some past work has explored various priors for VAE.Serban et al. (2017a) proposed a piecewise constant distribution which deals with multiple modes, but which sacrifices the property of continuous interpolation.Guu et al. (2018) also applied vMF in a VAE model, but used theirs specifically in the sentenceediting case.Davidson et al. (2018) explored vMF in a VAE model for MNIST and a link prediction task.Hasnat et al. (2017) applied the vMF distribution for facial recognition.Other past work has used different decoders, including CNNs (Yang et al., 2017) and CNN-RNN hybrids (Semeniuta et al., 2017).Changing the decoder is a change largely orthogonal to changing the prior: it can alleviate the KL vanishing issue, but it does not necessarily scale to new settings and does not give explicit control over utilization of the latent code.

Conclusion
In this paper, we propose the use of a von Mises-Fisher VAE to resolve optimization issues in variational autoencoders for text.This choice of distribution allows us to explicitly control the balance between the capacity of the decoder and the utilization of the latent representation in a principled way.Experimental results demonstrate that the proposed model has better performance than a Gaussian VAE across a range of settings.Further analysis shows that vMF VAE is more sensitive to word order information and makes more effective use of the latent code space.
t e x i t s h a 1 _ b a s e 6 4 = " e R n Y s h r 1 8 z 9i d M J q H h S d 9 z f 6 Z D U = " > A A A B 9 H i c b V D L T g J B E O z F F + I L 9 O h l I j H B C 9 n 1 o k c S L x 4 x k U c C G z I 7 z M K E m d l l Z h b F l e / w 4 k F j v P o N f o M 3 / 8 b h c V C w k k 4 q V d 3 p 7 g p i z r R x 3 W 8 n s 7 a + s b m V 3 c 7 t 7 O 7 t H 1 y 9 g j e 4 s r L U D 8 r e W 7 J q 3 q F 8 i n M l I U j O I Y i e H A O Z b i G C t S A A c I D P M G z c + s 8 O i / O 6 6 w 0 4 8 x 7 D u G P n L c f o S a P p w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " i Y f k 9 S a y 8 1 D m h z p O e A l L T L 4 U J l w t e x i t s h a 1 _ b a s e 6 4 = " 5 o a G J D w Y e 7 8 0 w M y 9 I O F P a c b 6 t w t L y y u p a c b 2 0 s b m 1 v a G J D w Y e 7 8 0 w M y 9 I O F P a c b 6 t w t L y y u p a c b 2 0 s b m 1 v 6 M 1 6 s l 6 s d + t j 3 r p i l T N H 6 A + s z x 9 J k Z Y Y < / l a t e x i t > N (µ 0 , 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " V 0 7 E a O w L 2 h n g j h + a o 4 L g 9 H 0 Z A 4 I = " > A A A C c X i c l Z F N S y N B E I Z r x o + N 8 S v r e p F F a Q y C o o S Z v a z H w O 7 B k y h s T C C J o a d T S R p 7 e o b u n o X s k P / m b 9 j D w v 6 J v e x d r U w s h a 1 _ b a s e 6 4 = " V 0 7 E a O w L 2 h n g j h + a o 4 L g 9 H 0 Z A 4 I = " > A A A C c X i c l Z F N S y N B E I Z r x o + N 8 S v r e p F F a Q y C o o S Z v a z H w O 7 B k y h s T C C J o a d T S R p 7 e o b u n o X s k P / m b 9 j D w v 6 J v e x d r U w e H M e n B f n 3 f l Y t p a c Y u Y U / s D 5 / A H f 3 4 z n < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 o E B U 6 / J 0 u E M h I w b K Y 5 U 1 r 5 C 6 l 0 = " > A A A B 6 H i c b Z C 7 S w N B E M b n 4 i u J r 6 i l z W I Q r M K d j Z Y B G 8 s E z A O S E P b 2 5 p I 1 e 3 v H 7 p 5 w H K k t b C w U s R P / J D v / R U P H s c 3 w P A A 8 Y B q o 5 V N P C N X M 3 4 r p m G h C r Y + p 4 k O I V l 9 e J + 2 r e h T W o / u o 1 g i L O M r o D J 2 j S x S h a 9 R A d 6 i J W o g i h Z 7 R K 3 o L X P A S v A c f y 9 Z S U M y c o j 8 I P n 8 A g 6 6 R 6 g = = < / l a t e x i t >

Figure 3 :
Figure3: Visualization of the interaction between κ, KL, and dimensionality in vMF.Cos represents the cosine similarity between µ and samples from vMF d (µ, κ) which reflects how disperse the distribution is.KL is defined as KL with a uniform vMF prior, KL(vMF d (µ, κ)||vMF(•, 0)).Higher κ values yield higher cosine similarities, but also higher KL costs.

Figure 4 :
Figure 4: Comparison of Gaussian-and vMF-NVRNN with different hyper-parameters.All models are trained on PTB in the Inputless setting where the latent dimension is 50.G-α indicates Gaussian VAE with KL annealed by the given constant α, and V-κ indicates VAE with κ set to the given value.The green bar reflects the amount of KL loss while the total height reflects the whole objective.Numbers above bars are perplexity.vMF is more highly tunable and also achieves stronger results across a wide range of κ values.

Figure 5 :
Figure5: Sensitivity of latent codes to swapping adjacent words of encoding sequence.Cosine similarity is measured between the latent code (encoded mean vector) of the original sentence and the sentence after swaps are applied.We see that vMF is more highly sensitive to swaps in both the NVRNN and NVRNN-BoW settings, indicating that its latent space likely encodes more ordering information.

9Figure 6 :
Figure 6: Perplexity of v-VAE in different settings with different κ values when the latent dimension is 50.Darker colors correspond to perplexity values closer to the best observed for that setting.For each task, we see that there is a range of κ values that work well, and these transfer between comparable tasks.

Table 1 :
Development set KL and NLL values for two NVRNN models trained on the Penn Treebank with and without the annealing technique of Bowman et al.

Table 2 :
Statistics of the datasets used in our experiments.Len stands for the average length of an example.Vocab is the vocabulary size; these follow prior work.
where the KL term weight is annealed for the Gaussian VAE; vMF VAE works well without weight annealing.The vMF distribution gives a performance boost in all datasets in both the Standard and Inputless settings.Even in the Standard setting, our model is able to successfully use nonzero KL values to achieve better

Table 3 :
Bowman et al. (2016) of NVRNN on the test sets of PTB and Yelp.The upper RNNLM and G-VAE shows the result fromBowman et al. (2016).KL divergence is shown in the parenthesis, along with total NLL.Best results are in bold.vMF consistently uses higher KL term weights but achieves comparable or better NLL and perplexity values across all four settings.

Table 5 :
Experimental results of NVRNN-BoW on PTB; i.e., the decoder also conditions on a bag of words representation of the sentence to generate.In this case, the Gaussian models exhibit KL collapse but vMF can still learn effectively.theaveragewordembedding value of the sentence x.While an artificial setting, this lets us see how effectively the latent code can capture information other than simple word choice by making a form of this information independently available.Table5shows results in this setting, where once again we see the KL collapse problem for the Gaussian models and better performance from vMF on perplexity in both the Standard and Inputless settings.

Table 6 :
Average cosine similarity when trying to reconstruct the latent code µ from the bag of words and vice versa.In vMF, the latent code contains more information beyond the bag of words, as shown by the lower cosine similarity when predicting BoW → µ (0.57).When the latent code is learned in a model conditioned on the bag of words (right column), it predicts the bag of words much less well, indicating that the model successfully learns orthogonal information.