Do sequence-to-sequence VAEs learn global features of sentences?

A longstanding goal in NLP is to compute global sentence representations. Such representations would be useful for sample-efficient semi-supervised learning and controllable text generation. To learn to represent global and local information separately, Bowman&al. (2016) proposed to train a sequence-to-sequence model with the variational auto-encoder (VAE) objective. What precisely is encoded in these latent variables expected to capture global features? We measure which words benefit most from the latent information by decomposing the reconstruction loss per position in the sentence. Using this method, we see that VAEs are prone to memorizing the first words and the sentence length, drastically limiting their usefulness. To alleviate this, we propose variants based on bag-of-words assumptions and language model pretraining. These variants learn latents that are more global: they are more predictive of topic or sentiment labels, and their reconstructions are more faithful to the labels of the original documents.


Introduction
Natural language generation is a major problem underlying many classical NLP tasks such as machine translation, automatic summarization or dialogue modeling. Recent progress has been mostly attributed to the replacement of LSTMs (Hochreiter and Schmidhuber, 1997) by more powerful, attention-based models such as Transformers (Vaswani et al., 2017;Radford et al., 2019).
Despite their differences, Transformers remain mostly used in an auto-regressive manner via masking, generating words one after the other. In contrast, the sequence-to-sequence model trained with a Variational Auto-Encoder (Kingma and Welling, 2013; (VAE) objective proposed by Bowman et al. (2016) generates text in a two-step process: first, a latent vector is sampled from a prior distribution; then, words are sampled from the probability distribution produced by the auto-regressive decoder, itself conditioned on the latent vector. The hope is that such an architecture would encourage a useful information decomposition, where the latent vector would "explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features", while the local and grammatical correlations would be handled by the recurrent decoder. Such global features encoded in a compact, fixed-size representation would be handy both for semi-supervised learning and controllable generation. For semi-supervised learning, the latent vector would be the ideal representation on which to train small classifiers using a handful of labels (Kingma et al., 2014). For controllable generation, we could obtain a "prototypical" latent vector for a given label by averaging the latent vectors of all datapoints sharing that label. Then, we could decode this average vector to generate examples that would be labeled similarly.
Despite its conceptual appeal, Bowman et al. (2016)'s VAE suffer from the posterior collapse problem. The VAE objective is a sum of two terms: a reconstruction term that encourages the encoder and decoder to collaborate to reconstruct the input, and a KL divergence term that aligns the approximate posterior produced by the encoder with the prior. The problem is that, early on during training, the KL term goes to 0, such that the approximate posterior becomes the prior and no information is encoded in the latent variable. Faced with the same problem in the context of image modelling, Chen et al. (2016) remarked 1 that it is possible to imagine a model architecture where learned latent variables would encode local statistics while the auto-regressive decoder would focus on global variations. In summary, latent variables can be com-pletely uninformative or encode local information in an undesirable and counter-intuitive manner.
Using modifications to the objective such as free bits (Kingma et al., 2016), we can obtain a positive KL term, which indicates that some information is encoded in the latent variables. However, how can we verify that they capture global aspects of texts? Qualitative evaluation methods such as reconstruction from interpolated codes ("homotopies") are highly subjective and ill-defined. Semi-supervised experiments are useful, but as we will show, they are limited and often not performed correctly.
In this paper, we propose to examine the content of latent variables by decomposing the reconstruction loss over positions in the sentence. We observe that encoders mostly store in the latent vector information pertaining to the first few words of each sentence as well as the number of words. If sequence-to-sequence VAEs sometimes encode global features, it is a byproduct of this memorization behavior, and therefore depends heavily on the dataset. This casts serious doubts about the usefulness and robustness of these representations. To prevent this behavior, we propose simple variants based on bag-of-words assumptions and pretraining. The representations learned by our variants are more predictive of the ground-truth labels, both in the small or large data-regime. Consequently, the reconstructions of texts share the same label as the source texts more often than our baselines and memorization is decreased.

Model and datasets 2.1 Sequence-to-sequence model and VAE objective
We briefly describe the object of this study, the sequence-to-sequence model with the VAE objective (Bowman et al., 2016). A document, sentence or paragraph, of L words x = (x 1 , . . . , x L ) is embedded in L vectors (e 1 , . . . , e L ). An LSTM encoder processes these embeddings to produce hidden states: h 1 , . . . , h L = LSTM(e 1 , . . . , e L ) In general, the encoder produces a vector r that represent the entire document. In the original model, this vector is the hidden state of the last word r = h L , but we introduce variants later on. This representation is transformed by linear functions L 1 and L 2 , yielding the variational parameters that are specific to each input document: µ = L 1 r σ 2 = exp(L 2 r) These two vectors of dimension d fully determine the approximate posterior, a multivariate normal with a diagonal covariance matrix, q φ (z|x) = N (z|µ, diag(σ 2 )), where φ is the set of all encoder parameters (the parameters of the LSTM, L 1 and L 2 ). Then, a sample z is drawn from the approximate posterior and the decoder, another LSTM, produces a sequence of hidden states: where BOS is a special token indicating the beginning of the sentence and [·; ·] denotes the concatenation of vectors. Finally, each hidden state at position i is transformed to produce a probability distribution of the word at position i + 1: where softmax(v i ) = e v i / j e v j and θ is the set of parameters of the decoder (the parameters of the LSTM decoder, W and b). The vocabulary is augmented with an EOS token indicating the end of the sentence, which is appended at the end of every document. For each document x, the lower-bound on the marginal log-likelihood (ELBo) is: On the entire training set {x (1) , ., x (N ) }, the objective is:

Controlling the capacity of the encoder
Following Alemi et al. (2018), we call the average value of the KL term the rate. It measures how much information is encoded on average about the datapoint x by the approximate posterior q(z|x).
The KL term can be modified to target a specific rate, or at least to make sure it is above a target rate using a variety of similar techniques (see Appendix A.1 for more details). The main goal of these modifications is to prevent the posterior collapse in sequence-to-sequence VAE. We use the free bits formulation of the δ-VAE (Razavi et al., 2019): for a desired rate λ, the modified negative ELBo is: Since sequence-to-sequence VAEs are prone to posterior collapse, in practice, the rates obtained are very close to the target rates λ.
As observed by Alemi et al. (2018), different models or sets of hyperparameters for a given model can yield very similar values of ELBos despite reaching very different rates. In other words, the work of modelling stochasticity can be divided very differently between the latent variable and the auto-regressive decoder. Therefore, for our purposes, the free-bits modification has the additional advantage that it enables us to compare different models with similar capacity.

Variants
Throughout the paper, we use variants of the original architecture and training procedure. In the next section, we also use a deterministic Auto-Encoder (AE) trained only with the reconstruction loss, as well as several other variants recently introduced to alleviate the posterior collapse.  proposed to pretrain an AE, then to reinitialize the weights of the decoder and finally, to train the entire model again end-to-end with the VAE objective. The sentence representation is still the last hidden state of the LSTM encoder and therefore, we call this model and training procedure last-PreAE.
In the second variant, proposed by Long et al. (2019), the representation of the document r is the component-wise maximum over hidden states h i , i.e. r j = max i h j i . We call this model max. In later experiments, we also consider a hybrid of the two techniques, max-PreAE.
We make slight, beneficial modifications to these two methods. We remove KL annealing which is not only redundant with the free bits technique but also increases the rate erratically. Moreover, we use δ-VAE-style free bits techniques to achieve a rate closer to the target rate. These modifications are justified in Appendix A. Therefore, all of the models in the paper use δ-VAE-style free bits without KL annealing.

Datasets
We train VAEs on four small versions of AGNews, Amazon, Yahoo and Yelp from Zhang et al. (2015).  Each document is written in English and consists of one or several sentences. Each document is manually labeled according to its main topic or the sentiment it expresses, and the labels are close to uniformly balanced over all the dataset. For faster training, we use smaller datasets. Characteristics of these datasets are detailed in Table 1.

Encoders prioritize information about the first words and sentence length
The ELBo objective trades off the KL term against the reconstruction term. To minimize the objective, it is worth increasing the KL term only if the reconstruction term is decreased by the same amount or more. With free bits, we allow the encoder to store information up to a certain extent without paying any cost. The optimisation objective becomes to minimize the reconstruction cost by using this "free" storage as efficiently as possible.
In order to visualize what information is stored in the latents, our method is to look at where gains are seen in the reconstruction loss. Since the loss is a sum over documents and positions in these documents, these gains could be concentrated: i) on certain documents, for example, on large documents or documents containing rarer words; ii) at certain positions in the sentence, for example in the beginning or in the middle of the sentence. We investigate the latter possibility.

Visualizing the reconstruction loss
Concretely, we compare the reconstruction loss of different models at specific positions in the sentence. The baseline is a LSTM trained with a language model objective (LSTM-LM). It has the same size as the decoders of the auto-encoder models. 2 Since the posterior collapse makes VAEs behave exactly like the LSTM-LM, the reconstruction losses between the VAEs and the LSTM-LM are directly comparable. Additionally, the deterministic AE gives us the reconstruction error that is reachable with a latent space constrained only by its dimension d, but not by any target rate λ (equivalent to an infinite target rate). 3 On Figure 1, the left-side plot shows the reconstruction losses of different models and different target rates λ on the Yelp dataset. As expected, for all models, raising the target rate lowers the reconstruction cost. In the extreme, AE obtains the lowest reconstruction loss. What is remarkable is that these gains are very focused around the beginning and the end of the sentence. To see that more clearly, we compute the relative improvement in reconstruction with respect to the baseline (righthand side of Figure 1) as follows: where r LSTM (i) is the reconstruction loss of the baseline.
All the models reconstruct the first couple of words and the penultimate token better than the LSTM-LM. In the Yelp dataset, the penultimate token is a punctuation mark which is always followed by the end-of-sentence token, and therefore, accurately predicting when this token occurs is equivalent to predicting the sentence length. Thus, we conclude that the latent variables encodes information about the sentence length. On the three other datasets, we see similar peaks on relative improvements in the beginning and the end of sentences (see Appendix). On Yelp, the situation is even worse than on other datasets: between positions 4 and 13, there is no relative improvement when λ = 2, indicating that the latent vector does not encode any global information.
If words in a document were pairwise independent, any improvement in reconstruction at a certain position would indicate that information about the word in that position were encoded in the latent variable. However, words are far from being independent, so how can we trace back the information to the encoder? First, any latent information related to the first word should not yield any improvements on the prediction of the second word, because the decoder is recurrent and trained using teacher forcing, i.e. conditioned on the true first word, so that information would be redundant. However, information related to the second word in the latent variable can help the decoder predict the first word. Therefore, improvements of the reconstruction loss in position i can only be attributed to stored information pertaining to the words in positions ≥ i. Second, the correlation between words in two positions decreases as the distance between these words grow. In effect, information pertaining to the second word yields more gains on the second word than on the first word. From these two facts, we conclude that gains for a position i mostly comes from information about the word in position i itself.

Impact during decoding
To study the concrete impact of this observation for generation, we encode and decode test documents using the last-PreAE variant. 4 Then, we compute the ratio of documents for which the first word in the sources and in the reconstructions match and similarly, how often the sources and their reconstructions have the same number of words. We compare these with scores obtained by a baseline model that outputs the most frequent first word given the label and the most common document length given the label. This baseline mimicks the behavior of a hypothetical VAE which would encode the labels of the documents (topic or sentiment) perfectly and nothing more.
Results in Table 2 show that with the last-PreAE the first words are reconstructed with much higher accuracy than if the latent vector only encoded the label. On the last two datasets, it recovers the first words on more than half of the documents whereas the baseline only recovers the first words between 12.9 and 14.1% of the time. Accurate encoding of the number of words seems less systematic than the encoding of the first few words. For example, on AGNews, the sentence length is recovered less often than our baselines. The encoding of the sentence length is more pronounced on datasets with small documents like Yahoo and Yelp.

Is it an issue?
To sum up, our first experiment shows that, compared to an unconditional LSTM-LM, the sequenceto-sequence VAEs incur a much lower reconstruction loss on the first tokens and towards the end of the sentence. Our second experiment indicates that 4 λ = 8, d = 16, decoding with beam search (beam of size 5). AGNews 29.6 ± 1.1 3.6 ± 0.1 12.9 4.8 Amazon 42.4 ± 2.3 13.0 ± 1.6 14.0 0 Yahoo 56.6 ± 1.0 17.1 ± 1.1 11.3 4.9 Yelp 53.0 ± 0.5 33.7 ± 1.7 14.1 9.7 Table 2: The latent variables encode more information than the label alone, in particular, information that allows to retrieve the first word and the document length with high accuracy.
if the latent variable of the VAEs did encode the label perfectly and exclusively, they would reconstruct the first words or recover the length of each document with much lower accuracy than what is observed. Therefore, we conclude that sequenceto-sequence VAEs are biased towards memorizing the first few words and the sentence length.
However, Figure 1 also shows that when enough capacity is given as free bits (λ = 8), there are consistent gains of around 0.2 nats on average in intermediary positions. In that case, we cannot claim that the encoded information is purely local. Since we can increase the capacity via the hyperparameters, is it a real issue? We believe it is for the following reasons.
Firstly, as noted by Alemi et al. (2018), higher KL values lead to lower ELBos or marginal likelihoods. Prokhorov et al. (2019) confirmed that models with low likelihood are also poor at generation and samples are less and less coherent as the rate increase. Moreover, decoding interpola-tions of two latent codes yield completely unrelated texts. It is often argued that more complex prior or approximate posterior are the solutions to such "non-smooth" latent spaces, but Pelsmaeker and Aziz (2019) did not find that such methods reach higher rates without any loss in the likelihood. These papers all support the idea that with current techniques, higher rates come at the cost of worse modelling of the data. Therefore, for our purposes, we should strive for latent-variable models which store less information, but more global information.
Secondly, we see potential issues related to specific use cases of VAEs. For controllable generation, we want to generate a variety of sentences that include different lengths and beginnings for a fixed, global aspect such as topic or sentiment. It is an undesirable side-effect that the particular choice of the first word or the sentence length are so strongly influenced by the latent variable. In some applications, it might be useful to learn a decoder that learns to continue a given "prompt" (the beginning of a text), but left-to-right models such as GPT-2 (Radford et al., 2019) are naturally more fit for this task. As for semi-supervised learning using such representations, downstream classifiers risk picking up on correlations that might exist between the first words or sentence lengths and the label, yielding classifiers that are not very robust or simply inefficient.
If this reasoning is correct (which we will verify in later sections), it is doubtful that the commonly used sequence-to-sequence VAE architectures in the low capacity regime would learn a useful representation. This brings us to the third problem: most of the KL values reported in the literature are low 5 . Therefore, it is not clear whether the gains in performance (however measured) of these VAE models are significant, and if they are, what precisely cause these gains.

Proposed models
What architectures could avoid the memorization phenomenon that we have exposed? We investigate simple variants and refer to the Appendix D.1 for a more thorough comparison with existing models.
Our first variant has a simple bag-of-words (BoW) encoder in place of the LSTM encoder and the sentence representation r j = max i e j i where the exponents denote components and the indices denote positions in the sentence. We call it BoWmax-LSTM. It is similar to the max-pooling model of Long et al. (2019) except that the maximum is taken over (non-contextualized) embeddings rather than LSTM hidden states. As Long et al. (2019) reported, the max-pooling operator is better than the average operator, both when the encoder is a LSTM and BoW. It is possibly because the maximum introduces a non-linearity, unlike the average. Therefore, we use the maximum in all our subsequent experiments. A priori, we think that since word order is not provided to the encoder, the encoder will be unable to learn to store information pertaining specifically to the first words.
For our second variant we use a unigram decoder (Uni) in place of an LSTM decoder. It produces a single output probability distribution for all positions in the sentence i, conditioned only on the latent variable z. This distribution is obtained by applying a one-hidden layer MLP followed by softmax to the latent vector: Since the decoder does not model the order of the words anymore, we hope that the encoder will learn representations that do not focus on the reconstruction of the first words. We can use any encoder in combination of this decoder and notably, 5 Most papers do not report if they use bits or nats (1 bit is ln(2) ≈ 0.693 nats). At the risk of over-estimating their reported rates, we assume nats. Here are some of the KL values of the best models in several papers (datasets between brack- if we use a BoW encoder, we obtain the NVDM model of Miao et al. (2016).
Both the BoW encoders and Uni decoders variants might benefit from the PreAE pretraining technique, which is orthogonal. Since it is neither well understood nor well motivated (it is "a surprisingly effect fix") and would require running many more experiments, we leave it for future work.
Lastly, the pretrained LM (PreLM) variant is obtained in two training steps. First, we pretrain a LSTM-LM on each entire dataset. Then, it is used as an encoder without further training, so that the effect of pretraining can not be overridden. We use average pooling over the hidden states to get a sentence representation, i.e. r = 1 L L i=1 h i , and learn the transformations L 1 and L 2 that compute the variational parameters. Initially, we have tried to use max-pooling but the training was extremely unstable. The LM objective requires the hidden state to capture both close correlations between words but also more global information to predict long-distance correlations. The hope is that this global information information can be retrieved via pooling and encoded in the variational parameters. The PreLM variant is therefore nothing more than the use of a pretrained LM as a feature extractor (Peters et al., 2018). To our knowledge, this approach has not yet been evaluated in the VAE setting. Our goal here is to test the effect of the training procedure rather than the architecture, which is why we keep a simple LSTM instead of more powerful architecture such as Transformers.
We rename the baselines according to our changes, for instance, we call Li et al. (2019)'s model LSTM-last-LSTM-PreAE. These variants allow us to isolate the influence of the encoder, the decoder and the training procedure on the performance of the VAE.

Semi-supervised learning evaluation
We diagnosed a potential problem with sequenceto-sequence VAEs and proposed several alternative models and a training procedure to solve them. For our first evaluation, we simulate the semisupervised learning (SSL) setting to see which variants produce the most informative representations. There are two training phases: first, an unsupervised pretraining phase where VAEs are trained; second, a supervised learning phase where classifiers are trained to predict ground-truth labels given the latent vectors encoded with the encoders of the VAEs. This is essentially the same setup as M1 from Kingma et al. (2014). We could integrate the labels into the generative model as a random variable that is either observed or missing in order to obtain better results (Kingma et al., 2014), but our goal is to study the inductive bias of the sequence-to-sequence VAE as an unsupervised learning method. The small and the large data-regimes give us complementary information. Informally, with many labels and complex classifiers, we quantify how much of the information pertaining to the labels is contained in the latent vector, whereas with a few labels and simple classifiers, we quantify how accessible this information is.

Model selection
For each dataset, we subsample g = 5 balanced labeled datasets for each different data-regimes, containing 5, 50, 500 and 5000 examples per class. These labeled datasets are used for training and validating during the supervised learning phase. The performance of the classifiers are measured by the macro F1-score on the entire test sets.
For a given dataset in a given data-regime, we want a measure of the performance of our models that abstracts away from i) hyperparameters for the VAEs, ii) hyperparameters for the downstream task classifiers, iii) subsampling of the dataset and iv) parameter initialisation of the VAEs. As is usually done by practitioners, we optimize over the hyperparameters of the VAEs and the classifiers, eliminating i) and ii) as sources of variance. The choice of the subsample and the initialisation of the model are used to quantify the robustness of the different algorithms.
On a given dataset and in a given data-regime, for a given model, we note F H M ,H C ij the F1-score obtained on the test set on the subsample using seed i, the parameter initialisation using seed j, VAE hyperparameters H M and classifier hyperparameters H C . We use repeated stratified K-fold crossvalidation (Moss et al., 2018) to compute a valida- For all training folds, we train logistic regression classifiers with L 2 regularisation and a grid-search on H C ∈ {0.01, 0.1, 1, 10, 100}. We select the best classifier hyperparameter: Then, the best VAE hyperparameter is chosen by averaging over the s = 3 random seeds and picking the best classifier hyperparameter, Having optimised the hyperparameters, we compute the test set F1-score: We reportF ·· , the empirical average F1-score over i and j. We also decompose the variance coming from the parameter initialisation and the subsampling. NoteF ·j the empirical average F1score for a given j. We report the two following quantities: 1 2 , which quantifies the variability due to the initialisation of the model (s = 3 different seeds), which quantifies the remaining variability (g = 5 seeds).
In the context of ANOVA with a linear model and a single factor, these quantities are the square roots of M S T and M S E (see Appendix E).
Finally, we also add a data-regime where the entire labeled training set is used in the supervised learning phase. In that setting, we use more expressive one-hidden-layer MLP classifiers, with early stopping on a validation set. We optimise only over the hyperparameters of the VAE. This allows us to to check that our conclusions do not depend too much on the model selection procedure and on the choice of the classifier.

Hyperparameter sweep
For each class of model, we perform a grid search over target rates λ ∈ {2, 8} and sizes of latent vector d ∈ {4, 16}.
The target rates λ are chosen to be higher than the entropy of the labels of the documents (Table 1) as we assume that the latent variable should at least capture the annotated label. Indeed, λ = 2 nats is enough to store the labels of all datasets without any loss, except Yahoo which has an entropy of 2.3 whereas λ = 8 nats suffices to capture much more information than needed to store the labels on all datasets. Moreover, these rates are chosen to be much smaller than the reconstruction loss of the baselines because of the technical difficulty of increasing the rate without degrading the loglikelihood explained above.
The latent vector dimension d is either 4 or 16. Recall that our representations are evaluated on downstream tasks with very limited data in some cases (as little as 5 examples per class), so we need a small enough dimension of latent vector to be able to learn. We suppose that d = 4 will be favored for the 5 or 50 examples per class regime while d = 16 could be more efficient above this, but we leave this choice to the model selection procedure.
Other training details and hyperparameters kept constant are described in Appendix C.

What is the representation of a document?
VAEs are mostly used for generating samples but are also sometimes used as feature extractors for SSL. In the latter case, it is not clear what the representation of a datapoint is: the mean of the approximate posterior µ or the noisy samples Z ∼ N (µ, Iσ 2 )? Kingma et al. (2014) feed noisy samples z in the classifiers but in the literature of VAEs applied to language modeling, it is more common to use µ without explanation or even mention. 6 If we are interested purely in downstream task performance, the mean should perform best, as the samples are just noisy versions of the mean vector (it is still not completely straightforward as the noise could play a regularizing role). However, in order to evaluate what information is effectively transmitted to the decoder, we should use the samples. The performance of downstream task classifiers using the mean does not tell us at all whether the latent variable is used by the decoder to reconstruct the input. The following experiment illustrates this fact.
We train the original VAE architecture on the Yelp dataset, both with and without the PreAE, using the original ELBo objective (λ = 0). As expected, the KL term collapses to 0. Then, we train a classifier using the procedure explained above using 5000 examples per class. We expect that its performance will be close to random chance, regardless of whether samples or the mean parameter are used as inputs. However, Table 3 shows that this is not the case. Using samples, we do get random  chance predictions from the classifiers, whereas using means, the performance is remarkably high (as high as 81.5 of F1 using pretraining). The reason is that the KL term never completely collapses to 0. Therefore, µ can be almost zero while still encoding a lot of information about its inputs. However, when the KL term is close to 0, the variance of the samples is close to 1, so no information is transmitted to the decoder. This tendency is exacerbated with the PreAE runs, for which the means encode remnants of the pretraining phase. This experiment shows that it is crucial to report what representation (z or µ) is analyzed and to cautiously interpret the results. Therefore, for the purpose of analysing representations for text generation, we feed z as inputs to the classifiers. Table 4 contains the results of the SSL experiments. The proposed variants are either on par or improve significantly over the baselines. In the large dataregime, BoW-max-LSTM and LSTM-avg-LSTM-PreLM perform best on average. In the small dataregime, the picture is more complex and it depends on the dataset. The exception is LSTM-last-Uni which is worse than the PreAE baselines and suffers from unstable training on AGnews (high variance).

On which datasets do the variants improve?
On AGnews and Yelp and in the large data-regime, our variants do not seem to improve over the baselines. However, on Amazon and Yahoo, in the large data-regime, the variants seem to improve by 5 in F1-score. Why do the gains vary so widely depending on the datasets? We suppose that on some datasets, the first words are enough to predict the  Table 4: Using BoW encoders, Uni decoders or PreLM pretraining, the representations learned by the VAEs are more predictive of the labels (sentiment or topic) of the documents. labels correctly. We train bag-of-words classifiers 7 using either i) only the first three words or ii) all the words as features on the entire datasets. If the three-words classifiers are as good as the all-words classifiers, we expect that the original VAE variants will perform well: in that case, encoding information about the first words is not harmful, it could be a rather useful inductive bias. Conversely, if the first three words are not predictive of the label, the original VAEs will perform badly. As reported in Table 5, on AGNews and Yelp, classifiers trained on the first 3 words have a performance somewhat close to the classifier trained on all the words, reaching 80.8% and 85.4% of its scores respectively. On AGNews, for instance, 7 fastText classifiers (Joulin et al., 2017)   the first words are often nouns that directly gives the topic of the news item: country names for the politics category, firm names for the technology category, athlete or team names for the sports category, etc. On the two other datasets, the performance decays a lot if we only use the first three words: three-words F1-scores make up for 60.7% and 30.3% of all-words F1-scores on Amazon and Yahoo. This explains why the original VAE can perform on par or slightly better than our variants on certain datasets for which the first words are very predictive of the labels. Despite similar asymptotic performance, the proposed variants are better than the baselines in the small data-regime, which suggests that the encoded information is quantitatively different. We will come back to this in the next evaluation.

Recurrent and BoW encoders work around max-pooling
Let us focus on BoW encoders. It is counterintuitive that BoW-max-LSTM improves over LSTM-max-LSTM (with or without PreAE). Indeed, taking into account word order should allow the LSTM encoder to do better inference than the BoW encoder, for example, by handling negation or parsing more complicated discourse structure (Pang et al., 2002). We found that LSTM encoders learn an undesirable behavior through counting mechanisms (Shi et al., 2016;Suzgun et al., 2019). Indeed, they produce hidden states such that some components of the first hidden state h 1 consistently take higher values than those of h 2 , h 3 , . . . regardless of the inputs. Similarly, other components (but in lesser quantities) are consistently maximized in the second position or the third position. Therefore, after max-pooling over these states, some components of r act like memory slots assigned to fixed positions in the sentence independently of the inputs. Since the decoder is also an LSTM and can count, it extracts the relevant components at each position to retrieve the corresponding words.
Unfortunately, BoW encoders are not immune to this problem either. Depending on the language of the texts, the dataset and its preprocessing, the vocabulary can sometimes be partitioned in 1) words that appear in first positions of sentences and 2) the other words. For example, in English, only uppercased words will appear in the first position. Word embeddings of words that frequently start sentences can therefore learn to be identifiable by having high values at certain, fixed components, so that it is possible to identify the first word from the max-pooled representation r.
Therefore, it seems that the decoder and the loss play a larger role in what the encoder will learn than the encoder itself. This is rather intuitive given that the gradients of the parameters of the encoders are a function of the gradients of the decoder. This also confirms the findings of McCoy et al. (2019), who analyzed representations learned by sequence-tosequence models (without any constraints on their capacity, see Appendix D.2).
As LSTM encoders can count, they can also easily encode sentence length. However, what happens when the sentence representation r is obtained via max or average pooling on word embeddings? Assuming that a given component j of the word embeddings e j 1 , ..., e j L are independently distributed, then r j = max i e j i is positively correlated with L. If instead we assume that the e j i have 0 mean and that r j = 1 L e j i , then |r j | is anti-correlated with L. Therefore, if the decoder encourages sentence length to be encoded, the encoder manages to do so (at least approximately) even in the absence of an explicit counting mechanism.
In summary, these experiments show that our variants encode more global information pertaining to sentiment or topic than the baselines. We have explained how the counting mechanism of the LSTM underlies memorization and how BoW encoders coupled with LSTM decoders are also affected by the problem, demonstrating the importance of the decoder among other architectural choices. Methodologically, we showed that only samples z can be used to evaluate representations for the purpose of generation. Moreover, we stressed the importance of using different datasets, because when global attributes are not very correlated with the first words, the original VAE suffers more from its bad inductive bias.

Text generation evaluation
What is the influence of the different representations learned by our models on generation? The samples z are predictive of the labels so they should also be predictive of the words that indicate the labels. Therefore, we expect that the better the classification performance, the more the reconstructed texts should exhibit the characteristics of texts sharing the same label.
To measure the agreement in label between the source document and its reconstruction, we adapt the evaluation procedure used by Ficler and Goldberg (2017) so that no human annotators or heuristics are required (see Appendix D.2). First, a classifier is trained to predict the label on the source dataset. Then, for each model, we encode the doc-Enc. r Pre. Agreement 1st (%) Len (%) NLL Table 6: Our variants reconstruct the inputs with 1) higher agreement with the ground-truth, 2) less memorization of the 1st word and the length, 3) with a negligible loss in likelihood. The best score and scores within one std are bolded.
uments, reconstruct them, and classify these reconstructions using the classifier. Finally, we report the F1 scores between the original labels and the labels given by the classifiers on the generated samples. We call this score the agreement. We use two decoding schemes: beam search with a beam of size 5 and greedy decoding. We fix λ = 8, d = 16 on all models with three seeds. For the Uni decoder, we drop LSTM-last-Uni which underperformed by a large margin in the SSL setting, and for the other Uni models, we freeze the encoder, L 1 and L 2 and train a new recurrent decoder using the reconstruction loss of the VAE. The Uni decoder is used as an auxiliary decoder, as described by De Fauw et al. (2019) (see Appendix D.1 for details) and we denote this technique by PreUni.
To quantify memorization, we measure the reconstruction accuracy of the first word and the ratio of identical sentence length between sources and reconstructions, as in Table 2. Finally, to verify that our bag-of-words assumptions do not hurt the overall fit to the data, we estimate the negative log-likelihood (NLL) via the importance-weighted lower bound (Burda et al., 2015) (500 samples). Table 6 show the results for beam search decoding. 8 There is a close correspondence between agreement and performance on the SSL tasks in the large data-regime. Our variants have higher agreement than the baseline, especially on Amazon and Yahoo datasets where, as we have seen before, the memorization of the first words is an especially bad inductive bias. Note that on these datasets, the agreements are consistently lower than the downstream task performance classification, which shows that reconstructing a sentence with the same label as the source sentence is harder than predicting the label using a classifier. Apart from that, the agreement does not tell us much more than the SSL results.
However, the baselines reconstruct the first words with very high accuracy (more than 50% of the time on Yahoo and Yelp) while our variants mitigate this memorization. For instance, the Pre-Uni method recovers the first word around twice less often on AGNews and Amazon and 1.5 less often on Yahoo and Yelp. This is particularly interesting on AGNews and Yelp, where the first words are very indicative of the topics or sentiments, both baselines and variants have similarly high agreement. This shows that the mechanisms to produce texts with the same labels are different: the reconstructions of the baselines exhibit the same labels as the sources mostly as a side-effect of starting with the same words. On the other hand, our best variants have more diverse beginning of sentences but nonetheless produce as many or more documents of the correct labels.
We can now interpret the discrepancy of results between the small and large data-regimes that we have observed in the SSL setting. Recall that despite similar performances using a lot of data, our variants were much more efficient using very few labels (5 examples per class). If the baselines simply memorize the first words of the sentences by mapping prefixes of the sentences (possibly of varying sizes) to latent vectors, the amount of data required to learn a good classifier will be higher than if the features are more global and abstract.
Swapping the LSTM encoder with a BoW encoder yields less memorization of the first word; further swapping the LSTM decoder with a Uni decoder decreases memorization further. This shows that our bag-of-words assumptions, both on the encoder and the decoder side, are efficient for dealing with the memorization problem. Note that BoW-Max and LSTM-Max with PreUni pretraining yield very close performance despite having a different encoder, which confirms that the choice of the decoder is much more important than the choice of the encoder.
Finally, there seems to be a tradeoff between the global character of the latent information and the fit of the model to the data, as BoW and Uni variants have a higher negative log-likelihood than the baselines. The difference seems significant (informally speaking, by looking at the standard deviations) but the effect size is very small and should not impact the overall quality of the generated texts.
To recapitulate, the bag-of-words assumptions decrease the memorization of the first word and of the sentence length in the latent variable while increasing the agreement between the labels of the source and of the reconstruction. This is achieved at the cost of a very small decrease in log-likelihood.

Conclusion and outlook
Since the inception of the sequence-to-sequence VAE, a lot of efforts were invested in solving the posterior collapse problem and learning to encode something. However, this is not a sufficient condition for VAEs to be used for SSL or controllable generation, use cases for which latent variables should encode global information. By decomposing the reconstruction loss per positions in the documents, we showed that sequence-to-sequence VAEs, both the original versions and recent variants, tend to memorize the first few words as well as the length of the documents. These VAEs sometimes capture global features, but coincidentally and as a side-effect of their memorization behavior, when these features are correlated with the first words of the documents.
In order to reduce memorization, we proposed simple modifications to the architecture (bag-ofwords encoders or unigram decoders) and to the training procedure (pretraining with a language modelling objective). In the semi-supervised learning setting, our simple variants produce representations that are more predictive of the ground-truth labels and these gains translate directly in generation. We obtained a higher agreement between the labels of source texts and the labels of their reconstructions with less memorization at almost no cost in terms of likelihood.
A lot of work remains to be done. The root cause of memorization should be clearly identified. A first hypothesis to explore is that the fixed, left-toright factorization of the probability of the decoder could lead to memorization of the first words. Indeed, on all datasets, the LSTM-LM incurs a higher reconstruction loss on the first positions (cf. Figure 1) and these early errors should account for a proportionally larger part of the gradients. This hypothesis is also supported by our successes with the unigram decoder, which models words independently. If the hypothesis were true, we would expect that either non-autoregressive decoders (for instance Gu et al., 2017) or auto-regressive models where the order is latent and therefore, variable (for example, Gu et al., 2019) would not exhibit memorization. It would also imply that standard Transformers used auto-regressively would not yield improvements. Similarly, the causes behind the encoding of sentence length should be analyzed in depth.
Another promising avenue is to draw inspiration from models, training procedures and losses used for language model pretraining. Models such as BERT (Devlin et al., 2018) only penalize the re-construction of the words that are either missing or corrupted and therefore, they avoid memorization altogether. These models can be seen as denoising auto-encoders (DAE) (Vincent et al., 2008). Current VAE models learn to corrupt the latent space and to reconstruct the entire input, while current DAE models corrupts parts of their inputs and reconstruct the corrupted portions of their inputs. Models which blend the two frameworks might have the best of both worlds (Im et al., 2017).
A On the use of KL annealing, the choice of the free bits flavor and resetting the decoder  evaluated their models in the SSL setting using relatively small training sets (Section 3.3 of their paper). However, their experimental setting is not very rigorous. They use a validation set containing 10000 examples to do model selection, which is also the size of their largest training set. This is equivalent to selecting the model on the test set. The hyperparameter budget seems to be different for different models, exacerbating the problem. Finally, it seems that KL annealing played the same role as the free bits technique and that therefore, KL annealing was redundant. Therefore, we run our own hyperparameter search on the Yelp dataset.
Our experiments clearly confirm that their pretraining technique improves the performance, but their choice of the free bits technique and the use of KL annealing is suboptimal. We first show that KL annealing is not necessary anymore when we use free bits, and that the original free bits method is equivalent or worse than the δ-VAE variant. This justifies our use of a slightly different method than their method in the paper. Additionally, we also confirm that resetting the decoder is crucial.

A.1 The free bits technique and variants
The original free bits objective (Kingma et al., 2016) is the following modification to the KL term: where indices denote components. In this formulation, each component of the multivariate normal is allowed to deviate from the prior by a small amount. Instead, in the δ-VAE formulation, one component can use of all the λ free bits and the rest of the components can collapse to the prior. This is the variant called δ, used throughout the paper: max(λ, KL(q(z|x)||p(z))) Other modifications of the free bits technique include the use of a variable coefficient in front of the KL term (Chen et al., 2016), the target rate objective in Alemi et al. (2018), minimum desired rate (Pelsmaeker and Aziz, 2019), etc. A comparison of all these methods is out of the scope of this paper and the δ variant satisfies our only requirement: the rate should be close to the desired rate.

A.2 KL annealing and the original free bits method higher the rate
Our hypothesis is that KL annealing is redundant when used with free bits. Therefore, it should increase the actual rate more than with free bits alone. This prevents comparisons of models fairly, at equivalent capacity. We also posit that the original free bits formulation impose unnecessary constraints on how the free bits should be use, namely, that they should be used equally in all components.
To study the influence of the free bits variant as well as of KL annealing, we use the same experimental protocol as described in Section 5. To save computations, we fix d = 16. We do not perform model selection on the desired rate λ in order to see which methods yield the rates that are closest to the desired rate. Table 7 shows that both KL annealing and the original free bits term instead of the δ-VAE variant increase the actual rate that is reached at the end of the optimisation. Moreover, the increases are very unpredictable: we gain higher capacity due to using KL annealing when we are using the original free bits than when we are not. Therefore, we cannot hope to do comparisons with equal rates using KL annealing. The δ-VAE free bits variant without annealing reach the closest KL value to the desired target rate λ. In addition, the δ-VAE free bits without KL annealing consistently yield better downstream task performance. In summary, KL annealing is harmful when used with free bits and the δ-VAE free bits technique is superior to the original formulation. Therefore, all the experiments in the paper use the δ variant without annealing.
In 's work, the "per-component" variant might have been chosen because it trivially maximizes a metric called active units (AU). This measure quantifies roughly how many components of the latent vector deviates by a certain threshold from the prior on average. However, to our knowledge, there is no evidence that this metric should be maximized, neither theoretical nor empirical. Arguably, it is not only meaningless but also detrimental to maximize this metric as it discourages sparsity. Hence, we refrain from using this metric.
B On the importance of resetting the decoder after pretraining  proposed to pretrain an AE with a reconstruction loss only. Then, the parameters of the decoder are re-initialised and the (modified) KL term is added to the objective. Since it is not very clear why it would be useful, we studied the impact of this choice. Table 8 shows that it is is crucial.

C Training procedure
All the runs are trained using SGD with a learning rate of 0.5 and gradients are clipped when their norms are higher than 5. We use the following early stopping scheme: at every epoch, if there has not been improvements on the validation error for two epochs in a row, the learning rate is halved. Once it has been halved four times, the training stops. All the LSTMs have hidden state size of 512 and    use a batch size of 64. No dropout is applied to the encoders. The LSTM decoders use dropout (p = 0.5) both on embeddings and on the hidden states (before the linear transformation that gives logits). Similarly, dropout is applied to the representation before the linear transformation that gives the logits for the Unigram decoder. Word embeddings are initialized randomly and learned.

D Related work D.1 Related models
The models that we use are very similar to already proposed models. The NVDM model of Miao et al. (2016) is precisely BoW-max-Uni. Zhao et al. (2017) proposed to use an auxiliary loss that consists in reconstructing the input using a unigram model. Thus, their objective contains two reconstruction losses: the reconstruction loss given by the recurrent decoder and the one given by the unigram decoder. In comparison, our Uni models are trained in two steps: the encoder is trained jointly with the unigram decoder, then the decoder is thrown away and we train a recurrent decoder using the fixed encoder. This way, we do not fear that one decoder might dominate the other and moreover, we do not deal with potential hyperparameters that weigh the two losses. Instead of having an auxiliary loss, we have an auxiliary decoder that is only used for the purpose of training the encoder. This method was presented by De Fauw et al. (2019) for training generative models of image. There is a slight difference: they use a feedforward auxiliary decoder to produce different probability distributions for all the pixels, whereas our unigram probability distribution is the same for all words of a document. This modification allows us to deal with varying lengths of documents.
Finally, the PreLM training procedure is related to large LM pretraining in the spirit of contextualized embeddings (Peters et al., 2018) and its successors. Note, however, two differences. Firstly, we do not use external data and stick to each individual training set. The goal is obviously not to evaluate transfer learning abilities. Secondly, we do not fine-tune the entire encoder but merely learn the linear transformations L 1 and L 2 that produce the variational parameters, to make sure that the VAE objective will have no impact on the extraction of features.

D.2 Methods and evaluations
Ficler and Goldberg (2017) learn LSTM-LMs conditioned on labels that describe high-level properties of texts. Among others, they want to verify that generated texts exhibit the same properties as the conditioning labels. For instance, when the LSTM-LM is conditioned on positive sentiment value, the generated texts should also exhibit a positive sentiment. To check that the conditioning variables and the generated texts are consistent, they use the following procedure. First, they extract information about the various documents using heuristics or with the help of annotators. Then, they learn LSTM-LMs conditioned on these labels. Finally, they quantify the ratio of generated samples which have the same labels than the conditioning labels, either by applying the same heuristics again to the generated samples or by asking human annotators once more. Our evaluation in Section 6 is extremely similar. We simply replace the heuristics and the human annotators with classifiers learned on ground-truth data.
Our work is also related to the important work of McCoy et al. (2019). They trained autoencoders with different combinations of encoders and decoders (unidirectional, bidirectional or treestructured) and decomposed the representations learned by the encoders using tensor product representations (Smolensky, 1990). They showed that decoders "largely dictate" the way information is encoded. The main difference between our works is that they study how information is encoded in sequence-to-sequence models without capacity limitations, whereas we study what information is encoded in the VAE sequence-to-sequence model, where the VAE objective puts severe limits on capacity.

E Decomposing the variances of the scores
For a given model, dataset and data-regime, after optimisation of the hyperparameters of the VAE and the classifier, we collect several F1-scores F ij which depend on the seed used to subsample the dataset i and the seed used to initialise the model parameters j. We posit a linear model with one random-effect factor, the initialisation seed, and where replicates are obtained by varying the subsampling seed: Assuming that α j and ij are independent random variables with null expectations, we can decompose the variance as This is the basis of the method of analysis of variance (ANOVA) and is often used to test hypotheses (for instance, that the effect E[α i ] is significant) (Oehlert, 2010). The two estimates of σ 2 init and σ 2 are usually denoted M S T and M S E .
In our case, we are only interested in estimating roughly what variability is due to the model initialisation and what is due to the subsampling of the dataset.
Note that we could treat the two sources of variance i and j symmetrically by adding add a term β i , but we would need to report 3 standard deviations (that of α j , β i and ij ) to get the full picture. The most important estimate is σ init . It quantifies the inherent robustness of the model to different initialisations. The effect of the subsampling is specific to the dataset, therefore, it is less relevant to our analysis.