Discrete Latent Variable Representations for Low-Resource Text Classification

While much work on deep latent variable models of text uses continuous latent variables, discrete latent variables are interesting because they are more interpretable and typically more space efficient. We consider several approaches to learning discrete latent variable models for text in the case where exact marginalization over these variables is intractable. We compare the performance of the learned representations as features for low-resource document and sentence classification. Our best models outperform the previous best reported results with continuous representations in these low-resource settings, while learning significantly more compressed representations. Interestingly, we find that an amortized variant of Hard EM performs particularly well in the lowest-resource regimes.


Introduction
Deep generative models with latent variables have become a major focus of NLP research over the past several years. These models have been used both for generating text (Bowman et al., 2016) and as a way of learning latent representations of text for downstream tasks (Yang et al., 2017;Gururangan et al., 2019). Most of this work has modeled the latent variables as being continuous, that is, as vectors in R d , in part due to the simplicity of performing inference over (certain) continuous latents using variational autoencoders and the reparameterization trick (Kingma and Welling, 2014;Rezende et al., 2014).
At the same time, deep generative models with discrete latent variables are attractive because the latents are arguably more interpretable, and because they lead to significantly more compressed * Work done as an intern at Toyota Technological Institute at Chicago. 1 Code available on GitHub: https://github.com/ shuningjin/discrete-text-rep representations: A representation consisting of M floating point values conventionally requires M × 32 bits, whereas M integers in {1, . . . , K} requires only M × log 2 K bits.
Unfortunately, discrete latent variable models have a reputation for being more difficult to learn. We conduct a thorough comparison of several popular methods for learning such models, all within the framework of maximizing the evidence lower bound (ELBO) on the training data. In particular, we compare learning such models either with a Vector Quantized-VAE (van den Oord et al., 2017, VQ-VAE), a more conventional VAE with discrete latent variables (Jang et al., 2017;Maddison et al., 2017), or with an amortized version of "Hard" or "Viterbi" Expectation Maximization (Brown et al., 1993), which to our knowledge has not been explored to date. We consider both models where the latents are local (i.e., per token) and where they are global (i.e., per sentence); we assess the quality of these learned discrete representations as features for a low-resource text classifier, as suggested by Gururangan et al. (2019), and in a nearest neighborbased retrieval task.
Our classification experiments distinguish between (1) the setting where the classifier must consume only the discrete representation associated with each sentence (i.e., the discrete assignment that maximizes the approximate posterior), and (2) the setting where the classifier may consume the embeddings of this discrete representation learned by the VAE encoder. Note that the former setting is more flexible, since we need only store a sentence's discrete representation, and are therefore free to use task-specific (and possibly much smaller) architectures for classification. In case (1), we are able to effectively match the performance of Gururangan et al. (2019) and other baselines; in case (2), we outperform them. Our experiments also suggest that Hard EM performs particularly well in case (1) when there is little supervised data, and that VQ-VAE struggles in this setting.

Related Work
Our work builds on recent advances in discrete representation learning and its applications. In particular, we are inspired by recent success with VQ-VAEs outside NLP (van den Oord et al., 2017;Razavi et al., 2019). These works show that we can generate realistic speech and image samples from discrete encodings, which better align with symbolic representations that humans seem to work with (e.g., we naturally encode continuous speech signals into discrete words). Despite its success in speech and vision, VQ-VAE has not been considered as much in NLP. One exception is the translation model of Kaiser et al. (2018) that encodes a source sequence into discrete codes using vector quantization. But their work focuses on making inference faster, by decoding the target sequence from the discrete codes non-autoregressively. To our knowledge, we are the first that explores general text representations induced by VQ-VAEs for semi-supervised and transfer learning in NLP.
In addition to exploring the viability of VQ-VAEs for text representation learning, an important part of this paper is a systematic comparison between different discretization techniques. Gumbel-Softmax (Jang et al., 2017;Maddison et al., 2017) is a popular choice that has been considered for supervised text classification (Chen and Gimpel, 2018) and dialog generation (Zhao et al., 2018). In the binary latent variable setting, straight-through estimators are often used (Dong et al., 2019). Another choice is "continuous decoding" which takes a convex combination of latent values to make the loss differentiable (Al-Shedivat and Parikh, 2019). Yet a less considered choice is Hard EM (Brown et al., 1993;De Marcken, 1995;Spitkovsky et al., 2010). A main contribution of this work is a thorough empirical comparison between such different choices in a controlled setting.
To demonstrate the usefulness of our models, we focus on improving low-resource classification performance by pretraining on unlabeled text. Previous best results are obtained with continuous latentvariable VAEs, e.g., VAMPIRE (Gururangan et al., 2019). We show that our discrete representations outperform these previous results while being significantly more lightweight.

Background
We consider generative models of a sequence x = x 1:T of T word tokens. We assume our latents to be a sequence z = z 1:L of L discrete latent vectors, each taking a value in {1, . . . , K} M ; that is, z ∈ {1, . . . , K} M ×L . As is common in VAE-style models of text, we model the text autoregressively, and allow arbitrary interdependence between the text and the latents. That is, we have p(x, z; θ) = p(z) × T t=1 p(x t | x <t , z; θ), where θ are the generative model's parameters. We further assume p(z) to be a fully factorized, uniform prior: p(z) = 1 K M L . Maximizing the marginal likelihood of such a model will be intractable for moderate values of K, M , and L. So we consider learning approaches that maximize the ELBO (Jordan et al., 1999) in an amortized way (Kingma and Welling, 2014;Rezende et al., 2014): where q(z | x; φ) is the approximate posterior given by an inference or encoder network with parameters φ. The approaches we consider differ in terms of how this approximate posterior q is defined.
Mean-Field Categorical VAE (CatVAE) A standard Categorical VAE parameterizes the approximate posterior as factorizing over categorical distributions that are independent given x. We therefore maximize: L l=1 q ml (z ml | x; φ), p ml = 1/K, and H is the entropy.
We approximate the expectation above by sampling from the q ml , and we use the straight-through gradient estimator (Bengio et al., 2013;Jang et al., 2017) to compute gradients with respect to φ. We find this approach to be more stable than using the REINFORCE (Williams, 1992) gradient estimator, or a Concrete (Maddison et al., 2017;Jang et al., 2017) approximation to categorical distributions. Specifically, we sample from a categorical distribution using the Gumbel-Max trick (Maddison et al., 2014) in the forward pass, and approximate the gradient using softmax with a small temperature. This approach is also referred to as straight-through Gumbel-Softmax (Jang et al., 2017). Oord et al., 2017;Razavi et al., 2019) can also be seen as maximizing the ELBO, except the approximate posterior is assumed to be a point mass given by

VQ-VAE A VQ-VAE (van den
and e (m) j ∈ R d is an embedding of the j th discrete value z ml can take on, and enc(x) ml ∈ R d is an encoding corresponding to the ml th latent given by an encoder network. These e (m) j embedding vectors are often referred to as a VQ-VAE's "code book". In our setting, a code book is shared across latent vectors.
VQ-VAEs are typically learned by maximizing the ELBO assuming degenerate approximate posteriors as above, plus two terms that encourage the encoder embeddings and the "code book" embeddings to become close. In particular, we attempt to maximize the objective: where sg is the stop-gradient operator, andẑ =ẑ 1:L is the sequence of minimizing assignmentsẑ m,l for each enc(x) ml . The loss term following the β is known as the "commitment loss". Gradients of the likelihood term with respect to enc(x) are again estimated with the straight-through gradient estimator.
Hard EM We train with an amortized form of Hard EM. First we define a relaxed version of z, z, where eachz ml is a softmax over K outputs (rather than a hard assignment) and is produced by an inference network with parameters φ. 2 In the E-Step, we take a small, constant number of 2 Note this assumes our generative model can condition on such a relaxed latent variable. e 3 e 6 3 6 ...

Code books
...  gradient steps to maximize log p(x |z; θ) with respect to φ (for a fixed θ). In the M-Step, we take a single gradient step to maximize log p(x |ẑ; θ) with respect to θ, whereẑ contains the elementwise argmaxes ofz as produced by the inference network (with its most recent parameters φ). Thus, Hard EM can also be interpreted as maximizing the (relaxed) ELBO. We also note that taking multiple steps in the hard E-step somewhat resembles the recently proposed aggressive training of VAEs .

Models and Architectures
Recall that the latent sequence is z = z 1:L , where z l ∈ {1, . . . , K} M . We consider two generative models p(x | z; θ), one where L = T and one where L = 1. Each latent in the former model corresponds to a word, and so we refer to this as a "local" model, whereas in the second model we view the latents as being "global", since there is one latent vector for the whole sentence. We use the following architectures for our encoders and decoder, as illustrated in Figure 1.

Encoder
The encoder (parameterized by φ) maps an example x to the parameters of an approximate posterior distribution. Our encoder uses a single-layer Transformer (Vaswani et al., 2017) Mean-Field Categorical VAE For the local model, we obtain the parameters of each categorical approximate posterior q mt as softmax(W m h t ), where each W m ∈ R K×d is a learned projection. For the global model, we obtain the parameters of each categorical approximate posterior q m1 as softmax t Wm ht T ; that is, we pass token-level h t vectors through learned projections W m , followed by mean-pooling.
VQ-VAE For the local model, letd = d/M . We obtain enc(x) mt , the encoding of the mt th latent variable, as h t,(m−1)d:md , following Kaiser et al. (2018). That is, we take the m thd -length subvector of h t . For the global model, letd = d. We first project h t to R M d , mean-pool, and obtain enc(x) m1 by taking the m thd -length subvector of the resulting pooled vector. A VQ-VAE also requires learning a code book, and we define M code Hard EM We use the same encoder architecture as in the mean-field Categorical VAE case. Note, however, that we do not sample from the resulting categorical distributions. Rather, the softmax distributions are passed directly into the decoder.

Decoder
In the case of the mean-field Categorical VAE, we obtain a length-L sequence of vectors z l ∈ {1, . . . , K} M after sampling from the approximate posteriors. For the VQ-VAE, on the other hand, we obtain the sequence ofẑ l vectors by taking the indices of the closest code book embeddings, as in Equation (1).
In both cases, the resulting sequence of discrete vectors is embedded and consumed by the decoder. In particular, when learning with a VQ-VAE, the embedding ofẑ ml is simply e (m) z ml , whereas for the Categorical VAE each discrete latent is embedded using a trained embedding layer. In the local model, when M > 1, we concatenate the M embeddings to form a single real vector embedding for the l th latent variable. In the global model, we use the M embeddings directly. This resulting sequence of T or M real vectors is then viewed as the source side input for a standard 1-layer Transformer encoderdecoder model (Vaswani et al., 2017), which decodes x using causal masking.
As above, for Hard EM, we do not obtain a sequence of discrete vectors from the encoder, but rather a sequence of softmax distributions. These are multiplied into an embedding layer, as in the Categorical VAE case, and fed into the Transformer encoder-decoder model.

Evaluating Latent Representations
Similar to Gururangan et al. (2019), we evaluate the learned latent representations by using them as features in a text classification system. We are in particular interested in using latent representations learned on unlabeled text to help improve the performance of classifiers trained on a small amount of labeled text. Concretely, we compare different discrete latent variable models in following steps: 1. Pretraining an encoder-decoder model on indomain unlabeled text with an ELBO objective, with early stopping based on validation perplexity.
2. Fixing the encoder to get discrete latents for the downstream classification task, and training a small number of task-specific parameters on top, using varying amounts of labeled data. As noted in the introduction, we consider both reembedding these latents from scratch, or using the embeddings learned by the encoder.

Tasks and Datasets
The datasets we use for classification are AG News, DBPedia, and Yelp Review Full (Zhang et al., 2015), which correspond to predicting news labels, Wikipedia ontology labels, and the number of Yelp stars, respectively. The data details are summarized in Table 1. For all datasets, we randomly sample 5,000 examples as development data. To evaluate the efficiency of the latent representation in lowresource settings, we train the classifier with varying numbers of labeled instances: 200, 500, 2500, and the full training set size (varies by dataset). We use accuracy as the evaluation metric.
In preprocessing, we space tokenize, lowercase, and clean the text as in Kim (2014), and then truncate each sentence to a maximum sequence length of 400. For each dataset, we use a vocabulary of the 30,000 most common words.  As noted in the introduction, we consider two ways of embedding the integers for consumption by a classifier. We either (1) learn a new taskspecific embedding space E (m) task (i.e., reembedding) or (2) use the fixed embedding space E (m) from pretraining. The first setting allows us to effectively replace sentences with their lower dimensional discrete representations, and learn a classifier on the discrete representations from scratch. In the local model, we obtain token-level embedding vectors by concatenating the M subvectors corresponding to each word. The resulting embeddings are either averaged, or fed to a Transformer and then averaged, and finally fed into a linear layer followed by a softmax.
6 Experimental Details

Baselines
We first experiment with three common text models: CBOW (Mikolov et al., 2013), bidirectional LSTM (Hochreiter and Schmidhuber, 1997), and a single-layer Transformer encoder. We find CBOW (with 64-dimensional embeddings) to be the most robust in settings with small numbers of labeled instances, and thus report results only with this baseline among the three. Further, we compare to VAMPIRE (Gururangan et al., 2019), a framework of pretraining VAEs for text classification using continuous latent variables. We pretrain VAMPIRE models on in-domain text for each dataset with 60 random hyperparameter search (with same ranges as specified in their Appendix A.1), and select best models based on validation accuracy in each setting.

Hyperparameters
In our experiments, we use Transformer layers with d model = 64. For optimization, we use Adam (Kingma and Ba, 2015), either with a learning rate of 0.001 or with the inverse square-root schedule defined in Vaswani et al. (2017) in pretraining. We use a learning rate of 0.0003 in classification. We tune other hyperparameters with random search and select the best settings based on validation accuracy. For the latent space size, we choose M in {1, 2, 4, 8, 16} and K in {128, 256, 512, 1024, 4096}. Model specific hyperparameters are introduced below.

VQ-VAE
In VQ-VAE, an alternative to the objective in Equation (2) is to remove its second term, while using an auxiliary dictionary learning algorithm with exponential moving averages (EMA) to update the embedding vectors (van den Oord et al., 2017). We tune whether to use EMA updates or not. Also, we find small β for commitment loss to be beneficial, and search over {0.001, 0.01, 0.1}.

Mean-Field Categorical VAE
We find that using the discrete analytic KL divergence term directly in the ELBO objective leads to posterior collapse. The KL term vanishes to 0 and the q ml distributions converge to the uniform priors. To circumvent this, we modify the KL term to be max(KL, λ). This is known as Free Bits (Kingma et al., 2016;Li et al., 2019), which ensures that the latent variables encode a certain amount of information by not penalizing the KL divergence when it is less than λ. We set λ = γM L log K, where γ is a hyperparameter between 0 and 1. That is, we allocate a "KL budget" as a fraction of M L log K, which is the upper bound of KL divergence between M L independent categorical distributions and uniform prior distributions. Since in this case KL(q ml (z ml | x)||p ml (z ml )) = log K − H[q ml (z ml | x)], this is equivalent to thresholding H[q ml (z ml | x)] by (1 − γ) log K. We experiment with γ ∈ {0.2, 0.4, 0.6, 0.8, 1}.

Hard EM
We vary the number of gradient steps in the E-step in {1, 3}. At evaluation time, we always take the argmax ofz to get a hard assignment.

Results
In Figure 2, we compare the accuracy obtained by the representations from our Hard EM, Categorical VAE, and VQ-VAE models, averaged over the development datasets of AG News, DBPedia, and Yelp Full. In particular, we plot the best accuracy obtained over all hyperparameters (including M ) for different numbers of labeled examples; we distinguish between local and global models, and between when the discrete representations are reembedded from scratch and when the encoder embeddings are used. We see that using the encoder embeddings typically outperforms reembedding from scratch, and that global representations tend to outperform local ones, except in the full data regime. Furthermore, we see that the Categorical VAE and VQ-VAE are largely comparable on average, though we undertake a finer-grained comparison by dataset in Appendix A. Perhaps most interestingly, we note that when reembedding from scratch, Hard EM significantly outperforms the other approaches in the lowest data regimes (i.e., for 200 and 500 examples). In fact, Hard EM allows us to match the performance of the best previously reported results even when reembedding from scratch; see Table 3. Table 2 shows the best combinations of model and hyperparameters when training with 200 labeled examples on AG News. These settings were used in obtaining the numbers in Figure 2, and are largely stable across datasets.
In Figure 3, we compare the average accuracy of our local and global model variants trained on 200 labeled examples, as we vary M . When reembedding, local representations tend to improve as we move from M = 1 to M = 2, but not significantly after that. When reembedding global representations, performance increases as M does. Unsurprisingly, when not reembedding, M matters less.   Finally, we show the final accuracies obtained by our best models on the test data of each dataset in Table 3. We see that on all datasets when there are only 200 or 500 labeled examples, our best model outperforms VAMPIRE and the CBOW baseline, and our models that reembed the latents from scratch match or outperform VAMPIRE. As noted in Table 2, it is Hard EM that is particularly performant in these settings.

Qualitative analysis
To gain a better understanding of what the learned clusters represent, we examine their patterns on the AG News dataset labeled with four classes. Since VQ-VAEs and Categorical VAEs exhibit similar patterns, we focus on the latter model. Tables 4 and 5 show examples of sentence-and word-level clusters, respectively, induced by Categorical VAEs. The sentence-level model encodes each document into M = 4 latents, each taking one of K = 256 integers. The word-level model encodes each word into M = 1 latent taking one of K = 1024 integers. Since a word can be assigned multiple clusters, we take the majority cluster for illustration purposes.
We see that clusters correspond to topical aspects of the input (either a document or a word). In particular, in the sentence-level case, documents in the same cluster often have the same ground-truth label. We also find that each of M latents independently corresponds to topical aspects (e.g., z 1 = 65 implies that the topic has to do with technology); thus, taking the combination of these latents seems to make the cluster "purer". The word-level clusters are also organized by topical aspects (e.g., many words in cluster 510 are about modern conflicts in the Middle East).

Effect of Alternating Optimization
While Hard EM achieves impressive performance when reembedding from scratch and when training on only 200 or 500 examples, we wonder whether this performance is due to the alternating optimization, to the multiple E-step updates per M-step update, or to the lack of sampling. We accordingly experiment with optimizing our VQ-VAE and Cat-VAE variants in an alternating way, allowing multiple inference network updates per update of the generative parameters θ. We show the results on the AG News dataset in Table 6. We find that alternating does generally improve the performance of VQ-VAE and CatVAE as well, though Hard EM performs the best overall when reembedding from scratch. Furthermore, because Hard EM requires no sampling, it is a compelling alternative to Cat-VAE. For all three methods, we find that doing 3 inference network update steps during alternating optimization performs no better than doing a single one, which suggests that aggressively optimizing the inference network is not crucial in our setting.

Compression
We briefly discuss in what sense discrete latent representations reduce storage requirements. Given a vocabulary of size 30,000, storing a T -length sentence requires T log 2 30000 ≈ 14.9T bits. Our models require at most M L log 2 K bits to represent a sentence, which is generally smaller, and especially so when using a global representation. It is also worth noting that storing a d-dimensional floating point representation of a sentence (as continuous latent variable approaches might) costs 32d bits, which is typically much larger.
While the above holds for storage, the space required to classify a sentence represented as M L integers using a parametric classifier may not be smaller than that required for classifying a sentence represented as a d-dimensional floating point vector. On the other hand, nearest neighbor-based methods, which are experiencing renewed interest (Guu et al., 2018;Wiseman and Stratos, 2019), should be significantly less expensive in terms of time and memory when sentences are encoded as M L integers rather than d-dimensional floating point vectors. In the next subsection we quantitatively evaluate our discrete representations in a nearest neighbor-based retrieval setting.

Nearest Neighbor-Based Retrieval
In the classification experiments of Section 5, we evaluated our discrete representations by training a small classifier on top of them. Here we evaluate our global discrete representations in a document retrieval task to directly assess their quality; we note that this evaluation does not rely on the learned code books, embeddings, or a classifier.
In these experiments we use each document in the development set of the AG News corpus as a query to retrieve 100 nearest neighbors in the training corpus, as measured by Hamming distance. We use average label precision, the fraction of retrieved documents that have the same label as the query document, to evaluate the retrieved neighbors. We compare with baselines that use averaged 300d pretrained word vectors (corresponding to each token in the document) as a representation, where neighbors are retrieved based on cosine or L 2 distance. We use GloVe with a 2.2 million vocabulary (Pennington et al., 2014) and fastText with a 2 million vocabulary (Mikolov et al., 2018). The results are in Table 7. We see that CatVAE and Hard EM outperform these CBOW baselines (while being significantly more space efficient), while VQ-VAE does not. These results are in line with those of Figure 2, where VQ-VAE struggles when its code book vectors cannot be used (i.e., when reembedding from scratch).
In Figure 4 we additionally experiment with a slightly different setting: Rather than retrieving a fixed number of nearest neighbors for a query document, we retrieve all the documents within a neighborhood of Hamming distance ≤ D, and calculate the average label precision. These results use global representations with M = 16, and we therefore examine thresholds of D ∈ {0, . . . , 16}. We see that for CatVAE and Hard EM, the document similarity (or label precision) has an approximately linear correlation with Hamming distance. On the other hand, VQ-VAE shows a more surprising pattern, where high precision is not achieved until D = 10, perhaps suggesting that a large portion of the latent dimensions are redundant.

Conclusion
We have presented experiments comparing the discrete representations learned by a Categorical VAE, a VQ-VAE, and Hard EM in terms of their ability to improve a low-resource text classification system, and to allow for nearest neighbor-based document retrieval. Our best classification models are able to outperform previous work, and this remains so even when we reembed discrete latents from scratch in the learned classifier. We find that amortized Hard EM is particularly effective in lowresource regimes when reembedding from scratch, and that VQ-VAE struggles in these settings.