Latent-Variable Generative Models for Data-Efficient Text Classification

Generative classifiers offer potential advantages over their discriminative counterparts, namely in the areas of data efficiency, robustness to data shift and adversarial examples, and zero-shot learning (Ng and Jordan,2002; Yogatama et al., 2017; Lewis and Fan,2019). In this paper, we improve generative text classifiers by introducing discrete latent variables into the generative story, and explore several graphical model configurations. We parameterize the distributions using standard neural architectures used in conditional language modeling and perform learning by directly maximizing the log marginal likelihood via gradient-based optimization, which avoids the need to do expectation-maximization. We empirically characterize the performance of our models on six text classification datasets. The choice of where to include the latent variable has a significant impact on performance, with the strongest results obtained when using the latent variable as an auxiliary conditioning variable in the generation of the textual input. This model consistently outperforms both the generative and discriminative classifiers in small-data settings. We analyze our model by finding that the latent variable captures interpretable properties of the data, even with very small training sets.


Introduction
The most widely-used neural network classifiers are discriminative, that is, they are trained to explicitly favor the gold standard label over others. The alternative is to design classifiers that are generative; these follow a generative story that includes predicting the label and then the data conditioned on the label. Discriminative classifiers are preferred because they generally outperform their generative counterparts on standard benchmarks. These benchmarks typically assume large annotated training sets, little mismatch between training and test distributions, relatively clean data, and a lack of adversarial examples (Zue et al., 1990;Marcus et al., 1993;Deng et al., 2009;Lin et al., 2014).
However, when conditions are not ideal for discriminative classifiers, generative classifiers can actually perform better. Ng and Jordan (2002) showed theoretically that linear generative classifiers approach their asymptotic error rates more rapidly than discriminative ones. Based on this finding, Yogatama et al. (2017) empirically characterized the performance of RNN-based generative classifiers, showing advantages in sample complexity, zero-shot learning, and continual learning. Recent work in generative question answering models (Lewis and Fan, 2019) demonstrates better robustness to biased training data and adversarial testing data than state-of-the-art discriminative models.
In this paper, we focus on settings with small amounts of annotated data and improve generative text classifiers by introducing discrete latent variables into the generative story. Accordingly, the training objective is changed to log marginal likelihood of the data as we marginalize out the latent variables during learning. We parameterize the distributions with standard neural architectures used in conditional language models and include the latent variable by concatenating its embedding to the RNN hidden state before computing the softmax over words. While traditional latent variable learning in NLP uses the expectationmaximization (EM) algorithm (Dempster et al., 1977), we instead simply perform direct optimization of the log marginal likelihood using gradientbased methods. At inference time, we similarly marginalize out the latent variables while maximizing over the label.
We characterize the performance of our latentvariable generative classifiers on six text classification datasets introduced by Zhang et al. (2015). We observe that introducing latent variables leads to large and consistent performance gains in the small-data regime, though the benefits of adding latent variables reduce as the training set becomes larger.
To better understand the modeling space of latent variable classifiers, we explore several graphical model configurations. Our experimental results demonstrate the importance of including a direct dependency between the label and the input in the model. We study the relationship between the label, latent, and input variables in our strongest latent generative classifier, finding that the label and latent capture complementary information about the input. Some information about the textual input is encoded in the latent variable to help with generation.
We analyze our latent generative model by generating samples when controlling the label and latent variables. Even with small training data, the samples capture the salient characteristics of the label space while also conforming to the values of the latent variable, some of which we find to be interpretable. While discriminative classifiers excel at separating examples according to labels, generative classifiers offer certain advantages in practical settings that benefit from a richer understanding of the data-generating process.

Discriminative and Generative Text Classifiers
We begin by defining our baseline generative and discriminative text classifiers for document classification. Our models are essentially the same as those from Yogatama et al. (2017); we describe them in detail here because our latent-variable models will extend them. 1 Our classifiers are trained on datasets D of annotated documents.
Each instance x, y ∈ D consists of a textual input x = {x 1 , x 2 , ..., x T }, where T is the length of the document, and a label y ∈ Y.
1 The main difference between our baselines and the models in Yogatama et al. (2017) are: (1) their discriminative classifier uses an LSTM with "peephole connections"; (2) they evaluate a label-based generative classifier ("Independent LSTMs") that uses a separate LSTM for each label. They also evaluate the model we described here, which they call "Shared LSTMs". Their Independent and Shared LSTMs perform similarly across training set sizes.
The discriminative classifier is trained to maximize the conditional probability of labels given documents: x,y ∈D log p(y | x). For our discriminative model, we encode a document x using an LSTM (Hochreiter and Schmidhuber, 1997), and use the average of the LSTM hidden states as the document representation. The classifier is built by adding a softmax layer on top of the LSTM state average to get a probability distribution over labels.
The generative classifier is trained to maximize the joint probability of documents and labels: x,y ∈D log p(x, y). The generative classifier uses the following factorization: We parameterize log p(x | y) as a conditional LSTM language model using the standard sequential factorization: We define a label embedding matrix V Y ∈ R d 1 ×|Y| . To predict the next word x t+1 , we concatenate the LSTM hidden state h t with the label embedding v y (a column of V Y ), and feed it to a softmax layer to get the probability distribution over the vocabulary. More details about the factorization and parameterization are discussed in Section 3. The label prior p(y) is acquired via maximum likelihood estimation and fixed during training of the remaining parameters. At inference time, the prediction is made by maximizing p(y | x) with respect to y for the discriminative classifier and maximizing p(x | y)p(y) for the generative classifier.

Latent-Variable Generative Classifiers
We now introduce discrete latent variables into the standard generative classifier as shown in Figure 1. We refer to the latent-variable model as an auxiliary latent generative model, as we expect the latent variable to contain auxiliary information that can help with the generation of the input. Following the graphical model structure in Figure 1(b), we factorize the joint probability p(x, y, c) as follows: We parameterize p Θ (x | c, y) as a conditional LSTM language model using the same factorization as above: where Θ is the set of parameters of the language model. As in the generative classifier, we use a label embedding matrix V Y . In addition, we define a latent variable embedding matrix V C ∈ R d 2 ×|C| where C is the set of values for the discrete latent variable. Also like the generative classifier, we use an LSTM to predict each word with a softmax layer: where h t is the hidden representation of x <t from the LSTM, v y and v c are the embeddings for the label and the latent variable, respectively, [u; v] denotes vertical concatenation, u xt is the output word embedding, and b xt is a bias parameter. The prior distribution of the latent variable is parameterized as follows: where Φ is the set of parameters for this distribution which includes the vector w c and scalar b c for each c.
As in the standard generative model, the label prior p Ψ (y) is acquired from the empirical label distribution in the training data and remains fixed during training.
Training. As is standard in latent-variable modeling, we train our models by maximizing the log marginal likelihood: In NLP, these sorts of optimization problems are traditionally solved with the EM algorithm. However, we instead directly optimize the above quantity using automatic differentiation. This is natural because we use softmax-transformed parameterizations; a more traditional parameterization would assign parameters directly to individual probabilities, which then requires constrained optimization.
Inference. The prediction is made by marginalizing out the latent variables as follows: We experimented with other inference objectives and found similar results. More details can be found in Appendix C.
Differences with ensembles. Our latentvariable model resembles an ensemble of multiple generative classifiers, but there are two main differences. First, all parameters in the latent generative classifier are trained jointly, while a standard ensemble combines predictions from multiple, independently-trained models. Joint training leads to complementary information being captured by latent variable values, as shown in our analysis. Moreover, a standard ensemble will lead to far more parameters (10, 30, or 50 times as many in our experimental setup) since each generative classifier is a completely separate model. Our approach simply conditions on the embedding of the latent variable value and therefore does not add many parameters.

Datasets
We present our results on six publicly available text classification datasets introduced by Zhang et al. (2015), which include news categorization, sentiment analysis, question/answer topic classification, and article ontology classification. 2 To compare classifiers across training set sizes, we follow the setup of Yogatama et al. (2017) and construct multiple training sets by randomly sampling 5, 20, 100, 1k, 2k, 5k, and 10k instances per label from each dataset.

Training Details
In all experiments, the word embedding dimension and the LSTM hidden state dimension are set to 100. All LSTMs use one layer and are unidirectional. The label dimensionality of all generative classifiers is set to 100. We adopt the same parameter settings as Yogatama et al. (2017) to ensure the results are comparable. For the latent-variable generative classifiers, we choose 10 or 30 latent variable values with embeddings of dimensionality 10, 50, or 100. For optimization, we use Adam (Kingma and Ba, 2015) with learning rate 0.001. We do early stopping by evaluating the classification accuracy on the development set.
Due to memory limitations and computational costs, we truncate the length of the input sequences to 80 tokens before adding <s> and </s> to indicate the start and end of the document. Though truncation decreases the performance of the models, all models use the same truncated inputs, so the comparison is still fair. 3

Baselines
To confirm we have built strong baselines, we first compare our implementation of the generative and discriminative classifiers to prior work. Our results in Appendix A show that our baselines are comparable to those of Yogatama et al. (2017). Figure 2 shows results for the discriminative, generative, and latent generative classifiers in terms of data efficiency. Data efficiency is measured by comparing the accuracies of the classifiers when trained across varying sizes of training sets. Numerical comparisons on two datasets are shown in Table 1.

Data Efficiency
With small training sets, the latent generative classifier consistently outperforms both the generative and discriminative classifiers. When the generative classifier is better than the discriminative one, as in DBpedia, the latent classifier resembles the generative classifier. When the discriminative classifier is better, as in Yelp Polarity, the latent classifier patterns after the discriminative classifier. However, when the number of training examples is in the range of approximately 5,000 to 10,000 per class, the discriminative classifier tends to perform best.
With small training sets, the generative classifier outperforms the discriminative one in most cases except the very smallest training sets. For example, in the Yelp Review Polarity dataset, the first two points are from classifiers trained with only 10 and 40 instances in total. The other case in which generative classifiers underperform is when training over large training sets, which  agrees with the theoretical and empirical findings in prior work (Yogatama et al., 2017;Ng and Jordan, 2002).

Effect of Graphical Model Structure
There are multiple choices to factorize the joint probability of the variables x, y, and c, which correspond to different graphical models. Here we consider other graphical model structures, namely those shown in Figure 3. We refer to the model in Figure 3(b) as the "joint" latent generative classifier since it uses the latent variable to jointly generate x and y. We refer to the model in Figure 3(c) as the "middle" latent generative classifier as the latent variable separates the textual input from the label. We use similar parameterizations for these models as for the auxiliary latent classifier, with conditional language models to generate x where the embedding of the latent variable is concatenated to the hidden state as in Section 3. Figure 4 shows the comparison of the standard and the three latent generative classifiers on Yelp Review Polarity, AGNews, and DBpedia. 4 We observe that the auxiliary model consistently performs best, while the other two latent generative  classifiers do not consistently improve over the standard generative classifier. On DBpedia, we see surprisingly poor performance when adding latent variables suboptimally. This suggests that the choice of where to include latent variables has a significant impact on performance.

Dependency between label and input variable.
We observe that the most prominent difference between the auxiliary and the other two latentvariable models is that the label variable y is directly linked to the input variable x in the auxiliary model, which is also the case in the standard generative model. In order to verify the importance of this direct dependency between the label and input, we create a new latent-variable model by adding a directed edge between y and x to the middle latent generative model. We refer to this model as the "hierarchical" latent generative classifier, which is shown in Figure 3(d). The results in Table 2 show the performance gains after adding this edge, which are all positive and sometimes very large. The resulting hierarchical model is very close in performance to the auxiliary model, which is unsurprising because these two models differ only in the presence of the edge from y to c.

Effect of Latent Variables
We conduct a comparison to demonstrate that the performance gains are due to the latent-variable structure instead of an increased number of parameters when adding the latent variables. 5 For the latent generative classifier, we choose 10 latent variable values with embeddings of di- 5 The results in the preceding sections use the models with configurations tuned on the development sets. We follow the practice of Yogatama et al. (2017) and fix label dimensionality to 100, as described in Section 4.2. The only tuned hyperparameters are the number of latent variable values and the dimensions of their embeddings.   Table 3: Accuracy comparison of standard generative (Gen.) and latent (Lat.) classifiers under earlier experimental configurations and parameter-comparison configurations (PC). When controlling for the number of parameters, the latent classifier still outperforms the standard generative classifier, which indicates the performance gains are due to the latent variables instead of an increased number of parameters. mensionality 10, and a label dimensionality of 100 (Lat. PC in Table 3). For the standard generative classifier, we choose a label dimensionality of 110 (Gen. PC in Table 3). So, the numbers of parameters are comparable, since we ensure the same number of parameters in the "output" word embeddings in the softmax layer of the language model, which is the decision that most strongly affects the number of parameters. Table 3 shows the results with different configurations, including the choices mentioned above as well as the results from earlier configurations mentioned in the paper. We observe that the latent generative classifiers still perform better in terms of data efficiency, which shows that the latentvariable structure accounts for the performance gains.

Learning via Expectation-Maximization
The results reported before are evaluated on the classifiers trained by directly maximizing the log marginal likelihood via gradient-based optimiza-  tion. In addition, we train our latent generative classifiers with the EM algorithm (Salakhutdinov et al., 2003). More training details can be found in Appendix B.
To speed convergence, we use a mini-batch version of EM, updating the parameters after each mini-batch. Our results in Table 4 show that the direct approach and the EM algorithm have similar performance in terms of classification accuracy and convergence speed in optimizing the parameters of our latent models. Similar trends appear for the other datasets.

Interpretation of Latent Variables
We take the strongest latent-variable model, the auxiliary latent generative classifier, and analyze the relationship among the latent, input, and label variables. We use the AGNews dataset, which contains 4 categories: world, sports, business, and sci/tech. The classifier we analyze has 10 values for the latent variable and is trained on a training set containing 1k instances per class.
We first investigate the relationship between the latent variable and the label by counting cooccurrences. For each instance in the development set, we calculate the posterior probability distribution over the latent variable, and pick the value with the highest probability as the preferred latent variable value for that instance. This is reasonable since in our trained model, the posterior distribution over latent variable values is peaked. Then we categorize the data by their preferred latent variable values and count the gold standard labels in each group. We observe that the labels are nearly uniformly distributed in each latent variable value, suggesting that the latent variables are not obviously being used to encode information about the label.
Thus, we hypothesize there should be information other than that pertaining to the label that causes the data to cluster into different latent variable values. We study the differences of the input texts among the 10 clusters by counting frequent words, manually scanning through instances, and looking for high-level similarities and differences. We report our manual labeling for the latent variable values in Table 5.
For example, value 1 is mostly associated with future and progressive tenses; the words "will", "next", and "new" appear frequently. Value 2 tends to contain past and perfect verb tenses (the phrases "has been" and "have been" appear frequently). Value 3 contains region names like "VANCOUVER", "LONDON", and "New Brunswick", while value 7 contains countryoriented terms like "Indian", "Russian", "North Korea", and "Ireland". Our choice of only 10 latent variable values causes them to capture the coarse-grained patterns we observe here. It is possible that more fine-grained differences would appear with a larger number of values.

Generation with Latent Variables
Another advantage of generative models is that they can be used to generate data in order to better understand what they have learned, especially in seeking to understand latent variables. We use our auxiliary latent generative classifier to generate multiple samples by setting the latent variable and the label. Instead of the soft mixture of discrete latent variable values that is used in classification (since we marginalize over the latent variable at test time), here we choose a single latent variable value when generating a textual sample.
To increase generation diversity, we use temperature-based sampling when choosing the next word, where higher temperature leads to higher variety and more noise. We set the temperature to 0.6. Note that the latent-variable model here is trained on only 4000 instances (1k for each label) from AGNews, so the generated samples do suffer from the small size of data used in training the language model. Table 6 shows some generated examples. We observe that different combinations of the latent variable and label lead to generations that comport with both the labels and our interpretations of the latent variable values.
We speculate that the reason our generative classifiers perform well in the data-efficient setting is that they are better able to understand the data via language modeling rather than directly optimizing the classification objective. Our generated samples testify to the ability of generative classifiers to model the underlying data distribution even with only 4000 instances.

Related Work
Supervised Generative Models. Generative models have traditionally been used in supervised settings for many NLP tasks, including naive Bayes and other models for text classification (Maron, 1961;Yogatama et al., 2017), Markov models for sequence labeling (Church, 1988;Bikel et al., 1999;Brants, 2000;Zhou and Su, 2002), and probabilistic models for parsing (Magerrnan and Marcus, 1991;Black et al., 1993;Eisner, 1996;Collins, 1997;Dyer et al., 2016). Recent work in generative models for question answering (Lewis and Fan, 2019) learns to generate questions instead of directly penalizing prediction errors, which encourages the model to better understand the input data. Our work is directly inspired by that of Yogatama et al. (2017), who build RNN-based generative text classifiers and show scenarios where they can be empirically useful.
Text Classification. Traditionally, linear classifiers (McCallum and Nigam, 1998;Joachims, 1998;Fan et al., 2008) have been used for text classification. Recent work has scaled up text classification to larger datasets with models based on logistic regression (Joulin et al., 2017), convolutional neural networks (Kim, 2014;Zhang et al., 2015;Conneau et al., 2017), and recurrent neural networks (Xiao and Cho, 2016;Yogatama et al.,    2017), the latter of which is most closely-related to our models.
Latent-variable Models. Latent variables have been widely used in both generative and discriminative models to learn rich structure from data Klein, 2007, 2008;Blunsom et al., 2008;Yu and Joachims, 2009;Morency et al., 2008). Recent work in neural networks has shown that introducing latent variables leads to higher representational capacity (Kingma and Welling, 2014;Chung et al., 2015;Burda et al., 2016;Ji et al., 2016). However, unlike variational autoencoders (Kingma and Ba, 2015) and related work that use continuous latent variables, our model is more similar to recent efforts that combine neural architectures with discrete latent variables and end-to-end training (Ji et al., 2016;Kim et al., 2017;Kong et al., 2017;Chen and Gimpel, 2018;Wiseman et al., 2018, inter alia).

Discussion and Future Work
An alternative solution to the small-data setting is to use language representations pretrained on large, unannotated datasets (Mikolov et al., 2013;Pennington et al., 2014;Devlin et al., 2019). In other experiments not reported here, we found that using pretrained word embeddings leads to larger performance improvements for the discriminative classifiers than the generative ones.
Future work will explore the performance of latent generative classifiers in other challenging experimental conditions, including testing robustness to data shift and adversarial examples as well as zero-shot learning. Another thread of future work is to explore the performance of discriminative models with latent variables, and investigate combining pretrained representations with both generative and discriminative classifiers.

Conclusion
We focused in this paper on improving the data efficiency of generative text classifiers by introducing discrete latent variables into the generative story. Our experimental results demonstrate that, with small annotated training data, latent generative classifiers have larger and more stable performance gains over discriminative classifiers than their standard generative counterparts. Analysis reveals interpretable latent variable values and generated samples, even with very small training sets.