Improving Neural Topic Models using Knowledge Distillation

Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.


Introduction
The core idea behind the predominant pretrain and fine-tune paradigm for transfer learning in NLP is that general language knowledge, gleaned from large quantities of data using unsupervised objectives, can serve as a foundation for more specialized endeavors. Current practice involves taking the full model that has amassed such general knowledge and fine-tuning it with a second objective appropriate to the new task (see Raffel et al., 2019, for an overview). Using these methods, pre-trained transformer-based language models (e.g., BERT, Devlin et al., 2019) have been employed to great effect on a wide variety of NLP problems, thanks, in part, to a fine-grained ability to capture aspects of linguistic context (Clark et al., 2019;Liu et al., 2019;Rogers et al., 2020).
However, this paradigm introduces a subtle but insidious limitation that becomes evident when the downstream application is a topic model. A topic model may be cast as a (stochastic) autoencoder (Miao et al., 2016), and we could fine-tune a pre- * Equal contribution.

BoW
BAT art chess gingerbread modernism painter picasso Base neural topic model d Marcel Duchamp was a painter, sculptor, chess player, and writer whose work is associated with Cubism, Dada, and conceptual art. Figure 1: Improving a base neural topic model with knowledge distillation. A document is mapped through both a standard BoW representation and a BERT-based Auto-encoder "Teacher" (BAT), yielding two distributions over words. These are used as the ground truth in the "student" topic model's document reconstruction loss L KD (backpropagated along the dotted line). Crucially, the BAT distribution assigns mass to unobserved but related terms (unbolded).
trained transformer with an identical document reconstruction objective. But in replacing the original topic model, we lose the property that makes it desirable: its interpretability. The transformer gains its contextual power from its ability to exploit a huge number of parameters, while the interpretability of a topic model comes from a dramatic dimensionality reduction.
We combine the advantages of these two approaches-the rich contextual language knowledge in pretrained transformers and the intelligibility of topic models-using knowledge distillation (Hinton et al., 2015). In the original formulation, knowledge distillation involves training a parameter-rich teacher classifier on large swaths of data, then using its high-quality probability estimates over outputs to guide a smaller student model. Since the information contained in these estimates is useful-a picture of an ox will yield higher label probabilities for BUFFALO than APRI-COT-the student needs less data to train and can generalize better.
We show how this principle can apply equally well to improve unsupervised topic modeling, which to our knowledge has not previously been attempted. While distillation usually involves two models of the same type, it can also apply to models of differing architectures. Our method is conceptually quite straightforward: we fine-tune a pretrained transformer ) on a document reconstruction objective, where it acts in the capacity of an autoencoder. When a document is passed through this BERT autoencoder, it generates a distribution over words that includes unobserved but related terms. We then incorporate this distilled document representation into the loss function for topic model estimation. (See Figure 1.) To connect this method to the more standard supervised knowledge distillation, observe that the unsupervised "task" for both an autoencoder and a topic model is the reconstruction of the original document, i.e. prediction of a distribution over the vocabulary. The BERT autoencoder, as "teacher", provides a dense prediction that is richly informed by training on a large corpus. The topic model, as "student", is generating its own prediction of that distribution. We use the former to guide the latter, essentially as if predicting word distributions were a multi-class labeling problem. 1 Our approach, which we call BERT-based Autoencoder as Teacher (BAT), obtains best-in-class results on the most commonly used measure of topic coherence, normalized pointwise mutual information (NPMI, Aletras and Stevenson, 2013) compared against recent state-of-the-art-models that serve as our baselines.
In order to accomplish this, we adopt neural topic models (NTM, Miao et al., 2016;Srivastava and Sutton, 2017;Card et al., 2018;Burkhardt and Kramer, 2019;Nan et al., 2019, inter alia), which use various forms of black-box distributionmatching (Kingma and Welling, 2014;Tolstikhin et al., 2018). 2 These now surpass traditional methods (e.g. LDA, Blei, 2003, and variants) in topic coherence. In addition, it is easier to modify the generative model of a neural topic model than for a classic probabilistic latent-variable model, where changes generally require investing effort in new variational inference procedures or samplers. In fact, because we leave the base NTM unmodified, our approach is flexible enough to easily accommodate any neural topic model, so long as it includes a word-level document reconstruction objective. We support this claim by demonstrating improvements on models based on both Variational (Card et al., 2018) and Wasserstein (Nan et al., 2019) auto-encoders.
To summarize our contributions: • We introduce a novel coupling of the knowledge distillation technique with generative graphical models.
• We construct knowledge-distilled neural topic models that achieve better topic coherence than their counterparts without distillation on three standard English-language topicmodeling datasets.
• We demonstrate that our method is not only effective but modular, by improving topic coherence in a base state-of-the-art model by modifying only a few lines of code. 3 • In addition to showing overall improvement across topics, our method preserves the topic analysis of the base model and improves coherence on a topic-by-topic basis.

Background on topic models
Topic modeling is a well-established probabilistic method that aims to summarize large document corpora using a much smaller number of latent topics. The most prominent instantiation, LDA (Blei, 2003), treats each document as a mixture over K latent topics, θ d , where each topic is a distribution over words β k . By presenting topics as ranked word lists and documents in terms of their probable topics, topic models can provide legible and concise representations of both the entire corpus and individual documents. In classical topic models like LDA, distributions over the latent variables are estimated with approximate inference algorithms tailored to the generative process. Changes to the model specificationfor instance, the inclusion of a supervised labelrequires attendant changes in the inference method, which can prove onerous to derive. For some probabilistic models, this problem may be circumvented by the variational auto-encoder (VAE, Kingma and Welling, 2014), which introduces a recognition model that approximates the posterior with a neural network. As a result, neural topic models have capitalized on the VAE framework (Srivastava and Sutton, 2017;Card et al., 2018;Burkhardt and Kramer, 2019, inter alia) and other deep generative models (Wang et al., 2019;Nan et al., 2019). In addition to their flexibility, the best models now yield more coherent topics than LDA.
Although our method (Section 2.3) is agnostic as to the choice of neural topic model, we borrow from Card et al. (2018) for both formal exposition and our base implementation (Section 3). Card et al. (2018) develop SCHOLAR, a generalization of the first successful VAE-based neural topic model (PRODLDA, Srivastava and Sutton, 2017). The generative story is broadly similar to that of LDA, although the uniform Dirichlet prior is replaced with a logistic normal (LN ): 4 For each document d: Following PRODLDA, B is a K × V matrix where each row corresponds to the kth topic-word probabilities in log-frequency space. The multinomial distribution over a document's words is parameterized by where m is a vector of fixed empirical background word frequencies and σ(·) is the softmax function.
We highlight that each document is treated as a bag of words, w BOW d . To perform inference on the model, VAE-based models like SCHOLAR approximate the true intractable posterior p(θ d | ·) with a neural encoder network g(w d ) that parameterizes the variational distribution q (θ d | g(·)) (here, a logistic normal with diagonal covariance). The Evidence Lower BOund (ELBO) is therefore which is optimized with stochastic gradient descent. The form of the reconstruction error L R is a consequence of the independent multinomial draws.

Background on knowledge distillation
It is instructive to think of Eq. (1) as a latent logistic regression, intended to approximate the distribution over words in a document. Under this lens, the neural topic model outlined above can be cast as a multi-label classification problem. Indeed, it accords with the standard structure: there is a softmax over logits estimated by a neural network, coupled with a cross-entropy loss. However, because w BOW d is a sparse bag of words, the model is limited in its ability to generalize. During backpropagation (Eq. (3)), the topic parameters will only update to account for observed terms, which can lead to overfitting and topics with suboptimal coherence.
In contrast, dense document representations can capture rich information that bag-of-words representations cannot.
These observations motivate our use of knowledge distillation (KD, Hinton et al., 2015). The authors argue that the knowledge learned by a large "cumbersome" classifier on extensive data-e.g., a deep neural network or an ensemble-is expressed in its probability estimates over classes, and not just contained in its parameters. Hence, these teacher estimates for an input may be repurposed as soft labels to train a smaller student model. In practice, the loss against the true labels is linearly interpolated with a loss against the teacher probabilities, Eq. (4). We discuss alternative ways to integrate outside information in Section 6.

Combining neural topic modeling with knowledge distillation
The knowledge distillation objective. To apply KD to a "base" neural topic model, we replace the reconstruction term L R in Eq. (3) with L KD , as follows: Here, z BAT d are the logits produced by the teacher network for a given input document d, meaning that w BAT d acts as a smoothed pseudo-document. T is the softmax temperature, which controls how diffuse the estimated probability mass is over the words (hence f (·; T ) is Eq. (1) with the corresponding scaling). This differs from the original KD in two ways: (a) it scales the estimated probabilities by the document length N d , and (b) it uses a multilabel loss.
The teacher model. We generate the teacher logits z BAT using the pretrained transformer DISTIL-BERT , itself a distilled version of BERT (Devlin et al., 2019). 5 BERT-like models are generally pretrained on large domain-general corpora with a language-modeling like objective, yielding an ability to capture nuances of linguistic context more effectively than bag-of-words models (Clark et al., 2019;Liu et al., 2019;Rogers et al., 2020). Mirroring the NTM's formulation as a variational auto-encoder, we treat DISTILBERT as a deterministic auto-encoder, fine-tuning it with the document-reconstruction objective L R on the same dataset. Thus, we use a BERT-based Autoencoder as our Teacher model, hence BAT. 6 Clipping the logit distribution. Depending on preprocessing, V may number in the tens of thousands of words. This leads to a long tail of probability mass assigned to unlikely terms, and breaks standard assumptions of sparsity. Tang et al. (2020), 5 DISTILBERT's light weight accommodates longer documents, necessary for topic modeling. Even with this change, we divide very long documents into chunks, estimating logits for each chunk and taking the pointwise mean. More complex schemes (i.e., LSTMs, Hochreiter and Schmidhuber, 1997) yielded no benefit. 6 A reader familiar with variational NTMs may notice that we haven't mentioned an obvious means of incorporating representations from a pretrained transformer: encoding the document representation from a BERT-like model. This yields unimpressive results; see Appendix D.1.  working in a classification setting, find that truncating the logits to the top-n classes and assigning uniform mass to the rest improves accuracy. We instead choose the top c N d , c ∈ R + logits and assign zero probability to the remaining elements to enforce sparsity.  10 We seek to discover a latent space of topics that is meaningful and useful to people (Chang et al., 2009). Accordingly, we evaluate topic coherence using normalized mutual pointwise information (NPMI), which is significantly correlated with human judgments of topic quality (Aletras and Stevenson, 2013;Lau et al., 2014) and widely used to evaluate topic models. 11 We follow precedent and calculate (internal) NPMI using the top ten words in each topic, taking the mean across the NPMI scores for individual topics. Internal NPMI is estimated with reference co-occurrence counts from a held-out dataset from the same corpus, i.e., the dev or test split. While internal NPMI is the metric of choice for most prior work, we also provide external NPMI results using Gigaword 5 (Parker et al., 2011), following Card et al. (2018).

Experimental Baselines
We select three experimental baseline models that represent diverse styles of neural topic modeling. 12 Each achieves the highest NPMI on the majority of its respective datasets, as well as a considerable improvement over previous neural and non-neural topic models (such as Srivastava and Sutton, 2017;Miao et al., 2016;Ding et al., 2018). All our baselines are roughly contemporaneous with one another, and had yet to be compared in a head-to-head fashion prior to our work.  (Tolstikhin et al., 2018), using a Dirichlet prior that is matched by minimizing Maximum Mean Discrepancy. They find the method leads to state-of-the-art coherence on several datasets and encourages topics to exhibit greater word diversity.
We demonstrate the modularity of our core innovation by combining our method with both SCHOLAR and W-LDA (Section 4).

Our Models and Settings
As discussed in Section 2.3, our approach relies on a "base" neural topic model and unnormalized probabilities over words estimated by a transformer as "teacher". We discuss each in turn.
Neural topic models augmented with knowledge distillation. We experiment with both SCHOLAR and W-LDA as base models. The former constitutes our primary model and point of comparison with baselines, while the latter is a proof-of-concept that attests to our method's modularity; we added knowledge distillation to W-LDA with only a few lines of code (Appendix F). We evaluate both at K = 50 and K = 200 topics.
We tune using NPMI, with reference cooccurrence counts taken from a held-out development set from the relevant corpus. For our baselines, we use the publicly-released author implementations. 13 While we generally attempt to retain the original hyperparameter settings when available, we do perform an exhaustive grid search on the SCHOLAR baselines and SCHOLAR+BAT to ensure fairness in comparison (ranges, optimal values, and other details in Appendix E.1).
Our method also introduces additional hyperparameters: the weight for KD loss, λ (Eq. (4)); the softmax temperature T ; and the proportion of the word-level teacher logits that we retain (relative to document length, see clipping in Section 2.3). For most dataset-K pairs, we find that we can improve topic quality under most settings, with a relatively small set of values for each hyperparameter leading to better results. In fact, following the extensive search on SCHOLAR+BAT, we found we could tune W-LDA within a few iterations.
Topic models rely on random sampling procedures, and to ensure that our results are robust, we report the average values across five runs (previously unreported by the authors of our baselines).
The DISTILBERT teacher. We fine-tune a modified version of DISTILBERT with the same document reconstruction objective as the NTM (L R , Eq. (3)) on the training data. Specifically, DISTILBERT maps a WordPiece-tokenized (Wu et al., 2016) document d to an l-dimensional hidden vector with a transformer (Vaswani et al., 2017), then back to logits over V words (tokenized with the same scheme as the topic model). For long documents, we split into blocks of 512 tokens and mean-pool the transformer outputs. We use the pretrained model made available by the authors (Wolf   2.3) using SCHOLAR as our base neural architecture. We achieve better NPMI than all baselines across three datasets and K = 50, K = 200 topics. We use 5 random restarts and report the standard deviation.
et al., 2019). We train until perplexity converges on the same held-out dev set used in the topic modeling setting. Unsurprisingly, DISTILBERT achieves dramatically lower perplexity than all topic model baselines. Note that we need only train the model once per corpus, and can experiment with different NTM variations using the same z BAT .

Results and Discussion
Using the VAE-based SCHOLAR as the base model, topics discovered using BAT are more coherent, as measured via NPMI, than previous state-of-theart baseline NTMs (Table 2), improving on the DVAE and W-LDA baselines, and the baseline of SCHOLAR without the KD augmentation. We establish the robustness of our approach's improvement by taking the mean across multiple runs with different random seeds, yielding consistent improvement over all baselines for all the datasets. We validate the approach using a smaller and larger number of topics, K = 50 and 200, respectively. In addition to its improved performance, BAT can apply straightforwardly to other models, because it makes very few assumptions about the base model-requiring only that it rely on a word-level reconstruction objective, which is true of the majority of neural topic models proposed to date. We illustrate this by using the Wasserstein auto-encoder (W-LDA) as a base NTM, showing in Table 3 that BAT improves on the unaugmented model. 14 We report the dev set results (corresponding to the test set results in Tables 2 and 3) in Appendix A-the same pattern of results is obtained, for all the models. 14 We note that the W-LDA baseline did not tune well on 200 topics, further complicated by the model's extensive run time. As such, we focus on augmenting that model for 50 topics, consistent with the number of topics on which Nan et al. (2019) report their results. We add preliminary results using BAT with DVAE in Appendix C.
Finally, we also compute NPMI using reference counts from an external corpus (Gigaword 5, Parker et al., 2011) for SCHOLAR and SCHOLAR+BAT (Table 4). We find the same patterns generally hold: in all but one setting (Wiki, K = 50), BAT improves topic coherence relative to SCHOLAR. These external NPMI results suggest that our model avails itself of the distilled general language knowledge from pretrained BERT, and moreover that our fine-tuning procedure does not overfit to the training data.

Impact of BAT on Individual Topics
Following standard practice, we have established that our models discover more coherent topics on average when compared to others (Tables 2 and 3). Now, we look more closely at the extent to which those improvements are meaningful at the level of individual topics. To do so we directly compare topics discovered by the baseline neural topic model (SCHOLAR) with corresponding topics obtained when that model is augmented with BAT, looking at the NPMIs of the corresponding topics as well as considering them qualitatively.
We align the topics in the base and augmented SCHOLAR models using a variation of competitive linking, which produces a greedy approximation to optimal weighted bipartite graph matching (Melamed, 2000). A fully connected weighted bipartite graph is constructed by linking all topic pairs across (but not within) the two models, with the weight for a topic pair being the similarity between their word distributions as measured by Jenson-Shannon (JS) divergence (Wong and You, 1985;Lin, 1991). We pick the pair (t i , t j ) with the lowest JS divergence and add it to the resulting alignment, then remove t i and t j from consideration and iterate until no pairs are left. The resulting aligned topic pairs can then be sorted by their JS divergences to directly compare corresponding topics. 15 Fig. 2 shows the JS-divergences for aligned topic pairs, for our three corpora. Based on visual inspection, we choose the 44 most aligned topic pairs as being meaningful for comparison; beyond this point, the topics do not bear a conceptual relationship (using the same threshold for the three datasets for simplicity).
When we consider these conceptually related 15 Note that more similar topics have lower JS-divergence, so we are seeking to minimize rather than maximize total weight. We use JS-divergence because it is conveniently symmetric and finite. topic pairs, we see that the model augmented with BAT has the topic with the higher NPMI value more often across all three datasets (Fig. 3). This means that BAT is not just producing improvements in the aggregate (Section 4): its effect can be interpreted more specifically as identifying the same space of topics generated by an existing model and, in most cases, improving the coherence of individual topics. This highlights the modular value of our approach. Table 5 provides qualitative discussion for one example from each corpus, which we have selected for illustration from a single randomly selected run of the baseline SCHOLAR and SCHOLAR+BAT models for K = 50. We find that, consistent with prior work on automatic evaluation of topic models, differences in NPMI do appear to correspond to recognizable subjective differences in topic quality. So that readers may form their own judgments, Appendix G presents 15 aligned pairs for each corpus, selected randomly by stratifying across levels of alignment quality to create a fair sample to review.

Related Work
Integrating embeddings into topic models. A key goal in our use of knowledge distillation is to incorporate relationships between words that may not be well supported by the topic model's input documents alone. Some previous topic models have sought to address this issue by incorporating external word information, including word senses (Ferrugento et al., 2016) and pretrained word embeddings (Hu and Tsujii, 2016;Yang et al., 2017;Xun et al., 2017;Ding et al., 2018). More recently, Bianchi et al. (2020)   A limitation of these approaches is that they simply import general, non-corpus-specific word-level information. In contrast, representations from a pretrained transformer can benefit from both general language knowledge and corpus-dependent information, by way of the pretraining and fine-tuning regime. By regularizing toward representations conditioned on the document, we remain coherent relative to the topic model data. An additional key advantage for our method is that it involves only a slight change to the underlying topic model, rather than the specialized designs by the above methods.
Knowledge distillation. While the focus was originally on single-label image classification, KD has also been extended to the multi-label setting (Liu et al., 2018b). In NLP, KD has usually been applied in supervised settings (Kim and Rush, 2016;Huang et al., 2018;Yang et al., 2020), but also in some unsupervised tasks (usually using an unsupervised teacher for a supervised student) (Hu et al., 2020;Sun et al., 2020). Xu et al. (2018) use word embeddings jointly learned with a topic model in a procedure they term distillation, but do not follow the method from Hinton et al. (2015) that we employ (instead opting for joint-learning). Recently, pretrained models like BERT have offered an attractive choice of teacher model, used successfully for a variety of tasks such as sentiment classification and paraphrasing (Tang et al., 2019a,b). Work in distillation often cites a reduction in computational cost as a goal (e.g., , although we are aware of at least one effort that is focused specifically on interpretability (Liu et al., 2018a).
Topic diversity. Coherence, commonly quantified automatically using NPMI, is the current standard for evaluating topic model quality. Recently several authors (Dieng et al., 2020; Burkhardt and Kramer, 2019; Nan et al., 2019) have proposed additional metrics focused on the diversity or uniqueness of topics (based on top words in topics). However, no one metric has yet achieved acceptance or consensus in the literature. Moreover, such measures fail to distinguish between the case where two topics share the same set of top n words, therefore coming across as essentially identical, versus when one topic's top n words are repeated individually across multiple other topics, indicating a weaker and more diffuse similarity to those topics. We discuss issues related to topic diversity in Appendix D.2.

Conclusions and Future Work
To our knowledge, we are the first to distill a "blackbox" neural network teacher to guide a probabilistic graphical model. We do this in order to combine the expressivity of probabilistic topic models with the precision of pretrained transformers. Our modular method sits atop any neural topic model (NTM) to improve topic quality, which we demonstrate using two NTMs of highly disparate architectures (VAEs and WAEs), obtaining state-of-the-art topic coherence across three datasets from different domains. Our adaptable framework does not just produce improvements in the aggregate (as is commonly reported): its effect can be interpreted more specifically as identifying the same space of topics generated by an existing model and, in most cases, improving the coherence of individual topics, thus highlighting the modular value of our approach.
In future work, we also hope to explore the effects of the pretraining corpus (Gururangan et al., 2020) and teachers (besides BERT) on the generated topics. Another intriguing direction is exploring the connection between our methods and neural network interpretability. The use of knowledge distillation to facilitate interpretability has also been previously explored, for example, in Liu et al. (2018a) to learn interpretable decision trees from neural networks. In our work, as the weight on the BERT autoencoder logits λ goes to one, the topic model begins to describe less the corpus and more the teacher. We believe mining this connection can open up further research avenues; for instance, by investigating the differences in such teacher-topics conditioned on the pre-training corpus. Finally, although we are motivated primarily by the widespread use of topic models for identifying interpretable topics (Boyd-Graber et al., 2017, Ch. 3), we plan to explore the ideas presented here further in the context of downstream applications like document classification.   2.3) using SCHOLAR as our base neural architecture. We achieve better NPMI than all baselines across three datasets and K = 50, K = 200 topics. We use 5 random restarts report the standard deviation.
We optimized our models on the dev set, froze the optimal models, and showed the results on the test set in Tables 2 and 3. We show the corresponding dev set results for those models in Tables 6 and 7.

B Extrinsic Classification Results
The primary goal of our method is to improve the coherence of generated topics. It is natural, however, to ask about the impact of our method on downstream applications. We include here a preliminary exploration suggesting that the addition of BAT does not hurt performance in document classification.
In our setup, we seek to predict document labels y d from the MAP estimate of a document's topic distribution, θ d . Specifically, we classify the newsgroup to which a document was posted for the 20 newsgroups data (e.g., talk.politics.misc) and a binary sentiment label for the IMDb review data. We train a random forest classifier using default parameters from scikit-learn (Pedregosa et al., 2011) and report the accuracies in Table 8 (averaged across 5 runs).
Much like other work that is aimed at topic coherence rather than their downstream use in supervised models (Nan et al., 2019), we find that our method has little impact on predictive performance. While it is possible that improvements may be obtained by specifically tuning models for classification, or by integrating BAT into model variations that combine lexical and topic representations (e.g. Nguyen et al., 2013), we leave this to future work.

C Using BAT with DVAE
We further illustrate our method's modularity by applying BAT to our own reimplementation of DVAE (Burkhardt and Kramer, 2019). 16 In contrast to the author's primary implementation, which estimates the model with rejection sampling variational inference (used in Section 4), we reimplemented DVAE, approximating the Dirichlet gradient via pathwise derivatives (Jankowiak and Obermeyer, 2018), similar to Burkhardt and Kramer (2019)'s alternative model variant using implicit gradients. Our reimplementation shows baseline behavior substantially similar to the author's implementation. In the course of our experimentation, we noted a degeneracy in this model, in which high NPMI is achieved but at the cost of redundant topics. This failure mode is well-established, but as discussed in Appendix D.2, we find the measures proposed to diagnose topic diversity (including those proposed by Burkhardt and Kramer, 2019;Nan et al., 2019) to be problematic. Rather than use these metrics, therefore, we took a coarse but simple approach and filtered out any models that yielded more than one pair of identical topics, averaged across five runs (defined as having two topics with the same set of top-10 words). This filtering eliminated many hyperparameter settings, leading us to believe that DVAE is not robust to this problem.
Ultimately, we find that applying BAT to DVAE does not hurt, and also does not help appreciably (Table 9). In addition, when applying the above filtering criterion to our main SCHOLAR and SCHOLAR + BAT models, we still obtain the positive results reported in Table 6. 17

D.1 Using BERT in the encoder
In SCHOLAR, the encoder takes the following form: Where the weight matrix W , along with the parameters of nueral networks µ(·) and σ(·), are our variational parameters. Card et al. (2018) propose that pre-trained word2vec (Mikolov et al., 2013) embeddings can replace W , meaning that the document representation made available to the encoder is an ldimensional sum of word embeddings. Card et al. (2018) argue that fixed embeddings act as an inductive prior which improves topic coherence. Likewise, we might want to encode the document representation from a BERT-like model and, in fact, this has been attempted with some success (Bianchi et al., 2020). The hypothesis is that a structuredependent representation of the document can better parameterize its corresponding topic distribution. 17 For K = 50. The single-pair threshold proves too restrictive for the K = 200 case, where no hyperparameter settings pass the threshold. Increasing the tolerance to a maximum of 5 redundant pairs with K = 200 leads to a somewhat lower average NPMI overall, but the same directional improvement, i.e. SCHOLAR+BAT yields a significantly higher NPMI than SCHOLAR.  Table 10: Effect on topic coherence of passing various document representations to the SCHOLAR encoder (using the IMDb data). Each setting describes the document representation provided to the encoder, which is transformed by one feed-forward layer of 300dimensions followed by a second down to K dimensions. "+ w2v" indicates that we first concatenated with the sum of the 300-dimensional word2vec embeddings for the document. Note that these early findings are based on a different IMDb development set, a 20% split from the training data. They are thus not directly comparable to the results reported elsewhere in the text, which used a separate held-out development set.
We experimented with this method as well, using both the hidden BERT representation and the predicted probabilities, although we also include a fixed randomized baseline to maintain parameter parity. Results for IMDb are reported in Table 10, and we find at best a mild improvement over the baselines. 18 We suspect the reason for this tepid result is both that (a) in training, the effect of estimated local document-topic proportions on the global topic-word distributions is diffuse and indirect; and (b) the compression of the representation into k dimensions causes too much of the highlevel linguistic information to be lost. Nonetheless, owing to the slight benefit, we do pass the logits to the encoder in our SCHOLAR-based model. We avoid this change for the model based on W-LDA to underscore the modularity of our method.

D.2 Topic Diversity
Burkhardt and Kramer (2019) have found a degeneracy in some topic models, wherein a single topic will be repeated more than once with slightly varying terms (e.g., several Dadaism topics). Burkhardt and Kramer (2019) and others (Nan et al., 2019;Dieng et al., 2020) have independently proposed related metrics to quantify the problem, but the literature has not converged on a solution. In contrast to NPMI, we are not aware of any work that as-sesses the validity of such metrics with respect to human judgements.
Moreover, all these proposals suffer from a common problem: because they are global measures of word overlap, they fail to account for how words are repeated across topics. For instance, Topic Uniqueness (Nan et al., 2019) is identical regardless of whether all of a topic's top words are all repeated in a single second topic, or individual top words from that topic are repeated in several other topics. In addition, the measures inappropriately penalize partially-related topics.
They also penalize polysemy-and, more generally, the contextual flexibility of word meanings. One of the key advantages of latent topics, compared to surface lexical summaries, is that the same word can contribute differently to an understanding of what different topics are about. As a real example from our experience, in modeling a set of documents related to paid family and medical leave, words like parent, mother, and father are prominent in one topic related to parental leave when a child is born (accompanying other terms like newborn and maternity leave) and also in another topic related to taking leave to care for family members, including elderly parents (accompanying other terms like elderly and aging). The fact that topic models permit a word like parent to be prominent in both of these clearly distinct topics, emphasizing two different aspects of the word relative to the collection as a whole (being a parent taking care of children, being a child taking care of parents), is a feature, not a bug. We consider the question of topic diversity an important direction for future work.

E Experimental Procedures
In this section, we first provide details of our hyperparameters and tuning procedures, then turn to our computing infrastructure and the rough runtime of the SCHOLAR model.

E.1 Hyperparameter Tuning and Optimal
Values We used well-tuned baselines to establish thresholds for performance on NPMI (following the reported hyperparameters in Card et al., 2018;Burkhardt and Kramer, 2019;Nan et al., 2019). While developing our model, we performed a coarse-grained initial hyperparameter sweep to identify ranges that were not beating the threshold, and decided to exclude those ranges when performing a full grid search. We report the hyperparameter ranges used in this search, along with their optimal values (as determined by development set NPMI), in Tables 11 to 15. These produced the final set of  results (Tables 2, 3, 6 and 7).
For the DISTILBERT training, we use the default hyperparameter settings for the bert-base-uncased model . Our code is a modified version of the MM-IMDB multimodal sequence classification code from the same codebase as DISTILBERT (https: and we use all default hyperparameter settings specified there. We train for 7500 steps for 20ng, and 17000 steps for Wiki and IMDb (this corresponds to convergence on development-set perplexity).

E.2 Computing Infrastructure and Runtime
For the full hyperparameter sweep, we used an Amazon Web Services ParallelCluster https:// github.com/aws/aws-parallelcluster with 40 nodes of g4dn.xlarge instances (consisting of Nvidia T4 GPUs with 16 GB RAM), which ran for about 5 days. For initial experimentation, we used a SLURM cluster with a mix of consumer-grade Nvidia GPUs (e.g., 1080, 2080).
In terms of runtime, SCHOLAR) and our own SCHOLAR+BAT are equal and this is true for any of our baseline model augmented with BAT. It is important to note that the overhead in terms of the overall runtime comes only from training the DISTILBERT encoder on the full dataset first and inference time for obtaining the logits after training. Thus, users should keep in mind the initial step of training and inferring teacher model logits and saving them; once that is done for the dataset, our model does not add to the runtime. We show the comparison between the full runtimes, including the initial step, in Fig. 4.

F Changes to W-LDA
In Fig. 5 Table 14: Hyperparameter ranges and optimal values (as determined by development set NPMI) for W-LDA and W-LDA+BAT , on all three datasets. lr is the learning rate, α is the hyperparameter for the dirichlet prior, λ is the weight on the teacher model logits from Eq. (4), T is the softmax temperature from Eq.   Runtime comparison for SCHOLAR and SCHOLAR+BAT model SCHOLAR SCHOLAR+BAT Figure 4: Runtime comparison for SCHOLAR) and our own SCHOLAR+BAT -Note that the overhead due to BAT is only due to the training and inference time required to obtain the DISTILBERT encoder logits on the full dataset first, and once the teacher logits are available, the run time of both models is the same. We depict the full approximate time (in hours) including this initial overhead in case of BAT .

G Impact of BAT on Individual Topics: Aligned Topic Pair Examples
For each corpus (20NG, Wiki, and IMDb), a single comparison of base and BAT-augmented (SCHOLAR vs. SCHOLAR+BAT) 50-topic models was selected randomly, from the five runs used in computing average performance in Fig. 3. For each of those pairs of models, we then randomly selected 15 aligned topic pairs from that set of 50 to include in the tables below. Specifically, a full set of 50 topic pairs was partitioned according to JS divergence into the 10 most similar pairs, the next 10 most similar, and so forth, for a total of five "brackets" of topic alignment quality. Three topic pairs were then selected at random from each bracket, hence 15 pairs in all, in order to yield a fair picture of what pairs look like at various qualities of topic alignment.
In the tables below (Tables 16 to 18), we present pairs sorted from best to worst alignment quality. Recall that for NPMI, higher is better, and for JS divergence, lower score indicates a higher quality match (or alignment) for the topic pair.