Auto-Encoding Variational Bayes for Inferring Topics and Visualization

Visualization and topic modeling are widely used approaches for text analysis. Traditional visualization methods find low-dimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. In contrast, topic modeling aims to discover topics from text, but for visualization, one needs to perform a post-hoc embedding using dimensionality reduction methods. Recent approaches propose using a generative model to jointly find topics and visualization, allowing the semantics to be infused in the visualization space for a meaningful interpretation. A major challenge that prevents these methods from being used practically is the scalability of their inference algorithms. We present, to the best of our knowledge, the first fast Auto-Encoding Variational Bayes based inference method for jointly inferring topics and visualization. Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We demonstrate the efficiency and effectiveness of our method on real-world large datasets and compare it with existing baselines.


Introduction
Visualization and topic modeling are important tools in the analysis of text corpora. Visualization methods, such as t-SNE (Maaten and Hinton, 2008), find low-dimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. Such visualization is useful for exploratory tasks. However, there is a lack of semantic interpretation as those visualization methods do not extract topics. In contrast, topic modeling aims to discover semantic topics from text, but for visualization, one needs to perform a post-hoc embedding using dimensionality reduction methods. Since this pipeline approach may not be ideal, there has been recent interest in jointly inferring topics and visualization using a single generative model (Iwata et al., 2008). This joint approach allows the semantics to be infused in the visualization space where users can view documents and their topics. The problem of jointly inferring topics and visualization can be formally stated as follows. Problem. Let D = {w n } N n=1 denote a finite set of N documents and let V be a finite vocabulary from these documents. Given a number of topics Z, and visualization dimension d, we want to find: • For topic modeling: Z latent topics, and their word distributions collectively denoted as β = {β z } Z z=1 , topic distributions of documents collectively denoted as Θ = {θ n } N n=1 , and • For visualization: d-dimensional visualization coordinates for N documents X = {x n } N n=1 , and Z topics Φ = {φ z } Z z=1 such that the distances between documents, topics in the visualization space reflect the topic-document distributions Θ. To solve this problem, PLSV (Probabilistic Latent Semantic Visualization) is the first model that attempts to tie together all latent variables of topics and visualization (i.e., Υ = {X, Φ, β}) in a generative model. Its tight integration between visualization and the underlying topic model can support applications such as user-driven topic modeling where users can interactively provide feedback to the model (Choo et al., 2013). PLSV can also be used as a basic building block when developing new models for other analysis tasks, such as visual comparison of document collections (Le and Akoglu, 2019).
Relatively less attention has been paid to methods for fast inference of topics and visualization. Existing models often use Maximum a Posteriori (MAP) estimation with the EM algorithm, which is difficult to scale to large datasets. As shown in Figure 12, to run a PLSV model of 50 topics via MAP estimation on a dataset of modest size (e.g., 20 NEWSGROUPS), it takes more than 18 hours using a single core. This long running time limits the usability of these visualization methods in practice.
In this paper, we aim to propose a fast Auto-Encoding Variational Bayes (AEVB) based inference method for inferring topics and visualization. AEVB (Kingma and Welling, 2014a) is a black-box variational method which is efficient for inference and learning in latent Gaussian Models with large datasets. However, to apply the AEVB approach to topic models like LDA, one needs to deal with problems caused by the Dirichlet prior and by posterior collapse (He et al., 2019). One of the successful AEVB based methods proposed to tackle those challenges for topic models is AVITM (Srivastava and Sutton, 2017).
It is not straightforward to apply AEVB or AVITM to our problem because of two main challenges. First, as reviewed in Section 2, PLSV models a document's topic distribution using a softmax function over its Euclidean distances to topics. It is not clear how to express this nonlinear functional relationship between three categories of latent variables (i.e., topic distribution θ n , document coordinate x n , and topic coordinate φ z ) when applying AVITM to visualization. Second, AEVB has an assumption that latent encodings are identically and independently distributed (i.i.d.) across samples (Casale et al., 2018) (Lin et al., 2019). In our case, this assumption works well with latent document coordinates X where each document n is associated with its latent encoding x n in the visualization space. However, for topic coordinates Φ and word probabilities β, that assumption is too strong. The reason is that latent encodings of any topic k w.r.t any documents are not independent, but in fact, in our extreme case these latent encodings are similar, i.e., φ z , for any documents i, j and any topic z. In other words, φ z is shared across documents. The same argument also applies to word probabilities β.
To address the first challenge, we propose to model the nonlinear functional relationship between θ n , x n , Φ using a normalized Radial Basis Function (RBF) Neural Network (Bishop, 1995). In this model, φ z ∈ Φ is the center vector for neuron z, i.e., Φ are treated as parameters of the RBF network and will be estimated. Similarly, we model β as parameters of a linear neural network that is connected to the RBF network to form the decoder in the AEVB approach. By treating Φ and β as parameters of the decoder, we can solve the second challenge, though it can be seen that our algorithm does not learn their posterior distributions but rather their point estimates. In Section 3, we present in detail our proposed method. We focus on PLSV model in this work, though the proposed AEVB inference method could be easily adapted to other visualization models.
We summarize our contributions as follows: • We propose, to the best of our knowledge, the first AEVB inference method for the problem of jointly inferring topics and visualization. • In our approach, we design a decoder that includes an RBF network connected to a linear neural network. These networks are parameterized by topic coordinates and word probabilities, ensuring that they are shared across all documents. • We conduct extensive experiments on real-world large datasets, showing the efficiency and effectiveness of our method. While running much faster than PLSV, it gains better visualization quality and comparable topic coherence. • Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We implement different PLSV models that use different RBFs by just changing a few lines of code. We experimentally show that PLSV with Gaussian or Inverse quadratic RBFs consistently produces good performance across datasets.
2 Background and Related Work

Topic Modeling and Visualization
Topic models (Blei et al., 2003;Hofmann, 1999) are widely used for unsupervised representation learning of text and have found applications in different text mining tasks (Ramage et al., 2009;Blei et al., 2007;Tkachenko and Lauw, 2019;Kim et al., 2019). Popular topic models such as LDA (Blei et al., 2003), find a low-dimensional representation of each document in topic space. Each dimension of the topic space has a meaning attached to it and is modeled as a probability distribution over words. In contrast, t-SNE (Maaten and Hinton, 2008), LargeVis (Tang et al., 2016) are visualization methods aiming to find for each document a low-dimensional representation (typically 2D or 3D). However, we often do not have such semantic interpretation for that low-dimensional space as in topic models. Therefore, there have been works attempting to infuse semantics to the visualization space by jointly modeling topics and visualization (Iwata et al., 2008;Le and Lauw, 2014a). These methods often suffer from the scalability issue with large datasets. In this work, we aim to scale up these methods by proposing a fast AEVB based inference method. We focus on PLSV (Iwata et al., 2008) for applying our proposed method. PLSV has been used as a basic block for building new models for visual text mining tasks (Le and Lauw, 2014b;Le and Akoglu, 2019). Our proposed method could be easily adapted to these models. PLSV assumes the following process to generate documents and visualization: 1. For each topic z = 1, · · · , Z: (a) Draw a word distribution: ii. Draw a word: w nm ∼ Multi (β z ) Here β z has a Dirichlet prior. Topic and document coordinates have Gaussian priors of the forms: respectively. The topic distribution of a document is defined using a softmax function over its distances to topics: As we can see from Eq. 1, the zth topic proportion of document n is high when document coordinate x n is close to topic coordinate φ z . This relationship ensures that the distances between documents, topics in the visualization space reflect the topic-document distributions Θ. In the PLSV paper, the parameters Υ = {X, Φ, β} are estimated using MAP estimation with the EM algorithm. As shown in our experiments, the algorithm does not scale to large datasets.

Auto-Encoding Variational Bayes for Topic Models
AEVB (Kingma and Welling, 2014b) and its variant WiSE-ALE (Lin et al., 2019), AVITM (Srivastava and Sutton, 2017) are black-box variational inference methods whose purpose is to allow practitioners to quickly explore and adjust the model's assumptions with little rederivation effort (Ranganath et al., 2014). AVITM is an auto-encoding variational inference method for topic models. It approximates the true posterior p(θ, z|w, α, β) using a variational distribution q(θ, z|w, η, ρ) where α is hyperparameter of Dirichlet prior and η, ρ are the free variational parameters over θ, z respectively. Different from Mean-Field Variational Inference, AVITM computes the variational parameters using an inference neural network and they are chosen by optimizing the following ELBO (i.e., the lower bound to the marginal log likelihood): By collapsing z and approximating the Dirichlet prior p(θ|α) with a logistic normal distribution, the second term (i.e., the expectations with respect to q) in the ELBO can be approximated using the reparameterization trick as in AEVB. The second term is also referred to as an expected negative reconstruction error in variational auto-encoders (VAE). While AVITM is successfully applied to LDA, it is not straightforward to apply it to our problem as discussed in the introduction.

Proposed Auto-Encoding Variational Bayes for Inferring Topics and Visualization
We represent a document n as a row vector of word counts: w n ∈ Z |V| ≥ and w v n is the number of occurrences of word v ∈ V in the document. The marginal likelihood of a document is given by: The marginal likelihood of the corpus is p(D) = N n=1 p (w n |γ, Φ, β). Note that here we treat Φ, and β as fixed quantities that are to be estimated. Therefore we are working with a non-smoothed PLSV where Φ, and β are not endowed with a posterior distribution. By treating Φ, and β as model parameters, we ensure that they are shared across all documents in the AEVB approach. We will consider a fuller Bayesian approach to PLSV in our future work.
As in AVITM, we collapse the discrete latent variable z to avoid the difficulty of determining a reparameterization function for it. The rightmost integral in Eq. 3 is the marginal likelihood after z is collapsed. We now only consider the true posterior distribution over latent variable x: p (x|w n , γ, Φ, β). Due to the intractability of Eq. 3, it is intractable to compute the posterior. We approximate it by a variational distribution q (x|w n , η) parameterized by η. The variational parameter η is estimated using an inference network as in AEVB. We have the following lower bound to the marginal log likelihood (ELBO) of a document: Since the prior p(x|γ) = Normal (0, γI) is a Gaussian, we can let the variational posterior q(x|w n , η) be a Gaussian with a diagonal covariance matrix: q(x|w n , η) = Normal (µ n , Σ n ). The KL divergence between two Gaussians in Eq. 4 can be computed in a closed form as follows (Kalai et al., 2010): where µ n , diagonal Σ n ∈ R d are outputs of the encoding feed forward neural network with variational parameters η. The expectation w.r.t q(x|w n , η) in Eq. 4 can be estimated using reparameterization trick (Kingma and Welling, 2014a). More specifically, we sample x (l) from the posterior q(x|w n , η) by using reparameterization over random variable x, i.e., x (l) = µ n + Σ 1/2 n (l) where (l) ∼ Normal (0, I). The expectation can then be approximated as: In Eq. 6, the decoding term log p w n |x (l) , Φ, β is computed as: where β ∈ R Z×V is the topic-word probability matrix, w n ∈ R |V| is a row vector of word counts, θ (l) n ∈ R Z is a row vector of topic proportions and θ (l) nz = p z|x (l) , Φ is computed as in Eq. 1. Based on Eq. 7 and Eq. 1, we propose using a decoder with two connected neural networks: Normalized Radial Basis Function Network for computing θ nz . We generalize θ nz in Eq. 1 using a Normalized Radial Basis Function (RBF) Network (Bishop, 1995) as follows: In this network, we have Z neurons in the hidden layer and φ z is the center vector for neuron z . The RBF function ρ is a non-linear function that depends on the distance x − φ z and w z,z is the influence weight of neuron z on θ nz where Z z =1 w z,z = 1. While w z,z can be estimated by optimizing the ELBO, we choose to fix it as w z,z = 1 when z = z and 0 otherwise. The parameters of this network are then the center vectors of Z neurons that are the coordinates of topics in the visualization space. The RBF function ρ can have different forms, e.g., Gaussian: exp(− 1 2 r 2 ), Inverse quadratic: 1 1+r 2 , or Inverse multiquadric: 1 √ 1+r 2 where r = x − φ z 1 . When ρ is Gaussian, Eq. 8 reduces to Eq. 1. Note that this generalization of θ nz is also discussed in (Le and Lauw, 2016) but not in the context of VAE inference. Since topic coordinates φ z are now the parameters of the RBF network, they can be shared and used by all documents for computing the topic distributions θ (l) n . In the experiments, we will show the performance of PLSV with these RBFs using VAE inference. Linear Neural Network for computing θ (l) n β . The output of the above normalized RBF network will be the input of a linear neural network to compute θ (l) n β in the decoding term. We treat β as the parameters, i.e., the linear weights W , of the network and it is computed using a softmax over the network weights to ensure the simplex constraint on β: β = σ(W ). The architecture of the whole Variational Auto-Encoder is given in Figure 1. We use batch normalization (Ioffe and Szegedy, 2015) to mitigate the posterior collapse issue found in the AEVB approach (He et al., 2019;Razavi et al., 2019).

Experiments
We evaluate the effectiveness and efficiency of our proposed AEVB based inference method for visualization and topic modeling both quantitatively and qualitatively. We use four real-world public datasets from different domains including newswire articles, newsgroups posts and academic papers.

Dataset Description
• REUTERS 2 : contains 7674 newswire articles from 8 categories (Cardoso-Cachopo, 2007 We perform preprocessing by removing stopwords and stemming. The vocabulary sizes are 3000, 3248, 4000, and 5000 for REUTERS, 20 NEWSGROUPS, WEB OF SCIENCE, and ARXIV respectively. Note that our problem is unsupervised and the ground-truth class labels are mainly used for evaluation. Before detailing the experiment results, we describe the comparative methods. Comparative Methods. We compare the following methods for inferring topics and visualization: Joint approach: • PLSV-MAP 6 : the original PLSV using MAP estimation with EM algorithm (Iwata et al., 2008).
• PLSV-VAE (Gaussian) [this paper] 7 : we apply our proposed variational auto-encoder (VAE) inference to PLSV where Gaussian RBF is used. We write PLSV-VAE to refer to PLSV-VAE (Gaussian). • PLSV-VAE (Inverse quadratic) and PLSV-VAE (Inverse multiquadric) [this paper]: these are PLSV-VAE models with Inverse quadratic and Inverse multiquadric RBFs. Since our method is black box, we can quickly implement these two models by just changing a few lines of code of PLSV-VAE (Gaussian) implementation. Pipeline approach: this is the approach of topic modeling followed by embedding of documents' topic proportions for visualization. We compare the above joint models with two pipeline models: • LDA-VAE + t-SNE: topic modeling by LDA 8 with VAE inference (Srivastava and Sutton, 2017), then use t-SNE 9 (Maaten and Hinton, 2008) to visualize the documents' topic proportions. • ProdLDA-VAE + t-SNE: similar to the above but we use ProdLDA-VAE 8 instead of LDA-VAE.
In the next sections, we report the experiment results averaged across 10 independent runs. For PLSV models, we choose λ = 0.01, γ = 1, ϕ = N Z that work well for large datasets in our experiments. We run PLSV-MAP with the number of EM iterations set to 200 and the maximum number of iterations for the quasi-Newton algorithm set to 10. Following AVITM, we set H1 = H2 = 100, the batch size to 256, the number of samples L per document to 1, the learning rate to 0.002, and use dropout with probability p = 0.6. We use Adam as our optimizing algorithm. VAE based models are trained with 1000 epochs. All the experiments are conducted on a system with 64GB memory, an Intel(R) Xeon(R) CPU E5-2623 v3, 16 cores at 3.00GHz. The GPU in use on this system is NVIDIA Quadro P2000 GPU with 1024 CUDA cores and 5 GB GDDR5.

Classification in the Visualization Space
We quantitatively evaluate the visualization quality by measuring the k-NN accuracy in the visualization space. This evaluation approach is also adopted in t-SNE, LargeVis, and the original PLSV. A k-NN classifier is used to classify documents using their visualization coordinates. A good visualization should group documents with the same label together and hence yield a high classification accuracy in the visualization space. Figures 2 and 3 show k-NN accuracy of all methods on each dataset, for varying number of nearest neighbors k and number of topics Z. For some settings, we do not show PLSV-MAP's performance as it does not return any results even after 24 hours of running. We can see that PLSV-VAE consistently achieves the best result, except for 25 topics on REUTERS (Figure 3a) where it produces a comparable result with PLSV-MAP. These results show that the joint approach outperforms the pipeline approach and VAE inference may help improve the visualization quality of PLSV. To verify this qualitatively, in Section 4.3, we show some visualization examples of all methods across datasets. Note that in this section, we show the accuracy of PLSV-VAE with Gaussian RBF. In Section 4.4, we present the performance of PLSV-VAE with different RBFs.

Topic Coherence
We quantitatively measure the quality of topic models produced by all methods in terms of topic coherence. The objective is to show that while having better visualization quality, PLSV-VAE also gains comparable, if not better, topic coherence. For topic coherence evaluation, we use Normalized Pointwise   Mutual Information (NPMI) which has been shown to be correlated with human judgments (Lau et al., 2014). NPMI is computed as follows: We estimate p(w i , w j ), p(w i ), and p(w j ) using Wikipedia 7-gram dataset 10 created from the Wikipedia dump data as of June 2008 version. NPMI of a topic is computed as an average of the pairwise NPMI of its top 10 words. For each method, we average NPMI of its topics. Figure 4 shows topic coherence NPMI of all methods. As we can see, PLSV-VAE finds topics as good as those found by other methods, and in some settings, PLSV-VAE can find significantly better topics. For a qualitative evaluation of topic quality, we show some example topics found by PLSV-VAE in Figure 9.

Visualization Examples
We compare visualizations produced by all methods qualitatively by showing some visualization examples. In these visualizations, each document is represented by a point and the color of each point indicates the class of that document. Figures 5 and 6 present visualizations by PLSV-MAP, PLSV-VAE on REUTERS and 20 NEWSGROUPS. We see that PLSV-VAE can find meaningful clusters of documents. For example, PLSV-VAE in Figure 5(b) separates well the eight classes into different clusters such as the pink cluster for acq, the orange cluster for earn, and the brown cluster for crude. The visualization by PLSV-MAP in Figure 5(a) also shows clear clusters but it runs much slower than PLSV-VAE as shown in Section 4.5. Figure 6 presents visualization outputs for 20 NEWSGROUPS. For this more challenging dataset, PLSV-VAE produces better-separated clusters, as compared to PLSV-MAP. For example, baseball and hockey are mixed in Figure 6(a) by PLSV-MAP but these classes are separated better in Figure  6(b) by PLSV-VAE. We do not show visualizations of WEB OF SCIENCE and ARXIV by PLSV-MAP because it fails to return any results even after 24 hours of running. We instead show visualizations of these two large datasets by PLSV-VAE and ProdLDA-VAE + t-SNE in Figures 7 and 8. As we can see, visualizations by PLSV-VAE are more intuitive than the ones by ProdLDA-VAE + t-SNE, which supports the outperformance of the joint approach over the pipeline approach.

Comparing Different Radial Basis Functions
Since our method is black box, we can quickly explore PLSV-VAE model with different assumptions.
In this section, we show how different RBFs affect the performance of PLSV-VAE. Besides PLSV-VAE with Gaussian RBF, we implement another two variants of PLSV-VAE that uses two other RBFs: Inverse quadratic and Inverse multiquadric RBFs. We choose these two because, similar to Gaussian, they support the assumption that the zth topic proportion of document n is high when document coordinate x n is close to topic coordinate φ z . For these model changes, we do not need to perform a mathematical rederivation, but we only need to change a few lines of code of PLSV-VAE (Gaussian). Figures 10 and 11 show the k-NN accuracy and topic coherence of PLSV-VAE with different RBFs. In general, PLSV-VAE with Gaussian or Inverse quadratic RBFs consistently produces good performance across datasets. In some cases, Inverse quadratic produces better results.

Topic Examples and Running Time Comparison
To qualitatively evaluate the topics, in Figure 9, we show visualization and topic examples generated by PLSV-VAE (Inverse quadratic) on ARXIV. In the visualization, each black empty circle represents a topic that is associated with a list of top 10 words. We see that the topics are meaningful and reflect different research subdomains discussed in the ARXIV papers. For example, many topics are studied in the CS domain such as "graph, g, vertex, k", "model, data, use, method", and "logic, program, system". For the Astro domain, we have topics like "galaxi, cluster, star", and "observ, ray, model, star". Topics such as "energi, nucleu, reaction" and "electron, energi, atom" are discussed in the Nucl domain. By allowing the semantics to be infused in the visualization space, users can now not only see the documents but also their topics. The joint nature of the model may lead to potential applications in different visual text mining tasks. Finally, we show the running time of all the methods in Figure 12. As expected, PLSV-MAP running on a single core is very slow and it fails to return any results on large datasets even after 24 hours of running. PLSV-VAE runs much faster. It only needs about 5 hours for 200 topics on the largest dataset ARXIV. For completeness, we also include the running time of LDA-VAE, and ProdLDA-VAE. PLSV-VAE is as fast as these methods. In summary, PLSV-VAE can find good topics and visualization while it can scale well to large datasets, which will increase its usability in practice.

Conclusion
We propose, to the best of our knowledge, the first fast AEVB based inference method for jointly learning topics and visualization. In our approach, we design a decoder that includes a normalized RBF network connected to a linear neural network. These networks are parameterized by topic coordinates and word probabilities, ensuring that they are shared across all documents. Due to our method's black box nature, we can quickly experiment with different RBFs with minimal reimplementation effort. Our extensive experiments on four real-world large datasets show that PLSV-VAE runs much faster than PLSV-MAP while gaining better visualization quality and comparable topic coherence.