Learning VAE-LDA Models with Rounded Reparameterization Trick

The introduction of VAE provides an efﬁcient framework for the learning of generative models, including generative topic models. However, when the topic model is a Latent Dirichlet Allocation (LDA) model, a central technique of VAE, the reparameterization trick, fails to be applicable. This is because no repa-rameterization form of Dirichlet distributions is known to date that allows the use of the repa-rameterization trick. In this work, we propose a new method, which we call Rounded Repa-rameterization Trick (RRT), to reparameterize Dirichlet distributions for the learning of VAE-LDA models. This method, when applied to a VAE-LDA model, is shown experimentally to outperform the existing neural topic models on several benchmark datasets and on a synthetic dataset.


Introduction
Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al., 1990) (Hofmann, 1999) (Blei et al., 2003)(Blei andLafferty, 2006). A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al., 2003), where a document is treated as a bag of words and each word is modelled via a generative process. More specifically, in this generative process, a topic distribution is first drawn from a Dirichlet prior, then a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic. Since its introduction, LDA has shown great power in a large varieties of natural language applications (Wei and Croft, 2006) (AlSumait et al., 2008) (Mehrotra et al., 2013). However, the classical methods of learning LDA, such as variational techniques and collapsed Gibbs sampling, entails high computation complexity in posterior inference (Blei et al., 2003) (Grif-fiths and Steyvers, 2004), which limits the ability of LDA on modelling large corpus.
Variational AutoEncoder (VAE) or AutoEncoding Variational Bayes (AEVB) (Kingma and Welling, 2013) provides another choice of learning a generative model. Under the VAE framework, a generative model is specified by first drawing a latent vector z from a prior distribution and then transforming this vector through a neural network, called decoder, which subsequently generates the observation x. Using a variational inference approach, VAE couples the decoder network with another network, called encoder, responsible for computing the posterior distribution of the latent variable z for each observation x. A key technique of VAE is its "reparameterization trick", in which sampling from the posterior is performed by sampling a noise variable from some distribution p( ) and then transforming to z using a differentiable function. This technique allows the model to be trained efficiently using back propagation.
The VAE framework significantly alleviates the computational burden of learning a generative model. Therefore, researchers interested in topic modelling are naturally motivated to consider VAE as an alternative approach to learn LDA, exploiting the power and efficiency of deep learning neural networks. This is also the interest of this paper. However, the key limitation in the application of VAE to Dirichlet-based topic models is that the original reparameterization trick in VAE is not applicable to Dirichlet distributions. In this sense, VAE cannot be directly used for learning any Dirichlet-based topic models. To cope with this, the NVDM model (Miao et al., 2016) discards the Dirichlet assumption and build neural topic models based on Gaussian prior. Although such a Gaussian-based topic model achieves a reasonably good performance on perplexity, the topic words they extracted appear to lack human-interpretability. Additionally the use of Gaussian prior significantly deviates from the desired Dirichlet distribution and arguably has significant room for improvement.
The adoption of the Dirichlet prior plays a central role in topic modelling, since it nicely captures the intuition that a topic is sampled from a sparse topic distribution. Due to the importance of the Dirichlet assumption in topic modelling, ProdLDA (Srivastava and Sutton, 2017) attempts to apply VAE to LDA by constructing a Laplace approximation to the Dirichlet prior in the softmax basis. However, the Laplace approximation is only used to estimate the prior parameters and ProdLDA has essentially a Gaussian VAE architecture where the KL divergence is on Gaussian distributions. The work of (Joo et al., 2019) argues that the Laplace approximation in ProdLDA fails to capture the multimodality nature of Dirichlet distributions. They then propose DirVAE, in which an approximation of the inverse Gamma CDF (Knowles, 2015) is used to reparameterize Gamma distributions. The Dirichlet samples are then constructed by normalizing Gamma random variables. However, the approximation of inverse Gamma CDF is accurate only when the shape parameter of the Gamma distribution is much less than 1 (Knowles, 2015). This in turn limits the application scope of DirVAE.
In this work, we develop a technique, which we call the Rounded Reparameterization Trick (RRT), to reparameterize Dirichlet distributions. The use of RRT enables VAE as an efficient method for learning LDA, based on which we propose a new neural topic model, referred to as "RRT-VAE". 1 Experiments on several datasets show that RRT-VAE outperforms NVDM, ProdLDA, and DirVAE. The experimental results strongly demonstrate the applicability of RRT in topic modelling that utilizes VAE.

LDA
In this paper, we refer to LDA broadly as a generative model characterized by first drawing a distribution θ over k topics from a Dirichlet prior Dir (θ|α) and then through a function f dec , or a decoder, transforming θ to a distribution P over a 1 Code will be available at https://github.com/ rzTian/RRT-VAE/tree/main vocabulary of n words. That is, where β is the parameter of the decoder and will be treated as a k × n matrix throughout this paper, although other options are also possible. Under this model, the words in a document is regarded as being drawn i.i.d from this distribution P .
In the classical LDA model (Blei et al., 2003), each row of β represents a word distribution, and the decoder can be written as In the deep learning paradigm, the decoder may be constructed differently, for example, and f dec (θ) = Softmax θ T β where in both cases, the rows of β are unconstrained. Note that (4), presented in (Srivastava and Sutton, 2017) is merely a different parameterization of (3) and will be referred to as the "standard decoder" in this paper. The structure in (5), referred to as "product of experts" in (Srivastava and Sutton, 2017), will be called "prod decoder" for simplicity.

VAE-LDA
The difficulty in learning an LDA model lies in the exact inference of θ. In the classical LDA, exact inference is replaced by approximation methods using a symbolist variational method (Blei et al., 2003) or MCMC (Griffiths and Steyvers, 2004). In the deep learning era, the development of Variational AutoEncoder (Kingma and Welling, 2013), a connectionist counterpart of the symbolist variational methods, provides an alternative approach to handle this difficulty. When applying VAE to an LDA model, the model is augmented with an encoder network f enc . Specifically, the encoder takes as the input the bagof-words (i.e., word histogram) representation x of a document and outputs a k-dimensional parameter α, and then the Dirichlet distribution with parameter α is taken as the posterior distribution q(·|α) of θ: α := f enc (x; Π) (6) q(·|α) := Dir(·|α) where Π denotes the parameters of the encoder. Under the VAE framework, the parameters of the encoder and the decoder are jointly optimized by minimizing the negative Evidence Lower Bound (ELBO): (8) where p(θ|α) := Dir(θ|α), the Dirichlet prior; and We refer to the model specified by the loss function (8) as VAE-LDA. Note that the KL term in (8) has a closed-form expression The gradient of this term can be obtained directly. The optimization of the second term in (8) is however challenging, since it has no closed-form expression. Additionally, when using a stochastic approximation, one must deal with back-propagating gradient signals through a sampling process. One way to deal with this is to use a score function estimator (Williams, 1992) (Glynn, 1990). But such an approach is known to give rise to high variances in the gradient estimation, due to which a reliable estimate would require drawing a large number of θ from the posterior q(·|α) and make learning inefficient. In the framework of VAE, a "reparameterization trick" is introduced as an elegant solution to such a problem, where the posterior is reparameterized as drawing a noise from another distribution and re-expressing the posterior as a differentiable function of the noise. However when the posterior distribution is a Dirichlet distribution (or a related distribution such as Beta and Gamma distributions), no such noise distribution and continuous functions are known to exist. Thus the standard reparameterization trick does not apply to learning VAE-LDA.

Rounded Reparameterization Trick
To tackle the limitation of the standard reparameterization trick, we propose a new reparameterization method, referred to as rounded reparameterization trick or RRT.
Given a real number ∆, we define a "∆rounding" function · ∆ as follows: For any real number a, where the operation · is the integer floor (or "rounding down") operation. For example, 3.14159265 ∆=0.001 = 3.141. When the ∆rounding operation applies to a vector, it acts on the vector component-wise.
In RRT, we draw an auxiliary variableθ from a "rounded" posterior distribution q θ | α ∆ , and compute Then θ is used to approximate θ ∼ q(θ|α). In (12), the parameter λ is a hyper parameter which will serve to adjust the strength of the gradient. Note that when choosing a very small rounding precision ∆, we expect that the distribution q(·|α) of θ and the distribution q(·|α) are nearly identical. As a consequence, E q(θ|α) [J(θ, x)] and its replacement E q(θ|α) [J(θ, x)] are also very close to each other. Thus such a replacement keeps the loss function very close to the original loss in (8).
For shorter notations, we denote and Constructing gradient estimator using RRT The gradient ∇ α A(α) can be expressed as a sum of two terms: The first term in sum is usually estimated through the score function estimator. But this is unnecessary in this case. To see this, note that ∇ α α ∆ = 0 almost everywhere. This implies that the first term is in fact 0 at every α for which the gradient exists. The next lemma then immediately follows.
Lemma 1 For any α at which the gradient ∇ α A(α) exists, The fact that the score function estimator is not needed for estimating the gradient ∇ α A(α) allows RRT to enjoy a low variance and hence requires very few samples in Monte-Carlo estimation.
Using Lemma 1, one can directly express the stochastic (Monte Carlo) estimate of the gradient The fact that g is differentiable almost everywhere with respect to α allows the gradient signal to back propagate and can be implemented using automatic differentiation libraries. Due to the low variance in this estimator, it is sufficient to sample only a singleθ from q θ | α ∆ , namely, take N = 1 in (16).
At this end, we conclude that the loss function L obtained by replacing θ with θ is very close to the original loss function L, and a low-variance gradient estimator can be easily constructed from L. This completes the description of RRT.

On the discontinuities induced by RRT
Notably the ∆-rounding function in RRT induces discontinuities in the resulting loss function L. This is because A(α) is discontinuous in α and countably many discontinuity points exist. One may be concerned with whether an update of α may "hop over" a discontinuity point of A(α) and cause training unstable or diverge.
To that end, we have the following result.
Then for any integer m, We note that when → ∆, the quantity A(m∆) − A(m∆ − ) measures the magnitude of a sudden rise or drop when an update hops over the discontinuity point α = m∆. When this magnitude is small, the discontinuity causes little impact on the stability of training. The upper bound of this quantity given by this lemma suggests that as long as J(θ) and the objective function A(α) are reasonably smooth, one may control this magnitude to be small by choosing a relatively small ∆. On the other hand, in case one indeed chooses a relatively large ∆, the bound of this magnitude may become quite large. However in this case, the update will have much smaller chance of hopping over a discontinuity point, and one still expects no serious problem caused by these discontinuities.
We now present the proof. Proof: Clearly, A(m∆) = A(m∆). And This proves the lemma.
2 It is clear that when ∆ is small, the discontinuity is not obvious and has small impact on the optimization of the model.

Related Work
Beyond topic modelling, another theme of research related to this work is the estimation of gradient in neural networks containing stochastic nodes or samplers. In this setting, one desires that the gradient signal is capable of back-propagating through the samplers. A classical method for this purpose is to construct a score function estimator, also known as the "log derivative trick" or REINFORCE (Williams, 1992) (Glynn, 1990). However, despite giving an unbiased estimate, the Monte-Carlo implementation of such an estimator typically suffers from a high variance, and thus relies on some additional variance-reduction techniques (Greensmith et al., 2004). Reparameterization trick(Kingma and Welling, 2013), as mentioned above, may also be used to back-propagate gradients through samples and enjoys a low-variance advantage. Unfortunately this technique is not applicable to many distributions such as Gamma, Beta and Dirichlet distributions. Various efforts have been spent on extending the applicability of reparameterization trick to a broader range. These works include, for example, G-REP , RSVI (Naesseth et al., 2016) and Implicit Reparameterization Gradients (Figurnov et al., 2018), etc. These methods usually involve complicated gradient derivations and are often difficult to implement in neural networks.
In the experiments, we adopt three MLPs with ReLU activations as the encoder of RRT-VAE, where each hidden layer is set to 500 dimensions. We apply an exponential function on the outputs of the encoder, so that the outputs are positive values. The topic distribution vectors are sampled through RRT and then normalized before being passed to the decoder. For Online LDA, we use the standard implementation from scikit-learn (Pedregosa et al., 2011). The encoder structures of NVDM, ProdLDA and DirVAE are built according to (Miao et al., 2016), (Srivastava and Sutton, 2017) and (Joo et al., 2019), where in our experiments the dimension of each hidden layer is set to 500.
On the real-world datasets, we adopt the prod decoder, since the standard decoder appears to extract many repetitive topic words (see Appendix B.1). 2 2 As reported in (Srivastava and Sutton, 2017), ProdLDA also appears to extract many repetitive words when using the On the synthetic datasets, we adopt the standard decoder, which is examined to be superior to the prod decoder on this learning task (see Appendix A.1).

Datasets
Synthetic datasets. We construct three synthetic datasets based on the LDA generative process: a 30 × 500 topic-word probability matrix β g is generated as the ground truth; each dataset is then generated based on β g using different Dirichlet priors α g ·1 ∈ R 30 , where 1 denotes the all-one vector. We set α g to [0.01, 0.05, 0.1] for the three datasets and the vocabulary size to 500. Each dataset has 20000 training examples. Real-world datasets. We use five real-world datasets in our experiments: 20NG, RCV1-v2, 3 AGNews 4 , DBPeida (Lehmann et al., 2015), and Yelp review polarity (Zhang et al., 2015).
The 20NG and RCV1-v2 datasets are the same as (Miao et al., 2016). The other three datasets are preprocessed through tokenizing, stemming, lemmatizing and the removal of stop words. We keep the most frequent 2000 words in DBPedia and Yelp. For AGNews, we keep the words which are contained in no more than half the documents and are contained in at least 15 documents. The statistics of the cleaned datasets are summarized in Table 1.

Evaluation Methods
On the real-world datasets, we use perplexity and normalized pointwise mutual information (NPMI) (Lau et al., 2014) as the evaluation metrics. On synthetic datasets, we propose topic words recovery accuracy (or "recovery accuracy" in short) to evaluate the model performance. Specifically, we extract the top-10 highestprobability word indexes from each row of β g . The standard decoder. extracted word indexes constitute a 30 × 10 topicword matrix T g . Our goal is to use the topic models to recover this matrix. Denote by T L , a matrix extracted from the learned β matrix of a model. Note that the rows of T L are arbitrarily ordered. To count how many words in the ith row t (i) g of T g is recovered in a topic in T L , we compare t (i) g with each row in T L . We count the number of common words in the compared two rows and keep the maximum count as the number of recovered words in t (i) g . The recovery accuracy is then defined as the total number of recovered words in all rows of T L divided by the total number of words.
We note that after a row of T g is compared with T L as the target of coverage, the found bestmatching row in T L is not removed. This approach is better than the alternative approach of greedily removing the best-matching row, since the latter would give an accuracy result that depends on the row ordering in T g . Additionally we note that the data generation process assures that the rows of T g each contain 10 distinct words. For this reason, keeping the found best-matching row in T L in each step entails no problem.

Influence of Parameter Settings
In this section, we run RRT-VAE on 20NG and the synthetic datasets to explore its performance under different parameter settings.

Results on 20NG
Prior settings. Prior settings are claimed to have a significant influence on model performance (Wallach et al., 2009). In this experiment, we run RRT-VAE on the 20NG dataset using four symmetric Dirichlet prior settings [0.02,0.2,1.0,2.0]. The number of topics is set to 50 and λ is set to 0.01 in all experiments. We use ∆ = 10 −10 as the rounding precision such that accurate Dirichlet samples can be drawn. As shown in Figure 1 (left), when using a larger prior parameter (1 or larger), the training loss drops   more rapidly and converges to a lower value. Table 2 reports the corresponding testing results. We found that when using a smaller prior setting, RRT-VAE tends to achieve a better topic coherence (NPMI) while sacrificing some performance on perplexity. One possible explanation of these phenomena is that a smaller prior setting (lower than 1) encourages the encoder network to sample a sparser topic distribution θ. The sparsity of θ in turn makes it easier for the model to assign a very small probability on some existing words in a document and thus increases the training loss and perplexity.
To verify this conjecture, we construct a simple method to measure sparsity: after the training, we randomly feed 1000 training samples into the encoder network and obtain 1000 topic distribution vectors {θ i } 1000 i=1 . For each θ i , we calculate the difference between its largest and smallest probability value and then average these differences over the 1000 samples. Clearly, a larger difference value indicates a sparser θ, e.g. the maximum difference 1 is achieved by a one-hot vector. From the sparsity measurements in Table 2, we see that a smaller prior setting causes the encoder to generate sparser topic distribution vectors, which in turn hinders the convergence of the training loss to a lower value and hence causes a higher perplexity. On the other hand, sparser topic distributions tend to improve NPMI, although this improvement is slight. λ settings. The "gradient control" parameter λ in RRT adjusts the strength of the gradient signal back-propagated to the encoder while also influencing the variance of the Monte Carlo gradient estimator. Figure 1 (right) and Table 3 report the influence of different λ settings on the model performance, where the number of topics is set to 50 and the prior is set to 1. As shown, when λ is set too small (e.g. λ = 0.001), the training loss fails to converge to a lower value, resulting in a higher perplexity and worse NPMI. The best performance is achieved when λ is set between around 0.01 and 0.005. Different λ settings can bring similar training performances but different testing results. For example, when λ is set to 0.1 and 0.01, the corresponding training performances are very similar (see Figure 1 (right), blue and grey dash line), however, λ = 0.01 achieves a better perplexity and NPMI result. In these experiments, λ is set to 0.01, the number of topics is set to 50.
Influence of the rounding precision ∆. A main concern of RRT is that the induced discontinuities may cause training to be unstable. As proved in Section 3, this discontinuity actually causes little impact on the stability of training. We substantiate this conclusion in Figure 2 (a) by plotting the training loss curves of RRT-VAE under different ∆ settings. As shown, all the training losses converge stably when using different ∆. This demonstrates that the precision of the rounding operation has little impact on the training stability. The influences of ∆ on perplexity and NPMI are also modest. As shown in Figure 2 (b) and (c), the resulting perplexities and NPMIs are in general insensitive to the ∆ settings.
From Figure 2 (b) and (d), it can also be observed that the perplexity of RRT-VAE has correlation with the sparsity. When ∆ changes from 1 to 10 −10 , the sparsity value ofα = 0.02 (green line in Figure 2 (d)) jumps from 0.059 to around 0.55. 5 The corresponding perplexity value (green line in Figure 2 (b)) also increases from 1078 to around 1400. In contrast, the sparsity levels ofα = 1.0 and α = 2.0 remain unchanged. Their corresponding perplexities also stay at the same levels.

Results on Synthetic datasets
Our experiments on the synthetic datasets again demonstrate that the rounding precision has little impact on the training stability. Figure 3 (left) exhibits how different ∆ settings influence the training performance of RRT-VAE when α g = 0.01 (the results of α g = 0.05 and 0.1 are shown in Appendix A.2). As shown, all the training losses decrease stably, although a higher ∆ setting hinders the loss converging to a lower value. Figure  3 (right) reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets. It can be seen that a smaller ∆ achieves a better performance. Specifically, when ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution. In contrast, when ∆ = 10 −10 , RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%. Recall that on 20NG, both the training and testing performances are insensitive to the rounding precision. In contrast, on synthetic datasets, the rounding precision has a significant influence. This phenomenon is reasonable, since the synthetic data strictly satisfies the LDA generative process. A higher ∆ setting causes the rounded distribution deviate the Dirichlet posterior, thereby interfering with the fitting of the data. On the other hand, the underlying distribution of the real-world data does not strictly conform to the LDA assumption. This deviation, therefore, has little impact on fitting the data.

Comparison with Other Models
In this section, we compare RRT-VAE with other existing topic models on both real-world datasets and synthetic datasets.

Real-world datasets
On real-world datasets, we do not compare Online LDA, since the training of Online LDA on large datasets is extremely time consuming and Online LDA fails to obtain any good results after being trained for a long time (results of Online LDA on 20NG are shown in Appendix B.2). For ProdLDA, DirVAE and RRT-VAE, we tune the prior parameter from [0.02,0.2,1.0]. The best λ settings of RRT-VAE for each dataset are shown in Table 4. All the compared models adopt the same prod decoder of (5) on the real-world datasets.    Table 6: Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200. margherita grimaldi pizzeria pepperoni sbarro brooklyn bianco mozza spinato concours udon ichiza monta tokyo chaya agedashi saigon chinatown gyoza yaki croissant decaf oatmeal scone coffe granola almond pastri latt muffin hue bo pho vietnames viet banh lemongrass vietnam mi basil sportsbook mandalay ronin kiki miyagi puck bachi shogun fatburg oxtail heighten punctuat suppl amidst juxtapos conscious onward revel evok gleam ewwww saliva kneel cock toothless broom discust demerit surveil sill wan non asian pan asian pak taipei totti hotpot hai sift empty hand marshall stuffer overstock spree reorgan sweatshirt store preach outbreak heartfelt pois raymond uplift caregiv worship charismat deathli buger haystack stripburg in and out quadrupl deli fukuburg fries food poison ambienc atmospher awsom bedienungen cafeteria defiantli chipotl slowest oldtown boozer after work carly grapevin fiver meet up hang tombston pokey pizza but peroni numero pizzaria pizza n nth insipid banal nil nla disposit st laurent hyper extraordinair procur store sale housewar homegood inventori brows shelv thrift shopper stock sashimi eel tempura nigiri yellowtail ponzu sushi edamam tuna wasabi dr doctor exam physician nurs physician obgyn urgent clinic medic airport plane flight baggag mccarran tsa passeng megabu shuttl airlin workout instructor zumba yoga class bike gym crossfit fairway paintbal The experimental results are shown in Table 5 and 6. It can be seen that on the small and medium size datasets (20NG and AGNews), the performance of DirVAE levels with RRT-VAE, while on the large datasets (RCV1-v2, DBpedia and Yelp), the NPMI of RRT-VAE is significantly better than all the other compared models. Although the perplexity of NVDM is better than RRT-VAE, this gap is small. On the other hand, on NPMI, RRT-VAE outperforms NVDM by a very large margin. In fact, it has been demonstrated that perplexity is not necessarily a good metric for evaluating the quality of learned topics (Newman et al., 2010). Its correlation to the quality of the learned topics is questionable 6 (Chang et al., 2009). With these considerations, we argue that RRT-VAE is overall superior to other compared models. Table 7 exhibits the extracted topic words of different models, where each line of the words corresponds to a certain topic. We see that the words extracted by RRT-VAE (the bottom cell of Table  7) are much more interpretable, from which it can 6 In general, perplexity measures the goodness-of-fit of data to a learned model under the maximum likelihood principle. This makes it a valid metric for evaluation when the learning objective (as in the considered models) aims at maximizing the data likelihood. On the other hand, we note that traditionally in all VAE-LDA models (e.g., those compared in this paper) and also in this paper, perplexity is in fact approximately computed using the evidence lower bound (ELBO) of the data likelihood, since exact computation of the data likelihood is usually intractable. But the perplexity computed this way aggregates the overall effects of both the learned decoder (i.e., the β matrix) and the learned encoder. Therefore it does not provide a direct evaluation of the learned word distributions in the β matrix. This problem is overcomed by the additional NPMI measure, which is computed directly from the β matrix and serves as a more indicative quality measurement of the learned topics. be easily inferred that the corresponding topics are "trade", "Japanese food", "medical" and "fitness". But it is not the case for the other models.

Synthetic datasets
We compare RRT-VAE with Online LDA, ProdLDA and DirVAE on three synthetic datasets which are generated by different Dirichlet parameters. The compared three neural topic models adopt the same standard decoder of (4). Since NVDM is a pure Gaussian VAE model without any approximation of Dirichlet distributions, it is not compared in this experiment. Table 8 reports the recovery accuracy of the compared models. The experimental results strongly demonstrate the ability of RRT-VAE as an inference method to learn LDA. Specifically, RRT-VAE levels with Online LDA on recovery accuracy, while it enjoys a much higher computational efficiency. Among three neural topic models, RRT-VAE clearly outperforms the others. Appendix A.3 shows an example of the ground truth matrix T g and the matrix recovered by RRT-VAE.  Table 8: Recovery accuracy of four topic models on synthetic datasets generated by three different α g settings. For RRT-VAE, λ is set to 1; ∆ is set to 10 −10 .

Concluding Remarks
In this paper, rounded reparameterization trick, or RRT, is shown as an effective and efficient reparameterization method for Dirichlet distributions in the context of learning VAE based LDA models. In fact, the applicability of RRT can be generalized beyond Dirichlet distributions. This is because any distribution can be reparameterized to an "RRT form" as long as a sampling algorithm exists for that distribution. Thus it will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling. Successes in these investigations will certainly extend the applicability of VAE to much broader application domains and model families.

A Additional Results on Synthetic Datasets
A.1 Topic recovery accuracy using prod decoder  Table 9: Topic words recovery accuracy of three neural topic models on synthetic datasets generated with three different α g settings. The models adopt the same prod decoder structure. For RRT-VAE, λ is set to 1; ∆ is set to 10 −10 . Table 9 reports the topic recovery accuracy of three neural topic models using the prod decoder. Compared to Table 8, it can be seen that the standard decoder significantly outperforms the prod decoder on the synthetic datasets.   Table 10 exhibits an example of the ground truth topic word matrix T g used in our experiments and Table 10: Left: the ground truth topic word matrix T g ; Right: a matrix T L learned by RRT-VAE. Note that the rows of T L are arbitrarily ordered. For example, the first and second rows of T g individually correspond to the 11th and 14th rows of T L (as shown in bold).

B Additional Results on Real-world Datasets
B.1 Repetitive words write article one get know like think say go use write article get one know like use think say go get go like write make people article insurance tax one write article one get use like think know go say know thanks please anyone write get email article post like The standard decoder appears to extract many repetitive words on 20NG.
As shown in Table 11, when using the standard decoder on the 20NG dataset, RRT-VAE appears to extract many repetitive topic words.  B.3 Topic words extracted by RRT-VAE Table 13 exhibits the topic words extracted by RRT-VAE from four real-world datasets (20NG, AG-News, RCV1-v2 and DBpedia), where each line of the words corresponds to a certain topic.
health medical patient disease medicine estimate hospital care service coverage violent gun crime handgun usa criminal uk homicide defend firearm constitution senate amendment representative states president extend congress militia bear homosexual male sexual man statistics percent rsa number gay behavior fuel moon cool lunar air launch heat stage orbit cold guilti conspiraci ghraib martha milosev enron prison yugoslav torture sentence ansari spaceshipon genesi space hubbl parachut spacecraft nasa station astronaut docomo nokia vodafon phone motorola blackberri ip mobil treo mmo kill explod injur dead quak typhoon peopl jakarta bomb landslide mice skeleton supercompute gene genetic stem clone ancestor scientist speci thriv lifestyl shop museum flock fame cultur tast dream ancient desktop access network internet digit modem intranet download voice compute durum flood moisture disaster wheat grain hrw canol sorghum crop detain troop gunfire violent policeman military siege dozen terror embass attorney counsel felon lawsuit jury testif improp hear conspir guilt paperback reprint book republish young adult isbn author locu scholast desktop server intel web bas software device microsoft applic uav clarinet bassist guitarist drummer banjo violin guitar drum saxophon keyboardist airway airport iata airlin icao brokerag telecommun exchang asset financi