Tree-Structured Neural Topic Model

This paper presents a tree-structured neural topic model, which has a topic distribution over a tree with an infinite number of branches. Our model parameterizes an unbounded ancestral and fraternal topic distribution by applying doubly-recurrent neural networks. With the help of autoencoding variational Bayes, our model improves data scalability and achieves competitive performance when inducing latent topics and tree structures, as compared to a prior tree-structured topic model (Blei et al., 2010). This work extends the tree-structured topic model such that it can be incorporated with neural models for downstream tasks.


Introduction
Probabilistic topic models, such as latent Dirichlet allocation (LDA; Blei et al., 2003), are applied to numerous tasks including document modeling and information retrieval. Recently, Srivastava and Sutton (2017); Miao et al. (2017) have applied the autoencoding variational Bayes (AEVB; Kingma and Welling, 2014;Rezende et al., 2014) framework to basic topic models such as LDA. AEVB improves data scalability in conventional models.
The limitation of the basic topic models is that they induce topics as flat structures, not organizing them into coherent groups or hierarchies. Treestructured topic models (Griffiths et al., 2004), which detect the latent tree structure of topics, can overcome this limitation. These models induce a tree with an infinite number of nodes and assign a generic topic to the root and more detailed topics to the leaf nodes. In Figure 1, we show an example of topics induced by our model. Such characteristics are preferable for several downstream tasks, such as document retrieval (Weninger et al., 2012), aspect-based sentiment analysis (Kim et al., 2013) and extractive summarization (Celikyilmaz Root   and Hakkani-Tur, 2010), because they provide succinct information from multiple viewpoints. For instance, in the case of document retrieval of product reviews, some users are interested in the general opinions about bag covers, while others pay more attention to specific topics such as the hardness or color of the covers. The tree structure can navigate users to the documents with desirable granularity.
However, it is difficult to use tree-structured topic models with neural models for downstream tasks. While neural models require a large amount of data for training, conventional inference algorithms, such as collapsed Gibbs sampling (Blei et al., 2010) or mean-field approximation , have data scalability issues. It is also desirable to optimize the tree structure for downstream tasks by jointly updating the neural model parameters and posteriors of a topic model.
To overcome these challenges, we propose a treestructured neural topic model (TSNTM), which is parameterized by neural networks and is trained using AEVB. While prior works have applied AEVB to flat topic models, it is not straightforward to parameterize the unbounded ancestral and fraternal topic distribution. In this paper, we provide a solution to this by applying doubly-recurrent neural networks (DRNN; Alvarez-Melis and Jaakkola, 2017), which have two recurrent structures over respectively the ancestors and siblings.
Experimental results show that the TSNTM achieves competitive performance against a prior work (Blei et al., 2010) when inducing latent topics and tree structures. The TSNTM scales to larger datasets and allows for end-to-end training with neural models of several tasks such as aspect-based sentiment analysis (Esmaeili et al., 2019) and abstractive summarization (Wang et al., 2019).

Related Works
Following the pioneering work of tree-structured topic models by Griffiths et al. (2004), several extended models have been proposed (Ghahramani et al., 2010;Zavitsanos et al., 2011;Kim et al., 2012;Ahmed et al., 2013;Paisley et al., 2014). Our model is based on the modeling assumption of Blei et al. (2010), while parameterizing a topic distribution with AEVB.
In the context of applying AEVB to flat document or topic modeling (Miao et al., 2016;Srivastava and Sutton, 2017;Ding et al., 2018), Miao et al. (2017) proposed a model, which is closely related to ours, by applying recurrent neural networks (RNN) to parameterize an unbounded flat topic distribution. Our work infers the topic distributions over an infinite tree with a DRNN, which enables us to induce latent tree structures. Goyal et al. (2017) used a tree-structured topic model ) with a variational autoencoder (VAE) to represent video frames as a tree. However, their approach is limited to smaller datasets. In fact, they used only 1,241 videos (corresponding to documents) for training and separately updated the VAE parameters and the posteriors of the topic model by mean-field approximation. This motivates us to propose the TSNTM, which scales to larger datasets and allows for end-to-end training with neural models for downstream tasks.

Tree-Structured Neural Topic Model
We present the generative process of documents and the posterior inference by our model. As shown in Figure 2, we draw a path from the root to a leaf node and a level for each word. The word is drawn from the multinomial distribution assigned to the topic specified by the path and level.
In contrast, we introduce neural architectures, f π and f θ , to transform a Gaussian sample to a topic distribution, allowing for posterior inference with AEVB. Specifically, we apply a DRNN to parameterize the path distribution over the tree.

Parameterizing Topic Distribution
A DRNN is a neural network decoder for generating tree-structured objects from encoded representations (Alvarez-Melis and Jaakkola, 2017). A DRNN consists of two RNNs over respectively the ancestors and siblings (see Appendix A.2). We assume that their two recurrent structures can parameterize the unbounded ancestral and fraternal path distribution conditioned on a Gaussian sample x, using a finite number of parameters.
The hidden state, h k , of the topic k is given by: where h par(k) and h k−1 are the hidden states of a parent and a previous sibling of the k-th topic, respectively. We alternate the breaking proportions, ν, in (7) and obtain the path distribution, π, as: Moreover, we parameterize the unbounded level distribution, θ, by passing a Gaussian vector through a RNN and alternating the breaking proportions, η, in (8) as:

Parameterizing Word Distribution
Next, we explain the word distribution assigned to each topic 1 . We introduce the embeddings of the k-th topic, t k ∈ R H , and words, U ∈ R V ×H , to obtain the word distribution, β k ∈ ∆ V −1 , by (13).
where τ 1 l is a temperature value and produces more sparse probability distribution over words as the level l gets to be deeper (Hinton et al., 2014).
As the number of topics is unbounded, the word distributions must be generated dynamically. Hence, we introduce another DRNN to generate topic embeddings as t k = DRNN(t par(k) , t k−1 ).
Several neural topic models (Xie et al., 2015;Miao et al., 2017;He et al., 2017) have introduced diversity regularizer to eliminate redundancy in the topics. While they force all topics to be orthogonal, this is not suitable for tree-structured topic models, which admit the correlation between a parent and its children. Hence, we introduce a tree-specific diversity regularizer witht ki = t i − t k as: where Leaf and Chi(k) denote the set of the topics with no children and the children of the k-th topic, respectively. By adding this regularizer to the variational objective, each child topic becomes orthogonal from the viewpoint of their parent, while allowing parent-children correlations.

Variational Inference with AEVB
Under our proposed probabilistic model, the likelihood of a document is given by (15): where φ ∈ ∆ K−1 is the topic distribution and is derived as φ k = L l=1 θ l ( c:c l =k π c ). As the latent variables c n and z n are integrated out in (15), the evidence lower bound for the document log-likelihood is derived as: where q(π, θ|w d ) is the variational distribution approximating posteriors.

Dynamically Updating the Tree Structure
To allow an unbounded tree structure, we introduce two heuristic rules for adding and pruning the branches. We compute the proportion of the words in topic k: p k = ( D d=1 N dφd,k )/( D d=1 N d ). For each non-leaf topic k, if p k is more than a threshold, a child is added to refine the topic. For each topic k, if the cumulative proportion of topics over descendants, j∈Des(k) p j , is less than a threshold, the k-th topic and its descendants are removed (Des(k) denotes the set of topic k and its descendants). We also remove topics with no children at the bottom.

Datasets
In our experiments, we use the 20NewsGroups and the Amazon product reviews. The 20NewsGroups is a collection of 20 different news groups containing 11, 258 training and 7, 487 testing documents 2 . For the Amazon product reviews, we use the domain of Laptop Bags provided by Angelidis and Lapata (2018), with 31, 943 training, 385 validation and 416 testing documents 3 . We use the provided test documents in our evaluations, while randomly splitting the remainder of the documents into training and validation sets.

Baseline Methods
As baselines, we use a tree-structured topic model based on the nested Chinese restaurant process (nCRP) with collapsed Gibbs sampling (Blei et al., 2010). In addition, we use a flat neural topic model, i.e. the recurrent stick-breaking process (RSB), which constructs the unbounded flat topic distribution via an RNN (Miao et al., 2017).

Implementation Details
For the TSNTM and RSB, we use 256-dimensional word embeddings, a one-hidden-layer MLP with 256 hidden units, and a one-layer RNN with 256 hidden units to construct variational parameters. We set the hyper-parameters of Gaussian prior distribution µ 0 and σ 2 0 as a zero mean vector and a unit variance vector with 32 dimensions, respectively. We train the model using AdaGrad (Duchi et al., 2011) with a learning rate of 10 −2 , an initial accumulator value of 10 −1 , and a batch size of 64. We grow and prune a tree with a threshold of 0.05 in Section 3.4 and set a temperature as τ = 10 in Section 3.2 4 .
Regarding the nCRP-based model, we set the nCRP parameter as γ = 0.01, the GEM parameter as π = 10, m = 0.5, and the Dirichlet parameter as η = 5.
The hyperparameters of each model are tuned based on the perplexity on the validation set in the Amazon product reviews. We fix the number of levels in the tree as 3 with an initial number of branches 3 for both the second and third levels.

Evaluating Topic Interpretability
Several works (Chang et al., 2009;Newman et al., 2010) pointed out that perplexity is not suitable for evaluating topic interpretability. Meanwhile, Lau et al. (2014) showed that the normalized pointwise mutual information (NPMI) between all pairs of words in each topic closely corresponds to the ranking of topic interpretability by human annotators. Thus, we use NPMI instead of perplexity as the primary evaluation measure following Srivastava and Sutton (2017); Ding et al. (2018). Table 1 shows the average NPMI of the topics induced by each model. Our model is competitive with the nCRP-based model and the RSB for each dataset. This indicates that our model can induce interpretable topics similar to the other models.
As a note, we also show the average perplexity over the documents of each model in Table 2. For the AEVB-based models (RSB and TSNTM), we calculate the upper bound of the perplexity using ELBO following Miao et al. (2017); Srivastava and Sutton (2017). In contrast, we estimate it by sampling the posteriors in the nCRP-based model with collapsed Gibbs sampling.
Even though it is difficult to compare them directly, the perplexity of the nCRP-based model is lower than that of the AEVB-based models. This tendency corresponds to the result of Srivastava and Sutton (2017); Ding et al. (2018), which report that the model with collapsed Gibbs sampling achieves the lowest perplexity in comparison with the AEVB-based models. In addition, Ding et al. (2018) also reports that there is a trade-off between perplexity and NPMI. Therefore, it is natural that our model is competitive with the other models regarding to NPMI, while there is a significant difference in achieved perplexity.

Evaluating Tree-Structure
For evaluating the characteristic of the tree structure, we adopt two metrics: topic specialization and hierarchical affinity following Kim et al. (2012).
Topic specialization: An important characteristic of the tree-structure is that the most general topic is assigned to the root, while the topics become more specific toward the leaves. To quantify this characteristic, we measure the specialization score as the cosine similarity of the word distribution between each topic and the entire corpus. As the entire corpus is regarded as the most general topic, more specific topics have lower similarity scores. Figure 3 presents the average topic specialization scores for each level. While the root of the nCRP is more general than that of our model, the tendency is roughly similar for both models.
Hierarchical Affinity: It is preferable that a parent topic is more similar to its children than the topics descended from the other parents. To verify this property, for each parent in the second level, we calculate the average cosine similarity of the word distribution to children and non-children respectively. Figure 4 shows the average cosine similarity over the topics. While the nCRP-based model induces child topics slightly similar to their parents, our model infers child topics with more similarity to their parent topics. Moreover, lower scores of the TSNTM also indicate that it induces more diverse topics than the nCRP-based model.
Example: In Section 1, an example of the induced topics and the latent tree for the laptop bag reviews is shown in Figure 1.

Evaluating Data Scalability
To evaluate how our model scales with the size of the datasets, we measure the training time until the convergence for various numbers of documents.  We randomly sample several number of documents (1,000, 2,000, 4,000, 8,000, 16,000 and all) from the training set of the Amazon product reviews and measure the training time for each number of documents. The training is stopped when the perplexity of the validation set is not improved for 10 consecutive iterations over the entire batches. We measure the time to sample the posteriors or update the model parameters, except for the time to compute the perplexity 5 .
As shown in Figure 5, as the number of documents increases, the training time of our model does not change considerably, whereas that of the nCRP increases significantly. Our model can be trained approximately 15 times faster than the nCRP-based model with 32,000 documents.

Conclusion
We proposed a novel tree-structured topic model, the TSNTM, which parameterizes the topic distribution over an infinite tree by a DRNN.
Experimental results demonstrated that the TSNTM achieves competitive performance when inducing latent topics and their tree structures, as compared to a prior tree-structured topic model (Blei et al., 2010). With the help of AEVB, the TSNTM can be trained approximately 15 times faster and scales to larger datasets than the nCRPbased model. This allows the tree-structured topic model to be incorporated with recent neural models for downstream tasks, such as aspect-based sentiment analysis (Esmaeili et al., 2019) and abstractive summarization (Wang et al., 2019). By incorporating our model instead of flat topic models, they can provide multiple information with desirable granularity.

A Appendices
A.1 Tree-Based Stick-Breaking Construction Figure 6 describes the process of the tree-based stick-breaking construction . At the first level, the stick length is π 1 = 1. Then, the stick-breaking construction is applied to the first level stick to obtain the path distribution over the second level. For instance, if the second level contains K = 3 topics, the probability of each path is obtained as π 11 = π 1 ν 11 , π 12 = π 1 ν 12 (1−ν 11 ) and the remaining stick π 13 = π 1 (1−ν 12 )(1−ν 11 ). Generally, for any values of K, it satisfies K k=1 π 1k = π 1 . The same process is applied to each stick proportion of the second level and continues until it reaches to the bottom level. A.2 Doubly-Recurrent Neural Networks Figure 7 shows the architecture of doubly-recurrent neural networks (Alvarez-Melis and Jaakkola, 2017). It consists of two recurrent neural networks over respectively the ancestors and siblings that are combined in each cell as described in (9).