CluHTM - Semantic Hierarchical Topic Modeling based on CluWords

Hierarchical Topic modeling (HTM) exploits latent topics and relationships among them as a powerful tool for data analysis and exploration. Despite advantages over traditional topic modeling, HTM poses its own challenges, such as (1) topic incoherence, (2) unreasonable (hierarchical) structure, and (3) issues related to the definition of the “ideal” number of topics and depth of the hierarchy. In this paper, we advance the state-of-the-art on HTM by means of the design and evaluation of CluHTM, a novel non-probabilistic hierarchical matrix factorization aimed at solving the specific issues of HTM. CluHTM’s novel contributions include: (i) the exploration of richer text representation that encapsulates both, global (dataset level) and local semantic information – when combined, these pieces of information help to solve the topic incoherence problem as well as issues related to the unreasonable structure; (ii) the exploitation of a stability analysis metric for defining the number of topics and the “shape” the hierarchical structure. In our evaluation, considering twelve datasets and seven state-of-the-art baselines, CluHTM outperformed the baselines in the vast majority of the cases, with gains of around 500% over the strongest state-of-the-art baselines. We also provide qualitative and quantitative statistical analyses of why our solution works so well.


Introduction
Topic Modeling (TM) is the task of automatically extracting latent topics (e.g., a concept or a theme) from a collection of textual documents. Such topics are usually defined as a probability distribution over a fixed vocabulary (a set of words) that refers to some subject and describes the latent topic as a whole. Topics might be related to each other, and if they are defined at different semantic granularity levels (more general or more specific), this naturally induces a hierarchical structure. Although traditional TM strategies are of great importance to extract latent topics, the relationships among them are also extremely valuable for data analysis and exploration. In this context, Hierarchical Topic Modeling (HTM) aims to achieve -to induce latent topics from text data while preserving the inherent hierarchical structure (Teh et al., 2006). Relevant scenarios have been shown to enjoy the usefulness of HTM, such as (i) hierarchical categorization of Web pages (Ming et al., 2010), (ii) extracting aspects hierarchies in reviews (Kim et al., 2013) and (iii) discovering research topics hierarchies in academic repositories (Paisley et al., 2014).
Despite its practical importance and potential advantages over traditional TM, HTM poses its own challenges, the main ones being: (i) topic incoherence and (ii) unreasonable hierarchical structure. Topic Incoherence has to do with the need to learn meaningful topics. That is, the top words that represent a topic have to be semantically consistent with each other. Unreasonable structure is related to the extracted hierarchical topic structure. Topics near the root should be more general, while topics close to the leaves should be more specific. Furthermore, child topics must be coherent with their corresponding parent topics, guaranteeing a reasonable hierarchical structure. Finally, (iii) the number of topics in each hierarchy level is usually unknown and cannot be previously set to a predefined value since it directly depends on the latent topical distribution of the data.
Both supervised and unsupervised approaches have been applied to HTM. Supervised methods use prior knowledge to build the hierarchical tree structure, such as labeled data or linking relationships among documents (Wang et al., 2015). Those strategies are unfeasible when there is no explicit taxonomy or hierarchical scheme to associate with documents or when such an association (a.k.a., labeling) is very cumbersome or costly to obtain. Unsupervised HTM (uHTM) deals with such limitations. uHTM methods do not rely on previous knowledge (such as taxonomies or labeled hierarchies), having the additional challenge of discovering the hierarchy of topics based solely on the data at hand.
HTM solutions can also be roughly grouped into non-probabilistic and probabilistic models. In probabilistic strategies, textual data is considered to be "ruled" by an unknown probability distribution that governs the relationships between documents and topics, hierarchically. The major drawback in this type of approach has to do with the number of parameters in the model, which rapidly grows with the number of documents. This leads to learning inefficiencies and proneness to over-fitting, mainly for short textual data (Tang et al., 2014). To overcome these drawbacks, non-probabilistic models aim at extracting hierarchical topic models through matrix factorization techniques instead of learning probability distributions. Such strategies also pose challenges. They are usually limited to just local information (i.e., data limitation) as they go deeper into the hierarchy when extracting the latent topics. That is, as one moves more in-depth in the hierarchical structure representing the latent topics, the available data rapidly reduces in size, directly impacting the quality of extracted topics (in terms of both coherence and structure reasonableness). Probabilistic models mitigate this phenomenon as they rely on global information when handling the probability distributions (Xu et al., 2018). Because of that, the current main HTM methods are built based on probabilistic methods (Griffiths et al., 2004;Mimno et al., 2007).
In this paper, we aim at exploring the best properties of both non-probabilistic and probabilistic strategies while mitigating their main drawbacks. Up to our knowledge, the only work to explore this research venue is (Liu et al., 2018). In that work, the authors explore NMF for solving HTM tasks by enforcing three optimization constraints during matrix factorization: global independence, local independence, and information consistency. Those constraints allow their strategy, named HSOC, to produce hierarchical topics that somehow preserve topic coherence and reasonable hierarchical structures. However, as we shall see in our experiments, HSOC is still not capable of extracting coherent topics when applied to short text data, which is currently prominent on the Web, especially on social network environments.
We here propose a distinct approach, taking a data engineering perspective, instead of focusing on the optimization process. More specifically, we explore a matrix factorization solution properly designed to explore global information (akin to probabilistic models) when learning hierarchical topics while ensuring proper topic coherence and structure reasonableness. This strategy allows us to build a data-efficient HTM strategy, less prone to over-fitting that also enjoys the desired properties of topic coherence and reasonable (hierarchical) structure. We do so by applying a matrix factorization method over a richer text representation that encapsulates both, global and semantic information when extracting the hierarchical topics.
Recent non-probabilistic methods (Shi et al., 2018;Viegas et al., 2019) have produced top-notch results on traditional TM tasks by taking advantage of semantic similarities obtained from distances between words within an embedding space (Mikolov et al., 2013;Pennington et al., 2014). Our critical insight for HTM was to note that the richer (semantic) representation offered by distributional word embeddings can be readily explored as a global 1 source of information in more profound levels of the hierarchical structure of topics. This insight gives us an essential building block to overcome the challenges of matrix factorization strategies for HTM without the need for additional optimization constraints.
In (Viegas et al., 2019), the authors exploit the nearest words of a given "pre-trained" word embedding to generate "meta-words", aka Cluwords, able of expanding and enhancing the document representation in terms of syntactic and semantic information. Such an improved representation is capable of mitigating the drawbacks of using the projected space of word embeddings as well as extracting cohesive topics when applying nonnegative matrix factorization for topic modeling.
Motivated by this finding, we here advance the state-of-the-art in HTM, by designing, developing and evaluating an unsupervised non-probabilistic HTM method that exploits CluWords as a key building block for TM when capturing the latent hierarchical structure of topics. We focus on the NMF method for uncovering the latent hierarchy as it is the most effective matrix factorization method for our purposes. Finally, the last aspect needed to be addressed for the successful use of NMF for HTM is the definition of the appropriate number of topics k to be extracted. Choosing just a few topics will produce overly broad results while choosing too many will result in over-clustering the data into many redundant, highly-similar topics. Thus, our proposed method uses a stability analysis concept to automatically select the best number of topics for each level of the hierarchy.
As we shall see, our approach outperforms HSOC and hLDA (current state-of-the-art) for both small and large text datasets, often by large margins. To summarize, our main contributions are: (i) a novel non-probabilistic HTM strategy -CluHTM -based on NMF and CluWords that excels on HTM tasks (in both short and large text data) while ensuring topic coherence and reasonable topic hierarchies; (ii) the exploitation in an original way of a cross-level stability analysis metric for defining the number of topics and ultimately 'the shape' of the hierarchical structure; as far as we know this metric has never been applied with this goal; (iii) an extensive empirical analysis of our proposal considering twelve datasets and seven state-of-the-art baselines. In our experimental evaluation, CluHTM outperformed the baselines in the vast majority of the cases (In case of NPMI, in all cases), with gains of 500% when compared to hLDA and 549% when compared to HSOC, some of the strongest baselines; and finally, (iv) qualitative and quantitative statistical analyses of the individual components of our solution.

Related Work
Hierarchical Topic Modeling (HTM) can be roughly grouped into supervised and unsupervised methods. Considering the supervised HTM strategies, we here highlight some relevant supervised extensions to the traditional Latent Dirichlet Allocation (LDA) (Blei et al., 2003), a widely used strategy for the topic modeling (TM). LDA assumes a Dirichlet probability distribution over textual data to estimate the probabilities of words for each topic. In (Mcauliffe and Blei, 2008), the authors propose SLDA, a supervised extension of LDA that provides a statistical model for labeled documents. SLDA allows connecting each document to a regression variable to find latent topics that will best predict the response variables for future unlabeled documents. Based on SLDA, Hierarchical Supervised LDA (HSLDA) (Perotte et al., 2011) incorporates the hierarchy of multilabel and pre-labeled data into a single model, thus providing extended prediction capabilities w.r.t., the latent hierarchical topics. The Supervised Nested LDA (SNLDA) (Resnik et al., 2015), also based on SLDA, implements a generative probabilistic strategy where topics are sampled from a probability distribution. SNLDA extends SLDA by assuming that the topics are organized into a tree structure. Although our focus is on unsupervised solutions, we include SLDA, HSLDA and SNLDA as baselines in our experimental evaluation.
We now turn our attention to unsupervised HTM strategies, in which a hierarchical structure is learned during topic extraction. In (Mimno et al., 2007) the authors propose Hierarchical Pachinko Allocation Model (hPAM), an extension of Pachinko Allocation (PAM) (Li and McCallum, 2006). In PAM, documents are a mix of distributions over an individual topic set, using a directed acyclic graph to represent the co-occurrences of topics. Each node in such a graph represents a Dirichlet distribution. At the highest level of PAM, there is only a single node, where the lowest levels represent a distribution between nodes of the next higher level. In hPAM, each node is associated with a distribution over the vocabulary of documents.
In (Griffiths et al., 2004), the authors propose the hLDA algorithm, which is also an expansion of LDA, being considered state-of-the-art in HTM. In hLDA, in addition to using the text Dirichlet distribution, the nested Chinese Restaurant Process (nCRP) is used to generate a hierarchical tree. NCRP needs two parameters: the tree level and a γ parameter. At each node of the tree, a document can belong to a path or create a new tree path with probability controlled by γ. More recently, in (Xu et al., 2018), the authors propose the unsupervised HTM strategy named a knowledge-based hierarchical topic model (KHTM). This method is based on hLDA and, as such, models a generative process whose parameter estimation strategy is based on Gibbs sampling. KHTM is able to uncover prior knowledge (such as the semantic correlation among words), organizing them into a hierarchy, consisting of knowledge sets (k-sets). More specifically, the method first generates, through hLDA, an initial set of topics. After comparing pairs of topics, those topics with similarity higher than α (a.k.a., k-sets) are then filtered so that the first 20 words of each topic are kept, and the remaining are just discarded. Those extracted k-sets are then used as an extra weight when extracting the final topics. All these methods are used as baselines in our experimentation.
Probably the most similar work to ours is the HSOC strategy, proposed in (Liu et al., 2018), which proposes to use NMF for solving HTM tasks. In order to mitigate the main drawbacks of NMF in the HTM setting 2 , HSOC relies on three optimization constraints to properly drive the matrix factorization operations when uncovering the hierarchical topic structure. Such constraints are global independence, local independence, and information consistency, and allow HSOC to derive hierarchical topics that somehow preserve topic coherence and reasonable hierarchical structures.
As it can be observed, almost all models, supervised or unsupervised, are based on LDA. As discussed in Section 1, though matrix factorization strategies normally present better results than Dirichlet strategies in TM tasks, for HTM, the situation is quite different. In fact, matrix factorization methods face difficult challenges in HTM, mainly regarding data size as ones go deeper into the hierarchy. More specifically, at every hierarchical level, a matrix factorization needs to be applied to increasingly smaller data sets, ultimately leading to insufficient data at lower hierarchy levels. These approaches also do not exploit semantics nor any external enrichment, relying only on the statistical information extracted from the dataset. Contrarily, here we propose a new HTM approach, called CluHTM, which exploits externally built word embedding models to incorporate global semantic information into the hierarchical topic tree creation. This brings some important advantages to our proposal in terms of effectiveness, topic coherence, and hierarchy reasonableness altogether.

CluWords Representation
Cluwords (Viegas et al., 2019) combine the traditional Bag of Words (BoW) statistical representation with semantic information related to the words present in the documents. The semantic context is obtained employing a "pre-trained" word representation, such as Fasttext (Mikolov et al., 2018). Figure 1 presents the process of transforming each original word into a Cluword (cluster of words) representation. First, the strategy uses the information about the dataset, as well as pre-trained word embedding (i.e. Fasttext) to build semantic relationships between a word and its neighbors (described in Section 3.1.1). Next, statistical information on words (e.g., term frequency, document frequency) is extracted from the dataset. Then, both semantic and statistical information are combined to measure the importance of each Cluword as explained in Section 3.1.2. Cluwords enjoy the best of "two worlds": it conjugates statistical information on the dataset, which has demonstrated to be very effective, efficient and robust in text applications, enriched with semantic contextual information captured by distributional word embeddings adapted to the dataset by the clusterization process described next.

Cluwords Generation
Let W be the set of vectors representing each word t in the dataset vocabulary (represented as V). Each word t ∈ V has a corresponding vector u ∈ W. The CluWords representation is defined as in Figure 1. The semantic matrix in the Figure 1 is defined as C ∈ R |V|×|V| , where each dimension has the size of the vocabulary (|V|), t represents the rows of C while t represents the columns. Finally, each index C t ,t is computed according to Eq. 1.
where ω(u t , u t ) is the cosine similarity and α is a similarity threshold that acts as a regularizer for the representation. Larger values of α lead sparser representations. In this notation each column t of the semantic matrix C will be forming a CluWord t and each value of the matrix C t ,t may receive the cosine similarity between the vectors u t and u t in the embedding space W if it is greater than or equal to α . Otherwise, the C t ,t receives zero, according to the Eq. 1.

TFIDF Weight for CluWords
In Figure 1, the CluWords representation is defined as the product between the statistical matrix (a.k.a. term-frequency matrix) and semantic matrix C. The statistical matrix (T F ) can be represented as a T F ∈ R |D|×|V| , where each position T F d,t relates to the frequency of a word t in document d. Thus, given a CluWord (CW) t for a document d, its data representation corresponds to CW d, − → C ,t is the semantic scores for the CluWord t, according to Eq. 1.
The TFIDF weighting for a CluWord t in a document d is defined as

Stability Measure
The Stability measure is motivated by the termcentering approach generally taken in topic modeling strategies, where topics are usually summarized as a truncated set of top words (Greene et al., 2014). The intuition behind this strategy is, given some K topics, to measure whether running multiple random samplings for a topic modeling strategy results in Stability, in terms of p top words extracted from the topics. Given a range of topics [K min , K max ], and some topic modeling strategy (on our case, Non-negative Factorization Matrix method), the strategy proceeds as follows. First, it learns a topic model considering the complete data set representation D, which will be used as a reference point (W D ) for analyzing the Stability afforded by the K topics. Note that the p top words represent each topic. Subsequently, S samples of the data are randomly drawn from D without replacement, forming a subset of D documents. Then, |S| topic models are generated, one for each subsampling (W S i ).
To measure the quality of K topics, the Stability computes the mean agreement among each pair of (W D , W S i ). The goal is to find the best match between the p top words of the compared topics. The agreement is defined as agree(W x , W y ) = 1 p p i=1 AJ(w xi , ρ(w xi )), where AJ(·) is the av-erage Jaccard coefficient used to compare the similarity among the words w and ρ(·) is the optimal permutation of the words in W S i that can be found in O(p 3 ) time by solving the minimal weight bipartite matching problem using the Hungarian method (Kuhn, 2010).

Proposed Solution
CluHTM is an iterative method able to automatically define the best number of topics in each hierarchy, given a range of possible number of topics [K min , K max ].
CluHTM explores Cluwords and Non-negative Matrix Factorization (NMF) (Lee and Seung, 2001), one of the main non-probabilistic strategies. Finally, the Stability method (described in Section 3) is used to select NMF k parameters (a.k.a number of topics).
CluHTM has five inputs (Algorithm 1), (i) D max corresponds to the depth down to which we want to extract the hierarchical structure. (ii) K min and K max control the range of some topics, such range will be used in all levels of the hierarchy; (iii) T is the input text data; and (iv) W is the "pre-trained" word embedding vector space used in the Clu-Words generation. The output is the hierarchical structure H of p top words for each topic. The method starts by getting the root topic (line 2-3 of Algorithm 1), which is composed of all documents in T . Since the method is iterative, each iteration is controlled by a queue schema to build a hierarchical structure. Thus, at each iteration (line 3), the algorithm produces the CluWords representation for the documents ∈ T (line 5), chooses the number of topics, exploiting the Stability measure (line 6), and runs the NMF method (line 7) to extract the p words for each topic in O (line 8). Then, in the loop of line 9, each topic is stored in the queue, as well as the respective documents of each topic.
Summarizing, our solution exploits global semantic information (captured by CluWords) within local factorizations, limited by a stability criterion that defines the 'shape' of the hierarchical structure. Though simple (and original), the combination of these ideas is extremely powerful for solving the HTM task, as we will see next.

Experimental Setup
The primary goal of our solution is to effectively perform hierarchical topic modeling so that more coherent topics can be extracted. To evaluate topic model coherence, we consider 12 real-world datasets as reference. All of them were obtained from previous works in the literature. For all datasets, we performed stopwords removal (using the standard SMART list) and removed words such as adverbs, using the VADER lexicon dictionary (Hutto and Gilbert, 2014), as the vast majority of the essential words for identifying topics are nouns and verbs. These procedures improved both the efficiency and effectiveness of all analyzed strategies. Table 1 provides a summary of the reference datasets, reporting the number of features (words) and documents, as well as the mean number of words per document (density) and the corresponding references. We compare the HTM strategies using representative topic quality metrics in the literature (Nikolenko, 2016;Nikolenko et al., 2017). We consider three classes of topic quality metrics based on three criteria: (a) coherence, (b) mutual information, and (c) semantic representation. In this paper, we focus on these three criteria since they are the most used metrics in the literature (Shi et al., 2018). We consider three topic lengths (5, 10 and 20 words) for each parameter in our evaluation, since different lengths may bring different challenges.
Regarding the metrics, coherence captures easiness of interpretation by co-occurrence. Words that frequently co-occur in similar contexts in a corpus are easier to correlate since they usually define a more well-defined "concept" or "topic". We employ an improved version of regular coherence (Nikolenko, 2016), called Coherence, defined as where d(w1) denotes the number of occurrences of w1, d(w1, w2) is the number of documents that contain both w1 and w2 together, and ε is a smoothing factor used for preventing log(0).
Another class of topic quality metrics is based on the notion of pairwise pointwise mutual information (PMI) between the top words in a topic. It captures how much one "gains" in the information given the occurrence of the other word, taking dependencies between words into consideration. Following a recent work (Nikolenko, 2016), we here compute a normalized version of PMI (NPMI) where, for a given ordered set of top words W t = (w 1 , ..., w N ) in a topic: Finally, the third class of metrics is based on the distributed word representations introduced in (Nikolenko, 2016). The intuition is that, in a well-defined topic, the words should be semantically similar, or at least related, to be easily interpreted by humans. In a d-dimensional vector space model in which every vocabulary word w ∈ W has been assigned to a vector v w ∈ R d , the vectors corresponding to the top words in a topic should be close to each other. In (Nikolenko, 2016), the authors define topic quality as the average distance between the top words in the topic, as follows: Generally speaking, let d(w 1 , w 2 ) be a distance function in R d . In this case, larger d(w 1 , w 2 ) corresponds to worse topics (with words not as localized as in topics with smaller average distances). In (Nikolenko, 2016), the authors suggest four different distance metrics, with cosine distance achieving the best results. We here also employ the cosine distance, defined as d cos (x, y) = 1 − x T y.
We compare our approach described in Section 4, with seven hierarchical topic model strategies marked in bold in Section 2. For the input parameters of CluHTM (Algorithm 1), we set K min = 5, K max =25, R = 10 and D max = 3. We define K min through empirical experiments, and the K max was defined according to the number of topics exploited in (Viegas et al., 2019). For the baseline methods, we adopt the parameters suggested by their own works. We assess the statistical significance of our results employing a paired t-test with 95% confidence and Holm-Bonferroni correction to account for multiple tests.

Experimental Results
We start by comparing CluHTM against four state-of-the-art uHTM baselines considering the twelve reference datasets. Three hierarchical levels for each strategy are used in this comparison. In Figures 2, 4 and 3 we contrast the results of our proposed CluHTM and the reference strategies, considering the NPMI, W2V-L1, and Coherence metrics.  Note that each strategy extracted a different number of topics in its hierarchical structure. Considering NPMI, the most important metric to evaluate the quality of topics (Nikolenko, 2016), we can see in Figure 2 that our strategy outperforms all baselines in all datasets by large margins, with gains over 500% against some of the strongest ones. Some of these results are the highest in terms of NMPI ever reported for several of these datasets. Considering the Coherence scores (Figure 3), our strategy achieves the single best results in 2 out of 12 datasets, with gains up to 58% and 92% against the most robust baseline (hPAM), tying in 8 out 12 and losing two times for hLDA and hPAM. Similar results can be observed for the W2V-L1 metric (Figure 4) -CluHTM ties in 10 out of 12 results, with one win and one loss for KHTM. As we will see, even with very few losses in these metrics, our method proves to be   more consistent than the baselines. We now turn our attention to the effectiveness of our proposal when compared to the supervised HTM strategies. We consider the 20News and ACM datasets for which have a ground truth for supervised strategies. Table 2 presents the results considering Coherence, W2V-L1, and NPMI. The statistical significance tests ensure that the best results, marked in , are superior to others. The statistically equivalent results are marked in • while statistically significant losses are marked in . Once again, in Table 2, our proposed strategy achieves the best results in 4 out of 6 cases, tying with SNLDA and HSLDA in ACM and loosing only to SLDA in 20News, both considering the W2V-L1 metric. It is important to remind that, differently from these supervised baselines, our method does not use any privileged class information to build the hierarchical structure nor to extract topics.
We provide a comparative table with all experimental results 5 , including the results for each extracted level of the hierarchical structure. We summarize our findings regarding the behavior of all analyzed strategies in the 12 datasets, counting the number of times each strategy figured out as a top performer 6 . The summarized results can be seen in Table 3. Our proposal is in considerable advantage over the other explored baselines, being 5 see Appendix, Section Supplementary Results for detailed results 6 If two approaches are statistically tied as top performers in the same dataset, both will be counted. the strategy of choice in the vast majority of cases. Overall, considering a universe of 36 experimental results (the combination of 3 evaluation metrics over 12 datasets), we obtained the best results (33 best performances), with the most robust baseline -hPAM -coming far away, with just 17 top performances. Another interesting observation is that, in terms of NPMI, CluHTM wins in all cases. Details of this analysis are summarized in the Appendix.

Impact of the Factors
One important open question remains to be answered: To what extent the characteristics of the dataset impact the quality of the topics generated by our strategy? To answer this question, we provide a quantitative analysis regarding the hierarchical topic modeling effectiveness, measured by the NPMI score.
We start our analysis by quantifying the effects of the parameters of interest (i.e., factors). Those factors might affect the performance of the system under study, while also determining whether the observed variations are due to significant effects (e.g., measurement errors, the inherent variability of the process being analyzed (Jain, 1991)). To this end, we adopt a full factorial design, which uses all the possible combinations of the levels of the factors in each complete experiment. The first factor is the dataset. The idea is to analyze the impact of textual properties such as dataset size, density, dimensionality, etc. Thus, each level of this factor is a dataset in Table 1. The second factor is the HTM strategies evaluated in the previous Section. In this factor, we intend to assess the impact of the extracted topics, as well as the hierarchical structure. Each level of this factor is an evaluated HTM strategy. All the possible combination between these two factors will be measured by the average of NMPI among topics of the hierarchical structure.
Results are shown in Table 4. In the Table, we highlight the average NPMI and the effects of each   factor. From the effects, we can observe that the CluHTM impact in the NPMI value is 99.38% higher than the overall average. We can also see that hLDA has an NPMI score higher than the overall average (18.67%) and HSOC has an NPMI score of approximately 64.44% smaller than overall NMPI. Concerning the datasets' effects, the full factorial design experiment tells us that they have a small impact on the variation concerning the obtained average NPMI scores. We can also observe that the dataset with the most variation of NPMI is InfoVis-Vast, with a score of 29.97% smaller than the overall NPMI.
We perform a ANOVA test to assess whether the studied factors are indeed statistically significant and conclude, with 99% confidence according to the F-test, that the choice of algorithm (factor B) explains approximately 90% of the obtained NPMI values. We can also conclude that the investigated properties of the textual data (factor A), as well as the experimental errors, have a small influence on the experimental results. Summarizing, we can conclude that the characteristics of the datasets have a lower impact on the results and that the impact of CluHTM is consistent across all of them. The ANOVA test details are presented in Table 5.

Conclusion
We advanced the state-of-the-art in hierarchical topic modeling (HTM) by designing, implementing and evaluation a novel unsupervised nonprobabilistic method -CluHTM. Our new method exploits a more elaborate (global) semantic data representation -CluWords -as well as an original application of a stability measure to define the "shape" of the hierarchy. CLUHTM excelled in terms of effectiveness, being around two times more effective than the strongest state-of-the-art baselines, considering all tested datasets and evaluation metrics. The overall gains over some of these strongest baselines are higher than 500% in some datasets. We also showed that CluHTM results are consistent across most datasets, independently of the data characteristics and idiosyncrasies. As future work, we intend to apply CluHTM in other representative applications on the Web, such as hierarchical classification by devising a supervised version of CluHTM. We also intend to incorporate some type of attention mechanism into our methods to better understand which Cluwords are more important to define certain topics.

Acknowledgments
This work is partially supported by CAPES, CNPq, Finep, Fapemig, Mundiale, Astrein, projects InWeb and MASWeb. models and the nested chinese restaurant process. In Advances in neural information processing systems, pages 17-24.

A Appendix Supplementary Results
The Tables below expand on the results of Section 5.