Topic Model Stability for Hierarchical Summarization

We envisioned responsive generic hierarchical text summarization with summaries organized by section and paragraph based on hierarchical structure topic models. But we had to be sure that topic models were stable for the sampled corpora. To that end we developed a methodology for aligning multiple hierarchical structure topic models run over the same corpus under similar conditions, calculating a representative centroid model, and reporting stability of the centroid model. We ran stability experiments for standard corpora and a development corpus of Global Warming articles. We found flat and hierarchical structures of two levels plus the root offer stable centroid models, but hierarchical structures of three levels plus the root didn’t seem stable enough for use in hierarchical summarization.


Introduction
We envisioned a responsive generic hierarchical text summarization process for complex subjects and multiple page documents with resulting text summaries organized by topic and paragraph. Information extraction and summary construction would be based on hierarchical structure topic models learned in the analysis phase. 1 The hierarchical topic structure would provide the organization as well as the information quantity budget and extraction criteria for sections and paragraphs in hierarchical summarization. Initial attempts along this path offered promise for a more coherent and organized summary for a small corpus of Global Warming articles from (Live Science, 2015) versus that obtained by flat topic structures.
However, multiple analyses of the same Global Warming corpus and various standard corpora under similar conditions rendered seemingly different hierarchical topic models. Model differences remained even after transforming and reducing models based on required summary size and other extrinsic summary requirements. So we decided to examine topic model stability with the goal of assuring that stable, representative, and credible topic models would be produced in our analysis phase. This paper documents our effort at assuring hierarchical topic model stability for hierarchical summarization.
It is inherent in Bayesian probabilistic topic modeling and similar methods that repeat analyses of the same corpus under the same conditions give different results. But we must have substantially similar results to do credible hierarchical summarization (or other application). We require topic model stability, i.e., similar topic models for analyses performed under similar conditions. Without stable results, we do not know which analyses to believe, if any, and we mistrust the methodology itself. Furthermore, any application of the resulting topic model is not credible.
Organization of Paper Bayesian probabilisitic topic analysis ( §2.1) expresses a corpus as the matrix product of topic compositions of words with document mixtures of topics. In flat topic analysis, the matrix of topic-word compositions is organized as a flat vector of individual topics. With hierarchical structure topic analysis, the topics take on a hierarchical tree structure.
Topic model quality ( §2.2) is typically assessed by predictive likelihood of words for a test corpus or by assessment of topic coherence. Our stability assessment methodology seems largely com-plementary to quality assessment.
The Hungarian assignment algorithm (Kuhn, 1955) has been used for aligning flat topic model pairs ( §2.3), based on a cost matrix of pairwise topic alignments. We will use a pairwise topic similarity measure for populating the Hungarian algorithm's cost matrix.
Topic models, including hierarchical models, are being used to construct text summaries ( §2.4), including hierarchical text summaries. This provides sufficient reason to want to assure the stability of flat and hierarchical structure topic models.
We introduce the particular flat and hierarchical structure topic models ( §3.1) used for this paper.
In a simple yet significant innovation, we extend topic alignment ( §3.2) to hierarchical structure topic model pairs via a recursive application of the Hungarian assignment algorithm starting with root topics of the model pair. Surprisingly, we find time complexity of the hierarchical topic structure improves versus flat structure with increasing level of the hierarchy. 2 We measure stability ( §3.3) as alignment (proportion of aligned topics), similarity (weighted cosine similarity over topic compositions), and divergence (Jensen-Shannon divergence over topic distributions). Measures are defined for flat and then extended to hierarchical structure topic models.
The more topic models in the study, the more credible the stability analysis, since we are aligning more models and measuring stability based on more analyses. For complex problems, however, more models also makes it more likely we would encounter alternative topic models, just as human topic modelers might. We perform agglomerative clustering on topic model similarity ( §3.4) to test whether models form a single or multiple stable topic model groups, or are unstable.
For each cluster, we align models and calculate topic frequency weighted centroids ( §3.5) of topicword compositions for aligned topics. Then we assess stability versus the centroid model ( §3.6) similarly to that done previously for model pairs.
We demonstrate the methodology ( §4) over flat and hierarchical structure models in an 18 run factorial experiment on three corpora, and in a separate ad hoc 16 run experiment on a larger corpus.
We return to our work on hierarchical summa-2 Software engineering already knows this -that hierarchical structure is less time complex than monolithic. rization ( §5) now armed with stable hierarchical topic models and examine our next steps as well as options for further research.

Previous Work
We use Bayesian probabilistic topic modeling in the analysis phase of our hierarchical summarization process. Here we briefly review topic modeling, topic model quality, topic model stability, and use of topic models in hierarchical summarization.

Topic Models
The Latent Dirichlet analysis (LDA) Bayesian probabilistic topic model, introduced and popularized by Blei et al. (2003); Griffiths and Steyvers (2004), factors a corpus of document-word occurrences as the matrix product of topic compositions of words and document mixtures of topics (figure 1). The topic structure is flat and the number of topics, K, and vocabulary size, V , are fixed. In the generative probabilistic model, topic-word compositions are distributed symmetric Dirichlet with parameter η, and document-topic mixtures are distributed Dirichlet with concentration parameter α.  Teh et al. (2005Teh et al. ( , 2006) generalized the LDA model in two important ways: (1) the number of topics, K, is made open ended by treating the topic model as a Dirichlet process (DP) with growth parameter γ for sampling a new topic, and (2) documents are sampled from Dirichlet processes (DPs) which are themselves sampled from corpus DPs thus forming hierarchical Dirichlet processes, HDPs, even while the topic structure remains flat. Blei et al. (2010) developed hierarchical topic analysis where the generative model of the corpus consists of a hierarchy of nested Dirichlet processes (DPs) and each document is generated as a single non-branching path down the corpus hierarchical structure. Stay-or-go stochastic switches are used at each document node to determine whether to stay on the current topic or go to a topic further down the tree. Paisley et al. (2015) extended the non-branching document paths to a nested hierarchical structure Dirichlet process model with branching in both the document and global models. In figure 2, the grey represents the corpus tree and the black overlaid trees the individual document trees. Each document parent node is a DP sampled from its corresponding corpus node DP. Analysis infers the corpus topic structure and compositions, and document topic mixtures and stay-or-go switches.

Quality
Predictive log likelihood for words, test LL(x), is a popular measure of topic analysis quality. Test LL(x) shows the predictability of words on test data given the model fit to training data (corpus topics and compositions). While not a stability measure, test LL(x) does give an objective indication of predictability. Teh et al. (2007) provides formulas for calculating test LL(x) for the flat topic structure in both Gibbs sampler and variational inference analysis methods.
Assessing quality of individual topics can be as simple as noting topics below a minimum frequency or comparing divergence of topics from any of uniform, corpus, or power distributions of word frequencies. More powerful methods assess individual and aggregate topic coherence. The current standard is to measure coherence by normalized pairwise mutual information (NPMI) (Aletras and Stevenson, 2013;Lau et al., 2014;Röder et al., 2015) versus pairwise probabilities calculated from some very large pertinent corpus.
We view test likelihood and topic coherence as largely complementary to topic model stability.

Topic Alignment and Stability
Topic models must be aligned on topics before assessing stability. de Wall and Barnard (2008) calculates similarity weights between topics from different models over documents, constructs a cost matrix from negative similarity weights, and applies the Hungarian assignment algorithm (Kuhn, 1955) to determine the optimal pairwise topic model alignment. Stability is defined as the correlation between aligned topics over documents. Greene et al. (2014) calculates the average of Jaccard scores on sets of popular word ranks between topic combinations of a topic model pair, and determines the model agreement (i.e., stability) as the average over topics of Jaccard scores resulting from the optimal topic alignment by the Hungarian assignment algorithm. Chuang et al. (2015) notes that model alignment is "ill-defined and computationally intractable" with multiple-to-multiple mappings between topics, and adopts the solution of mapping topics upto-one topic. 3 Yang et al. (2016) aligns topics for flat topic structures also using the Hungarian assignment algorithm and up-to-one topic correspondence. Stability is measured as agreement between token topic assignments over aligned topic models.
We use the Hungarian algorithm and the upto-one topic correspondence. We choose to emphasize topic correspondence based on topic word compositions, as in the generative model, and so base our cost matrix on similarity of topic word compositions between models.

Topic Model Based Summarization
Haghighi and Vanderwende (2009) examined several hybrid topic models using LDA as a building block and demonstrated the superior efficacy of their hybrid model (general topic, general content topic, detail content topics, and document specific topics) in constructing short summaries for Document Understanding Conferences (U.S. Department of Commerce: National Institute of Standards and Technology, 2015). Delort and Alfonseca (2011); Mason and Charniak (2011) used similar models in short summaries for the Text Analysis Conferences (of Commerce: National Institute of Standards and Technology, 2010, 2011). Hakkani-Tur (2010, 2011) used a more general hierarchical LDA topic model structure, doing hierarchical summarization for longer summaries. Christensen et al. (2014) developed "hierarchical summarization" using temporal hierarchical clustering and budgeting summary component size by cluster.
We use a more general hierarchical structured Bayesian topic model similar to Paisley et al. (2015). Essential for any of these related hierarchical topic model or cluster based methods is the stability of the model used to drive summarization.

Methodology
We present a process for aligning topic models and measuring topic model stability for both flat and hierarchical structure cases. The resulting stable hierarchical structure topic centroid model would be further transformed to take into account extrinsic summarization requirements.

Stability -Measurement Process
1. Infer multiple topic models for the same corpus run under similar conditions.
4. Cluster topic models using agglomerative clustering over pairwise stability.

For each cluster:
(a) Align member topic models and calculate topic model centroids. (b) Align member topic models with topic centroid model. (c) Calculate stability of topic models with topic centroid model.

Topic Modeling
For a flat topic structure, we use a Gibbs sampler implementation of Teh et al. (2006) hierarchical Dirichlet processes (HDP). For a hierarchical topic structure, we use a Gibbs sampler implementation of a simplified version of Paisley et al.
(2015)'s nested hierarchical Dirichlet processes. Our simplified model and Gibbs sampler drops the use of stay-or-go stochastic switches at each document Dirichlet process (DP) node. See supplemental notes (Supplemental, 2017b).

Pairwise Topic Model Alignment
From a set of M topic models, all M (M − 1)/2 model pairs are aligned based on topic pair assignment costs. Assignment cost between topics from distinct model pairs is calculated as where (k, l) indexes topics from model pairs, m k and n l are topic frequencies, N is corpus size, m k and n l are vectors of word frequencies for topic pair (k, l), and cosSim calculates the cosine similarity. 4 By using topic frequency ratios in the cost, similar frequency topics are preferred. Since weak similarities are not useful, we censor cosSim ≤ .25 and substitute zero for their cost.
Flat Topic Models Pairwise costs are assembled into a cost matrix indexed by (k, l) and the optimal cost assignment of the model pair is determined by the Hungarian assignment algorithm. For unequal numbers of topics, vectors of zero (maximum) costs are substituted for nonexistent topics.
Hierarchical Topic Models Hierarchical topic structures are single rooted branching trees of depth L where the root is depth 0. Each tree node includes a topic of word compositions, and each non-leaf tree node includes a Dirichlet process (DP) of topic mixtures. We restrict hierarchical topic structure alignment to require: (1) roots must align, and (2) aligned child branches must align in their ancestors. With these restrictions, we developed Minimize Subtree Cost (algorithm 1) applying the Hungarian algorithm to DP (nonleaf) nodes of the hierarchical topic structure. Method minimizeSubtreeCost is invoked initially for model pair roots, (σ 0 , τ 0 ) and recursively thereafter for subtree pairs, (σ, τ ). If either subtree is a leaf the topic alignment cost is returned. For internal nodes, a cost matrix is constructed between the child nodes for the subtrees, the Hungarian assignment algorithm is invoked to get the optimum cost alignment for the subtrees, the topic cost is added to the subtree costs, and this result is returned. Filling the subtree cost matrix calculates the cost of aligning properties between model pairs of subtree children by minimizing subtree costs for each child pair. Thus calculating subtree costs and filling subtree costs together recursively span the entire solution space for hierarchical topic alignment. See supplemental java snippets (Supplemental, 2017a).
Time Complexity For flat topic structures, topic alignment time complexity is O(K 2 (V + K)), where K is the number of topics and V is the vocabulary size. Preparation of the cost matrix takes K 2 topic vector cosine similarity calculations over  (Kuhn, 1955).
Level 1 in the hierarchical structure is similar to the flat topic structure. Time complexity is O(B 2 (V +B)), with branching factor, B, in place of number of topics, K. Each increment in level increases by a factor of B 2 the tree node pairs from the parent level. The resulting time complexity for level l beyond the root is then O(B 2l (V +B)). For B > 1 the final level dominates the order calculation, and so the time complexity for a hierarchical structure of depth L is O(B 2L (V + B)).
We compare this with the time complexity for the flat structure alignment problem by expressing K as though from a flattened hierarchical structure, ). For B > 1 the terms with B in the ratio dominate, and so expressing flat structure in hierarchical terms gives time complexity O(B 2L (V + B L )). Cost of assignment for flat is greater by a factor of B L−1 versus a comparable hierarchical structure. This is a surprising result! We had expected hierarchical structure to add time complexity, but instead it reduces time complexity with increasing level compared to a corresponding flat structure. Alignment of topics between hierarchical struc-5 Sum of geometric series, L l=0 B l , for a branching tree. tures is less time complex than for flat structures.

Pairwise Stability
Given the topic model alignment, we calculate alignment, similarity, and divergence measures. Table 1 gives a priori and preliminary calibration study interpretations of the stability measures.
Proportion Aligned Alignment is calculated as, pAlign = K /[(K σ + K τ )/2], where K is the number of aligned topics, and K σ and K τ are the number of topics for each model.
Weighted Similarity Similarity is calculated as topic frequency weighted similarity of the topic word compositions of the (σ, τ ) model pair, 6 where (k, l) indexes topics from the flat or hierarchically aligned model pair, m k and n l are topic frequencies, N is the corpus size, m k and n l are vectors of word frequencies for topic pair (k, l), and cosSim calculates the cosine similarity. Only aligned topics are added to the wtSim, but the corpus size includes all observations, so the fewer aligned topics, the lower the weighted similarity. For the hierarchical model we require that ancestors are also aligned.
Divergence Divergence is calculated as the Jensen-Shannon divergence (JSD) between topic frequency distributions for model pairs. Distributions are calculated as follows: (1) model σ topic frequency counts are assembled in array s by topic index k, (2) frequencies of unaligned topics from σ are set to zero with the sum of frequencies of unaligned topics set in s K where K is the maximum number of topics for the (σ, τ ) model pair, (3) model τ topic frequency counts are assembled in array t by topic index l, (4) frequencies of unaligned topics from τ are set to zero with the sum of frequencies of unaligned topics set in t K+1 , and (5) topic frequencies in t are reordered according to the alignment mapping between (σ, τ ). Thus, aligned topics coincide with respect to their positions in s, t and unaligned frequencies are kept separate between models. Divergence is calculated as JSD(s||t) = 1/2(KLD(s||m) + KLD(t||m)), 6 Unweighted or other weighting could be used as well.

Basis
Value Interpretation a priori alignment = 1 full alignment calibration alignment ≈ 0.6 useful alignment a priori similarity = 1 full similarity calibration similarity ≈ 0.6 useful similarity calibration similarity ≈ 0.25 marginal similarity a priori divergence = 0 full convergence calibration divergence ≈ 0.1 strong convergence calibration divergence ≈ 0.4 strong divergence Table 1: Preliminary interpretation of stability where m = (s + t)/2 and KLD is the Kullback-Leibler divergence. For the hierarchical model we require that ancestors are also aligned.

Cluster Topic Models
There are multiple ways in which topics can be organized and assigned -whether performed automatically or by human experts. So we test whether model pairs align to a single stable model group, or if multiple stable groups can be identified. We use group-average agglomerative clustering (Manning et al., 2008) on pairwise weighted similarity, wtSim, to form model clusters. This results in compact clusters maximizing separation between clusters while minimizing the distance between the cluster centroid and its members. Clustering begins with each model forming its own cluster and ends when either all models form a single cluster or no more clusters can be formed that meet wtSim > cutP oint, where wtSim is the average weighted similarity. Output is a list of clusters where each cluster includes a list of models ordered by entry into the cluster and wtSim.
Agglomerative clustering is fast and simple; pairwise similarity scores do not have to be recalculated after each clustering step. However, we don't know what are the similarities or differences between clusters without inspecting them.

Form Topic Centroid Models
With only one cluster, no unclustered models, and good similarity, the models seem stable. We form topic centroids and report this centroid model as the representative topic model. With multiple clusters, we should consider the appropriateness of multiple solutions -perhaps corresponding to multiple human solutions. We form centroids for each topic and report centroid models as representative of the clusters. The occurrence of many unclustered models would indicate instability.
Controls specify a censor limit for similarity below which topics do not merge into a centroid, and a minimum number of models and minimum topic frequency below which topics drop from the centroid topic model. While a cluster may have several models, not all topics need not be aligned across all models.
Form Topic Centroid Model (algorithm 2) forms cluster centroid models by copying the cluster centroid from the initial model and then aligning and entering individual models into the centroid iteratively based on their order of entry into the cluster.
The method optimizeSubtreeMap, a variation on the previous minimizeSubtreeCost (algorithm 1), returns the topic correspondence mapping. Topics which do not meet the topic similarity censor limit (wtSim < .25) are not aligned. Unaligned topics are provisionally added to the centroid model in case subsequent models in the list have similar topics. After the centroid model is formed, topics which to not meet a minimum topic frequency limit or minimum number of topic models limit are dropped.

Centroid Model Stability
For each cluster's centroid model, we align individual models with the centroid model and estimate stability. The method is similar to that for pairwise stability with the exception that the centroid model is always one member of the pair and so only M (centroid, model) pairs are analyzed.

Use in Hierarchical Summarization
The final product is a single stable centroid model, when one exists. The stable centroid model shows the topic structure, the proportional importance of each topic, and the word composition of each topic as a discrete probability distribution. In our hierarchical summarization process, this centroid model would be further transformed (nested, pruned, aggregated) by taking into account extrinsic requirements of summary size, and paragraph and subparagraph structure. The resulting topic structure model would be used to extract information proportionally for each topic, and organize the section and paragraph structured summary.
If the centroid model is not stable, then hierarchical summarization would not be credible. If there are multiple identifiable stable clusters, then their centroid models become candidates for organizing the hierarchical summary.

Stability Experiments
The purpose of the stability experiments is to demonstrate the methodology over corpora for flat and hierarchical structures. When stable centroid models result from replicate topic analyses, they can credibly be transformed to take into account extrinsic summarization requirements, and carried forward to the information extraction phase of our hierarchical summarization process.

Corpora
Corpora used in this study are Journal of the ACM (JACM) abstracts from years 1987-2000, Global Warming (GW) articles for the year 2015 (Live Science, 2015), Proceedings of the National Academy of Sciences (PNAS) abstracts for years 1991-2001(Ponweiser et al., 2015, Neural Information Processing Systems (NIPS) proceedings for years 1988-1999 from (Lichman, 2013). PNAS and GW texts were lemmatized. Stop words and words with frequency less than ten were removed. JACM and GW are small corpora; JACM has very small abstracts while GW has short articles; PNAS has numerous abstracts and NIPS has longer articles.

Experimental Design
An 18 run factorial design (3 corpora x 3 levels x 2 growth rates) crosses JACM, GW, and PNAS corpora, with flat (L=0) and hierarchical (L=2,3) topic structures, and topic growth rates to achieve

Results -Factorial Design
Stability analysis was performed for each experimental group of replicates. Topics were not aligned when wtSim < .25, clustering terminated when when avgW tSim < cutP oint = .5, 7 and topics were dropped from the cluster centroid model when nM odel k < 2. Table 3 shows the results for the factorial design with corpus, hierarchical topic structure (L), and growth rate (γ). Results reported are number of topics in training model (K), and stability measures of number (K') and proportion of topics aligned (pAlign) in centroid model, average weighted similarity (wtSim), and hierarchical Jensen-Shannon divergence (hJSD). Ideal results based on a priori values (table 1) would be pAlign ≈ 1, wtSim ≈ 1, hJSD ≈ 0.
We expected simpler would be more stable (Ockham's razor), such that more levels and topics give poorer stability. This is largely confirmed by stability measures in that greater hierarchy levels and greater topic count models generally had poorer stability measures. Hierarchical L=3 models and with the JACM corpus especially showed poorer stability.

Results -Ad hoc Design -NIPS
We analyzed a set of 16 trials on the NIPS corpus run under somewhat similar conditions with topic counts in the 90 to 200 range with hierarchical L=3. Given the corpus size, non-equality of conditions, and diversity of topic counts, we weren't surprised to find multiple distinct clusters. Stability analysis was performed with control settings: topics not aligned for wtSim < .25, clustering terminated for wtSim < cutP oint = .5 or .6, and topics dropped from the cluster centroid model for nM odel k < 2. Results are reported in table 4. At cutP oint = 0.5, all models formed one cluster; at cutP oint = 0.6, three separate clusters were identified and six models were not joined to any cluster. Proportion of aligned topics declined (nM odel k < 2 is a more stringent test when there are only 2 or 3 models in the cluster), but similarity and divergence measures were substantially improved for each of the three separate clusters.

Impact on Hierarchical Summarization
For corpora in the factorial design, both flat and hierarchal L=2 topic structures resulted in good  stability (high alignment and similarity with little divergence), so the centroid topic model can credibly be carried forward for use in our hierarchical summarization process. The hierarchical L=3 models are generally less stable.
The NIPS stability analysis for a single cluster shows moderate similarity of models and moderate divergence of topic distributions, while more restrictive clustering reveals three separate clusters and six unassigned models. This bears further investigation.

Discussion
We have: • placed modeling hierarchical topic structure in the analysis phase of our hierarchical text summarization process; • established the importance of a stable topic model for use in the analysis phase; • developed a methodology for aligning and measuring stability of topic models; • defined innovative and simple hierarchical topic structure model alignment via a recursive algorithm applying the Hungarian algorithm to individual Dirichlet processes; • quantified time complexity of our hierarchical alignment algorithm and showed reduced time complexity at increasing hierarchical level versus flat topic structures; • developed alignment, similarity, and divergence stability measures for hierarchical topic structures; • applied agglomerative clustering to form coherent groups of topic models: -constructed representative cluster centroid models, and -calculated centroid model stability; • demonstrated the methodology, finding credible models for flat and hierarchical L=2 structures; • demonstrated the methodology on a large set of hierarchical L=3 topic models run on the NIPS corpus, finding multiple coherent clusters plus unclustered models; • mentioned parenthetically work on a pilot calibration study for stability measures; Future Work There is work to be done on topic model stability, model alignment, and stability measurement: • apply our methodology to larger, more varied models and different inference methods; • improve, expand, and publish calibration studies beyond our pilot; • explore other topic model alignment cost measures; • further improve topic alignment including options other than up-to-one matching; • improve hierarchical structure topic model stability.

Summarization -Next
Step We further transform the hierarchical topic structure taking into account extrinsic summarization requirements. The product from the analysis phase is a hierarchical structure topic model where each topic includes its proportional representation of the corpus and a composition of words given as a discrete probability distribution. This structure is used in information extraction, where topic compositions match information from the corpus, e.g., sentences, and proportional representation budgets the quantity of information to be extracted for each topic. The transformed topic structure organizes summary topic and paragraph structure.
Conclusion Our topic model stability methodology lets us diagnose and compute "usable" hierarchical topic models for collections of long documents. This is an essential and "attractive starting point towards hierarchical text summarization." 8