Sparse Parallel Training for Hierarchical Dirichlet Process Topic Models

Nonparametric extensions of topic models such as Latent Dirichlet Allocation, including Hierarchical Dirichlet Process (HDP), are often studied in natural language processing. Training these models generally requires use of serial algorithms, which limits scalability to large data sets and complicates acceleration via use of parallel and distributed systems. Most current approaches to scalable training of such models either don't converge to the correct target, or are not data-parallel. Moreover, these approaches generally do not utilize all available sources of sparsity found in natural language - an important way to make computation efficient. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly sparse data-parallel sampler for the HDP topic model that addresses these issues. We benchmark our method on a well-known corpora (PubMed) with 8m documents and 768m tokens, using a single multi-core machine in under three days.


Introduction
Topic models are a widely-used class of methods that allow practitioners to identify latent semantic themes in large bodies of text in an unsupervised manner. They are particularly attractive in areas such as history (Yang et al., 2011;Wang et al., 2012), sociology (DiMaggio et al., 2013), and political science (Roberts et al., 2014), where a desire for careful control of structure and prior information incorporated into the model motivates one to adopt a Bayesian approach to learning. In these areas, large corpora such as newspaper archives are becoming increasingly available (Ehrmann et al., 2020), and models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and its nonparametric extensions Teh, 2006;Hu and Boyd-Graber, 2012;Paisley et al., 2015) are widely used by practitioners. Moreover, these models are emerging as a component of data-efficient language models (Guo et al., 2020). Training topic models efficiently entails two requirements.
1. Expose sufficient parallelism that can be taken advantage of by the hardware.
2. Utilize sparsity found in natural language to control memory requirements and computational complexity.
In this work, we focus on the hierarchical Dirichlet process (HDP) topic model of , which we review in Section 2. This model is a simple non-trivial extension of LDA to the nonparametric setting. This parallel implementation provides a blueprint for designing massively parallel training algorithms in more complicated settings, such as nonparametric dynamic topic models (Ahmed and Xing, 2010) and tree-based extensions (Hu and Boyd-Graber, 2012).
Parallel approaches to training HDPs have been previously introduced by a number of authors, including Newman et al. (2009), Wang et al. (2011), Williamson et al. (2013), Chang and Fisher (2014) and Ge et al. (2015). These techniques suit various settings: some are designed to explicitly incorporate sparsity present in natural language and other discrete spaces, while others are intended for HDPbased continuous mixture models. Gal and Ghahramani (2014) have pointed out that some methods can suffer from load-balancing issues, which limit their parallelism and scalability. The largest benchmark of parallel HDP training performed to our awareness is by Chang and Fisher (2014) on the 100m-token NYTIMES corpora. Throughout this work, we focus on Markov chain Monte Carlo (MCMC) methods-empirically, their scalability is comparable to variational methods (Magnusson et al., 2018;Hoffman and Ma, 2019), and, subject to convergence, they yield the correct posterior.
Our contributions are as follows. We propose an augmented representation of the HDP for which the topic indicators can be sampled in parallel over documents. We prove that, under this representation, the global topic distribution Ψ is conditionally conjugate given an auxiliary parameter l. We develop Topic indicator for token i in d l : 1 × ∞ Global topic latent sufficient statistic K * Index for implicitly-represented topics α, β, γ Prior concentration for θ d , φ k , Ψ fast sampling schemes for Ψ and l, and propose a training algorithm with a per-iteration complexity that depends on the minima of two sparsity terms-it takes advantage of both document-topic and topic-word sparsity simultaneously.

Partially collapsed Gibbs sampling for hierarchical Dirichlet processes
The hierarchical Dirichlet process topic model  begins with a global distribution Ψ over topics. Documents are assumed exchangeable-for each document d, the associated topic distribution θ d follows a Dirichlet process centered at Ψ. Each topic is associated with a distribution of tokens φ k . Within each document, tokens are assumed exchangeable (bag of words) and assigned to topic indicators z i,d . For given data, we observe the tokens w i,d . We thus arrive at the GEM representation of a HDP, given by equation (19) of  as where α, β, γ are prior hyperparameters.

Intuition and augmented representation
At a high level, our strategy for constructing a scalable sampler is as follows. Conditional on Ψ, the likelihood in equations (1)-(5) is the same as that of LDA. Using this observation, the Gibbs step for z, which is the largest component of the model, can be handled efficiently by leveraging insights on sparse parallel sampling from the well-studied LDA literature (Yao et al., 2009;Li et al., 2014;Magnusson et al., 2018;Terenin et al., 2019). For this strategy to succeed, we need to ensure that all Gibbs steps involved in the HDP under this representation are analytically tractable and can be computed efficiently. For this, the representation needs to be modified.
To begin, we integrate each θ d out of the model, which by conjugacy (Blackwell and MacQueen, 1973) yields a Pólya sequence for each z d . By definition, given in Appendix A, this sequence is a mixture distribution with respect to a set of Bernoulli random variables b d , each representing whether z i,d was drawn from Ψ or from a repeated draw in the Pólya urn. Thus, the HDP can be written where PS(Ψ, b d ) is a Pólya sequence, defined in Appendix A. This representation defines a posterior distribution over z, Φ, Ψ, b for the HDP. To derive a Gibbs sampler, we calculate its full conditionals.

Full conditionals for z, Φ, and b
The full conditionals z | Φ, Ψ and Φ | z, Ψ, with b marginalized out, are essentially those in partially collapsed LDA (Magnusson et al., 2018;Terenin et al., 2019). They are where v(i) is the word type for word token i, and where m −i d,k denotes the document-topic sufficient statistic with index i removed, and n k is the topicword sufficient statistic. Note the number of possible topics and full conditionals φ k | z here is countably infinite. The full conditional for each b i,d is .
The derivation, based on a direct application of Bayes' Rule with respect to the probability mass function of the Pólya sequence, is in Appendix A.

The full conditional for Ψ
To derive the full conditional for Ψ, we examine the prior and likelihood components of the model. It is shown in Appendix A that the likelihood term The first term is a multiplicative constant independent of Ψ and vanishes via normalization. Thus, the full conditional Ψ | z, b depends on z and b only through the sufficient statistic l defined by and so we may suppose without loss of generality that the likelihood term is categorical. Under these conditions, we prove the full conditional for Ψ admits a stick-breaking representation.
Proposition 1. Without loss of generality, suppose Then Ψ | x is given by where l are the empirical counts of x.
Proof. Appendix B.
This expression is similar to the stick-breaking representation of a Dirichlet process DP(·, F )however, it has different weights and does not include random atoms drawn from F as part of its definition-see Appendix B for more details. Putting these ideas together, we define an infinitedimensional parallel Gibbs sampler.
• Sample b i,d according to equation (14) in parallel over documents for d = 1, .., D.
Algorithm 1 is completely parallel, but cannot be implemented as stated due to the infinite number of full conditionals for Φ, as well as the infinite product used in sampling Ψ. We now bypass these issues by introducing an approximate finitedimensional sampling scheme.

Finite-dimensional sampling of Ψ and Φ
By way of assuming Ψ ∼ GEM(γ), an HDP assumes an infinite number of topics are present a priori, with the number of tokens per topic decreasing rapidly with the topic's index in a manner controlled by γ. Thus, under the model, a topic with a sufficiently large index should contain no tokens with high probability. We thus propose to approximate Ψ by projecting its tail onto a single flag topic K * , which stands for all topics not explicitly represented as part of the computation. This can be done by by deterministically setting ς K * = 1 in equation (19). The resulting finite-dimensional Ψ will be the correct posterior full conditional for the finite-dimensional generalized Dirichlet prior considered previously in Section 2.3. Hence, this finite-dimensional truncation forms a Bayesian model in its own right, which suggests it should perform reasonably well. From an asymptotic perspective, Ishwaran and James (2001) have shown that the approximation is almost surely convergent and, therefore, well-posed.
Once this is done, Ψ becomes a finite vector of length K * , and only K * rows of Φ need to be explicitly instantiated as part of the computation. This instantiation allows the algorithm to be defined on a fixed finite state space, simplifying bookkeeping and implementation.
From a computational efficiency perspective, the resulting value K * takes the place of K in partially collapsed LDA. However, it cannot be interpreted as the number of topics in the sense of LDA. Indeed, LDA implicitly assumes that Ψ = Unif(1, .., K) deterministically-i.e., that every topic is assumed a priori to contain the same number of tokens. In contrast, the HDP model learns this distribution from the data by letting Ψ ∼ GEM(γ).
If we allow the state space to be resized when topic K * is sampled, then following Papaspiliopoulos and Roberts (2008), it is possible to develop truncation schemes which introduce no error. Since this results in more complicated bookkeeping which reduces performance, we instead fix K * and defer such considerations to future work. We recommend setting K * to be sufficiently large that it does not significantly affect the model's behavior, which can be checked by tracking the number of tokens assigned to the topic K * .

Sparse sampling of Φ and z
To be efficient, a topic model needs to utilize the sparsity found in natural language as much as possible. In our case, the two main sources of sparsity are as follows.
1. Document-topic sparsity: most documents will only contain a handful of topics.
2. Topic-word sparsity: most word types will not be present in most topics.
We thus expect the document-topic sufficient statistic m and topic-word sufficient statistic n to contain many zeros. We seek to use this to reduce sampling complexity. Our starting point is the Poisson Pólya Urn sampler of Terenin et al. (2019), which presents a Gibbs sampler for LDA with computational complexity that depends on the minima of two sparsity coefficients representing documenttopic and topic-word sparsity-such algorithms are termed doubly sparse. The key idea is to approximate the Dirichlet full conditional for φ k with a Poisson Pólya Urn (PPU) distribution defined by for v = 1, .., V . This distribution is discrete, so Φ becomes a sparse matrix. The approximation is accurate even for small values of n k,v , and Terenin et al. (2019) proves that the approximation error will vanish for large data sets in the sense of convergence in distribution. If β is uniform, we can further use sparsity to accelerate sampling ϕ k,v . Since a sum of Poisson random variables is Poisson, we can split We then sample ϕ (β) k,v sparsely by introducing a Poisson process and sampling its points uniformly, and sample ϕ (n) k,v sparsely by iterating over nonzero entries of n.
For z, the full conditional is similar to to the one in partially collapsed LDA (Magnusson et al., 2018)-the difference is the presence of Ψ k . As Ψ k only enters the expression through component (a) and is identical for all z i,d , it can be absorbed at each iteration directly into an alias table (Walker, 1977;Li et al., 2014). Component (b) can be computed efficiently by utilizing sparsity of Φ and m and iterating over whichever has fewer non-zero entries.

Direct sampling of l
Rather than sampling b, whose size will grow linearly with the number of documents, we introduce a scheme for sampling the sufficient statistic l directly. Observe that (25) where the domain of summation and the value of the indicators have been switched. By definition of Summing this expression over documents, we obtain the expression where D k,j is the total number of documents with m d,k ≥ j. Since m d,k = 0 for all topics k without any tokens assigned, we only need to sample l for topics that have tokens assigned to them. This idea can also be straightforwardly applied to other HDP samplers (Chang and Fisher, 2014;Ge et al., 2015), by allowing one to derive alternative full conditionals in lieu of the Stirling distribution (Antoniak, 1974). The complexity of sampling l directly is constant with respect to the number of documents, and depends instead on the maximum number of tokens per document.
To handle the bookkeeping necessary for computing D k,j , we introduce a sparse matrix d of size K × max d N d whose entries d k,p are the number of documents for topic k that have a total of p topic indicators assigned to them. We increment d once z d been sampled by iterating over non-zero elements in m d . We then compute D k,j as the reverse cumulative sum of the rows of d.

Poisson Pólya urn partially collapsed Gibbs sampling
Putting all of these ideas together, we obtain the following algorithm.
Algorithm 2 is sparse, massively parallel, defined on a fixed finite state space, and contains no infinite computations in any of its steps. The Gibbs step for Φ converges in distribution (Terenin et al., 2019) to the true Gibbs steps as N → ∞, and the Gibbs step for Ψ converges almost surely (Ishwaran and James, 2001) to the true Gibbs step as K * → ∞.

Computational complexity
We now examine the per-iteration computational complexity of Algorithm 2. To proceed, we fix K * and maximum document size max d N d , and relate the vocabulary size V with the number N of total words as follows.
The per-iteration complexity of Algorithm 2 is equal to the sum of the per-iteration complexity of sampling its components. The sampling complexities of Ψ and l are constant with respect to the number of tokens, and the sampling complexity of Φ has been shown by Magnusson et al. (2018) to be negligible under the given assumptions. Thus, it suffices to consider z.
At a given iteration, let K Algorithm 2 is thus a doubly sparse algorithm.

Performance results
To study performance of the partially collapsed sampler-Algorithm 2-we implemented it in Java using the open-source MALLET 1 (McCallum, 2002) topic modeling framework. We ran it on the AP, CGCBIB, NEURIPS, and PUBMED corpora, 1 which are summarized in Table 2. Prior hyperparameters controlling the degree of sparsity were set to α = 0.1, β = 0.01, γ = 1. We set K * = 1000 and observed no tokens ever allocated to the topic K * . Data were preprocessed with default Mallet (McCallum, 2002) stop-word removal, minimum document size of 10, and a rare word limit of 10. Following , the algorithm was initialized with one topic. All experiments were repeated five times to assess variability. Total runtime for each experiment is given in Table 2.
To assess Algorithm 2 in a small-scale setting, we compare it to the widely-studied sparse fully collapsed direct assignment sampler of , which is not parallel. We ran 100 000  iterations of both methods on AP and CGCBIB. We selected these corpora because they were among the larger corpora on which it was feasible to run our direct assignment reference implementation within one week. Trace plots for the log marginal likelihood for z given Ψ and the number of active topics, i.e., those topics assigned at least one token, can be seen in Figure 1(a,d) and Figure 1(b,e), respectively. The direct assignment algorithm converges slower, but achieves a slightly better local optimum in terms of marginal log-likelihood, compared to our method. This fact indicates that the direct assignment method may stabilize around a different local optimum, and may represent a potential limitation of the partially collapsed sampler in settings where non-parallel methods are practical.
To better understand the distributional differences between the algorithms, we examined the number of tokens per topic, which can be seen in Figure 1(c,f). The partially collapsed sampler is seen to assign more tokens to smaller topics, indicating that it stabilizes around a local optimum with slightly broader semantic themes.
To visualize the effect this has on the topics, we examined the most common words for each topic. Since the algorithms generate too many topics to make full examination practical, we instead compute a quantile summary with five topics per quantile. The quantile is computed by ranking all topics by the number of tokens, choosing the five closest topics to the 100%, 75%, 50%, 25%, and 5% quantiles in the ranking, and computing their top words. This approach gives a representative view of the algorithm's output for large, medium, and small topics. Results may be seen in Appendix D and Appendix C-we find the direct assignment and partially collapsed samplers to be mostly com- parable, with substantial overlap in top words for common topics. Next, we assess Algorithm 2 in a more demanding setting and compare against previous parallel state-of-the-art. There are various scalable samplers available for the HDP. For a fair comparison, we restrict ourselves to those samplers designed for topic models and explicitly incorporate sparsity of natural language in their construction. Among these, we selected the parallel subcluster splitmerge algorithm of Chang and Fisher (2014) as our baseline because it was used in the largest-scale benchmark of the HDP topic model performed to date to our awareness, and shows comparable performance to other methods (Ge et al., 2015). The subcluster split-merge algorithm is designed to converge with fewer iterations, but is more costly to run per iteration. Thus, we used a fixed computational budget of 24 hours of wall-clock time for both algorithms. Computation was performed on a system with a 4-core 8-thread CPU and 8GB RAM.
Results can be seen in Figure 1(g)-note that the subcluster split-merge algorithm is parametrized using sub-topic indicators and sub-topic probabilities, so its numerical log-likelihood values are not directly comparable to ours and should be interpreted purely to assess convergence. Algorithm 2 stabilizes much faster with respect to both the number of active topics in Figure 1(g), and marginal log-likelihood in Figure 1(h). The subcluster splitmerge algorithm adds new topics one-at-a-time, whereas our algorithm can create multiple new topics per iteration-we hypothesize this difference leads to faster convergence for Algorithm 2.
In Figure 1(i), we observe that the amount of computing time per iteration increases substantially for the subcluster split-merge method as it adds more topics. For Algorithm 2, this stays approximately constant for its entire runtime.
To evaluate the topics produced by the algorithms, we again examined the most common words for each topic via a quantile summary, given in Appendix E. We find the subcluster split-merge algorithm appears to generate topics with slightly more semantic overlap compared to Algorithm 2, but otherwise produces comparable output.
Finally, to assess scalability, we ran 25 000 iterations of Algorithm 2 on PubMed, which contains 768m tokens. To our knowledge, this dataset is an order of magnitude larger than any datasets used in previous MCMC-based approaches for the HDP.
Computation was performed on a compute node with 2x10-core CPUs with 20 threads and 64GB of RAM. The marginal likelihood and number of active topics are given in Figure 1(j) and Figure 1(k).
To evaluate the topics discovered by the algorithm, we examined their most common wordsthese may be seen in full in Appendix F. We observe that the semantic themes present in the topics vary according to how many tokens they have: topics with more tokens appear to be broader, whereas topics with fewer tokens appear to be more specific. This behavior illustrates a key difference between the HDP and methods like LDA, which do not contain a learned global topic distribution Ψ in their formulation. We suspect the effect is particularly pronounced on PubMed compared to CGCBIB and NeurIPS due to its large number of tokens.

Discussion
In this work, we introduce the parallel partially collapsed Gibbs sampler-Algorithm 1-for the HDP topic model, which converges to the correct target distribution. We propose a doubly sparse approximate sampler-Algorithm 2-which allows the HDP to be implemented with per-token sampling which is the same as that of Pólya Urn LDA (Terenin et al., 2019). Compared to other approaches for the HDP, it offers the following improvements.
1. The algorithm is fully parallel in all steps.
2. The topic indicators z utilize all available sources of sparsity to accelerate sampling.
3. All steps not involving z have constant complexity with respect to data size.
4. The proposed sparse approximate algorithm becomes exact as N → ∞ and K * → ∞.
These improvements allow us to train the HDP on larger corpora. The data-parallel nature of our approach means that the amount of available parallelism increases with data size. This parallelism avoids load-balancing-related scalability limitations pointed out by Gal and Ghahramani (2014). Nonparametric topic models are less straightforward to evaluate empirically than ordinary topic models. In particular, we found topic coherence scores (Mimno et al., 2011) to be strongly affected by the number of active topics K, which causes preference for models with fewer topics and more k  Topic 1  Topic 5  Topic 9  Topic 13  Topic 17  n k,•  42 395 289  23 907 517  22 167 377  20 925 933  18 924 590  care  cancer  protein  protein  cell  health  tumor  binding  cell  neuron  patient  patient  membrane  kinase  electron  medical  cell  acid  expression  brain  research  carcinoma  activity  receptor  rat  system  breast  cell  activation  nerve  clinical  tumour  gel  pathway  fiber  cost  survival  human  phosphorylati  nucleus   k  Topic 21  Topic 25  Topic 29  Topic 33  Topic Figure 2: Top 8 words for topics obtained by Algorithm 2 on PubMed, together with topic index k and total number of words n k,• present in the topic. We observe that the topics range from broad to specific: this is a consequence of the hierarchical Dirichlet process prior via the inclusion of the global topic proportions Ψ. Topics obtained by Algorithm 2 on all corpora may be seen in Appendix C, Appendix D, Appendix E, and Appendix F. semantic overlap per topic. We view the development of summary statistics that are K-agnostic and those measuring other aspects of topic quality such as overlap, to be an important direction for future work. We are particularly interested in techniques that can be used to compare algorithms for sampling from the same model defined over fully disjoint state spaces, such as Algorithm 2 and the subcluster split-merge algorithm in Section 3. Partially collapsed HDP can stabilize around a different local mode than fully collapsed HDP as proposed by . There have been attempts to improve mixing in that sampler (Chang and Fisher, 2014), including the use of Metropolis-Hastings steps for jumping between modes (Jain and Neal, 2004). These techniques are largely complementary to ours and can be explored in combination with the ideas presented here.
The HDP posterior is a heavily multimodal target for which full posterior exploration is known to be difficult (Chang and Fisher, 2014;Gal and Ghahramani, 2014;Buntine and Mishra, 2014), and sampling schemes are generally used more in the spirit of optimization than traditional MCMC. These issues are mirrored in other approaches, such as variational inference. There, restrictive meanfield factorization assumptions are often required, which reduces the quality of discovered topics. We view MAP-based analogs of ideas presented here as a promising direction, since these may allow additional flexibility that may enable faster training.
Many of the ideas in this work, such as the binomial trick, are generic and apply to any topic model structurally similar to the HDP's GEM representation  given in Section 2. For example, one could consider an informative prior for Ψ in lieu of GEM(γ), potentially improving convergence and topic quality, or developing parallel schemes for other nonparametric topic models such as Pitman-Yor models (Teh, 2006), tree-based models (Hu and Boyd-Graber, 2012;Paisley et al., 2015), embedded topic models (Dieng et al., 2020), as well as nonparametric topic models used within data-efficient language models (Guo et al., 2020) in future work.

Conclusion
We introduce the doubly sparse partially collapsed Gibbs sampler for the hierarchical Dirichlet process topic model. By formulating this algorithm using a representation of the HDP which connects it with the well-studied Latent Dirichlet Allocation model, we obtain a parallel algorithm whose per-token sampling complexity is the minima of two sparsity terms. The ideas used apply to a large array of topic models which possess the same full conditional for the topic indicators z. Our algorithm for the HDP scales to a 768m-token corpus (PubMed) on a single multicore machine in under four days.
The proposed techniques leverage parallelism and sparsity to scale nonparametric topic models to larger datasets than previously considered feasible for MCMC or other methods possessing similar convergence properties. We hope these contributions enable wider use of Bayesian nonparametrics for large collections of text.