Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

Topic models are a useful analysis tool to uncover the underlying themes within document collections. Probabilistic models which assume a generative story have been the dominant approach for topic modeling. We propose an alternative approach based on clustering readily available pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance under dimensionality reduction with PCA. The best performing combination for our approach is comparable to classical models, and complexity analysis indicate that this is a practical alternative to traditional topic modeling.


Introduction
For exploratory document analysis, which aims to uncover main themes and underlying narratives within a corpus (Boyd-Graber et al., 2017), Topic Models are undeniably the standard approach. But in times of distributed and even contextualized embeddings, are they the only option?
This work explores an alternative to topic modeling by reformulating 'key themes' or 'topics' as clusters of words under the modern distributed representation learning paradigm. Unsupervised pretrained word embeddings provide a representation for each word type as a vector. This allows us to cluster them based on their distance in highdimensional space. The goal of this work is not to strictly outperform, but rather to benchmark standard clustering of modern embedding methods against the classical approach of Latent Dirichlet Allocation (Blei et al., 2003). We restrict our study to influential embedding methods and focus on centroid based clustering algorithms as they provide a natural way to obtain the top words in each cluster based on distance from the cluster center.
Aside from reporting the best performing combination of word embeddings and clustering algorithm, we are also interested in whether there are consistent patterns across the choice of embeddings and clustering algorithm. A word embedding method that does consistently well across clustering algorithms, would suggest that it is a good representation for unsupervised document analysis. Similarly, a clustering algorithm that performs consistently well across embeddings would suggest that the assumptions of this algorithm are more likely to be generalizable even with future advances in word embedding methods.
Finally, we seek to incorporate document information directly into the clustering algorithm, and quantify the effects of two key methods, 1) weighting terms during clustering and 2) reranking terms for obtaining the top J representative words. Our contributions are as follows: • To our knowledge, this is the first work which systematically applies centroid based clustering algorithms on embedding methods for document analysis.
• We analyse how clustering embeddings directly can potentially achieve lower computational complexity and runtime as compared to probabilistic generative approaches.
• Our proposed approach for incorporating document information into clustering and reranking of top words results in sensible topics; the best performing combination is comparable with LDA, but with smaller time complexity and empirical runtime.
• We find that the dimensions of some word embeddings can be reduced by more than 50% before clustering.
Clustering word embeddings has been used for readability assessment (Cha et al., 2017), argument mining (Reimers et al., 2019), document classification and document clustering (Sano et al., 2017). To our knowledge, there is no prior work that studies the interaction between word embeddings and clustering algorithms on unsupervised document analysis in a direct comparison with standard LDA (Blei et al., 2003). Most related perhaps is the work of de Miranda et al. (2019), who this idea with self-organising maps, but do not provide any quantitative results. 2

General Clustering Approach
We first preprocess and extract the vocabulary from our training documents (section 5.1). Each word is converted to its embedding representation, following which we apply the various clustering algorithms to obtain k clusters, using weighted (subsection 3.3) or unweighted word types. After the clustering algorithm has converged, we obtain the top J words from each cluster for evaluation.

Obtaining the top-J words
In traditional topic modeling (LDA), the top J words are those with highest probability under each topic-word distribution. For centroid based clustering algorithms, the top words are naturally those closest to the cluster center, and for probabilistic clustering, the top words are those with highest probability under the cluster parameters. Formally,

Incorporating document information
We explore various methods to incorporate corpus information into the clustering algorithm. Specifically, we examine three different schemes to assign scores to word types: These scores are then used for weighting word types when clustering (models marked + ), reranking top words (models marked r ), both (models marked + r ), or neither (models simply ), i.e., using uniform weights.

Computational Complexity
The complexity of KM is O(tknm), and of GMM is O(tknm 3 ) where t refers to the number of iterations, 3 k is the number of clusters (topics), n is the number of word types (unique vocabulary), and m is the dimension of the embeddings. Weighted variants have an one-off cost of weight initialisation, and contribute a constant multiplicative factor when recalulculating the centroid in the clustering algorithm. Reranking has an additional O(n · log(n k )) factor, where n k is the average number of elements in a cluster. In contrast, LDA via collapsed Gibbs sampling has a complexity of O(tkN ), where N is the number of all tokens in the corpus. When N n, clustering methods can potentially achieve better performance-complexity tradeoffs.

Experimental Setup
Datasets We use the 20 newsgroup dataset (20NG) which is a common text analysis dataset containing around 18000 documents and 20 categories. 4 We adopt the standard 60-40 train-test splits in the dataset, and run the clustering algorithm on the training set, obtained the top 10 words of each cluster, and evaluated this on the test split. We present results averaged across 5 random seeds.
Preprocessing We remove stopwords, punctuation and digits, lowercase tokens, and exclude words that appear in less than 5 documents. For contextualised word embeddings (BERT and ELMo), sentences served as the context window to obtain the token representations which were averaged to obtain the type representation. For BERT, we experiment with two variants, BERT(ns), which ignores subword tokens, BERT, which averages the subword token representations. 5 Evaluation Metric We evaluate the clustering results using the topical coherence metric normalised pointwise mutual information (NPMI; Bouma, 2009) which has been shown to correlate with human judgements (Chang et al., 2009). NPMI ranges from [−1, 1].

Results and Discussion
Runtime Compared to a simple LDA run, which performs no better than our best method, KM + r and takes about a minute using MALLET (McCallum, 2002), running the clustering on CPU takes little more than 10 seconds using sklearn (Pedregosa et al., 2011), and a third of that using custom JAX implementations on GPU (Bradbury et al., 2018).
Incorporating Document Information We find that simply using Term Frequency (TF) outperforms the other weighting schemes (subsection 3.3). In particular, TF-IDF, perhaps surprisingly, is a poorer reweighting scheme (see Appendix B). Therefore our results in Table 2 and subsequent analysis utilises TF for weighted clustering and reranking.

Analysis of Algorithms -Weighted Clustering
Under the unweighted clustering of vocabulary types, all clustering algorithms and embedding combinations perform poorly compared to LDA. GMM outperforms KM and SK for both weighted (indicated with +) and unweighted variants across all embedding methods (p < 0.05). 6 Analysis of Algorithms -Reranking For KM + and SK + , extracting the top topic words (subsection 3.2) before reranking results in reasonable looking themes, but scores poorly on NPMI. Reranking top topic words with a window size of 100 results in a large improvement for KM + (p < 0.02) and SK + (p < 0.01). Examples before and after reranking are provided in Table 1. This indicates that cluster centers are surrounded by low frequency types, even if the clusters are centered around valid themes.
Reranking GMM + gains are much less pronounced. We found that the top topic words before and after reranking for BERT-GMM + have an average Jaccard similarity score of 0.910, indicating that the Gaussians are already centered at word types of high frequency in the training corpus, and fundamentally have different cluster centers from those learned by KM.

Analysis of Embedding Method
The best performer are Spherical embeddings (Meng et al., 2019) Table 2: NPMI Results (higher is better) for pre-trained word embeddings and k-means (KM), spherical k-means (SK) and GMM. + indicates weighted and r indicates reranking of top words. LDA has an NPMI score of 0.279, while the best performing Spherical Embeddings with KMN + r achieves a slightly better (but not statistically different) NPMI of 0.282. All results are averaged across 5 random seeds. GPU resources are not always available to efficiently extract BERT embeddings from pre-trained models. BERT embeddings perform consistently well across the clustering algorithm variants with weighted and reranking. Interestingly, the exclusion of words that are tokenized into subwords (BERT (ns)) does not negatively impact topic coherence (p ≥ 0.05). This suggests that compound words that can be tokenized into subwords are not critical to finding coherent topics.

Dimensionality Reduction
We apply PCA to the word embeddings before clustering to estimate the amount of redundancy in the dimensions of large embeddings, which impact clustering complexity (section 4). Across both KM and GMM, BERT embeddings can be reduced to 300 dimensions (≥ 50%). Both Spherical and BERT begin to fall at 200 dimensions, but this effect can be mitigated with reranking.
We observe that for GMM, we can safely reduce the dimensions of BERT embeddings from 768 to 100, and even achieve better performance at lower dimensionality. The reduction is consistent across different types of embeddings, indicating that GMM performs better under lower dimensionality ( Figure 1). However given the cubic complexity of GMM in the number of dimensions (section 4), KM which achieves a comparable performance might be preferred in practical settings.

Conclusion
We outlined a methodology for clustering word embeddings for unsupervised document analysis, and presented a systematic comparison of various influential embedding methods and clustering algorithms. Our experiments suggest that pre-trained word embeddings combined with weighted clustering algorithms and reranking, provide a viable alternative to traditional topic modeling at lower complexity and runtime.

A k-means (KM) vs k-medoids (KD)
To further understand the effect of other centroid based algorithms on topic coherence, we also applied the k-medoids (KD) clustering algorithm. KD is a hard clustering algorithm similar to KM but less sensitive to outliers.
As we can see in Table 3, in all cases KD usually did as well or worse than KM. KD also did relatively poorly after frequency reranking. Where KD did do better than KM, the difference is not very striking and the NPMI scores were still quite below the other top performing models.

B Comparing Different Reranking Schemes
As mentioned in the paper, after clustering the embeddings, instead of directly retrieving the top-J terms, we can rerank the terms based on metrics and then retrieve the top-J terms that have the highest ranked values. We compare term frequency (TF), term frequencyinverse document frequency (TF-IDF) and term frequencydocument frequency (TF-DF), equations for which are presented in subsection 3.3. To get a single value for each word for TF-IDF, we sum over all the documents to get one aggregated value. We present the results for using different reranking schemes for KM (Table 4) and Weighted KM for Frequency (Table 5).
We can see that compared to the TF results in the main paper, other schemes for reranking such as aggregated TF-IDF and TF-DF, while producing more coherent topics over the original hard clustering, fare worse in comparison with reranking with TF.