Measuring Topic Coherence through Optimal Word Buckets

Measuring topic quality is essential for scoring the learned topics and their subsequent use in Information Retrieval and Text classification. To measure quality of Latent Dirichlet Allocation (LDA) based topics learned from text, we propose a novel approach based on grouping of topic words into buckets (TBuckets). A single large bucket signifies a single coherent theme, in turn indicating high topic coherence. TBuckets uses word embeddings of topic words and employs singular value decomposition (SVD) and Integer Linear Programming based optimization to create coherent word buckets. TBuckets outperforms the state-of-the-art techniques when evaluated using 3 publicly available datasets and on another one proposed in this paper.


Introduction
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) based topic modelling uses statistical relations between words like word co-occurrence while inferring topics and not semantic relations. Hence, topics inferred by LDA may not correlate well with human judgements even though they better optimize perplexity on held-out documents (Chang et al., 2009). Given the growing importance of topic models like LDA in text mining techniques and applications (Hingmire et al., 2013;Lin and He, 2009;Pawar et al., 2016), it is crucial to ensure that the inferred topics are of as high quality as possible. As shown in (Aletras et al., 2017), computing topic coherence is also important for developing better topic representation methods for use in Information Retrieval. An attractive feature of the probabilistic topic models is that the inferred topics can be interpreted by humans, each topic being just a bag of probabilistically selected "prominent" words in that topic's distribution. This has opened up a research area which explores use of human expertise or automated techniques to measure the quality of topics and improve the topic modelling techniques by incorporating these measures. As an example, consider two topics inferred from a document collection (topics are represented by their 10 most probable words): {loan, foreclosure, mortgage, home, property, lender, housing, bank, homeowner, claim} {horse, sullivan, business, secretariat, owner, get, truck, back, old, merchant} The first topic is easily interpretable by humans whereas the second topic is incoherent and less understandable. One could evaluate a single topic or an entire set of topics ("topic model") for quality. Several approaches have been proposed in the literature for measuring the quality of a single topic or of an entire topic model (see Section 2).
In this paper, we aim at measuring the quality of a single topic and propose a novel approach -TBuckets, which groups a topic's words into thematic groups (which we call buckets). The intuition is that if a single large bucket is obtained from a topic, the topic carries a single coherent theme. TBuckets combines Singular Value Decomposition (SVD) and Integer Linear Programming (ILP) to achieve an optimal word bucket distribution. We evaluate our technique by correlating its estimated coherence scores with human annotated scores and compare with state-ofthe-art results reported in Röder et al. (2015) and Nikolenko (2016). The TBuckets approach not only outperforms the state-of-the-art but also is parameter free. This makes TBuckets directly applicable to topics of a topic model without any searching in a parameter space.

Related Work
Several authors hypothesize that coherence of the N most probable words of a topic capture its semantic interpretability. Newman et al. (2010) used the set of N most probable words of a topic and computed its coherence (C U CI ) based on pointwise mutual information (PMI) between all possible word pairs of N words. In (Aletras and Stevenson, 2013) the authors propose a variant of C U CI by using normalized PMI (NPMI) computed based on distributional similarity between the words of the topic. Each word of a topic is represented by a context vector based on a window context in Wikipedia and coherence is computed as average of cosine similarities between the topic's centroid vector and each word. Mimno et al. (2011) proposes (C U M ASS ) that uses log conditional probability (LCP) instead of PMI and uses the same corpus on which topics are inferred to estimate LCP. Röder et al. (2015) propose a unifying framework that represents a coherence measure as a composition of parts, that can be freely combined to form a configuration space of coherence definitions. These parts can be grouped into four dimensions: 1) ways a word set can be divided into smaller pieces, 2) word pair agreement measures like PMI or NPMI, 3) ways to estimate word probabilities and 4) methods to aggregate scalar values. This framework spans over a large number of configuration space of coherence measures and it becomes tedious to find an appropriate coherence measure for a set of topics.
Nikolenko (2016), one of the state-of-the-art, also uses distributional properties of words and proposes coherence measures based on word embeddings. Topic quality is defined as average distance between topic words, and four distance functions -cosine, L1, L2 and co-ordinate are proposed. The paper reports strong results on datasets in Russian. Fang et al. (2016) also uses cosine similarity between word embeddings to compute coherence scores for twitter topics. Two other major approaches are based on topic word probability distributions (Alsumait et al., 2009) and coverage and specificities of WordNet hierarchies for topic words (Musat et al., 2011).

TBuckets: Creating buckets of topic words
The idea of viewing a topic as a set of coherent word buckets is based on how we humans observe a topic and decide its coherence. A human would observe the topic words one by one and put them in some form of coherent groups (or buckets, as we call them). Starting with a fresh bucket for the first word, every new word is put in an already created bucket if the word is semantically similar or semantically associated with the words in the bucket; otherwise the word is put in a new bucket. On completion of this exercise, all topic words would be distributed in various buckets. A distribution with a single large bucket and few small buckets would signify better coherence. However, a distribution with multiple medium sized buckets would indicate lower coherence. For a coherent topic like {storm, weather, wind, temperature, rain, snow, air, high, cold, northern}, which deals with weather and associated factors, the above procedure leads to the following bucket distribution: Bucket-1: {storm, weather, wind, temperature, rain, snow, air, cold}; But for a non-coherent topic like {karzai, afghan, miner, official, mine, assange, government, kabul, afghanistan, wikileaks} the same procedure leads to the following bucket distribution: Bucket-1: {karzai, afghan, kabul, afghanistan}; Bucket-2: {miner, mine}; Bucket-3: {official, government}; Bucket-4: {assange, wikileaks} It is evident from above examples that the final distribution of topic words into buckets, reflects the coherence of a topic closely. Based on this idea, we devise the TBuckets approach which enables us to perform this bucketing automatically and generate a coherence score for a topic. It only requires word embeddings of topic words, which are not difficult to obtain as embeddings of a large set of words, trained on various corpora, are now available publicly (Mikolov et al., 2013;Pennington et al., 2014;Levy and Goldberg, 2014) The idea of clustering arises intuitively when we think of forming related groups among a set of items (words here). However, an important limitation of clustering is that the resulting clusters are sensitive to choice of parameters like linkage configuration, threshold on maximum distance, number of clusters, etc. Furthermore, cluster cen-troids computed using average of word embeddings might not represent the underlying themes among the words. To really find the underlying themes, it is important to focus on interactions among the features of topic words. The values on dimensions of a word's embeddings can be regarded as the word's abstract features. Considering a matrix capturing interactions among the features of topic words, we hypothesize that the principal eigenvector of this matrix should capture the central theme of the topic. Further, we say that a topic is coherent if most of its words are aligned to this central theme. Additionally other eigenvectors would capture other themes, if any.
To capture this notion, we propose use of Singular Value Decomposition (SVD) and Integer Linear Programming (ILP) for obtaining optimal word theme alignments. We begin by constructing a n × d rectangular matrix A comprising d dimensional word embeddings of n words of a topic. We then apply SVD on A to obtain a product U SV T where columns of the V matrix are eigenvectors of the feature-feature interaction matrix A T A. These d dimensional eigenvectors represent the underlying themes we are interested in. The eigenvector corresponding to the largest singular value is the principal eigenvector 1 , representing the central theme. Now to determine an initial assignment of words with the eigenvectors, we use the first n eigenvectors in V as bucket identifiers to assign words to. The assignment is näive -the word goes to the bucket represented by the word's most similar eigenvector. We use cosine similarity to measure similarity between the word's embedding and an eigenvector. We define the principal bucket as the one corresponding to the principal eigenvector.
We believe that this näive assignment is strict and may lead to formation of multiple distinct but related themes. This may lead to splitting of the central theme across multiple buckets and hence words that should align with the central theme may get aligned to other (related) themes. Hence, to improve the näive assignment we propose an ILP based optimization and attain an optimal word theme alignment. The details of the optimization formulation are presented in Table 1. We consider the following example topic from the NYT dataset to understand the ILP formulation: {baby, birth, pregnant, Parameters: n: No. of eigenvectors/No. of words in a topic E: Matrix of dimensions n × n, where Eij represents similarity of the j th word with the i th eigenvector W : Matrix of dimensions n × n, where Wij represents similarity of the i th word with the j th word L: Matrix of dimensions (n − 1) × n, where Lij = 1 if E (i+1)j > E1j else 0 Variable: X: Matrix of dimensions n × n, where Xij = 1 only when j th word is assigned to the bucket associated with i th eigenvector Objective:

Objective
The objective function consists of two terms. The first term n i=1 n j=1 E ij ·X ij maximizes the similarity between any word with the eigenvector to which it is assigned. Optimizing only this term is equivalent to obtaining the SVD based assignments, as each word gets assigned to the bucket corresponding to its closest eigenvector. The second term − n i=2 n j=1 E 1j · X ij minimizes the penalty for the words which are not assigned to the principal eigenvector. The penalty is equal to their similarity with the principal eigenvector. The penalty term favours word assignments to the principal eigenvector by pushing to it some words which are not "too dissimilar" to its theme. The constraints described in the next subsection, bal-ance addition and restriction of word assignments to the principal eigenvector ensuring a coherent principal bucket.

Constraints
The first two constraints ensure sanity of the assignments. Constraint C 1 ensures that any word is assigned to one and only one eigenvector and constraint C 2 makes sure that at least one word is assigned to the principal eigenvector.
Constraint C 3 makes sure that any word j which is assigned to a non-principal eigenvector i has more similarity to the eigenvector i than its similarity with any word k assigned to the principal eigenvector. When the j th word itself is assigned to the principal eigenvector then the LHS is always zero and the RHS is either zero or negative; hence satisfying the constraint trivially. When the j th word is assigned to a non-principal eigenvector i, then E ij · X ij represents its similarity with the i th eigenvector. As both the terms X 1j and n m=2,m =i X mj would be zero, the RHS will reduce to W jk · X 1k which is similarity of the j th word with the k th word when the k th word is assigned to the principal eigenvector.
It can be observed that the penalty term and constraint C 3 , both favour assignments to the principal eigenvector. If the ILP formulation is restricted to only the three constraints C 1 , C 2 and C 3 , the example topic results in the following bucket distribution: birth,pregnant,woman,pregnancy,mother,born,american}; The constraint C 4 ensures that for any word which is assigned to the principal eigenvector, it is either the word's most similar eigenvector or second most similar eigenvector. This constraint ensures that words highly dissimilar to the principal eigenvector do not get forced to the principal bucket. For any word j, the sum n−1 i=1 L ij represents the number of eigenvectors which are more similar to it than the principal eigenvector. Hence, for each word assigned to the principal eigenvector, the LHS simply counts the number of other more similar eigenvectors and the constraint restricts this count to 1. Therefore, constraint C 4 ensures that there can be only two types of words in the principal bucket: i) words for which the prin-cipal eigenvector is the most similar and ii) words for which the principal eigenvector is the second most similar.
It is important to further improve the set of words that get attached to the principal eigenvector. Maintaining that words of type (i) are always in majority would imply adding lesser words which have the principal eigenvector as their second most similar eigenvector. Constraint C 5 ensures that words of type (i) are always in majority.
It can be observed that as against the principaleigenvector-favouring nature of the penalty term and constraint C 3 , both constraints C 4 and C 5 inhibit addition of dissimilar terms and ensure thematic coherence in the principal bucket. The complete ILP formulation for the example topic results in the following bucket distribution. It is evident that constraints C 4 and C 5 evict the term american, ensuring a coherent principal bucket. Bucket-1: {baby, birth, pregnant, woman, pregnancy, mother, born}; Bucket-2: {american}; Bucket-3: {allergy};

Bucket-4: {bat}
The constraints in the ILP formulation can also be viewed as a set of flexible settings, and depending on the desired representation of the learned topics, the constraints can be loosened or tightened leading to an optimal bucket distribution.
The coherence score of the topic is defined as the size of the principal bucket after optimization.

Datasets
We evaluate TBuckets on 4 datasets -20 News-Groups (20NG), New York Times (NYT), Genomics and ACL. Each dataset consists of a set of 100 topics where each topic is represented by its 10 most probable words. Each topic is associated with a real number between 1 and 3 indicating human judgement of its coherence. Detailed description of 20NG, NYT and Genomics datasets is provided in Röder et. al (2015).
We inferred the 100 topics for the ACL dataset 2 on the ACL Anthology Reference Corpus (Bird, 2008). We obtained the gold coherence scores for these topics from three annotators by following the methodology described in Röder et. al (2015).  For all our experiments, we use the 300 dimensional pre-trained word embeddings provided by the GloVe framework (Pennington et al., 2014).

Evaluation
We use the same evaluation scheme used in (Röder et al., 2015). Each technique generates coherence scores for all the topics in a dataset. Pearson's r correlation co-efficient is computed between the coherence scores based on human judgement and the coherence scores automatically generated by the technique. Higher the correlation with human scores, better is the performance of the technique at measuring coherence. Table 2 shows the Pearson's r values obtained from the state-of-the-art (Röder et al. (2015) and Nikolenko (2016)) and baselines (Clustering and Only SVD) compared with TBuckets. We consider scores on NYT, 20NG and Genomics as reported in (Röder et al., 2015) and obtain scores on the ACL dataset using the web demo provided by the authors at http://palmetto.aksw. org/palmetto-webapp/ As observed in Table 2, TBuckets outperforms (Röder et al., 2015) on 3 out of 4 and (Nikolenko, 2016) on all 4 datasets. It also outperforms all the baselines considering average performance across all datasets. This is significant considering the fact that TBuckets is parameter less whereas the stateof-the-art technique (Röder et al., 2015) requires considerable tuning of multiple parameters. This also is a sound validation of the TBuckets idea for measuring topic coherence.
Effect of word polysemy: The TBuckets approach relies on word embeddings for capturing the semantic relations among topic words. An important limitation of word embeddings is that a single representation of a word is learned irrespective of its senses. Hence it is observed that infrequent or domain-specific senses of polysemous words are not represented sufficiently. Coherent topics containing such polysemous words can still be judged coherent by humans as they can easily consider the appropriate sense of these words looking at the context of other topic words. TBuckets however, is unable to consider infrequent or domain-specific senses of such words, resulting into multiple unnecessary buckets and lower coherence. For a coherent topic from the ACL dataset: {derivation, probabilistic, pcfg, collins, subtree, production, child, charniak, parser, treebank}, TBuckets produces three non-principal buckets for the words child, production and collins. A similar example from 20NG is {game, team, player, baseball, win, fan, run, season, hit, play}, where TBuckets creates a separate bucket for the word fan due to its infrequent sense of "sports fan".

Conclusion and Future Work
We proposed a novel approach TBuckets to measure quality of Latent Dirichlet Allocation (LDA) based topics, based on grouping of topic words into buckets. TBuckets uses singular value decomposition (SVD) to discover important themes in topic words and ILP based optimization to find optimal word-bucket assignments. We evaluated TBuckets on LDA topics of 4 datasets, by correlating the estimated coherence scores with human annotated scores and demonstrated the best average performance across datasets. Moreover, as compared to the state-of-the-art techniques which need to tune multiple parameters, TBuckets requires no parameter tuning.
In future, we plan to devise better ways to compute word similarities which would be more suitable for specific domains like Genomics. One possible way is to train word embeddings on a domain specific corpus and use the learned embeddings. Also we intend to study the impact of using coherent topics for text classification and other NLP applications. We would also like to explore a new topic generation process which incorporates semantic relations between words, in addition to their statistical co-occurrence, leading to generation of semantically coherent topics.