A Semantic Cover Approach for Topic Modeling

We introduce a novel topic modeling approach based on constructing a semantic set cover for clusters of similar documents. Specifically, our approach first clusters documents using their Tf-Idf representation, and then covers each cluster with a set of topic words based on semantic similarity, defined in terms of a word embedding. Computing a topic cover amounts to solving a minimum set cover problem. Our evaluation compares our topic modeling approach to Latent Dirichlet Allocation (LDA) on three metrics: 1) qualitative topic match, measured using evaluations by Amazon Mechanical Turk (MTurk) workers, 2) performance on classification tasks using each topic model as a sparse feature representation, and 3) topic coherence. We find that qualitative judgments significantly favor our approach, the method outperforms LDA on topic coherence, and is comparable to LDA on document classification tasks.


Introduction
Topic modeling is one of the core research problems in natural language processing. Approaches to topic modeling range from simple vector comparisons to probabilistic graphical models (Deerwester et al., 1990;Hofmann, 1999;Blei et al., 2003;Mimno and McCallum, 2012). Nevertheless, despite the many approaches proposed over the years, probabilistic topic modeling methods in general, and Latent Dirichlet Allocation (LDA) (Blei et al., 2003) in particular, have become arguably the dominant paradigm. For example, it remains the algorithm of choice in the Amazon's healthcare NLP toolkit (Amazon Web Services, 2018).
However, there have been concerns about the performance of probabilistic models, particularly in the context of datasets comprised of short documents, such as tweets (Davidson et al., 2017;Yan et al., 2013;Hong and Davison, 2010;Mittos et al., 2018;Steinskog et al., 2017). This is primarily because the sparsity posed by short texts makes it hard for the model to sufficiently account for word co-occurrences, which form the basis of the definition of a topic in the sense of a multinomial distribution over words. Additionally, the language used on Twitter is informal in nature, uses slang and non-dictionary words, and often lacks proper grammatical structure. Moreover, the complexity of the probabilistic topic modeling approaches makes it difficult to interpret the specific choices they make about topics and their constituent words.
In this paper, we propose a novel approach to topic modeling which is conceptually simple and highly interpretable. Our approach is based on two hypotheses about the nature of short texts, such as tweets: first, that such texts can be grouped into relatively few disjoint clusters representing a similar mix of subjects (nominally, we call these clusters topics, recognizing that any such cluster may be comprised of multiple topics), and second, that each such subject mix can be adequately summarized by a small number of concepts (words). Both of these are distinct from LDA, which models a topic as a probability distribution over a large number of words. While LDA models each text as a mixture of multiple topics, we assert that each tweet falls into a single cluster. A more fundamental qualitative distinction of our approach from LDA is that it is deterministic in nature, and admits a much more compact representation of the corpus, since each topic, or cluster, is represented by only a small number of words.
To operationalize our hypotheses, we propose a two-step approach to topic modeling. First, we cluster documents based on their similarity in terms of Tf-Idf feature representation. Second, given the clustering, we attempt to find a set of words for each cluster that forms a description of the cluster. Specifically, we use a word embedding, along with a document representation in the same semantic space, to cover each cluster with a small set of topic words that are semantically similar to the documents. More precisely, we say that a word (concept) in a dictionary covers a document if it is among the k most similar words in the semantic embedding space. To cover a collection of documents thereby becomes a minimum set cover problem instance. While the set cover problem is computationally hard, it admits a fast greedy approximation algorithm (Chvatal, 1979), which we utilize to construct the topic descriptions for each document cluster.
Our evaluation combines qualitative and quantitative metrics. We first qualitatively compare our approach to LDA by asking MTurk subjects for their judgments about the quality of respective choices of topics for a random sample of documents from a cluster. We do this through two conceptually different ways, and observe a significant and systematic advantage of our approach over LDA. Quantitatively, we compare our approach and LDA in terms of standard intrinsic topic coherence and performance in text classification. On the intrinsic topic coherence metric, our approach fares significantly better than LDA for 4 out of the 5 datasets we use, and the two are comparable on the fifth dataset. Finally, we consider two classification tasks, spam and hate speech prediction, in which topic modeling is used as a sparse feature representation. In this task, we find that both approaches yield similar performance.

Related Work
One of the earlier and more influential topic modeling methods was Latent Semantic Analysis (LSA) (Deerwester et al., 1990) which performs a singular value decomposition on the termdocument matrix to discover concepts. Probabilistic Latent Semantic Analysis (pLSA) Hofmann (1999) tackles the limitations of LSA -namely potential negative values in the SVD, and the lack of a proper probability distribution -using a latent variable model, where topics are the latent variables. Arguably the most influential approach to the topic modeling domain is Latent Dirichlet Allocation (Blei et al., 2003). LDA can be thought of as an extension to pLSA, where the priors are Dirichlet distributions. LDA continues to be widely used in topic modeling, and several derivatives exist -each catering to a specific task, or corpus-structure (Blei et al., 2007;Blei and Lafferty, 2006;Yan et al., 2013).
Concerns about the performance of such probabilistic topic models with short text data (eg. tweets) have been illustrated by Davidson et al. (2017); Yan et al. (2013); Hong and Davison (2010); Mittos et al. (2018); Steinskog et al. (2017). Poor performance is attributed to the sparsity of short text data, which provide insufficient information for an approach like LDA to capture word co-occurrence. Yan et al. (2013) tackle this by explicitly modeling co-occurrence throughout the corpus to enhance topic learning. However, this approach requires O(m 2 ) memory (where m is the size of the vocabulary) to maintain all biterms (2-grams) and their frequencies in the corpus, making it inefficient in practice. Weng et al. (2010) aggregate tweets by the same user into pseudo-documents, yet this approach suffers from a dependence on the availability of user-information, or disproportionate distribution of tweets over users. Hong and Davison (2010) aggregate tweets containing the same word, which improves performance relative to LDA. Combining documents based on single words however induces heavy biases on the topics discovered. In our approach, we include a clustering step that can be thought of as an aggregation method. Documents that are semantically similar are grouped together into a cluster instead of a pseudo document, where similarity is a function of all words in the document.
Rangarajan Sridhar (2015) propose learning a vector space representation of words in a corpus using Word2Vec, similarly to our approach, except without Tf-Idf weights, and then fitting a mixture of gaussians on the resulting vectors using standard EM. However, the dimensionality of a Word2Vec representation is typically high (50-300 in practice), where gaussian mixtures are known to perform poorly (Krishnamurthy, 2011). Dimensionality reduction on the Word2Vec space is typically used to alleviate this problem, but it reduces the strength of the representation in the process.
In addition to probabilistic topic modeling, document clustering was successfully used in topic modeling by Aker et al. (2016), who use a supervised framework to train a learning model that predicts similarity scores between comments from news articles. A graph consisting of documents as nodes and similarity-weighted edges is then passed to the Markov Clustering Algorithm (Van Dongen, 2000). A major drawback of this approach is the dependence on availability of ground truth data to begin with.

Topic Modeling Using a Semantic Cover
We propose a simple topic modeling framework comprised of two steps. First, we cluster documents based on similarity. Second, we extract a set of topics from each cluster by leveraging a word embedding. The intuition behind the clustering step is that it splits a corpus into qualitatively similar groups of documents. Thus, we expect it to be possible to summarize the subject of each cluster by a small collection of topic words. The second step aims at summarizing each cluster of documents using a small set of topic words. The property we seek in this step is that the topic words chosen are semantically representative of the cluster. To achieve this goal, we leverage recent advances in neural word embeddings which empirically demonstrated that such embeddings are semantically meaningful (Mikolov et al., 2013a,b). Semantic similarity between words is roughly captured by cosine similarity in the embedded space. Specifically, we first represent documents in the same embedding space as words, and define the problem of the choice of topic extraction as a set cover problem instance. In the set cover instance, a potential topic word covers a collection of documents if the word is similar to these in the embedding space.

Document Clustering
Our first step is to partition the set of documents in the corpus into a collection of clusters. For this purpose, we first transform each document into its Tf-Idf representation. Depending on the dataset, any standard clustering approach may be used to partition the documents. In our case, we run spectral clustering (Ng et al., 2002) over the documents in their Tf-Idf form, where we use cosine similarity between vectors as the similarity metric.

A Set Cover Approach for Topic Extraction
Having obtained a collection of clusters, we treat them independently, with the goal of extracting a small set of representative topic words for each cluster, which adequately represents the subject of the documents in the cluster. To this end, we first represent words, as well as documents, in a vector space using a word embedding. Aiming for a small set of words is useful both in reducing the effort required for human interpretation, as well as forming a compact representation of a set of documents for quantitative tasks such as document classification.

Figure 1: Extracting topic words with set cover
Each document (green) in a cluster is connected to its 2 most similar words (blue). The aim is to find the smallest set of words such that the union of the edges originating from them covers all documents in the cluster. In this case, w1 and w2 form the cover.
Suppose that we have a dictionary W (a collection of words, which is a superset of words that actually occur in the cluster), with each word embedded in a real vector space, i.e., for each w ∈ W , w ∈ R n . Moreover, suppose that each document d is represented in the same embedding space. First, we associate each word w ∈ W with a set of documents, D(w), based on their similarity in the embedding space. Let s(w, d) be a similarity score between a word w and a document d. Given a document d, let W k (d) be the set of k most similar words to d in terms of s(w, d).
Now, the set D k (w) is the set of all documents d in the cluster k-covered by a word w. Next, we define our topic representation for a cluster C of documents as a set cover.
In words, a collection of words W C covers a cluster if each document in the cluster is covered by some word w ∈ W C . If the cover is partial, in the sense that at least a fraction 1−δ (i.e., most) of the documents are covered, we call it the 1 − δ cover. At this point, it is important to note that in principle the cover W C need not include solely words found in the documents in cluster C.
Having defined what it means for a collection of topical words to cover (exactly or approximately) a document cluster (really, an arbitrary collection of documents), we now observe that our aim is to find a small cover-that is, the smallest number of topic words that adequately cover a document cluster. Next, we define this notion precisely. Definition 3 Given a k and δ, a minimum (k, 1 − δ) cover for a document cluster C is a collection W * C which is a 1−δ cover such that |W * C | ≤ |W C | for any other (k, 1 − δ) cover W C of a document cluster C.

Embedding Words and Documents
To derive a word embedding, we can use one of the standard embedding approaches which has been demonstrated to roughly correspond to semantic relationships among words. We chose Word2Vec for this purpose, although other such embedding approaches can presumably be used in its place. While we used the Tf-Idf representation of documents in clustering, this is not well-suited to topic extraction using set cover, since it does not embed documents in the same semantic space as words.
To address this, we represent the documents in a new embedded space by computing a weighted average of Word2Vec (Mikolov et al., 2013a) representations of words occurring in the document, with Tf-Idf as the weighting scheme. Using Tf-Idf weighting in conjunction with a Word2Vec representation helps alleviate issues that the individual representations face when used independently. Used in isolation, the standard Tf-Idf representation only allows us to compute similarities between documents, but not between wordsgiven that words in this case are simply orthonormal one-hot vectors. Using only the Word2Vec representation allows us to compare similarity between words, but does not, by itself, represent documents. As Tf-Idf is an information-measure of how important a word is to a document, it is naturally an apt weighting scheme to represent a document as the weighted centroid of the vectors corresponding to the words in the document.
As we describe in the sections to follow, this also allows us to find topic-words for documents that might not necessarily be contained in the documents themselves. To define this representation precisely, suppose that t is the Tf-Idf representation of a document over a word dictionary W , and let V be the matrix with columns corresponding to words embedded in real space using Word2Vec. Then the embedded document representation is defined by where m is the number of words in the document.

Computing the Minimum Semantic Set Cover
Given the definition of the minimum semantic cover for a cluster of documents, along with an embedding of both words and documents in the same space, we can now extract the topics for each cluster using a greedy algorithm inspired by the O(n log n) greedy solution for set-cover (Chvatal, 1979), as follows.
We first convert the documents and words in the embedded space to an unweighted bipartite graph, using our notion of (k, 1 − δ)-cover. Let V 1 be a set of vertices where each vertex corresponds to a word w in the corpus dictionary, W . Let V 2 be a set of vertices, where each vertex corresponds to a document d in the corpus D. We add an edge between a word w and a document d if w (k, 1 − δ)covers d in the sense of cosine similarity between words and documents, s(w, d), in the embedded space. Thus, the graph G = {(V 1 ∪ V 2 ), E}. We also have for each document, a cluster assignment from the spectral-clustering step, i.e. D = ∪ i=1,...,n C i , where n is the total number of clusters (topics), such that each document belongs to exactly one cluster C i .
Then, to construct a minimum semantic set cover for a cluster, we proceed as follows. Let the set of topic words, T i for the i th cluster, C i be an empty set. Let V 2,i = {d ∈ V 2 : d ∈ C i }, i.e. V 2,i is the subset of vertices in V 2 corresponding to documents in the i th cluster. Let V 1,i = ∪ d∈V 2,i N (d), where N (d) represents node neighborhood. In words, V 1,i is the subset of corpus words that cover at least one document in cluster C i , i.e. the set ∪ d∈C i W k (d).
Let G i be the subgraph of G induced on V i,1 ∪ V 2,i . The greedy algorithm to find the minimum set-cover for a cluster C i proceeds by picking the node in V 1,i that covers the maximum number of documents in V 2,i . In the case of a tie, we pick all nodes with maximum degree. The words corresponding to the selected vertices are placed in T i , then the selected nodes, their neighbors in G i and the edges between them are removed from the graph. We then recompute the degrees of all nodes affected by this removal of edges. This process is repeated until we have covered a desired fraction (1 − δ) of the cluster. Algorithm 1 details topicword extraction using set-cover.
Algorithm 1 Greedy Set Cover while |V 2,i | > k do

Evaluation Methodology
We evaluate our approach in comparison with LDA-the de facto standard in topic modelingboth in qualitative and quantitative terms. Our qualitative evaluation involves human judgments about the appropriateness of topic choices for a subsample of texts. We complement this with two quantitative metrics, one with respect to a standard topic coherence measure, and the second in using topic models for text classification tasks. Throughout, we refer to our approach as set cover. Moreover, in our experiments, the Word2Vec vectors are derived by training a skip-gram model on the corpus, with a sliding window of size 4 and the number of dimensions set to 500. Additionally, we compute the minimum 1-cover (i.e. δ = 0), that is, we ensure that all documents in the cluster are covered.

Qualitative Evaluation
Given the common use of topic modeling in obtaining qualitative insight from text, our first evaluation approach involves human judgments of quality. This evaluation echos other human evaluations of topic modeling, such as by Steinskog et al. (2017) for the topic-intrusion detection task. Also noteworthy is the work by Chang et al. (2009), who demonstrated the poor correlation of the popular perplexity metric (Blei et al., 2003) with human judgments.
For our qualitative evaluation, we set up a series of experiments on Amazon Mechanical Turk (MTurk). For these tasks, we use 4 sets from the health news tweets collected by (Karami et al., 2018) and YouTube comments about 23andMe (we provide specific details in a later section). To ensure fairness to LDA-our chosen baseline-we do this in two different settings based on how we group documents into topically related subsets.

Matched Clusters
In the first setup, we take the document clusters produced by spectral clustering as given, and focus the comparison between LDA and set cover on the particular choice of topical words these generate. In this case, we produce a correspondence between a given cluster and an LDA topic by choosing an LDA topic which maximizes the likelihood that the cluster was produced by the topic. More precisely, we assign a cluster C to the topic j which maximizes i∈C P (i|j), where P (i|j) is the LDA-derived likelihood that a document i reflects a topic j. We then generate the collection of topic words for a given cluster using LDA in a standard way. Specifically, we choose the n most probable words in the associated LDA topic, where n is set as the number of topic words produced by the set cover.
In the experiment, we assign a random cluster to a subject, who is then presented with the documents in this cluster (or a random subsample of these, if the cluster is too large), the choice of topic words based on LDA, and the choice of topic words based on set cover. Additionally, we also ask the subjects for judgments of a collection of n randomly chosen words from the cluster to calibrate the results. We then ask participants to judge how well a topic (i.e., the collection of topic words) describes the given set of documents, and score each result on a 5-point Likert scale, with 1 being very poor and 5 very good.

Independent Clusters
One may naturally object that the above comparison is unfair to LDA insofar as we are choosing the clusters and then retrofitting LDA topics to these. We therefore ran a second set of qualitative experiments in which LDA topics were used to derive clusters of similar documents. Specifically, we clustered all documents based on their associated likelihood given a topic; that is, a document i was assigned to an LDA topic j which maximizes P (i|j). This gives us a collection of document clusters, which we can then present to human subjects for judgment. As before, we used the top most probable n words from an LDA topic as the topic description presented to human subjects. The set cover approach, on the other hand, used spectral clustering as before. Since the set of documents presented for judgment is now different for the two approaches, we omitted the random words for calibration. Consequently, while we still presented the subjects with the same 5point Likert scale as before, this scale is now calibrated differently, as will be made evident in the results section.

Quantitative Evaluations
To quantitatively compare our algorithms, we use the standard intrinsic topic coherence metric, and two classification tasks to compare the strengths of the sparse representation produced. The topic coherence (Stevens et al., 2012) for a set T of topic words is defined as follows: where N (w) is the number of documents that contain the word w and λ is a smoothing factor. We compute average coherence scores over 5 runs, varying cluster sizes between 5 and 25.

Document Classification Accuracy
We set up two classification tasks. The first task is to classify short text messages as Ham or Spam. The second task is to classify tweets as offensive, hate speech or neither. For both, we use topic modeling approaches to arrive at a sparse feature representation of a document. For LDA, the feature vector for a document is comprised of the probabilities that a document was generated by each of the topics. For the set cover approach, we construct binary feature vectors that represent the occurrence of topic words in the cluster to which the document is assigned. Given the above feature representations, we use a Linear Support Vector Classifier. For the multi-class problem, we use a One-vs-Rest approach with Linear Support Vector Classifiers for both classification tasks. We maintain a 60%-40% train-test split over the corpus, and average accuracy over 5 runs, varying the number of topics between 5 and 25.

Data
Our evaluation used several datasets which we describe briefly below.
Twitter -Health News tweets from more than 15 health news agencies were collected by Karami et al. (2018). The dataset contains separate files for tweets collected from each source. Each source is observed to have had trends in tweets, which implicitly form topic clusters.
YouTube Comments -23andMe We collected a sample of 800 YouTube comments from the top 50 YouTube video results for the search term '23andMe'. This dataset is qualitatively different from the Twitter corpus, showing greater variation in document length and significantly more noise.
Twitter -Hate and Offensive Speech A set of 24802 tweets based on a hate speech lexicon were collected and labelled into 3 categories -hate speech, offensive speech and neither by Davidson et al. (2017). This dataset is used in one of our two classification tasks.
Ham/Spam Short Messages 5574 short text messages were classified as legitimate (ham) or spam by Almeida et al. (2013). This forms the basis of the second classification task.

Qualitative Evaluations
In the first set of MTurk experiments, where topics from both algorithms were shown in the same task, we asked human judges to score topics from 4 document clusters, collecting 20 responses for each. For instance, the Fox News dataset contains a set of tweets posted in 2015 about the measles outbreak in California, linked to Disney theme parks. The topic words for the measles outbreak cluster identified by the two algorithms are shown in Table 1. Here, we see that LDA picks certain irrelevant terms for the shown cluster sample ('u', 'rare'), while completely missing the term 'measles', which is a key subject of the documents (we note that we remove stop-words during preprocessing using NLTK (Loper and Bird, 2002)). The set cover approach, on the other hand, is able to identify highly pertinent words. We refer to this experiment as Matched Clusters.
This can be thought of as reflecting the propriety of the chosen topic words conditional on the clusters of similar documents. The average scores are shown in Table 2.   Our second set of experiments for clusters chosen independently for the two algorithms was conducted on a significantly larger scale. We uploaded 5 clusters per dataset, and collected 40 responses per cluster, resulting in a total of 2000 data points, 1000 for each algorithm. We refer to this experiment as Independent Clusters. Table 3 shows example topic words identified by LDA and Set Cover. The advantages of the clustering step in our approach are evident in this example -the set cover cluster contains documents that are more closely related to one another.
More importantly, it is worth noting the choice of the term 'delay' in the set cover topic words -while the term does not itself appear in the entire cluster, it is semantically related to documents in the cluster referring to the long wait Maryland residents had to endure to sign up for Obamacare. This is precisely the reason for using a wordembedding such as Word2Vec in our approachtopic words are not restricted to words in the cluster and yet appear to be semantically meaningful. The average judgments from MTurk for these experiments are reported in Table 4.
In both experiments, we can see that set cover consistently outperforms LDA, often by a large margin. We also performed a two sided independent samples t-test on the scores. The differences between the means in Table 4 are statistically significant; all but 23andMe for p < 0.0001, and the significance for 23andMe is for p < 0.05. It is interesting to note that set cover performs slightly better in terms of evaluation scores in the Matched Clusters study, suggesting that it is judged favorably particularly in the context of random calibration and LDA.
Since the clusters are fixed in these experiments, the results reflect the particular advantage of the set cover method itself in choosing descriptive words for a collection of similar documents. The Independent Clusters study, in contrast, serves more as an evaluation of each approach in an end-to-end fashion, and here, too, the difference is substantial. However, the LDA scores in this case are generally comparable or higher than in the Matched Clusters experiments, which suggests that the advantage of set cover over LDA may be primarily due to its better choice of topic words, which is its main novelty, rather than the clustering approach.

Quantitative Evaluations
Topic Coherence The first quantitative comparison between set cover and LDA is in terms of the topic coherence metric. For each dataset, we plot topic coherence as a function of the number of topics ranging from 5 to 25 (for the set cover approach, the number of topics corresponds to the number of clusters). Figure 2 presents the topic coherence results. In nearly all of these cases (with the few apparent exceptions), set cover scores significantly better on this metric than LDA. It is also notable that set cover tends to improve as we increase the number of topics, whereas this is typically not the case for LDA (New York Times Health News tweets is an exception, where set cover scores decrease with the number of topics, while LDA scores increase slightly, so that for a large number of topics the two approaches are indistinguishable).

Classification
The final evaluation uses two objective document classification tasks to compare the effectiveness of set cover and LDA in producing a sparse feature representation for such tasks. We present classification accuracy by varying number of topics again from 5 to 25. Figure 3 shows classification results. While LDA appears to be slightly better in the Ham/Spam email classification case, and is occasionally better in the Hate/Offensive speech classification task, the differences are quite small, with both achieving accuracy in the 87-89% range in the former, and 77-78% in the latter.

Discussion
The reason for LDA's observed inferiority in the qualitative experiments can be traced back to the fact that LDA allows each document to be generated from a mix of topics. However, in most shorttext corpora, documents usually pertain to a single topic. Additionally, the number of documents belonging to each topic in a corpora is not (explicitly) captured by LDA.
With the set cover approach, the clustering step provides us this information -clusters need not be of uniform size, and such a clustering is easy to learn. This may explain, for instance, why LDA completely misses the word 'measles' in the Matched Clusters sample shown in Table 1. The number of documents about the measles outbreak in the corpora are relatively few, and treating this set of documents independently of other documents in the corpus makes it easier to identify this theme.
The topic coherence experiments show that the topic words learnt using set cover are more likely to co-occur across the corpus as compared to those learnt with LDA, thereby suggesting that set cover's choice of topic words is more meaningful. The results of the classification task are noteworthy, given that our model is far less complex than LDA, and yet produces almost as effective a sparse representation.

Conclusion
In this paper, we introduced a conceptually simple and highly interpretable deterministic topic modeling algorithm based on constructing a semantic set cover over clusters of documents in a corpus. Unlike popular probabilistic topic modeling methods, our algorithm performed well on short text data, thereby overcoming the limitations imposed by corpus-sparsity. We demonstrated that our approach significantly outperforms LDA on qualitative scores by human judges as well as the standard topic coherence metric, and that it is comparable to LDA for document classification.
One limitation of our approach is the dependence on a good clustering of documents, in the sense that documents are meaningfully grouped together by the clustering algorithm used, given a dataset. Additionally, we rely on a word embedding, which may not be easy to learn over datasets where terms do not recur in the same contexts frequently. A potential solution to this is to learn the embedding on the union of said dataset with another corpus of similar (thematic and structural) nature, where term co-occurrences are more frequent.
Finally, as future work, we aim to explore setcover based topic modeling where the covering threshold set as the top-k similar words to a document varies for each topic. Hopefully, this will allow us to capture the notion that some topics are sufficiently captured by a smaller set of words whereas others may need a larger threshold to fully capture their semantics.