Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures

When developing topic models, a critical question that should be asked is: How well will this model work in an applied setting? Because standard performance evaluation of topic interpretability uses automated measures modeled on human evaluation tests that are dissimilar to applied usage, these models’ generalizability remains in question. In this paper, we probe the issue of validity in topic model evaluation and assess how informative coherence measures are for specialized collections used in an applied setting. Informed by the literature, we propose four understandings of interpretability. We evaluate these using a novel experimental framework reflective of varied applied settings, including human evaluations using open labeling, typical of applied research. These evaluations show that for some specialized collections, standard coherence measures may not inform the most appropriate topic model or the optimal number of topics, and current interpretability performance validation methods are challenged as a means to confirm model quality in the absence of ground truth data.


Introduction
Topic modeling has become a popular tool for applied research such as social media analysis, as it facilitates the exploration of large documentcollections and yields insights that would not be accessible by manual methods (Sinnenberg et al., 2017;Karami et al., 2020). However, social media data can be challenging to model as it is both sparse and noisy (Zhao et al., 2011). This has resulted in increased demand for short-text topic models that can handle these challenges (Lim et al., 2013;Zuo et al., 2016;Chen et al., 2015).
Topic word-sets, denoted T ws , are considered to be semantically related words that represent the latent component of the underlying topic's documentcollection, denoted T dc . Meaning is derived from these topics through the interpretation of either the T ws (Nerghes and Lee, 2019), the corresponding T dc (Maier et al., 2018), or both (Törnberg and Törnberg, 2016). Since meaning requires topics to be interpretable to humans, empirical assurance is needed to confirm a novel topic models' capacity to generate "semantically interpretable" topics, as well as a method to guide model selection and other parameters such as the number of topics, K. This is often achieved by calculating the coherence scores for T ws (Lau and Baldwin, 2016) Recent literature contradicts previous evaluations of some short-text topic models that claim superior interpretability (Li et al., 2018;Eickhoff and Wieneke, 2018;Bhatia et al., 2017). Such rethinking flows from the fact there is no agreement on the best measure of interpretability (Lau et al., 2014b;Morstatter and Liu, 2017) and is compounded by the unclear relationship between human evaluation methodologies and automated coherence scores (Lau et al., 2014b). Finally, despite assurances of generalizability and applicability, topic model evaluations in machine learning are conducted in experimental settings that are not representative of typical applied use. This raises questions of whether coherence measures are suitably robust to measure topic interpretability and inform model selection in applied settings, particularly with challenging datasets like that of social media.
Advances in topic modeling for static documentcollections have produced non-parametric approaches such as HDP-LDA, which employ sophisticated hierarchical priors that allow for different prior proportions (Teh et al., 2006). Non-negative matrix factorization (Zhou and Carin, 2015), the use of word embeddings, and neural network methods (Zhao et al., 2021) are a few of these other innovations.
To support these advances, it is crucial to establish the robustness of topic modeling interpretability measures, especially given the growing trend towards evaluating topic models using coherence measures, often in the absence of perplexity or other predictive scores (?). Additionally, increasingly sophisticated methods for automatic topic labeling have been developed. Beginning with Lau et al. (2011), this research relies on models which generate interpretable topics. While these advances enhance the technologies available to conduct applied research, they do not address the underlying question of whether topic interpretability can be adequately assessed using coherence measures.
In this paper, we demonstrate a research gap in topic model evaluation methods in light of their growing use in specialized settings. Previously declared state-of-the-art models are under-performing in applied settings (Li et al., 2018;Arnold et al., 2016), and little work has been done to improve application relevance (Hecking and Leydesdorff, 2019). Following the work of (Lau and Baldwin, 2016;Bhatia et al., 2017;Hecking and Leydesdorff, 2019), this study examines whether coherence is a valid predictor of topic model interpretability when interpretability is defined as more than just the ability to label a T ws , and as the diversity of topic models, datasets and application tasks increases.
Earlier research has established a correlation between novel coherence measures and human ranking of interpretability, as measured by qualitative tests (Cheng et al., 2014;Newman et al., 2010a). However, since bounded experimental settings constrain these tests, they are unlikely to reliably and consistently indicate topic quality in applied research settings. As a result, we ask the following question: To what extent can we rely on current coherence measures as proxies for topic model interpretability in applied settings?
This work has significant practical implications. It signals the need to re-develop interpretability measures and reappraise best-practice for validating and evaluating topic models and their applications. Our research contributes the following: 1. Introduces a novel human-centered qualitative framework for evaluating interpretability in model development that mimics those processes seen in applied settings.
2. Demonstrates that the ranking of topic quality using state-of-the-art coherence measures is inconsistent with those produced through validation tasks performed in an applied setting.
3. Systematically quantifies the impact of model behavior, dataset composition, and other pre-viously reported factors (Morstatter and Liu, 2017;Lau and Baldwin, 2016), on coherence measures for many topics across four variant datasets and two topic models.
4. Provide evidence to show that interpretability measures for evaluating T ws and T dc for applied work in specialized contexts (e.g., Twitter) are ill-suited and may hinder model development and topic selection.
The remainder of this paper is organized as follows. Section 2 provides a review of related work around the interpretability of topic models. Section 3 describes five propositions that have informed the design of interpretable topic models and their evaluation measures. This is followed by a description of the experimental framework we designed to test these propositions. Section 4 provides the results of these evaluations and Section 5 contains a discussion of findings.

Background
This section provides a brief overview of work related to interpretability evaluation, followed by a review of the challenges associated with coherence optimization for specialized contexts.

Topic Model Interpretability
Topic model interpretability is a nebulous concept (Lipton, 2018) related to other topic model qualities, but without an agreed-upon definition. Measures of semantic coherence influence how easily understood the top-N T ws are (Morstatter and Liu, 2017;Lund et al., 2019;Newman et al., 2010a;Lau et al., 2014b). This is also referred to as topic understandability (Röder et al., 2015;Aletras et al., 2015). A coherent topic is said to be one that can be easily labeled and thus interpreted Morstatter and Liu, 2017), but only if the label is meaningful (Hui, 2001;Newman et al., 2010b,a). Some have modeled coherence measures based on topic meaningfulness (Lau et al., 2014a); others state that a meaningful topic is not necessarily a useful one (Boyd-Graber et al., 2015). Indeed, the literature remains divided over whether usefulness is a property of an interpretable topic (Röder et al., 2015), or if interpretability is a property of a useful topic (Aletras and Stevenson, 2013;Newman et al., 2010b). Such terminological disagreement suggests that there are challenges to the progression of this area of research.
The ease of labeling a topic is assumed to be an expression of how coherent that topic is and thus its degree of interpretability. This assumption is challenged when annotators provide different labels for a topic. Morstatter and Liu (2017) presented interpretability from the perspective of both coherence and consensus, where consensus is a measure of annotator agreement about a topics' representation in its T dc . Alignment is how representative a topic is of its T dc and is another understanding of interpretability (Ando and Lee, 2001;Chang et al., 2009;Mimno et al., 2011;Bhatia et al., 2017;Alokaili et al., 2019;Morstatter and Liu, 2017;Lund et al., 2019). However, the probabilistic nature of topic models impede this measure. The ambiguity of interpretability as a performance target raises questions about how topic models are used and evaluated.

Related Work
Following the seminal work of Chang et al. (2009), the development of coherence measures and the human evaluation tasks that guide their design has been actively pursued (Newman et al., 2010a;Bhatia et al., 2017Bhatia et al., , 2018Morstatter and Liu, 2017;Lau and Baldwin, 2016;Lund et al., 2019;Alokaili et al., 2019). Newman et al. (2010a) showed that human ratings of topic coherence (observed coherence) correlated with their coherence measure when the aggregate Pointwise Mutual Information (PMI) pairwise scores were calculated over the top-N T ws . In addition to the word intrusion task (Chang et al., 2009), Mimno et al. (2011 validated their coherence measure for modeling domainspecific corpora using expert ratings of topic quality. The measure takes the order of the top-N T ws into account using a smoothed conditional probability derived from document co-occurrence counts. This performance was further improved by substituting PMI for Normalized PMI (C NPMI ) (Aletras and Stevenson, 2013;Lau et al., 2014b). Aletras and Stevenson (2013) used crowdsourced ratings of topic usefulness to evaluate distributional semantic similarity methods for automated topic coherence. Röder et al. (2015) conducted an exhaustive study evaluating prior work and developing several improved coherence measures.
Similarly, Ramrakhiyani et al. (2017) made use of the same datasets and evaluations and presented a coherence measure which is approximated with the size of the largest cluster produced from embed-dings of the top-N T ws . Human evaluation tasks have also been created to measure how representative a topic model is of the underlying T dc (Chang et al., 2009;Bhatia et al., 2017;Morstatter and Liu, 2017;Alokaili et al., 2019;Lund et al., 2019).

Practical Applications
Within computer science, topic modeling has been used for tasks such as word-sense disambiguation (Boyd-Graber and Blei, 2007), hierarchical information retrieval (Blei et al., 2003), topic correlation (Blei and Lafferty, 2007), trend tracking (Al-Sumait and Domeniconi, 2008), and handling shorttexts (Wang et al., 2018). Outside of computer science, topic modeling is predominantly used to guide exploration of large datasets (Agrawal et al., 2018), often with a human-in-the-loop approach.
Qualitative techniques make use of topics in different ways. "Open labeling" of topics by Subject Matter Experts (SME) is followed by a descriptive analysis of that topic (Kim et al., 2016;Morstatter et al., 2018;Karami et al., 2018). However, this method is subjective and may fail to produce the depth of insight required. Supplementing a topic analysis with samples from the T dc increases the depth of insight (Eickhoff and Wieneke, 2018;Kagashe et al., 2017;Nerghes and Lee, 2019). Alternatively, the T dc alone cam be used for in-depth analysis (Törnberg and Törnberg, 2016

Evaluating Interpretability
We have generated five propositions about the relationship between coherence scores, human evaluation of topic models, and the different views of interpretability to explore the research question. We conduct five experiments to interrogate these propositions and re-evaluate how informative coherence measures are for topic interpretability. Because we are evaluating existing coherence measures, we do not employ automatic topic labeling techniques. Instead, we make use of human evaluation tasks that reflect those conducted in applied settings.
Proposition 1. If coherence scores are robust, they should correlate. The battery of coherence measures for evaluating novel topic models and automated labeling approaches are inconsistent across the literature. Each new measure claims superior alignment to topic model interpretability. As these measures are evolutionary (Röder et al., 2015), and there is no convention for which measure should be used, particularly as a standard measure of qualitative performance (Zuo et al., 2016;Zhao et al., 2017;Zhang and Lauw, 2020), they are considered notionally interchangeable. Thus, we would expect that there would be some correlation between these measures. However, previous studies have not considered the impact that the data type or model has on the coherence scores. Particularly for non-parametric models, these issues may be compounded by how coherence measures are presented as an aggregate, e.g., The presentation of the top-N models. Indeed, studies reporting multiple coherence measures have demonstrated inconsistencies at the model-level that are obscured during reporting (Blair et al., 2020).
Proposition 2. An interpretable topic is one that can be easily labeled. How easily a topic could be labeled has been evaluated on an ordinal scale where humans determined if they could hypothetically give a topic a label (Mimno et al., 2011;Morstatter and Liu, 2017). However, humans are notoriously poor at estimating their performance, particularly when they are untrained and do not have feedback on their performance (Dunning et al., 2003;Morstatter and Liu, 2017). Thus, a rater's perception of whether they could complete a task is actually less informative than having them complete the task.
Proposition 3. An interpretable topic has high agreement on labels. Agreement on a topic label is considered a feature of interpretability by Morstatter and Liu (2017), who propose "consensus" as a measure of interpretability. A high level of agreement on topic labels, particularly in crowdsourcing tasks, is seen as a means to infer that a T ws is interpretable. However, in applied tasks, a topic is described in a sense-making process resulting in one coherent label. Thus, the consensus task is not necessarily a reasonable means to infer inter-pretability. A robust way to measure agreement on a topic label is needed. Inter-coder reliability (ICR) measures are an appropriate means to achieve this.
Proposition 4. An interpretable topic is one where the document-collection is easily labeled. The investigation of topic document-collections is an emerging trend in the applied topic modeling literature. In these studies, authors have either used a topics "top documents" to validate or inform the labels assigned to T ws (Kirilenko et al., 2021), or have ignored the T ws in favor of qualitative analysis of the richer T dc (Doogan et al., 2020). The use of topic modeling for the exploration of documentcollections requires a T dc to be coherent enough that a reader can identify intertextual links between the documents. The label or description given to the T dc results from the readers' interpretation of individual documents relative to the other documents in the collection. T dc that have a high degree of similarity between their documents will be easiest to interpret and therefore label. The ease of labeling a T dc decreases as the documents become more dissimilar.
Proposition 5. An interpretable topic word-set is descriptive of its topic document-collection. The alignment of T ws to T dc is an expected property of a "good" topic (Chang et al., 2009), which human evaluation tasks have been developed to assess. Typically these tasks ask annotators to choose the most and/or least aligned T ws to a given document (Morstatter and Liu, 2017;Lund et al., 2019;Alokaili et al., 2019;Bhatia et al., 2018), identify an intruder topic (Chang et al., 2009;Morstatter and Liu, 2017), rate their confidence in a topicdocument pair (Bhatia et al., 2017), or select appropriate documents given a category label (Aletras et al., 2017). However, none of these methods address the need for the topic document-collection to be evaluated and labeled. Furthermore, they generally use one document and/or are not comparable to applied tasks.

Data
The Auspol-18 dataset was constructed from 1,830,423 tweets containing the hashtag #Auspol, an established Twitter forum for the discussion of Australian politics. The diminutives, slang, and domain-specific content provide a realistic example of a specialized context. Four versions of the dataset were constructed from a subset of 123,629 tweets; AWH (contains the 30 most frequent hash-tags), AWM (contains the 30 most frequent mentions of verified accounts), AWMH (contains the 30 most frequent hashtags and 30 most frequent mentions of verified accounts), and AP (contains neither hashtags nor mentions). Pre-processing included stopword removal, POS-tagging, lemmatization, exclusion of non-English tweets, duplicate removal, removal of tokens with a frequency n < 10, and removal of tweets with n < 5 tokens, and standardization of slang, abbreviations (Agrawal et al., 2018;Doogan et al., 2020) 1 .

Models and Parameters
To investigate interpretability in an applied setting, we compare LDA to MetaLDA (Zhao et al., 2017), a recent non-parametric topic model designed to improve short-text topic modeling by leveraging the incorporation of the document and word metainformation using word embeddings as well as nonparametrically learning topic proportions. Despite the many extensions to LDA, the vanilla model maintains popularity among applied researchers (Sun et al., 2016), and as the baseline model, it is necessary to compare LDA with a model purposebuilt for short-text applications. MetaLDA is one reasonable representative of such models and has demonstrated effectiveness on Twitter data for applied work (Doogan et al., 2020). The extensive effort of human labeling in our experiments (see Section 3.4) precludes us from adding more models. LDA and MetaLDA are available in the MetaLDA package 2 , which is implemented on top of Mallet (McCallum, 2002).

Coherence Measures
Several coherence measures were evaluated. These were C Umass (Mimno et al., 2011), C V , C P (Röder et al., 2015), C A and C NPMI (Aletras and Stevenson, 2013). These were calculated for each topic using the Palmetto package 3 using the top ten most frequent words. Along with the default C NPMI , which is calculated using Wikipedia, we introduced C NPMI-ABC , which is calculated using a collection of 760k Australian Broadcasting Company (ABC) news articles 4 with 150 million words (enough to make the C NPMI scores stable), and C NPMI-AP calculated using the AP dataset and is used to test C NPMI but with statistics drawn from the training data. We report the average scores and the standard deviations over five random runs.

Qualitative Experiments
A primary concern in machine learning research is the need to establish model performance. Following the recent trend to analyze T dc , we devised qualitative tests for the assessment of whether the T ws and T dc were adequately aligned and whether current performance measures are informative of this alignment. We also tested to see if there is a relationship between topic alignment and the topic diagnostic statistics; effective number of words 5 , and topic proportion, denoted D ew and D tp , respectively.
Topic Word-sets: Four SMEs were recruited from a multidisciplinary pool of researchers who were representative of the political-ideological spectrum and who were Australian English speakers. They were shown the same topics consisting of the top-10 words ranked by term frequency that were generated by LDA and MetaLDA on AP, AWH, and AWM for K=10-60 topics 6 , producing a total of 3,120 labels (780 for each SME) generated for the 390 topics (130 per model-dataset combination). Their task was to provide a descriptive label for each T ws and to use 'NA' if they were unable to provide a label. Appendix A provides an example of this task. Two measures were constructed from these labels. The first was the number of raters able to label the topic, a count between 0-4 denoted Q nbr . The second was a simple ICR measure, Percentage Agreement denoted Q agr , which calculated as the number of times a set of annotators agree on a label, divided by the total number of annotations, as a percentage.
Topic Document-collections: Two SMEs analyzed the T dc s of the 60 topics each modeled by LDA and MetaLDA on the AP dataset, referred to hereafter as the qualitative set. Samples of T dc generated by each model (K=10-60) were reviewed, and those generated from both models 60-topic sets 3829 were found to be of equal or higher quality than those produced by other values of K.
The SMEs reviewed the top-30 tweets representative of a topic and provided a label for each tweet. They then inductively determined a label or phrase describing that T dc . They noted any key phrases, names, or other terms that were consistent across the collection. The SMEs were experienced at annotating such datasets and were familiar with the online #Auspol community. The SMEs then discussed the results together and agreed on a final label for each T dc .
The SMEs were asked to rate on a scale of 1-3 how difficult it was to label each T dc , where 1 was difficult, 3 was easy, and 0 was where a label could be assigned. This qualitative statistic is denoted Q dif . The researchers then scored, on a scale of 1-5, the degree of alignment between topic labels and the labels assigned to their corresponding collections. A score of 5 indicated the labels were identical, and a score of 0 indicated the T ws and/or T dc was incoherent. This statistic is denoted Q aln . Examples of these tasks are in Appendix A.

Statistical Tests
We measure the strength of the association between variables using Pearson's r correlation coefficient in evaluation 1 (see section 4.1) and Spearman's ρ correlation coefficient in evaluations 2-5 (see sections 4.2, 4.3, 4.4, and 4.5). Pearson's r is used in the few papers that evaluate coherence scores over the same datasets (Röder et al., 2015;Lau et al., 2014b). The practical reason for using Pearson's r for our evaluation of proposition 1 was to make valid comparisons with these studies. The statistical justification for using Pearson's r (rather than Spearman's ρ) is that the datasets are continuous (neither is ordinal, as Spearman's ρ requires) and believed to have a bivariate normal distribution. 7 Spearman's ρ is only appropriate when the relationship between variables is monotonic, which has not been consistently demonstrated for coherence (Röder et al., 2015;Bovens and Hartmann, 2004). Spearman's ρ is appropriate to assess the association between coherence scores and human judgments in evaluations 2-5 8 . It is a preferred method for such tasks (Aletras and Stevenson, 2013;Newman et al., 2010a) as it is unaffected by variability in the range for each dataset (Lau et al., 2014b).

Results
Here we detail the results of our analysis of the five propositions about interpretability evaluation.

Evaluation 1: Coherence Measure Correlations
As per proposition 1, coherence measures should be robust and highly correlated. To test this proposition, we conducted a Pearson's correlation analysis of paired coherence measures calculated for K=10-60 for each model-dataset combination. Pooling the results for K and the three datasets, we calculate the x r for LDA and MetaLDA. C NPMI and C P scores were strongly correlated for all datasets. Ranging from x r =0.779-0.902 for LDA, and x r =0.770-0.940 for MetaLDA. C NPMI and C NPMI-ABC also showed a moderateto-strong correlation for all datasets with LDA ranging from x r =0.719-0.769, and MetaLDA from x r =0.606-0.716. C NPMI-ABC appears more sensitive to changes in K than C P . No significant trends were seen between other coherence measures calculated for any dataset. These results are reported in Appendix B.
Methods to aggregate coherence scores may mask any differences in the models' behaviors as K increases. To test this, aggregate coherence measures, typical of the empirical evaluation of topic models, were calculated per value of K. These were the mean of all topics (Average), the mean for all topics weighted by the topic proportion (WeightedAverage), and the mean of the Top-N percent of ranked topics by coherence score (Top-Npcnt), where N = {25, 50, 80}.
Both models showed trends in aggregated coherence scores calculated on the AP dataset. As shown in Figure 1, the peak for each measure varies according to different values of K and between models. For instance, aggregates of both models C NPMI and C NPMI-ABC peak at 60 and 10 topics, respectively. However, C V aggregate peaks are completely divergent between models, K=200 for MetaLDA and K=50 for LDA. Indeed, the two models favored different coherence measures and aggregate methods. Generally, MetaLDA exhibits superior performance across all aggregates for C V and C A , while LDA is superior for C Umass . No- tably, MetaLDA shows superior C NPMI , C NPMI-ABC , C NPMI-AP scores for Top20pcnt, Top50pcnt, and Top80pcnt aggregations, but is inferior when the full average of these scores is calculated. Other datasets are broadly similar and shown in Appendix B.
We also compare MetaLDA with LDA. Pooling the results for K=10-200 for each of the four datasets, we get a set of differences in the scores and compute the p-value for a one-sided student t-test to determine whether LDA has higher average coherence scores than MetaLDA. MetaLDA yields significantly higher C NPMI scores calculated using the Top20pcnt (p<0.01) and Top50pcnt of topics (p<0.05). Conversely, LDA yields significantly higher C NPMI scores for the other aggregates (p<0.01). Except for the full average, MetaLDA achieves significantly higher (p<0.01) C NPMI-ABC , C NPMI-AP , and C V scores than LDA for the other aggregate methods. Disturbingly, the "best" models, or optimal K varies depending on the coherence measure and the aggregate measure used to calculate it. This has implications for topic model selection in applied settings, where coherence is used to inform K (Kirilenko et al., 2021). When repeating the analysis using different K, a second trend emerges: Met-aLDA significantly outperforms LDA in C NPMI for smaller K on average but loses out for larger K. Results from our qualitative analysis confirmed this occurred because LDA had many less frequent topics (e.g., when K = 60, all topics occur about 1/60 of the time), unlike MetaLDA, which mixes more and less frequent topics.

Evaluation 2: Labeling Topic Words-sets
Proposition 2 states that if topics can be labeled they are interpretable. Coherence as a measure of interpretability should then be predictive of topics that can be labeled. To evaluate this proposition, a Spearman's ρ correlation coefficient was used to assess the relationship between coherence measures and the number of raters able to label the T ws , Q nbr , for each of the 130 topics produced per model-dataset combination. These results are available in Appendix C. There was no significant correlation between any coherence measure and Q nbr . Interestingly, the SMEs reported several topics they could not label despite their high coherence scores. For instance, the LDA modeled topic "red, wear, flag, blue, gold, black, tape, tie, green, iron" could not be labeled despite being the 9 th /60 highest ranked topic for C NPMI .

Evaluation 3: Topic Label Agreement
Proposition 3 states an interpretable topic is one where there is high agreement between annotators on its label. As such, coherence should align to measures of consensus or agreement. To evaluate this proposition, we calculate the gold-standard ICR measures, Fleiss' kappa (κ) (Fleiss, 1971) and Krippendorff's alpha (α) (Krippendorff, 2004). Both allow for multiple coders and produce a chance-corrected estimate of ICR but do not facilitate the isolation of low-agreement topics. For this, we also calculated the Percentage Agreement Q agr for each topic, as shown in Appendix D.
Generally, α, κ, and Q agr improved as K increased. As shown in Table 1, LDA consistently outperformed MetaLDA when K=60 across all three datasets and generally attained higher α, κ, and Q agr scores than MetaLDA. There was a moderate-to-strong agreement between SMEs, a reliable result for an open labeling task (Landis and Koch, 1977). However, the performance of each model was notably affected by the datasets. LDA outperformed MetaLDA on the AP dataset across all three measures except for κ when K=20, and for Q agr when K=10. Except for κ when K=40, MetaLDA achieved higher or comparable scores to LDA on the AWH dataset when K=20-40, but outperformed LDA only when K=10-20 on the AWM dataset. Kripp Spearman's ρ was calculated to measure the strength of the relationship between Q agr and the generated coherence measures. As shown in Appendix D, results were random with no significant correlations. As shown in Table 2, there was a statistically significant correlation between Q agr and Q nbr when K=60.  Coherence measures did not correlate with Q agr , and in some cases, were contradictory. For example, Q agr generally increases with K (and our experts reported that labeling was often easier for smaller topics), but coherence measures such as C A and C NPMI-ABC tended to decrease (in Figure 1). These results show that the two models show different sensitivities to dataset preparation and the value of K.

Evaluation 4: Labeling Topic Document-collections
Proposition 4 states that topics that are interpretable have a T dc that is easily labeled. To evaluate this proposition, a Spearman's ρ was used to assess the relationship between coherence measures and SME ratings of T dc labeling difficulty, Q dif . The full set, Top25pcnt, top50pcnt, and bottom 15% (Bot15pcnt) of ranked Q dif scores were analyzed. The only notable correlation was between the Bot15pcnt of LDA T dc for C NPMI-ABC (ρ=-0.817, p=<0.01). Interestingly, when ranked by topic diagnostic D ew , the Top25pcnt and Top50pcnt of T dc s showed moderate correlation with Q dif for MetaLDA (ρ=-0.764, p<0.01; ρ=-0.630, p<0.01).
A repeat analysis with topic diagnostic D tp did not yield any statistically significant results. However, we observed that for T dc s produced by Met-aLDA, the three largest and three smallest topics could not be labeled. By contrast, the LDA T dc s that were not interpretable were from the smallest 20% of topics. We hypothesize that this distinction results from MetaLDA's broadly distributed D tp (0.017±0.155), which features several very large and very small topics. By comparison, LDA D tp is approximately uniformly distributed (0.017 ± 0.001).

Evaluation 5: Topic Label Alignment
Proposition 5 states that an interpretable topic is one that is descriptive of the T dc . To test this proposition, we constructed an alignment score Q aln , which rate the similarity between the standardized topic label from T ws and the label from T dc . Similar to the evaluation of Proposition 4, we conducted a Spearman's ρ to test for a relationship between Q aln , coherence measures, and diagnostic scores.
The following illustrates a high scoring, but poorly aligned topic with a C NPMI of 0.073. T ws : "law, bill, power, gun, democracy, control, freedom, rule, protect, legislation" was labeled "Gun control", but the T dc was labeled "Foreign Interference Act". Appendix F contains additional examples.
LDA showed a strong relationship between Q aln and C NPMI-ABC for the Top25pcnt of topics (ρ=0.825, p<0.01), but the relationship was weak for other coherence measures. No coherence measures were correlated with MetaLDA Q aln scores.

Discussion
We repeated the work of Zhao et al. (2017), who demonstrated that when the top-ranked topics by C NPMI are considered, MetaLDA produces higher C NPMI scores than LDA. However, when C NPMI was measured using alternative aggregate methods, we discovered that LDA outperformed MetaLDA. This is likely to be because the smaller topics in Met-aLDA can be effectively ignored or scrapped, while in LDA, all topics are of comparable size and are used by the model. Other non-parametric topic models are belived to behave similarly. While MetaLDA generated higher C NPMI-ABC scores than LDA for all aggregates, it was highly dependent on dataset heterogeneity and the value of K. This should indicate that MetaLDA is more adaptive to specialized language, an effect expected in other topic models supported by word embeddings.
The comparative performance of coherence measures can vary significantly depending on the aggregate calculation method used and the way the data has been prepared. This latter point has been well established in the literature, most notably for Twitter data (Symeonidis et al., 2018), but is often overlooked when evaluating novel topic models. This is a cause for concern, given the growing reliance on coherence measures to select the optimal model or K in applied settings (Xue et al., 2020;Lyu and Luli, 2021).
Propositions 2 and 3 addressed T ws interpretability. We have demonstrated the difference between comprehending a topic and providing a topic label that is both informative and reliable. However, coherence measures may not be informative of these qualities. Propositions 4 and 5 addressed T dc interpretability. We have demonstrated that the ease of labeling a T dc and the alignment between T ws and T dc does not correlate with coherence measures. Additionally, we identified several areas for future research into the use of diagnostic statistics in applied settings. We observed unexpected behaviors in the distributions of D ew and D tp after a comparative analysis between LDA and the non-parametric model MetaLDA, affecting the interpretability of both T ws and T dc . Correlations between Q dif /Q aln and D ew /D tp for MetaLDA, for example, indicate that these topic diagnostics could assist in evaluating T d c interpretability.

Conclusion
We have shown that coherence measures can be unreliable for evaluating topic models for specialized collections like Twitter data. We claim this is because the target of "interpretability" is ambiguous, compromising the validity of both automatic and human evaluation methods 9 .
Due to the advancements in topic models, coherence measures designed for older models and more general datasets may be incompatible with newer models and more specific datasets. Our experiments show that non-parametric models, such as MetaLDA, which employs embeddings to improve support for short-texts, behaves differently to LDA for these performance and diagnostic measures. This is critical because recent research has focused on sophisticated deep neural topic models (Zhao et al., 2021), which make tracing and predicting behaviors more challenging. Abstractly, we may compare the use of coherence measures in topic modeling to the use of BLEU in machine translation. Both lack the finesse necessary for a complete evaluation, as is now the case with BLEU (Song et al., 2013).
Additionally, our study demonstrated that an examination of the T dc could provide greater insights into topic model behaviors and explained many of the observed problems. We argue for the representation of topics as a combination of thematically related T dc and T ws , and the further adoption of empirical evaluation using specialized datasets and consideration of T dc interpretability. To date, few papers have attempted this combination (Korenčić et al., 2018).
However, we believe coherence measures and automated labeling techniques will continue to play a critical role in applied topic modeling. Contextually relevant measures like C NPMI-ABC and topic diagnostics like D ew can be key indicators of interpretability. Aside from the empirical evaluation of novel topic models, new automated labeling techniques, having proven themselves useful for labeling T tw , should be extended for T dc .

Ethics and Impact
This project has been reviewed and approved by the Monash University Human Research Committee (Project ID: 18167), subject to abidance with legislated data use and protection protocols. In particular, the Twitter Inc. developers policy prohibits the further distribution of collected tweets and associated metadata by the authors group, with the exception of tweet IDs which may be distributed and re-hydrated. The subject matter of the tweets collected is Australian Politics. We have forgone the use of material included in the paper that would be offensive or problematic to marginalized groups in the Australian political context.  3838 Topic document-collection labeling Figure 3: Example of topic document-collection labeling task. Only the top 10 tweets have been shown for brevity.

3839
Difficulty of topic document-collection labeling Figure 4: Example question asking SME to rate how difficult it was to label a topic document-collection.
Topic word-set and topic document-collection label alignment Figure 5: Example question asking SME to rate how aligned a topic word-set label was to topic documentcollection label.

C Evaluation 2: Labeling Topics
The Spearman's ρ correlation coefficients for pairwise combinations of Q ( nbr) and coherence measures for all learned models.