Topic Intrusion for Automatic Topic Model Evaluation

Topic coherence is increasingly being used to evaluate topic models and filter topics for end-user applications. Topic coherence measures how well topic words relate to each other, but offers little insight on the utility of the topics in describing the documents. In this paper, we explore the topic intrusion task — the task of guessing an outlier topic given a document and a few topics — and propose a method to automate it. We improve upon the state-of-the-art substantially, demonstrating its viability as an alternative method for topic model evaluation.


Introduction
Topic models have traditionally been evaluated using model perplexity, but there is an increasing trend to use topic coherence as a task-independent evaluation (Newman et al., 2010;Mimno et al., 2011;Aletras and Stevenson, 2013;Lau et al., 2014;Röder et al., 2015). In earlier work (Bhatia et al., 2017), we showed that topic coherence as a standalone evaluation can be misleading, which we illustrated with an adversarial topic model that produces highly coherent topics that collectively tell us little about the content of the document collection.
We went on to explore an alternative approach to assessing topics using topic intrusion, based on the manual task of Chang et al. (2009). In the original topic intrusion setup, users are presented with a document, a set of associated topics (from a topic model) and an intruder topic, and are tasked to find the intruder. Success in the task demonstrates that the topics learnt by the topic model are relevant to the document. We proposed a method to automate this (Bhatia et al., 2017), by training a support vector regression model based on information retrieval (IR) and word co-occurrence features to predict the intruder topic. Although our earlier method is able to distinguish between good and bad topic models (at the model-level), we provided no evaluation at the document level other than the observation that "there are still slight disparities between human annotators and the automated method in intruder topic selection". Additionally, the method involves a number of dependencies on complex external systems such as Indri, and no implementation of the method was ever released. In this paper, we extend our earlier work (Bhatia et al., 2017) as follows: (1) we improve the results based on a novel neural model and provide additional analysis of documentlevel evaluation via mean-absolute-error; (2) we propose a new metric to measure the performance of the system; and (3) we release an open source implementation of our system. 1 2 Related Work Chang et al. (2009) introduced the word and topic intrusion tasks to assess topic models. Since then, various automatic measures to assess topics have been proposed (Newman et al., 2010;Mimno et al., 2011;Aletras and Stevenson, 2013). Lau et al. (2014) compared and contrasted these approaches, and proposed a variant method based on normalised pointwise mutual information. Röder et al. (2015) conducted a systematic search using a framework that combines various existing measures.
In Bhatia et al. (2017), we revisited the topic intrusion task of Chang et al. (2009), and explored its viability as an alternative task-independent approach for topic model evaluation. We tested a number of topic models and found that there can be large discrepancies between conventional topic coherence measures and topic intrusion results, suggesting that topics can be individually coherent but

Datasets and Topic Models
We conduct our experiments using the datasets of Bhatia et al. (2017): (1) APNEWS, a collection of Associated Press news articles; and (2) the British National Corpus ("BNC": Burnard (1995)), made up of excerpts from diverse sources such as journals, books, letters, and articles. For the topic models we experiment with the following: standard LDA (lda: Blei et al. (2003)), correlated topic model (ctm: Blei and Lafferty (2006)), non-parametric topic model (hca: Buntine and Mishra (2014)), neural topic model (ntm: Cao et al. (2015)), and an adversarial topic model (cluster: Bhatia et al. (2017)). cluster is adversarial in the sense that it is designed to produce topics that are coherent but poor descriptors of documents.

Methodology
In this section, we briefly describe the topic intrusion task and propose an improved methodology to automate it. Chang et al. (2009) first proposed the topic intrusion task with the aim of assessing whether topics associated with a document capture its content. In this task, an annotator is presented with a document along with its top-3 highest probability topics and a low probability intruder topic, and are asked to identify the outlier intruder topic. Bhatia et al. (2017) incorporate an additional constraint: the intruder topic has to have high probability for at least one other document. Their rationale is to ensure that the intruder is interpretable. We follow the approach of our earlier work (Bhatia et al., 2017) when generating intruder topics.

Human Judgements
To assess our methodology, we need human annotations for the topic intrusion task. We collect human judgements using Amazon Mechanical Turk. Each HIT is comprised of 5 documents, and each document is paired with 4 topics (3 real and 1 intruder). To control for annotation quality, an additional document-topics pair is inserted as part of the HIT. The control item's intruder topic is generated by randomly sampling words from the corpus vocabulary. To pass the quality control, an annotator has to select the correct intruder topic; they are filtered out if their pass rate over all controls is < 0.6. 2 Each HIT is judged by 10 workers. We collect additional annotations by releasing the task internally to a small number of local workers. We needed to carry out some annotations internally to make sure that each HIT had at least 4 annotations. The average number of internal annotations was approximately 1.6. For each topic model, we collected annotations for 100 documents on 2 corpora (5 topic models × 100 documents × 2 collections = 1000). After filtering and including internal judgements we have an average of 6.7 and 6.9 annotations for APNEWS and BNC, respectively.

Intruder Topic Detection
We propose a neural network model to automatically predict intruder topics. Our model is inspired by Severyn and Moschitti (2015), where they combine a learn-to-rank deep learning architecture in an IR setting to rank the documents for a given query. We adapt it to our topic intrusion task by ranking topics for a given document. Our task takes the form of a document d i with corresponding topics where 3 topics are real and 1 is the intruder. The topic set T i has labels Y i = {y 1 i , y 2 i , y 3 i , y 4 i } ("1" denotes the intruder topic, or "0" otherwise). We train using a point-  Figure 2: mp GOLD vs. System Scores at the model level are triples of (d i , t j i , y j i ) -essentially the task is formulated as a binary classification problem.
The architecture of our network is given in Figure 1. The input to our model is a document-topic pair, with each represented as a sequence of words. These words are mapped to embeddings, via embedding matrix W ∈ R |V |×d , where V is the vocabulary and d the dimensionality of the embeddings. The document embeddings E d ∈ R k×d (k = document length) and topic embeddings E t ∈ R m×d (m = number of topic words) are processed via convolutional layers (Kim, 2014;Severyn and Moschitti, 2015) to produce two hidden representations for the document and topic. The convolution operation is performed using feature maps of varying size followed by a max-pooling operation to produce a constant-length vector. The document and topic hidden representations are concatenated and fed to 2 dense layers and ultimately reduced to a sigmoid-activated score.

External IR Feature
A good topic model learns common themes in the document collection. A limitation of our network is the lack of global-or collection-level information (as the input consists of only a document and topic). To incorporate collection-level information, we include an IR feature where we query document d i using the topic words of t j i . We use Okapi BM25 (Robertson and Walker, 1994) to compute the relevance score of the document with respect to its N topic words independently, thereby constructing an N -dimensional feature vector. 3 This external feature vector is incorporated into the network after the convolutional layers (see Figure 1).  Table 1: mae between mp GOLD and nss/mp. "BNC → APNEWS" means the model is trained on BNC and tested on APNEWS. Boldface indicates optimal performance for each dataset.

Aggregating Human and System Scores for a Document
For each document we have a number of workers identifying the intruder topic. To aggregate the results, Chang et al. (2009) define model precision (mp GOLD ), which is the proportion of workers who correctly identified the intruder, as a proxy for how clearly the intruder topic is inappropriate for the document.
Our system and that of Bhatia et al. (2017) compute several scores for a document (one for each topic). Bhatia et al. (2017) select the topic with the maximum score as the intruder, and compute model precision (mp) based on that. This yields binary precision scores (i.e. the model either predicts the intruder correctly or not) and ignores the relative magnitude of the system score. We additionally propose using the normalised sigmoid score (nss) as a means of scoring the intruder topic for a given document, which is computed by normalising the raw sigmoid scores over all topics.

Implementation Details
For our experiments, we train the model on outputs from all topics models over one dataset, and test it on the other (cross-domain training). We use a single channel for the convolutional networks, pad the documents as necessary (k = 200), and use the top-10 words to represent a topic (i.e. m = 10). Word embeddings are initialised using pre-trained GloVe (Pennington et al., 2014) vectors (d = 100), and their weights are fixed during training. We use kernel windows of width = {3, 5, 7} with 100 feature maps each and two (fully-connected) hidden layers, with dimensionality of 50 and 10. We use a dropout rate of 0.5, 0.5 and 0.25 after the document, topic and first hidden layer, respectively. We set the batch size to 100, and use Adam as the optimizer with a learning rate of 0.001. For activation functions, we use ReLU for the fully-connected layers and sigmoid for the final layer. To reduce variance, we run the models with 8 different seeds for initialisation and take the average for a topic's sigmoid score.

Results
By taking the mean of mp GOLD and mp over documents, Bhatia et al. (2017) compute a single human/system score for each topic model. Although this resulted in a strong correlation between mp GOLD and mp, the evaluation is limited to modellevel comparison: it separates good topic models from bad topic models, but does not provide any insights into the performance of each top model over individual documents. We aim to improve model-level correlation in this work, in addition to analysing document-level evaluation, i.e. investigating how well the system predicts mp GOLD for each individual document. We present plots of human and system scores in Figure 2. There are 3 system scores: mp of Bhatia et al. (2017) (mp ORIG ), and mp and nss of our proposed system. In general, we found strong correlation for all systems, but nss of our proposed system performs substantially better than mp ORIG , though our mp is lower than mp ORIG .
To compare the performance of our system with human judgements at the document level, we compute mean absolute error (mae) between mp GOLD and nss/mp, as summarised in Table 1. We find for both datasets nss consistently outperforms mp ORIG and mp by a substantial margin, and also has a score close to human judgements. We can attribute this to the fact that nss provides more nuanced system predictions (over the full range [0, 1]), whereas mp tends to be binary. 4

Discussion
One motivation we have in this paper is to test whether topic intrusion can be used as an alternative means for assessing topics. Given the encouraging mae results, we attempt to use nss to rank topics produced by a topic model.
To accomplish this, we first filter out the topics that occur in less than 5 documents as top 1-topic: these topics tend to be noisy, and as such do not appear with significant weight in any documents. For each of the filtered topics we randomly select 5-10 documents for which it is a top topic and calculate its mean nss over these documents. We then use the topics' mean nss to rank them; in Table 2 we show some selected best and worst topics for different topic models. Overall, the top-ranked topics appear to be more descriptive than the bottom-ranked topics. Having said that, we found instances where coherent topics have low nss ranking (e.g. ctm topics in the bottom half of Table 2), but stress that ultimately the topic intrusion approach to assessing topics is very different to topic coherence. We include a more comprehensive list of best/worst topics in the supplementary material.

Conclusion
We explore an alternative approach to evaluate topic models based on topic allocations in documents, i.e. via topic intrusion. We propose an automated method that improves upon the state-of-theart substantially at the model-and document-levels, and demonstrate that it can be used to rank/filter topics. As future work we intend to explore ways that combine both the topic coherence and topic intrusion for topic model evaluation.