TOMODAPI: A Topic Modeling API to Train, Use and Compare Topic Models

From LDA to neural models, different topic modeling approaches have been proposed in the literature. However, their suitability and performance is not easy to compare, particularly when the algorithms are being used in the wild on heterogeneous datasets. In this paper, we introduce ToModAPI (TOpic MOdeling API), a wrapper library to easily train, evaluate and infer using different topic modeling algorithms through a unified interface. The library is extensible and can be used in Python environments or through a Web API.


Introduction
The analysis of massive volumes of text is an extremely expensive activity when it relies on not-scalable manual approaches or crowdsourcing strategies. Relevant tasks typically include textual document classification, document clustering, keywords and named entities extraction, language or sequence modeling, etc. In the literature, topic modeling and topic extraction, which enable to automatically recognise the main subject (or topic) in a text, have attracted a lot of interest. The predicted topics can be used for clustering documents, for improving named entity extraction (Newman et al., 2006), and for automatic recommendation of related documents (Luostarinen and Kohonen, 2013).
Several topic modeling algorithms have been proposed. However, we argue that it is hard to compare and to choose the most appropriate one given a particular goal. Furthermore, the algorithms are often evaluated on different datasets and different scoring metrics are used. In this work, we have selected some of the most popular topic modeling algorithms from the state of the art in order to integrate them in a common platform, which homogenises the interface methods and the evaluation metrics. The result is ToModAPI 1 which allows to dynamically train, evaluate, perform inference on different models, and extract information from these models as well, making it possible to compare them using different metrics.
The remaining of this paper is organised as follows. In Section 2, we describe some related works and we detail some state-of-the-art topic modeling techniques. In Section 3, we provide an overview of the evaluation metrics usually used. We introduce ToModAPI in Section 4. We then describe some datasets (Section 5) that are used in training to perform a comparison of the topic models (Section 6). Finally, we give some conclusions and outline future work in Section 7.

Related Work
Aside from a few exceptions (Blei and McAuliffe, 2007), most topic modeling works propose or apply unsupervised methods. Instead of learning the mapping to a pre-defined set of topics (or labels), the goal of these methods consists in assigning training documents to N unknown topics, where N is a required parameter. Usually, these models compute two distributions: a Document-Topic distribution which represents the probability of each document to belong to each topic, and a Topic-Word distribution which represents the probability of each topic to be represented by each word present in the documents. These distributions are used to predict (or infer) the topic of unseen documents.
Latent Dirichlet Allocation (LDA) is a unsupervised statistical modeling approach (Blei et al., 2003) that considers each document as a bag of words and creates a randomly assigned documenttopic and word-topic distribution. Iterating over words in each document, the distributions are updated according to the probability that a document or a word belongs to a certain topic. The Hierarchical Dirichlet Process (HDP) model (Teh et al., 2006) is another statistical approach for clustering grouped data such as text documents. It considers each document as a group of words belonging with a certain probability to one or multiple components of a mixture model, i.e. the topics. Both the probability measure for each document (distribution over the topics) and the base probability measure -which allows the sharing of clusters across documents -are drawn from Dirichlet Processes (Ferguson, 1973). Differently from many other topic models, HDP infers the number of topics automatically.
Gibbs Sampling for a DMM (GSDMM) applies the Dirichlet Multinomial Mixture model for short text clustering (Yin and Wang, 2014). This algorithm works computing iteratively the probability that a document join a specific one of the N available clusters. This probability consist in two parts: 1) a part that promotes the clusters with more documents; 2) a part that advantages the movement of a document towards similar clusters, i.e. which contains a similar word-set. Those two parts are controlled by the parameters α and β. The simplicity of GSDMM provides a fast convergence after some iterations. This algorithm consider the given number of clusters given as an upper bound and it might end up with a lower number of topics. From another perspective, it is somehow able to infer the optimal number of topics, given the upper bound.

Pre-trained
Word vectors such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) can help to enhance topic-word representations, as achieved by the Latent Feature Topic Models (LFTM) (Nguyen et al., 2015). One of the LFTM algorithms is Latent Feature LDA (LF-LDA), which extends the original LDA algorithm by enriching the topic-word distribution with a latent feature component composed of pre-trained word vectors. In the same vein, the Paragraph Vector Topic Model (PVTM) (Lenz and Winker, 2020) uses doc2vec (Le and Mikolov, 2014) to generate document-level representations in a common embedding space. Then, it fits a Gaussian Mixture Model to cluster all the similar documents into a predetermined number of topics -i.e. the number of GMM components.
Topic modeling can also be performed via linearalgebraic methods. Starting from the the high-dimensional term-document matrix, multiple approaches can be used to lower its dimensions. Then, we consider every dimension in the lower-rank matrix as a latent topic. A straightforward application of this principle is the Latent Semantic Indexing model (LSI) (Deerwester et al., 1990), which uses Singular Value Decomposition as a means to approximate the term-document matrix (potentially mediated by TF-IDF) into one with less rowseach one representing a latent semantic dimension in the data -and preserving the similarity structure among columns (terms). Non-negative Matrix Factorisation (NMF) (Paatero and Tapper, 1994) exploits the fact that the term-document matrix is non-negative, thus producing not only a denser representation of the term-document distribution through the matrix factorisation but guaranteeing that the membership of a document to each topic is represented by a positive coefficient.
In recent years, neural network approaches for topic modeling have gained popularity giving birth to a family of Neural Topic Models (NTM) (Cao et al., 2015). Among those, doc2topic (D2T) 2 uses a neural network which separately computes N-dimensional embedding vectors for words and documents -with N equal to the number of topics, before computing the final output using a sigmoid activation. The distributions topic-word and document-topic are obtained by getting the final weights on the two embedding layers. Another neural topic model, the Contextualized Topic Model (CTM) (Bianchi et al., 2020) uses Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) -a neural transformer language model designed to compute sentences representations efficiently -to generate a fixed-size embedding for each document to contextualise the usual Bag of Words representation. CTM enhances the Neural-ProdLDA (Srivastava and Sutton, 2017) architecture with this contextual representation to significantly improve the coherence of the generated topics.
Previous works have tried to compare different topic models. A review of statistical topic modeling techniques is included in Newman et al. (2006). A comparison and evaluation of LDA and NMF using the coherence metric is proposed by O'Callaghan et al. (2015). Among the libraries for performing topic modeling, Gensim is undoubtedly the most known one, providing implementations of several tools for the NLP field (Řehůřek and Sojka, 2010). Focusing on topic modeling for short texts, STMM includes 11 different topic models, which can be trained and evaluated through command line (Qiang et al., 2019). The Topic Modelling Open Source Tool 3 exposes a web graphical user interface for training and evaluating topic models, LDA being the only representative so far. The Promoss Topic Modelling Toolbox 4 provides a unified Java command line interface for computing a topic model distribution using LDA or the Hierarchical Multi-Dirichlet Process Topic Model (HMDP) (Kling, 2016). However, it does not allow to apply the computed model on unseen documents.

Metrics
The evaluation of machine learning techniques often relies on accuracy scores computed comparing predicted results against a ground truth. In the case of unsupervised techniques like topic modeling, the ground truth is not always available. For this reason, in the literature, we can find: • metrics which enable to evaluate a topic model independently from a ground truth, among which, coherence measures are the most popular ones for topic modeling (Röder et al., 2015;O'Callaghan et al., 2015;Qiang et al., 2019); • metrics that measure the quality of a model's predictions by comparing its resulting clusters against ground truth labels, in this case a topic label for each document.

Coherence metrics
The coherence metrics rely on the joint probability P (w i , w j ) of two words w i and w j that is computed by counting the number of documents in which those words occur together divided by the total number of documents in the corpus. The documents are fragmented using sliding windows of a given length, and the probability is given by the number of fragments including both w i and w j divided by the total number of fragments. This probability can be expressed through the Pointwise Mutual Information (PMI), defined as: 3 https://github.com/opeyemibami/ Topic-Modelling-Open-Source-Tool 4 https://github.com/gesiscss/promoss A small value is chosen for , in order to avoid computing the logarithm of 0. Different metrics based on PMI have been introduced in the literature, differing in the strategies applied for token segmentation, probability estimation, confirmation measure, and aggregation. The UCI coherence (Röder et al., 2015) averages the PMI computed between pairs of topics, according to: The UMASS coherence (Röder et al., 2015) relies instead on a differently computed joint probability: The Normalized Pointwise Mutual Information (NPMI) (Chiarcos et al., 2009) applies the PMI in a confirmation measure for defining the association between two words: NPMI values go from -1 (never co-occurring words) to +1 (always co-occurring), while the value of 0 suggests complete independence. This measure can be applied also to word sets. This is made possible using a vector representation in which each feature consists in the NPMI computed between w i and a word in the corpus W , according to the formula: In ToModAPI, we include the following four metrics 5 : • C N P M I applies NPMI as in Eqn (4) to couples of words, computing their joint probabilities using sliding windows; • C V compute the cosine similarity of the vectors -as defined in Eqn (5) -related to each word of the topic. The NPMI is computed on sliding windows; • C U CI as in Eqn (2); • C U M ASS as in Eqn (3).
Additionally, we include a Word Embeddingsbased Coherence as introduced by Fang et al. (2016). This metric relies on pre-trained word embeddings such as GloVe or word2vec and evaluate the topic quality using a similarity metric between its top words. In other words, a high mutual embedding similarity between a model's top words reflects its underlying semantic coherence. In the context of this paper, we will use the sum of mutual cosine similarity computed on the Glove vectors 6 of the top N = 10 words of each topic: where v i and v j are the GloVe vectors of the words w i and w j .
All metrics aggregate the different values at topic level using the arithmetic mean, in order to provide a coherence value for the whole model.

Metrics which relies on a ground truth
The most used metric that relies on a ground truth is the Purity, defined as the fraction of documents in each cluster with a correct prediction (Hajjem and Latiri, 2017). A prediction is considered correct if the original label coincides with the original label of the majority of documents falling in the same topic prediction. Given L the set of original labels and T the set of predictions: In addition, we include in the API the following metrics used in the literature for evaluating the quality of classification or clustering algorithms, applied to the topic modeling task: 1. Homogeneity: a topic model output is considered homogeneous if all documents assigned to each topic belong to the same ground-truth label (Rosenberg and Hirschberg, 2007); 2. Completeness: a topic model output is considered complete if all documents from one ground-truth label fall into the same topic (Rosenberg and Hirschberg, 2007); 3. V-Measure: the harmonic mean of Homogeneity and Completeness. A V-Measure of 6 We use a Glove model pre-trained on Wikipedia 2014 + Gigaword 5, available at https://nlp.stanford. edu/projects/glove/ 1.0 corresponds to a perfect alignment between topic model outputs and ground truth labels (Rosenberg and Hirschberg, 2007);

Normalized Mutual Information (NMI) is
the ratio between the mutual information between two distributions -in our case, the prediction set and the ground truth -normalised through an aggregation of those distributions' entropies (Lancichinetti et al., 2009). The aggregation can be realised by selecting the minimum/maximum or applying the geometric/arithmetic mean. In the case of arithmetic mean, NMI is equivalent to the V-Measure.
For these metrics, we use the implementations provided by scikit-learn (Pedregosa et al., 2011).

ToModAPI: a Topic Modeling API
We now introduce ToModAPI, a Python library which harmonises the interfaces of topic modeling algorithms. So far, 9 topic modeling algorithms have been integrated in the library (Table 1).
For each algorithm, the following interface methods are exposed: • train which requires in input the path of a dataset and an algorithm-specific set of training parameters; • topics which returns the list of trained topics and, for each of them, the 10 most representative words. Where available, the weights of those words in representing the topic are given; • topic which returns the information (representative words and weights) about a single topic; • predict which performs the topic inference on a given (unseen) text; • get training predictions which provides the final predictions made on the training corpus. Where possible, this method is not performing a new inference on the text, but returns the predictions obtained during the training; • coherence which computes the chosen coherence metric -among the ones described in Section 3.1 -on a given dataset; • evaluate which evaluate the model predictions against a given ground truth, using the metrics described in Section 3.2.  The structure of the library, which relies on class inheritance, is easy to extend with the addition of new models. In addition to allowing the import in any Python environment and use the library offline, it provides the possibility of automatically build a web API, in order to access to the different methods through HTTP calls. Table 2 provides a comparison between the ToModAPI, Gensim and STMM. Given that we wrap some Gensim models and methods (i.e. for coherence computation), some similarities between it and our work can be observed.
The software is distributed under an open source license 7 . A demo of the web API is available at http://hyperted.eurecom.fr/topic.

Datasets and pre-trained models
Together with the library, we provide pre-trained models trained on two different datasets having different characteristics (20NG and AFP). A common pre-processing is performed on the datasets before training, consisting of: • Removing numbers, which, in general, do not contribute to the broad semantics; • Removing the punctuation and lower-casing; • Removing the standard English stop words; • Lemmatisation using Wordnet, in order to deal with inflected forms as a single semantic item; • Ignoring words with 2 letters or less. In facts, they are mainly residuals from removing punctuation -e.g. stripping punctuation from people's produces people and s.
The same pre-processing is also applied to the text before topic prediction. 7 https://github.com/D2KLab/ToModAPI

20 NewsGroups
The 20 NewsGroups collection (20NG) (Lang, 1995) is a popular dataset used for text classification and clustering. It is composed of English news documents, distributed fairly equally across 20 different categories according to the subject of the text. We use a reduced version of this dataset 8 , which excludes all the documents composed by the sole header while preserving an even partition over the 20 categories. This reduced dataset contains 11,314 documents. We pre-process the dataset in order to remove irrelevant metadata -consisting of email addresses and news feed identifiers -keeping just the textual content. The average number of words per document is 142.

Agence France Presse
The Agence France Presse (AFP) publishes daily up to 2000 news articles in 5 different languages 9 , together with some metadata represented in the NewsML XML-based format. Each document is categorised using one or more subject codes, taken from the IPTC NewsCode Concept vocabulary 10 . In case of multiple subjects, they are ordered by relevance. In this work, we only consider the first level of the hierarchy of the IPTC subject codes. We extracted a dataset containing 125,516 news documents in English and corresponding to the production of AFP for the year 2019, with 237 words per document on average. Table 3 summarizes the number of documents for each topic in those two datasets. In AFP, a single document can be assigned to multiple subject, so we take each assignment into account. The two

Wikipedia Corpus
We also describe the Wikipedia corpus (Wiki) 11 , which is a readily extracted and organised snapshot from 2013 that includes pages with at least 20 page views in English. This corpus has been used in other works, for example, for computing word embeddings (Leimeister and Wilson, 2018).
The corpus is distributed with some pre-processing already applied, like lower-casing and punctuation stripping. However, we performed additional operations such as lemmatisation, stop-word and small word (2 characters or less) removal. The dataset consists of around 463k documents with 498M words. This corpus will not be used for training but only for evaluating the models (trained on 20NG or AFP) in order to reflect on the generalisation of the topics models.

Experiment and Results
We empirically evaluate the performances of the topic modeling algorithms described in Section 2 on the two datasets presented in Section 5 using the metrics detailed in Section 3. For each algorithm, we trained two different models, respectively on 20NG and AFP corpus. The number of topicswhen required by the algorithm -has been set to 20 and 7 when training on 20NG and AFP, respectively, in order to mimic the original division in class labels of the corpora (except for GSDMM and HDP which infer the optimal number of topics). Each model trained on either 20NG or AFP is tested against the same dataset and the Wikipedia dataset to compute each metric. Table 4 shows the average coherence scores of the topics computed on the 20NG dataset, together with the standard deviation, while the results of Table 5 refer to models computed on the AFP dataset. The results differ depending on the studied metric and the evaluation dataset. LFTM generalises better when evaluated against the Wikipedia corpus, probably thanks to the usage of pre-trained word vectors on large corpora. Overall, LDA has the best results on all metrics, always being among   the top ones in terms of coherence. When trained on AFP, all topic models benefit of a bigger dataset; this results in generally higher scores and in different algorithms maximising specific metrics. We also consider the time taken by the different techniques for different tasks like training and getting prediction ( Table 6). The results have been collected selecting the best of 3 different calls. The inference time has been computed using the models trained on the 20NG dataset, on a small sentence of 18 words 12 . The table shows LDA leading in training, while the longest execution time belongs to LFTM. The inference time for all models is in the order of few seconds or even less than 1 for GSDMM, HDP, LSI and PVTM. The manipulation of BERT embeddings makes CTM inference more time-consuming. The inference timing for D2T is not computed because its implementation is not available yet.

Conclusions and Future Work
In this paper, we introduced ToModAPI, a library and a Web API to easily train, test and evaluate topic models. 9 algorithms are already included in the library, while new ones will be added in future. Other evaluation metrics for topic modeling 12 "Climate change is a global environmental issue that is affecting the lands, the oceans, the animals, and humans" have been proposed (Wallach et al., 2009) and will be included in the API for enabling a complete evaluation. Among these, metrics based on word embeddings are gaining particular attention (Ding et al., 2018). For further exploiting the advantage of having a common interface, we will study ways to automatically tune each model's hyper-parameters such as the right number of topics, find an appropriate label for the computed topics, optimise and use the models in real world applications. Finally, future work includes a deeper comparison of the models trained on different datasets.