Neural Topic Modeling with Cycle-Consistent Adversarial Training

Advances on deep generative models have attracted significant research interest in neural topic modeling. The recently proposed Adversarial-neural Topic Model models topics with an adversarially trained generator network and employs Dirichlet prior to capture the semantic patterns in latent topics. It is effective in discovering coherent topics but unable to infer topic distributions for given documents or utilize available document labels. To overcome such limitations, we propose Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT) and its supervised version sToMCAT. ToMCAT employs a generator network to interpret topics and an encoder network to infer document topics. Adversarial training and cycle-consistent constraints are used to encourage the generator and the encoder to produce realistic samples that coordinate with each other. sToMCAT extends ToMCAT by incorporating document labels into the topic modeling process to help discover more coherent topics. The effectiveness of the proposed models is evaluated on unsupervised/supervised topic modeling and text classification. The experimental results show that our models can produce both coherent and informative topics, outperforming a number of competitive baselines.


Introduction
Topic models, such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), aim to discover underlying topics and semantic structures from text collections. Due to its interpretability and effectiveness, LDA has been extended to many Natural Language Processing (NLP) tasks (Lin and He, 2009;McAuley and Leskovec, 2013;Zhou et al., 2017). Most of these models employ mean-field * Equal contribution. † Corresponding author.
variational inference or collapsed Gibbs sampling (Griffiths and Steyvers, 2004) for model inference as a result of their intractable posteriors. However, such inference algorithms are model specific and require dedicated derivations.
To address such limitation, neural topic models with black-box inference have been explored, with more flexible training schemes. Inspired by variational autoencoder (VAE) (Kingma and Welling, 2013), Miao et al. (2016) proposed Neural Variational Document Model which interprets the latent code in VAE as topics. Following this way, Srivastava and Sutton (2017) adopted the logistic normal prior rather than Gaussian to mimic the simplex properties of topic distribution. Logistic normal is a Laplace approximation to the Dirichlet distribution (MacKay, 1998). However, logistic normal can not exhibit multiple peaks at the vertices of the simplex as the Dirichlet distribution. Therefore, it is less capable of capturing the multi-modality which is crucial for topic modeling (Wallach et al., 2009).
To overcome such limitation, Wang et al. (2019a) proposed Adversarial-neural Topic Model (ATM), a topic model based on Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and sampling topics directly from the Dirichlet distribution to impose a Dirichlet prior. ATM employs a generator transforming randomly sampled topic distributions to word distributions, and an adversarially trained discriminator estimating the probability that a word distribution came from the training data rather than the generator. Although ATM was shown to be effective in discovering coherent topics, it can not be used to induce the topic distribution given a document due to the absence of a topic inference module. Such limitation hinders its application to downstream tasks, such as text classification. Moreover, ATM fails to deal with document labels which can help extract more co-herent topics. For example, a document labeled as 'sports' more likely belongs to topics such as 'basketball' or 'football' rather than 'economics' or 'politics'.
To address such limitations of ATM, we propose a novel neural topic modeling approach, named Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT). In ToMCAT, topic modeling is cast into the transformation between topic distributions and word distributions. Specifically, the transformation from topic distributions to word distributions is used to interpret topics, and the reverse transformation is used to infer underlying topics for a given document. Under such formulation, ToMCAT employs a generator to transform topic distributions randomly sampled from the Dirichlet prior into the corresponding word distributions, and an encoder to reversely transform documents represented as word distributions into their topic distributions. To encourage the generator/encoder to produce more realistic target samples, discriminators for word/topic distributions are introduced to enable adversarial training. Additional cycleconsistency constraints are utilized to align the learning of the encoder and the generator to prevent them from contradicting each other. Furthermore, for documents with labels, we propose sToMCAT that introduces an extra classifier to regularize the topic modeling process.
The main contributions of the paper are: • ToMCAT, a novel topic model with cycleconsistent adversarial training is proposed. To the best of our knowledge, it is the first adversarial topic modeling approach with both topic discovery and topic inference.
• sToMCAT, a supervised extension to ToM-CAT, is proposed to help discover more coherent topics with available document labels.
• Experimental results on unsupervised/supervised topic modeling and text classification demonstrate the effectiveness of the proposed approaches.

Related Work
Our work is related to neural topic modeling and unsupervised style transfer.

Neural Topic Modeling
Recent advances on deep generative models, such as VAEs (Kingma and Welling, 2013) and GANs (Goodfellow et al., 2014), attract much research interest in the NLP community.
Based on VAE, Neural Variational Document Model (NVDM) (Miao et al., 2016) encodes documents with variational posteriors in the latent topic space. NVDM employs Gaussian as the prior distribution of latent topics. Instead, Srivastava and Sutton (2017) proposed that Dirichlet distribution is a more appropriate prior for multinomial topic distributions, and constructed a Laplace approximation of Dirichlet to enable reparameterisation (Kingma and Welling, 2013). Furthermore, the word-level mixture is replaced with a weighted product of experts (Srivastava and Sutton, 2017). Later, a non-parametric neural topic model utilizing stick-breaking construction was presented in (Miao et al., 2017). There are some attempts in incorporating supervised information into neural topic modeling. For example, Card et al. (2018) extended the Sparse Additive Generative Model (Eisenstein et al., 2011) in the neural framework and incorporated document metadata such as document labels into the modeling process.
Apart from VAE-based approaches, Adversarialneural Topic Model (ATM) (Wang et al., 2019a)) was proposed to model topics with GANs. The generator of ATM projects randomly sampled topic distributions to word distributions, and is adversarially trained with a discriminator that tries to distinguish real and generated word distributions. Moreover, Wang et al. (2019b) extended ATM for open-domain event extraction by representing an event as a combination of an entity distribution, a location distribution, a keyword distribution and a date distribution. Such joint distributions are adversarially learned in a similar manner as ATM. The proposed ToMCAT is partly inspired by ATM but differs in its capability of inferring documentspecific topic distributions and incorporating supervision information. BAT (Wang et al., 2020) is an extension to ATM that employs bidirectional adversarial training (Donahue et al., 2016) for documentspecific topic distribution inference. Although BAT similarly utilizes an adversarial training objective to guide the learning of topic distribution, there are some major differences. Apart from different adversarial losses, ToMCAT also incorporates two cycle-consistency constraints which encourage the model to generate informative representations and are shown to be crucial for generating coherent topics as in our experiments.

Unsupervised Style Transfer
Style transfer, aiming at transforming representations from one style to another, has been found many interesting applications, such as image and text style transfer. However, training data paired between different styles are not available for many tasks. To solve this problem, Zhu et al. (2017) imposed cycle-consistency constraints to align mappings between two styles and proposed CycleGAN for unsupervised image style translation. Similarly, DiscoGAN (Kim et al., 2017) was proposed to discover the relations between different image styles and transformed images from one style to another without paired data. In the NLP field, Lee et al. (2018) developed a CycleGAN-based approach to transfer the sentiment style (positive, negative) of the text.
Inspired by CycleGAN, Our work views topic modeling as unsupervised distribution transfer and follows the framework of CycleGAN.

Methodology
Given a corpus D consisting of N documents , two main purposes of topic modeling are: 1. Topic discovery. Given a one-hot topic indicating vector I k ∈ R K where K is the number of topics and I kk = 1, discover the corresponding word distribution t k ∈ R V from D where V is the vocabulary size. More generally, we can consider topic discovery as finding a mapping from topic distribution to word distribution.
2. Topic inference. Infer the topic distribution z j ∈ R K of the document x j ∈ R V . Similarly, the topic inference can be considered as finding a mapping from word distribution to topic distribution.
We now formalize the above observations. Let X be the word distribution set and Z the topic distribution set. Given training samples where x i ∈ X and document-specific topic distributions {z j } M j=1 where z j ∈ Z, the goal of topic modeling is to learn a mapping function G, called generator, to transform samples in Z into X and a reverse function E, called encoder, to transform samples in X into Z. However, it should be noted that training samples in X and Z are unpaired since the topic distribution of a document is unknown before topic modeling. Thus, the problem is how ToMCAT sToMCAT to learn G and E to model topics in the absence of paired samples between X and Z.

ToMCAT
We now introduce the proposed ToMCAT, which is shown in the inner panel of Figure 1.
ToMCAT consists of a generator G: Z → X, an encoder E: X → Z, and adversarial discriminators D X and D Z of G and E respectively. Following CycleGAN (Zhu et al., 2017), ToMCAT employs two types of losses, namely adversarial losses and cycle-consistency losses, to guide the training of the encoder E and the generator G. The details of these modules are described below.

Encoder Network E
Encoder E transforms a word distribution x i ∈ R V into its corresponding topic distribution z i ∈ R K . Following (Wang et al., 2019a), we represent x i ∈ X with the normalized TF-IDF (Term FrequencyInverse Document Frequency) representation of i-th document: where d ij is the count of j-th word in i-th document, 1(·) denotes the indicator function. Equation   1 calculates the smoothed TF-IDF of d i , which is then normalized to sum to one in Equation 2.
We use TF-IDF as the document representation because TF-IDF generally preserves the relative importance of words in a document and reduce the noise of stop words. As the target distribution of the generator G, such property of TF-IDF will help generate more informative topics. The implementation of the encoder is a multilayer perception (MLP) with LeakyReLU activation (Maas et al., 2013) and batch normalization (BN) (Ioffe and Szegedy, 2015). The detailed transformations are: where Linear(I, J) denotes a linear transformation from I-dim to J-dim, H is the number of hidden units, and the final Softmax makes sure that the final output is one-normalized to match the input of the generator G. Inputs of E are either sampled from the corpus D or generated by G.

Generator Network G
The generator G performs the reverse operation of the encoder by transforming a topic distribution z j ∈ R K into a word distribution x j ∈ R V , where the input z j is generated by the encoder or sampled from the prior distribution. To draw the topic distribution z j , a common practice for topic modeling is to use the Dirichlet distribution, the conjugate prior of the multinomial distribution. We also stick with this choice in our model. Specifically, we draw topic distributions from a symmetric Dirichlet dis- After sampling a topic distribution z j from the Dirichlet prior, the generator then maps z j from Z to X, and the transformations is similar to the encoder: where the final output is also normalized by the Softmax to match the input of the encoder.

Training Objective
Following CycleGAN (Zhu et al., 2017), we employ adversarial losses and cycle-consistency losses to guide the training of G and E. The adversarial losses encourage G and E to generate samples matching the data distribution in the target space (X for G and Z for E) while the cycle-consistency losses align G and E in these two distribution spaces to prevent them from contradicting each other.
Adversarial Loss Generator G is adversarially trained with a discriminator D X , which takes as input either real samples from training data, i.e., x ∼ p data (x), or fake samples generated by G, i.e., G(z). The goal of D X is to distinguish real samples from fake ones, while G instead aims to fool D X by generating samples similar to x. Therefore, the adversarial training encourages G to mimic the pattern of X and produce realistic word distributions. We employ a Wasserstein GAN (WGAN)  based adversarial loss to G and D X : where D tries to maximize L adv (G, D X ) while G tries to minimize it. Similarly, the adversarial loss applied to E and D Z is: (4) Discriminators D X and D Z are implemented with MLPs, and we use the same architecture for them: [ Linear(S, H) → LeakyReLU(0.1) → BN → Linear(H, 1) ], where S equals to V for D X and K for D Z . Since we are using WGAN rather than the original GAN loss as in CycleGAN, we do not apply a sigmoid transformation to discriminator outputs.
Cycle-Consistency Loss Adversarial training might lead to generating samples identically distributed as corresponding target samples (Goodfellow et al., 2014). However, the relationship between the source distributions and the transformed distributions is unconstrained. Zhu et al. (2017) argued that adversarial losses alone is not able to fulfill this task and that the learned mappings should be cycle-consistent to reduce the search space of possible mapping functions, i.e., To this end, two cycle-consistency losses − − → L cyc (G, E) and ← − − L cyc (G, E) are added to the training objective, as shown in the inner panel (dotted lines) of Figure 1. Specifically, where · 1 denotes L1 norm.
Overall Objective Summing up adversarial losses in Equation 3, 4 and cycle-consistency losses in Equation 5, the overall objective of ToMCAT is: where λ 1 and λ 2 respectively control the relative importance of − − → L cyc (G, E) and ← − − L cyc (G, E) w.r.t. adversarial losses.

sToMCAT
The encoder E transforms the word distribution x to corresponding topic distribution z, which effectively captures the key semantic information of x and can be directly used to downstream tasks, e.g., text classification. Therefore, for labeled documents we extend ToMCAT with a classifier C to allow the incorporation of label information, as shown in Figure 1. We name the supervised version as sToMCAT.
For a word distribution x and its one-hot label y, x is first encoded by the encoder E into the topic distribution z, and then z is fed to the classifier C to predict the probability of y. The predictive objective is defined as: L cls (E, C) = −E (x,y)∼pdata(x,y) [y log C(E(x))], (7) where L is the dimension of y. We employ an MLP classifier: For sToMCAT, the topic model and the classifier are trained jointly, and its overall objective is defined as:

Training Details
The proposed ToMCAT and sToMCAT are trained with the Adam optimizer (Kingma and Ba, 2014), whose learning rate and momentum term β 1 are set to 0.0001 and 0.5 respectively for (G, E, D X ) and D Z , while 0.001 and 0.9 for the classifier C. The hidden unit numbers are set to 100 for all modules. Besides, to enforce the Lipschitz constraints required by WGAN, a weight clipping of 0.01 is adopted . 1 During training, the parameters of discriminators D X , D Z and mappings G, E are alternately updated. Specifically, at each training iteration, firstly we optimize D X and D Z for 5 steps with adversarial losses, and then another training step is taken to optimize G and E with adversarial losses and cycle-consistency losses (Equation 6). When the model is trained in a supervised way, the predictive objective is additionally applied to E and C at the last training step (Equation 8).
We found that relatively good choices of λ 1 and λ 2 fall into different regions for different datasets and topic number settings, which implies a further tuning of these hyperparameters is needed. To ease this kind of burden, we apply a gradientbased mechanism to adversarial losses and cycleconsistency losses. It balances these two types of losses with the L2 norms of their gradients w.r.t. the output of their preceding mapping functions. E.g., for L adv (G, D X ) and − − → L cyc (G, E), we replace λ 1 in Equation 6 with: whereλ 1 is the new balancing factor and · 2 denotes L2 norm. Similarly, L adv (E, D Z ) and ← − − L cyc (G, E), L adv (E, D Z ) and L cls (E, C) are also balanced in this way withλ 2 andλ 3 . The resultinĝ λ 1 ,λ 2 andλ 3 are set to 2, 0.2 and 1 respectively for all datasets and topic number settings in our experiments, thus avoiding the time-consuming hyperparameter tuning process.

Experiments
In this section, we first describe datasets and compared baselines. Then we present topic modeling results under both unsupervised and supervised settings. Finally, we report the text classification results.

Experimental Setup
We evaluate the performance of proposed models on four datasets: NYTimes 2 (NYT), Grolier 3 (GRL), DBpedia ontology classification dataset (DBP) (Zhang et al., 2015) and 20 Newsgroups 4 (20NG). For NYTimes and Grolier datasets, we use the processed version of (Wang et al., 2019a). For the DBpedia dataset, we first sample 100, 000 documents from the whole training set, and then perform preprocessing including tokenization, lemmatization, removal of stopwords, and low-frequency words. The same preprocessing is also applied to the 20 Newsgroups dataset. The statistics of the processed datasets are shown in Table 1.
We choose the following approaches as our baselines: • LDA (Blei et al., 2003). We use GibbsLDA++, an implementation using Gibbs sampling for parameter estimation and inference 5 .

Topic Modeling
We evaluate the performance of the proposed models and baselines using topic coherence measures. Topic coherence measures are metrics for quantifying the understandability of the extracted topics, which are shown highly correlated with human subjects (Newman et al., 2010;Aletras and Stevenson, 2013). Since a topic is typically represented as a word distribution over the vocabulary or n top-weighted words (i.e., topic words) in this distribution, we calculate the coherence of a topic by measuring the relatedness between its topic words. The word relatedness scores are estimated based on some kind of word co-occurrence statistics on Wikipedia, for example, by applying a sliding window over the Wikipedia corpus and collecting word co-occurrences to calculate NPMI (Normalized Pointwise Mutual Information) (Bouma, 2009) for word pairs. We refer readers to (Röder et al., 2015) for detailed calculation and comparison of different topic coherence measures. In our experiments, we use top-10 topic words of each topic to calculate topic coherence and report the results of 3 topic coherence measures: C A (Aletras and Stevenson, 2013), C P (Röder et al., 2015), and NPMI (Aletras and Stevenson, 2013). The topic coherence scores are calculated using Palmetto 10 .

Unsupervised Topic Modeling
To make a more comprehensive comparison of our model with baselines for topic modeling, we experiment on each dataset with five topic number settings: 20, 30, 50, 75, 100. The average topic coherence scores of 5 settings are presented in Table 2. We can see from the left part of Table 2 that, among all unsupervised topic models, our model achieves the highest scores on all datasets and topic coherence measures.
With an improper Gaussian prior, NVDM shows the worst performance among all neural topic models with no exception. The logistic-normal based ProdLDA and Scholar achieve higher topic coherence scores compared to NVDM, but are still largely underperformed compared to our model. BAT achieves the second-best place most of the time in unsupervised topic modeling experiments. Compared to ToMCAT, BAT has a similar adversarial objective but lacks the cycle-consistency constraints, Therefore, the generator and encoder of BAT only aim to fool the discriminator by mimicking the pattern of the joint distribution of real documents and topics. With the incorporation of two cycle-consistency losses, ToMCAT is explicitly encouraged to generate not only realistic but also informative representations in order to reduce the cycle-consistency losses.
To give an insight into the generated topics, 8 out of 50 topics discovered by ToMCAT on NYTimes are presented in Table 3, where a topic is represented by the ten words with the highest probability in the topic. We can observe that the extracted topics are highly coherent and interpretable. The corresponding full list of topics can be found in the appendix.

Supervised Topic Modeling
Supervised topic modeling aims to leverage available document labels to benefit topic modeling. Therefore we only conduct experiments on labeled datasets, i.e., DBpedia and 20 Newsgroups. The experimental results are shown on the right part of Table 2. We expect the topic extraction results would be improved with the incorporation of topic labels. However, this is not always the case as shown in 10 https://github.com/AKSW/Palmetto

Impact of Topic Numbers
To investigate how topic coherence scores vary with respect to different topic number settings, we show in Figure 2 the topic coherence measures on four datasets for all models. Although there are exceptions that some baselines achieve higher scores on specific experimental settings, the general conclusion is that our models perform the best in both unsupervised and supervised topic modeling tasks. On DBpedia and 20 Newsgroups datasets, sToM-CAT consistently outperforms ToMCAT, indicating the additional supervision helps generate more co- herent topics. We also notice that although the topic coherence measures of our models remain relatively stable across topic numbers, there are slight drops on the DBpedia and 20 Newsgroups datasets when the topic number becomes bigger. This phenomenon may result from the fact that DBpedia and 20 Newsgroups datasets are less diverse than others. There are only 14 and 20 categories in DBpedia and 20 Newsgroups datasets, respectively. When the topic number is much larger than the ground-truth category number, discriminating different topics would be more challenging. Nevertheless, the overall superiorities of our models are significant as in Figure 2.

Text Classification
We now report text classification results of supervised topic models : sLDA, Scholar, and sToMCAT.
To show that our model can learn both coherent and informative topics concurrently, we use the same models as in the topic modeling experiments to classify test set documents, and do not perform any further fine-tuning. In our experiments, we found that the text classification performance is influenced by topic numbers. Therefore we conduct experiments with five topic number settings: 20, 30, 50, 75, and 100. Classification results are presented in Table 4. We can see that our model not only achieves the  Table 4: Classification accuracy of supervised topic models with different topic numbers (20,30,50,75,100). 'Min/Avg/Max' shows the minimum/average/maximum accuracy among different topic numbers. '∆' shows the variance of the classification accuracy across different topic numbers.
best overall performance (the Max and Avg column), but also has the highest accuracies on all dataset and topic number settings. Compared to Scholar, our model achieves a slightly higher accuracy on DBpedia and an accuracy improvement of 2.5% on 20 Newsgroups. The performance gain of our model over sLDA is more significant. In addition to better classification results, our model is also more robust to the change of topic numbers (the ∆ column). With the topic number increasing from 20 to 100, the variance of the classification accuracy of our model is only 0.025 and 0.026 on DBpedia and 20 Newsgroups respectively, which is much lower than that of sLDA and Scholar.

Conclusion
We have presented ToMCAT, a neural topic model with adversarial and cycle-consistent objectives, and its supervised extension, sToMCAT. ToMCAT employs a generator to capture semantic patterns in topics and an encoder to encode documents into their corresponding topics. sToMCAT further incorporates document labels into topic modeling. The effectiveness of ToMCAT and sToMCAT is verified by experiments on topic modeling and text classification. In the future, we plan to extend our model to cope with external word or document semantics. It would also be interesting to explore alternative architectures other than CycleGAN under our formulation of topic modeling.