Neural Topic Modeling by Incorporating Document Relationship Graph

Graph Neural Networks (GNNs) that capture the relationships between graph nodes via message passing have been a hot research direction in the natural language processing community. In this paper, we propose Graph Topic Model (GTM), a GNN based neural topic model that represents a corpus as a document relationship graph. Documents and words in the corpus become nodes in the graph and are connected based on document-word co-occurrences. By introducing the graph structure, the relationships between documents are established through their shared words and thus the topical representation of a document is enriched by aggregating information from its neighboring nodes using graph convolution. Extensive experiments on three datasets were conducted and the results demonstrate the effectiveness of the proposed approach.


Introduction
Probabilistic topic models (Blei, 2012) are tools for discovering main themes from large corpora.The popular Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its variants (Lin and He, 2009;Zhao et al., 2010;Zhou et al., 2014) are effective in extracting coherent topics in an interpretable manner, but usually at the cost of designing sophisticated and model-specific learning algorithm.Recently, neural topic modeling that utilizes neuralnetwork-based black-box inference has been the main research direction in this field.Notably, NVDM (Miao et al., 2016) employs variational autoencoder (VAE) (Kingma and Welling, 2013) to model topic inference and document generation.Specifically, NVDM consists of an encoder inferring topics from documents and a decoder generating documents from topics, where the latent topics are constrained by a Gaussian prior.Srivastava and Sutton (2017) argued that Dirichlet distribution is a more appropriate prior for topic modeling than Gaussian in NVDM and proposed ProdLDA that approximates the Dirichlet prior with logistic normal.There are also attempts that directly enforced a Dirichlet prior on the document topics.W-LDA (Nan et al., 2019) models topics in the Wasserstein autoencoders (Tolstikhin et al., 2017) framework and achieves distribution matching by minimizing their Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), while adversarial topic model (Wang et al., 2019a(Wang et al., ,b, 2020) ) directly generates documents from the Dirichlet prior and such a process is adversarially trained with a discriminator under the framework of Generative Adversarial Network (GAN) (Goodfellow et al., 2014).
Recently, due to the effectiveness of Graph Neural Networks (GNNs) (Li et al., 2015;Kipf and Welling, 2016;Zhou et al., 2018) in embedding graph structures, there is a surge of interests of applying GNN to natural language processing tasks (Yasunaga et al., 2017;Song et al., 2018;Yao et al., 2019).For example, GraphBTM (Zhu et al., 2018) is a neural topic model that incorporates the graph representation of a document to capture biterm cooccurrences in the document.To construct the graph, a sliding window over the document is employed and all word pairs in the window are connected.
A limitation of GraphBTM is that only word relationships are considered while ignoring document relationships.Since a topic is possessed by a subset of documents in the corpus, we believe that the topical neighborhood of a document, i.e., documents with similar topics, would help determine the topics of a document.To this end, we propose Graph Topic Model (GTM), a neural topic model that a corpus is represented as a document relationship graph where documents and words in the corpus are nodes and they are connected based on arXiv:2009.13972v1[cs.CL] 29 Sep 2020 document-word co-occurrences.In GTM, the topical representation of a document node is aggregated from its multi-hop neighborhood, including both document and word nodes, using Graph Convolutional Network (GCN) (Kipf and Welling, 2016).As GCN is able to capture high-order neighborhood relationships, GTM is essentially capable of modeling both word-word and doc-doc relationships.In specific, the relationships between relevant documents are established by their shared words, which is desirable for topic modeling as documents belonging to one topic typically have similar word distributions.
The main contributions of the paper are: • We propose GTM, a novel topic model that incorporates document relationship graph to enrich document and word representations.
• We extensively experimented on three datasets and the results demonstrate the effectiveness of the proposed approach.
2 Graph Topic Model

Graph Representation of the Corpus
We represent the whole corpus D with an undirected graph G = (N , E), where N and E are nodes and edges in the graph respectively.To model both words and documents, each of them is represented as a node n i ∈ N , which gives rise to N = V + D nodes in total, where V is the size of vocabulary V and D is the number of documents in corpus D. An edge (n i , n j ) indicates the relevance of node n i and n j , whose weight is determined by where A is the adjacency matrix of G and TF-IDF ij denotes the max-normalized TF-IDF (Term FrequencyInverse Document Frequency) weight of word j in document i.Besides self-connections, we only apply positive weights to edges between documents and words, while rely on the model to capture higher-order relationships, e.g.doc-doc and word-word relationships, by applying graph convolutions on graph G.
Figure 1: The framework of GTM.Circles denote neural networks.X, I, Ẑ, X, Z are the TF-IDF matrix of the corpus, an identity matrix, latent topics of all documents, reconstructed word weights and topic distributions drawn from the Dirichlet prior respectively.L rec (X, X) and MMD(P Z , Q Z ) are training objectives.

Model Architecture
The proposed GTM consists of an encoder E and a decoder G.The framework is shown in Figure 1, and we detail the architecture in the following.
The encoder network E maps nodes in G to their topic distributions by iteratively applying graph convolution to the node features.Following (Kipf and Welling, 2016), the layer-wise propagation rule of the graph convolution at layer l + 1 ∈ [1, L] is defined as where A ∈ R N ×N is the adjacency matrix of G, ) is a layerspecific weight matrix where d (l) is the output size of layer l, and σ denotes an activation function that is LeakyReLU (Maas et al., 2013) in this paper.
is the activations of all nodes at layer l and H (0) i is the embedding of node i.At each encoder layer, what the graph convolution does is aggregating node features from a node's first-order neighborhood, which consequently enlarges the receptive field of the central node and enables the information propagation between relevant nodes.After successively applying L graph convolution layers, the encoding of a node essentially involves its L th -order neighborhood.With L ≥ 2, doc-doc and word-word relationships are naturally captured in the topic inference process.
We also add a batch normalization (Ioffe and Szegedy, 2015) after each graph convolution.After the graph encoding, a softmax is further applied to the node features of a document to produce a multinomial topic distribution ẑ ∈ R K , where K is the topic number.
Based on the inferred topic distribution ẑ, the decoder network G tries to restore the original document representations.To achieve this goal, we employ a 2-layer MLP with LeakyReLU activation and batch normalization in the first layer.The output of the MLP decoder is then softmax-normalized to generate a word distribution x ∈ R V .
The decoder is also used to interpret topics.In this case, we feed to the decoder an identity matrix I ∈ R K×K , and the decoder output G(I) i is the word distribution of the i-th topic.

Training Objective
Based on the Wasserstein Autoencoder (Tolstikhin et al., 2017) framework, the training objective of GTM is to minimize the document reconstruction loss when the latent topic space is constrained by a prior distribution.The reconstruction loss is defined as where x denotes the TF-IDF of a document and x is the reconstructed word distribution corresponding to x. we use TF-IDF as the reconstruction target since TF-IDF basically preserves the relative importance of words and reduces some background noise that may hurt topic modeling, e.g., stop words.We impose a Dirichlet prior, the conjugate prior of the multinomial distribution, to the latent topic distributions.Following W-LDA (Nan et al., 2019), we achieve this goal by minimizing the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) between the distribution Q Ẑ of inferred topic distributions ẑ and the Dirichlet prior P Z from which we draw multinomial noises z: where m and n are the number of samples from Z and Ẑ respectively (m and n are batch sizes and they are equal in our experiments), and k : Z×Z → R is the kernel function.We use the information diffusion kernel (Lebanon and Lafferty, 2003) as in W-LDA: which is sensitive to points near the simplex boundary and thus more suitable for the sparse topic distributions.

Experiments
We evaluate our model on three datasets: 20Newsgroups consisting of 11,259 documents, Grolier consisting of 29,762 documents, and NYTimes consisting of 99,992 documents.We use the preprocessed 20Newsgroups of (Srivastava and Sutton, 2017), and preprocessed Grolier and NYTimes of (Wang et al., 2019a).We compare the performance of our model with LDA (Blei et al., 2003), NVDM (Miao et al., 2016), ProdLDA (Srivastava and Sutton, 2017), GraphBTM (Zhu et al., 2018), ATM (Wang et al., 2019a) and W-LDA (Nan et al., 2019) using topic coherence measures (Röder et al., 2015).To quantify the understandability of the extracted topics, a topic coherence measure aggregates the relatedness scores of the topic words (topweighted words) of each topic, where the word relatedness scores are estimated based on word co-occurrence statistics on a large external corpus.For example, the NPMI coherence measure (Aletras and Stevenson, 2013) applies a sliding window of size 10 over the Wikipedia corpus to calculate NPMI (Bouma, 2009) for word pairs.We use three topic coherence measures in our experiments: C A (Aletras and Stevenson, 2013), C P (Röder et al., 2015), and NPMI.The topic coherence scores are calculated using Palmetto (Röder et al., 2015) 1 .We use 2 graph convolution layers with output dimensions of 100 and K respectively in the encoder.The hidden size of the decoder is also set to 100.We use the RMSProp (Hinton et al., 2012) optimizer with a learning rate of 0.01 to train the model for 100 epochs.Since the training datasets scale up to 100K documents, i.e., 100K document nodes in the graph, it is hard to do batch training on a single GPU given the large memory requirements.We solve this issue by mini-batching the datasets and feeding to the model a subgraph consisting of 1000 document nodes and all word nodes at a training step, which results in efficient training (The training time increases almost linearly with the number of documents) and makes it possible to apply our model to even bigger datasets.
The topic coherence results on the three datasets are shown in Table 1, where each value is the average of 5 topic number settings: 20, 30, 50, 75, 100.From Table 1, we can observe that our proposed GTM is the best-performing model under all dataset/metric settings.W-LDA, ATM, LDA, and GraphBTM alternately achieve the second-best but they are always under-performed compared to our model.As described in section 2, GTM is an extension to W-LDA with the main difference that GTM models topics in a larger context and incorporates more global information with the graph encoder.Therefore the improvements of GTM over W-LDA indicate the effectiveness of such information for topic modeling.We only experimented GraphBTM on 20Newsgroups because only 20Newsgroups preserves the sequential information that is necessary for GraphBTM to build graphs.GraphBTM performs well on the C A metric, which is reasonable since C A is a coherence measure based on a small sliding window of size 5 and consequently prefers models concentrating on a smaller context like GraphBTM.However, GraphBTM fails to achieve a high C P or NPMI score, which uses a bigger window (70 and 10 respectively).
To explore how topic coherence results vary w.r.numbers settings.It can be observed in Figure 2 that GTM enjoys the best overall performance, achieving the highest scores in most settings.LDA has a slightly higher NPMI score on 20Newsgroups dataset with 75 and 100 topics, nevertheless, GTM outperforms all baseline models with a relatively large margin on other settings of 20Newsgroups.NVDM is apparently the worst-performing model, while performances of models other than GTM and NVDM are not so consistent.Notably, W-LDA, GraphBTM, and LDA obtain the second-best overall C P, C A, and NPMI scores respectively.Another observation from Figure 2 is that GTM performs better on smaller topics, probably due to the fact that topics become more discriminative against each other when the topic number is small.
To gain an intuitive impression on the discovered topics, we present in Table 2 4 topics corresponding to 4 out of 20 ground-truth categories of 20Newsgroups.It can be observed that the topics discovered by GTM are more coherent and interpretable, containing few off-topic words.As a comparison, GraphBTM's rec.autos topic mixes up automobiles and criminals, W-LDA's misc.forsaletopic is difficult to identify with too many offtopic words, while LDA can not distinguish between rec.autos and misc.forsalewell thus recog-nizes them as the same topic.It can be observed that GTM learns more discriminative topics by examining topic words from overlapping topics, e.g.rec.autos and misc.forsale.

Conclusion
We have introduced Graph Topic Model, a neural topic model that incorporates corpus-level neighboring context using graph convolutions to enrich document representations and facilitate the topic inference.Both quantitative and qualitative results are presented in the experiments to demonstrate the effectiveness of the proposed approach.In the future, we would like to extend GTM to corpora with explicit doc-doc interactions, e.g., scientific documents with citations or social media posts with user relationships.Replacing GCN in GTM with more advanced graph neural networks is another promising research direction.

Table 2 :
treatment medical disease md health hospital investigation satellite mission space launch lunar spacecraft shuttle orbit nasa flight car honda bmw engine ford saturn dealer turbo rear model ticket send mail price credit sale offer receive list customer GraphBTM cancer hus md medical health disease patient mission laboratory culture probe mission spacecraft lunar shuttle orbit nasa solar satellite space car bike cop road hit gas insurance fbi guy lot car buy mouse scsi engine card audio pc windows faster village turkish armenia azerbaijan troops militia greek lebanon armenian greece W-LDA msg food patient disease study science one treatment doctor scientific space launch nasa satellite ground mission shuttle use rocket orbit car dog road ride speed light drive bike go front condition sale offer shipping sell excellent car speaker cd include LDA use drug cause effect medical study disease patient doctor treatment space launch earth nasa mission system orbit satellite design moon car buy price sale new engine offer model dealer car buy sell price sale new engine offer model dealer Discovered topics that are most similar to 4 ground-truth categories (sci.med,sci.space, rec.autos, misc.forsale) on 20Newsgroups with topic number 50.Italics are manually labeled off-topic words.