Inductive Topic Variational Graph Auto-Encoder for Text Classification

Graph convolutional networks (GCNs) have been applied recently to text classification and produced an excellent performance. However, existing GCN-based methods do not assume an explicit latent semantic structure of documents, making learned representations less effective and difficult to interpret. They are also transductive in nature, thus cannot handle out-of-graph documents. To address these issues, we propose a novel model named inductive Topic Variational Graph Auto-Encoder (T-VGAE), which incorporates a topic model into variational graph-auto-encoder (VGAE) to capture the hidden semantic information between documents and words. T-VGAE inherits the interpretability of the topic model and the efficient information propagation mechanism of VGAE. It learns probabilistic representations of words and documents by jointly encoding and reconstructing the global word-level graph and bipartite graphs of documents, where each document is considered individually and decoupled from the global correlation graph so as to enable inductive learning. Our experiments on several benchmark datasets show that our method outperforms the existing competitive models on supervised and semi-supervised text classification, as well as unsupervised text representation learning. In addition, it has higher interpretability and is able to deal with unseen documents.


Introduction
Recently, graph convolutional networks (GCNs) (Kipf and Welling, 2017;Veličković et al., 2018) have been successfully applied to text classification tasks (Peng et al., 2018a;Yao et al., 2019;Liu et al., 2020;. In addition to the local information captured by CNN or RNN, GCNs learn word and document representations by taking into account the global correlation information embedded in the corpuslevel graph, where words and documents are nodes connected by indexing or citation relations. However, the hidden semantic structures, such as latent topics in documents (Blei et al., 2003;Yan et al., 2013;Peng et al., 2018b), is still ignored by most of these methods (Yao et al., 2019;Huang et al., 2019;Liu et al., 2020;, which can improve the text representation and provide extra interpretability (in which the probabilistic generative process and topics make more sense to humans compared to neural networks, i.e. topics can be visually represented by top-10 or 20 most probable word clusters). Although few studies such as  have proposed incorporating a topic structure into GCNs, the topics are extracted in advance from the set of documents, independently from the graph and information propagation among documents and words. We believe that the topics should be determined in accordance with the connections in the graph. For example, the fact that two words are connected provides extra information that these words are on a similar topic(s). Moreover, existing GCN-based methods are limited by their transductive learning nature, i.e. a document can be classified only if it is already seen in the training phase Yao et al., 2019;Liu et al., 2020). The lack of inductive learning ability for unseen documents is a critical issue in practical text classification applications, where we have to deal with new documents. It is Table 1: Comparison with related work. We compare the manner of model learning, whether incorporate the latent topic structure and the manner of topic learning of these models.
To address these issues, we incorporate the topic model into variational graph auto-encoder (VGAE), and propose a novel framework named inductive Topic Variational Graph Auto-Encoder (T-VGAE). T-VGAE first learns to represent the words in a latent topic space by embedding and reconstructing the word correlation graph with the GCN probabilistic encoder and probabilistic decoder. Take the learned word representations as input, a GCNbased message passing probabilistic encoder is adopted to generate document representations via information propagation between words and documents in the bipartite graph. We compare our model with existing related work in Table 1. Different from previous approaches, our method unifies topic mining and graph embedding learning with VGAE, thus can fully embed the relations between documents and words into dynamic topics and provide interpretable topic structures into representations. Besides, our model builds a documentindependent word correlation graph and a worddocument bipartite graph for each document instead of a corpus-level graph to enable inductive learning.
The main contributions of our work are as follows: 1. We propose a novel model T-VGAE based on topic models and VGAE, which incorporates latent topic structures for inductively document and word representation learning. This makes the model more effective and interpretable.
2. we propose to utilize the auto-encoding vari-ational Bayes (AEVB) method to make efficient black-box inference of our model.
3. Experimental results on benchmark datasets demonstrate that our method outperforms the existing competitive GCN-based methods on supervised and semi-supervised text classification tasks. It also outperforms topic models on unsupervised text representation learning.
2 Related Work

Graph based Text Classification
Recently, GCNs have been applied to various NLP tasks Vashishth et al., 2019). For example, TextGCN (Yao et al., 2019) was proposed for text classification, which enriches the corpus-level graph with the global semantic information to learn word and document embeddings. Inspired by it, Liu et al. (Liu et al., 2020) further considered syntactic and sequential contextual information and proposed TensorGCN. However, none of them utilized the latent semantic structures in the documents to enhance text classification. To address the issue,  proposed dynamic HTG (DHTG), in an attempt to integrate the topic model into graph construction. DHTG learned latent topics from the document-word correlation information (similar to traditional topic models), which will be used for GCN based document embedding. However, the topics in DHTG were learned independently from the word relation graph and the information propagation process in the graph, in which word relations are ignored. Moreover, the existing GCN-based methods also require a pre-defined graph with all the documents and cannot handle out-of-graph documents, thus limiting their practical applicability.
To deal with the inductive learning problem, (Huang et al., 2019;Ding et al., 2020; proposed to consider each document as an independent graph for text classification. However, the latent semantic structure and interpretability are still ignored in these methods. Different from previous approaches, we aim to deal with both issues of dynamic topic structure and inductive learning. We propose to combine the topic model and graph based information propagation in a unified framework with VGAE to learn interpretable representations for words and documents.

Graph Enhanced Topic Models
There are also studies trying to enhance topic models with efficient message passing in the graph data structure of GCNs. GraphBTM (Zhu et al., 2018) proposed to enrich the biterm topic model (BTM) with the word co-occurrence graph encoded with GCNs. To deal with data streams, (Van Linh et al., 2020) proposed graph convolutional topic model (GCTM), which introduces a knowledge graph modeled with GCNs to the topic model. (Yang et al., 2020) presented Graph Attention TOpic Network (GATON) for correlated topic modeling. It tackles the overfitting issue in topic modeling with a generative stochastic block model (SBM) and GCNs. In contrast with these studies, we focus on integrating the topic model into GCN-based VGAE for supervised learning tasks and derive word-topic and document-topic distributions simultaneously.

Variational Graph Auto-encoders
Variational Graph Auto-encoders (VGAEs) have been widely used in graph representation learning and graph generation. The earliest study (Kipf and Welling, 2016) proposed VGAE method, which extended variational auto-encoder (VAE) on graph structure data for learning graph embedding. Based on VGAE, (Pan et al., 2018) introduced an adversarial training to regularize the latent variables and further proposed adversarially regularized variational graph autoencoder (ARVGA). (Hasanzadeh et al., 2019) incorporated semi-implicit hierarchical variational distribution into VGAE (SIG-VAE) to improve the representation power of node embeddings. (Grover et al., 2019) proposed Graphite model that integrated an iterative graph refinement strategy into VGAE, inspired by low-rank approximations. However, to the best of our knowledge, our model is the first effort to apply VGAE to unify the topic learning and graph embedding for text classification, thus can provide better interpretability and overall performance.

Graph Construction
Formally, we denote a corpus as C, which contains D documents and the ground truth labels Y 2 c = {1, ..., M} of documents, where M is the total number of classes in the corpus. Each document t 2 C is represented by a sequence of words t = {w 1 , ..., w nt }(w i 2 v), where n t is the number of words in document t and v is the vocabulary of size V .
From the whole corpus, we build a word correlation graph G = (v, e) containing word nodes v and edges e, to capture the word co-occurrence information. Similar to previous work (Yao et al., 2019), we utilize the positive point mutual information (PPMI) to calculate the correlation between two word nodes. Formally, for two words (w i , w j ), we have where p(w i , w j ) is the probability that (w i , w j ) cooccur in the sliding window and p(w i ), p(w j ) are the probabilities of words w i and w j in the sliding window. They can be empirically estimated is the number of co-occurrences of (w i , w j ) in the sliding windows, n(w i ) is the number of occurrences of w i in the sliding windows and n the total number of sliding windows. For two word nodes (w i , w j ), the weight of the edge between them can be defined as: where A v 2 R V ⇤V is the adjacency matrix which represents the word correlation graph structure G. Different from the existing studies (Yao et al., 2019;Liu et al., 2020;) that consider all documents and words in a heterogeneous graph, we propose to build a separate graph for each document to enable inductive learning. Typically, documents can be represented by the represents the document i, and x ij is the TF-IDF weight of the word j in document i. The decoupling of documents from a global pre-defined graph enables our method to handle new documents.

Topic Variational Graph Auto-encoder
Based on A v and A d , we propose the T-VGAE model, as shown in Figure 1. It is a deep generative model with structured latent variables based on GCNs.

Generative Modeling
We consider that the word co-occurrence graph A v and the bipartite graph A d t of each document t are generated from the random process with two latent Figure 1: The architecture of T-VGAE.As shown in the Figure, for a new test document i, its latent representation z d i is generated by the UAMP probabilistic encoder based on its document-word vector A d i and learned word topic distribution matrix z v . Then, z d i is fed into the trained MLP classifier f y to predict the output label. Therefore, new test documents can be classified do not need to be included in the training process, thus enabling inductive learning of our model. variables z v 2 R V ⇥K and z d t 2 R 1⇥K , where K denotes the number of latent topics. The generating process for A v , A d and Y are as follows (see Figure  2 3. For each document t in corpus C: where ✓ is the set of parameters for all prior distributions. Here, we consider the centered isotropic multivariate Gaussian priors p( Notice that the priors p(z v ) and p(z d ) are parameter free in this case. According to the above generative process, we can maximize the marginal likelihood of observed graph A v , A d and Y to learn parameters ✓ and latent variables as follows: Because the inference of true posterior of latent variable z v and z d is intractable, we further introduce the variational posterior distri- the feature vectors of words and M is the dimension of the feature vectors (see Figure 2(b)). We can yield the following tractable stochastic evidence lower bound (ELBO): where the first three terms are the reconstruction terms, and the latter two terms are the Kullback-Leibler (KL) divergences of variational posterior distributions and true posterior distributions. Using auto-encoding variational Bayes (AVB) approach (Kingma and Welling, 2013), we are able to parametrize the variational posteriors q and true posteriors p ✓ with the GCN-based probabilistic encoder and decoder, to conduct neural variational inference (NVI).

Graph Convolutional Probabilistic Encoder
For the latent variable z v , we make the meanfield approximation that: . For simplify the model inference, we consider the multivariate normal variational posterior with a diagonal covariance matrix as previous neural topic models (Miao et al., 2016;Bai et al., 2018) are the mean and diagonal covariance of the multivariate Gaussian distribution.
We use the graph convolutional neural network to parametrize the above posterior and inference z v with the input graph A v and feature vectors X v : where µ v , v are matrices of µ v i , v i , l is the number of GCN layers, we use one layer in our exper- 1 2 is the symmetrically normalized adjacent matrix of the word graph, and D v denotes the corresponding degree matrix. The input of GCN is the feature vectors X v which is initialized as the identity matrix I, i.e., (H v ) 0 = X v = I, same as in (Yao et al., 2019). Then, z v can be naturally sampled as follows according to the reparameterization trick (Kingma and Welling, 2013): where is the element-wise product, and ✏ ⇠ N (0, I) is the noise variable. Through the message propagation of the GCN layer, words that co-occur frequently tend to achieve similar representations in the latent topic space.
Similar to z v , we also have: where µ d t , ( d t ) 2 are the mean and diagonal covariance of the multivariate Gaussian distribution. Although there are two types of nodes -word and document -in the bipartite graph A d , we mainly focus on learning representations of document nodes based on the representations of word nodes learned from A v in this step. Therefore, we propose the unidirectional message passing (UDMP) process on A d , which propagates the information from word nodes to documents: Then, we parametrize the posterior and inference z d based on UDMP: where µ d , d are matrices of µ d t , ( d t ) 2 , UDMP is the message passing as in Equation 4, W d µ , W d are weight matrices. Similarly, we sample z d as follows z d = µ d + d ", where " ⇠ N (0, I) is the noise variable. Through the propagation mechanism of UDMP, documents which share similar words tend to yield similar representations in the latent topic space.
Although T-VGAE can learn topics z v and document-topic representations z d as in traditional topic models, we do not focus on proposing a novel topic model, but aim to combine the topic model with VGAE, to improve word and document representations with latent topic semantic and provide probabilistic interpretability. Moreover, rather than learning topics and document-topic representations from the document-word feature A d as LDA topic models (Blei et al., 2003), we propose to learn word-topic representations z v from word cooccurrence matrix A v , and then infer documenttopic representations z d based on the documentword feature A d and word-topic representations z v , which is similar to the Biterm topic model (Yan et al., 2013).

Probabilistic Decoder
With the learned z v and z d , ideally, the observed graph A v and A d can be reconstructed through a decoding process. For A v , we assume P ✓ (A v |z v ) conforms to a multivariate Gaussian distribution, whose mean parameters are generated from the inner product of the latent variable z v : where ⇢ is the nonlinear activation function. Similarly, the inner product between z v and z d is used to generate A d , which is sampled from the multivariate Gaussian distribution: For categorical labels Y , we assume p ✓ (Y |z d ) follows a multinomial distribution P ✓ (Y |z d ) = Mul(Y |f y (z d )), whose label probability vectors are generated from z d , where f y is the multi-layer neural network. For each document t, the prediction is given byŷ t = argmax y2c P ✓ (y|f y (z d t )).

Optimization
We can rewrite Equation 4 to yield the final variational objective function: with following reconstruction terms and KL divergences: Through maximizing the objective with stochastic gradient descent, we jointly learn the latent word and document representations, which can efficiently reconstruct observed graphs and predict ground truth labels.

Experiment
In this section, to evaluate the effectiveness of our proposed T-VGAE, experiments are conducted on both supervised and semi-supervised text classification tasks, as well as unsupervised topic modeling tasks.  Figure 3: The augmentation of test accuracy with our model under different topic size k.

Datasets
We conduct experiments on five commonly used text classification datasets: 20NewsGroups, Ohsumed, R52 and R8, and MR. We use the same data preprocessing as in (Yao et al., 2019). The overview of the five datasets is depicted in Table 2.

Baselines
We compare our method with the following two categories of baselines:  Table 3: Micro precision, recall and F1-Score on document classification task. We report mean ± standard deviation averaged on 10 times following previous methods (Yao et al., 2019).

Settings
Following (Yao et al., 2019), we set the hidden size K of latent variables and other neural network layers as 200 and set the window size in PPMI as 20.
The dropout is only utilized in the classifier, and is set to 0.85. We train our model for a maximum of 1000 epochs with Adam (Kingma and Ba, 2015) under learning rate 0.05. 10% of the data set is randomly sampled and spared as the validation set for model selection. The parameter settings of all baselines are the same as their original papers or implementations.

Supervised Classification
We present the test performances of models in text classification among five datasets in Table 3. We can see that our model consistently outperforms all the baselines on each dataset, which proves the effectiveness of our proposed methods. Compared with TextGCN, our method yields better performance in both datasets. It demonstrates the importance of integrating the latent semantic structures in text classification. It is also observed from the superior performance of DHTG when compared with TextGCN. However, DHTG only learns from the document-word correlation while our method fully exploits both word-word and document-word correlation information, resulting in a significant improvement over DHTG. This proves the effectiveness of unified topic modeling and graph representation learning in text classification. Moreover, there are no test documents involved during the training of our method, which shows the inductive learning ability of our method, different from TextGCN and DHTG which requires a global graph including all documents and words.

Effects of Correlation Information of Different Order
In Table 4, we further present the test accuracy of our method using different layers of GCN en-coder, to demonstrate the impact of a different order of word-word correlation information in A v . On datasets R52 and R8, our method achieves the best performance when the layer number is 1. This is different from TextGCN and DHTG, which generally have the best performance with 2 layer GCN. A possible reason is that our model has already considered one-hop document-word relation information when encoding document-word graph A d .
If the layer number is set to 1 when encoding A v , it actually integrates two-hop neighborhood information, thus achieves a similar effect to TextGCN and DHTG. In Table 4, we further present the test accuracy of our method using different layers of GCN encoder, to demonstrate the impact of different orders of word-word correlation information in A v . On datasets R52 and R8, our method achieves the best performance when the layer number is 1. This is different from TextGCN and DHTG, which generally have the best performance with 2 layer GCN. A possible reason is that our model has already considered one-hop document-word relation information when encoding document-word graph A d .
If the layer number is set to 1 when encoding A v , it actually integrates two-hop neighborhood information, thus achieves a similar effect to TextGCN and DHTG. Figure 3 shows the changes of the test accuracy along with different numbers of topics on five datasets. We can see that the test accuracy on five datasets generally improves with the increase of the number of topics and reaches the peak when the topic number is around 200. The number of topics shows more impact on the Oshumed dataset than on the other four datasets. This does not seem to be related to the number of classes in the dataset. We suspect it has to do with the nature of the text (medical domain vs. other domains).

Semi-Supervised Classification
In Figure 4,     We further evaluate the performance of models on unsupervised topic modeling tasks. We generally assume that the more topics are coherent, the more they are interpretable. Following (Srivastava and Sutton, 2017), We use the average pairwise PMI of the top 10 words in each topic and the perplexity with the ELBO as quality measures of topics. We show in Table 6 the measures under different topic numbers in the 20NG dataset. We remove the supervised loss of our method and the result of GraphBTM is not presented for unable to learn document topic representation for each document.

Document Topic Modelling
In the table, we can see that our model outperforms the others in terms of topic coherence, which could be attributed to the combination of word cooccurrence graph and message passing in GCN. The message passing leads to similar representations of words that co-occur frequently in the latent topic space, thus improves the semantic coherence of learned topics, as shown in Table 5 that related words tend to belong to the same topic. Our method also benefits from document-word correlation, and yield better performance when compared with GraphBTM which encode bi-term graph via GCN.

Document Representations
We utilize t-SNE to visualize the latent test document representations of the 20NG dataset learned by our model, DHTG and TextGCN in Figure 5, in which each dot represents a document and each color represents a category. Our method yields the best clustering results compared with the others, which means the topics are more consistent with pre-defined classes. It shows the superior interpretability of our method for modeling the latent topics along with both word co-occurrence graph and document-word graph when compared with DHTG.

Conclusion
In this paper, we proposed a novel deep latent variable model T-VGAE via combining the topic model with VGAE. It can learn more interpretable representations and leverage the latent topic semantic to improve the classification performance. T-VGAE inherits advantages from the topic model and VGAE: probabilistic interpretability and efficient label propagation mechanism. Experimental results demonstrate the effectiveness of our method along with inductive learning. As future work, it would be interesting to explore better-suited prior distribution in the generative process. It is also possible to extend our model to other tasks, such as information recommendation and link prediction.