Autoencoding Keyword Correlation Graph for Document Clustering

Document clustering requires a deep understanding of the complex structure of long-text; in particular, the intra-sentential (local) and inter-sentential features (global). Existing representation learning models do not fully capture these features. To address this, we present a novel graph-based representation for document clustering that builds a graph autoencoder (GAE) on a Keyword Correlation Graph. The graph is constructed with topical keywords as nodes and multiple local and global features as edges. A GAE is employed to aggregate the two sets of features by learning a latent representation which can jointly reconstruct them. Clustering is then performed on the learned representations, using vector dimensions as features for inducing document classes. Extensive experiments on two datasets show that the features learned by our approach can achieve better clustering performance than other existing features, including term frequency-inverse document frequency and average embedding.


Introduction
Text classification is a core task in natural language processing (NLP) with a variety of applications, such as news topic labeling and opinion mining. Supervised methods for text classification generally perform better than unsupervised clustering methods, at the cost of heavy annotation efforts. In contrast, unsupervised clustering methods have the advantage in terms of requiring less prior knowledge and can be used to discover new classes when relevant training data is not available.
The performance of text clustering is closely related to the quality of its feature representation. While sentence-level clustering relies primarily on the local, intra-sentential features, document-level clustering also needs the global, inter-sentential features. Existing representation learning methods that model text as a bag-of-words (e.g., term frequency-inverse document frequency, TFIDF) or as sequences of variable-length units (e.g., Bidirectional Encoder Representations from Transformers, BERT) (Devlin et al., 2019) are ineffective in capturing global features across long sequences -suffering from heavy computational cost as a result of high dimensionality and complex neural network architectures, as reported by Ye et al. (2017) and Jawahar et al. (2019).
Recently, graph neural networks have been used to provide features for NLP applications, including text classification (Yao et al., 2019) and relation extraction (Sahu et al., 2019). By modeling text in a topological structure, these models can encode global information in long-range words. Despite their usefulness, graph models remain underexplored in document clustering.
In this work, we propose a novel graph-based representation for document clustering by utilizing a graph autoencoder (GAE) (Kipf and Welling, 2016) on a Keyword Correlation Graph (KCG). Our KCG represents a document as a weighted graph of topical keywords. Each graph node is a keyword, and sentences in the document are attached to the nodes they are related to. The edges between nodes indicate their correlation strength, which is determined by comparing their corresponding sets of sentences. The node and edge features in the KCG are encoded using a GAE, and the encoded features are used to infer document classes.
Our contribution is threefold. First, we propose a KCG, which can capture the complex relations among words and sentences in long text. Second, we propose a new graph-based representation for document clustering. To the best of our knowledge, this is the first attempt to use GAEs to jointly learn local and global features for document clustering.
Last, an analysis of the individual model components indicates that our model can effectively encode both sets of features. This distinguishes us from existing sequence-level representations which generally better encode the former than the latter.

Related Work
In the literature, three common neural methods, Convolutional neural network (CNN), Recurrent neural network (RNN) and Transformer, have been proposed to model the sequence-level features between words. CNNs have been shown to be more effective in capturing features in short text (e.g. phrases) than in long sequences (Xu et al., 2015). In contrast, RNN is suitable for handling sequential input (Zhou et al., 2019). It aims at modelling the relations between the current word and all the previous ones in the sequence as a whole. Unlike RNN and CNN, which model a text sequence either from left to right or combined left-to-right and rightto-left, Transformer operates on the masked language model that predicts randomly-masked words in consecutive sentence pair. Nonetheless, these approaches only model the context on consecutive words/sentences, neglecting many global features that span across non-consecutive text units in multiple sentences.
Several methods have been proposed to represent documents as graphs. These document graphs can be induced directly from the input document, using its words, sentences, paragraphs or even the document itself as nodes (Defferrard et al., 2016), and establishing edges according to the distributional information such as, word co-occurrence frequencies (Yao et al., 2019;Peng et al., 2018), text similarities (Putra and Tokunaga, 2017) and hyperlinks between documents (Page et al., 1999). Alternatively, document graphs can be constructed indirectly with the use of NLP pipelines and knowledge bases such as WordNet (Miller, 1995) for identifying the entities in the document, as well as their syntactic and semantic relations (Sahu et al., 2019;Li et al., 2019). However, such type of approaches are limited to resource-rich languages.

Methodology
We describe our model architecture in Figure 1. It includes three steps. Given a document, the model first constructs a KCG with keywords as nodes and edges correspond to their local and global features. Next, it uses a GAE to encode the two feature sets  by jointly reconstructing them. Finally, clustering is performed on the encoded representations, using vector dimensions as features for inducing document classes.

KCG Construction
The KCG construction involves 4 steps: Given a document, KCG first uses Non-Negative Matrix Factorization (NMF) (Févotte and Idier, 2011;Cichocki and Phan, 2009) to extract the top-50 keywords of each document as nodes. 1 Second, each sentence in the document are mapped to the node it is most related to. 2 Thus, each node will have its own sentence sets. An example is shown in Figure 2. Then, we generate embeddings for each sentence in the set (referred to as sentence set embeddings henceforth). They will be served as features of the nodes. Last, edges between nodes are established by measuring the correlations of their corresponding sentence sets.
1 Earlier approaches used mature NLP pipelines (e.g., named entity recognizer) for keyword extraction (Li et al., 2019;Liu et al., 2019). Instead, we use unsupervised NMF for keyword extraction. We tested with top-10, 20, 50, 100 keywords on Latent Dirichlet allocation (LDA) (Blei et al., 2003;Hoffman et al., 2010) and NMF. We found that using NMF to extract top-50 keywords gives the best clustering result. 2 We map sentences and keywords based on the cosine similarity between their TFIDF features Node Feature: We represent each keyword node as the average of its sentence set embeddings. A range of word-and sentence-level embeddings, including Global Vector (GloVe) (Pennington et al., 2014), BERT, Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) and Embeddings from Language Models (ELMo) (Peters et al., 2018), are tested (see Section 5.1).
Word co-occurrence edge: The distributional hypothesis suggests that similar (key)words appear in similar contexts (Firth, 1957). Thus, the co-occurrence rate between two keywords reveals helpful clues for their relatedness. For this, we connect two keywords by their co-occurrence frequencies in sentences.
Sentence similarity edge: To estimate the global correlation between two keywords, we calculate the mean pairwise (cosine) similarity between their sentence embedding sets. Two keywords will have a high edge weight if their sentence set embeddings are similar.
Sentence position edge: The position of a word in the document can be an indicator of its importance. For example, topical keywords and sentences tend to appear in the beginning of the text (Lin and Hovy, 1997). Hence, we connect two keywords by computing the average position of their sentence sets in text. If two keywords both appear early in text, they will have a high edge weight. Details are described in the Appendix.

Graph Autoencoders (GAEs)
KCG captures the local and global features in documents using text embeddings and adjacency edges. After that, we compute the representation of each document by applying a GAE on the KCG. The GAE is an advanced version of the autoencoder for graph encoding, under an encoderdecoder framework. For each node in the KCG, the encoder aims to extract the latent features that can reconstruct the graph using the decoder. This way, the GAE learns to encode global information about (keyword) nodes that are multiplehops away in the KCG. To capture the global features, while preserving the local ones, we use a Multi-Task GAE (MTGAE), whose objective is to jointly learn the latent representation that can reconstruct both the input graph and node features (Tran, 2018a,b). In Section 5.1, we will compare MT-GAE performance with the GAE, the Variational   and Salakhutdinov, 2006). The model settings are described in the Appendix.

Clustering Algorithm
After we encode the KCG features for each node, we employ global average pooling over the node sequence to get a fixed-length representation of the document. We then apply the Spectral Clustering algorithm, on these representations to group documents into classes. 3 Spectral Clustering has wide applications in similar NLP tasks that involve highdimensional feature spaces (Xu et al., 2015;Belkin and Niyogi, 2002;Xie and Xing, 2013

Baseline Models
We compare our model with multiple cutting-edge text clustering and representation models, as reported by Ye et al. (2017) and Xie and Xing (2013 Ye et al. (2017).
In addition to the aforementioned models, we also generate document embeddings using GloVe, BERT, ELMo and SBERT. Here, a document is represented as the average of the words/sentence embeddings in that document (AvgEmb).

Model Training
For embeddings, we use GloVe-300d, BERT-baseuncased, ELMo-original and SBERT-bert-large-nlistsb-mean-tokens in our experiments. In all AEs, the ReLU activation function is employed in all layers. Parameters of all the models are optimized using the Adam optimisation algorithm with an initial learning rate of 0.01 (Kingma and Ba, 2014). We used early stopping with patience equal to 10 epochs in order to determine the best training epoch. Unless specific, other hyper-parameters are kept default as provided om their corresponding studies. The hyper-parameter values are shown in Table 2. Table 3, we show the results 4 of our main model (SS-SB-MT). It is created using Sentence Similarity (edge), SBERT (node) and MTGAE (autoencoder). From Table 3, our model is notably better than the baseline models, which showcases the effectiveness of topological features on long-text datasets. The main reasons our model performs well are twofold: first, the KCG can capture both the local and global features using text embeddings and adjacency edges (resp.). Second, the MTGAE is able to aggregate the two sets of features by jointly reconstructing them. To  better analyze the behaviour of our model, we experiment with different edges, node features and autoencoders individually. We vary one variable at a time and keep others constant. We report the results in the next section.

Impact of Edge Types, Node Features and Autoencoders
We first analyze the performance of SS-SB-MT using different edge types 5 , and report them in Table 4 (upper rows). Here, we see that the sentence-level edges perform better than the wordlevel edge. One possible reason is that text embed- The Duo Powerbooks seem to park the heads after a few seconds of is that builtin into the drive logic or is it being programmed via software, any way to tune the idle timeout that makes the heads park I think the heads are being parked since after a few seconds of inactivity you can hear the clunk of heads parking. 0 0 1 I have a Logitech 256 grays hand scanner from a PC. I'm wondering if anyone has been successful in connecting the scanner to a Mac? It has the same connector and is a serial device on the PC. I can imagine the pins configuration would need to changed, but I'm not sure if the signal levels would be correct, and if the Mac would work with it. Of course the manuals say nothing about the interface, connector layout or anything! Any ideas? Text is correctly clustered; and 0: Text is wrongly clustered. "Ours" is our best proposed models (i.e., SS-SB-MT).
dings (e.g., SBERT) have already encoded the local semantic relations between adjacency words and sentences. An additional word co-occurrence edge may thus be less helpful. We then analyze the performance of SS-SB-MT using different text embeddings to generate node features. From Table 4 (middle rows), we observe that sentence-level embeddings -SBERT (i.e., SB in SS-SB-MT) consistently outperforms the other word-level embeddings (GloVe, ELMo and BERT), suggesting that it can better represent the node features in the KCG.
We additionally conduct an analysis on different autoencoders. Results are shown in Table 4 (bottom rows). While graph-level autoencoders (GAE and VGAE) generally perform better than the sequencelevel one (AE), the better results come when we use MTGAE (i.e., MT in SS-SB-MT) to aggregate local and global features, indicating the important roles of both features in document clustering.
Qualitative Analysis of Autoencoders. Table 5 showcases some prediction errors from AE and VGAE. All examples describe the hardware issues specifically about Mac (i.e., comp.sys.mac.hardware). We find that VGAE performs better when the document class is determined by the entire document or a long-range semantic relation that spans over multiple sentences, rather than some local relation in consecutive keywords. Example (1) contains both the "hardware-related" phrases (e.g., Sony monitor), as well as the "Macrelated" ones (e.g., Mac), but the whole document clearly refers to Mac if one explicitly considers the related context around the first and the last sentences; thus, an architecture likes VGAE is needed to fully utilize the semantic structures over long-sequences. In contrast, AE has a competitive advantage over VGAE in modelling the local dependencies among consecutive words, as shown in example (2). Here, VGAE captures the semantic features of some key-phrases such as drive logic and heads and misclusters the example to other group that talk about general hardware issues. But AE can effectively model consecutive features and capture the information about Duo Powerbooks. Similar to the previous two examples, example (3) also has a mixed keywords across different sentences, but neither the local features nor the global features alone are informative enough to interpret the topic of the document: AE may capture some local key-phrases such as scanner and PC, whereas VGAE may capture the non-local relations like scanner from a PC and connecting the scanner to a Mac. A scenario of this nature highlights the need for aggregating the two feature sets, and in essence, an effective model likes our MTGAE, that can exploit the synergy between them.

Conclusion
In this paper, we propose a document clustering model based on features induced unsupervisedly from a GAE and KCG. Our model offers an elegant way to learn features directly from large corpora, bypassing the dependence on mature NLP pipelines. Thus, it is not limited to resource-rich languages and can be used by any applications that operate on text. Experiments show that our model achieves better performance than the sequencelevel representations, and we conduct a series of analyses to further understand the reasons behind such a performance gain. During training, GAE learns to minimize the reconstruction loss (L R ), as measured by the cross entropy between its input A and its reconstructed A, At inference time, we use the latent representation Z for document clustering and disregard the reconstructed partÂ.
To encode more content information from the graph, one can reconstruct both the input adjacency matrix (A) and feature matrix (X). Regarding this, Tran (2018a,b) proposed the Multi-Task GAE (MTGAE). Here, the MT-reconstruction loss is defined as: where L (a i ,â i ) and L (x i ,x i ) both are is the standard cross-entropy loss with sigmoid function σ (·), as, Variational Graph Autoencoder (VGAE) is an extension of the GAE architecture proposed by Kipf and Welling (2016). VGAE extends GAE by introducing an inference encoder, which is defined as: q(Z|X, A) = n i=1 q(z i |X, A) q(z i |X, A) = N (x i |µ i , diag(σ 2 )) (7) µ = Z 2 is a matrix of mean vectors z i , σ = f l inear(Z 1 , A|W 1 ) is the covariance matrix. During training, the VGAE optimizes the variational lower bound as: KL(q(·)||p(·)) denotes the Kullback-Leibler divergance and p(Z) = i N (z i |0, I) denotes the Gaussian prior for the latent data distribution. We perform the reparameterization trick (Kingma and Welling, 2014) to train the variational model.

Adjusted Mutual Information (AMI), and Accuracy (ACC)
Here, we describe the details of AMI and ACC. AMI is formally defined as: U and C are the ground truth and predict classes (resp.). M I and H stand for Mutual Information and Entropy (resp.). E {M I} stands for the expected mutual information. ACC is formally defined as: where δ(·) is an indicator function, c i is the predicted label for x i , map(·) transforms the predicted label c i to its group label by the Hungarian algorithm (Papadimitriou and Steiglitz, 1982), and y i is the ground truth of x i .