Short Text Clustering via Convolutional Neural Networks

Short text clustering has become an increasing important task with the popularity of social media, and it is a challenging problem due to its sparseness of text representation. In this paper, we propose a Short Text Clustering via Convolutional neural networks (abbr. to STCC), which is more beneﬁcial for clustering by considering one constraint on learned features through a self-taught learning framework without using any external tags/labels. First, we embed the original keyword features into compact binary codes with a locality-preserving constraint. Then, word embed-dings are explored and fed into convolutional neural networks to learn deep feature representations, with the output units ﬁtting the pre-trained binary code in the training process. After obtaining the learned representations, we use K-means to cluster them. Our extensive experimental study on two public short text datasets shows that the deep feature representation learned by our approach can achieve a signiﬁcantly better performance than some other existing features, such as term frequency-inverse document frequency, Laplacian eigenvectors and average embedding, for clustering.


Introduction
Different from the normal text clustering, short text clustering has the problem of sparsity (Aggarwal and Zhai, 2012). Most words only occur once in each short text, as a result, the term frequencyinverse document frequency (TF-IDF) measure cannot work well in the short text setting. In order to address this problem, some researchers work on expanding and enriching the context of data from Wikipedia (Banerjee et al., 2007) or an ontology (Fodeh et al., 2011). However, these methods involve solid natural language processing (NLP) knowledge and still use high-dimensional representation which may result in a waste of both memory and computation time. Another way to overcome these issues is to explore some sophisticated models to cluster short texts. For example, Yin and Wang (2014) proposed a Dirichlet multinomial mixture model-based approach for short text clustering and Cai et al. (2005) clustered texts using Locality Preserving Indexing (LPI) algorithm. Yet how to design an effective model is an open question, and most of these methods directly trained based on bagof-words (BoW) are shallow structures which cannot preserve the accurate semantic similarities.
With the recent revival of interest in Deep Neural Network (DNN), many researchers have concentrated on using Deep Learning to learn features. Hinton and Salakhutdinov (2006) use deep auto encoder (DAE) to learn text representation from raw text representation. Recently, with the help of word embedding, neural networks demonstrate their great performance in terms of constructing text representation, such as Recursive Neural Network (RecN-N) (Socher et al., 2011;Socher et al., 2013) and Recurrent Neural Network (RNN) (Mikolov et al., 2011). However, RecNN exhibits high time complexity to construct the textual tree, and RNN, using the layer computed at the last word to represent the text, is a biased model (Lai et al., 2015). More recently, Convolution Neural Network (CNN), applying convolutional filters to capture local features, has achieved a better performance in many NLP applications, such as sentence modeling (Blunsom et al., 2014), relation classification (Zeng et al., 2014), and other traditional NLP tasks (Collobert et al., 2011). Most of the previous works focus CNN on solving supervised NLP tasks, while in this paper we aim to explore the power of CNN on one unsupervised NLP task, short text clustering.
To address the above challenges, we systematically introduce a short text clustering method via con-  Figure 1: Architecture of the proposed short text clustering via convolutional neural networks volutional neural networks. An overall architecture of the proposed method is illustrated in Figure 1. Given a short text collection X, the goal of this work is to cluster these texts into clusters C based on the deep feature representation h learned from CNN models. In order to train the CNN models, we, inspired by (Zhang et al., 2010), utilize a self-taught learning framework in our work. In particular, we first embed the original features into compact binary code B with a locality-preserving constraint. Then word vectors S projected from word embeddings are fed into a CNN model to learn the feature representation h and the output units are used to fit the pretrained binary code B. After obtaining the learned features, traditional K-means algorithm is employed to cluster texts into clusters C. The main contributions of this paper are summarized as follows: 1). To the best of our knowledge, this is the first attempt to explore the feasibility and effectiveness of combining CNN and traditional semantic constraint, with the help of word embedding to solve one unsupervised learning task, short text clustering.
2). We learn deep feature representations with locality-preserving constraint through a self-taught learning framework, and our approach do not use any external tags/labels or complicated NLP preprocessing.
3). We conduct experiments on two short text datasets. The experimental results demonstrate that the proposed method achieves excellent perfor- The remainder of this paper is organized as follows: In Section 2, we first describe the proposed approach STCC and implementation details. Experimental results and analyses are presented in Section 3. In Section 4, we briefly survey several related works. Finally, conclusions are given in the last Section.

Convolutional Neural Networks
In this section, we will briefly review one popular deep convolutional neural network, Dynamic Convolutional Neural Network (DCNN) (Blunsom et al., 2014), which is the foundation of our proposed method.
Taking a neural network with two convolutional layers in Figure 2 as an example, the network transforms raw input text to a powerful representation.
..,n denote the set of input n texts, where d is the dimensionality of the original keyword features. Each raw text vector x i is projected into a matrix representation S ∈ R dw×s by looking up a word embedding E, where d w is the dimension of word embedding features and s is the length of one text. We also let W = {W i } i=1,2 and W O denote the weights of the neural networks. The network defines a transformation f (·) : R d×1 → R r×1 (d ≫ r) which trans-forms an raw input text x to a r-dimensional deep representation h. There are three basic operations described as follows: -Wide one-dimensional convolution This operation is applied to an individual row of the sentence matrix S ∈ R dw×s , and yields a set of sequences C i ∈ R s+m−1 where m is the width of convolutional filter.
-Folding In this operation, every two rows in a feature map component-wise are simply summed. For a map of d w rows, folding returns a map of d w /2 rows, thus halving the size of the representation.
-Dymantic k-max pooling Given a fixed pooling parameter k top for the topmost convolutional layer, the parameter k of k-max pooling in the l-th convolutional layer can be computed as follows: where L is the total number of convolutional layers in the network.

Locality-preserving Constraint
Here, we first pre-train binary code B based on the keyword features with a locality-preserving constraint, and choose Laplacian affinity loss, also used in some previous works (Weiss et al., 2009;Zhang et al., 2010). The optimization can be written as: where S ij is the pairwise similarity between texts x i and x j , and ∥·∥ F is the Frobenius norm. The problem is relaxed by discarding B ∈ {−1, 1} n×q , and the q-dimensional real-valued vectorsB can be learned from Laplacian Eigenmap. Then, we get the binary code B via the media vector median(B). In particular, we construct the n × n local similarity matrix S by using heat kernel as follows: where, σ is a tuning parameter (default is 1) and N k (x) represents the set of k-nearest-neighbors of x.
The last layer of CNN is an output layer as follows: where, h is the deep feature representation, O ∈ R q is the output vector and W O ∈ R q×r is weight matrix. In order to fit the pre-trained binary code B, we apply q logistic operations to the output vector O as follows:

Learning
All of the parameters to be trained are defined as θ.
Given the training text collection X, and the pretrained binary code B, the log likelihood of the parameters can be written down as follows: Following the previous work (Blunsom et al., 2014), we train the network with mini-batches by back-propagation and perform the gradient-based optimization using the Adagrad update rule (Duchi et al., 2011). For regularization, we employ dropout with 50% rate to the penultimate layer (Blunsom et al., 2014;Kim, 2014).

K-means for Clustering
With the given short texts, we first utilize the trained deep neural network to obtain the semantic representations h, and then employ traditional K-means algorithm to perform clustering.

Pre-trained Word Vectors
We use the publicly available word2vec tool to train word embeddings, and the most parameters are set as same as Mikolov et al. (2013) to train word vectors on Google News setting 4 , excepts of vector dimensionality using 48 and minimize count using 5. For SearchSnippets, we train word vectors on Wikipedia dumps 5 . For StackOverflow, we train word vectors on the whole corpus of the Stack-Overflow dataset described above which includes the question titles and post contents. The coverage of these learned vectors on two datasets are listed in Table 2, and the words not present in the set of pre-trained words are initialized randomly.

Comparisons
We compare the proposed method with some most popular clustering algorithms: • K-means K-means (Wagstaff et al., 2001) on original keyword features which are respectively weighted with term frequency (TF) and term frequency-inverse document frequency (TF-IDF).
• Spectral Clustering This baseline (Belkin and Niyogi, 2001) uses Laplacian Eigenmaps (LE) and subsequently employ K-means algorithm. The dimension of subspace is default set to the number of clusters (Ng et al., 2002;Cai et al., 2005), we also iterate the dimensions ranging from 10:10:200 to get the best performance, that is 20 on SearchSnippets and 70 on Stack-Overflow in our expriments.
• Average Embedding K-means on the weighted average of the word embeddings which are respectively weighted with TF and TF-IDF. Huang et al. (2012) also used this strategy as the global context in their task and Lai et al. (2015) used this in text classification.

Evaluation Metrics
The clustering performance is evaluated by comparing the clustering results of texts with the tags/labels provided by the text corpus. Two metrics, the accuracy (ACC) and the normalized mutual information metric (NMI), are used to measure the clustering performance (Cai et al., 2005;Huang et al., 2014). Given a text x i , let c i and y i be the obtained cluster label and the label provided by the corpus, respectively. Accuracy is defined as: where, n is the total number of texts, δ(x, y) is the indicator function that equals one if x = y and equals zero otherwise, and map(c i ) is the permutation mapping function that maps each cluster label c i to the equivalent label from the text data by Hungarian algorithm (Papadimitriou and Steiglitz, 1998). Normalized mutual information (Chen et al., 2011) between tag/label set Y and cluster set C is a popular metric used for evaluating clustering tasks. It is defined as follows: where, M I(Y, C) is the mutual information between Y and C, H(·) is entropy and the denominator √ H(Y)H(C) is used for normalizing the mutual information to be in the range of [0, 1].

Hyperparameter Settings
In our experiments, the most of parameters are set uniformly for these datasets. Following previous study (Cai et al., 2005), the parameter k in Eq. 3 is fixed to 15 when constructing the graph Laplacians in our approach, as well as in spectral clustering. For CNN model, we manually choose a same architecture for the two datasets. More specifically, in our experiments, the networks has two convolutional layers similar as the example in Figure 2. The widths of the convolutional filters are both 3. The value of k for the top k-max pooling is 5. The number of feature maps at the first convolutional layer is 12, and 8 feature maps at the second convolutional layer. Both those two convolutional layers are followed by a folding layer. We further set the dimension of word embeddings d w as 48. Finally, the dimension of the deep feature representation r is fixed to 480. Moreover, we set the learning rate λ as 0.01 and the mini-batch training size as 200. The output size q in Eq. 4 and Eq. 2 is set same as the best dimensions of subspace in the baseline method, spectral clustering, as described in Section 3.3.
For initial centroids have significant impact on clustering results when utilizing the K-means algorithms, we repeat K-means for multiple times with random initial centroids (specifically, 100 times for statistical significance). The final results reported are the average of 5 trials with all clustering methods on two text datasets.

Quantitative Results
Here, we firstly evaluate the influence of the iteration number in our method. Figure 3 shows the change of ACC and NMI as the iteration number increases on two text datasets. It can be found that the performance rises steadily in the first ten iterations, which demonstrates that our method is effective. In the period of 10∼20 iterations, ACC and NMI become relatively stable on both two texts. In the following experiments, we report the results after 10 iterations.
We report ACC and NMI performance of all the clustering methods in Table 3. The experimen- tal results show that Spectral Clustering and Average Embedding significantly better than K-means on two datasets. It is because K-means directly construct the similarity structure from the original keyword feature space while Average Embedding and Spectral Clustering extract the semantic features using shallow structure models. Compared with the best baselines, the proposed STCC extracting deep learned representation from convolutional neural network achieves large improvement on these datasets by 2.33%/4.86% and 14.23%/10.01% (ACC/NMI) on SearchSnippets and StackOverflow, respectively. Note that TF-IDF weighting gives a more remarkable improvement for K-means, while TF weighting works better than TF-IDF weighting for Average Embedding. Maybe the reason is that pre-trained word embeddings encode some useful information from external corpus and are able to get even better results without TF-IDF weighting.
In Figure 4 and Figure 5, we further report 2dimensional embeddings using stochastic neighbor embedding (Van der Maaten and Hinton, 2008) 6 of the feature representations used in the clustering methods. We can see that the 2-dimensional embedding results of deep features representation learned from our STCC show more clear-cut margins among different semantic topics (that is, tags/labels) on two short text datasets.

Related Work
In this section, we review the related work from the following two perspectives: short text clustering and deep neural networks.

Short Text Clustering
There have been several studies that attempted to overcome the sparseness of short representation.   (2014) proposed a Dirichlet multinomial mixture model-based approach for short text clustering and Cai et al. (2005) applied the LPI algorithm for text clustering. Moreover, some studies both focus the above two streams. For example, Tang et al. (2012) proposed a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. However, the former works need solid NLP knowledge while the later works are shallow structures which can not fully capture accurate semantic similarities.

Deep Neural Networks
With the recent revival of interest in DNN, many researchers have concentrated on using Deep Learning to learn features. Hinton and Salakhutdinov (2006) use DAE to learn text representation. During the fine-tuning procedure, they use backpropagation to find codes that are good at reconstructing the wordcount vector.
Recently, researchers propose to use external corpus to learn a distributed representation for each word, called word embedding (Turian et al., 2010), to improve DNN performance on NLP tasks. The skip-gram and continuous bag-of-words models of (Mikolov et al., 2013) propose a simple singlelayer architecture based on the inner product between two word vectors, and Jeffrey Pennington et al. (2014) introduce a new model for word representation, called GloVe, which captures the global corpus statistics.
Based on word embedding, neural networks can capture true meaningful syntactic and semantic regularities, such as RecNN (Socher et al., 2011;Socher et al., 2013) and RNN (Mikolov et al., 2011). However, RecNN exhibits high time complexity to construct the textual tree, and RNN, using the layer computed at the last word to represent the text, is a biased model. Recently, CNN, applying convolving filters to local features, has been successfully exploited for many supervised NLP learning tasks as described in Section 1. This paper, to our best knowledge, is the first time to explore the power of CNN and word embedding to solve one unsupervised learning task, short text clustering.

Conclusions
In this paper, we proposed a short text clustering based on deep feature representation learned from CNN without using any external tags/labels and complicated NLP pre-processing. As experimental study shows that STCC can achieve significantly better performance than the baseline methods.