Cluster-Gated Convolutional Neural Network for Short Text Classification

Text classification plays a crucial role for understanding natural language in a wide range of applications. Most existing approaches mainly focus on long text classification (e.g., blogs, documents, paragraphs). However, they cannot easily be applied to short text because of its sparsity and lack of context. In this paper, we propose a new model called cluster-gated convolutional neural network (CGCNN), which jointly explores word-level clustering and text classification in an end-to-end manner. Specifically, the proposed model firstly uses a bi-directional long short-term memory to learn word representations. Then, it leverages a soft clustering method to explore their semantic relation with the cluster centers, and takes linear transformation on text representations. It develops a cluster-dependent gated convolutional layer to further control the cluster-dependent feature flows. Experimental results on five commonly used datasets show that our model outperforms state-of-the-art models.


Introduction
With the rapid development of social media, ecommerce and on-line communication, the Internet has been generating an increasing amount of short texts, including texts, search snippets, user reviews for products, etc., which poses an urgent demand for understanding them. Short text classification, assigning predefined categories to texts, is a fundamental technique in natural language processing, and plays an important role in a wide range of applications, such as sentiment analysis, web searching, and ads matching.
In prior research, much progress has been made on text classification, including traditional approaches based on human-designed features (Lazaridou et al., 2013;Zhang et al., 2015a) and neural networks based on deep architectures (Lai et al., 2015;Yang et al., 2016). However, such methods prefer to deal with documents and paragraphs, and still have limitations for short texts. Each short text does not have enough words, which may result in data sparsity and lack of contexts (Wang et al., 2017).
Some researchers incorporated knowledge bases into traditional approaches (Feng et al., 2013;Wang et al., 2014) or neural networks (Wang et al., 2017) to overcome these challenges. Extra resources can provide abundant semantic information for short text classification, but the performance of such methods is strongly dependent on the quality of knowledge bases and constructing a large-scale knowledge base is timeconsuming and labor-intensive. Another strategy is to explore latent topics (Chen et al., 2011;Ren et al., 2016) or clustering features (Ma et al., 2015;Revanasiddappa and Harish, 2018) for texts and input them into some classifiers as features. Such methods can reduce high dimensionality and terms' sparse distribution problems. Their shortness is to use pre-trained topics or clusters as features, which might be hard to explore the potential association between clustering and classification.
To address the limitations, we construct a joint architecture to embed a soft clustering method into the classification task, because joint architectures can leverage mutual information for each other and have been useful in many studies for understanding natural language (Luo et al., 2015;Shao et al., 2017;Schmitt et al., 2018). In addition, convolutional neural network and the gated mechanisms have been proven effectiveness in sentence-level language modeling (Dauphin et al., 2016;Gehring et al., 2017), and cluster centers of words contain semantic closeness of similar ones, which motivate us to utilize them to auto-extract and highlight the cluster-related features for classification.
Based on the above analysis, we propose a joint model called cluster-gated convolutional neural network (CGCNN), coupling clustering and classification methods, to construct an end-to-end deep architecture. It integrates a soft clustering method into a gated convolutional neural network, which can explore the semantic relation of wordlevel context and the global corpus. And it can also guide the gating mechanism unit to control clusterdependent feature flows. Specifically, it firstly uses a bi-directional long short-term memory model (BiLSTM) to learn word representations and capture local context in text. Then, it performs a soft clustering method on word representations for the probability of each word assigning to each cluster, which can build a bridge between word and the global corpus. And we develop a linear transformation to calculate cluster-dependent text representations. Based on the gating mechanism, we uses the cluster centers to further highlight the cluster-dependent convolutional features for the corresponding cluster. At last, we perform maxover-time pooling and concatenation operations to combine the selected features for classification.
The main contributions of this study are summarized as follows:  We develop a joint model that combines clustering and classification methods in an end-to-end manner. The model leverages the semantic relation of words and the global corpus by learning from a soft clustering method to assist the classification task.
 To the best of our knowledge, our model is the first to incorporate a clustering method into the gating mechanism for convolutional neural network, which can help to control related features with clusters.
 We conduct extensive experiments on five real-world datasets to verify the effectiveness of our model. The experiment results show that the proposed method outperforms state-of-the-art methods.

Related Work
In this section, we review the related work from the following two aspects: text classification and short text classification.

Text Classification
Traditional text classification methods generally rely on manual features, such as bag-of-words, short n-grams, POS tagging. Most recent studies design more complex features for specific applications. For example, Lazaridou et al. (2013) considered discourse connectives (such as "but", "and") in the Bayesian model for sentiment classification. Post and Bergsma (2013) used multiple explicit and implicit syntactic features (e.g., unigrams, bigrams, and grammar tree patterns) for text classification. Zhang et al. (2015a) integrated word embeddings learned by word2vec into support vector machine model. Recently, deep learning methods have been proven to be effective in text classification. Kim (2014) proposed a convolutional neural network (CNN) architecture that utilized multiple parallel convolutional layers with varying filter window sizes and concatenated the selected important features into a dense softmax layer for sentence classification. Lai et al. (2015) applied a recurrent structure to learn contextual information of each word and employed a max-pooling layer to capture the important features in texts. Another state-ofthe-art method is hierarchical attention networks for document classification (Yang et al., 2016). Based on documents' hierarchical structure, it performed attention mechanisms on word-level and sentence-level representations extracted by BiLSTMs.
Such methods have good performance for long texts, especially for documents or paragraphs, but they are inferior when directly applied for short text classification task. Short texts tend to span over a wide range of words, resulting in data sparsity and lack of enough contexts (Chen et al., 2011;Wang et al., 2017).

Short Text Classification
According to our review, there generally exist two strategies for short text classification.
The first strategy is to leverage an external knowledge base to expand the context of short texts. For example, Feng et al. (2013) calculated the correlation between each short text and domain knowledge for classification. Wang et al. (2014) leveraged a large-scale taxonomy knowledge base to learn the concepts of words and ranked the similarities between short texts and concepts. Wang et al. (2017) associated each short text with its relevant concepts in the knowledge base. They combined the words and relevant concepts of the short text to generate its embedding. A high-quality knowledge base is vital for their performance, but its construction is time-consuming and laborintensive, or even worse, it may be unavailable for some domains (Li et al., 2016).
The second strategy is to explore latent topics or clustering features for classification. For example, Chen et al. (2011) derived multi-granularity topics through latent Dirichlet allocation (LDA) as features for traditional classifiers. Ren et al. (2016) used LDA to extract topics and extended existing recursive autoencoder to effectively incorporate topic information. Ma et al. (2015) used Gaussian models to describe the distribution of words embeddings and classified new short texts using the Bayesian rule to get the posterior probability. Revanasiddappa and Harish (2018) developed a fuzzy c-means clustering method and built the match degree between cluster and categories. Such methods can reduce high dimensionality and terms' sparse distribution problems. But their pipeline architecture (i.e., using clustering or topic models to derive clusters or topics, and then integrating them into classifiers as features), might be hard to leverage the mutual dependency of clustering and classification methods.

Method
In this paper, we propose a joint model called cluster-gated convolutional neural network (CGCNN), coupling a soft clustering method and a gated CNN for classification. In this section, we mainly introduce the overall architecture of our model, and define the objective function for training. Figure 1 presents the CGCNN structure, composing of five major components: (1) a word encoder layer based on BiLSTM to learn word representations in each short text, (2) a clustering layer that calculates words' distributions and performs a linear transformation to get clusterdependent text representations, (3) a cluster-gated convolutional layer that integrates cluster centers into a gated CNN for further controlling clusterrelated feature flows, (4) a max-pooling layer to select most important features and concatenate them as the final text features, and (5) a fully connected layer with softmax function for classification. We update all the parameters in these five components simultaneously, which is introduced in the next subsection.

Overall Architecture of the Model
Word Encoder. Suppose each short text has a maximum of T words, and the t-th word can be denoted as , ∈ [1, ]. We can embed the short text to vectors through an embedding matrix . To capture the local context in text, we employ a BiLSTM to derive the forward representation and backward representation . We concatenate them as word representation, i.e., ℎ = [ , ] . Specifically, the input text is represented as a matrix = [ℎ , ℎ , … , ℎ ] . In some cases with weak sequential text, we will directly use word embedding as the corresponding word representation ℎ , i.e., ℎ = . This will be further discussed in the experiment section.
Clustering Layer. Cluster centers contain semantic closeness of similar words, which is used to selectively control related word flows in the next layer. Here we employ a soft clustering method (Maaten and Hinton, 2008;Xie et al., 2016) to explore words' cluster centers. And then we build a projection function : (ℎ , ) → ℎ to get cluster-dependent text representations, where refers to the k-th cluster center, and ℎ refers to the t-th representation dependent on the k-th cluster center. We set the number of clusters as K, i.e., k∈[1,K]. The soft clustering method uses the student's t-distribution as a kernel to calculate the similarity between word representation ℎ and cluster center , as formula (1).
where , is the probability of t-th word belonging to k-th cluster. A higher value of , indicates the word is more closed to the cluster. With the help of the probability, we build a linear function to get the cluster-dependent text representations, as formula (2). It can reduce the role of words unrelated with the cluster, and ensures the sum of all cluster-dependent word representations at position t to the corresponding word representation ℎ , as formula (3).
, , In this way, we can transfer the matrix of a short text to K cluster-dependent matrices, as formula (4).
Cluster-Gated Convolutional Layer. The gating mechanism can control information flows in the network, which have been proven effective in LSTM and CNN (Dauphin et al., 2016). With the help of cluster centers, we would further explore related features with clusters in this layer. We employ a convolutional filter ∈ × for mapping n words into a phrase-level feature, where D and refer to the dimention of ℎ , and the filter window size respectively. As shifting the filter across the k-th cluster-dependent text representation , as formula (5), we can obtain a sequence of new features = [ , , , , … , , ]. Here we use no-padding mode, i.e., L=T-n+1. , , : where is the term bias. Based on gated linear units (GLU) (Dauphin et al., 2016), we use word representations and cluster center to together decide the information passed on, as formulas (6) and (7). , , : where ∈ × , ∈ , ∈ are learned parameters. is the sigmoid function, and  is the element-wise product between vectors. , refers to the cluster-gated value, which is used to control the convolutional feature , . And , is the final gated convolutional feature at position t for k-th cluster-dependent text representation. Pooling Layer. In this layer, we apply a maxover-time pooling operation over each clustergated convolutional features to capture the maximum value as the feature for the corresponding cluster-dependent text representation, as formula (8). And then we concatenate all of them for the next classification layer, as formula (9) Classifier Layer. For each short text instance, we generate the high-level representations of the combination of multiple clusters' related information. To make full use of them, we use a fully connection with softmax function for prediction. The probability assigning a category label to this instance, can be calculated as formula (10).
where j is the category label. To avoid over-fitting, we can also employ dropout in this layer.

Training
The entire CGCNN model integrates a clustering method into the gated-CNN for classification, which can be updated simultaneously in one framework. Hence, we combine their loss effects into one objective function as formula (11).
where is the cross-entropy loss of the classifier, and is the clustering loss with Kullback-Leibler divergence (KL divergence) minimization. λ > 0 is a tradeoff parameter controlling the degree of clustering loss. The classifier loss can be defined as formula (12).
where is the i-th sample instance, is the ground truth label, and 1{ * } is the indicator function.
For the clustering loss, we use KL divergence between the distribution of soft labels , and the auxiliary distribution , as (Maaten and Hinton, 2008;Xie et al., 2016), as formula (13). , , , where , is the target distribution, as formula (14).
As (Xie et al., 2016), this target distribution is computed by first raising the second power of , to its corresponding soft cluster frequencies ∑ , and then performing normalization to prevent large clusters from distorting the hidden feature space. It can not only improve cluster purity, but also emphasize the data points assigned to clusters with high confidence.

Datasets and Preprocessing
To illustrate the effectiveness of our model, we conduct experiments on five public datasets: AG News, Sogou News, Amazon Reviews, Yahoo! Answers, and Search Snippets. The first three datasets are adopted from (Zhang et al., 2015b). The last two datasets are from the Yahoo! Webscope program and (Phan et al., 2008) respectively. For each dataset, we use 80% of the data for training, 10% for validation, and the remaining 10% for test. To construct short texts, we only use titles or some partial information of the datasets.
AG News and Sogou News. These two original datasets include 127,600 samples from 4 categories and 510,000 samples from 5 categories respectively. Sogou News is a dataset in Chinese, and Zhang et al. (2015b) combined pypinyin package and a Chinese segmentation tool to produce Pinyin -Roman spelling in Chinese. For both of them, each sample contains both title and content of news. To test for short texts, we remove contents and only use the titles in our experiment.
Amazon Reviews. The full dataset contains 3.65 million samples from one-to-five rating labels. In order to test for short texts, we remove the review contents and only use the review titles in our experiment.
Yahoo! Answers. This corpus includes 4,483,032 question titles, question contexts and their answers. We use 10 largest classes to construct a topic classification task. We randomly choose 50,000 samples for each class. Here we only use the question titles for classification.
Search Snippets. This dataset, released by Google search engine, includes 12,340 samples with predefined 8 categories by (Phan et al., 2008).
Note that we filter out punctuation and use Natural Language Toolkit (NLTK) for stemming. We do not remove stopwords since some of them may carry classification information, especially for users' reviews. The details of each dataset are listed in Table 1

Implementation Detail
The model hyper-parameters are tuned based on the AG News dataset. We also conduct experiments with the model directly using word embeddings instead of BiLSTM, representing as CGCNN*. We firstly set the dimension of word embeddings to 300, and pre-train word embeddings on each dataset with word2vec. The dimensions of all hidden vectors are set to 200. For the clustering method, we set the number of clusters to the number of ground-truth categories, and randomly initialize cluster center vectors. We set λ=0.5 and λ=0.6 for CGCNN* and CGCNN respectively to control the effects of clustering method. To avoid model over-fitting, we use dropout with rate of 0.2. We train the parameters by using Adam method with a learning rate of 0.001, and set the batch size to 64. The filter sizes of all convolution layers are set to 3 in these two methods.

Baselines and Experimental Settings
In this paper, we choose the following baseline algorithms for comparison: CNN (Kim, 2014). It builds a multi-channel convolutional architecture with varying filter window sizes, and concatenates the important features extracted by a max-over-time pooling operation.
CNNM. To further illustrate the effectiveness of our model, we develop the multi-channel convolutional architecture (Kim, 2014) with multiple fixed size filters. As our model hyperparameters, the number of filters is equal to the number of ground-truth categories, and all their sizes are set to 3.
RCNN (Lai et al., 2015). It develops a recurrent convolutional structure. It employs a bi-directional recurrent structure to capture word context embeddings and uses a max-pooling layer to select the important features.
CNN-LSTM (Zhou et al., 2015). This method uses a multi-channel convolutional layer to extract higher-level phrase features, and employs a BiLSTM to capture their sequences for classification.
AttBiLSTM (Lin et al., 2017). It uses a BiLSTM to explore the sequences of texts, and develops a self-attention mechanism to get sentence-level representations.
For the multiple-channel convolutional architecture of CNN and CNN-LSTM, the filter sizes are 3, 4 and 5, as Kim (2014)'s default settings. For the hidden vectors of BiLSTM in these methods, we also set their dimensions to 200.

Results
We use accuracy as the evaluation metric, and Table 2 reports the different algorithms' performance on the five real-world datasets. We highlight the highest value in each column. As we can see, either CGCNN or CGCNN* has the best performance on the datasets. The CNN-LSTM outperforms the other baseline methods on AG News, Amazon Review and Yahoo! Answers datasets, while CNN and CNNM have the best performance on Sogou News and Search Snippets respectively. As compared with CNN-LSTM, CGCNN has about 1.5% performance improvements on Amazon Review and Yahoo! Answers, and 0.49% on AG News dataset. CGCNN* can achieve 0.6%~0.8% performance improvements over the second best baseline method on Sogou News and Search Snippets datasets. The AttBiLSTM method has poor performance. We suspect that lack of enough context might cause the failure of the self-attention mechanism.  The CNNM method using category number of convolutional filters, has similar performance with the CNN method using three convolutional filters, which illustrates increasing number of convolutional filters might have no active impact on performance. Differently, our CGCNN* method using category number of clusters for gated CNN, achieves better accuracies than both of them. It shows that our proposed architecture, integrating a clustering-gated mechanism into CNN, can significantly improve the performance in short text classification.
The CNN-LSTM method uses CNN and BiLSTM to capture phrase features and their sequences, outperforms CNN and CNNM on AG News, Amazon Reviews and Yahoo! Answers datasets, while it has poorer performance on Sogou News and Search Snippets datasets. Such cases also exist in the comparison between CGCNN and CGCNN* methods. We analyze the datasets, and suspect that weak sequential relationship in texts may result in the decreasing performances on Sogou News and Search Snippets. The original Sogou News was transferred from Chinese characters to Pinyin format (Zhang et al., 2015b). It might cause a homophone problem. For example, word "与(and)" and word "雨(rain)" have the same pronunciation but different meanings in Chinese. It breaks sequential patterns in texts. For Search Snippets, each sample is consisted of multiple keywords, and there are no obvious sequences among them. For example, a sample likes "… calorie count calories item ...", containing weak sequential semantics.

Clustering Analysis
To further study the impact of clustering method, we conduct additional experiments on AG News dataset by varying the tradeoff parameter and the cluster number , and assess the sensitivity of our model. Figure 2 reports the change of performance with increasing values of tradeoff parameter from 0.1 to 0.9 while keeping the cluster number constant (as the category number). We can observe that CGCNN* and CGCNN reach the best performances when = 0.5 and λ = 0.6 respectively. When tradeoff parameter varies from 0.1 to the values of their best performances, their performances generally show an increasing trend, which implies the clustering effect can benefit for understanding short texts. When tradeoff parameter increases from the optimal values to 0.9, the performances of these two methods generally have a slight decrease, which shows excessive clustering might have a bad influence on short text classification. Figure 3 reports the results of adjusting the number of clusters (K) in CGCNN* and CGCNN when we set to the optimal values (i.e., = 0.5 and λ = 0.6 respectively). For CGCNN method, we can observe that it reaches the best performance when the cluster number equals to the category number (i.e., = 4). No matter the cluster number increases or decreases, its performance would have a decrease tendency. While the CGCNN* method's performance generally show an increasing trend, which relatively stabilizes when K reaches category number (although its performance has a slight decrease at = 5). That is the reason that we set the cluster number to the category number.

Case Study
In this section, we take several concrete samples from AG News dataset to illustrate how the   In the clustering layer, we leverage a soft clustering method to explore words' cluster centers, and build a linear function to project the representation of a short text to K clusterdependent representations. Here we calculate the similarity between words and cluster centers, and normalize the values of each word belonging to clusters. Figure 4 (a) and (b) show two different instances from categories "Sci/Tech" and "World" respectively. We can observe the linear projection can strengthen the words' representations dependent on some cluster, and weaken them on others. These two figures have different distributions on the same word "for", which is due to different contexts explored by BiLSTM.
There might exist some instances closely related with two or more clusters, as figure 5 (a). To further control the information flows, we leverage cluster centers and phrase-level features for the gated mechanism. We use Xue and Li (2018)'s method to visualize the gated mechanism: summing the representation of each phrase-level feature and normalizing them according to clusters. Figure 5 (a) shows the similarities between words and clusters in an instance with category "World", while figure 5 (b) shows the corresponding cluster-gated convolutional features. We can observe that cluster-gated layer can further strengthen the corresponding cluster-dependent representation, and weaken others.

Conclusion
In this paper, we propose a joint model that couples clustering and classification methods. It employs a BiLSTM to learn word representations for local contexts in short texts. We take a soft clustering method to calculate the probability of each word assigning to each cluster, which can derive the semantic relation of word representations and the global corpus. We also perform a linear transformation to explore cluster-dependent text representations. Moreover, we develop a clustergated CNN by integrating cluster centers into GLU, which can select cluster-related features for classification. Experiments on five real-world datasets show that our model does better than the state-of-the-art methods for short text classification task.
In the future work, we will further analyze the mutual effects of document-level clustering and classification methods for long text, and attempt to develop more effective joint model for text classification. Moreover, we will study some other (a) An instance with category "Sci/Tech" (b) An instance with category "World" mechanisms (e.g., highway units, attention mechanism) to further improve the performance.