A Self-Training Approach for Short Text Clustering

Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.


Introduction
Text clustering groups semantically similar text without using supervision or manually assigned labels. Text clusters have proven to be beneficial in many applications including news recommendation (Wang et al., 2010), language modeling (Liu and Croft, 2004), query expansion (Amini and Usunier, 2007), visualization (Cadez et al., 2003), and corpus summarization (Schutze and Silverstein, 1997).
Due to the popularity of social media and online fora such as Twitter and Reddit, texts containing only few words have become prevalent on the web. Compared to clustering of long documents, Short Text Clustering (STC) introduces additional challenges. Traditionally, text is represented as a bag-of-words (BOW) or termfrequency inverse-document-frequency (TF-IDF) vectors, after which a clustering algorithm such as k-means is applied to partition the texts into homogeneous groups (Xu et al., 2017). Due to the short lengths of such texts, their vector representations tend to become very sparse. As a result, traditional measures for similarity, which rely on word overlap or distance between high-dimensional vectors, become ineffective (Xu et al., 2015).
Previous work on STC enriched short text representations by incorporating features from external resources. Hu et al. (2009) and Banerjee et al. (2007) extended short texts using articles from Wikipedia. In similar fashion, Hotho et al. (2003) and Wei et al. (2015) proposed different methods to enrich text representation using ontologies. More recently, low-dimensional representations have shown potential to counter the sparsity problem in STC. Combined with neural network architectures, embeddings of words (Mikolov et al., 2013;Pennington et al., 2014), sentences (Le and Mikolov, 2014;Kiros et al., 2015) and documents (Dai et al., 2015) were proven to be effective on a variety of tasks in machine learning for NLP.
Deep clustering methods first embed the highdimensional data into a lower dimensional space, after which a clustering algorithm is applied. These methods either perform clustering after having trained the embedding transformation (Tian et al., 2014;De Boom et al., 2016), or jointly optimize both the embedding and clustering (Yang et al., 2016), and we situate our method in the former. Closely related to our work is the method of Deep Embedded Clustering (DEC) (Xie et al., 2016), which learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space while iteratively optimizing a clustering objective. The self-taught convolutional neural network (STC 2 ) framework proposed by Xu et al. (2017) Figure 1: Short text clustering using SIF embedding, an autoencoder architecture and self-training. resentations in order to reconstruct these auxiliary targets. Trained representations from the CNN are clustered using the k-means algorithm. Two recent surveys provide an overview of research on deep clustering methods (Aljalbout et al., 2018;Min et al., 2018).
Similar to Xie et al. (2016), we follow a multiphase approach and train a neural network (which we will refer to as the encoder) to transform embeddings to a latent space before clustering. However, we apply two crucial modifications. As opposed to CNN-based encoders (Xu et al., 2017), we propose the use of Smooth Inverse Frequency (SIF) embeddings (Arora et al., 2017) in order to simplify and make clustering more efficient while maintaining performance.
During the second stage of clustering, we apply self-training using soft cluster assigments to fine-tune the encoder before applying a final clustering. We describe our methodology in more detail in Section 2. In Section 3, we evaluate our method using three short text datasets, measuring for clustering accuracy and normalized mutual information. Our model matches or produces better results compared to more sophisticated neural network architectures.

Methodology
Our model for short text clustering includes three steps: (1) Short texts are embedded using SIF embeddings (Section 2.1); (2) During a pre-training phase, a deep autoencoder is applied to encode and reconstruct the short text SIF embeddings (Section 2.2); (3) In a self-training phase, we use soft cluster assignments as an auxiliary target distribution, and jointly fine-tune the encoder weights and the clustering assignments (Section 2.3). The described setup is illustrated in Figure 1.

SIF Embedding
We apply a relatively simple and yet effective strategy for embedding short texts, called Smooth Inverse Frequency (SIF) embeddings. For SIF embedding, first, a weighted average of pre-trained word embeddings is computed. The contribution of each word is calculated as a a+p(w) with a being a hyperparameter and p(w) being the empirical word frequency in the text corpus. SIF embeddings are then produced by computing the first principal component of all the resulting vectors and removing it from the weighted embeddings.

Autoencoder
The parameters of the encoder network are initialized using a deep autoencoder architecture such as the one used by Hinton and Salakhutdinov (2006). The mean squared error is used to measure reconstruction loss after the encoded embeddings are decoded by the decoder subnetwork (see Fig. 1). This non-clustering loss is independent of the clustering algorithm and controls preservation of the original text representations. Yang et al. (2017) demonstrated that the absence of such a non-clustering loss can lead to worse representations, or trivial solutions where the clusters all collapse into a single representation.

Self-Training
After pre-training using the autoencoder architecture, we obtain an initial estimate of the nonlinear mapping from the SIF embedding to a lowdimensional representation, on which a cluster algorithm is applied. Next, we improve clustering using a second self-training phase: we assign initial cluster centroids after which we alternate between two steps: (i) first, the probability of assigning a data point to each cluster is computed; (ii) second, an auxiliary probability distribution is calculated and used as target for the encoder network. Network weights and cluster centroids are updated iteratively until a stopping criterion is met. For Step (i), we compute a soft cluster assignment for each data point. Maaten and Hinton (2008) propose the Student's t-distribution Q with a single degree of freedom to measure the similarity between embedded points z i and centroids µ j : in which q ij can be interpreted as the probability of assigning sample i to cluster j. Then q ij can be used as a soft assignment of embeddings to centroids. The encoder is then fine-tuned to match this soft assignment q i to a target distribution p j . For Step (ii), as Xie et al. (2016), we use an auxiliary target distribution P which has "stricter" probabilities compared to the similarity score q ij , with the aim to improve cluster purity and put more emphasis on data points assigned with high confidence. This prevents large clusters from distorting the hidden feature space. The probabilities p ij in the proposed distribution P are calculated as: in which the squared summation terms q 2 ij are normalized by the soft cluster frequencies ( i q i j ).
The KL-divergence between the two probability distributions P and Q is then used as training objective, i.e., the training loss L is defined as: The strategy outlined above can be seen as a form of self-supervision (Nigam and Ghani, 2000). Centroids of a standard clustering algorithm (e.g., k-means) are used to intialize the weights of the clustering layer, after which high confidence predictions are used to fine-tune the encoder and centroids. After convergence of this procedure, short texts are encoded and final cluster assignments are made using k-means.

Experimental Results
After describing the datasets (Section 3.1) and the experiment design (Section 3.2), we will present the results of these experiments (Section 3.3).

Data
We replicate the test setting used by Xu et al. (2017) and evaluate our model on three datasets for short text clustering: (1) SearchSnippets: a text collection comprising Web search snippets categorized in 8 different topics (Phan et al., 2008).
(2) Stackoverflow: a collection of posts from question and answer site stackoverflow, published as part of a Kaggle challenge. 1 This subset contains question titles from 20 different categories selected by Xu et al. (2017). (3) Biomedical, a snapshot of one year of PubMed data distributed by BioASQ for evaluation of large-scale online biomedical semantic indexing. 2 Table 2 provides an overview of the main characteristics of the presented short text datasets.

Experimental Setup
We compare our method to baselines for STC including clustering of TF and TF-IDF representations, Skip-thought Vectors (Kiros et al., 2015) and the best reported STC 2 model by Xu et al. (2017). Following (Van Der Maaten, 2009;Xie et al., 2016), we set sizes of hidden layers to d:500:500:2000:20 for all datasets, where d is the short text embedding dimension for all datasets. We used pre-trained word2vec embeddings 3 with fixed α = 0.1 value for all corpora. We set the batch size to 64 and pre-trained the autoencoder for 15 epochs. We initialized stochastic gradient descent with a learning rate of 0.01 and momentum value of 0.9. During experiments, the choice of initial centroids had considerable impact on clustering performance when applying the k-means algorithm. To reduce this influence of initialization, we restarted k-means 100 times with different initial centroids, as Huang et al. (2014); Xu et al. (2017), and selected the best centroids, which obtained the lowest sum of squared distances of samples to their closest cluster center. Similar to Xu et al. (2017), results are averaged over 5 trials and we also report the standard deviation on the scores.

Results and Discussion
We evaluate clustering performance based on the correspondence between clusters and partitions as per the ground truth class labels assigned to each of the short texts. We report two widely used performance metrics, the clustering accuracy (ACC) and the normalized mutual information (NMI) (Huang et al., 2014;Xu et al., 2017).
NMI measures the information shared between the predicted assignments A, and the ground truth  assignments B, and is defined as where I is the mutual information and H is the entropy. When data is partitioned perfectly, the NMI score is 1, and when A and B are independent, it becomes 0. The clustering accuracy is defined as where δ() is an indicator function, c i is the clustering label for x i , map() transforms the clustering label c i to its group label by the Hungarian algorithm (Papadimitriou and Steiglitz, 1982), and y i is the true group label of x i . Results for NMI and accuracy of existing work and the presented model are shown in Table 1. While generic, low-dimensional representations such as Skip-Thought or SIF embeddings have demonstrated to be beneficial for NLP on many tasks, for STC, additional fine-tuning and selftraining leads to improved cluster quality. The evaluation results show the superiority of our approach, compared to the STC 2 model, on all but one of the metrics.

TfIdf + KMeans
SIF + KMeans Our model Figure 2: Two dimensional representations of Search-Snippets short texts before application of k-means.
Colors indicate the C = 8 different ground truth labels.
Qualitatively, the improved cluster quality is also visually apparent in Figure 2, which shows a two-dimensional t-SNE (Maaten and Hinton, 2008) representation of the SearchSnippets short texts before clustering.
The source code of our model, implemented using Tensorflow, is publicly available to encourage further research on STC. 4

Conclusion
We proposed a method for clustering of short texts using sentence embeddings and a multi-phase approach, starting from unsupervised SIF embeddings for the short texts. Our STC model then adopts an autoencoder architecture which is finetuned for clustering using self-training. Our empirical evaluation on three short text clustering datasets demonstrates resulting accuracies ranging from at least as good up to 12 percentage points, compared to the state-of-the-art STC 2 method.