Supporting Clustering with Contrastive Learning

Unsupervised clustering aims at discovering the semantic categories of data according to some distance measured in the representation space. However, different categories often overlap with each other in the representation space at the beginning of the learning process, which poses a significant challenge for distance-based clustering in achieving good separation between different categories. To this end, we propose Supporting Clustering with Contrastive Learning (SCCL) – a novel framework to leverage contrastive learning to promote better separation. We assess the performance of SCCL on short text clustering and show that SCCL significantly advances the state-of-the-art results on most benchmark datasets with 3%-11% improvement on Accuracy and 4%-15% improvement on Normalized Mutual Information. Furthermore, our quantitative analysis demonstrates the effectiveness of SCCL in leveraging the strengths of both bottom-up instance discrimination and top-down clustering to achieve better intra-cluster and inter-cluster distances when evaluated with the ground truth cluster labels.


Introduction
Clustering, one of the most fundamental challenges in unsupervised learning, has been widely studied for decades. Long established clustering methods such as K-means (MacQueen et al., 1967;Lloyd, 1982) and Gaussian Mixture Models (Celeux and Govaert, 1995) rely on distance measured in the data space, which tends to be ineffective for highdimensional data. On the other hand, deep neural networks are gaining momentum as an effective way to map data to a low dimensional and hopefully better separable representation space.
Many recent research efforts focus on integrating clustering with deep representation learning 1 We plan to open source our implementation. Please visit https://arxiv.org/abs/2103.12953 for the release updates. Figure 1: TSNE visualization of the embedding space learned on SearchSnippets using Sentence Transformer (Reimers and Gurevych, 2019a) as backbone. Each color indicates a ground truth semantic category. by optimizing a clustering objective defined in the representation space (Xie et al., 2016;Jiang et al., 2016;Zhang et al., 2017a;Shaham et al., 2018). Despite promising improvements, the clustering performance is still inadequate, especially in the presence of complex data with a large number of clusters. As illustrated in Figure 1, one possible reason is that, even with a deep neural network, data still has significant overlap across categories before clustering starts. Consequently, the clusters learned by optimizing various distance or similarity based clustering objectives suffer from poor purity.
On the other hand, Instance-wise Contrastive Learning (Instance-CL) (Wu et al., 2018;Bachman et al., 2019;He et al., 2020;Chen et al., 2020a,b) has recently achieved remarkable success in selfsupervised learning. Instance-CL usually optimizes on an auxiliary set obtained by data augmentation. As the name suggests, a contrastive loss is then adopted to pull together samples augmented from the same instance in the original dataset while pushing apart those from different ones. Essen-tially, Instance-CL disperses different instances apart while implicitly bringing similar instances together to some extent (see Figure 1). This beneficial property can be leveraged to support clustering by scattering apart the overlapped categories. Then clustering, thereby better separates different clusters while tightening each cluster by explicitly bringing samples in that cluster together.
To this end, we propose Supporting Clustering with Contrastive Learning (SCCL) by jointly optimizing a top-down clustering loss with a bottom-up instance-wise contrastive loss. We assess the performance of SCCL on short text clustering, which has become increasingly important due to the popularity of social media such as Twitter and Instagram. It benefits many real-world applications, including topic discovery (Kim et al., 2013), recommendation (Bouras andTsogkas, 2017), and visualization (Sebrechts et al., 1999). However, the weak signal caused by noise and sparsity poses a significant challenge for clustering short texts. Although some improvement has been achieved by leveraging shallow neural networks to enrich the representations (Xu et al., 2017;Hadifar et al., 2019), there is still large room for improvement.
We address this challenge with our SCCL model. Our main contributions are the following: • We propose a novel end-to-end framework for unsupervised clustering, which advances the state-of-the-art results on various short text clustering datasets by a large margin. Furthermore, our model is much simpler than the existing deep neural network based short text clustering approaches that often require multistage independent training.
• We provide in-depth analysis and demonstrate how SCCL effectively combines the top-down clustering with the bottom-up instance-wise contrastive learning to achieve better intercluster distance and intra-cluster distance.
• We explore various text augmentation techniques for SCCL, showing that, unlike the image domain (Chen et al., 2020a), using composition of augmentations is not always beneficial in the text domain.
2 Related Work Self-supervised learning Self-supervised learning has recently become prominent in providing effective representations for many downstream tasks.
Early work focuses on solving different artificially designed pretext tasks, such as predicting masked tokens (Devlin et al., 2019), generating future tokens (Radford et al., 2018), or denoising corrupted tokens (Lewis et al., 2019) for textual data, and predicting colorization (Zhang et al., 2016), rotation (Gidaris et al., 2018, or relative patch position (Doersch et al., 2015) for image data. Nevertheless, the resulting representations are tailored to the specific pretext tasks with limited generalization. Many recent successes are largely driven by instance-wise contrastive learning. Inspired by the pioneering work of Becker and Hinton (1992);Bromley et al. (1994), Instance-CL treats each data instance and its augmentations as an independent class and tries to pull together the representations within each class while pushing apart different classes (Dosovitskiy et al., 2014;Oord et al., 2018;Bachman et al., 2019;He et al., 2020;Chen et al., 2020a,b). Consequently, different instances are well-separated in the learned embedding space with local invariance being preserved for each instance.
Although Instance-CL may implicitly group similar instances together (Wu et al., 2018), it pushes representations apart as long as they are from different original instances, regardless of their semantic similarities. Thereby, the implicit grouping effect of Instance-CL is less stable and more datadependent, giving rise to worse representations in some cases (Khosla et al., 2020;Li et al., 2020;Purushwalkam and Gupta, 2020).
Short Text Clustering Compared with the general text clustering problem, short text clustering comes with its own challenge due to the weak signal contained in each instance. In this scenario, BoW and TF-IDF often yield very sparse representation vectors that lack expressive ability. To remedy this issue, some early work leverages neural networks to enrich the representations (Xu et al., 2017;Hadifar et al., 2019), where word embeddings (Mikolov et al., 2013b;Arora et al., 2017) are adopted to further enhance the performance. However, the above approaches divide the learning process into multiple stages, each requiring independent optimization. On the other hand, despite the tremendous successes achieved by contextualized word embeddings (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2018;Reimers and Gurevych, 2019b), they have been left largely unexplored for short text clustering. In this work, we leverage the pretrained transformer as the back- Figure 2: Training framework SCCL. During training, we jointly optimize a clustering loss over the original data instances and an instance-wise contrastive loss over the associated augmented pairs. bone, which is optimized in an end-to-end fashion. As demonstrated in Section 4, we advance the state-of-the-art results on most benchmark datasets with 3% − 11% improvement on Accuracy and 4% − 15% improvement on NMI.

Model
We aim at developing a joint model that leverages the beneficial properties of Instance-CL to improve unsupervised clustering. As illustrated in Figure 2, our model consists of three components. A neural network ψ(·) first maps the input data to the representation space, which is then followed by two different heads g(·) and f (·) where the contrastive loss and the clustering loss are applied, respectively. Please refer to Section 4 for details.
Our data consists of both the original and the augmented data. Specifically, for a randomly sampled minibatch B = {x i } M i=1 , we randomly generate a pair of augmentations for each data instance in B, yielding an augmented batch B a with size 2M , denoted as B a = {x i } 2M i=1 .

Instance-wise Contrastive Learning
For each minibatch B, the Instance-CL loss is defined on the augmented pairs in B a . Let i 1 ∈ {1, . . . , 2M } denote the index of an arbitrary instance in augmented set B a , and let i 2 ∈ {1, . . . , 2M } be the index of the other instance in B a augmented from the same instance in the original set B. We refer tox i 1 ,x i 2 ∈ B a as a positive pair, while treating the other 2M -2 examples in B a as negative instances regarding this positive pair. Letz i 1 andz i 2 be the corresponding outputs of the head g, i.e.,z j = g(ψ(x j )), j = i 1 , i 2 . Then for x i 1 , we try to separatex i 2 apart from all negative instances in B a by minimizing the following Here 1 j =i 1 is an indicator function and τ denotes the temperature parameter which we set as 0.5. Following Chen et al. (2020a), we choose sim(·) as the dot product between a pair of normalized outputs, i.e., sim(z i ,z j ) =z T iz j / z i 2 z j 2 . The Instance-CL loss is then averaged over all instances in B a , To explore the above contrastive loss in the text domain, we explore three different augmentation strategies in Section 4.3.1, where we find contextual augmenter (Kobayashi, 2018; Ma, 2019) consistently performs better than the other two.

Clustering
We simultaneously encode the semantic categorical structure into the representations via unsupervised clustering. Unlike Instance-CL, clustering focuses on the high-level semantic concepts and tries to bring together instances from the same semantic category together. Suppose our data consists of K semantic categories, and each category is characterized by its centroid in the representation space, denoted as µ k , k ∈ {1, . . . , K}. Let e j = ψ(x j ) denote the representation of instance x j in the original set B. Following Maaten and Hinton (2008), we use the Student's t-distribution to compute the probability of assigning x j to the k th cluster, Here α denotes the degree of freedom of the Student's t-distribution. Without explicit mention, we follow Maaten and Hinton (2008) by setting α = 1 in this paper. We use a linear layer, i.e., the clustering head in Figure 2, to approximate the centroids of each cluster, and we iteratively refine it by leveraging an auxiliary distribution proposed by Xie et al. (2016). Specifically, let p jk denote the auxiliary probability defined as Here f k = M j=1 q jk , k = 1, . . . , K can be interpreted as the soft cluster frequencies approximated within a minibatch. This target distribution first sharpens the soft-assignment probability q jk by raising it to the second power, and then normalizes it by the associated cluster frequency. By doing so, we encourage learning from high confidence cluster assignments and simultaneously combating the bias caused by imbalanced clusters.
We push the cluster assignment probability towards the target distribution by optimizing the KL divergence between them, The clustering objective is then followed as This clustering loss is first proposed in Xie et al. (2016) and later adopted by Hadifar et al. (2019) for short text clustering. However, they both require expensive layer-wise pretraining of the neural network, and update the target distribution (Eq (4)) through carefully chosen intervals that often vary across datasets. In contrast, we simplify the learning process to end-to-end training with the target distribution being updated per iteration.
Overall objective In summary, our overall objective is, C j and I i are defined in Eq (5) and Eq (2), respectively. η balances between the contrastive loss and the clustering loss of SCCL, which we set as 10 in Section 4 for simplicity. Also noted that, the clustering loss is optimized over the original data only. Alternatively, we can also leverage the augmented data to enforce local consistency of the cluster assignments for each instance. We discuss this further in Appendix A.3.

Numerical Results
Implementation We implement our model in Py-Torch (Paszke et al., 2017) with the Sentence Transformer library (Reimers and Gurevych, 2019a). We choose distilbert-base-nli-stsb-mean-tokens as the backbone, followed by a linear clustering head (f ) of size 768 × K with K indicating the number of clusters. For the contrastive loss, we optimize an MLP (g) with one hidden layer of size 768, and output vectors of size 128. Datasets We assess the performance of the proposed SCCL model on eight benchmark datasets for short text clustering. Table 2 provides an overview of the main statistics, and the details of each dataset are as follows.
• SearchSnippets is extracted from web search snippets, which contains 12,340 snippets associated with 8 groups Phan et al. (2008).
• StackOverflow is a subset of the challenge data published by Kaggle 2 , where 20,000 question titles associated with 20 different categories are selected by Xu et al. (2017).
• Biomedical is a subset of the PubMed data distributed by BioASQ 3 , where 20,000 paper titles from 20 groups are randomly selected by Xu et al. (2017).
• GoogleNews contains titles and snippets of 11,109 news articles related to 152 events (Yin and Wang, 2016). Following (Rakib et al., 2020), we name the full dataset as GoogleNews-TS, and GoogleNews-T and GoogleNews-S are obtained by extracting the titles and the snippets, respectively.

Comparison with State-of-the-art
We first demonstrate that our model can achieve state-of-the-art or highly competitive performance on short text clustering. For comparison, we consider the following baselines.
• STCC (Xu et al., 2017) consists of three independent stages. For each dataset, it first pretrains a word embedding on a large in-domain corpus using the Word2Vec method (Mikolov et al., 2013a). A convolutional neural network is then optimized to further enrich the repre-sentations that are fed into K-means for the final stage clustering. To demonstrate that our model is robust against the noisy input that often poses a significant chal- Figure 3: Ablation study of SCCL. In SCCL-Seq, we first train the model using Instance-CL, and then optimize the clustering objective. We exclude Biomedical for better visualization, full plot can be found in Appendix A.4. lenge for short text clustering, we do not apply any pre-processing procedures on any of the eight datasets. In contrast, all baselines except BoW and TF-IDF considered in this paper either preprocessed the Biomedical dataset (Xu et al., 2017;Hadifar et al., 2019) or all eight datasets by removing the stop words, punctuation, and converting the text to lower case (Rakib et al., 2020).
We report the comparison results in Table 1 Rakib et al. (2020) also shows better Accuracy on Tweet and GoogleNews-T, for which we hypothesize two reasons. First, both GoogleNews and Tweet have fewer training examples with much more clusters. Thereby, it's challenging for instance-wise contrast learning to manifest its advantages, which often requires a large training dataset. Second, as implied by the clustering perfermance evaluated on BoW and TF-IDF, clustering GoogleNews and Tweet is less challenging than clustering the other four datasets. Hence, by applying agglomerative clustering on the carefully selected pairwise similarities of the preprocessed data, Rakib et al. (2020) can achieve good performance, especially when the text instances are very short, i.e., Tweet and GoogleNews-T. We also high-light the scalability of our model to large scale data, whereas agglomerative clustering often suffers from high computation complexity. We discuss this further in Appendix A.5.

Ablation Study
To better validate our model, we run ablations in this section. For illustration, we name the clustering component described in Section 3.2 as Clustering. Besides Instance-CL and Clustering, we also evaluate SCCL against its sequential version (SCCL-Seq) where we first train the model with Instance-CL, and then optimize it with Clustering.
As shown in Figure 3, Instance-CL also groups semantically similar instances together. However, this grouping effect is implicit and data-dependent. In contrast, SCCL consistently outperforms both Instance-CL and Clustering by a large margin. Furthermore, SCCL also achieves better performance than its sequential version, SCCL-Seq. The result validates the effectiveness and importance of the proposed joint optimization framework in leveraging the strengths of both Instance-CL and Clustering to compliment each other.

SCCL leads to better separated and less dispersed clusters
To further investigate what enables the better performance of SCCL, we track both the intra-cluster distance and the inter-cluster distance evaluated in the representation space throughout the learning process. For a given cluster, the intra-cluster distance is the average distance between the centroid and all samples grouped into that cluster, and the inter-cluster distance is the distance to its closest neighbor cluster. In Figure 4, we report each type    Figure 4 shows Clustering achieves smaller intracluster distance and larger inter-cluster distance when evaluated on the predicted clusters. It demonstrates the ability of Clustering to tight each selflearned cluster and separate different clusters apart. However, we observe the opposite when evaluated on the ground truth clusters, along with poor Accuracy and NMI scores. One possible explanation is, data from different ground-truth clusters often have significant overlap in the embedding space before clustering starts (see upper left plot in Figure 1), which makes it hard for our distance-based clustering approach to separate them apart effectively.
Although the implicit grouping effect allows Instance-CL attains better Accuracy and NMI scores, the resulting clusters are less apart from each other and each cluster is more dispersed, as indicated by the smaller inter-cluster distance and larger intra-cluster distance. This result is unsurprising since Instance-CL only focuses on instance discrimination, which often leads to a more dispersed embedding space. In contrast, we leverage the strengths of both Clustering and Instance-CL to compliment each other. Consequently, Figure 4 shows SCCL leads to better separated clusters with each cluster being less dispersed.

Exploration of Data Augmentations
To study the impact of data augmentation, we explore three different unsupervised text augmentations: (1) WordNet Augmenter 5 transforms an input text by replacing its words with WordNet synonyms (Morris et al., 2020;Ren et al., 2019). (2) Contextual Augmenter 6 leverages the pretrained transformers to find top-n suitable words of the input text for insertion or substitution (Kobayashi, 2018;Ma, 2019). We augment the data via word substitution, and we choose Bertbase and Roberta to generate the augmented pairs. (3) Paraphrase via back translation 7 generates paraphrases of the input text by first translating it to another language (French) and then back to English. When translating back to English, we used the mixture of experts model (Shen et al., 2019) to generate ten candidate paraphrases per input to increase diversity.
For both WordNet Augmenter and Contextual Augmenter, we try three different settings by choosing the word substitution ratio of each text instance to 10%, 20%, and 30%, respectively. As for Paraphrase via back translation, we compute the BLEU score between each text instance and its ten candidate paraphrases. We then select three pairs, achieving the highest, medium, and lowest BLEU scores, from the ten condidates of each instance. The best results 8 of each augmentation technique are summarized in Table 3, where Contexual Augmenter substantially outperforms the other two. We conjecture that this is due to both Contextual Augmenter and SCCL leverage the pretrained transformers as backbones, which allows Contextual Augmenter to generate more informative augmentations. Figure 5 shows the impact of using composition of data augmentations, in which we explored Contextual Augmenter and CharSwap Augmenter 9 (Morris et al., 2020). As we can see, using composition of data augmentations does boost the performance of SCCL on GoogleNews-TS where the average number of words in each text instance is 28 (see Table 2). However, we observe the opposite on StackOverflow where the average number of words in each instance is 8. This result differs from what has been observed in the image domain where using composition of data augmentations is crucial for contrastive learning to attain good performance. Possible explanations is that generating high-quality augmentations for textual data is more challenging, since changing a single word can invert the semantic meaning of the whole instance. This challenge is compounded when a second round of augmentation is applied on very 8 Please refer to Appendix A.2 for details. 9 A simple technique that augments text by substituting, deleting, inserting, and swapping adjacent characters short text instances, e.g., StackOverflow. We further demonstrate this in Figure 5 (right), where the augmented pairs of StackOverflow largely diverge from the original texts in the representation space after the second round of augmentation.

Conclusion
We have proposed a novel framework leveraging instance-wise contrastive learning to support unsupervised clustering. We thoroughly evaluate our model on eight benchmark short text clustering datasets, and show that our model either substantially outperforms or performs highly comparably to the state-of-the-art methods. Moreover, we conduct ablation studies to better validate the effectiveness of our model. We demonstrate that, by integrating the strengths of both bottom-up instance discrimination and top-down clustering, our model is capable of generating high-quality clusters with better intra-cluster and inter-clusters distances. Although we only evaluate our model on short text data, the proposed framework is generic and is expected to be effective for various kinds of text clustering problems.
In this work, we explored different data augmentation strategies with extensive comparisons. However, due to the discrete nature of natural language, designing effective transformations for textual data is more challenging compared to the counterparts in the computer vision domain. One promising direction is leveraging the data mixing strategies (Zhang et al., 2017b) to either obtain stronger augmentations (Kalantidis et al., 2020) or alleviate the heavy burden on data augmentation (Lee et al., 2020). We leave this as future work.  and (4), respectively.

References
Alternative 2. Let j 0 and j 1 , j 2 denote the indices of the original text instance and its augmented pair, respectively. We then use the original instance as anchor, and push the cluster assignments of the augmented pair towards it by optimizing the following C j = KL p j 0 ||q j 1 + KL p j 0 ||q j 2 (9) Exploring (8) and (9) is out of the scope of this paper, however, it's worth trying when applying SCCL to solve different application problems. Especially considering that the above alternatives might lead to further performance improvement by jointly optimizing the instance-level and the cluster assignment level contrastive learning losses.
A.4 Supplement materials for ablation study Figure 6 provides the full version of Figure 3 in Section 4.