Unsupervised Fine-tuning for Text Clustering

Fine-tuning with pre-trained language models (e.g. BERT) has achieved great success in many language understanding tasks in supervised settings (e.g. text classification). However, relatively little work has been focused on applying pre-trained models in unsupervised settings, such as text clustering. In this paper, we propose a novel method to fine-tune pre-trained models unsupervisedly for text clustering, which simultaneously learns text representations and cluster assignments using a clustering oriented loss. Experiments on three text clustering datasets (namely TREC-6, Yelp, and DBpedia) show that our model outperforms the baseline methods and achieves state-of-the-art results.


Introduction
Pre-trained language models have shown remarkable progress in many natural language understanding tasks (Radford et al., 2018;Peters et al., 2018;Howard and Ruder, 2018). Especially, BERT (Devlin et al., 2018) applies the fine-tuning approach to achieve ground-breaking performance in a set of NLP tasks. BERT, a deep bidirectional transformer model (Vaswani et al., 2017), utilizes a huge unlabeled data to learn complex features and representations and then fine-tunes its pre-trained model on the downstream tasks with labeled data.
Although BERT has achieved great success in many natural language understanding tasks under supervised fine-tuning approaches, relatively little work has been focused on applying pre-trained models in unsupervised settings. In this paper, by a case study of text clustering, we investigate how to leverage the pre-trained BERT model and fine-tune it in unsupervised settings, such as text clustering.
Previous approaches have made some progress on text clustering using deep neural networks (Min et al., 2018;Aljalbout et al., 2018). Existing deep clustering approaches fall into two categories: two-stage and jointly optimization. Two-stage approach uses deep learning frameworks to learn the representation first and then run clustering algorithms (Chen, 2015;Yang et al., 2017). As the name implies, jointly optimization approaches learn the representations and clustering jointly (Xie et al., 2016;Guo et al., 2017). Inspired by those methods, we can fine-tune pre-trained models by learning text representations and cluster assignments simultaneously.
In this paper, we propose a novel method to fine-tune pre-trained models unsupervisedly for text clustering. Our model simultaneously learns text representations and cluster assignments by jointly optimizing both the masked language model loss and the clustering oriented loss. The masked language model loss can help learn domain-specific knowledge and guarantee that our model not be misguided such as all-zero vector (Yang et al., 2017). Clustering oriented loss is designed to make the latent representation space more separable. In our experiments, we evaluate our proposed method on three different types of text datasets (namely, TREC-6,Yelp, and DBpedia). Experimental results show that our model achieves the state-of-the-art performance on question, sentiment and topic text datasets.

Model
Consider a text dataset X with n samples where {x i ∈ X} n i=1 . The number of clusters K is known and lets {µ i } k i=1 denote cluster centers. We aim to learn a good encoder f θ :  Figure 1: The overview architecture of unsupervised fine-tuning pre-trained model for text clustering. the representation z i of ith sample x i more suitable for the clustering tasks. The θ and cluster centers µ i are learnable parameters. As illustrated in Figure 1, we implement masked language model loss L m and clustering loss L c in our model. The masked language model loss helps to learn representations in a domain-specific dataset. The clustering loss is responsible for making representations more discriminative and separable. The loss function can be formulated as follows: We first introduce the BERT model and masked language model loss L m . We then describe the clustering loss L c with KL-divergence. Finally, we present the parameter training details.

Pre-trained Model
BERT (Devlin et al., 2018), a pre-trained model, has achieved great success on many natural language processing tasks. The architecture of BERT model is a multi-layer bidirectional Transformer encoder, which takes words and their positions as input through embedding layer and transformer blocks and outputs the final hidden representations. As illustrated in Figure 1, we fine-tune our model as a masked-language model as in (Devlin et al., 2018), which masks some of the input tokens randomly, and then predicts those masked tokens. The final hidden representations corresponding to the mask tokens are fed into a softmax layer over the vocabulary. The masked language model loss L m is optimized by minimizing the negative log-likelihood.
In a vanilla BERT model, the hidden representations of [CLS] token is used as a symbol to represent one sentence or a pair of sentences. In the unsupervised setting, we do not have labeled data to fine-tune our model and the hidden vector of [CLS] token may not capture all information without fine-tuning. In order to obtain better representation, we implement an average pooling to compute the text representation as z i = N j h i,j /N , where h i,j is the hidden vector of jth token in sample x i . In the next section, we introduce how to compute the clustering loss with the representation z i .

Clustering Loss
The clustering loss with the representation z i is designed to learn representation distribution with the help of an auxiliary target distribution (Xie et al., 2016). The clustering loss is defined as Kullback-Leibler (KL) divergence between distribution P and Q, where Q is the distribution of soft assignment by Student's t-distribution (Maaten and Hinton, 2008) and P is the target distribution derived from Q. The clustering loss is defined as: where q ij is the similarity between text representation z i and clustering centroid µ j : Since we cannot cross-validate α on a validation set in the unsupervised setting, we let α = 1 for all experiments as in (Xie et al., 2016;Guo et al., 2017). We use the distribution of soft assignment q ij to assign the label l i to x i as follows: The target distribution p ij puts more emphasis on data points assigned with high confidence and normalizes loss contribution (Xie et al., 2016). It is computed as follows: where f j = i q ij are soft cluster frequencies. For clustering part, the target distribution P is derived from Q, so minimizing clustering loss is a self-training process (Nigam and Ghani, 2000).
To initialize the cluster centroids, we first extract representations z i through the original pre-trained model. Then employ standard k-means clustering in the representations space {z i } n i=1 to obtain k initial centroids {µ j } k j=1 . To avoid instability, we update the target distribution P, which depends on the predicted soft labels, per epoch rather than per batch. To make the target distribution P towards "groundtruth" distribution, we update P without masking any token.

Dataset and Evaluation Metrics
We conducted experiments on three types of text datasets. TREC dataset (Voorhees and Tice, 1999) is an open-domain, fact-based questions dataset, which contains six categories and 5,452 examples. DBpedia ontology datasets are constructed by picking 14 non-overlapping classes from DBpedia 2014 (Zhang et al., 2015). Yelp reviews dataset is constructed to predict number of stars the user has given, which has 5 classes (Zhang et al., 2015). Since some algorithms do not scale to the full DBpedia and Yelp datasets (Xie et al., 2016;Guo et al., 2017), we randomly choose 10,000 samples for clustering.
To evaluate whether the clustering results, we measure the clustering purity, which is a well-known metric for evaluating clustering (Manning et al., 2008). To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned instances and dividing by the number of instances.

Implementation
We implement our model based on the bert-base-uncased version of BERT. We set the learning rate as 3e −5 . During training, we replace 10% of tokens with mask token at random. We set the max sequence length is 128, the batch size is 16, and maximum epoch as 10.

Baseline Methods
We compare our method against traditional clustering algorithms k-means and three deep clustering algorithms. DEC represents Deep Embedded Clustering (Xie et al., 2016) that pre-trains autoencoder to learn feature representation and uses cluster assignment hardening loss as a regularization. IDEC is an improved Deep Embedded Clustering (Guo et al., 2017) method adding reconstruction term to preserve local structure. DCN represents Deep Clustering Network model, which is a "two-stage" model proposed by Yang et al. (2017).
We also evaluate two two-stage clustering algorithms: AE+k-means represents performing k-means algorithm on features of the pre-trained autoencoder and BERT+k-means is applying k-means algorithm on average hidden vectors of original BERT.

Results
We report the clustering purity results on three text datasets in Table 1. As it shows, our model outperforms the baseline methods and achieves state-of-the-art results. Comparing BERT+k-means method with AE+k-means method, there is a large gap between them, which indicates that we can learn a better text representation from pre-trained model than autoencoder framework for text clustering tasks. The improvement of our method over BERT+k-means method reflects that our unsupervised fine-tuning method can help improve clustering performance. We train variants of our model by removing the masked LM loss and using the hidden representations of [CLS] to represent sentences. The results are shown in Table 2. We can find that there was a marked decrease when the masked LM loss is removed, which indicates that the masked LM loss is crucial for text clustering. The results also shows that there has only been a marginal improvement using average pooling instead of the hidden representations of [CLS].

Conclusion
We proposed a method to fine-tune the pre-trained language models in unsupervised settings for text clustering. It provides a method to leverage pre-trained model for text clustering. Experimental results show that our model achieves the state-of-the-art performance on TREC-6, Yelp, and DBpedia datasets.