Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach

Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks, but they still require excessive labeled data in the fine-tuning stage. We study the problem of fine-tuning pre-trained LMs using only weak supervision, without any labeled data. This problem is challenging because the high capacity of LMs makes them prone to overfitting the noisy labels generated by weak supervision. To address this problem, we develop a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision. Underpinned by contrastive regularization and confidence-based reweighting, our framework gradually improves model fitting while effectively suppressing error propagation. Experiments on sequence, token, and sentence pair classification tasks show that our model outperforms the strongest baseline by large margins and achieves competitive performance with fully-supervised fine-tuning methods. Our implementation is available on https://github.com/yueyu1030/COSINE.


Introduction
Language model (LM) pre-training and fine-tuning achieve state-of-the-art performance in various natural language processing tasks (Peters et al., 2018;Devlin et al., 2019;Raffel et al., 2019). Such approaches stack task-specific layers on top of pre-trained language models, e.g., BERT (Devlin et al., 2019), then fine-tune the models with task-specific data. During fine-tuning, the semantic and syntactic knowledge in the pre-trained LMs is adapted for the target task. Despite their success, one bottleneck for fine-tuning LMs is the requirement of labeled data. When labeled data are scarce, the fine-tuned models often suffer from degraded performance, and the large number of parameters can cause severe overfitting (Xie et al., 2019). * Equal Contribution.
To relieve the label scarcity bottleneck, we finetune the pre-trained language models with only weak supervision. While collecting large amounts of clean labeled data is expensive for many NLP tasks, it is often cheap to obtain weakly labeled data from various weak supervision sources, such as semantic rules (Awasthi et al., 2020). For example, in sentiment analysis, we can use rules 'terrible'→Negative (a keyword rule) and ' * not recommend * '→Negative (a pattern rule) to generate large amounts of weak labels.
Fine-tuning language models with weak supervision is nontrivial. Excessive label noise, e.g., wrong labels, and limited label coverage are common and inevitable in weak supervision. Although existing fine-tuning approaches (Xu et al., 2020;Zhu et al., 2020;Jiang et al., 2020) improve LMs' generalization ability, they are not designed for noisy data and are still easy to overfit on the noise. Moreover, existing works on tackling label noise are flawed and are not designed for fine-tuning LMs. For example, ; Varma and Ré (2018) use probabilistic models to aggregate multiple weak supervisions for denoising, but they generate weaklabels in a context-free manner, without using LMs to encode contextual information of the training samples (Aina et al., 2019). Other works (Luo et al., 2017;Wang et al., 2019b) focus on noise transitions without explicitly conducting instance-level denoising, and they require clean training samples. Although some recent studies (Awasthi et al., 2020; design labeling function-guided neural modules to denoise each sample, they require prior knowledge on weak supervision, which is often infeasible in real practice. Self-training (Rosenberg et al., 2005; Lee, 2013) is a proper tool for fine-tuning language models with weak supervision. It augments the training set with unlabeled data by generating pseudo-labels for them, which improves the models' generalization power. This resolves the limited coverage issue in weak supervision. However, one major challenge of self-training is that the algorithm still suffers from error propagation-wrong pseudo-labels can cause model performance to gradually deteriorate.
We propose a new algorithm COSINE 1 that fine-tunes pre-trained LMs with only weak supervision. COSINE leverages both weakly labeled and unlabeled data, as well as suppresses label noise via contrastive self-training. Weakly-supervised learning enriches data with potentially noisy labels, and our contrastive self-training scheme fulfills the denoising purpose. Specifically, contrastive selftraining regularizes the feature space by pulling samples with the same pseudo-labels close while pushing samples with different pseudo-labels apart. Such regularization enforces representations of samples from different classes to be more distinguishable, such that the classifier can make better decisions. To suppress label noise propagation during contrastive self-training, we propose confidence-based sample reweighting and regularization methods. The reweighting strategy emphasizes samples with high prediction confidence, which are more likely to be correctly classified, in order to reduce the effect of wrong predictions. Confidence regularization encourages smoothness over model predictions, such that no prediction can be over-confident, and therefore reduces the influence of wrong pseudo-labels.
Our model is flexible and can be naturally extended to semi-supervised learning, where a small set of clean labels is available. Moreover, since we do not make assumptions about the nature of the weak labels, COSINE can handle various types of label noise, including biased labels and randomly corrupted labels. Biased labels are usually generated by semantic rules, whereas corrupted labels are often produced by crowd-sourcing.
Our main contributions are: (1) A contrastiveregularized self-training framework that fine-tunes pre-trained LMs with only weak supervision. (2) Confidence-based reweighting and regularization techniques that reduce error propagation and prevent over-confident predictions. (3) Extensive experiments on 6 NLP classification tasks using 7 public benchmarks verifying the efficacy of CO-SINE. We highlight that our model achieves competitive performance in comparison with fullysupervised models on some datasets, e.g., on the Yelp dataset, we obtain a 97.2% (fully-supervised) v.s. 96.0% (ours) accuracy comparison.

Background
In this section, we introduce weak supervision and our problem formulation. Weak Supervision. Instead of using humanannotated data, we obtain labels from weak supervision sources, including keywords and semantic rules 2 . From weak supervision sources, each of the input samples x ∈ X is given a label y ∈ Y ∪ {∅}, where Y is the label set and ∅ denotes the sample is not matched by any rules. For samples that are given multiple labels, e.g., matched by multiple rules, we determine their labels by majority voting. Problem Formulation. We focus on the weaklysupervised classification problems in natural language processing. We consider three types of tasks: sequence classification, token classification, and sentence pair classification. These tasks have a broad scope of applications in NLP, and some examples can be found in Table 1.
Formally, the weakly-supervised classification problem is defined as the following: Given weakly- , we seek to learn a classifier f (x; θ) : X → Y. Here X = X l ∪ X u denotes all the samples and Y = {1, 2, · · · , C} is the label set, where C is the number of classes.

Method
Our classifier f = g • BERT consists of two parts: BERT is a pre-trained language model that outputs hidden representations of input samples, and g is a task-specific classification head that outputs a C-dimensional vector, where each dimension corresponds to the prediction confidence of a specific class. In this paper, we use RoBERTa  as the realization of BERT.
The framework of COSINE is shown in Figure 1. First, COSINE initializes the LM with weak labels. In this step, the semantic and syntactic knowledge of the pre-trained LM are transferred to our model. Then, it uses contrastive self-training to suppress label noise propagation and continue training.

Overview
The training procedure of COSINE is as follows. Initialization with Weakly-labeled Data. We fine-tune f (·; θ) with weakly-labeled data X l by Figure 1: The framework of COSINE. We first fine-tune the pre-trained language model on weakly-labeled data with early stopping. Then, we conduct contrastive-regularized self-training to improve model generalization and reduce the label noise. During self-training, we calculate the confidence of the prediction and update the model with high confidence samples to reduce error propagation.

Formulation
Example Task   solving the optimization problem where CE(·, ·) is the cross entropy loss. We adopt early stopping (Dodge et al., 2020) to prevent the model from overfitting to the label noise. However, early stopping causes underfitting, and we resolve this issue by contrastive self-training. Contrastive Self-training with All Data. The goal of contrastive self-training is to leverage all data, both labeled and unlabeled, for fine-tuning, as well as to reduce the error propagation of wrongly labelled data. We generate pseudo-labels for the unlabeled data and incorporate them into the training set. To reduce error propagation, we introduce contrastive representation learning (Sec. 3.2) and confidence-based sample reweighting and regularization (Sec. 3.3). We update the pseudo-labels (denoted by y) and the model iteratively. The procedures are summarized in Algorithm 1.
Update y with the current θ. To generate the pseudo-label for each sample x ∈ X , one straightforward way is to use hard labels (Lee, 2013) y Notice that f (x; θ) ∈ R C is a probability vector and [f (x; θ)] j indicates the j-th entry of it. How-ever, these hard pseudo-labels only keep the most likely class for each sample and result in the propagation of labeling mistakes. For example, if a sample is mistakenly classified to a wrong class, assigning a 0/1 label complicates model updating (Eq. 4), in that the model is fitted on erroneous labels. To alleviate this issue, for each sample x in a batch B, we generate soft pseudo-labels 3 (Xie et al., 2016(Xie et al., , 2019Meng et al., 2020;Liang et al., 2020) y ∈ R C based on the current model as where f j = x ∈B [f (x ; θ)] 2 j is the sum over soft frequencies of class j. The non-binary soft pseudolabels guarantee that, even if our prediction is inaccurate, the error propagated to the model update step will be smaller than using hard pseudo-labels.

Contrastive Learning on Sample Pairs
The key ingredient of our contrastive self-training method is to learn representations that encourage data within the same class to have similar representations and keep data in different classes separated. Specifically, we first select high-confidence samples (Sec. 3.3) C from X . Then for each pair x i , x j ∈ C, we define their similarity as where y i , y j are the soft pseudo-labels (Eq. 3) for x i , x j , respectively. For each x ∈ C, we calculate its representation v = BERT(x) ∈ R d , then we define the contrastive regularizer as Here, (·, ·, ·) is the contrastive loss (Chopra et al., 2005;Taigman et al., 2014), d ij is the distance 4 between v i and v j , and γ is a pre-defined margin.
For samples from the same class, i.e. W ij = 1, Eq. 6 penalizes the distance between them, and 4 We use scaled Euclidean distance dij = 1 d vi − vj 2 2 by default. More discussions on Wij and dij are in Appendix E.
for samples from different classes, the contrastive loss is large if their distance is small. In this way, the regularizer enforces similar samples to be close, while keeping dissimilar samples apart by at least γ. Figure 2 illustrates the contrastive representations. We can see that our method produces clear interclass boundaries and small intra-class distances, which eases the classification tasks.

Confidence-based Sample Reweighting
and Regularization While contrastive representations yield better decision boundaries, they require samples with highquality pseudo-labels. In this section, we introduce reweighting and regularization methods to suppress error propagation and refine pseudo-label qualities. Sample Reweighting. In the classification task, samples with high prediction confidence are more likely to be classified correctly than those with low confidence. Therefore, we further reduce label noise propagation by a confidence-based sample reweighting scheme. For each sample x with the soft pseudo-label y, we assign x with a weight ω(x) defined by Notice that if the prediction confidence is low, then H( y) will be large, and the sample weight ω(x) will be small, and vice versa. We use a pre-defined threshold ξ to select high confidence samples C from each batch B as Then we define the loss function as where is the Kullback-Leibler (KL) divergence.
Confidence regularization The sample reweighting approach promotes high confidence samples during contrastive self-training. However, this strategy relies on wrongly-labeled samples to have low confidence, which may not be true unless we prevent over-confident predictions. To this end, we propose a confidence-based regularizer that encourages smoothness over predictions, defined as where D KL is the KL-divergence and u i = 1/C for i = 1, 2, · · · , C. Such term constitutes a regularization to prevent over-confident predictions and leads to better generalization (Pereyra et al., 2017).

Experiments
Datasets and Tasks. We conduct experiments on 6 NLP classification tasks using 7 public benchmarks: AGNews (Zhang et al., 2015) is a Topic Classification task; IMDB (Maas et al., 2011) and Yelp   Self-ensemble (Xu et al., 2020) uses selfensemble and distillation to improve performances.
Mixup  creates virtual training samples by linear interpolations.
SMART (Jiang et al., 2020) adds adversarial and smoothness constraints to fine-tune LMs and achieves state-of-the-art result for many NLP tasks. (iii) Weakly-supervised Models: The third group of baselines are weakly-supervised models 6 : Snorkel  aggregates different labeling functions based on their correlations.
WeSTClass  trains a classifier with generated pseudo-documents and use selftraining to bootstrap over all samples.
Denoise  uses attention network to estimate reliability of weak supervisions, and then reduces the noise by aggregating weak labels.
UST (Mukherjee and Awadallah, 2020) is stateof-the-art for self-training with limited labels. It estimates uncertainties via MC-dropout (Gal and Ghahramani, 2015), and then select samples with low uncertainties for self-training. Evaluation Metrics. We use classification accuracy on the test set as the evaluation metric for all datasets except MIT-R. MIT-R contains a large number of tokens that are labeled as "Others". We use the micro F 1 score from other classes for this dataset. 7 Auxiliary. We implement COSINE using Py-Torch 8 , and we use RoBERTa-base as the pretrained LM. Datasets and weak supervision details are in Appendix A. Baseline settings are in Appendices B. Training details and setups are in Appendix C. Discussions on early-stopping are in Appendix D. Comparison of distance metrics and similarity measures are in Appendix E.

Learning From Weak Labels
We summarize the weakly-supervised leaning results in Table 3. In all the datasets, COSINE outperforms all the baseline models. A special case is the WiC dataset, where we use WordNet 9 to generate weak labels. However, this enables Snorkel to access some labeled data in the development set, making it unfair to compete against other methods. We will discuss more about this dataset in Sec. 4.3.
In comparison with directly fine-tuning the pretrained LMs with weakly-labeled data, our model employs an "earlier stopping" technique 10 so that it does not overfit on the label noise. As shown, indeed "Init" achieves better performance, and it serves as a good initialization for our framework. Other fine-tuning methods and weakly-supervised models either cannot harness the power of pretrained language models, e.g., Snorkel, or rely on clean labels, e.g., other baselines. We highlight that although UST, the state-of-the-art method to date, achieves strong performance under few-shot settings, their approach cannot estimate confidence well with noisy labels, and this yields inferior performance. Our model can gradually correct wrong pseudo-labels and mitigate error propagation via contrastive self-training.
It is worth noticing that on some datasets, e.g.,    Figure 3: Results of label corruption on TREC. When the corruption ratio is less than 40%, the performance is close to the fully supervised method.
AGNews, IMDB, Yelp, and WiC, our model achieves the same level of performance with models (RoBERTa-CL) trained with clean labels. This makes COSINE appealing in the scenario where only weak supervision is available.

Robustness Against Label Noise
Our model is robust against excessive label noise. We corrupt certain percentage of labels by randomly changing each one of them to another class. This is a common scenario in crowd-sourcing, where we assume human annotators mis-label each sample with the same probability.  pared with advanced fine-tuning and self-training methods (e.g. SMART and UST) 11 , our model consistently outperforms the baselines.

Semi-supervised Learning
We can naturally extend our model to semisupervised learning, where clean labels are avail-able for a portion of the data. We conduct experiments on the WiC dataset. As a part of the Su-perGLUE (Wang et al., 2019a) benchmark, this dataset proposes a challenging task: models need to determine whether the same word in different sentences has the same sense (meaning). Different from previous tasks where the labels in the training set are noisy, in this part, we utilize the clean labels provided by the WiC dataset. We further augment the original training data of WiC with unlabeled sentence pairs obtained from lexical databases (e.g., WordNet, Wictionary). Note that part of the unlabeled data can be weaklylabeled by rule matching. This essentially creates a semi-supervised task, where we have labeled data, weakly-labeled data and unlabeled data.
Since the weak labels of WiC are generated by WordNet and partially reveal the true label information, Snorkel ) takes this unfair advantage by accessing the unlabeled sentences and weak labels of validation and test data. To make a fair comparison to Snorkel, we consider the transductive learning setting, where we are allowed access to the same information by integrating unlabeled validation and test data and their weak labels into the training set. As shown in Table 4, CO-SINE with transductive learning achieves better performance compared with Snorkel. Moreover, in comparison with semi-supervised baselines (i.e. VAT and MT) and fine-tuning methods with extra resources (i.e., SenseBERT), COSINE achieves better performance in both semi-supervised and transductive learning settings.

Case Study
Error propagation mitigation and wrong-label correction. Figure 4 visualizes this process. Before training, the semantic rules make noisy predictions. After the initialization step, model predictions are less noisy but more biased, e.g., many samples are mis-labeled as "Amenity". These predictions are further refined by contrastive self-training. The rightmost figure demonstrates wrong-label correction. Samples are indicated by radii of the circle, and classification correctness is indicated by color, i.e., blue means correct and orange means incorrect. From inner to outer tori specify classification accuracy after the initialization stage, and the iteration 1,2,3. We can see that many incorrect predictions are corrected within three iterations. To illustrate: the right black dashed line means the corresponding sample is classified correctly after the first iteration, and the left dashed line indicates the case where the sample is mis-classified after the second iteration but corrected after the third. These results demonstrate that our model can correct wrong predictions via contrastive self-training. Better data representations. We visualize sample embeddings in Fig. 7. By incorporating the contrastive regularizer R 1 , our model learns more compact representations for data in the same class, e.g., the green class, and also extends the inter-class distances, e.g., the purple class is more separable from other classes in Fig. 7(b) than in Fig. 7(a). Label efficiency. Figure 8 illustrates the number of clean labels needed for the supervised model to outperform COSINE. On both of the datasets, the supervised model requires a significant amount of clean labels (around 750 for Agnews and 120 for MIT-R) to reach the level of performance as ours, whereas our method assumes no clean sample. Higher Confidence Indicates Better Accuracy. Figure 6 demonstrates the relation between prediction confidence and prediction accuracy on IMDB. We can see that in general, samples with higher prediction confidence yield higher prediction accuracy. With our sample reweighting method, we gradually filter out low-confidence samples and assign higher weights for others, which effectively mitigates error propagation.

Ablation Study
Components of COSINE. We inspect the importance of various components, including the contrastive regularizer R 1 , the confidence regularizer R 2 , and the sample reweighting (SR) method, and the soft labels. Table 5 summarizes the results and Fig. 9 visualizes the learning curves. We remark that all the components jointly contribute to the model performance, and removing any of them hurts the classification accuracy. For example, sample reweighting is an effective tool to reduce error propagation, and removing it causes the model to eventually overfit to the label noise, e.g., the red bottom line in Fig. 9 illustrates that the classification accuracy increases and then drops rapidly. On the other hand, replacing the soft pseudo-labels (Eq. 3) with the hard counterparts (Eq. 2) causes drops in performance. This is because hard pseudo-labels lose prediction confidence information. Hyper-parameters of COSINE. In Fig. 5, we examine the effects of different hyper-parameters, including the confidence threshold ξ (Eq. 9), the    stopping time T 1 in the initialization step, and the update period T 3 for pseudo-labels. From Fig. 5(a), we can see that setting the confidence threshold too big hurts model performance, which is because an over-conservative selection strategy can result in insufficient number of training data. The stopping time T 1 has drastic effects on the model. This is because fine-tuning COSINE with weak labels for excessive steps causes the model to unavoidably overfit to the label noise, such that the contrastive self-training procedure cannot correct the   error. Also, with the increment of T 3 , the update period of pseudo-labels, model performance first increases and then decreases. This is because if we update pseudo-labels too frequently, the contrastive self-training procedure cannot fully suppress the label noise, and if the updates are too infrequent, the pseudo-labels cannot capture the updated information well.

Related Works
Fine-tuning Pre-trained Language Models. To improve the model's generalization power during fine-tuning stage, several methods are proposed (Peters et al., 2019;Dodge et al., 2020;Zhu et al., 2020;Jiang et al., 2020;Xu et al., 2020;Kong et al., 2020;Zhao et al., 2020;Gunel et al., 2021;Zhang et al., 2021;Aghajanyan et al., 2021;Wang et al., 2021), However, most of these methods focus on fully-supervised setting and rely heavily on large amounts of clean labels, which are not always available. To address this issue, we propose a contrastive self-training framework that fine-tunes pre-trained models with only weak labels. Compared with the existing fine-tuning approaches (Xu et al., 2020;Zhu et al., 2020;Jiang et al., 2020), our model effectively reduce the label noise, which achieves better performance on various NLP tasks with weak supervision. Learning From Weak Supervision. In weaklysupervised learning, the training data are usually noisy and incomplete. Existing methods aim to denoise the sample labels or the labeling functions by, for example, aggregating multiple weak supervisions Lison et al., 2020;, using clean samples (Awasthi et al., 2020), and leveraging contextual information (Mekala and Shang, 2020). However, most of them can only use specific type of weak supervision on specific task, e.g., keywords for text classification (Meng et al., 2020;Mekala and Shang, 2020), and they require prior knowledge on weak supervision sources (Awasthi et al., 2020;Lison et al., 2020;, which somehow limits the scope of their applications. Our work is orthogonal to them since we do not denoise the labeling functions directly. Instead, we adopt contrastive self-training to leverage the power of pretrained language models for denoising, which is task-agnostic and applicable to various NLP tasks with minimal additional efforts.

Discussions
Adaptation of LMs to Different Domains. When fine-tuning LMs on data from different domains, we can first continue pre-training on in-domain text data for better adaptation (Gururangan et al., 2020). For some rare domains where BERT trained on general domains is not optimal, we can use LMs pretrained on those specific domains (e.g. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019)) to tackle this issue. Scalability of Weak Supervision. COSINE can be applied to tasks with a large number of classes. This is because rules can be automatically generated beyond hand-crafting. For example, we can use label names/descriptions as weak supervision signals (Meng et al., 2020). Such signals are easy to obtain and do not require hand-crafted rules. Once weak supervision is provided, we can create weak labels to further apply COSINE. Flexibility. COSINE can handle tasks and weak supervision sources beyond our conducted experiments. For example, other than semantic rules, crowd-sourcing can be another weak supervision source to generate pseudo-labels (Wang et al., 2013). Moreover, we only conduct experiments on several representative tasks, but our framework can be applied to other tasks as well, e.g., namedentity recognition (token classification) and reading comprehension (sentence pair classification).

Conclusion
In this paper, we propose a contrastive regularized self-training framework, COSINE, for finetuning pre-trained language models with weak supervision. Our framework can learn better data representations to ease the classification task, and also efficiently reduce label noise propagation by confidence-based reweighting and regularization. We conduct experiments on various classification tasks, including sequence classification, token classification, and sentence pair classification, and the results demonstrate the efficacy of our model.

Broader Impact
COSINE is a general framework that tackled the label scarcity issue via combining neural nets with weak supervision. The weak supervision provides a simple but flexible language to encode the domain knowledge and capture the correlations between features and labels. When combined with unlabeled data, our framework can largely tackle the label scarcity bottleneck for training DNNs, enabling them to be applied for downstream NLP classification tasks in a label efficient manner.
COSINE neither introduces any social/ethical bias to the model nor amplify any bias in the data. In all the experiments, we use publicly available data, and we build our algorithms using public code bases. We do not foresee any direct social consequences or ethical issues.  (2020), such weak supervisions are cheap to obtain and are much efficient than collecting clean labels. In this way, we can obtain significantly more labeled examples using these weak supervision sources than human labor. There are two types of semantic rules that we apply as weak supervisions: Keyword Rule: HAS(x, L) → C. If x matches one of the words in the list L, we label it as C.
Pattern Rule: MATCH(x, R) → C. If x matches the regular expression R, we label it as C.
In addition to the keyword rule and the pattern rule, we can also use third-party tools to obtain weak labels. These tools (e.g. TextBlob 12 ) are available online and can be obtained cheaply, but their prediction is not accurate enough (when directly use this tool to predict label for all training samples, the accuracy on Yelp dataset is around 60%). We now introduce the semantic rules on each dataset: AGNews, IMDB, Yelp: We use the rule in . Please refer to the original paper for detailed information on rules.
MIT-R, TREC: We use the rule in Awasthi et al. (2020). Please refer to the original paper for detailed information on rules.
ChemProt: There are 26 rules. We show part of the rules in Table 6.
WiC: Each sense of each word in WordNet has example sentences. For each sentence in the WiC dataset and its corresponding keyword, we collect the example sentences of that word from WordNet. Then for a pair of sentences, the corresponding weak label is "True" if their definitions are the same, otherwise the weak label is "False".

D Early Stopping and Earlier Stopping
Our model adopts the earlier stopping strategy during the initialization stage. Here we use "earlier stopping" to differentiate from "early stopping", which is standard in fine-tuning algorithms. Early stopping refers to the technique where we stop training when the evaluation score drops. Earlier stopping is self-explanatory, namely we fine-tune the pre-trained LMs with only a few steps, even before the evaluation score starts dropping. This technique can efficiently prevent the model from overfitting. For example, as Figure 5(b) illustrates, on IMDB dataset, our model overfits after 240 iterations of initialization with weak labels. In contrast, the model achieves good performance even after 400 iterations of fine-tuning when using clean labels. This verifies the necessity of earlier stopping.

E Comparison of Distance Measures in Contrastive Learning
The contrastive regularizer R 1 (θ; y) is related to two designs: the sample distance metric d ij and the sample similarity measure W ij . In our implementation, we use the scaled Euclidean distance as the default for d ij and Eq. 5 as the default for W ij 13 . Here we discuss other designs.

E.1 Sample distance metric d
Given the encoded vectorized representations v i and v j for samples i and j, we consider two distance metrics as follows. Scaled Euclidean distance (Euclidean): We calculate the distance between v i and v j as Cosine distance (Cos) 14 : Besides the scaled Euclidean distance, cosine distance is another widelyused distance metric: E.2 Sample similarity measures W Given the soft pseudo-labels y i and y j for samples i and j, the following are some designs for W ij . In all of the cases, W ij is scaled into range [0, 1] (we set γ = 1 in Eq. 7 for the hard similarity).
Hard Similarity: The hard similarity between two samples is calculated as This is called a "hard" similarity because we obtain a binary label, i.e., we say two samples are similar if their corresponding hard pseudo-labels are the same, otherwise we say they are dissimilar. 13 To accelerate contrastive learning, we adopt the doubly stochastic sampling approximation to reduce the computational cost. Specifically, the high confidence samples C in each batch B yield O(|C| 2 ) sample pairs, and we sample |C| pairs from them. 14 We use Cos to distinguish from our model name CO-SINE.
Soft KL-based Similarity: We calculate the similarity based on KL distance as follows.
where β is a scaling factor, and we set β = 10 by default. Soft L2-based Similarity: We calculate the similarity based on L2 distance as follows.

E.3 COSINE under different d and W .
We show the performance of COSINE with different choices of d and W on Agnews and MIT-R in Table 8. We can see that COSINE is robust to these choices. In our experiments, we use the scaled euclidean distance and the hard similarity by default.