Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.


Introduction
Large-scale pre-trained language models (LMs), such as BERT (Devlin et al., 2018), XLNet (Yang et al., 2019) and RoBERTa , have brought revolutionary advancement to the NLP field (Wang et al., 2018). However, as newgeneration LMs grow more and more into behemoth size, it becomes increasingly challenging to deploy them in resource-deprived environment. Naturally, there has been a surge of research interest in developing model compression methods (Sun et al., 2019;Sanh et al., 2019;Shen et al., 2019) to reduce network size in pre-trained LMs, while retaining comparable performance and efficiency.
PKD (Sun et al., 2019) was the first known effort in this expedition, an elegant and effective method that uses knowledge distillation (KD) for BERT model compression at finetuning stage. Later on, DistilBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2019) and MobileBERT (Sun et al., 2020) carried on the torch and extended similar compression techniques to pre-training stage, allowing efficient training of task-agnostic compressed models. In addition to the conventional KL-divergence loss applied to the probabilistic output of the teacher and student networks, an L 2 loss measuring the difference between normalized hidden layers has proven to be highly effective in these methods. However, L 2 norm follows the assumption that all dimensions of the target representation are independent, which overlooks important structural information in the many hidden layers of BERT teacher.
Motivated by this, we propose Contrastive Distillation for Intermediate Representations (CODIR), which uses a contrastive objective to capture higher-order output dependencies between intermediate representations of BERT teacher and the student. Contrastive learning (Gutmann and Hyvärinen, 2010) aims to learn representations by enforcing similar elements to be equal and dissimilar elements further apart. Formulated in either supervised or unsupervised way, it has been successfully applied to diverse applications (Hjelm et al., 2018;He et al., 2019;Tian et al., 2019;Khosla et al., 2020). To the best of our knowledge, utilizing contrastive learning to compress large Transformer models is still an unexplored territory, which is the main focus of this paper.
A teacher network's hidden layers usually contain rich semantic and syntactic knowledge that can be instrumental if successfully passed on to the student (Tenney et al., 2019;Kovaleva et al., 2019;Sun et al., 2019). Thus, instead of directly applying contrastive loss to the final output layer of the teacher, we apply contrastive learning to its intermediate layers, in addition to the use of KL-loss between the probabilistic outputs of the teacher and student. This casts a stronger regularization effect for student training by capturing more informative signals from intermediate representations. To maximize the exploitation of intermediate layers of the teacher, we also propose the use of mean-pooled representation as the distillation target, which is empirically more effective than commonly used special [CLS] token.
To realize constrastive distillation, we define a congruent pair (h t i , h s i ) as the pair of representations of the same data input from the teacher and student networks, as illustrated in Figure 1. Incongruent pair (h t i , h s j ) is defined as the pair of representations of two different data samples through the teacher and the student networks, respectively. The goal is to train the student network to distinguish the congruent pair from a large set of incongruent pairs, by minimizing the constrastive objective.
For efficient training, all data samples are stored in a memory bank (Wu et al., 2018;He et al., 2019). During finetuning, incongruent pairs can be selected by choosing sample pairs with different labels to maximize the distance. For pre-training, however, it is not straightforward to construct incongruent pairs this way as labels are unavailable. Thus, we randomly sample data points from the same mini-batch pair to form incongruent pairs, and construct a proxy congruent-incongruent sample pool to assimilate what is observed in the downstream tasks during finetuning stage. This and other designs in CoDIR make constrative learning possible for LM compression, and have demonstrated strong performance and high efficiency in experiments.
Our contributions are summarized as follows. (i) We propose CODIR, a principled framework to distill knowledge in the intermediate representations of large-scale language models via a contrastive objective, instead of a conventional L 2 loss. (ii) We propose effective sampling strategies to enable CoDIR in both pre-training and finetuning stages. (iii) Experimental results demonstrate that CoDIR can successfully train a half-size Transformer model that achieves competing performance to BERT-base on the GLUE benchmark (Wang et al., 2018), with half training time and GPU de-mand. Our pre-trained model checkpoint will be released for public access.
Quantization refers to storing model parameters from 32-or 16-bit floating number to 8-bit or even lower. Directly truncating the parameter values will cause significant accuracy loss, hence quantizationaware training has been developed to maintain similar accuracy to the original model (Shen et al., 2019;Zafrir et al., 2019). Michel et al. (2019) found that even after most attention heads are removed, the model still retains similar accuracy, indicating there is high redundancy in the learned model weights. Later studies proposed different pruning-based methods. For example, Gordon et al.
(2020) simply removed the model weights that are close to zero; while Guo et al. (2019) used re-weighted L 1 and proximal algorithm to prune weights to zero. Note that simple pruning does not improve inference speed, unless there is structure change such as removing the whole attention head.
There are also some efforts that try to improve the Transformer block directly. Typically, language models such as BERT (Devlin et al., 2018) and RoBERTa  can only handle a sequence of tokens in length up to 512. Kitaev et al. (2020) proposed to use reversible residual layers and locality-sensitive hashing to reduce memory usage to deal with extremely long sequences. Besides, Wu et al. (2020) proposed to use convolutional neural networks to capture short-range attention such that reducing the size of self-attention will not significantly hurt performance.
Another line of research on model compression is based on knowledge transfer, or knowledge distillation (KD) (Hinton et al., 2015), which is the main focus of this paper. Note that previously introduced model compression techniques are orthogonal to KD, and can be bundled for further speedup. Distilled BiLSTM (Tang et al., 2019) Figure 1: Overview of the proposed CoDIR framework for language model compression in both pre-training and finetuning stages. "Trm" represents a Transformer block, X are input tokens, f t and f s are teacher and student models, and X 0 , {X i } K i=1 represent one positive sample and a set of negative samples, respectively. The difference between CoDIR pre-training and finetuning mainly lies in the negative example sampling method. till knowledge from BERT into a simple LSTM. Though achieving more than 400 times speedup compared to BERT-large, it suffers from significant performance loss due to the shallow network architecture. DistilBERT (Sanh et al., 2019) proposed to distill predicted logits from the teacher model into a student model with 6 Transformer blocks. BERT-PKD (Sun et al., 2019) proposed to not only distill the logits, but also the representation of [CLS] tokens from the intermediate layers of the teacher model. TinyBERT (Jiao et al., 2019), MobileBERT (Sun et al., 2020) and SID (Aguilar et al., 2019) further proposed to improve BERT-PKD by distilling more internal representations to the student, such as embedding layers and attention weights. These existing methods can be generally divided into two categories: (i) task-specific, and (ii) task-agnostic. Task-specific methods, such as Distilled BiLSTM, BERT-PKD and SID, require the training of individual teacher model for each downstream task; while task-agnostic methods such as DistilBERT, TinyBERT and MobileBERT use KD to pre-train a model that can be applied to all downstream tasks by standard finetuning.
Contrastive Representation Learning Contrastive learning (Gutmann and Hyvärinen, 2010;Arora et al., 2019) is a popular research area that has been successfully applied to density estimation and representation learning, especially in self-supervised setting (He et al., 2019;Chen et al., 2020). It has been shown that the contrastive objective can be interpreted as maximizing the lower bound of mutual information between different views of the data (Hjelm et al., 2018;Oord et al., 2018;Bachman et al., 2019;Hénaff et al., 2019). However, it is unclear whether the success is determined by mutual information or by the specific form of the contrastive loss (Tschannen et al., 2019). Recently, it has been extended to knowledge distillation and cross-modal transfer for image classification tasks (Tian et al., 2019). Different from prior work, we propose the use of contrastive objective for Transformer-based model compression and focus on language understanding tasks.

CoDIR for Model Compression
In this section, we first provide an overview of the proposed method in Sec. 3.1, then describe the details of contrastive distillation in Sec. 3.2. Its adaptation to pre-training and finetuning is further discussed in Sec. 3.3.

Framework Overview
We use RoBERTa-base  as the teacher network, denoted as f t , which has 12 layers with 768-dimension hidden representations. We aim to transfer the knowledge of f t into a student network f s , where f s is a 6-layer Trans-former (Vaswani et al., 2017) to mimic the behavior of f t (Hinton et al., 2015). Denote a training sample as (X, y), where X = (x 0 , . . . , x L−1 ) is a sequence of tokens in length L, and y is the corresponding label (if available). The word embedding matrix of X is represented as vector, and X ∈ R L×d . In addition, the intermediate representations at each layer for the teacher and student are denoted as H t = (H t 1 , . . . , H t 12 ) and H s = (H s 1 , . . . , H s 6 ), respectively, where H t i , H s i ∈ R L×d contains all the hidden states in one layer. And z t , z s ∈ R k are the logit representations (before the softmax layer) of the teacher and student, respectively, where k is the number of classes.
As illustrated in Figure 1, our distillation objective consists of three components: (i) original training loss from the target task; (ii) conventional KL-divergence-based loss to distill the knowledge of z t into z s ; (iii) proposed contrastive loss to distill the knowledge of H t into H s . The final training objective can be written as: where L CE , L KD and L CRD correspond to the original loss, KD loss and contrastive loss, respectively. θ denotes all the learnable parameters in the student f s , while the teacher network is pre-trained and kept fixed. α 1 , α 2 are two hyper-parameters to balance the loss terms. L CE is typically implemented as a cross-entropy loss for classification problems, and L KD can be written as where g(·) denotes the softmax function, and ρ is the temperature. L KD encourages the student network to produce distributionally-similar outputs to the teacher network.
Only relying on the final logit output for distillation discards the rich information hidden in the intermediate layers of BERT. Recent work (Sun et al., 2019;Jiao et al., 2019) has found that distilling the knowledge from intermediate representations with L 2 loss can further enhance the performance. Following the same intuition, our proposed method also aims to achieve this goal, with a more principled contrastive objective as detailed below.

Contrastive Distillation
First, we describe how to summarize intermediate representations into a concise feature vector. Based on this, we detail how to perform contrastive distillation (Tian et al., 2019) for model compression.
Intermediate Representation Directly using H t and H s for distillation is infeasible, as the total feature dimension is |H s | = 6 × 512 × 768 ≈ 2.4 million for a sentence in full length (i.e., L = 512). Therefore, we propose to first perform meanpooling over H t and H s to obtain a layer-wise sentence embedding. Note that the embedding of the [CLS] token can also be used directly for this purpose; however, in practice we found that meanpooling performs better. Specifically, we conduct row-wise average over H t i and H s i : ] ∈ R 12d . Two linear mappings φ s : R 6d → R m and φ t : R 12d → R m are further applied to projecth t andh s into the same lowdimensional space, yielding h t , h s ∈ R m , which are used for calculating the contrastive loss.
Contrastive Objective Given a training sample (X 0 , y 0 ), we first randomly select K negative samples with different labels, denoted as {(X i , y i )} K i=1 . Following the above process, we can obtain a summarized intermediate representation h t 0 , h s 0 ∈ R m by sending X 0 to both the teacher and student network. Similarly, for negative samples, we can obtain {h s i } K i=1 . Contrastive learning aims to map the student's representation h s 0 close to the teacher's representation h t 0 , while the negative samples' representations {h s i } K i=1 far apart from h t 0 . To achieve this, we use the following InfoNCE loss (Oord et al., 2018) for model training: where ·, · denotes the cosine similarity between two feature vectors, and τ is the temperature that controls the concentration level. As demonstrated, contrastive distillation is implemented as a (K + 1)-way classification task, which is interpreted as maximizing the lower bound of mutual information between h t 0 and h s 0 (Oord et al., 2018; Tian et al., 2019).

Pre-training and Finetuning Adaptation
Memory Bank For a positive pair (h t 0 , h s 0 ), one needs to compute the intermediate representations for all the negative samples, i.e., {h s i } K i=1 , which requires K + 1 times computation compared to normal training. A large number of negative samples is required to ensure performance (Arora et al., 2019), which renders large-scale contrastive distillation infeasible for practical use. To address this issue, we follow Wu et al. (2018) and use a memory bank M ∈ R N ×m to store the intermediate representation of all N training examples, and the representation is only updated for positive samples in each forward propagation. Therefore, the training cost is roughly the same as in normal training. Specifically, assume the mini-batch size is 1, then at each training step, M is updated as: where m 0 is the retrieved representation from memory bank M that corresponds to h s 0 , and β ∈ (0, 1) is a hyper-parameter that controls how aggressively the memory bank is updated.
Finetuning Since task-specific label supervision is available in finetuning stage, applying CoDIR to finetuning is relatively straightforward. When selecting negative samples from the memory bank, we make sure the selected samples have different labels from the positive sample.
Pre-training For pre-training, the target task becomes masked language modeling (MLM) (Devlin et al., 2018). Therefore, we replace the L CE loss in Eqn. (1)  , we did not include the next-sentence-prediction task for pre-training, as it does not improve performance on downstream tasks. Since task-specific label supervision is unavailable during pre-training, we propose an effective method to select negative samples from the memory bank. Specifically, we sample negative examples randomly from the same mini-batch each time, as they have closer semantic meaning as some of them are from the same article, especially for Bookcorpus (Zhu et al., 2015). Then, we use the sampled negative examples to retrieve representations from the memory bank. Intuitively, negative examples sampled in this way serve as "hard" negatives, compared to randomly sampling from the whole training corpus; otherwise, the L CRD loss could easily drop to zero if the task is too easy.

Experiments
In this section, we present comprehensive experiments on a wide range of downstream tasks and provide detailed ablation studies, to demonstrate the effectiveness of the proposed approach to largescale LM compression.

Datasets
We evaluate the proposed approach on sentence classification tasks from the General Language Understanding Evaluate (GLUE) benchmark (Wang et al., 2018), as our finetuning framework is designed for classification, and we only exclude the STS-B dataset (Cer et al., 2017). Following other works (Sun et al., 2019;Jiao et al., 2019;Sun et al., 2020), we also do not run experiments on WNLI dataset (Levesque et al., 2012), as it is very difficult and even majority voting outperforms most benchmarks. 2 CoLA Corpus of Linguistic Acceptability (Warstadt et al., 2019) contains a collection of 8.5k sentences drawn from books or journal articles. The goal is to predict if the given sequence of words is grammatically correct. Mattthews correlation coefficient is used as the evaluation metric.
SST-2 Stanford Sentiment Treebank (Socher et al., 2013) consists of 67k human-annotated movie reviews. The goal is to predict whether each review is positive or negative. Accuracy is used as the evaluation metric. (Dolan and Brockett, 2005) consists of 3.7k sentence pairs extracted from online news, and the goal to predict if each pair of sentences is semantically equivalent. F1 score from GLUE server is reported as the metric.

MRPC Microsoft Research Paraphrase Corpus
QQP The Quora Question Pairs 3 task consists of 393k question pairs from Quora webiste. The task is to predict whether a pair of questions is semantically equivalent. Accuracy is used as the evaluation metric.
NLI Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2017), Questionanswering NLI (QNLI) (Rajpurkar et al., 2016) and Recognizing Textual Entailment (RTE) 4 are all natural language inference (NLI) tasks, which consist of 393k/108k/2.5k pairs of premise and hypothesis. The goal is to predict if the premise entails the hypothesis, or contradicts it, or neither. Accuracy is used as the evaluation metric. Besides, MNLI test set is further divided into two splits: matched (MNLI-m, in-domain) and mismatched (MNLI-mm, cross-domain), accuracy for both are reported.

Implementation Details
We mostly follow the pre-training setting from , and use the fairseq implementation . Specifically, we truncate raw text into sentences with maximum length of 512 tokens, and randomly mask 15% of tokens as [MASK].
For model architecture, we use a randomly initialized 6-layer Transformer model as the student, and RoBERTa-base with 12-layer Transformer as the teacher. The student model was first trained by using Adam optimizer with learning rate 0.0007 and batch size 8192 for 35,000 steps. For computational efficiency, this model serves as the initialization for the second-stage pre-training with the teacher. Then, the student model is further trained for another 10,000 steps with KD and the proposed contrastive objective, with learning rate set to 0.0001. We denote this model as CoDIR-Pre.
For ablation purposes, we also train two baseline models with only MLM loss or KD loss, using the same learning rate and number of steps. Similarly, these two models are denoted as MLM-Pre and KD-Pre, respectively. For other hyper-parameters, we use α 1 = α 2 = 0.1 for both L KD and L CRD . Due to high computational cost for pre-training, all the hyper-parameters are set empirically without tuning. As there exist many combinations of pretraining loss (MLM, KD, and CRD) and finetuning strategies (standard finetuning with cross-entropy loss, and finetuning with additional CRD loss), a grid search of all the hyper-parameters is infeasible. Thus, for standard finetuning, we search learning rate from {1e-5, 2e-5} and batch size from {16, 32}. The combination with the highest score on dev set is reported for ablation studies, and is kept fixed for future experiments. We then fix the hyperparameters in KD as ρ = 2, α 1 = 0.7, and search weight of the CRD loss α 2 from {0.1, 0.5, 1}, and the number of negative samples from {100, 500, 1000}. Results with the highest dev scores were submitted to the official GLUE server to obtain the final results. For fair comparison with other baseline methods, all the results are based on singlemodel performance.

Experimental Results
Results of different methods from the official GLUE server are summarized in Table 1   plicity, we denote our baseline approach without using any teacher supervision as "MLM-Pre + Fine": pre-trained by using MLM loss first, then finetuning using standard cross-entropy loss. Our baseline already achieves high average score across 8 tasks, and outperforms task-specific model compression methods (such as SID (Aguilar et al., 2019) and BERT-PKD (Sun et al., 2019)) as well as Distil-BERT (Sanh et al., 2019) by a large margin. After adding contrastive loss at the finetuning stage (denoted as CoDIR-Fine), the model outperforms the state-of-the-art compression method, TinyBERT with 6-layer Transformer, on average GLUE score. Especially on datasets with fewer data samples, such as CoLA and MRPC, the improved margin is large (+2.5% and +2.1%, respectively). Compared to our MLM-Pre + Fine baseline, CoDIR-Fine achieves significant performance gain on almost all tasks (+1.2% absolute improvement on average score), demonstrating the effectiveness of the proposed approach. The only exception is QQP (-0.1%) with more than 360k training examples. In such case, standard finetuning may already bring in enough performance boost with this large-scale labeled dataset, and the gap between the teacher and student networks is already small (89.6 vs 89.2).
We further test the effectiveness of CoDIR for pre-training (CoDIR-Pre), by applying standard finetuning on model pre-trained with additional contrastive loss. Again, compared to the MLM-Pre + Fine baseline, this improves the model performance on almost all the tasks (except QQP), with a significant lift on the average score (+1.4%). We notice that this model performs similarly to the contrastive-finetuning only approach (CoDIR-Fine) on almost all tasks. However, CoDIR-Pre is preferred because it utilizes the teacher's knowledge in the pre-training stage, thus no task-specific teacher is needed for finetuning downstream tasks. Finally, we experiment with the combination of CoDIR-Pre and CoDIR-Fine, and our observation is that adding constrastive loss for finetuning is not bringing in much improvement after already using constrastive loss in pre-training. Our hypothesis is that the model's ability to identify negative examples is already well learned during pre-training.
Inference Speed We compare the inference speed of the proposed CoDIR with the teacher network and other baselines. Statistics of Transformer layers and parameters are presented in Table  3. The statistics for BERT 6 -PKD and TinyBERT 6 are omitted as they share the same model architecture as DistilBERT. To test the inference speed, we ran each algorithm on MNLI dev set for 3 times, with batch size 32 and maximum sequence length 128 under the same hardware configuration. The average running time with 3 different random seeds is reported as the final inference speed. Though our RoBERTa teacher has almost 16 million more parameters, it shares almost the same inference   speed as BERT-base, because its computational cost mainly comes from the embedding layer with 50k vocabulary size that does not affect inference speed. By reducing the number of Transformer layers to 6, our proposed student model achieves 2 times speed up compared to the teacher, and achieves state-of-the-art performance among all models with similar inference time.

Ablation Studies
Sentence Embedding We also conduct experiments to evaluate the effectiveness of using different sentence embedding strategies. More detailed, based on the same model pre-trained on L MLM alone, we run finetuning experiments with contrastive loss on the GLUE dataset by using: Contrastive Loss We first evaluate the effectiveness of the proposed CRD loss for finetuning on a subset of GLUE dev set, using the following settings: (i) finetuning with cross-entropy loss only; (ii) finetuning with additional KD loss; and (iii) finetuning with additional KD loss and CRD loss. Results in Table 2 (upper part) show that using KD improves over standard finetuning by 0.9% on average, and using CRD loss further improves another 1.0%, demonstrating the advantage of using contrastive learning for finetuning. To further validate performance improvement of using contrastive loss on pre-training, we apply standard finetuning to three different pre-trained models: (i) model pre-trained by L MLM (MLM-Pre); (ii) model pre-trained by L MLM + L KD (KD-Pre); and (iii) model pre-trained by L MLM + L KD + L CRD (CoDIR-Pre). Results are summarized in Ta-ble 2 (bottom part). Similar trend can be observed that the model pre-trained with additional CRD loss performs the best, outperforming MLM-Pre and KD-Pre by 1.9% and 1.0% on average, respectively.
Model Variance Since different random seeds can exhibit different generalization behaviors, especially for tasks with a small training set (e.g., CoLA ), we examine the median, maximum and standard deviation of model performance on the dev set of each GLUE task, and present the results in Table 5. As expected, the models are more stable on larger datasets (SST-2, QQP, MNLI, and QNLI), where all standard deviations are lower than 0.5. However, the model is sensitive to the random seeds on smaller datasets (CoLA, MRPC, and RTE) with the standard deviation around 1.5. These analysis results provide potential references for future work on language model compression.

Conclusion
In this paper, we present CoDIR, a novel approach to large-scale language model compression via the use of contrastive loss. CoDIR utilizes information from both teacher's output layer and its intermediate layers for student model training. Extensive experiments demonstrate that CoDIR is highly effective in both finetuning and pre-training stages, and achieves state-of-the-art performance on GLUE benchmark compared to existing models with a similar size. All existing work either use BERTbase or RoBERTa-base as teacher. For future work, we plan to investigate the use of a more powerful language model, such as Megatron-LM (Shoeybi et al., 2019), as the teacher; and different strategies for choosing hard negatives to further boost the performance.