Adversarial Self-Supervised Data-Free Distillation for Text Classification

Large pre-trained transformer-based language models have achieved impressive results on a wide range of NLP tasks. In the past few years, Knowledge Distillation(KD) has become a popular paradigm to compress a computationally expensive model to a resource-efficient lightweight model. However, most KD algorithms, especially in NLP, rely on the accessibility of the original training dataset, which may be unavailable due to privacy issues. To tackle this problem, we propose a novel two-stage data-free distillation method, named Adversarial self-Supervised Data-Free Distillation (AS-DFD), which is designed for compressing large-scale transformer-based models (e.g., BERT). To avoid text generation in discrete space, we introduce a Plug&Play Embedding Guessing method to craft pseudo embeddings from the teacher's hidden knowledge. Meanwhile, with a self-supervised module to quantify the student's ability, we adapt the difficulty of pseudo embeddings in an adversarial training manner. To the best of our knowledge, our framework is the first data-free distillation framework designed for NLP tasks. We verify the effectiveness of our method on several text classification datasets.


Introduction
Recently, pre-trained language models (Devlin et al., 2018;Raffel et al., 2019) have achieved tremendous progress and reached the state-of-the-art performance in various downstream tasks such as text classification (Maas et al., 2011), language inference (Bowman et al., 2015) and question answering (Rajpurkar et al., 2016). These models become an indispensable part of current models for their transferability and generalizability. * * Corresponding author However, such language models are huge in volume and demand highly in computational resources, making it impractical in deploying them on portable systems with limited resources (e.g., mobile phones, edge devices) without appropriate compression. Recent researches (McCarley, 2019;Gordon et al., 2020;Michel et al., 2019) focus on compressing the large-scale models to a shallow and resource-efficient network via weight pruning (Guo et al., 2019), knowledge distillation (Mukherjee and Awadallah, 2019), weight quantization (Zafrir et al., 2019) and parameter sharing . Among them, some methods (Sanh et al., 2019;Sun et al., 2019) draw on the idea of transfer learning, utilizing knowledge distillation (Hinton et al., 2015) to transfer latent representation information embedded in teachers to students. These knowledge distillation methods share some commonalities: they rely on the training data to achieve high accuracy. It will be intractable if we need to compress a model without publicly accessible data. Reasons for that include privacy protection, company assets, safety/security concerns and transmission. Representative samples include GPT2 (Radford et al., 2019), which has not released its training data with fears of abuse of language models. Google trains a neural machine translation system (Wu et al., 2016) using internal datasets owned and protected by the company. DeepFace (Taigman et al., 2014) is trained on user images under confidential policies for protecting users. Further, some datasets, like Common Crawl dataset used in GPT3 (Brown et al., 2020), contain nearly a trillion words and are difficult to transmit and store.
Conventional knowledge distillation methods are highly dependent on data. Some models or algorithms in Computer Vision like DAFL , ZSKD (Nayak et al., 2019) solve the datafree distillation by generating pseudo images or uti-lizing metadata from teacher models. Exploratory researches (Micaelli and Storkey, 2019;Fang et al., 2019) also show that GANs can synthesize harder and more diversified images by exploiting disagreements between teachers and students. However, these models only make attempts in image tasks, designing for continuous and real-valued images. Applying these models to generate sentences is challenging due to the discrete representation of words (Huszár, 2015). Backpropagation on discrete words is not reasonable, and it seems unlikely to pass the gradient through the text to the generator. Apart from the discontinuity problem of text, some promotion strategies like layer-wise statistic matching in batch normalization (Yin et al., 2019) are not suitable for transformer-based models, which transposes batch normalization into layer normalization to fit with varied sentence length (Ba et al., 2016).
To address the above issues and distill without data, we propose a novel data-free distillation framework called "Adversarial self-Supervised Data-Free Distillation"(AS-DFD). We invert BERT to perform gradient updates on embeddings and consider parameters of the embedding layer as accessible knowledge for student models. Under constraints of constructing "BERT-like" vectors, pseudo embeddings extract underlying representations of each category. Besides, we employ a self-supervised module to quantify the student's ability and adversarially adjust the difficulty of pseudo samples, alleviating the insufficient supervisory problem controlled by the one-hot target. Our main contributions are summarized as follows: • We introduce AS-DFD, a data-free distillation framework, to compress BERT. To the best of our knowledge, AS-DFD is the first model in NLP to distill knowledge without data.
• We propose a Plug & Play Embedding Guessing method and align the pseudo embeddings with the distribution of BERT's embedding. We also propose a novel adversarial selfsupervised module to search for samples students perform poorly on, which also encourages diversity.
• We verify the effectiveness of AS-DFD on three popular text classification datasets with two different student architectures. Extensive experiments support the conjecture that synthetic embeddings are effective for data-free distillation.  (Hinton et al., 2015). Bert (Devlin et al., 2018) contains multiple layers of transformer blocks (Vaswani et al., 2017) which encodes contextual relationship between words. Recently, many works successfully compress BERT to a BERT-like model with knowledge distillation (Sanh et al., 2019) and achieve comparable performances on downstream-tasks. Patient-KD (Sun et al., 2019) bridges the student and teacher model between its intermediate outputs.
TinyBERT (Jiao et al., 2019) captures both domain-general and domain-specific knowledge in a two-stage framework. Zhao et al. (2019) employs a dual-training mechanism and shared projection matrices to compress the model by more than 60x. BERT-of-Theseus (Xu et al., 2020) progressively module replacing and involves a replacement scheduler in the distillation process. Besides, some recent surveys focus on compress BERT to a CNN-based (Chia et al., 2019) or LSTM-based model to create a more lightweight model with additional training data (Tang et al., 2019a,b).

Data-Free Distillation Methods
Current methods for data-free knowledge distillation are applied in the field of computer vision.  Figure 1: An overview of our two-stage Adversarial self-Supervised Data-Free Distillation framework. T and S contain transformer layers and classifier head. Firstly, when constructing synthetic samples, we iteratively guess and update the pseudo embeddings e under the feedback of the teacher's class-conditional supervision (top left) and the student's self-assessment (top right) in an adversarial training manner. Secondly, we use the generated sample e to distill knowledge (top middle). The parameters of embedding layer are fixed, and no inputs will go through the embedding layer when training.

Methods
In this section, we present our two-stage distillation framework named Adversarial self-Supervised Data-Free Distillation (AS-DFD). We craft welltrained embedding-level pseudo samples by controllable Plug & Play Embedding Guessing with alignment constraints (Section 3.1) and adversarially adapt synthetic embeddings under selfsupervision of the student (Section 3.2). Using these pseudo samples, we transfer knowledge from the teacher to the student (Section 3.3). The workflow of AS-DFD is illustrated in Figure 1.
Problem Definition Knowledge Distillation is a compression technique to train a high-performance model with fewer parameters instructed by the teacher model (Hinton et al., 2015). Let T be a large transformer-based teacher model (12layer BERT-base here) and S be a comparatively lightweight student model. For each sentence x, the classification prediction can be formulated as: where θ emb , θ layer , θ classif ier represent parameters in the embedding layer, transformer layers and classification head respectively. y is the softmax probability output of x and h [CLS] denotes the hidden states in the last layer corresponding to the special token [CLS]. Parameters with superscript T belong to the teacher and S for the student.
Our goal of data-free knowledge distillation is to train the student parameters θ S with no data X available. In other words, we only have a teacher model T and we need to compress it.

Construct Pseudo Samples
Plug & Play Embedding Guessing In the datafree settings, we need to solve the dilemma of having no access to the original dataset. The major challenge is how to construct a set of highly reliable samples, from which the student can extract differential knowledge.
Our approach exploits representative knowledge hidden in the teacher's parameters in a Plug & Play manner (Nguyen et al., 2017;Dathathri et al., 2020). Given a sentence x and a label y, the conditional probability can be written as P (y|x; θ T ). When finetuning the teacher, we optimize parameters θ T towards higher probability. To capture impression of prior training data in the teacher's parameters, we invert the model and utilize the teacher's parameters to guide the generation of x by ascending P (y|x; θ T ) with θ T fixed.
Due to the intractable discrete problem of text, gradients updated on x are pointless. Most language models transform discrete words into continuous embeddings. Inspired by this, we ignore the embedding layer and apply the updating on continuous representation space of embeddings. We name this generation process "Embedding Guessing". We randomly guess vectors e ∈ R l×d , feed them into the transformer blocks and get feedback from gradients to confirm or update our guess. l is the predefined length of sentence and d is the embedding dimensionality, which is 768 in BERTbase. Those target-aware embeddings can be obtained by minimizing the objective: where T takes pseudo embeddings e as input and contains TransformerLayers and ClassifierLayer in the teacher. θ T includes θ T layer and θ T classif ier . y is a random target class. CE refers to the crossentropy loss. E is a batch of e initialized with Gaussian distribution. We update e for several iterations until convergence, representing that e is correct judged by the teacher. As for θ S emb , we share θ T emb with θ S emb . We argue that under the process of Embedding Guessing, pseudo embeddings e contain the targetspecific information. Classification models need to find out differentiated characteristics which propitious to prediction over multiple categories. As the human learning process, examples given by teachers are encouraged to be representative and better reflecting the discrepancy among classes. Borrowed from this teaching strategy, we guess embeddings towards the direction of higher likelihood on target category and seek the local minimum regarding the target class, which reflects the characteristics of the target class within regions. In other words, these synthetic samples are more likely to comprise separation statistics between classes.
Making Pseudo-Embeddings More Realistic However, training on embeddings leads to a gap between the pseudo embeddings and the true underlying embeddings. Specifically, Embedding Guess-ing is independent of the parameter of the teacher's embedding and will shift the representational space. We add some additional constraints to ensure generated embeddings imitate the distribution of real data to a certain extent. Alignment strategies to restrain and reduce search space are listed as follows: • Add e [CLS] and e [SEP] at both ends of the synthetic embeddings. e [CLS] and e [SEP] represent embeddings corresponding to [CLS] and [SEP].
• Continuously mask random length of embeddings from the tail of it. Lengths of sentences in batches are indeterminate and synthetic embeddings should cover this scenario.
• Adjust the Gaussian distribution to find the best initialization. Excessive initialization scope expands search space while small one converges to limited samples.

Adversarial self-Supervised Student
Modeling Learning Ability of the Student Effective teaching needs to grasp the student's current state of knowledge and dynamically adapt teaching strategies and contents. How to model the ability of the student without data? While processing natural language, the ability to analyze the context is an indicator of the student's capabilities and it can be quantified by a self-supervised module. Borrowing the idea of masking and predicting the entries randomly, we randomly mask one embedding in e. Then, a new self-restricted objective is to predict the masked embedding with the following forums: where e is randomly masked on position i and converted to e mask . e i is the masked embedding and W is the parameters in the fully-connected supervised module for predicting masked embedding. S acts the same way as T . Unlike the classconditional guidance, the self-supervised module shifts the gradients with more concrete and diverse supervision from context.
Adversarial Training of the Student To enforce e with more valuable and diverse information, we encourage the student to adversarially search for samples that the student is not confident. Prior works (Micaelli and Storkey, 2019;Fang et al., 2019) maximize the discrepancy between the teacher and student to encourage difficulty in samples and avoid synthesizing redundant images. We design a self-assessed confrontational mechanism, which guides the pseudo embeddings towards greater difficulty by enlarging L M ASK in the constructing stage and enhances the student by decreasing L M ASK in the distillation stage. Here, L M ASK acts as the timely student's feedback to improve teaching.

Two-stage Training
Distillation Objective Students learn highentropy knowledge from teachers by matching soft targets. Taking E as synthetic samples, we measure the distance between the teacher and student as: where KL denotes the Kullback-Leibler divergence loss and τ is the distillation temperature. We follow PKD (Sun et al., 2019) to learn more meticulous details for students. To capture rich features, we define the additional loss as: S 2 2 2 (5) The objective of distillation can be formulated as: where α balances these two losses.
Training Procedure We summarize the training procedure in algorithm 1. The multi-round training of AS-DFD splits into two steps: the construction stage and the distillation stage. In the construction stage, after randomly sampling vectors with alignment constraints, we repeat the adversarial training of pseudo embeddings for n iter times. In each iteration, we guess embeddings under classconditional supervision information for n T steps, and the student is asked to predict and give negative feedback to guide pseudo-embeddings' generation for n S steps. When distilling, we train θ S as well as W with those pseudo samples.

Datasets
We demonstrate the effectiveness of our methods on three widely-used text classification datasets:

Teacher/Student Models
We experiment with official uncased BERT-base (Devlin et al., 2018)

Baselines
To the best of our knowledge, there is no data-free distillation method for language tasks. However, when slightly modifying the data-free distillation models that are effective in Computer Vision, these models can also work on language tasks. Imitating Plug & Play Embedding Guessing method, we plug those image generators/generation methods above the embedding layer to synthesize continuous embeddings (instead of images). Except for a baseline of random selection of words, we choose two models that represent the mainstream approaches in data-free distillation of image classification. Baselines are described as follows: Random Text We randomly select words from vocabulary and construct literally-uninterpretable sentences.
Modified-ZSKT Modified-ZSKT is extended from ZSKT (Micaelli and Storkey, 2019). ZSKT trains an adversarial generator to search for images in which the student's prediction poorly matches that of the teacher's and reaches state-of-the-art performance.
Modified-ZSKD Modified-ZSKD is derived from ZSKD (Nayak et al., 2019). ZSKD performs Dirichlet sampling on class probability and craft Data Impression. DeepInversion (Yin et al., 2019) extends ZSKD with feature distribution regularization in batch normalization and outperforms ZSKD. However, BERT is not suitable for this performance-enhancing approach (BERT has no BN or structure like BN to store statistics of training data) and DeepInversion cannot be the baseline of our method.

Experimental Results
We first show the performance of data-driven knowledge distillation. Then we show the effectiveness of AS-DFD methods. As shown in Table 2, AS-DFD with BERT 4 and BERT 6 performs the best on three datasets. For 6-layer BERT, our algorithm improves 1.8%, 1.1% and 1.6% compared to Modified-ZSKD, closing the distance between the teacher and student. Furthermore, when coaching the 4-layer student, our methods gain 4.4%, 11.1% and 6.5% increases, which significantly improves the distillation accuracy. It seems that AS-DFD performs better with higher compression rates compared with other data-free methods. However, there is still a large gap between the performance of datadrive distillation and data-free distillation.
As for other baselines, Random Text can be regarded as a special case of unlabeled text where models can extract information to infer on, especially on text classification tasks. We use it as a criterion to judge whether a model works. Modified-ZSKT performs worse than Random Text on DBPedia. The reason lies in the structure of the generator, which is designed for image generation and is not suitable for language generation. The strength of CNN-based generators lies in its ability to capture local and hierarchical features. However, it is difficult for CNN to capture global and sequential structures, which is essential for languages.

Implementation Details
We train the AS-DFD with n T = 5, n S = 1 and n iter = 5. Maximum sequence lengths for three datasets are set to 128. Ideally, the more samples generated, the higher the accuracy. We impose restrictions on the number of generated samples for each dataset. Training epochs are 2.5k(AG News), 10k(DBPedia), 10k(IMDb) with 48 samples per batch for all methods except ZSKT, which needs to train its generator from scratch (25k epochs in Modified-ZSKT). In our experiments, these samples are enough for models to reach a stable status. More implementation details about finetuning teachers and distilling students are listed in Appendix A.1.
Initialization We observe that students' performance is highly sensitive to initializations (especially the Random Text baseline). Fan et al. (2019) argues that different layers play different roles in BERT. We report results using different initialization schemes and show the stability of AS-DFD. Considering that the embedding layer is separated from transformer blocks when training, we strongly recommend sharing the first layer's parameters of the teacher with the student, which is also suggested in Xu et al. (2020). Specifically, we choose two sets of layer weights. One is {1, 4, 7, 10}, which is common in data-driven distillation, and the other is {1, 5, 8, 12}, which intentionally put the last layer's parameters in. We evaluate these initialization schemes on AS-DFD and Modified-ZSKD. To eliminate the effects of distillation, we ensure that hyperparameters in the distillation step are consistent in two models, which intuitively shows the disparity in samples' quality. We do not include Modified-ZSKT because samples of Modified-ZSKT vastly outnumber the other two approaches.
Experimental results are shown in Figure 2. Modified-ZSKD highly dependent on initialization, especially on AG news and DBPedia with 23.1% and 47.1% performance drop relatively. On the contrary, initialization has limited impacts on AS-DFD. If pseudo-embeddings are initialized with worse parameters, our method still achieves better accuracy than other baselines (87.7% on AG News, 90.5% on DBPedia and 75.4% on IMDb). It shows that our method synthesizes higher-quality samples compared with Modified-ZSKD. Additionally, AS-DFD maintains an upward trend when the size of synthetic samples grows, suggesting that synthetic Validity of Synthetic Embeddings Embeddings we generated are incomprehensible. We use t-SNE (Maaten and Hinton, 2008) to visualize the synthetic embeddings in comparison with the original dataset. As shown in Figure 3, samples generated by Embedding Guessing are close to the real samples and overlap with them to a certain extent.

Module Analysis
To verify the contribution of each module, we perform an ablation study and summarize it in Table 4. Embedding Guessing is the foundation of the entire model. After drawing into the idea of Plug & Play Embedding Guessing, distillation performance is improved with stability, demonstrating that knowledge extracted from the teacher makes the synthetic samples reasonable. The embedding layer of the student model is completely separated in the generation-distillation process. Imitating BERT's input precisely narrows this gap, leading to a large improvement in accuracy. Additionally, choosing an appropriate normal distribution can effectively reduce search space and avoid generating completely irrelevant samples. We conduct experiments on different normal distributions in Appendix A.2.
Effect of Adversarial self-Supervised Module To investigate whether the adversarial selfsupervised module help data-free distillation, we conduct experiments on AG News to demonstrate the advantage of it in Figure 4.   We repeat each experiment 3 times and plot mean and standard deviation to reduce the contingency of experiments. With the adversarial selfsupervised module, distillation converges faster and achieves higher accuracy. The number of epochs can be reduced to 2500, saving half of the time. As shown in the curve, AS-DFD does not perform well in the early stage since the selfsupervised module is underfitting. After training for a while, the self-supervised module can grasp the student's ability and provide corrective feedback to synthesize more challenging samples.

Conclusion
In this paper, we propose AS-DFD, a novel datafree distillation method applied in text classification tasks. We use Plug & Play Embedding Guessing with alignment constraints to solve the problem that gradients cannot update on the discrete text. To dynamically adjust synthetic samples according to students' situations, we involve an adversarial self-supervised module to quantify students' abilities. Experimental results on three text datasets demonstrate the effectiveness of AS-DFD.
However, it's still challenging to ensure the diversity of generated embeddings under the weak supervision signal and we argue that the gap between synthetic and real sentences still exists. In the future, we would like to explore data-free distillation on more complex tasks.

A Appendices
A.1 Implementation Details

Hyperparameters in Finetuning Teachers
We finetune BERT-base on three datasets mentioned above. We train our teacher models with Adam (Kingma and Ba, 2014) in 4 epochs. Learning rate is set to 2e-5 with a scheduler that linearly decreases it after 10% warmup steps. We set the maximum sequence length to 128 and batch size to 32 for all datasets.
Hyperparameters in Data-Free Distillation AS-DFD is trained on 1 TITAN Xp GPU. We set batch size to 48 with the student's learning rate ξ from {5 × 10 −5 , 2 × 10 −5 , 1 × 10 −5 } and embedding learning rate η from {1 × 10 −2 , 5 × 10 −3 , 1 × 10 −3 }. We conduct an additional search over α from {100, 200, 250, 350, 500} and select the hyperparameters with the highest accuracy. In our experiment, η equals to 1 × 10 −2 and ξ equals to 1 × 10 −5 . α is set to 250. Temperature τ = 1 works well in our model. In the distillation step, we use Adam with a warmup proportion of 0.1 and we linearly decay the learning rate. In the construction step, the learning rate is fixed with Adam optimizer. There may be no validation set under data-free settings, which makes tuning parameters impossible. We experiment with the hyperparameters performed best on AG News and find that this set of parameters also performs well on the other two datasets.

A.2 Adjust Gaussian Distributions
The other two parameters are the mean and standard deviation for Gaussian sampling. We found in our experiments that standard deviation has a great influence on the student's performance. If vectors are initialized with small standard deviation(e.g. std=0.05, see Figure 5.b), generated samples in each category gather together, meaning that they aggregate to limited regions and leading to insufficient diversity of pseudo samples. Real data samples show no aggregation under t-SNE(see Figure 5.a). A higher standard deviation(e.g. std=1) indicates that samples are spread out from the mean, which will increase the search space and far from the embedding's distribution of BERT. It is also reflected in our testing accuracy with 83.2, 85.3, 88.2, 83.2 corresponding to N (0, 0.05 2 ), N (0, 0.2 2 ), N (0, 0.35 2 ), N (0, 1 2 ). We search standard deviations over {0.05, 0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 1} and choose 0.35 to be the best standard deviation, which works well on all three datasets.