TinyBERT: Distilling BERT for Natural Language Understanding

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT4 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT-Base on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ~28% parameters and ~31% inference time of them. Moreover, TinyBERT6 with 6 layers performs on-par with its teacher BERT-Base.


Introduction
Pre-training language models then fine-tuning on downstream tasks has become a new paradigm for natural language processing (NLP). Pre-trained language models (PLMs), such as BERT (Devlin et al., 2019), XLNet , RoBERTa , ALBERT (Lan et al., 2020), T5 (Raffel et al., 2019) and ELECTRA (Clark et al., 2020), have achieved great success in many NLP tasks (e.g., the GLUE benchmark (Wang et al., 2018) and the challenging multi-hop reasoning task (Ding et al., 2019)). However, PLMs usually have a large number of parameters and take long inference time, which are difficult to be deployed on edge devices such as mobile phones. Recent studies (Kovaleva et al., 2019;Michel et al., 2019;Voita et al., 2019) demonstrate that there is redundancy in PLMs. Therefore, it is crucial and feasible to reduce the computational overhead and model storage of PLMs while retaining their performances.
There have been many model compression techniques (Han et al., 2016) proposed to accelerate deep model inference and reduce model size while maintaining accuracy. The most commonly used techniques include quantization (Gong et al., 2014), weights pruning (Han et al., 2015), and knowledge distillation (KD) (Romero et al., 2014). In this paper, we focus on knowledge distillation, an idea originated from Hinton et al. (2015), in a teacher-student framework. KD aims to transfer the knowledge embedded in a large teacher network to a small student network where the student network is trained to reproduce the behaviors of the teacher network. Based on the framework, we propose a novel distillation method specifically for the Transformer-based models (Vaswani et al., 2017), and use BERT as an example to investigate the method for large-scale PLMs.
KD has been extensively studied in NLP (Kim and Rush, 2016;Hu et al., 2018) as well as for pre-trained language models (Sanh et al., 2019;Sun et al., 2019Sun et al., , 2020. The pre-training-then-fine-tuning paradigm firstly pre- trains BERT on a large-scale unsupervised text corpus, then fine-tunes it on task-specific dataset, which greatly increases the difficulty of BERT distillation. Therefore, it is required to design an effective KD strategy for both training stages. To build a competitive TinyBERT, we firstly propose a new Transformer distillation method to distill the knowledge embedded in teacher BERT. Specifically, we design three types of loss functions to fit different representations from BERT layers: 1) the output of the embedding layer; 2) the hidden states and attention matrices derived from the Transformer layer; 3) the logits output by the prediction layer. The attention based fitting is inspired by the recent findings (Clark et al., 2019) that the attention weights learned by BERT can capture substantial linguistic knowledge, and it thus encourages the linguistic knowledge can be well transferred from teacher BERT to student Tiny-BERT. Then, we propose a novel two-stage learning framework including the general distillation and the task-specific distillation, as illustrated in Figure 1. At general distillation stage, the original BERT without fine-tuning acts as the teacher model. The student TinyBERT mimics the teacher's behavior through the proposed Transformer distillation on general-domain corpus. After that, we obtain a general TinyBERT that is used as the initialization of student model for the further distillation. At the task-specific distillation stage, we first do the data augmentation, then perform the distillation on the augmented dataset using the fine-tuned BERT as the teacher model. It should be pointed out that both the two stages are essential to improve the performance and generalization capability of TinyBERT.
The main contributions of this work are as follows: 1) We propose a new Transformer distillation method to encourage that the linguistic knowledge encoded in teacher BERT can be adequately transferred to TinyBERT; 2) We propose a novel two-stage learning framework with performing the proposed Transformer distillation at both the pretraining and fine-tuning stages, which ensures that TinyBERT can absorb both the general-domain and task-specific knowledge of the teacher BERT. 3) We show in the experiments that our TinyBERT 4 can achieve more than 96.8% the performance of teacher BERT BASE on GLUE tasks, while having much fewer parameters (∼13.3%) and less inference time (∼10.6%), and significantly outperforms other state-of-the-art baselines with 4 layers on BERT distillation; 4) We also show that a 6-layer TinyBERT 6 can perform on-par with the teacher BERT BASE on GLUE.

Preliminaries
In this section, we describe the formulation of Transformer (Vaswani et al., 2017) andKnowledge Distillation (Hinton et al., 2015). Our proposed Transformer distillation is a specially designed KD method for Transformer-based models.

Transformer Layer
Most of the recent pre-trained language models (e.g., BERT, XLNet and RoBERTa) are built with Transformer layers, which can capture longterm dependencies between input tokens by selfattention mechanism. Specifically, a standard Transformer layer includes two main sub-layers: multi-head attention (MHA) and fully connected feed-forward network (FFN). Multi-Head Attention (MHA). The calculation of attention function depends on the three components of queries, keys and values, denoted as matrices Q, K and V respectively. The attention function can be formulated as follows: where d k is the dimension of keys and acts as a scaling factor, A is the attention matrix calculated from the compatibility of Q and K by dot-product operation. The final function output is calculated as a weighted sum of values V , and the weight is computed by applying softmax() operation on the each column of matrix A. According to Clark et al. (2019), the attention matrices in BERT can capture substantial linguistic knowledge, and thus play an essential role in our proposed distillation method.
Multi-head attention is defined by concatenating the attention heads from different representation subspaces as follows: where k is the number of attention heads, and h i denotes the i-th attention head, which is calculated by the Attention() function with inputs from different representation subspaces. The matrix W acts as a linear transformation. Position-wise Feed-Forward Network (FFN). Transformer layer also contains a fully connected feed-forward network, which is formulated as follows: We can see that the FFN contains two linear transformations and one ReLU activation.

Knowledge Distillation
KD aims to transfer the knowledge of a large teacher network T to a small student network S. The student network is trained to mimic the behaviors of teacher networks. Let f T and f S represent the behavior functions of teacher and student networks, respectively. The behavior function targets at transforming network inputs to some informative representations, and it can be defined as the output of any layer in the network. In the context of Transformer distillation, the output of MHA layer or FFN layer, or some intermediate representations (such as the attention matrix A) can be used as behavior function. Formally, KD can be modeled as minimizing the following objective function: where L(·) is a loss function that evaluates the difference between teacher and student networks, x is the text input and X denotes the training dataset. Thus the key research problem becomes how to define effective behavior functions and loss functions. Different from previous KD methods, we also need to consider how to perform KD at the pre-training stage of BERT in addition to the task-specific training stage.

Method
In this section, we propose a novel distillation method for Transformer-based models, and present a two-stage learning framework for our model distilled from BERT, which is called TinyBERT.

Transformer Distillation
The proposed Transformer distillation is a specially designed KD method for Transformer networks. In this work, both the student and teacher networks are built with Transformer layers. For a clear illustration, we formulate the problem before introducing our method. Problem Formulation. Assuming that the student model has M Transformer layers and teacher model has N Transformer layers, we start with choosing M out of N layers from the teacher model for the Transformer-layer distillation. Then a function n = g(m) is defined as the mapping function between indices from student layers to teacher layers, which means that the m-th layer of student model learns the information from the g(m)-th layer of teacher model. To be precise, we set 0 to be the index of embedding layer and M + 1 to be the index of prediction layer, and the corresponding layer mappings are defined as 0 = g(0) and N + 1 = g(M + 1) respectively. The effect of the choice of different mapping functions on the performances is studied in the experiment section. Formally, the student can acquire knowledge from the teacher by minimizing the following objective: where L layer refers to the loss function of a given model layer (e.g., Transformer layer or embedding layer), f m (x) denotes the behavior function induced from the m-th layers and λ m is the hyper-parameter that represents the importance of the m-th layer's distillation.
Transformer-layer Distillation. The proposed Transformer-layer distillation includes the attention based distillation and hidden states based distillation, which is shown in Figure 2. The attention based distillation is motivated by the recent findings that attention weights learned by BERT can capture rich linguistic knowledge (Clark et al., 2019). This kind of linguistic knowledge includes the syntax and coreference information, which is essential for natural language understanding. Thus we propose the attention based distillation to encourage that the linguistic knowledge can be transferred from teacher (BERT) to student (TinyBERT). Specifically, the student learns to fit the matrices of multi-head attention in the teacher network, and the objective is defined as: where h is the number of attention heads, A i ∈ R l×l refers to the attention matrix corresponding to the i-th head of teacher or student, l is the input text length, and MSE() means the mean squared error loss function. In this work, the (unnormalized) attention matrix A i is used as the fitting target instead of its softmax output softmax(A i ), since our experiments show that the former setting has a faster convergence rate and better performances. In addition to the attention based distillation, we also distill the knowledge from the output of Transformer layer, and the objective is as follows: where the matrices H S ∈ R l×d and H T ∈ R l×d refer to the hidden states of student and teacher networks respectively, which are calculated by Equation 4. The scalar values d and d denote the hidden sizes of teacher and student models, and d is often smaller than d to obtain a smaller student network. The matrix W h ∈ R d ×d is a learnable linear transformation, which transforms the hidden states of student network into the same space as the teacher network's states.
Embedding-layer Distillation. Similar to the hidden states based distillation, we also perform embedding-layer distillation and the objective is: where the matrices E S and H T refer to the embeddings of student and teacher networks, respectively. In this paper, they have the same shape as the hidden state matrices. The matrix W e is a linear transformation playing a similar role as W h . Prediction-layer Distillation. In addition to imitating the behaviors of intermediate layers, we also use the knowledge distillation to fit the predictions of teacher model as in Hinton et al. (2015). Specifically, we penalize the soft cross-entropy loss between the student network's logits against the teacher's logits: where z S and z T are the logits vectors predicted by the student and teacher respectively, CE means the cross entropy loss, and t means the temperature value. In our experiment, we find that t = 1 performs well.
Using the above distillation objectives (i.e. Equations 7, 8, 9 and 10), we can unify the distillation loss of the corresponding layers between the teacher and the student network:

TinyBERT Learning
The application of BERT usually consists of two learning stages: the pre-training and fine-tuning. The plenty of knowledge learned by BERT in the pre-training stage is of great importance and should be transferred to the compressed model. Therefore, we propose a novel two-stage learning framework including the general distillation and the task-specific distillation, as illustrated in Figure 1. General distillation helps TinyBERT learn the rich knowledge embedded in pre-trained BERT, which plays an important role in improving the generalization capability of TinyBERT. The task-specific distillation further teaches TinyBERT the knowledge from the fine-tuned BERT. With the two-step distillation, we can substantially reduce the gap between teacher and student models. General Distillation. We use the original BERT without fine-tuning as the teacher and a large-scale text corpus as the training data. By performing the Transformer distillation 2 on the text from general Algorithm 1 Data Augmentation Procedure for Task-specific Distillation Input: x is a sequence of words Params: pt: the threshold probability Na: the number of samples augmented per example K: the size of candidate set Output: D : the augmented data 1: else 9: C ← K most similar words of x[i] from GloVe 10: end if 11: Sample p ∼ Uniform(0, 1) 12: if p ≤ pt then 13: Replace xm[i] with a word in C randomly 14: end if 15: end for 16: Append xm to D 17: n ← n + 1 18: end while 19: return D domain, we obtain a general TinyBERT that can be fine-tuned for downstream tasks. However, due to the significant reductions of the hidden/embedding size and the layer number, general TinyBERT performs generally worse than BERT.
Task-specific Distillation. Previous studies show that the complex models, such as finetuned BERTs, suffer from over-parametrization for domain-specific tasks (Kovaleva et al., 2019). Thus, it is possible for smaller models to achieve comparable performances to the BERTs. To this end, we propose to produce competitive fine-tuned Tiny-BERTs through the task-specific distillation. In the task-specific distillation, we re-perform the proposed Transformer distillation on an augmented task-specific dataset. Specifically, the fine-tuned BERT is used as the teacher and a data augmentation method is proposed to expand the task-specific training set. Training with more task-related examples, the generalization ability of the student model can be further improved.
Data Augmentation. We combine a pre-trained language model BERT and GloVe (Pennington et al., 2014) word embeddings to do word-level the TinyBERT primarily learn the intermediate structures of BERT at pre-training stage. From our preliminary experiments, we also found that conducting prediction-layer distillation at pre-training stage does not bring extra improvements on downstream tasks, when the Transformer-layer distillation (Attn and Hidn distillation) and Embedding-layer distillation have already been performed. replacement for data augmentation. Specifically, we use the language model to predict word replacements for single-piece words (Wu et al., 2019), and use the word embeddings to retrieve the most similar words as word replacements for multiple-pieces words 3 . Some hyper-parameters are defined to control the replacement ratio of a sentence and the amount of augmented dataset. More details of the data augmentation procedure are shown in Algorithm 1. We set p t = 0.4, N a = 20, K = 15 for all our experiments.
The above two learning stages are complementary to each other: the general distillation provides a good initialization for the task-specific distillation, while the task-specific distillation on the augmented data further improves TinyBERT by focusing on learning the task-specific knowledge. Although there is a significant reduction of model size, with the data augmentation and by performing the proposed Transformer distillation method at both the pre-training and fine-tuning stages, TinyBERT can achieve competitive performances in various NLP tasks.

Experiments
In this section, we evaluate the effectiveness and efficiency of TinyBERT on a variety of tasks with different model settings.

TinyBERT Settings
We instantiate a tiny student model (the number of layers M =4, the hidden size d =312, the feedforward/filter size d i =1200 and the head number h=12) that has a total of 14.5M parameters. This model is referred to as TinyBERT 4 . The original , BERT 4 -PKD and DistilBERT 4 is (M =4, d=768, d i =3072) and the architecture of BERT 6 -PKD, DistilBERT 6 and TinyBERT 6 is (M =6, d=768, d i =3072). All models are learned in a single-task manner. The inference speedup is evaluated on a single NVIDIA K80 GPU. † denotes that the comparison between MobileBERT TINY and TinyBERT 4 may not be fair since the former has 24 layers and is task-agnosticly distilled from IB-BERT LARGE while the later is a 4-layers model task-specifically distilled from BERT BASE .
BERT BASE (N =12, d=768, d i =3072 and h=12) is used as the teacher model that contains 109M parameters. We use g(m) = 3 × m as the layer mapping function, so TinyBERT 4 learns from every 3 layers of BERT BASE . The learning weight λ of each layer is set to 1. Besides, for a direct comparisons with baselines, we also instantiate a TinyBERT 6 (M =6, d =768, d i =3072 and h=12) with the same architecture as BERT 6 -PKD (Sun et al., 2019) and DistilBERT 6 (Sanh et al., 2019).
TinyBERT learning includes the general distillation and the task-specific distillation. For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words) as the text corpus and perform the intermediate layer distillation for 3 epochs with the supervision from a pre-trained BERT BASE and keep other hyper-parameters the same as BERT pretraining (Devlin et al., 2019). For the task-specific distillation, under the supervision of a fine-tuned BERT, we firstly perform intermediate layer distillation on the augmented data for 20 epochs 4 with batch size 32 and learning rate 5e-5, and then perform prediction layer distillation on the augmented data 5 for 3 epochs with choosing the batch size from {16, 32} and learning rate from {1e-5, 2e-5, 3e-5} on dev set. At task-specific distillation, the maximum sequence length is set to 64 for singlesentence tasks, and 128 for sequence pair tasks. 4 For large datasets MNLI, QQP, and QNLI, we only perform 10 epochs of the intermediate layer distillation, and for the challenging task CoLA, we perform 50 epochs at this step. 5 For regression task STS-B, the original train set is better.

Baselines
We compare TinyBERT with BERT TINY , BERT  (Sun et al., 2020). BERT TINY means directly pretraining a small BERT, which has the same model architecture as TinyBERT 4 . When training BERT TINY , we follow the same learning strategy as described in the original BERT (Devlin et al., 2019). To make a fair comparison, we use the released code to train a 4-layer BERT 4 -PKD 7 and a 4-layer DistilBERT 4 8 and fine-tuning these 4-layer baselines with suggested hyper-paramters. For 6-layer baselines, we use the reported numbers or evaluate the results on the test set of GLUE with released models.

Experimental Results on GLUE
We submitted our model predictions to the official GLUE evaluation server to obtain results on the test set 9 , as summarized in Table 1.
The experiment results from the 4-layer student models demonstrate that: 1) There is a large performance gap between BERT TINY (or BERT SMALL ) and BERT BASE due to the dramatic reduction in model size. 2) TinyBERT 4 is consistently better than BERT TINY on all the GLUE tasks and obtains a large improvement of 6.8% on average. This indicates that the proposed KD learning framework can effectively improve the performances of small models on a variety of downstream tasks. 3) TinyBERT 4 significantly outperforms the 4-layer state-of-the-art KD baselines (i.e., BERT 4 -PKD and DistilBERT 4 ) by a margin of at least 4.4%, with ∼28% parameters and 3.1x inference speedup. 4) Compared with the teacher BERT BASE , TinyBERT 4 is 7.5x smaller and 9.4x faster in the model efficiency, while maintaining competitive performances. 5) For the challenging CoLA dataset (the task of predicting linguistic acceptability judgments), all the 4-layer distilled models have big performance gaps compared to the teacher model, while TinyBERT 4 achieves a significant improvement over the 4-layer baselines. 6) We also compare TinyBERT with the 24layer MobileBERT TINY , which is distilled from 24-layer IB-BERT LARGE . The results show that TinyBERT 4 achieves the same average score as the 24-layer model with only 38.7% FLOPs. 7) When we increase the capacity of our model to TinyBERT 6 , its performance can be further elevated and outperforms the baselines of the same architecture by a margin of 2.6% on average and achieves comparable results with the teacher. 8) Compared with the other two-stage baseline PD, which first pre-trains a small BERT, then performs distillation on a specific task with this small model, TinyBERT initialize the student in task-specific stage via general distillation. We analyze these two initialization methods in Appendix C.
In addition, BERT-PKD and DistilBERT initialize their student models with some layers of a pretrained BERT, which makes the student models have to keep the same size settings of Transformer layer (or embedding layer) as their teacher. In our two-stage distillation framework, TinyBERT is initialized through general distillation, making it more flexible in choosing model configuration.
More Comparisons. We demonstrate the effectiveness of TinyBERT by including more baselines such as Poor Man's BERT (Sajjad et al., 2020), BERT-of-Theseus (Xu et al., 2020) and MiniLM , some of which only report results on the GLUE dev set. In addition, we evaluate TinyBERT on SQuAD v1.1 and v2.0. Due to the space limit, we present our results in the Appendix A and B.

Ablation Studies
In this section, we conduct ablation studies to investigate the contributions of : a) different procedures of the proposed two-stage TinyBERT learning framework in Figure 1, and b) different distillation objectives in Equation 11.

Effects of Learning Procedure
The proposed two-stage TinyBERT learning framework consists of three key procedures: GD (General Distillation), TD (Task-specific Distillation) and DA (Data Augmentation). The performances of removing each individual learning procedure are analyzed and presented in Table 2. The results indicate that all of the three procedures are crucial for the proposed method. The TD and DA has comparable effects in all the four tasks. We note that the task-specific procedures (TD and DA) are more helpful than the pre-training procedure (GD) on all of the tasks. Another interesting observation is that GD contribute more on CoLA than on MNLI and MRPC. We conjecture that the ability of linguistic generalization (Warstadt et al., 2019) learned by GD plays an important role in the task of linguistic acceptability judgments.

Effects of Distillation Objective
We investigate the effects of distillation objectives on the TinyBERT learning. Several baselines are proposed including the learning without the Transformer-layer distillation (w/o Trm), the embedding-layer distillation (w/o Emb) or the prediction-layer distillation (w/o Pred) 10 respectively. The results are illustrated in Table 3 and show that all the proposed distillation objectives are useful. The performance w/o Trm 11 drops significantly from 75.6 to 56.3. The reason for the significant drop lies in the initialization of student model. At the pre-training stage, obtaining a good initialization is crucial for the distillation of transformer-based models, while there is no supervision signal from upper layers to update the parameters of transformer layers at this stage under the w/o Trm setting. Furthermore, we study the contributions of attention (Attn) and hidden states (Hidn) in the Transformer-layer distillation. We can find the attention based distillation has a greater impact than hidden states based distillation. Meanwhile, these two kinds of knowledge distillation are complementary to each other, which makes them the most important distillation techniques for Transformer-based model in our experiments.

Effects of Mapping Function
We also investigate the effects of different mapping functions n = g(m) on the TinyBERT learning. Our original TinyBERT as described in section 4.2 uses the uniform strategy, and we compare with two typical baselines including top-strategy (g(m) = m + N − M ; 0 < m ≤ M ) and bottomstrategy (g(m) = m; 0 < m ≤ M ).
The comparison results are presented in Table 4. We find that the top-strategy performs better than the bottom-strategy on MNLI, while being worse on MRPC and CoLA, which confirms the observations that different tasks depend on the knowledge from different BERT layers. The uniform strategy covers the knowledge from bottom to top layers of BERT BASE , and it achieves better performances than the other two baselines in all the tasks. Adaptively choosing layers for a specific task is a challenging problem and we leave it as future work.

Related Work
Pre-trained Language Models Compression Generally, pre-trained language models (PLMs) can be compressed by low-rank approximation (Ma 10 The prediction-layer distillation performs soft crossentropy as Equation 10 on the augmented data. "w/o Pred" means performing standard cross-entropy against the groundtruth of original train set. 11 Under "w/o Trm" setting, we actually 1) conduct embedding-layer distillation at the pre-training stage; 2) perform embedding-layer and prediction-layer distillation at finetuning stage.   Sanh et al., 2019;Turc et al., 2019;Sun et al., 2020;, pruning (Cui et al., 2019;Mc-Carley, 2019;Elbayad et al., 2020;Gordon et al., 2020;Hou et al., 2020) or quantization (Shen et al., 2019;Zafrir et al., 2019). In this paper, our focus is on knowledge distillation. Knowledge Distillation for PLMs There have been some works trying to distill pre-trained language models (PLMs) into smaller models.   conducts deep self-attention distillation also at pre-training stage. By contrast, we propose a new two-stage learning framework to distill knowledge from BERT at both pre-training and fine-tuning stages by a novel transformer distillation method. Pretraining Lite PLMs Other related works aim at directly pretraining lite PLMs. Turc et al. (2019) pre-trained 24 miniature BERT models and show that pre-training remains important in the context of smaller architectures, and fine-tuning pretrained compact models can be competitive. AL-BERT (Lan et al., 2020) incorporates embedding factorization and cross-layer parameter sharing to reduce model parameters. Since ALBERT does not reduce hidden size or layers of transformer block, it still has large amount of computations. Another concurrent work, ELECTRA (Clark et al., 2020) proposes a sample-efficient task called replaced token detection to accelerate pre-training, and it also presents a 12-layer ELECTRA small that has comparable performance with TinyBERT 4 . Different from these small PLMs, TinyBERT 4 is a 4-layer model which can achieve more speedup.

Conclusion and Future Work
In this paper, we introduced a new method for Transformer-based distillation, and further proposed a two-stage framework for TinyBERT. Extensive experiments show that TinyBERT achieves competitive performances meanwhile significantly reducing the model size and inference time of BERT BASE , which provides an effective way to deploy BERT-based NLP models on edge devices.
In future work, we would study how to effectively transfer the knowledge from wider and deeper teachers (e.g., BERT LARGE ) to student TinyBERT.
Combining distillation with quantization/pruning would be another promising direction to further compress the pre-trained language models.  We also demonstrate the effectiveness of Tiny-BERT on the question answering (QA) tasks: SQuAD v1.1 (Rajpurkar et al., 2016) and SQuAD v2.0 (Rajpurkar et al., 2018). Following the learning procedure in the previous work (Devlin et al., 2019), we treat these two tasks as the problem of sequence labeling which predicts the possibility of each token as the start or end of answer span. One small difference from the GLUE tasks is that we perform the prediction-layer distillation on the original training dataset instead of the augmented dataset, which can bring better performances.
The results show that TinyBERT consistently outperforms both the 4-layer and 6-layer baselines, which indicates that the proposed framework also works for the tasks of token-level labeling. Compared with sequence-level GLUE tasks, the question answering tasks depend on more subtle knowledge to infer the correct answer, which increases the difficulty of knowledge distillation. We leave how to build a better QA-TinyBERT as future work.

C Initializing TinyBERT with BERT TINY
In the proposed two-stage learning framework, to make TinyBERT effectively work for different downstream tasks, we propose the General Distillation (GD) to capture the general domain knowledge, through which the TinyBERT learns the knowledge   Table 7: Results of different methods at pre-training stage. TD and GD refers to Task-specific Distillation (without data augmentation) and General Distillation, respectively. The results are evaluated on dev set. from intermediate layers of teacher BERT at the pre-training stage. After that, a general TinyBERT is obtained and used as the initialization of student model for Task-specific Distillation (TD) on downstream tasks. In our preliminary experiments, we have also tried to initialize TinyBERT with the directly pretrained BERT TINY , and then conduct the TD on downstream tasks. We denote this compression method as BERT TINY (+TD). The results in Table 7 show that BERT TINY (+TD) performs even worse than BERT TINY on MRPC and CoLA tasks. We conjecture that if without imitating the BERT BASE 's behaviors at the pre-training stage, BERT TINY will derive mismatched distributions in intermediate representations (e.g., attention matrices and hidden states) with the BERT BASE model. The following task-specific distillation under the supervision of fine-tuned BERT BASE will further disturb the learned distribution/knowledge of BERT TINY , finally leading to poor performances on some less-data tasks. For the intensive-data task (e.g. MNLI), TD has enough training data to make BERT TINY acquire the task-specific knowledge very well, although the pre-trained distributions have already been disturbed.
From the results of Table 7, we find that GD can effectively transfer the knowledge from the teacher BERT to the student TinyBERT and achieve comparable results with BERT TINY (61.1 vs. 63.9), even without performing the MLM and NSP tasks. Furthermore, the task-specific distillation boosts the performances of TinyBERT by continuing on learning the task-specific knowledge from fine-tuned teacher BERT BASE .

D GLUE Details
The GLUE datasets are described as follows: MNLI. Multi-Genre Natural Language Inference is a large-scale, crowd-sourced entailment classification task (Williams et al., 2018). Given a pair of premise, hypothesis , the goal is to predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise. QQP. Quora Question Pairs is a collection of question pairs from the website Quora. The task is to determine whether two questions are semantically equivalent (Chen et al., 2018). QNLI. Question Natural Language Inference is a version of the Stanford Question Answering Dataset which has been converted to a binary sentence pair classification task by Wang et al. (2018). Given a pair question, context . The task is to determine whether the context contains the answer to the question. SST-2. The Stanford Sentiment Treebank is a binary single-sentence classification task, where the goal is to predict the sentiment of movie reviews (Socher et al., 2013). CoLA. The Corpus of Linguistic Acceptability is a task to predict whether an English sentence is a grammatically correct one (Warstadt et al., 2019).

STS-B.
The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and many other domains (Cer et al., 2017). The task aims to evaluate how similar two pieces of texts are by a score from 1 to 5. MRPC. Microsoft Research Paraphrase Corpus is a paraphrase identification dataset where systems aim to identify if two sentences are paraphrases of each other (Dolan and Brockett, 2005). RTE. Recognizing Textual Entailment is a binary entailment task with a small training dataset (Bentivogli et al., 2009).