FastBERT: a Self-distilling BERT with Adaptive Inference Time

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To improve their efficiency with an assured model performance, we propose a novel speed-tunable FastBERT with adaptive inference time. The speed at inference can be flexibly adjusted under varying demands, while redundant calculation of samples is avoided. Moreover, this model adopts a unique self-distillation mechanism at fine-tuning, further enabling a greater computational efficacy with minimal loss in performance. Our model achieves promising results in twelve English and Chinese datasets. It is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.


Introduction
Last two years have witnessed significant improvements brought by language pre-training, such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018), and XLNet . By pretraining on unlabeled corpus and fine-tuning on labeled ones, BERT-like models achieved huge gains on many Natural Language Processing tasks.
Despite this gain in accuracy, these models have greater costs in computation and slower speed at inference, which severely impairs their practicalities. Actual settings, especially with limited time and resources in the industry, can hardly enable such models into operation. For example, in tasks like sentence matching and text classification, one often requires to process billions of requests per second. What's more, the number of requests varies with time. In the case of an online shopping site, the * * Corresponding author: Qi Ju (damonju@tencent.com) number of requests during the holidays is five to ten times more than that of the workdays. A large number of servers need to be deployed to enable BERT in industrial settings, and many spare servers need to be reserved to cope with the peak period of requests, demanding huge costs.
To improve their usability, many attempts in model acceleration have been made, such as quantinization (Gong et al., 2014), weights pruning (Han et al., 2015), and knowledge distillation (KD) (Romero et al., 2014). As one of the most popular methods, KD requires additional smaller student models that depend entirely on the bigger teacher model and trade task accuracy for ease in computation (Hinton et al., 2015). Reducing model sizes to achieve acceptable speed-accuracy balances, however, can only solve the problem halfway, for the model is still set as fixated, rendering them unable to cope with drastic changes in request amount.
By inspecting many NLP datasets , we discerned that the samples have different levels of difficulty. Heavy models may overcalculate the simple inputs, while lighter ones are prone to fail in complex samples. As recent studies (Kovaleva et al., 2019) have shown redundancy in pre-training models, it is useful to design a onesize-fits-all model that caters to samples with varying complexity and gains computational efficacy with the least loss of accuracy.
Based on this appeal, we propose FastBERT, a pre-trained model with a sample-wise adaptive mechanism. It can adjust the number of executed layers dynamically to reduce computational steps. This model also has a unique self-distillation process that requires minimal changes to the structure, achieving faster yet as accurate outcomes within a single framework. Our model not only reaches a comparable speedup (by 2 to 11 times) to the BERT model, but also attains competitive accuracy in comparison to heavier pre-training models.
Experimental results on six Chinese and six English NLP tasks have demonstrated that FastBERT achieves a huge retrench in computation with very little loss in accuracy. The main contributions of this paper can be summarized as follows: • This paper proposes a practical speed-tunable BERT model, namely FastBERT, that balances the speed and accuracy in the response of varying request amounts; • The sample-wise adaptive mechanism and the self-distillation mechanism are combined to improve the inference time of NLP model for the first time. Their efficacy is verified on twelve NLP datasets; • The code is publicly available at https:// github.com/autoliuweijie/FastBERT.

Related work
BERT (Devlin et al., 2019) can learn universal knowledge from mass unlabeled data and produce more performant outcomes. Many works have followed: RoBERTa ) that uses larger corpus and longer training steps. T5 (Raffel et al., 2019) that scales up the model size even more. UER (Zhao et al., 2019) pre-trains BERT in different Chinese corpora. K-BERT (Liu et al., 2020) injects knowledge graph into BERT model. These models achieve increased accuracy with heavier settings and even more data. However, such unwieldy sizes are often hampered under stringent conditions. To be more specific, BERT-base contains 110 million parameters by stacking twelve Transformer blocks (Vaswani et al., 2017), while BERT-large expands its size to even 24 layers. ALBERT (Lan et al., 2019) shares the parameters of each layer to reduce the model size. Obviously, the inference speed for these models would be much slower than classic architectures (e.g., CNN (Kim, 2014), RNN (Wang, 2018), etc). We think a large proportion of computation is caused by redundant calculation.
Knowledge distillation: Many attempts have been made to distill heavy models (teachers) into their lighter counterparts (students). PKD-BERT (Sun et al., 2019a) adopts an incremental extraction process that learns generalizations from intermediate layers of the teacher model. TinyBERT (Jiao et al., 2019) performs a two-stage learning involving both general-domain pre-training and taskspecific fine-tuning. DistilBERT (Sanh et al., 2019)  further leveraged the inductive bias within large models by introducing a triple loss. As shown in Figure 1, student model often require a separated structure, whose effect however, depends mainly on the gains of the teacher. They are as indiscriminate to individual cases as their teachers, and only get faster in the cost of degraded performance.
Adaptive inference: Conventional approaches in adaptive computations are performed token-wise or patch-wise, who either adds recurrent steps to individual tokens (Graves, 2016) or dynamically adjusts the number of executed layers inside discrete regions of images (Teerapittayanon et al., 2016;Figurnov et al., 2017). To the best of our knowledge, there has been no work in applying adaptive mechanisms to NLP pre-training language models for efficiency improvements so far.

Methodology
Distinct to the above efforts, our approach fusions the adaptation and distillation into a novel speed-up approach, shown in Figure 2, achieving competitive results in both accuracy and efficiency.

Model architecture
As shown in Figure 2, FastBERT consists of backbone and branches. The backbone is built upon 12-layers Transformer encoder with an additional teacher-classifier, while the branches include student-classifiers which are appended to each Transformer output to enable early outputs.

Backbone
The backbone consists of three parts: the embedding layer, the encoder containing stacks of Transformer blocks (Vaswani et al., 2017), and the teacher classifier. The structure of the embedding layer and the encoder conform with those of BERT

One batch of input sentences
Student-Classifier 0 Student-Classifier 1

Teacher-Classifier
This book is really good! Written really bad.

Backbone
Not too bad, it is worth reading.
Excellent! but a bit difficult to understand.
Not too bad, it is worth reading.
Excellent! but a bit difficult to understand.
Excellent! but a bit difficult to understand.
Excellent! but a bit difficult to understand.

Low uncertainty
High uncertainty Branch Figure 2: The inference process of FastBERT, where the number of executed layers with each sample varies based on its complexity. This illustrates a sample-wise adaptive mechanism. Taking a batch of inputs (batch size = 4) as an example, the Transformer0 and Student-classifier0 inferred their labels as probability distributions and calculate the individual uncertainty. Cases with low uncertainty are immediately removed from the batch, while those with higher uncertainty are sent to the next layer for further inference. (Devlin et al., 2019). Given the sentence length n, an input sentence s = [w 0 , w 1 , ...w n ] will be transformed by the embedding layers to a sequence of vector representations e like (1), where e is the summation of word, position, and segment embeddings. Next, the transformer blocks in the encoder performs a layer-by-layer feature extraction as (2), where h i (i = −1, 0, 1, ..., L − 1) is the output features at the ith layer, and h −1 = e. L is the number of Transformer layers. Following the final encoding output is a teacher classifier that extracts in-domain features for downstream inferences. It has a fully-connected layer narrowing the dimension from 768 to 128, a selfattention joining a fully-connected layer without changes in vector size, and a fully-connected layer with a sof tmax function projecting vectors to an N -class indicator p t as in (3), where N is the taskspecific number of classes.

Branches
To provide FastBERT with more adaptability, multiple branches, i.e. the student classifiers, in the same architecture with the teacher are added to the output of each Transformer block to enable early outputs, especially in those simple cases. The student classifiers can be described as (4), The student classifier is designed carefully to balance model accuracy and inference speed, for simple networks may impair the performance, while a heavy attention module severely slows down the inference speed. Our classifier has proven to be lighter with ensured competitive accuracy, detailed verifications are showcased in Section 4.1.

Model training
FastBERT requires respective training steps for the backbone and the student classifiers. The parameters in one module is always frozen while the other module is being trained. The model is trained in preparation for downstream inference with three steps: the major backbone pre-training, entire backbone fine-tuning, and self-distillation for student classifiers.

Pre-training
The pre-training of backbone resembles that of BERT in the same way that our backbone resembles BERT. Any pre-training method used for BERT-like models (e.g., BERT-WWM (Cui et al., 2019), RoBERTa , and ERNIE (Sun et al., 2019b)) can be directly applied. Note that the teacher classifier, as it is only used for inference, stays unaffected at this time. Also conveniently, FastBERT does not even need to perform pre-training by itself, for it can load high-quality pre-trained models freely.

Fine-tuning for backbone
For each downstream task, we plug in the taskspecific data into the model, fine-tuning both the major backbone and the teacher classifier. The structure of the teacher classifier is as previously described. At this stage, all student classifiers are not enabled.

Self-distillation for branch
With the backbone well-trained for knowledge extraction, its output, as a high-quality soft-label containing both the original embedding and the generalized knowledge, is distilled for training student classifiers. As student are mutually independent, their predictions p s are compared with the teacher soft-label p t respectively, with the differences measured by KL-Divergence in (5), As there are L − 1 student classifiers in the Fast-BERT, the sum of their KL-Divergences is used as the total loss for self-distillation, which is formulated in (6), where p s i refers to the probability distribution of the output from student-classifier i.
Since this process only requires the teachers output, we are free to use an unlimited number of unlabeled data, instead of being restricted to the labeled ones. This provides us with sufficient resources for self-distillation, which means we can always improve the student performance as long as the teacher allows. Moreover, our method differs from the previous distillation method, for the teacher and student outputs lie within the same model. This learning process does not require additional pretraining structures, making the distillation entirely a learning process by self.

Adaptive inference
With the above steps, FastBERT is well-prepared to perform inference in an adaptive manner, which means we can adjust the number of executed encoding layers within the model according to the sample complexity.
At each Transformer layer, we measure for each sample on whether the current inference is credible enough to be terminated.
Given an input sequence, the uncertainty of a student classifier's output p s is computed with a normalized entropy in (7), where p s is the distribution of output probability, and N is the number of labeled classes.
With the definition of the uncertainty, we make an important hypothesis.
Hypothesis 1. LUHA: the Lower the Uncertainty, the Higher the Accuracy.
Definition 1. Speed: The threshold to distinguish high and low uncertainty.
LUHA is verified in Section 4.4. Both Uncertainty and Speed range between 0 and 1. The adaptive inference mechanism can be described as: At each layer of FastBERT, the corresponding student classifier will predict the label of each sample with measured Uncertainty. Samples with Uncertainty below the Speed will be sifted to early outputs, while samples with Uncertainty above the Speed will move on to the next layer.
Intuitively, with a higher Speed, fewer samples will be sent to higher layers, and overall inference speed will be faster, and vice versa. Therefore, Speed can be used as a halt value for weighing the inference accuracy and efficiency.

Experimental results
In this section, we will verify the effectiveness of FastBERT on twelve NLP datasets (six in English and six in Chinese) with detailed explanations.

FLOPs analysis
Floating-point operations (FLOPs) is a measure of the computational complexity of models, which indicates the number of floating-point operations that the model performs for a single process. The FLOPs has nothing to do with the model's operating environment (CPU, GPU or TPU) and only reveals the computational complexity. Generally speaking, the bigger the model's FLOPs is, the longer the inference time will be. With the same accuracy, models with low FLOPs are more efficient and more suitable for industrial uses. We list the measured FLOPs of both structures in Table 1, from which we can infer that, the calculation load (FLOPs) of the Classifier is much lighter than that of the Transformer. This is the basis of the speed-up of FastBERT, for although it adds additional classifiers, it achieves acceleration by reducing more computation in Transformers.

Baseline
In this section, we compare FastBERT against two baselines: • BERT 1 The 12-layer BERT-base model was pre-trained on Wiki corpus and released by Google (Devlin et al., 2019).
• DistilBERT 2 The most famous distillation method of BERT with 6 layers was released by Huggingface (Sanh et al., 2019). In addition, we use the same method to distill the Distil-BERT with 3 and 1 layer(s), respectively.

Dataset
To verify the effectiveness of FastBERT, especially in industrial scenarios, six Chinese and six English datasets pressing closer to actual applications are used. The six Chinese datasets include the sentence classification tasks (ChnSentiCorp, Book review (Qiu et al., 2018), Shopping review, Weibo and THUCNews) and a sentences-matching task (LCQMC (Liu et al., 2018)). All the Chinese datasets are available at the FastBERT project. The six English datasets (Ag.News, Amz.F, DBpedia, Yahoo, Yelp.F, and Yelp.P) are sentence classification tasks and were released in (Zhang et al., 2015).

Performance comparison
To perform a fair comparison, BERT / DistilBERT / FastBERT all adopt the same configuration as follows. In this paper, L = 12. The number of self-attention heads, the hidden dimension of embedding vectors, and the max length of the input sentence are set to 12, 768 and 128 respectively. Both FastBERT and BERT use pre-trained parameters provided by Google, while DistilBERT is pretrained with (Sanh et al., 2019). We fine-tune these models using the AdamW (Loshchilov and Hutter) algorithm, a 2 × 10 −5 learning rate, and a 0.1 warmup. Then, we select the model with the best accuracy in 3 epochs. For the self-distillation of FastBERT, we increase the learning rate to 2×10 −4 and distill it for 5 epochs.
We evaluate the text inference capabilities of these models on the twelve datasets and report their accuracy (Acc.) and sample-averaged FLOPs under different Speed values. The result of comparisons are shown in Table 2, where the Speedup is ob-tained by using BERT as the benchmark. It can be observed that with the setting of Speed = 0.1, FastBERT can speed up 2 to 5 times without losing accuracy for most datasets. If a little loss of accuracy is tolerated, FastBERT can be 7 to 11 times faster than BERT. Comparing to DistilBERT, FastBERT trades less accuracy to catch higher efficiency. Figure 3 illustrates FastBERT's tradeoff in accuracy and efficiency. The speedup ratio of FastBERT are free to be adjusted between 1 and 12, while the loss of accuracy remains small, which is a very attractive feature in the industry.

LUHA hypothesis verification
As is described in the Section 3.3, the adaptive inference of FastBERT is based on the LUHA hypothesis, i.e., "the Lower the Uncertainty, the Higher the Accuracy". Here, we prove this hypothesis using the book review dataset. We intercept the classification results of Student-Classifier0, Student-Classifier5, and Teacher-Classifier in FastBERT, then count their accuracy in each uncertainty interval statistically. As shown in Figure 4, the statistical indexes confirm that the classifier follows the LUHA hypothesis, no matter it sits at the bottom, in the middle or on top of the model.
From Figure 4, it is easy to mistakenly conclude that Students has better performance than Teacher due to the fact that the accuracy of Student in each uncertainty range is higher than that of Teacher. This conclusion can be denied by analysis with Figure 6(a) together. For the Teacher, more samples are located in areas with lower uncertainty, while the Students' samples are nearly uniformly distributed. Therefore the overall accuracy of the Teacher is still higher than that of Students.

In-depth study
In this section, we conduct a set of in-depth analysis of FastBERT from three aspects: the distribution of exit layer, the distribution of sample uncertainty, and the convergence during self-distillation. The Speed is set to 0.3, 0.5, and 0.8 respectively, iand only the samples with Uncertainty higher than Speed will be sent to the next layer.

Layer distribution
In FastBERT, each sample walks through a different number of Transformer layers due to varied complexity. For a certain condition, fewer executed layers often requires less computing resources. As illustrated in Figure 5, we investigate the distribution of exit layers under different constraint of Speeds (0.3, 0.5 and 0.8) in the book review dataset. Take Speed = 0.8 as an example, at the first layer Transformer0, 61% of the samples is able to complete the inference. This significantly eliminates unnecessary calculations in the next eleven layers.

Uncertainty distribution
The distribution of sample uncertainty predicted by different student classifiers varies, as is illustrated in Figure 6. Observing these distributions help us to  further understand FastBERT. From Figure 6(a), it can be concluded that the higher the layer is posited, the lower the uncertainty with given Speed will be, indicating that the high-layer classifiers more decisive than the lower ones. It is worth noting that at higher layers, there are samples with uncertainty below the threshold of Uncertainty (i.e., the Speed), for these high-layer classifiers may reverse the previous judgments made by the low-layer classifiers.

Convergence of self-distillation
Self-distillation is a crucial step to enable Fast-BERT. This process grants student classifiers with the abilities to infer, thereby offloading work from the teacher classifier. Taking the Book Review dataset as an example, we fine-tune the FastBERT with three epochs then self-distill it for five more epochs. Figure 7 illustrates its convergence in accuracy and FLOPs during fine-tune and selfdistillation. It could be observed that the accuracy increases with fine-tuning, while the FLOPs decrease during the self-distillation stage.

Ablation study
Adaptation and self-distillation are two crucial mechanisms in FastBERT. We have preformed ablation studies to investigate the effects brought by these two mechanisms using the Book Review dataset and the Yelp.P dataset. The results are presented in Table 3, in which 'without selfdistillation' implies that all classifiers, including both the teacher and the students, are trained in the fine-tuning; while 'without adaptive inference' means that the number of executed layers of each sample is fixated to two or six. From Table 3, we have observed that: (1) At almost the same level of speedup, FastBERT without self-distillation or adaption performs poorer; (2) When the model is accelerated more than five times, downstream accuracy degrades dramatically without adaption. It is safe to conclude that both the adaptation and self-distillation play a key role in FastBERT, which achieves both significant speedups and favorable low losses of accuracy.

Conclusion
In this paper, we propose a fast version of BERT, namely FastBERT. Specifically, FastBERT adopts a self-distillation mechanism during the training phase and an adaptive mechanism in the inference phase, achieving the goal of gaining more efficiency with less accuracy loss. Self-distillation and adaptive inference are first introduced to NLP model in this paper. In addition, FastBERT has a very practical feature in industrial scenarios, i.e., its inference speed is tunable.
Our experiments demonstrate promising results on twelve NLP datasets. Empirical results have shown that FastBERT can be 2 to 3 times faster than BERT without performance degradation. If we slack the tolerated loss in accuracy, the model is free to tune its speedup between 1 and 12 times. Besides, FastBERT remains compatible to the parameter settings of other BERT-like models (e.g., BERT-WWM, ERNIE, and RoBERTa), which means these public available models can be readily loaded for FastBERT initialization.

Future work
These promising results point to future works in (1) linearizing the Speed-Speedup curve; (2) extending this approach to other pre-training architectures such as XLNet  and ELMo (Peters et al., 2018); (3) applying FastBERT on a wider range of NLP tasks, such as named entity recognition and machine translation.