DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. However, they are also notorious for being slow in inference, which makes them difficult to deploy in real-time applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Our approach allows samples to exit earlier without passing through the entire model. Experiments show that DeeBERT is able to save up to ~40% inference time with minimal degradation in model quality. Further analyses show different behaviors in the BERT transformer layers and also reveal their redundancy. Our work provides new ideas to efficiently apply deep transformer-based models to downstream tasks. Code is available at https://github.com/castorini/DeeBERT.


Introduction
Large-scale pre-trained language models such as ELMo (Peters et al., 2018), GPT (Radford et al., 2019), BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), and RoBERTa (Liu et al., 2019) have brought significant improvements to natural language processing (NLP) applications. Despite their power, they are notorious for being enormous in size and slow in both training and inference. Their long inference latencies present challenges to deployment in real-time applications and hardwareconstrained edge devices such as mobile phones and smart watches.
To accelerate inference for BERT, we propose DeeBERT: Dynamic early exiting for BERT. The inspiration comes from a well-known observation in the computer vision community: in deep convolutional neural networks, higher layers typically produce more detailed and finer-grained features (Zeiler and Fergus, 2014). Therefore, we hypothesize that, for BERT, features provided by the intermediate transformer layers may suffice to classify some input samples.
DeeBERT accelerates BERT inference by inserting extra classification layers (which we refer to as off-ramps) between each transformer layer of BERT ( Figure 1). All transformer layers and offramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.
In this paper, we conduct experiments on BERT and RoBERTa with six GLUE datasets, showing that DeeBERT is capable of accelerating model inference by up to ∼40% with minimal model quality degradation on downstream tasks. Further analyses reveal interesting patterns in the models' transformer layers, as well as redundancy in both BERT and RoBERTa.

Related Work
BERT and RoBERTa are large-scale pre-trained language models based on transformers (Vaswani et al., 2017). Despite their groundbreaking power, there have been many papers trying to examine and exploit their over-parameterization. Michel et al. (2019) and Voita et al. (2019) analyze redundancy in attention heads. Q-BERT (Shen et al., 2019) uses quantization to compress BERT, and Layer-Drop (Fan et al., 2019) uses group regularization to enable structured pruning at inference time. On the knowledge distillation side, TinyBERT (Jiao et al., 2019) and DistilBERT (Sanh et al., 2019) both distill BERT into a smaller transformer-based model, and Tang et al. (2019) distill BERT into even smaller non-transformer-based models.
Our work is inspired by Cambazoglu et al. (2010), Teerapittayanon et al. (2017), and Huang et al. (2018), but mainly differs from previous work in that we focus on improving model efficiency with minimal quality degradation.

Early Exit for BERT inference
DeeBERT modifies fine-tuning and inference of BERT models, leaving pre-training unchanged. It adds one off-ramp for each transformer layer. An inference sample can exit earlier at an off-ramp, without going through the rest of the transformer layers. The last off-ramp is the classification layer of the original BERT model.

DeeBERT at Fine-Tuning
We start with a pre-trained BERT model with n transformer layers and add n off-ramps to it. For fine-tuning on a downstream task, the loss function of the i th off-ramp is where D is the fine-tuning training set, θ is the collection of all parameters, (x, y) is the featurelabel pair of a sample, H is the cross-entropy loss function, and f i is the output of the i th off-ramp. The network is fine-tuned in two stages: 1. Update the embedding layer, all transformer layers, and the last off-ramp with the loss function L n . This stage is identical to BERT fine-tuning in the original paper (Devlin et al., 2019).
2. Freeze all parameters fine-tuned in the first stage, and then update all but the last offramp with the loss function n−1 i=1 L i . The reason for freezing parameters of transformer layers is to keep the optimal output quality for the last off-ramp; otherwise, transformer layers are no longer optimized solely for the last off-ramp, generally worsening its quality.

DeeBERT at Inference
The way DeeBERT works at inference time is shown in Algorithm 1. We quantify an off-ramp's confidence in its prediction using the entropy of the output probability distribution z i . When an input sample x arrives at an off-ramp, the off-ramp compares the entropy of its output distribution z i with a preset threshold S to determine whether the sample should be returned here or sent to the next transformer layer.
It is clear from both intuition and experimentation that a larger S leads to a faster but less accurate model, and a smaller S leads to a more accurate but slower one. In our experiments, we choose S based on this principle.
We also explored using ensembles of multiple layers instead of a single layer for the off-ramp, but this does not bring significant improvements. The reason is that predictions from different layers are usually highly correlated, and a wrong prediction is unlikely to be "fixed" by the other layers. Therefore, we stick to the simple yet efficient single output layer strategy.

Experimental Setup
We apply DeeBERT to both BERT and RoBERTa, and conduct experiments on six classification datasets from the GLUE benchmark (Wang et al., 2018): SST-2, MRPC, QNLI, RTE, QQP, and MNLI. Our implementation of DeeBERT is adapted from the HuggingFace Transformers Library (Wolf et al., 2019). Inference runtime measurements are performed on a single NVIDIA Tesla P100 graphics card. Hyperparameters such as hidden-state size, learning rate, fine-tune epoch, and batch size are kept unchanged from the library. There is no early stopping and the checkpoint after full fine-tuning is chosen.

Main Results
We vary DeeBERT's quality-efficiency trade-off by setting different entropy thresholds S, and compare the results with other baselines in Table 1.
Model quality is measured on the test set, and the results are provided by the GLUE evaluation server. Efficiency is quantified with wall-clock inference runtime 1 on the entire test set, where samples are fed into the model one by one. For each run of Dee-BERT on a dataset, we choose three entropy thresholds S based on quality-efficiency trade-offs on the development set, aiming to demonstrate two cases: (1) the maximum runtime savings with minimal performance drop (< 0.5%), and (2) the runtime savings with moderate performance drop (2% − 4%).
Chosen S values differ for each dataset.
We also visualize the trade-off in Figure 2. Each curve is drawn by interpolating a number of points, each of which corresponds to a different threshold S. Since this only involves a comparison between different settings of DeeBERT, runtime is measured on the development set.
From Table 1 and Figure 2, we observe the following patterns: • Despite differences in baseline performance, both models show similar patterns on all datasets: the performance (accuracy/F 1 score) stays (mostly) the same until runtime saving reaches a certain turning point, and then starts 1 This includes both CPU and GPU runtime.
to drop gradually. The turning point typically comes earlier for BERT than for RoBERTa, but after the turning point, the performance of RoBERTa drops faster than for BERT. The reason for this will be discussed in Section 4.4.
• Occasionally, we observe spikes in the curves, e.g., RoBERTa in SST-2, and both BERT and RoBERTa in RTE. We attribute this to possible regularization brought by early exiting and thus smaller effective model sizes, i.e., in some cases, using all transformer layers may not be as good as using only some of them.
Compared with other BERT acceleration methods, DeeBERT has the following two advantages: • Instead of producing a fixed-size smaller model like DistilBERT (Sanh et al., 2019), Dee-BERT produces a series of options for faster inference, which users have the flexibility to choose from, according to their demands.
• Unlike DistilBERT and LayerDrop (Fan et al., 2019), DeeBERT does not require further pretraining of the transformer model, which is much more time-consuming than fine-tuning.

Expected Savings
As the measurement of runtime might not be stable, we propose another metric to capture efficiency, called expected saving, defined as where n is the number of layers and N i is the number of samples exiting at layer i. Intuitively, expected saving is the fraction of transformer layer execution saved by using early exiting. The advantage of this metric is that it remains invariant between different runs and can be analytically computed. For validation, we compare this metric with measured saving in Figure 3. Overall, the curves show a linear relationship between expected savings and measured savings, indicating that our reported runtime is a stable measurement of Dee-BERT's efficiency.

Layerwise Analyses
In order to understand the effect of applying Dee-BERT to both models, we conduct further analyses on each off-ramp layer. Experiments in this section are also performed on the development set.
Output Performance by Layer. For each offramp, we force all samples in the development set to exit here, measure the output quality, and visualize the results in Figure 4. From the figure, we notice the difference between BERT and RoBERTa. The output quality of BERT improves at a relatively stable rate as the index of the exit off-ramp increases. The output quality of RoBERTa, on the other hand, stays almost unchanged (or even worsens) for a few layers, then rapidly improves, and reaches a saturation point be- fore BERT does. This provides an explanation for the phenomenon mentioned in Section 4.2: on the same dataset, RoBERTa often achieves more runtime savings while maintaining roughly the same output quality, but then quality drops faster after reaching the turning point. We also show the results for BERT-large and RoBERTa-large in Figure 5. From the two plots on the right, we observe signs of redundancy that both BERT-large and RoBERTa-large share: the last several layers do not show much improvement compared with the previous layers (performance even drops slightly in some cases). Such redundancy can also be seen in Figure 4.
Number of Exiting Samples by Layer. We further show the fraction of samples exiting at each off-ramp for a given entropy threshold in Figure 6.
Entropy threshold S = 0 is the baseline, and all samples exit at the last layer; as S increases, gradually more samples exit earlier. Apart from the obvious, we observe additional, interesting patterns: if a layer does not provide better-quality output than previous layers, such as layer 11 in BERT-base and layers 2-4 and 6 in RoBERTa-base (which can be seen in Figure 4, top left), it is typically chosen by very few samples; popular layers are typically those that substantially improve over previous layers, such as layer 7 and 9 in RoBERTabase. This shows that an entropy threshold is able to choose the fastest off-ramp among those with comparable quality, and achieves a good trade-off between quality and efficiency. Exit Layer 0% 50% 100% S=0.6 Savings=61% AccDrop=3.7% Figure 6: Number of output samples by layer for BERTbase and RoBERTa-base. Each plot represents a separate entropy threshold S.

Conclusions and Future Work
We propose DeeBERT, an effective method that exploits redundancy in BERT models to achieve better quality-efficiency trade-offs. Experiments demonstrate its ability to accelerate BERT's and RoBERTa's inference by up to ∼40%, and also reveal interesting patterns of different transformer layers in BERT models.
There are a few interesting questions left unanswered in this paper, which would provide interesting future research directions: (1) DeeBERT's training method, while maintaining good quality in the last off-ramp, reduces model capacity available for intermediate off-ramps; it would be important to look for a method that achieves a better balance between all off-ramps. (2) The reasons why some transformer layers appear redundant 2 and why Dee-BERT considers some samples easier than others remain unknown; it would be interesting to further explore relationships between pre-training and layer redundancy, sample complexity and exit layer, and related characteristics.