Exploring the Boundaries of Low-Resource BERT Distillation

In recent years, large pre-trained models have demonstrated state-of-the-art performance in many of NLP tasks. However, the deployment of these models on devices with limited resources is challenging due to the models’ large computational consumption and memory requirements. Moreover, the need for a considerable amount of labeled training data also hinders real-world deployment scenarios. Model distillation has shown promising results for reducing model size, computational load and data efficiency. In this paper we test the boundaries of BERT model distillation in terms of model compression, inference efficiency and data scarcity. We show that classification tasks that require the capturing of general lexical semantics can be successfully distilled by very simple and efficient models and require relatively small amount of labeled training data. We also show that the distillation of large pre-trained models is more effective in real-life scenarios where limited amounts of labeled training are available.


Introduction
In recent years, large pre-trained models such as BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2018) and XLNET (Yang et al., 2019) have demonstrated state-of-the-art performance in many NLP tasks and have become standard. However, the deployment of these models on devices with limited resources is challenging due to the models' large computational consumption and memory requirements. For example, the two variants of BERT, named BERT BASE and BERT LARGE consist of approximately 110M and 340M parameters, respectively. Another deployment hurdle in real-world scenarios is the scarcity of labeled data resources.
Model distillation (Ba and Caruana, 2014;Hinton et al., 2015) has shown promising results for reducing model size and computational load while preserving much of the original model's performance. A typical model distillation setup includes two stages; in the first stage, a large, cumbersome and accurate teacher neural network is trained for a specific downstream task. In the second stage a smaller and simpler student model, that is more practical for deployment in environments with limited resources, is trained to mimic the behavior of the teacher model.
Prior work related to transformer-based model distillation, focused on reducing the number of layers of the original model, obtaining shallower and more efficient student models (Sun et al., 2019;Sanh, 2019;Turc et al., 2019). Tang et al. (2019) proposed a BERT distillation method for single sentence classification tasks and sentence matching tasks using a BiLSTM (Graves, 2012;İrsoy and Cardie, 2014) student model. Our work is closely related to the work of Tang et al. (2019), however, in our work we push the boundaries of BERT model distillation in terms of model size and complexity reduction, computational load and data scarcity for single-sentence classification tasks.
The contribution of this paper is twofold; first, we show that classification tasks that require the capturing of general lexical semantics can be successfully distilled by simple and efficient models while preserving results comparable to those achieved by BERT. Second, building on previous work (Izsak et al., 2019;Mukherjee and Awadallah, 2020), we show that the distillation of large pre-trained models is more effective in real-life scenarios, where a limited amount of labeled training is available. Moreover, we show that in low data resource scenarios, the distillation model size and complexity can be substantially reduced. Specifically, we show that results produced by using a very simple and efficient model such as Continuous Bag of Words (CBoW) with a Feed Forward Network(FFN) are comparable to results produced by using a more complex model such as BiLSTM.

Approach
The aim of a model distillation process is to use a large pre-trained teacher model to train a small and computationally efficient student model so it achieves accuracy comparable to that of the teacher model. In this section we describe the teacher and student model architectures (Sections 2.1) and the distillation process (Section 2.2).

Models Architecture
For the teacher model we chose the popular pretrained BERT model (Devlin et al., 2019). Specifically, we used BERT BASE , consisting of 110M parameters, and added a sentence-level softmax classification layer on top of BERT's CLS token output. The first step of the distillation process is to fine-tune BERT for a specific task using labeled data. In this step, we jointly fine-tune the parameters of BERT and the sentence-level classifier by maximizing the probability of the correct label, using the cross-entropy loss.
For student models we chose two nontransformer-based models whose neural architectures are shallower than BERT, and which contain considerably fewer parameters. The two student models are: CBoW-FFN This simple student model is often used for very efficient text classification tasks based on sentence representation (Agibetov et al., 2018;. The network consists of an internal embedding layer with embedding vectors of dimension d emb = 16, followed by an average pooling layer and a Feed-Forward Network (FFN). The model contains approximately 80K parameters, meaning it is approximately 1375 times more compact than BERT BASE .
BiLSTM The BiLSTM network (Graves, 2012;Irsoy and Cardie, 2014) consists of a pre-trained embedding 1 layer followed by two identical BiL-STM layers stacked one on top of another, and where the last hidden state of the second layer is followed by a FFN. The model contains approximately 685K parameters, meaning it is approximately 160 times more compact than BERT BASE .
Additional Models We also experimented with Convolutional Neural Networks (CNNs) (Kalchbrenner et al., 2014). However, BiLSTM performed better for the same model size.  Table 1: Dataset descriptions and statistics. T-train represents the number of labeled samples used for training the teacher model (step 1) and S-train represent the number of unlabeled samples used for training the student model (step 2). * Obtained using the data augmentation method described by Jiao et al. (2020).

The Distillation Process
The first step of the training process consists of fine-tuning the teacher model using the available labeled data. The second step of the distillation process is depicted in Figure 1. In this step the student model is trained using the unlabeled data. The unlabeled data is fed in parallel into both the fine-tuned teacher model and to the student model. Following (Tang et al., 2019), we only use the distillation loss which is calculated for each training batch by performing Mean Square Error (MSE) between the soft targets (logits) that are produced by the student and teacher models: where y s and y t are the logits produced by the student and teacher models, respectively.

Datasets and Tasks
The goal of our work is to test the distillation boundaries in terms of model size compression, inference computation load and training data size of single-sentence classification tasks. We conducted experiments on five widely-used single-sentence classification datasets, as detailed below.
AGNews A topic classification dataset (Zhang et al., 2015) that consists of internet news titles labeled with four categories: World, Entertainment, Sports and Business.
Emotion An emotion classification dataset (Saravia et al., 2018) that consists of Twitter posts labeled with any of six basic emotion categories: sadness, disgust, anger, joy, surprise, and fear.    (Wang et al., 2018).
CoLA The Corpus of Linguistic Acceptability (CoLA; Warstadt et al. 2018) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each sentence is annotated with whether it is a grammatical English sentence or not. This dataset is also part of the GLUE benchmark. Table 1 shows the dataset descriptions and statistics. In order to simulate a real-life data-scarce environment, we limited the labeled teacher model training set (T-train) size to no more than a thousand samples. It was shown that large amounts of data are needed for the teacher model to fully express its knowledge (Ba and Caruana, 2014). For AGNews, Emotion and IMDB datasets, we used the available training data which is part of the datasets as unlabeled student training data (S-train). How-ever, both SST-2 and CoLA datasets, do not contain sufficient amounts of training data, therefore, we use the data augmentation method described by Jiao et al. (2020) for generating unlabeled student training data (S-train).

Setup
We adopt the HuggingFace (Wolf et al., 2019) implementation of BERT-base (uncased) 3 model for the teacher model. We fine-tune the model for 3 epochs with learning rate of 5e −5 and batch size of 16. The CBoW-FFN student model was implemented based on the model described by Agibetov et al. (2018) with embedding size of 16 and word vocabulary size of 5000. The BiLSTM student model was implemented in a fashion similar to the model described by Chollet 4 with embedding size of 100 and with vocabulary size of 5000. Table 2 shows low-data-resource scenario comparison between the accuracy of the two student  models and the teacher model across the different datasets and tasks. Overall, the distilled models produced results that are competitive with the teacher model's results across all datasets and tasks except for the CoLA task. An interesting observation is that the relatively lightweight CBoW-FFN model's results are on-par with the BiLSTM results. A possible explanation for these results is that all of the tasks, with the exception of CoLA, require the detection of general lexical semantic features with relatively less emphasis on linguistic structure and contextual relations, therefore BERT's contextualoriented architecture has no advantage over the student models' architecture. The CoLA task, on the other hand, requires the detection of linguistic structure and contextual relations and this is where BERT's architecture excels and the student models' architectures are lacking. Table 3 shows an F1 score comparison between the two student models and the teacher model for low and high labeled data resource scenarios for the SST-2 dataset. The table also shows results for the student models when trained directly on the labeled data (non-distilled version).

Distilled Vs. Non-Distilled Models
The results demonstrate that the student models trained using the distillation method (described in Section 2.2), consistently outperform the baseline student models trained directly on the labeled data, proving the effectiveness of the distillation approach. However, and in accordance with the findings of Izsak et al. (2019); Mukherjee and Awadallah (2020), it is also evident that the F1 score enhancement achieved by the distilled student models over the non-distilled models is higher in the low resource scenario than in the high resource scenario. Specifically, the F1 improvement between the distilled and non-distilled versions of the two student models in the low resource scenario are 16.3% and 17.6%, vs. 0.8% and 7.3% in the high resource scenario.
Distilled Models Vs. BERT The results also show that in the high resource scenario case, where an abundance of labeled training data is available, BERT's accuracy advantage over the distilled models grows larger compared to the low-resource scenario. Specifically, the F1 score gaps between BERT and the student models in the high resource scenario are 9%, and 4.9%, respectively, whereas in the low resource scenario those gaps are only 4.4% and 2.8% respectively.
BiLSTM Vs. CBoW-FFN Another observation is that in the high resource case, the practical tradeoff between model complexity and accuracy becomes more salient. For example, the F1 score gap between CBoW-FFN and BiLSTM is merely 1.6% in the low resource scenario but reaches 4.1% in the high resource scenario. This observation aligns with the basic neural-networks phenomena that larger and deeper neural networks are able to represent the distribution of the data more accurately compared to smaller models when large amounts of data are available (Ng, 2018).
Practical Implications The practical implications of these results is that distillation is more effective in real-life scenarios where limited amounts of labeled training data are available. In highresource scenarios, however, where an abundance of labeled training data is available, using deeper and more complex student models such as BiL-STM, or shallower transformer-based models, yields higher accuracies.

Conclusion
We showed that in low resource scenarios, it is feasible to distil BERT using very efficient models while preserving comparable results. However, the success of the distillation depends on the dataset and task at hand. Classification tasks that require capturing of general lexical semantics can be successfully distilled by very simple and efficient models; however, classification tasks that require the detection of linguistic structure and contextual relations are more challenging for distillation using simple student models. For future work, we aim to explore the impact of the datasets' linguistic structures on the distillation success and to develop dataset-related measurements (Arora et al., 2020) for predicting the success of the distillation in relation to different student models.