TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing

In this paper, we introduce TextBrewer, an open-source knowledge distillation toolkit designed for natural language processing. It works with different neural network models and supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling. TextBrewer provides a simple and uniform workflow that enables quick setting up of distillation experiments with highly flexible configurations. It offers a set of predefined distillation methods and can be extended with custom code. As a case study, we use TextBrewer to distill BERT on several typical NLP tasks. With simple configurations, we achieve results that are comparable with or even higher than the public distilled BERT models with similar numbers of parameters.


Introduction
Large pre-trained language models, such as GPT (Radford, 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b) and XLNet (Yang et al., 2019) have achieved great success in many NLP tasks and greatly contributed to the progress of NLP research.However, one big issue of these models is the high demand for computing resources -they usually have hundreds of millions of parameters, and take several gigabytes of memory to train and inference -which makes it impractical to deploy them on mobile or online systems.From a research point of view, we are tempted to ask: is it necessary to have such a big model that contains hundreds of millions of parameters to achieve high performance?Motivated by the above considerations, recently, some researchers in the NLP community have tried to design lite models (Lan et al., 2019), or resorted to the knowledge distillation technique to compress large pre-trained models to small models.
1 TextBrewer: http://textbrewer.hfl-rc.comKnowledge Distillation (KD) is a technique of transferring knowledge from a teacher model to a student model, which is usually smaller than the teacher.The student model is trained to mimic the outputs of the teacher model.Before the birth of BERT, KD had been applied to several specific tasks like machine translation (Kim and Rush, 2016;Tan et al., 2019) in NLP.While the recent studies of distilling large pre-trained models focus on finding general distillation methods that work on various tasks and are receiving more and more attention (Sanh et al., 2019;Jiao et al., 2019;Sun et al., 2019;Tang et al., 2019;Liu et al., 2019a;Clark et al., 2019;Zhao et al., 2019).
Though varieties of distillation methods have been proposed, they usually share a common workflow: firstly, train a teacher model, then optimize the student model by minimizing some losses that calculated between the outputs of the teacher and the student.Therefore it is desirable to have a reusable distillation workflow framework and treat different distillation strategies and tricks as plugins so that they could be easily and arbitrarily added to the framework.In this way, we could also achieve great flexibility in experimenting with different combinations of distillation strategies and comparing their effects.
In this paper, we introduce TextBrewer, a PyTorch-based (Paszke et al., 2019) knowledge distillation toolkit for NLP that aims to provide a unified distillation workflow, save the effort of setting up experiments, and help users to distill more effective models.TextBrewer provides simple-touse APIs, a collection of distillation methods, and highly customizable configurations.It has also been proved able to reproduce the state-of-the-art results on typical NLP tasks.The main features of TextBrewer are: • Versatility in tasks and models.It works with a wide range of models, from the RNN-based model to the Transformer-based model.It does not presume any network structures of teacher and student models.Its usability in tasks like text classification, reading comprehension, and sequence labeling has also been fully tested.
• Flexibility in configurations.The distillation process is configured by configuration objects, which can be initialized from JSON files and contain many tunable hyperparameters.If the presets do not meet the user's requirements, they can extend the configurations with new custom losses, schedulers, etc.
• Including various distillation methods and strategies.KD has been studied extensively in computer vision (CV) and has achieved great success.It would be worthwhile to introduce these studies to the NLP community as some of the methods in these studies could also be applied on texts.TextBrewer include a set of methods from both CV and NLP, such as flow of solution procedure (FSP) matrix loss (Yim et al., 2017), neuron selectivity transfer (NST) (Huang and Wang, 2017), probability shift and dynamic temperature (Wen et al., 2019), attention matrix loss, multi-task distillation (Liu et al., 2019a).In our experiments, we will show the effectiveness of applying methods from CV on NLP tasks.
• Being non-intrusive and simple to use.Nonintrusive means there is no need to modify the existing model code.Users can re-use their existing training scripts, and only minimal changes are required to use TextBrewer to perform distillation.
TextBrewer also provides some useful utilities such as model size analysis and data augmentation to help model design and distillation.

Related Work
Recently some distilled BERT have been released, such as DistilBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2019), and ERNIE Slim2 .DistilBERT performs distillation on the pre-training task, i.e., masked language modeling.TinyBERT performs transformer distillation at both the pre-training and task-specific learning stages.ERNIE Slim distills ERNIE on a sentiment classification task.Their distillation code is publicly available, and users can replicate their experiments easily.However, it is laborious and error-prone to change the distillation method or adapt the distillation code for some other models and tasks, since the code is not written for general distillation purposes.
There also exist some libraries for general model compression.Distiller (Zmora et al., 2018) and PaddleSlim3 are two versatile libraries supporting pruning, quantization and knowledge distillation.They focus on models and tasks in computer vision.In comparison, TextBrewer is more focused on knowledge distillation on NLP tasks, more flexible, and offers more functionalities.Based on PyTorch, It provides simple APIs and rich customization for fast and clean implementations of experiments.

Architecture and Design
Figure 1 shows an overview of the main functionalities and architecture of TextBrewer.To support different models and different tasks and meanwhile stay flexible and extensible, TextBrewer provides distillers to conduct the actual experiments and configuration classes to configure the behaviors of the distillers.

Distillers
Distillers are the cores of TextBrewer.They automatically train and save models and support custom evaluation functions.There are five distillers have been implemented: BasicDistiller is used for single-task single-teacher distillation; GeneralDistiller in addition supports more advanced intermediate loss functions; MultiTeacherDistiller distill an ensemble of teacher models into a single student model; MultiTaskDistiller distill multiple teacher models of different tasks into a single multitask student model.We also have implemented BasicTrainer for training teachers on labeled data to unify the workflows of supervised learning and distillation.All the distillers share the same interface and usage.They can be replaced by each other easily.

Configurations and Presets
The general training settings and the distillation method settings of a distiller are specified by two configurations: TrainingConfig and DistillationConfig.TrainingConfig defines the settings that are general to deep learning experiments, including the directory where logs and student model are stored (log dir, output dir), the device to use (device), the frequency of storing and evaluating student model (ckpt frequencey), etc.
DistillationConfig defines the settings that are pertinent to distillation, where various distillation methods could be configured or enabled.
It includes the type of KD loss (kd loss type), the temperature and weight of KD loss (temperature and kd loss weight), the weight of hard-label loss (hard label weight), probability shift switch, schedulers and intermediate losses, etc. Intermediate losses are used for computing the losses between the intermediate states of teacher and student, and they could be freely combined and added to the distillers.Schedulers are used to adjust loss weight or temperature dynamically.
The available values of configuration options such as loss functions and schedulers are defined as dictionaries in presets.For example, the loss function dictionary includes hidden state loss, cosine similarity loss, FSP loss, NST loss, etc.
All the configurations can be constructed from JSON files.In Figure 3 we shows an example of DistillationConfig for distilling BERT BASE , to a 4-layer transformers.See Section 4 for more details.

Workflow
Before distilling a teacher model using TextBrewer, some preliminary works have to be done: 1. Train a teacher model on a labeled dataset.
Users usually train the teacher model with 2. Define and initialize the student model.
3. Build a DataLoader of the dataset for distillation and initialize the optimizer and learning rate scheduler.
The above steps are usually common to all deep learning experiments.To perform distillation, take the following additional steps: 1. Initialize training and distillation configurations, and construct a distiller.
2. Define adaptors and a callback function.
3. Call the train method of the distiller.
A code snippet that shows the minimal workflow is presented in Figure 2. The concepts of callback and adaptor will be explained below.Since it is impractical to implement evaluation metrics and evaluation procedures for all NLP tasks, we encourage users to implement their own evaluation functions as the callbacks for the best practice.

Adaptor
The distiller is model-agnostic.It needs a translator to translate the model outputs into meaningful data.Adaptor plays the role of translator.An Adaptor is an interface and responsible for explaining the inputs and outputs of the teacher and student for the distiller.
Adaptor takes two arguments: the model inputs and the model outputs.It is expected to return a dictionary with some specific keys.Each key explains the meaning of the corresponding value, as shown in Figure 1 (b).For example, logits is the logits of final outputs, hidden is intermediate hidden states, attention is the attention matrices, inputs mask is used to mask padding positions.The distiller only takes necessary elements from the outputs of adaptors according to its distillation configurations.A minimal adaptor only needs to explain logits, as shown in lines 11-14 of Figure 2.

Extensibility
TextBrewer also works with user's custom modules.New loss functions and schedulers can be easily added to the toolkit.For example, to use a custom loss function, one first implements the loss function with a compatible interface, then add it to the loss function dictionary in the presets with a custom name, so that the new loss function become available as a new option value of the configuration and can be recognized by distillers.

Experiments
In this section, we conduct several experiments to show TextBrewer's ability to distill large pretrained models on different NLP tasks and achieve state-of-the-art results.

Settings
Datasets and tasks.We conduct experiments on both English and Chinese datasets.For English datasets, We use MNLI (Wang et al., 2019) for text classification task, SQuAD1.1 (Rajpurkar et al., 2016) for span-extraction machine reading comprehension (MRC) task and CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003) for named entity recognition (NER) task.For Chinese datasets, we use XNLI (Conneau et al., 2018), LCQMC (Liu et al., 2018), CMRC 2018(Cui et al., 2019b) and DRCD (Shao et al., 2018).XNLI is the multilingual version of MNLI.LCQMC is a large-scale Chinese question matching corpus.We use these two datasets for testing text classification tasks.CMRC 2018 and DRCD are two span-extraction machine reading comprehension datasets similar to SQuAD.
The statistics are listed in Table 2.  Models.We choose BERT BASE model as the teacher for all tasks.For English tasks, the teacher is initialized by the weights released by Google4 and converted into PyTorch format by Hugging-Face5 .For Chinese tasks, teacher is initialized by the pre-trained weights Chinese RoBERTa-wwmext6 (Cui et al., 2019a).We test the performance of several different student models.The model structures of the teacher and students are summarized in Table 1.T6 and T3 are BERT with fewer layers of transformers.T3-small is a 3-layer BERT with hidden size and feed-forward size being the half of BERT-base's.T4-tiny, which is the same as TinyBERT, is a 4-layer model with an even smaller hidden size and feed-forward size.T3-small and T4-tiny are initialized randomly.BiGRU is a singlelayer bidirectional GRU which uses the same word embeddings as BERT.
Training settings.To keep experiments simple, we directly distill the teacher model that has been trained on the task, while do not perform task-irrelevant language modeling distillation in advance.The number of epochs ranges from 30 to 60, and the learning rate of the student is 1e-4 for all experiments unless otherwise specified.
Distillation settings.Temperature is set to 8 for all experiments.We add intermediate losses uniformly distributed among all the layers between teacher and student (except BiGRU).The loss functions we choose are hidden mse loss, which computes the mean square loss between two hidden states and NST loss, which is an effective method in the CV field.In Figure 3 we show an example of distillation configuration for distilling BERT BASE to a T4-tiny.Since their hidden sizes are different, we use proj option to add linear layers to match the dimensions.The linear layers will be trained together with the student automatically.We experiment with two kinds of distillers: GeneralDistiller and MultiTeacherDistiller .

Results on English Datasets
We show the performance of students obtained by GeneralDistiller in Table 3. First, we observe that teachers can be distilled to T6 models with minor losses in performance: all the T6 models achieve 99% performance of the teachers.Second, T4-tiny outperforms TinyBERT though they have the same structure.This is attributed to the NST losses that we added in the distillation configuration.This result proves the effectiveness of applying the KD method developed in the CV on NLP tasks.Finally, data augmentation is critical.It significantly improves the performance, especially for the case where the training set size is small, like CoNLL-2003.We next show the effectiveness of MultiTeacherDistiller, which distills an ensemble of teachers to a single student model.For each task, we train three teacher models with the same architecture but different seeds.The student has the same architecture as teachers.The learning rate is set to 3e-5, and intermediate losses are not used.Table 4 shows the results.The student model achieves the best performance, higher than the ensemble results.

Results on Chinese Datasets
We show the results on Chinese datasets in Table 5.All the distillation experiments were performed by GeneralDistiller.We observe that since CMRC 2018 and DRCD have relative small training sets, data augmentation has a much more significant effect on the student performance on the two tasks.Especially when the student model is randomly initialized (T3-small and T4-tiny model), distillation without DA leads to poor performance.

Conclusion and Future Work
In this paper, we present TextBrewer, a flexible PyTorch-based distillation toolkit for NLP research and applications.TextBrewer provides rich customization options for users to compare different distillation methods and build their strategies.We have conducted a series of experiments, and the results show that the distilled models can achieve state-of-the-art results with simple settings.
Apart from the distillation strategies, the structure of the student is also critical to its performance.In the future, we will continue to incorporate more distillation strategies, and integrate neural architecture search (NAS) into the toolkit to automate the searching for model structures.
Figure 1: (a) An overview of the main functionalities of TextBrewer.(b) A sketch that shows the function of adaptors inside a distiller.

Figure 2 :
Figure 2: A code snippet that demonstrates the minimal TextBrewer workflow.

Figure 3 :
Figure 3:An example of distillation configuration.This configuration is used to distill a 12-layer BERT BASE to a 4-layer T4-tiny.
To monitor the performance of the student model during training, people usually evaluate the student model on a development set at some checkpoints besides logging the loss curve.TextBrewer support such functionality by providing the callback function argument in the train method, as shown in line 24 of Figure 2. The callback function receives two arguments: the student model and the current training step.At each checkpoint (determined by num train epochs and ckpt frequencey), the distiller saves the student model and then calls the callback function.

Table 1 :
Parameter settings of the teacher and students.The number of parameters includes embeddings but does not includes output layers.

Table 2 :
A summary of the datasets used in experiments.The size of CoNLL-2003 is measured in number of entities.

Table 3 :
(Yang et al., 2018)2017) (teacher), Tiny-BERT and students.mandmmunderMNLIdenote the accuracies on matched and mis-matched sections respectively.For the experiments in the last line, examples from the training set of NewsQA(Trischler et al., 2017)is used for data augmentation (DA) in SQuAD; passages from the training set of HotpotQA(Yang et al., 2018)is used for data augmentation inCoNLL-2003.

Table 4 :
Results of multi-teacher distillation on development sets.All the models are BERT BASE .Different teachers are trained with different random seeds.For each task, the ensemble is the average of 3 teachers.

Table 5 :
Performance of the teacher and various students on Chinese tasks.In the experiments with DA, CMRC 2018 and DRCD take each other's training set as data augmentation.