BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process. Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.


Introduction
With the prevalence of deep learning, many huge neural models have been proposed and achieve state-of-the-art performance in various fields [12,38].Specifically, in Natural Language Processing (NLP), pretraining and fine-tuning have become the new norm of most tasks.Transformer-based pretrained models [4,21,42,31,6] have dominated the field of both Natural Language Understanding (NLU) and Natural Language Generation (NLG).These models benefit from their "overparameterized" nature [24] and contain millions or even billions of parameters, making it computationally expensive and inefficient considering both memory consumption and high latency.This drawback enormously hinders the applications of these models in production.
To resolve this problem, many techniques have been proposed to compress a neural network.Generally, these techniques can be categorized into Quantization [10], Weights Pruning [11] and Knowledge Distillation (KD) [14].Among them, KD has received much attention for compressing pretrained language models.KD exploits a large teacher model to "teach" a compact student model to mimic the teacher's behavior.In this way, the knowledge embedded in the teacher model can be transferred into the smaller model.However, the retained performance of the student model relies on a well-designed distillation loss function which forces the student model to behave as the teacher.Recent studies on KD [33,15] even leverage more sophisticated model-specific distillation loss functions for better performance.Different from previous KD studies which explicitly exploit a distillation loss to minimize the distance between the teacher model and the student model, we propose a new genre of model compression.Inspired by the famous thought experiment "Ship of Theseus 3 " in Philosophy, where all components of a ship are gradually replaced by new ones until r 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " t D a 4 S p S t 1 K o q C j r l p t N n z W g S J R M = " > A A A B 6 n i c b Z D N S g M x F I V v q t Z a r V a 7 d B M s g g s p M + 1 C l w U 3 L i v a H 2 i H k k k z b W g m M y Q Z Y R j 6 C G 5 c K O L W J 3 L n w w i m P w t t P R D 4 O O d e c u / 1 Y 8 G 1 c Z w v l N v a 3 s n v F v a K + w e l w 6 P y 8 U l H R 4 m i r E 0 j E a m e T z Q T X L K 2 4 U a w X q w Y C X 3 B u v 7 0 Z p 5 3 H 5 n S P J I P J o 2 Z F 5 K x 5 A G n x F j r X g 3 d Y b n q 1 J y F 8 C a 4 K 6 g 2 0 a D y X c q n r W H 5 c z C K a B I y a a g g W v d d J z Z e R p T h V L B Z c Z B o F h M 6 J W P W t y h J y L S X L U a d 4 X P r j < l a t e x i t s h a 1 _ b a s e 6 4 = " h + a D n R R o x 0 j W f X c S L n s U a l 9 X 6 a k = " > A A A B 6 n i c b Z D N S g M x F I V v q t Z a r V a 7 d B M s g g s p M + 1 C l w U 3 L i v a H 2 i H k k k z b W g m M y Q Z Y R j 6 C G 5 c K O L W J 3 L n w w i m P w t t P R D 4 O O d e c u / 1 Y 8 G 1 c Z w v l N v a 3 s n v F v a K + w e l w 6 P y 8 U l H R 4 m i r E 0 j E a m e T z Q T X L K 2 4 U a w X q w Y C X 3 B u v 7 0 Z p 5 3 H 5 n S P J I P J o 2 Z F 5 K x 5 A G n x F j r X g 3 r w 3 L V q T k L 4 U 1 w V 1 B t o k H l u 5 R P W 8 P y 5 2 A U 0 S R k 0 l B B t O 6 7 T m y 8 j C j D q W C z 4 i D R L C Z 0 S s a s b 1 G S k G k v W 4 w 6 w + f W G e E g U v Z J g x f u 7 4 6 M h F q n o W 8 r Q 2 I m e j 2 b m / 9 l / c Q E 1 1 7 G Z Z w Y J u n y o y A R 2 E R 4 v j c e c c W o E a k F Q h W 3 s 2 I 6 I Y p Q Y 6 9 T t E d w 1 1 f e h E 6 9 5 j Z q 9 T u 3 e a 9 W 2 9 d w l I l O I U z u A A X r q E F d 9 A G D y h w e I I X e E U S P a M 3 9 L 4 s L a B V T w 3 + C H 3 8 A H U z k J o = < / l a t e x i t > scc 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " o s M A F M J g T x y 5 S n j u w A 8 no original component exists, we propose Theseus Compression for BERT (BERT-of-Theseus), which progressively substitutes modules of BERT with modules of fewer parameters.We call the original model and compressed model predecessor and successor, in correspondence to the concepts of teacher and student in KD, respectively.As shown in Figure 1, we first specify a substitute (successor module) for each predecessor module (i.e., modules in the predecessor model).Then, we randomly replace each predecessor module with its corresponding successor module by a probability and make them work together in the training phase.After convergence, we combine all successor modules to be the successor model for inference.In this way, the large predecessor model can be compressed into the compact successor model.
Theseus Compression shares a similar idea with KD, which encourages the compressed model to behave like the original, but holds many merits.First, we only use the task-specific loss function in the compression process.However, KD-based methods use task-specific loss, together with one or multiple distillation losses as its optimization objective.The use of only one loss function throughout the whole compression process allows us to unify the different phases and keep the compression in a total end-to-end fashion.Also, selecting various loss functions and balancing the weights of each loss for different tasks and datasets are always laborious [33,28].Second, different from recent work [15], Theseus Compression does not use Transformer-specific features for compression thus is potential to compress a wide spectrum of models.Third, instead of using the original model only for inference in KD, our approach allows the predecessor model to work in association with the compressed successor model, enabling a deeper gradient-level interaction and a smoother training process.Moreover, the different module permutations mixing both predecessor and successor modules adds extra regularization, similar to Dropout [32].With a Curriculum Learning [1] driven replacement scheduler, our approach achieves great performance compressing BERT [4], a large pretrained Transformer model.
To summarize, our contribution is two-fold: (1) We propose a novel approach, Theseus Compression, revealing a new pathway to model compression, with only one loss function and one hyper-parameter.(2) Our compressed BERT model is 1.94× faster while retaining more than 98% performance of the original model, outperforming other KD-based compression baselines.

Related Work
Model Compression.Model compression aims to reduce the size and computational cost of a large model while retaining as much performance as possible.Conventional explanations [3,43] claim that the large number of weights is necessary for the training of deep neural network but a high degree of redundancy exists after training.Recent work [8] proposes The Lottery Ticket Hypothesis claiming that dense, randomly-initialized and feed-forward networks contain subnetworks that can be recognized and trained to get a comparable test accuracy to the original network.
Quantization [10] reduces the number of bits used to represent a number in a model.Weights Pruning [11,13] conducts a binary classification to decide which weights to be trimmed from the model.Knowledge Distillation (KD) [14] aims to train a compact model which behaves like the original one.FitNets [27] demonstrates that "hints" learned by the large model can benefit the distillation process.Born-Again Neural Network [9] reveals that ensembling multiple identical-parameterized students can outperform a teacher model.LIT [17] introduces block-wise intermediate representation training.Liu et al. [20] distilled knowledge from ensemble models to improve the performance of a single model on NLU tasks.Tan et al. [35] exploited KD for multi-lingual machine translation.Different from KD-based methods, our proposed Theseus Compression is the first approach to mix the original model and compact model for training.Also, no additional loss is used throughout the whole compression procedure which eliminates the tricky hyper-parameter tuning for various losses.
Faster BERT.Very recently, many attempts have been made to speed up a large pretrained language model (e.g., BERT [4]).Michel et al. [22] reduced the parameters of a BERT model by pruning unnecessary heads in the Transformer.Shen et al. [29] quantized BERT to 2-bit using Hessian information.Also, substantial modification has been made to Transformer architecture.Fan et al. [7] exploited a structure dropping mechanism to train a BERT-like model which is resilient to pruning.ALBERT [18] leverages matrix decomposition and parameter sharing.However, these models cannot exploit ready-made model weights and require a full retraining.Tang et al. [36] used a BiLSTM architecture to extract task-specific knowledge from BERT.DistilBERT [28] applies a naive Knowledge Distillation on the same corpus used to pretrain BERT.Patient Knowledge Distillation (PKD) [33] designs multiple distillation losses between the module hidden states of the teacher and student models.Pretrained Distillation [37] pretrains the student model with a self-supervised masked LM objective on a large corpus first, then performs a standard KD on supervised tasks.TinyBERT [15] conducts the Knowledge Distillation twice with data augmentation.MobileBERT [34] devises a more computationally efficient architecture and applies knowledge distillation with a bottom-to-top layer training procedure.

BERT-of-Theseus
In this section, we introduce module replacing, the technique proposed for BERT-of-Theseus.Further, we introduce a Curriculum Learning driven scheduler to obtain better performance.The workflow is shown in Figure 1.

Module Replacing
The basic idea of Theseus Compression is very similar to KD.We want the successor model to act like a predecessor model.KD explicitly defines a loss to measure the similarity of the teacher and student.However, the performance greatly relies on the design of the loss function [14,33,15].This loss function needs to be combined with taskspecific loss [33,17].Different from KD, Theseus Compression only requires one task-specific loss function (e.g., Cross Entropy), which closely resembles a fine-tuning procedure.Inspired by Dropout [32], we propose module replacing, a novel technique for model compression.We call the original model and the target model predecessor and successor, respectively.First, we specify a successor module for each module in the predecessor.For example, in the context of BERT compression, we let one Transformer layer to be the successor module for two Transformer layers.Consider a predecessor model P which has n modules and a successor model S which has n predefined modules.Let P = {prd 1 , .., prd n } denote the predecessor model, prd i and scc i denote the the predecessor modules and their corresponding substitutes, respectively.The output vectors of the i-th module is denoted as y i .Thus, the forward operation can be described in the form of: During compression, we apply module replacing.First, for (i + 1)-th module, r i+1 is an independent Bernoulli random variable which has probability p to be 1 and 1 − p to be 0.
Then, the output of the (i + 1)-th model is calculated as: where * denotes the element-wise multiplication, r i+1 ∈ {0, 1}.In this way, the predecessor modules and successor modules work together in the training.Since the permutation of the hybrid model is random, it adds extra noises as a regularization for the training of the successor, similar to Dropout [32].
During training, similar to a fine-tuning process, we optimize a regular task-specific loss, e.g., Cross Entropy:  where x j ∈ X is the i-th training sample; z j is its corresponding ground-truth label; c and C denote a class label and the set of class labels, respectively.For back-propagation, the weights of all predecessor modules are frozen and only weights of successor will be updated.For both the embedding layer and output layer of the predecessor model are weight-frozen and directly adopted for the successor model in this training phase.In this way, the gradient can be calculated across both the predecessor and successor modules, allowing the interaction on a deeper level.

Successor Fine-tuning and Inference
Since all successor modules have not been combined for training yet, we further carry out a post-replacement fine-tuning phase.After the replacing compression converges, we collect all successor modules and combine them to be the successor model S: Since each scc i is smaller than prd i in size, the predecessor model P is in essence compressed into a smaller model S.
Then, we fine-tune the successor model by optimizing the same loss of Equation 4. The whole procedure including module replacing and successor fine-tuning is illustrated in Figure 2(a).Finally, we use the fine-tuned successor for inference as Equation 5.

Curriculum Replacement
Although setting a constant replacement rate p can meet the need for compressing a model, we further highlight a Curriculum Learning [1] driven replacement scheduler, which helps progressively substitute the modules in a model.Similar to Curriculum Dropout [23], we devise a replacement scheduler to dynamically tune the replacement rate p.
Here, we leverage a simple linear scheduler θ(t) to output the dynamic replacement rate p d for step t.
where k > 0 is the coefficient and b is the basic replacement rate.The replacing rate curve with a replacement scheduler is illustrated in Figure 2(b).
In this way, we unify the two previously separated training stages and encourage an end-to-end easy-to-hard learning process.First, with more predecessor modules present, the model would more likely to correctly predict thus have a relatively small cross-entropy loss, which is helpful for smoothing the learning process.Then, at a later time of compression, more modules can be present together, encouraging the model to gradually learn to predict with less guidance from the predecessor and steadily transit to the successor fine-tuning stage.
Second, at the beginning of the compression, when θ(t) < 1, considering the average learning rate for all n successor modules, the expected number of replaced modules is n • p d and the expected average learning rate is: where lr is the constant learning rate set for the compression and lr is the equivalent learning rate considering all successor modules.Thus, when applying a replacement scheduler, a warm-up mechanism [25] is essentially adopted at the same time, which helps the training of a Transformer.

Experiments
In this section, we introduce the experiments of Theseus Compression for BERT [4] compression.We compare BERT-of-Theseus with other compression methods and further conduct experiments to analyze the results.
The accuracy is used as the metric for SST-2, MNLI-m, MNLI-mm, QNLI and RTE.The F1 and accuracy are used for MRPC and QQP.The Pearson correlation and Spearman correlation are used for STS-B.Matthew's correlation is used for CoLA.The results reported for the test set of GLUE are in the same format as on the official leaderboard.
For the sake of comparison with [28], on dev set of GLUE, the result of MNLI is the average result on MNLI-m and MNLI-mm; the results on MRPC and QQP are reported with the average of F1 and accuracy; the result reported on STS-B is the average of the Pearson correlation and Spearman correlation.

Experimental Settings
We test our approach under a task-specific compression setting [33,37] instead of a pretraining compression setting [28,34].That is to say, we use no external unlabeled corpus but only the training set of each task in GLEU to compress the model.The reason why we test our model under such a setting is that we intend to straightforwardly verify the effectiveness of our generic compression approach.The fast training process of task-specific compression (e.g., no longer than 20 GPU hours for any task of GLUE) computationally enables us to conduct more analytical experiments.For comparision, DistillBERT [28] takes 720 GPU hours to train.Plus, in real-world applications, this setting provides with more flexibility when selecting from different pretrained LMs (e.g., BERT, RoBERTa [21]) for various downstream tasks and it is easy to adopt a newly released model, without a time-consuming pretraining compression.
On the other hand, we acknowledge that a general-purpose compressed BERT can better facilitate the downstream applications in the community since it requires less computational resource to simply fine-tune a small model than compressing a large one.Thus, we release a general-purpose compressed BERT as well.
Formally, we define the task of compression as trying to retain as much performance as possible when compressing the officially released BERT-base (uncased) 5 to a 6-layer compact model with the same hidden size, following the settings  in [28,33,37].Under this setting, the compressed model has 24M parameters for the token embedding (identical to the original model) and 42M parameters for the Transformer layers and obtains a 1.94× speed-up for inference.

Training Details
We fine-tune BERT-base as the predecessor model for each task with the batch size of 32, the learning rate of 2 × 10 −5 , and the number of epochs as 4. As a result, we are able to obtain a predecessor model with comparable performance with that reported in previous studies [28,33,15].
Afterward, for training successor models, following [28,33], we use the first 6 layers of BERT-base to initialize the successor model since the over-parameterized nature of Transformer [38] could cause the model unable to converge while training on small datasets.During module replacing, We fix the batch size as 32 for all evaluated tasks to reduce the search space.All r variables only sample once for a training batch.The maximum sequence length is set to 256 on QNLI and 128 for the other tasks.We perform grid search over the sets of learning rate lr as {1e-5, 2e-5}, the basic replacing rate b as {0.1, 0.3}, the scheduler coefficient k making the dynamic replacing rate increase to 1 within the first {1000, 5000, 10000, 30000} training steps.We apply an early stopping mechanism and select the model with the best performance on the dev set.We conduct our experiments on a single Nvidia V100 16GB GPU.The peak memory usage is identical to fine-tuning a BERT-base, since there would be at most 12 layers training at the same time.The training time for each task varies depending on the different sizes of training sets.For example, it takes 20 hours to train on MNLI but less than 30 minutes on MRPC.

Baselines
As shown in Table 1, we compare the layer numbers, parameter numbers, loss function, external data usage and model agnosticism of our proposed approach to existing methods.We set up a baseline of vanilla Knowledge Distillation [14] as in [33].Additionally, we directly fine-tune an initialized 6-layer BERT model on GLUE tasks to obtain a natural fine-tuning baseline.Under the setting of compressing 12-layer BERT-base to a 6-layer compact model, we choose BERT-PKD [33], PD-BERT [37], and DistillBERT [28] as strong baselines.Note that DistillBERT [28] is not directly comparable here since it uses a pretraining compression setting.Both PD-BERT and DistillBERT use external unlabeled corpus.We do not include TinyBERT [15] since it has a different size setting, conducts distillation twice, and leverages extra augmented data for GLUE tasks.We also exclude MobileBERT [34], due to its redesigned Transformer block and different model size.Besides, in these two studies, the loss functions are not architecture-agnostic thus limit their applications on other models.

Experimental Results
We report the experimental results on the dev set of GLUE in Table 2 and submit our predictions to the GLUE test server and obtain the results from the official leaderboard as shown in Table 3.Note that DistillBERT does not report on test set.The BERT-base performance reported on GLUE dev set is the predecessor fine-tuned by us.The results of BERT-PKD on dev set are reproduced by us using the official implementation.In the original paper of BERT-PKD, the results of CoLA and STS-B on test set are not reported, thus we reproduce these two results.Fine-tuning and Vanilla KD baselines are both implemented by us.All other results are from the original papers.The macro scores here are calculated in the same way as the official leaderboard but are not directly comparable with GLUE leaderboard since we exclude WNLI from the calculation.
Overall, our BERT-of-Theseus retains 98.4% and 98.3% of the BERT-base performance on GLUE dev set and test set, respectively.On every task of GLUE, our model dramatically outperforms the fine-tuning baseline, indicating that with the same loss function, our proposed approach can effectively transfer knowledge from the predecessor to the successor.Also, our model obviously outperforms the vanilla KD [14] and Patient Knowledge Distillation (PKD) [33], showing its supremacy over the KD-based compression approaches.On MNLI, our model performs better than BERT-PKD but slightly lower than PD-BERT [37].However, PD-BERT exploits an additional corpus which provides much more samples for knowledge transferring.Also, we would like to highlight that on RTE, our model achieves nearly identical performance to BERT-base and on QQP our model even outperforms BERT-base.To analyze, a moderate model size may help generalize and prevent overfitting on downstream tasks.Notably, on both large datasets with more than 350K samples (e.g., MNLI and QQP) and small datasets with fewer than 4K samples (e.g., MRPC and RTE), our model can consistently achieve good performance, verifying the robustness of our approach.

General-purpose Model
Although our approach achieves good performance under a task-specific setting, it requires more memory to fine-tune a full-size predecessor than a compact BERT (e.g., DistillBERT [28]).Liu et al. [21] found that a model fine-tuned on MNLI can successfully transfer to other sentence classification tasks.Thus, we release our compressed model by conducting compression on MNLI as a general-purpose compact BERT to facilitate downstream applications.After compression, we fine-tune the successor model on other sentence classification tasks and compare the results with DistillBERT [28] in Table 4.Our general-purpose model achieves an identical performance on MRPC and remarkably outperforms DistillBERT on the other sentence-level tasks.

Analysis
In this section, we conduct extensive experiments to analyze our BERT-of-Theseus.3: Average performance drop when replacing the predecessor module prd i with successor module scc i on QNLI, MNLI and QQP (dev set).

Impact of Module Replacement
As pointed out in previous work [7], different layers of a Transformer play imbalanced roles for inference.To explore the effect of different module replacements, we iteratively use one compressed successor module (constant replacing rate, without successor fine-tuning) to replace its corresponding predecessor module on QNLI, MNLI and QQP, as shown in Table 5.We illustrate the average performance drop on three tasks in Figure 3. Surprisingly, different from a similar study for the importance of different Transformer layers in [7], which is basically a U-curve, our results show that the replacement of the last two modules have only a trivial influence on the overall performance while the replacement of the first module significantly harms the performance.To analyze, the linguistic features are mainly extracted by the first few layers.Therefore, the reduced representation capability becomes the bottleneck for the following layers.On the contrary, high-quality low-level features can help the following layers, thus the reduced module size has only a limited influence on the final results.

Impact of Replacing Rate
We attempt to adopt different replacing rates on GLUE tasks.First, we fix the batch size to be 32 and learning rate lr to be 2 × 10 −5 and conduct compression on each task.On the other hand, as we analyzed in Section 3.3, the equivalent learning rate lr is affected by the replacing rate.To further eliminate the influence of learning rate, we fix the equivalent learning rate lr to be 2 × 10 −5 and adjust learning rate lr for different replacing rates by lr = lr /p.
We illustrate the results with different replacing rates on two representative tasks (MRPC and RTE) in Figure 4.The trivial gap between two curves in both figures indicate that the effect of different replacing rates on equivalent learning rate is not the main factor for the performance differences.Generally speaking, BERT-of-Theseus is not very sensitive to different replacing rates.A replacing rate in the range between 0.5 and 0.7 can always lead to a satisfying performance on all GLUE tasks.However, a significant performance drop can be observed on all tasks if the replacing rate is too small (e.g., p = 0.1).On the other hand, the best replacing rate differs across tasks.

Impact of Replacement Scheduler
To study the impact of our curriculum replacement strategy, we compare the results of BERT-of-Theseus compressed with a constant replacing rate and with a replacement scheduler.The constant replacing rate for the baseline is searched over {0.5, 0.7, 0.9}.Additionally, we implement an "anti-curriculum" baseline, similar to the one in [23].For each task, we adopt the same coefficient k and basic replacing rate b to calculate the p d as Equation 6for both curriculum replacement and anti-curriculum.However, we use 1 − p d as the dynamic replacing rate for anti-curriculum baseline.Thus, we can determine whether the improvement of curriculum replacement is simply due to an inconstant replacing rate or an easy-to-hard curriculum design.
As shown in Table 6, our model compressed with curriculum scheduler consistently outperforms a model compressed with a constant replacing rate.On the contrary, a substantial performance drop is observed on the model compressed with an anti-curriculum scheduler, which further verifies the effectiveness of the curriculum replacement strategy.

Discussion and Conclusion
In this paper, we propose Theseus Compression, a novel model compression approach.We use this approach to compress BERT to a compact model, which outperforms other models compressed by Knowledge Distillation, with only one hyper-parameter, one loss function and no external data.Our work highlights a new genre of model compression and reveals a new path towards model compression.
A known limitation of our approach is that to allow a successor module to replace a predecessor module, they must have the same input and output sizes.First, given this restriction, we can still perform depth reduction (i.e., reducing the number of layers).As analyzed in [28], the hidden size dimension has a smaller impact on computational efficiency than the depth, for a fixed parameter budget.Second, there have been many developed in-place substitutes (e.g., ShuffleNet unit [44] for ResBlock [12], Reformer Layer [16] for Transformer Layer [38]), which can be directly adopted as the successor.Third, it is possible to use a feed-forward neural network to map features between the hidden spaces of different sizes [15].
Although our model has achieved good performance compressing BERT, it would be interesting to explore its possible applications in other neural models.As summarized in Table 1, our model does not rely on any model-specific features to compress BERT.Therefore, it is potential to apply Theseus Compression to other large models (e.g., ResNet [12] in Computer Vision).For the future work, we would like to conduct Theseus Compression on Convolutional Neural Network and Graph Neural Network.
T E T v Z 7 N z f + y f m K C a y / j M k 4 M k 3 T 5 U Z A I b C I 8 3 x u P u G L U i N Q C o Y r b W T G d E E W o s d c p 2 i O 4 6 y t v Q q d e c x u 1 + p 1 b b V 7 C U g U 4 h T O 4 A B e u o A m 3 0 I I 2 U B j D E 7 z A K x L o G b 2 h 9 2 V p D q 1 6 K v B H 6 O M H + U y P v g = = < / l a t e x i t > r 2

Figure 1 :
Figure 1: The workflow of BERT-of-Theseus.In this example, we compress a 6-layer predecessor P = {prd 1 , .., prd 3 } to a 3-layer successor S = {scc 1 , .., scc 3 }.prd i and scc i contain two and one layer, respectively.(a) During module replacing training, each predecessor module prd i is replaced with corresponding successor module scc i by the probability of p.(b) During successor fine-tuning and inference, all successor modules scc 1..3 are combined for calculation.

Figure 2 :
Figure 2: The replacing curves of a constant module replace rate and a replacement scheduler.We use different shades of gray to mark the two phases of Theseus Compression: (1) Module replacing.(2) Successor fine-tuning.

Figure 4 :
Figure 4: Performance of different replacing rate on MRPC and RTE."LR" and "ELR" denote that the learning rate and equivalent learning rate are fixed, respectively.

Table 1 :
Comparison of different BERT compression approaches."CE" and "MSE" stand for Cross Entropy and Mean Square Error, respectively."KD" indicates the loss is for Knowledge Distillation."CE TASK " and "CE MLM " indicate Cross Entropy calculated on downstream tasks and Masked LM pretraining, respectively.Other loss functions are described in their corresponding papers.

Table 2 :
Experimental results on the dev set of GLUE.The numbers under each dataset indicate the number of training samples.

Table 3 :
Experimental results on the test set from GLUE server.The numbers under each dataset indicate the number of training samples.

Table 4 :
Experimental results of our general-purpose model on GLUE-dev.

Table 5 :
Impact of the replacement for different modules on GLUE-dev.prd i → scc i indicates the replacement of the i-th module from the predecessor.