Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Pretrained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pretrained models, especially in the era of edge computing. In this work, we propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. We incorporate the reweighted group Lasso into block-structured pruning for optimization. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. Experimental results on different models (BERT, RoBERTa, and DistilBERT) on the General Language Understanding Evaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero or minor accuracy degradation on certain task(s). Our proposed method is also orthogonal to existing compact pretrained language models such as DistilBERT using knowledge distillation, since a further 1.79x average compression rate can be achieved on top of DistilBERT with zero or minor accuracy degradation. It is suitable to deploy the final compressed model on resource-constrained edge devices.


Introduction
Transformer-based language model pre-training has proven to be highly effective in learning universal language representations from large-scale unlabeled data and being fine-tuned to adapt to downstream tasks (Peters et al., 2018;Sun et al., 2019). Representative works such as BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), RoBERTa (Liu * These authors contributed equally et al., 2019b), MT-DNN (Liu et al., 2019a), AL-BERT (Lan et al., 2019), , and UniLMv2 (Bao et al., 2020) have substantially advanced the state-of-the-art across a variety of downstream tasks, such as text classification, natural language inference, and question answering.
Despite its success in performance improvement in natural language understanding and generation, the computational cost and data storage of Transformer-based pre-trained language model are two widely recognized concerns due to Transformer's deep architecture and rich parameters. These models typically contain several hundred million parameters. The recent released research models even reach multi-billion parameters, such as MegatronLM (8.3 billion parameters) (Shoeybi et al., 2019), Turing-NLG (17 billion parameters) (Microsoft, 2020) and GPT-3 (175 billion parameters) (Brown et al., 2020), which require more advanced computing facility. Hence, it is imperative to reduce the computational cost and model storage of pre-trained Transformer-based language models in order to popularize their applications in computer systems, especially in edge devices with limited resources.
Several works have been developed in the context of model compression, such as knowledge distillation (Hinton et al., 2015;Jiao et al., 2019;Sun et al., 2019), weight pruning (Han et al., 2015), parameter sharing (Lan et al., 2019) and weight quantization (Polino et al., 2018). For computer vision, the information compressed/reduced in image features can be partially retrieved from neighboring pixels since they share similar and uniform characteristics spatially. However, for NLP, the syntax and semantics information of Transformer in language/text domain are more sensitive than that of computer vision. A high compression rate for large-scale language models is difficult to achieve on downstream NLP tasks.
As a result, there are few works in exploring and optimizing hardware-friendly model compression techniques for state-of-the-art Transformer-based pre-trained language models, to reduce the weight storage and computation on computer system while maintaining prediction accuracy.
In this work, we propose an efficient Transformer-based large-scale language representations using block structured pruning. The contributions of this work are as follows.
• To the best of our knowledge, we are the first to investigate hardware-friendly weight pruning on pre-trained large-scale language models. Besides the significantly reduced weight storage and computation, the adopted block structure pruning has high flexibility in achieving a high compression rate. The two advantages are critical for efficient Transformer in NLP since the non-uniformed syntax and semantics information in language/text domain makes weight pruning more difficult than computer vision.
• We incorporate the reweighted group Lasso for optimization into block structured pruning-based on pre-trained large-scale language models including BERT, RoBERTa, and DistilBERT. We relax the hard constraints in weight pruning by adding regularization terms in the objective function and use reweighted penalty parameters for different blocks. The dynamical regularization technique achieves higher compression rate with zero or minor accuracy degradation.
• Our proposed method is orthogonal to existing compact pre-trained language models such as DistilBERT using knowledge distillation. We can further reduce the model size using our method with zero or minor accuracy.
We evaluate the proposed approach on several GLUE benchmark tasks (Wang et al., 2018). Experimental results show that we achieve high compression rates with zero or minor accuracy degradation. With significant gain in weight storage reduction (up to 5×) and computation efficiency, our approach can maintain comparable accuracy score to original large models including DistilBERT. The hardware-friendly transformer-based acceleration method is suitable to be deployed on resourceconstrained edge devices.

Related Work
To address the memory limitation and high computational requirement of commonly seen deep learning platforms such as graphics processing unit (GPU), tensor processing unit (TPU) and fieldprogrammable gate array (FPGA) on large-scale pre-trained language models, various of compact NLP models or model compression techniques have been investigated. ALBERT (Lan et al., 2019) utilizes parameter sharing technique across encoders to reduce weight parameters and uses the same layer structures as BERT. It achieves comparable results on different benchmarks to BERT. Despite the weight storage reduction, the computational overhead remains unchanged since ALBERT and BERT have the same network structure.
Knowledge distillation is another type of model compression technique, which distills the knowledge from a large teacher model or an ensemble of models to a light-weighted student model (Hinton et al., 2015). The student model is trained to intimate the class probabilities produced by the large teacher model. For instance, Distil-BERT  applies knowledge distillation to BERT, and achieves 1.67 × model size reduction and 1.63 × inference speedup, while retaining 97% accuracy on the dev sets on the GLUE benchmark, compared to BERT. Patient knowledge distillation (Sun et al., 2019) is used to learn from multiple intermediate layers of the teacher model for incremental knowledge extraction.
Efficient deep learning methods can reduce the model size and accelerate the computation. It is well known that, in practice, the weight representation in deep learning models is redundant. After removing several redundant weights with appropriate model compression algorithms, the deep learning model can have minor accuracy degradation. Prior work focused on heuristic and iterative non-structured magnitude weight pruning (a.k.a, irregular pruning) (Han et al., 2015). It causes overhead in both weight storage and computation in computer systems. On weight storage, it results in irregular, sparse weight matrices (as arbitrary weights can be pruned), and relies on indices to be stored in a compressed format such as Coordinate (COO) format. The introduced indices cause extra memory footprint, i.e., at least one index per nonzero value, further degrading the compression rate. On computation, it is difficult to be accelerated on current GPU architectures as reported in (Han  Wen et al., 2016;Yu et al., 2017). On the other hand, structured pruning considers regularity in weight pruning focusing on generating regular but smaller and dense matrix with no index. However, it suffers notable accuracy loss due to the poor solution quality, and therefore not suitable for pruning sensitive syntax and semantics information in Transformer.

Problem Formulation
We adopt a more fine-grained block structured pruning algorithm, where pruning is executed by excluding entire blocks of weights within weight matrices such as rows or columns, therefore significantly reducing the number of indices when storing on memory. On computation, it is compatible with parallel computing platforms such as GPUs or Field Programmable Gate Arrays (FPGAs) in implementing matrix multiplications. We formulate the weight pruning problem using reweighted group Lasso, to orchestrate the block structured pruning. Thus, the Transformer-based large-scale models can be more efficient on computer systems while satisfying the accuracy requirement. As shown in Figure 1, we divide the weight matrix into small blocks and apply row pruning and column pruning on each block. For each row/column block, we compute the l 2 norm. We prune the weights within the block according to our pre-set threshold or percentile. The pseudocode is shown in Algorithm 1.
Consider an N -layer Transformer, we denote the weights and biases of the n-th layer as W n and b n . The loss function is f {W n } N n=1 , {b n } N n=1 , which will be minimized during training. For the block structured pruning problem, our target objec-Algorithm 1 Block structured pruning Input: weight matrix W, matrix width n, matrix height m, row division k (or column division k ), threshold t b Output: pruned weight matrix Wp Set Wp = W Divide Wp into k matrices: W1,W2,...,W k Set l2 norms = zeros(k, m) for i = 1 to k do for j = 1 to m do l2 norms(i, j) equals the l2 norm of the j th row of Wi if l2 norms(i, j) ≤ t b then Wi(j,:) = 0 end if end for end for Wp = concatenate(W1,W2,...,W k ) tive is to reduce the number of columns and rows in the blocks of weight matrix while maintaining the prediction accuracy.
subject to # of non-zero block rows in Wn is less than rn # of non-zero block columns in Wn is less than cn (1) where r n and c n are the desired non-zero block rows and columns, respectively. Due to regularity in pruning, only the non-zero rows/columns at the block level need to be indexed, as opposed to each non-zero element in irregular pruning. The storage overhead is minor compared to non-structured irregular pruning (Han et al., 2016). Because structured pruning is applied independently within each block, the scheme has higher flexibility, thereby higher accuracy, compared to the straightforward application on the whole weight matrix (Wen et al., 2016).

Reweighted Group Lasso Optimization
In problem (1), we use hard constraints to formulate the block row/column pruning problem. However, it is more difficult to satisfy the hard constraints on NLP than on computer vision. There are two reasons: i) Information compressed in image features can be partially retrieved from neighboring pixels since spatially they share similar and uniform characteristics, whereas syntax and semantics information in deep Transformer in language/text domain are not uniformly characterized; ii) Intuitively, the high-level semantic, syntax, and language understanding capability might be broken when we prune zero or near-zero weights in the latent space. Therefore, a high compression rate for large-scale language models is difficult to achieve on downstream NLP tasks.
To address this issue, we relax the hard constraints by adding regularization terms in the objective function. Prior work SSL (Wen et al., 2016) uses group Lasso as the relaxation of the hard constraints. Inspired by (Candes et al., 2008), we use reweighted penalty parameters for different blocks to achieve a high compression rate under same accuracy requirement than using a fixed penalty parameter to all the blocks in group Lasso method.
When we use group Lasso for block row pruning, the regularization term is where h n is the block row size in the n-th layer, p n is the number of rows in W n , q n is the number of blocks in a row of W n . And the block row pruning problem is (2) where λ is the penalty parameter. γ i,α is the penalty weight corresponding to the α-th block in the i-th row, and it is updated by where is a small value preventing division by zero. Similarly, when we prune columns in a block, the problem Algorithm 2 Reweighted group Lasso on Transformer pruning Input: pre-trained model, model weight matrix W, matrix width n, matrix height m Set milestones = m1, m2, ..., ms Set T1 as the number of iterations of reweighted training method Set T2 as the number of iterations of retraining method Calculate γ for s = 1 to T1 do if s in milestones then Update γ end if Calculate l 1loss and prediction loss f (W, b) mixed loss = l 1loss + f (W, b) Update model weight W to minimize mixed loss using Adam end for Prune the weight matrix W using block structured pruning M ask = zeros(m, n) where d n is the block column size in the n-th layer, r n is the number of columns in W n . s n is the number of blocks in a column of W n . γ j,β is the penalty weight corresponding to the β-th block in the j-th column, and it is updated by . We start with a pre-trained model and initialize the collection of penalty weights (γ i,α or γ j,β ) using the parameters in the pre-trained model. We remove the rows or blocks in a block if their group l 2 norm is smaller than a threshold after reweighted training. We refine the Transformer models using the non-zero weights. λ is used for adjusting regularization strength. When λ is too small, the reweighted training is close to the original training. When λ is too large, it gives too much penalty on the weights and accuracy cannot be maintained. Specifically, we start reweighted training with λ = 0 to reproduce the original results and increase λ to derive sparsity of the weight matrices. We stop increasing λ when the reweighted training accuracy drops slightly and the accuracy will be improved after retraining. Overall, using the same training trails, our method can achieve higher pruning rate than prior works using structured pruning (Wen et al., 2016), as shown in Algorithm 2.

Datasets
We conduct experiments on GLUE benchmark (Wang et al., 2018), a comprehensive collection of nine natural language understanding tasks covering three NLP task categories with different degrees of difficulty and dataset scales: single-sentence tasks, paraphrase similarity matching tasks, and inference tasks. All datasets are public available. More specifically, for single-sentence task, we consider the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2018), which contains 10,657 sentences of English acceptability judgments from books and articles on linguistic theory, and the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), which is comprised of 215,154 phrases in the parse trees of 11,855 sentences from movie reviews with annotated emotions.
For paraphrase similarity matching tasks, we consider the Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005), which contains 5,800 sentence-pairs corpora from online news sources and are manually annotated where the sentences in the sentence-pairs are semantically equivalent; the Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017), a collection of 8,628 sentence pairs extracted from the news title, video title, image title, and natural language inference data; and the Quora Question Pairs (QQP) 1 , a collection of 400,000 lines of potential question duplicate pairs from the Quora website.
For inference tasks, we consider the Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2018), a set of 433k premise hypothesis pairs to predict whether the premise statement contains assumptions for the hypothesis statement; Question-answering NLI (QNLI) (Wang et al., 2018), a set of over 100,000+ question-answer pairs from SQuAD (Rajpurkar et al., 2016); The Recognizing Textual Entailment datasets (RTE) (Wang et al., 2018), which come from the PASCAL recognizing Textual Entailment Challenge; and Winograd NLI (WNLI) (Levesque et al., 2012), a reading comprehension task that comes from the Winograd Schema Challenge.
In all GLUE benchmarks, we report the metrics following the conventions in (Wang et al., 2018), i.e., accuracy scores are reported for SST-2, QNLI, RTE, and WNLI; Matthews Correlation Coefficient (MCC) is reported for CoLA; F1 scores are reported for QQP and MRPC; Spearman correlations are reported for STS-B.

Experimental Setup
Baseline Models. Our baseline models are unpruned BERT BASE (Devlin et al., 2018), RoBERTa BASE (Liu et al., 2019b), and Distil-BERT . As shown in Table 1, for each transformer model, we list the reported accuracy/metrics from the original papers in the first row. We report our reproduced results using the same network architectures in the second row. Evaluation Metrics.
To evaluate our proposed framework on NLP model compression problems, we apply our method on different transformer-based models including BERT BASE , RoBERTa BASE , and DistilBERT. Reweighted l 1 training is carried out to add l 1 regularization, block structured pruning to obtain a sparse model, and retraining to improve the final accuracy.
We access the GPU-AI (Bridges GPU Artificial Intelligence) nodes on the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014). We use two node types: Volta 16 -nine HPE Apollo 6500 servers, each with 8 NVIDIA Tesla V100 GPUs with 16 GB of GPU memory each, connected by NVLink 2.0; Volta 32 -NVIDIA DGX-2 enterprise research AI system tightly coupling 16 NVIDIA Tesla V100 (Volta) GPUs with 32 GB of GPU memory each, connected by NVLink and NVSwitch. We also use an 8× NVIDIA Quadro RTX 6000 GPU server with 24 GB of GPU memory each for training. We conduct the experiments using HuggingFace Transformer toolkit for the state-of-the-art NLP  and the DeepLearningExamples repository from NVIDIA (NVIDIA, 2020). Our experiments are performed on Python 3.6.10, GCC 7.3.0, PyTorch 1.4.0, and CUDA 10.1.
We show the prediction accuracy with respect to

Implementation Details
We first fine-tune the pre-trained models for classification. BERT, RoBERTa, and DistilBERT share the same steps. We add a single linear layer on top of each original model. We train the model for the nine downstream GLUE tasks with their corresponding datasets. As we feed the data, the entire pre-trained model and the additional untrained classification layer is trained on our specific task. The original layers already have great English words representation, and we only need to train the top layer, with a bit of tweaking in the lower levels to accommodate our task.
For fine-tuning, we run 4 epochs with initial learning rate of 2e −5 , batch size of 32 and warm up proportion of 0.1. For block structured pruning, we adjust the reweighted penalty parameter, compression rate and training steps for each task. We use the same parameters as fine-tuning (epochs, learning rate, batch size), then we adjust some parameters for each task, depending on the prediction performance. For BERT BASE , we set penalty factor 1e −3 for WNLI and MRPC; penalty factor 1e −4 for CoLA, QQP, MNLI, SST-2, and RTE; penalty factor 1e −5 for QNLI. The learning rate is 3e −5 and batch size is 32 on nine tasks. For RoBERTa BASE , we set penalty factor 1e −3 for WNLI; penalty factor 1e −4 for MRPC, QQP, SST-2, and RTE; penalty factor 1e −5 for QNLI, CoLA, and MNLI. The learning rate and batch size are the same as BERT BASE . For DistilBERT model, the hyperparamters for reweighted training and retraining are learning rate = 3e −5 and batch size = 128 on nine datasets. We adjust other parameters, including penalty factors, number of blocks, and compression ratios to achieve the satisfied performance on each task.
We consider three objectives: weight distribution, loss, and accuracy. Weight distribution shows the distribution of weights in each layer after training and retraining. We visualize the weight parameters in Figure 2. With different pruning hyperparameters including penalty factors, learning rate, block numbers, and compression rate, the weights are distributed differently. We look at two losses: reweighted loss and mixed loss (the object function in Equation (3)). For all our tasks, BERT BASE , RoBERTa BASE , and DistilBERT are converged in less than 4 epochs. During training, we evaluate the performance between each given steps.

Experimental Results
We compare the performance (accuracy score) of our pruned models with the baselines. The results are shown in Table 1. For BERT BASE , we set a  compression rate of 1.428× (i.e., 30% sparsity) or above. The average accuracy degradation is within 2% on all tasks. On WNLI task, there is no accuracy loss. On MNLI, QQP, CoLA, STS-B, and MRPC tasks, the accuracy loss is within 1.5%. On SST-2, QNLI, and RTE tasks, the accuracy loss is also small (within 2.9%), compared to two baseline models. For RoBERTa, the average accuracy degradation is within 3% on all tasks. There is no accuracy loss on WNLI. The accuracy loss is within 1% on MRPC, within 2% on MNLI and STS-B tasks, within 4% on QNLI and RTE tasks, around 5% on QQP, SST-2 and CoLA tasks. For Distil-BERT, the average accuracy degradation is within 5% on all tasks. The accuracy losses are within 1% on MRPC task, 3% on MNLI, QQP, QNLI, and STS-B tasks, and 5% on SST-2 task. On CoLA and WNLI datasets, the pruned models perform even better than the unpruned models and increase the   Figure 4 show the reweighted training and retraining results on MRPC dataset, respectively. We choose 256 as the number of blocks. For reweighted training, the mixed loss drops during training within every 116 steps (4 epochs) and increases significantly since we update the penalty matrix γ. For retraining, the pruned model achieves higher F1 score than the unpruned one.
We evaluate the accuracy changes when compression rates are different on BERT BASE and report the accuracy scores for different tasks. Results indicate that the sensitivities of tasks vary significantly under different levels of compression rates. As shown in Table 2, different tasks show different accuracy degradation when using the same compression rate. As we increase the compression rate, the accuracy degradation increased. For specific task (e.g., WNLI), we can achieve up to 5× compression rate from baseline model with zero accuracy loss. Results on tasks such as WNLI and QQP show minor accuracy degradation while results on SST-2, MNLI, QNLI, show higher accuracy degradation when compression rate is 5.0×. The different accuracy results are related to different dataset sizes, degrees of difficult, and evaluation metrics.
We compare our BSP method with irregular sparse format and the block sparse format (Narang et al., 2017;Gray et al., 2017) (pruning all weights on selected blocks). Table 3 shows that under same accuracy, our method achieves a slightly lower pruning ratio compared to irregular sparse format. This is because irregular pruning has a larger flex-  ibility in pruning. However, irregular pruning is less effective when applying to hardware. Irregular sparse format introduces significant memory storage overhead when using Coordinate Format (COO) storage format, therefore is not hardwarefriendly. Our method, block structured format (pruning a portion of rows/columns on each block) strikes a better balance between accuracy and memory storage than irregular sparse format or block sparse format (Narang et al., 2017;Gray et al., 2017). For irregular sparse format, when storing or transmitting an irregular sparse matrix using the COO format, we store the subsequent nonzeros and related coordinates in memory. Three vectors are needed: row, col, data, where data[i] is value at (row[i], col[i]) position. More specifically, given 50% sparsity for a 9 × 9 matrix with block size of 3×3, the storage of COO format is 1.5 × 9 × 9 = 122; the storage of block structured sparsity is 9 × 0.5 × 3 × (3 + 1) (i.e., #blocks × sparsity × block size × (values + position ind )=54. Table 4 lists the accuracy of our method and block sparse format using DistilBERT. Our method achieves 3.04% higher accuracy in average compared with block sparse format.
As the proposed pruning is hardware-friendly, the pruned weights can be efficiently stored in hardware memory with minor overhead (compared to other pruning methods like irregular pruning). We use a compiler-assisted acceleration framework other than sparse linear algebra libraries, which allows the model to speed up with a sparsity of 30%. We also apply matrix reorder and compact model storage to achieve speed up on edge devices (Ma et al., 2020). Hence, it is suitable to deploy the final compressed model on resource-constrained edge devices such as embedded systems and mobile devices.

Ablation Studies
In this section, we perform ablation experiments over several parameters when pruning BERT and DistilBERT to better understand their relative importance and the procedure. We change the selection of following parameters: the numbers of blocks for reweighted training and block structured pruning, retraining epochs, and penalty factors. We also evaluate the knowledge distillation friendly.

Number of Blocks
After selecting penalty factor 3e −4 and compression rate 2.0× for each layer (except embedding layers), we choose different numbers of blocks to test. As shown in Table 5, the final accuracy is significantly improved for both BERT BASE and Di-tilBERT when we increase the number of blocks. It verifies that with more number of blocks (smaller block size), our weight pruning algorithm has higher flexibility in exploring model sparsity.

Number of Retraining Epochs
By default, all GLUE tests are carried out by running four epochs for pre-training. For reweighted training and retraining, more epochs usually lead to better final accuracy. In this test, we try different reweighted training and retraining epochs. During reweighted training, the mixed loss will drop significantly within every 4 epochs, while the evaluation loss keeps relatively stable. We summarize the results in Table 6. The final accuracy of retraining is improved when we increase the training epochs.

Penalty Factors
The reweighted training procedure is utilized to penalize the l 2 norm of the blocks and thus to reduce the magnitude of the weights. Therefore, larger penalty factors help to achieve better retraining accuracy since more smaller weight values of the weight matrices are pruned. However, if the penalty factors are too large, the reweighted training algorithm is not able to compress the model well, which leads to significant accuracy degradation. The results are summarized in Table 7. The retraining accuracy is improved when we increase the penalty factor from 3e −5 to 1e −4 and declines from 3e −4 to 1e −3 .

Variance of results on multiple runs
During our training, the random seeds are set to 42 as default. We further conduct experiments choosing different seeds and list the results in Table 8. We observe our reported accuracy is aligned with the results with different seeds.

Knowledge Distillation Friendly
To evaluate the effectiveness of our pruning method on distilled models, we focus on the BERT and Dis-tilBERT results in Table 1, where DistilBERT is a highly distilled version of BERT. The average compression rate of BERT and DistilBERT are 1.49× and 1.79×, respectively. Please note that model size of BERT is 1.67× of DistilBERT, and therefore is 2.99× of the final compressed DistilBERT model size. This show that the proposed block structured pruning is orthogonal to knowledge distillation. With this knowledge distillation friendly property, we can first apply the standard knowledge distillation step to reduce the original large model and then apply the proposed pruning method to further reduce the size of the student model.

Conclusion
In this work, we propose an hardware-friendly block structured pruning pruning framework for transformer-based large-scale language representation. We incorporate the reweighted group Lasso into for optimization and relax the hard constraints in block structured pruning. We significantly reduce weight storage and computational requirement. Experimental results on different models (BERT, RoBERTa, and DistilBERT) on the GLUE benchmark tasks show that we achieve significant compression rates with zero or minor accuracy degradation on certain benchmarks. Our proposed method is orthogonal to existing compact pretrained language models such as DistilBERT using knowledge distillation. It is suitable to deploy the final compressed model on resource-constrained edge devices.

Single-layer Sensitivity
Before retraining, block structured pruning is carried out for the reweighted trained models by choosing compression ratio for each layer. However, the sensitivity of different layers are different, which may leads to significant accuracy loss if the compression ratios are not proper. To test the sensitivity, we prune 50% of each layer while keeping the other layers unpruned and obtain the final accuracy after retraining. According to tests, embedding layers are sensitive on all datasets except WNLI. On MRPC and RTE datasets, we choose 8 as the number of blocks and 3e −4 as the penalty factor. In Figure 5, the first two weight matrices are related to embedding layers, while the third to the 38th weights are related to transformer layers (each transformer layer includes 6 weights). The last two layers is related to classifier layers. The results show that the embedding layers and linear weights in transformer layers are sensitive on CoLA and MRPC datasets. Therefore, we set the compression ratios of corresponding weights zero to ensure the final accuracy. Figure 7 represent reweighted training and retraining accuracy of different block sizes, respectively. During reweighted training, the accuracy decreases when we increase the number of blocks, since the corresponding l 1 loss increases significantly, which leads to mixed loss to increase as shown in Figure 8. The final accuracy is improved when increasing the number of blocks since the algorithm is capable to operate on smaller units of the weight matrices.

Number of Retraining Epochs
For reweighted training, Figure 9 and Figure 10 show the results of mixed and evaluation loss, respectively, in which we update the γ matrix every four epochs. For each selection of training epochs, we use linear learning rate decay and thus the results do not coincide with each other. The final accuracy of retraining is improved when we increase the training epochs as shown in Figure 11.

Penalty Factors
In Figure 12, the retraining accuracy is improved when we increase the penalty factor from 3e −5 to 1e −4 and declines from 3e −4 to 1e −3 .

Retrain Accuracy
Figure 13 ∼ Figure 21 show the accuracy with RoBERTa BASE model on nine GLUE benchmark tasks during retraining steps.