Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectral-normalized identity priors (SNIP), a structured pruning approach which penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm, and is applicable to any structured module including a single attention head, an entire attention blocks, or a feed-forward subnetwork. Furthermore, we introduce spectral normalization to stabilize the distribution of the post-activation values of the Transformer layers, further improving the pruning effectiveness of the proposed methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance. Specifically, we improve the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio.


Introduction
Natural Language Processing (NLP) has recently achieved great success by using the Transformerbased pre-trained models (Radford et al., 2019;Devlin et al., 2018;Yang et al., 2019;Clark et al., 2020). However, these models often consume considerable storage, memory bandwidth, and computational resource. To reduce the model size and increase the inference throughput, compression techniques such as knowledge distillation (Sanh et al., 2019;Sun et al., 2019;Tang et al., 2019;Jiao et al., 2019;Sun et al., 2020) and weight pruning (Guo * Work done as part of the Google AI Residency. † Work done at Google Research.  (Sanh et al., 2019) and BERT-PKD (Sun et al., 2019)) and iterative pruning methods (Iterative Pruning (Guo et al., 2019) and our proposed method) in terms of accuracy at various compression rate using MNLI test set. knowledge distillation methods require re-distillation from the teacher to get each single data point, whereas iterative pruning methods can produce continuous curves at once. Gordon et al., 2020;Sanh et al., 2020) have recently been developed.
Knowledge distillation methods require the specification of a student network with a smaller architecture, which often has to be identified by a tedious sequence process of trial-and-error based decisions. By comparison, the iterative pruning methods gradually prune the redundant model weights or layers from the full-size model, and provide a full picture of the trade-off between the task performance and the model size with a single training process, as illustrated in Figure 1. This allows the iterative pruning methods to easily determine the most compact architecture given a required level of model performance.
However, many of the existing pruning methods rely on classic regularizers that act on the individual weights by penalizing them to zero (Guo et al., 2019;Sanh et al., 2020). As a result, the pruned model tends to maintain the same architecture as the original model despite the reduced parameter count, which does not practically lead to an improvement in inference latency (Wen et al., 2016).
This leads to a question: is it possible to perform more structured pruning on a Transformer model to modify its model architecture (e.g., reducing width and depth)?
To this end, we notice that many previous works have suggested that the learned Transformer models often have much redundancy (Tenney et al., 2019;Liu et al., 2019a;Jawahar et al., 2019;Kovaleva et al., 2019). For example, Michel et al. (2019); Voita et al. (2019) found that most of the attention heads in a Transformer model can be removed without significantly impacting accuracy, and Tenney et al. (2019) found that while the earlier layers and the later layers in a BERT model play clear roles in extracting either low-level or task-specific linguistic knowledge, the roles of the intermediate layers are less important. These observations have motivated the idea that Transformer models may exhibit considerable structural redundancy -i.e., some layers can be removed during the training without harming the final performance.
In this paper, we propose a structured pruning method that reduces the architecture of a Transformer by directly targeting its sub-modules as a whole -for example, a single attention head, an attention module, a feed-forward subnetwork, etc. We take an approach we call spectral-normalized identity prior (SNIP), which imposes a functionlevel prior centered around the identity function on the aforementioned modules.
Specifically, we take the advantage of the residual blocks (F(x) + x) within a Transformer layer and compress them to strict identity mappings (Yu et al., 2018) by identifying the residual blocks whose nonlinear mapping's absolute values (|F(x)|) mostly fall below a threshold . With this strategy, the weights of the Transformer model can still be under-regularized when using simple L 1 or L 2 based regularizers, which cause the distribution of the post-activation values prone to be noisy even after layer normalization (Ba et al., 2016). To address this issue, we further leverage spectral normalization (Miyato et al., 2018) to stabilize the distribution of the post-activation values by regularizing the largest singular value of the weight matrices.
We use BERT (Devlin et al., 2018) as a case study in this paper. Across multuple tasks in the GLUE benchmark (Wang et al., 2018), SNIP improves the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio. 1 We also show that spectral normalization results in more sparse and regulated layer mappings during pre-training. We compare the remaining model components across the tasks at a fixed compression ratio in an ablation study, and show that the remaining components are similar but not identical. Our contributions are three-fold: First, we introduce identity-inducing prior, a structured pruning approach that imposes identity-inducing regularization on the Transformer mappings as a whole rather than its individual weights. Second, we show that through a novel combination with the spectral normalization regularization, the resulting spectral-normalized identity prior (SNIP) leads to well-regularized weight distribution and sparse layer mappings in a BERT model. Finally, we conduct thorough experiments to validate the SNIP approach over 5 standard NLU tasks. Our results suggest that different model components in a Transformer play critically different roles across tasks, suggesting the importance of performing task-specific pruning to obtain an architecture that is the most suitable for the target task.

Related Work
Pre-trained Language Model Compression The major existing efforts to compress pre-trained language models such as BERT include knowledge distillation (Ba and Caruana, 2014;Hinton et al., 2015) and pruning (Iandola et al., 2016;Veit and Belongie, 2017).
The knowledge distillation approach enables the transfer of knowledge from a large teacher model to a smaller student model. Such attempts have been made to distill BERT models, e.g., Distil-BERT (Sanh et al., 2019), BERT-PKD (Sun et al., 2019), Distilled BiLSTM (Tang et al., 2019), Tiny-BERT (Jiao et al., 2019), MobileBERT (Sun et al., 2020), etc. All of these methods require carefully designing the student architecture. Furthermore, to choose which intermediate results that the student model can learn from, e.g., the outputs of each layer, the attention maps, is still under discussion.
Similar to other pruning-based methods, our method can iteratively remove the least important weights or connections, explore the full spectrum of trade-offs, and find the best affordable architecture in one shot. Many language representation model pruning methods focus on individual components of the weight matrices. For example, Guo et al. (2019) integrates reweighted L 1 minimization with a proximal algorithm to search sparsity patterns in the model; Gordon et al. (2020) uses magnitude weight pruning, which compresses the model by removing weights close to 0; Sanh et al. (2020) applies deterministic first-order weight pruning method where both weights with low and high values can be pruned. A very few works try structured weight pruning, e.g.,  proposes a structured pruning approach based on low-rank factorization and augmented Lagrangian L 0 norm regularization. On the other hand, there also exist works that prune a coherent set of sub-modules in the Transformer model. For example, Michel et al. (2019) and Voita et al. (2019) propose to prune individual attention heads either manually via head importance score, or automatically via a relaxed L 0 regularization. Fan et al. (2020) applies random pruning to the entire layers. In contrast, our method allows finer-grained structured pruning on Transformer modules (i.e., both attention heads and feed-forward layers) and propose to improve the mathematical property of a Transformer (i.e., Lipschitz condition) for more effective pruning.
Other compression approaches include weight sharing (Liu et al., 2019b), quantization (Zafrir et al., 2019;Shen et al., 2019) and neural architecture search , but are not within the discussion of this paper. We refer interested readers to Ganesh et al. (2020) for further details.
Applications of Spectral Normalization Spectral normalization is first proposed for generative adversarial network (GAN) as a regularization technique to stabilize the discriminator training (Miyato et al., 2018). It was later applied to improve the performance of the other types of generative neural networks Behrmann et al., 2019), and was analyzed theoretically in the context of adversarial robustness and generalization (Farnia et al., 2018).
Spectral normalization regularizes the Lipschitz condition of the model mappings and is known to benefit model generalization under both the classic and the adversarial settings (Sokolić et al., 2017;Cisse et al., 2017;Oberman and Calder, 2018;Neyshabur et al., 2017). In this paper, we will explore the benefit of spectral regularization for improving the effectiveness of pruning.

Methods
In this section, we first briefly review the basic Transformer layers in Vaswani et al. (2017) (3.1). We then introduce our identity prior into Transformer's residual connections using threshold (3.2). In section 3.3, we give mathematical foundations to the spectral normalization and show how it could help with our identity prior. Finally, to put it all together, we establish our structured iterative pruning methods for BERT fine-tuning (3.4).

Background: Transformer Layer
Transformer-based models are usually comprised of a stack of Transformer layers. A Transformer layer takes on a sequence of vectors as input, first passes it through a (multi-head) self-attention sublayer, followed by a position-wise feed-forward network sub-layer.
Self-attention sub-layer The attention mechanism can be formulated as querying a dictionary with key-value pairs. Formally, where d H is the dimension of the hidden representations. Q, K, and V represent query, key, and value. The multi-head variant of attention (MHA) allows the model to jointly attend to information from different representation sub-spaces, defined as A is the number of heads, and d K and d V are the dimensions of key and value.
Position-wise FFN sub-layer In addition to the self-attention sub-layer, each Transformer layer also contains a fully connected feed-forward network, which is applied to each position separately and identically. This feed-forward network consists of two linear transformations with an activation function σ in between. Specially, given vectors x 1 , . . . , x n , a position-wise FFN sub-layer transforms each x i as FFN(x i ) = σ(x i W 1 +b 1 )W 2 +b 2 , where W 1 , W 2 , b 1 and b 2 are parameters.
We should also emphasize that a residual connection (He et al., 2016b) and a layer normalization (Ba et al., 2016) are applied to the output of both MHA or FNN sub-layers. The residual connection plays a key role in learning strict identity mapping (detailed in Section 3.2), while layer normalization and spectral normalization (detailed in Section 3.3) together ensure regulated magnitude of activation outputs for improved pruning stability.

Identity-inducing Prior for Transformer
The design of residual connection can provide us with a promising way to find identity mappings for the Transformer model. Specifically, residual connection (He et al., 2016b) can be formalized as H(x) = F(x) + x, where F could be either MHA or FFN and H is the sub-layer output. As illustrated in He et al. (2016a), if an identity mapping is optimal, it is easier to push the residual to zero than to fit an identity mapping by a stack of traditional non-linear layers.
We leverage -ResNet (Yu et al., 2018), a strict identity mapping mechanism that sparsifies the layer output by inducing a specific threshold as the identity prior. Specifically, we turn the residual connection where v i is the i-th element of the vector v. Here a sparsity-promoting function S is applied to dynamically discard the non-linearity term based on the activations. When all the responses in the nonlinear mapping F(x) is below a threshold , then S(F(x)) = 0, otherwise, the original mapping S(F(x)) = F(x) was used as the standard residual network.
To implement S, we put an extra binary gate layer t upon F(x) by stacking additional rectified linear units (ReLU), following Srivastava et al. (2015). In particular, where L refers to a very large positive constant (e.g., 1e5 in our experiments). Then, S (F(v)) has the following form: Recall that each layer of Transformer consists of two residual blocks, namely, the self-attention sub-layer and the position-wise FFN sub-layer. We apply the network directly to the residual block in the FFN sub-layer, i.e., When applying it to the attention sub-layer, we place S to each single attention head, which allows us to prune any subset of attention heads, i.e., this means that the i-th attention head does not contribute to the output of the attention layer and thus could be pruned out.
In our experiment, we set different values to ATT and FFN , since the absolute outputs of attention and FFN layers lay in different scalars, as illustrated in Figure 2.

Spectral Normalization
The sparsity-inducing function S (F(x)) in the -ResNet has been found to work well for randomly initialized neural network, where the initial weight matrix of the non-linear mappings for all layers was distributed within a consistent range, and thus facilitates a natural separation between the function norms |F| of the important and unimportant non-linear mappings in the residual blocks during training (Yu et al., 2018). This is, however, not the case for the weight distribution of a pre-trainined model like BERT, where the weight distributions between different layers have already diverged during pre-training, which is likely due to the specialization of layer functionalities under the masked language modeling (MLM) training (Tenney et al., 2019).
Indeed, in our preliminary experiments, we observed that the proposed identity-inducing prior is not effective for a BERT model initialized from a classic pre-training checkpoint. As shown in Table  1, we found the distribution of the function norms P (|F (x) i |) for the attention layers to be densely clustered within a small range ((0, 2)) and with no clear separation between the function norm for the important and unimportant non-linear residual mappings. On the other hand, the norm distributions  , respectively masked language modeling accuracy on pre-training data and accuracy of finetuning on 5 natural language understanding tasks (details could be found in Section 4.1).
for different FFN layers were found to vary wildly, creating challenges for selecting a proper set of 's in practice.
The above observations motivate us to identify an effective method to stabilize the norm distributions of BERT model layers. In this work, we consider spectral normalization (SN), an approach that directly controls the Lipschitz norm of a non-linear mapping F by regularizing the spectral behavior of its weight matrices (Miyato et al., 2018).
Specifically, for a weight matrix W , its spectral norm λ(W ) is defined as its largest singular value: We say a function F is L-Lipschitz if |F(x 1 ) − F(x 2 )|/||x 1 − x 2 || ≤ L, for all possible (x 1 , x 2 ) pairs from the feature space, and we call the smallest possible L the Lipschitz norm of F, denoted as |F| Lip . Consequently, for a neural network mapping F(x) = σ(W x + b) with an contractive activation function σ, its Lipschitz norm is upperbounded by λ(W ) (Miyato et al., 2018): For BERT models, since the layer input x follows a distribution of zero mean and unit variance due to layer normalization (Ba et al., 2016), a nonlinear mapping's L 1 norm |F| is roughly proportional to its Lipschitz norm |F| Lip , which is controlled by λ(W ). Therefore, we are able to have a better control of the maximum of |F(x)| for identifying a good . Furthermore, the regularization is achieved by that the layer weights are simply divided by their corresponding spectral norm in SN, i.e.,Ŵ = W/λ(W ), adding no additional trainable parameter to the original model.
We apply SN to both pre-training and fine-tuning of the BERT model, and on the weights in both the attention and the FFN layers. As shown in Table  1, compared to the original BERT BASE without SN, adding SN to a BERT BASE model has resulted in improved pre-training performance and competitive fine-tuning performance. To illustrate the effect

Structured Iterative Pruning
We use a simple pruning method that greedily and iteratively prunes away attention heads and FFN layers to avoid impractical combinatorial search, where two dynamic estimations are conducted for and model architecture respectively. One iteration contains four substeps: 1. Estimate given current model architecture and training data. Specifically, We sort the attention heads and FFN layers by their mean activation outputs, and set to the k-th smallest mean activation. Larger k leads to more mappings being pruned in one iteration, which makes the retraining more difficult to recover the performance, but leads to fewer pruning iterations.
2. Train the model with identity-inducing prior by using the selected .
3. Estimate a smaller architecture given current and training data. Specifically, we estimate the module usage by counting the number of times each residual block has been learned to become a strict identity mapping across minibatches in the training set. We prune residual blocks whose usage rate below a threshold θ. When a residual block produces a negligible response, the function will start producing 0 outputs. As a result, the weights in this block will stop contributing to the crossentropy term. Consequently, the gradients will only be based on the regularization term and lead to weight collapse.
4. Retrain the model with the pruned residual blocks completely removed from the architecture. This is critical -if the pruned network is used without retraining, accuracy is significantly impacted. Also, during retraining, it is better to retain the weights from the initial training phase for the connections that survived pruning than it is to re-initialize them (Han et al., 2015).

Experimental Settings
In the experiments, we apply the same architecture and the base settings from the original BERT BASE   (Devlin et al., 2018), and fine-tune each task independently. More details could be found in Appendix.
For pre-training, we use the same data as BERT, which consists of 3.3 Billion tokens from Wikipedia and BooksCorpus (Zhu et al., 2015). Similar to the standard BERT practice, we conduct the pre-training only once and from scratch (i.e., no second pre-training). We use dynamic token masking with the masked positions decided on-the-fly instead of during preprocessing. Also, we did not use the next sentence prediction objective proposed in the original BERT paper, as recent work has suggested it dost not improve the scores (Yang et al., 2019;Liu et al., 2019b).
For fine-tuning tasks, we focus on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) in the main text since it is a thoroughly studied setting in many pruning/distillation work, thereby allowing comprehensive comparison. We conduct experiments on the subset of GLUE , classified into three categories: 3. Natural language inference: Question Natural Language Inference (QNLI) (Chen et al., 2018), and Multi-genre Natural Language Inference (MNLI) (Williams et al., 2017).
The detailed description of downstream tasks could be found in Appendix. The reason for choosing this subset is that we found the variance in performance for those tasks lower than the other GLUE tasks. We used the hyperparameters from Clark et al. (2020) for the most part. Since we run the training iteratively, we set train epoch as 1 for most of the task but 3 for task MRPC consider that the size of the datasets is much smaller than other tasks.
For and architecture estimations in Section 3.4, we set k = 1 and θ = 0.95 in our experiment.

Results
We compare the pruning results on non-normalized and normalized BERT models on the 5 GLUE tasks, as shown in Figure 3, which includes separate pruning for attention heads and FFN sub-layers and joint pruning of both modules. The results demonstrate the advantage of spectral normalization.
To put it all together, in Table 2, we further show the simplest architecture we could get when allowed at most 1% in terms of performance degradation. This also means that further slight pruning will have a noticeable impact on the final results. We find that spectral normalization can lead to a better trade-off between parameter size and performance. Specifically, for the same ideal performance shown in the last two rows in Table 2 tral normalized BERT could on average be pruned 12% parameters more than the original BERT.
We also list other compression methods in the table for comparison while most of them are of knowledge distillation and have a pre-defined fixed size of the compressed model. Our methods provide the flexibility to choose the best architecture and take advantage of finding the inflection point during pruning, while compared with other pruning methods we could practically speed up the inference time since we use a structured pruning.

Analysis
In this section, we investigate the contribution of: (1) single attention head pruning, and (2) split pruning for attention heads and FFN sub-layers.
Single Head Attention Pruning Multi-head self-attention is a key component of Transformer, where each attention head potentially focuses on different parts of the inputs.
The analysis of multi-head attention and its importance is challenging. Previous analysis of multihead attention considered the average of attention weights over all heads at given position or focused only on the maximum attention weights (Voita et al., 2018;Tang et al., 2018), or explicitly takes into account the varying importance of different heads (Voita et al., 2019). Michel et al. (2019) has proved that attention heads can be removed without significantly impacting performance, but they mainly focus on machine translation and NLI.
To understand whether and to what extent attention heads play consistent and interpretable roles when trained on different downstream tasks, we pick one task from the sentiment analysis, paraphrasing and natural language inference respectively, plot the pruned attention heads during training and show the dynamic process in Figure 4. We can find that though for different tasks, the distributions of important attention heads are similar, as reflected in the mean of each attention layer, the exact pruned attention heads are actually different. This also indicates that splitting the pruning for attention heads could have more flexibility for the model to find an optimal architecture. Separate Pruning for Attention and FFN Decoupling and then individually studying the selfattention and the FFN sub-layers is important for understanding the progressive improvement they provide. As can be observed from Figure 3, for most of the tasks, pruning FFN layers damages the performance more than the attention layers, indicating that the compression technique for the Transformer model tends to be more effective on the attention layers (Voita et al., 2018;Michel et al., 2019), than the FFN layers (Ganesh et al., 2020).
In Figure 4, similar to attention heads, we further plot the pruning map of FFN layers. We find that, certain FFN sub-layers might be more amenable to compression, even if the corresponding selfattention sub-unit is not. For example, in all the tasks, the FFN layers near the ends of input or output are more likely to be pruned, while this does not hold for the corresponding self-attention layers.
Finally, we compare between single head attention pruning and separate pruning for attention/FFN, and show the evolution of performance for single head attention pruning (HA), multi-head attention pruning (MHA), separate attention/FFN pruning, and whole layer pruning respectively in Figure 5. We find that MHA and separate pruning perform much better than HA and layer pruning.

Conclusion
In this work, we propose a structured pruning method for compressing Transformer models, which prunes redundant mappings via spectralnormalized identity priors (SNIP). We achieve effective pruning results on BERT fine-tuning while maintaining comparable performance. Our work shows the importance of the mathematical properties of the Transformer model (specifically, the Lipschitz condition) on the effectiveness of pruning.
Additionally, we quantify task-specific trade-offs between model complexity and task performance, as well as the progressive improvement provided by Multi-head Attention (MHA) and Feedforward Networks (FFN). Our results show that applying pruning at the level of mappings instead of individual weights allows for better model compression, when combined with the appropriate regularization. This suggests that developing more global pruning strategies may be a fruitful avenue for future research.
In the future, we plan to apply a similar approach to further reduce the width of Transformer layers, i.e., the hidden dimension, to achieve an even higher compression ratio. We are also interested in jointly using the proposed approach with other compression methods.  Attention   layer_10  layer_11  layer_0  layer_7  layer_5  layer_3  layer_9  layer_4  layer_8  layer_1  layer_2  layer_6   0  2  4  6  8  10  12  Training steps   10   11   12   13   largest singular value   FFN intermidiate   layer_4  layer_10  layer_1  layer_11  layer_9  layer_6  layer_8  layer_3  layer_5  layer_0  layer_7  layer_2   0  2  4  6  8  10  12  Training steps   6   7   8   9   10   largest singular value   FFN output   layer_10  layer_9  layer_4  layer_3  layer_1  layer_11  layer_6  layer_2  layer_5  layer_7 layer_8 layer_0 Figure 6: The largest singular value of attention output weights, and FFN intermediate and output weights for the original BERT BASE during the fine-tuning on the SST-2 task.  We provide a brief description of the 5 tasks in our experiments from the GLUE benchmarks (Wang et al., 2018).

SST-2
The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence (positive/negative). The performance is evaluated by the accuracy.
QQP The Quora Question Pairs dataset (Chen et al., 2018) is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent. The performance is evaluated by the accuracy. pus (Dolan and Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent, and the task is to predict the equivalence. The performance is evaluated by both the F1 score.

MRPC The Microsoft Research Paraphrase Cor
QNLI The Question-answering NLI dataset (Chen et al., 2018) is converted from the Stanford Question Answering Dataset (SQuAD) to a classification task. The performance is evaluated by the accuracy.
MNLI The Multi-Genre Natural Language Inference Corpus (Williams et al., 2017) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The performance is evaluated by the test accuracy on both matched (in-domain) and mismatched (cross-domain) sections of the test data.

A.2 Experiment Settings
The full set of hyperparameters for pre-training and fine-tuning are listed in Table 3.

A.3 Spectral Norm of Weights during Training
We show the largest singular values of the weight metrics in the original BERT model during finetuning on the task SST-2 in Figure 6. As can be seen from the figure, the norm of the weights without spectral normalization is obviously out of control.