On the Weak Link between Importance and Prunability of Attention Heads

Given the success of Transformer-based models, two directions of study have emerged: interpreting role of individual attention heads and down-sizing the models for efﬁciency. Our work straddles these two streams: We analyse the importance of basing pruning strategies on the interpreted role of the attention heads. We evaluate this on Trans-former and BERT models on multiple NLP tasks. Firstly, we ﬁnd that a large fraction of the attention heads can be randomly pruned with limited effect on accuracy. Secondly, for Transformers, we ﬁnd no advantage in pruning attention heads identiﬁed to be important based on existing studies that relate importance to the location of a head. On the BERT model too we ﬁnd no preference for top or bottom layers, though the latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better. Finally, during ﬁne-tuning the compensation for pruned attention heads is roughly equally distributed across the un-pruned heads. Our results thus suggest that interpretation of attention heads does not strongly inform pruning.


Introduction
The acclaimed success of Transformer-based models across NLP tasks has been followed by two important directions of research. In the first direction, interpretability studies aim to understand how these models work. Given that multi-headed attention is an important feature of these models, researchers have focused on attention heads as the units of interpretation. These studies comment on the role of each attention head and the relation between a head's position and its significance (Clark et al., 2019;Michel et al., 2019;Voita et al., 2019b,a;Liu et al., 2019;Belinkov et al., 2017). These studies show that certain heads are more important based on (i) their position in the network (top, middle, bottom), or (ii) the component to which they belong (encoder self-attention, decoder self-attention, encoder-decoder cross attention), or (iii) the functional role they play (e.g., syntactic/semantic).
In the other major direction, these large Transformer-based models have been down-sized to be more time and space efficient. Different methods for down-sizing have been studied such as pruning (McCarley, 2019;Gordon et al., 2020;Sajjad et al., 2020), distillation (Sanh et al., 2019;Liu et al., 2019;Jiao et al., 2019), weight quantization (Zafrir et al., 2019;Shen et al., 2019), and weight factorization and parameter sharing (Lan et al., 2019). Pruning techniques have been particularly successful in reinforcing the folk-lore that these models are highly over-parameterized. These pruning methods prune parameters based on magnitude (Gordon et al., 2020), importance (McCarley, 2019) or layer-wise (Sajjad et al., 2020).
In this paper, we straddle these two directions of work by asking the following question: Can we randomly prune heads, thus completely ignoring any notion of importance of heads? To answer this, we systematically study the effect of randomly pruning specific subsets of attention heads on the accuracy on different tasks. Across experiments, we modify the random sampling to vary the percentage of heads pruned and their location in the network (components and layers).
We evaluate these experiments both on the Transformer and BERT models. Our results show that a large fraction of attention heads can be pruned randomly: 75% of the attention heads of Transformer can be randomly pruned with a drop of less than 1 BLEU point on NMT tasks. Similarly, half of the attention heads of BERT can be randomly pruned with an average drop in accuracy of less than 1% across a chosen set of GLUE tasks 1 . Significantly for Transformers, we find no evidence for pruning methods preferring specific attention heads based on their location; even when the locations are chosen to match attention heads identified to be more important in existing studies. Similarly on the BERT model, pruning top and bottom layers do not show significant difference, even though existing studies attribute higher importance to the latter (Sajjad et al., 2020). However, we identify a preference to avoid pruning the middle layers and consecutive layers. Lastly, we check if during fine-tuning certain heads compensate more for the pruned heads. If so, such heads would perhaps be more important. However, we find no such evidence. In particular, during fine-tuning, the un-pruned heads change similarly across most pruning configurations. Overall, our experiments suggest that interpretation of attention heads does not strongly inform pruning. The rest of the paper is organized as follows: Section 2 mentions about the models and the datasets used for this work followed by Section 3 which provides details of the experimental process. This section reports results on both Transformer and BERT models. We summarize our work in Section 4.

Multi-headed Self Attention
In each multi-headed attention layer we have multiple attention heads which transform the representation of inputs of a given sequence of tokens. Given the d v dimensional representation of T tokens as X ∈ T ×dv , the output of multi-headed self attention with N attention heads is given by where W k i , W q i , W v i ∈ dv×d k are parameters of the i-th attention head.

Transformers
We use the Transformer-Base model (Vaswani et al., 2017) which has 6 layers each in the three components: encoder self-attention (ES), encoderdecoder cross-attention (ED), and decoder selfattention (DS). In each layer of each of the three components, we have 8 attention heads, totalling to 3 × 6 × 8 = 144 attention heads. We train the mod-els with 2.5 million sentence pairs each from the WMT'14 English-Russian (EN-RU) and English-German (EN-DE) datasets. We report BLEU scores on WMT's newstest2014. We use Adam optimizer (Kingma and Ba, 2014) with parameters β 1 = 0.9, β 2 = 0.997, and = 10 −9 . We vary the learning rate according to the formula described in Vaswani et al. (2017) with warmup steps = 16k. We use large batch sizes of 32k and 25k for EN-RU and EN-DE, respectively, as it has been established that large batch sizes are inherent to the performance of Transformers (Popel and Bojar, 2018;Voita et al., 2019b). We achieve effectively large batch sizes using the technique of gradient accumulation on single NVIDIA V100 and 1080Ti GPUs.

BERT
In all experiments involving BERT, we use the BERT Base-uncased model (Devlin et al., 2018). It has 12 layers and each layer contains 12 attention heads, summing to 144 attention heads. We fine-tune and evaluate the pre-trained model 2 on sentence entailment task MNLI-M, the question similarity task QQP, the question-answering task QNLI, and the movie review task SST-2 from the GLUE Benchmark (Wang et al., 2018). We report accuracies on the official development sets of the considered GLUE tasks. For each of the four GLUE tasks, namely MNLI-M, QQP, QNLI and SST-2, we tried combinations of batch size and learning rate from {8, 16, 32, 64, 128} and {2, 3, 4, 5} × 10 −5 respectively and selected the best performing configuration. The exact hyperparameters used for each of the tasks have been made available with the code released 3 . Each BERT experiment was run on a single Cloud TPU (v2-8).

Experimental Process
In all the experiments, we perform random pruning where a subset of attention heads chosen by random sampling are zeroed out. Formally, each attention head is assigned a weight ξ which is 0 if the head is pruned and 1 otherwise. Then, the output of an attention layer is given by After pruning, we fine-tune the Transformer model for 30 epochs and the BERT model for 10 epochs.
Since the values ξ are randomly sampled, in each experiment we report the average of three different samplings of ξ. The standard deviations are 0.668% and 0.778% of the reported average values for Transformer and BERT respectively.

Experimental Results on Transformers
Varying Pruning Percentage. We randomly prune attention heads across all components and layers varying the percentage of pruning from 25% to 87% (Table 1). We observed that in the case of extreme pruning, i.e., keeping just one head in each layer of each of the three components (which corresponds to a pruning percentage of 87%), the drop in BLEU was 1.62 (EN-RU) and 1.03 (EN-DE) as can be seen from Pruning based on Layer Numbers. Voita et al. (2019b) identify that attention heads in specific layers of the Transformer -lower layers of Self-Attention components, i.e., Encoder-Self (ES) and Decoder-Self (DS), and higher layers of Encoder-Decoder cross attention (ED) -are more important. We evaluate the correspondence of this importance to pruning. We choose 5 pruning percentages from 25% to 75% and in each case two pruning configurations: One where the heads considered important are retained and the other where the important heads are pruned. The configurations and the corresponding BLEU scores on the EN-RU dataset are shown in Table 2 where each configuration is specified as a string. For example, the string 777322 indicates that 7 heads each were retained in the first  three layers, 3 in the fourth layer and 2 each in the last two layers. For each pruning percentage, the first row corresponds to the configuration in which heads considered important (Voita et al., 2019b) were retained and the second row corresponds to the adversarial configuration in which heads considered important were pruned. We identify no preference in pruning as for each pruning percentage the performance of both configurations is very similar.
Pruning Based on Component. Some studies show that heads in the ED component are most important while those in the ES module are least important (Voita et al., 2019b). We choose 4 different pruning percentages and in each case consider three configurations where the number of attention heads is least in one chosen component (ES, ED, DS). The configurations and corresponding BLEU scores on the EN-RU dataset are shown in Table 3.
We identify no consistent preference in the pruning strategy: In the 4 cases considered, each of the 3 configurations has the highest BLEU score in at least one case. Note that we chose the number of heads in each layer (14, 31, etc) to be consistent with those used in (Voita et al., 2019b).  Varying Pruning Percentage. We vary the pruning percentage from 10 to 90% and report the accuracy on the 4 GLUE tasks: MNLI-M, QQP, QNLI, and SST-2 (Table 4). We observe that half of the attention heads can be pruned with an average accuracy drop of under 1%. As shown in Figure 1, beyond 50% pruning, the accuracy drop is sharper.

Experimental Results on BERT
Pruning based on Layer Numbers. To identify any preference to pruning heads in specific layers, we consider several configurations as shown in Table 5, where we prune a subset of layers entirely, i.e. we prune all the attention heads of particular layers. When all the self-attention heads of a layer l are pruned, only the feed-forward network of that layer will be active whose input will just be the output from the previous layer l-1.  Bottom layers of BERT have been identified to model word morphology (Liu et al., 2019;Belinkov et al., 2017) and are considered to be important (Sajjad et al., 2020). Further, recent work has identified high cosine-similarity between output vectors of the top layers, indicating reduced importance of top layers (Goyal et al., 2020). We relate these studies to pruning by comparing the pruning of the same number of top and bottom layers (rows 2-9 in Table 5). Amongst the four cases, two cases each favor pruning top layers and bottom layers, revealing no preference in pruning.
The middle layers in BERT have been shown to have specific characteristics of higher attention entropy and greater attention to specific tokens (Clark et al., 2019). We thus considered configurations where we compare pruning top and bottom layers against pruning middle layers (last eight rows of Table 5). The results indicate a clear preference: In 14 out of 16 cases, pruning the middle layers performs worse that pruning equal number of layers distributed among top/bottom layers. Indeed, we incur an additional over 2% average drop in accuracy for QNLI and SST-2 tasks, indicating a task-specific sensitivity to pruning middle layers.
Recent work has identified that consecutive lay-  We now evaluate this for fine-tuning after pruning.
In Figure 2, we plot the average change in magnitude of parameters for different attention heads (W q , W k , W v in Equation 1) for the MNLI-M task. We observe no spatial patterns in the parameter changes or with respect to relative distance from pruned heads. In particular, for all experiments in Table 5 and 6, the average change in attention parameters for any two layers differs by less than 10%. This shows that the compensation for pruned attention heads is roughly equally distributed across the unpruned heads.

Conclusion
We systematically studied the effect of pruning attention heads in Transformer and BERT models. We confirmed the general expectation that a large number of attention heads can be pruned with limited impact on performance. For Transformers we observed no preference for pruning attention heads which have been identified as important in interpretability studies. Similarly, for BERT we found no preference between pruning top and bottom layers. However, pruning middle layers and consecutive layers led to a larger drop in accuracy. We also observe that the recovery during fine-tuning was uniformly distributed across attention heads. We conclude that there is often no direct entailment between importance of an attention head as characterised in several recent studies, and low prunability of the respective head using random pruning. and for supporting Preksha Nema through their Google Ph.D. India Fellowship program. We also thank the Department of Computer Science and Engineering as well as the Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI), IIT Madras for providing us with all the resources that made this work possible.