When BERT Plays the Lottery, All Tickets Are Winning

Much of the recent success in NLP is due to the large Transformer-based models such as BERT (Devlin et al, 2019). However, these models have been shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis. For fine-tuned BERT, we show that (a) it is possible to find a subnetwork of elements that achieves performance comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. However, the"bad"subnetworks can be fine-tuned separately to achieve only slightly worse performance than the"good"ones, indicating that most weights in the pre-trained BERT are potentially useful. We also show that the"good"subnetworks vary considerably across GLUE tasks, opening up the possibilities to learn what knowledge BERT actually uses at inference time.


Introduction
Much of the recent success in NLP is due to the transfer learning paradigm where large Transformer-based models first try to learn taskindependent linguistic knowledge from large raw text corpora, and then get fine-tuned on small datasets for specific tasks. One of the most famous Transformers is BERT (Devlin et al., 2019), which became a must-have baseline and inspired dozens of analysis studies (Rogers et al., 2020b).
However, these models have been shown to be overparametrized. We now know that most Transformer heads and even layers can be pruned (Voita et al., 2019;Michel et al., 2019;Kovaleva et al., 2019), but it is not clear whether that is due to redundant weights, or to some parts of the model simply being "inactive" (Zhang et al., 2019).
We conduct a systematic case study of finetuning BERT (Devlin et al., 2019) on GLUE tasks (Wang et al., 2018) from the perspective of the lottery ticket hypothesis (Frankle and Carbin, 2019). We use importance scores for both self-attention heads and multi-layer-perceptrons (MLPs) in finetuned BERT to find the "good" subnetworks that achieve 90% of full model performance, and we test the lottery ticket hypothesis at the level of BERT architecture blocks. We find that "good" subnetworks perform considerably better than similarly-sized subnetworks sampled from the less important components of the model. However, both "bad" and "good" subnetworks can be fine-tuned separately to achieve comparable performance.
We also experiment with 9 GLUE tasks to see the degree to which the "good" subnetworks overlap. We find that 86% heads and 57% MLPs survive in less than 7 tasks, which raises concerns about the degree to which BERT relies on task-specific heuristics rather than general linguistic knowledge. It also offers a more precise instrument for learning what kinds of knowledge are used by BERT in different types of tasks and datasets.
2 Related work BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. (2019) has inspired multiple studies which aim to understand why it works so well and propose various modifications. A detailed overview of work to date is available in the survey by Rogers et al. (2020b).
One claim supported by many studies is that BERT is considerably overparametrized. In particular, it is possible to ablate elements of its architecture without loss in performance or even with slight gains (Michel et al., 2019;Voita et al., 2019;Kovaleva et al., 2019). This explains the success of BERT compression studies (Sanh et al., 2019;Jiao et al., 2019;McCarley, 2019;Lan et al., 2020).
While NLP focused on building larger Trans-formers, the computer vision community was exploring the lottery ticket hypothesis (Frankle and Carbin, 2019;Lee et al., 2018;Zhou et al., 2019). It states that "dense, randomly-initialized, feedforward networks contain subnetworks (winning tickets) that -when trained in isolation -reach test accuracy comparable to the original network in a similar number of iterations" (Frankle and Carbin, 2019). The "winning" initializations were shown to generalize across computer vision datasets (Morcos et al., 2019), and to exist both in LSTM and Transformer models for NLP tasks (Yu et al., 2020). However, so far the lottery ticket work focused on the "winning" random initializations. In case of BERT and other widely used Transformers, there is a large pre-trained language model used in conjunction with a randomly initialized task-specific classifier. The motivation for this is that language modeling is a self-supervised task that can be performed on large amounts of text, and should yield transferable linguistic knowledge. The fine-tuning step would then only need to teach the model how to use the representation learned in pre-training to perform the specific task. However, we have ample evidence that BERT is very adapt at learning not just the new tasks, but also all kinds of biases present in the task-specific data (McCoy et al., 2019;Rogers et al., 2020a;Jin et al., 2020;Niven and Kao, 2019;Zellers et al., 2019).
An extra level of complexity is added by the fact that random initializations in the task-specific classifier interact with the pre-trained BERT weights, affecting the performance of fine-tuned BERT (Dodge et al., 2020). If the pre-trained weights indeed encode transferable linguistic knowledge, we would expect the "good" subnetworks to be the ones that better encode this knowledge, and they would be stable across different fine-tuning runs for the same task. The variation in performance between runs would then show that some initializations are better than others for leveraging the knowledge in the pre-trained weights for a given task. This is one of the questions we consider.

Methodology
The original lottery ticket study (Frankle and Carbin, 2019) focuses on feed-forward networks with iterative magnitude pruning. This section describes the alternative we use in the present study, namely, masking the "bad" subnetworks in BERT based on their importance scores.

Masking Heads and MLPs
BERT is fundamentally a stack of Transformer encoder layers (Vaswani et al., 2017). It consists of multiple identical layers, each containing several multi-head self-attention blocks followed by an MLP block with two residual connections.
The Multi-Head Self-Attention (MHAtt) consists of N h independently parametrized selfattention heads. An attention head h in layer l is parametrized by W Given n d-dimensional input vectors x = x 1 , x 2 , ..x n ∈ R d the multi-head attention is the sum of the output of each individual attention head applied to the input x.
The multi-layer perceptron M LP l in layer l of BERT consists of two feed-forward layers. It is applied separately to n d-dimensional vectors z ∈ R d coming from the attention sub-layer. Dropout (Srivastava et al., 2014) is used for regularization. Then inputs of the MLP are added to its outputs through a residual connection.

MLP
For masking each self-attention head in a layer we change (1) to: where the ξ (h,l) are masking variables set to values {0, 1}. If we set ξ (h,l) = 0 we effectively mask the attention head h in layer l.
For masking MLPs for a given layer we change (2) to: where the ν (l) are masking variables set to values {0, 1}. If we set ν (l) = 0 we effectively mask the MLP in the layer l.

Importance scores
We mask the maximum number of attention heads and MLPs possible with the constraint that the model attains at least 90% of the performance of the full model. Combinatorial search to find this mask is impractical due to the compute required. Michel et al. (2019) proposed an importance score heuristic for self-attention heads in Transformers, which we adopt and extend to MLPs. As a proxy score for component importance, we look at the expected sensitivity of the model to the mask variables ξ (h,l) in (3) and ν (l) (4): where x is a sample from the data distribution X and L(x) is the loss of the network outputs on that sample. mlp would involve computing backward pass on the loss over samples of the evaluation data 2 . Following Michel et al., we normalize the importance scores for attention heads by layer (using the 2 norm).

Iterative pruning
The importance scores described above are used iteratively to prune the lowest-scoring components of BERT. We continue pruning as long as the performance remains above 90% of the full fine-tuned 2 The GLUE dev sets are used as oracles to obtain the best possible heads and MLPs for the particular model and task. model's performance. The components for pruning are selected under the following settings: • Heads only: in each iteration, we mask as many of the unmasked heads with the lowest importance scores as we can (144 heads in the full BERT-base model).
• MLPs only: we iteratively mask one of the remaining MLPs that has the smallest importance score (Equation 5).
• Heads and MLPs: we compute head (Equation 5) and MLP (Equation 5) importance scores in a single backward pass, pruning 10% heads and one MLP with the smallest scores until the performance on the dev set is within 90%. Then we continue pruning heads alone, and then MLPs alone. This strategy results in a larger number of total components pruned within our performance threshold.

Fine-tuning
All experiments in this study are done on "BERTbase lowercase" pre-trained model, available in the Transformers library (Wolf et al., 2020). It is fine-tuned on 9 GLUE tasks, using the evaluation metrics shown in Table 1. All evaluation is done on the dev sets. For each experiment we test 5 random seeds. Fine-tuning is performed with a modified GLUE script 3 of the Transformers library (v2.5.0). All parameters were set to their default values.   former heads in machine translation task did the "heavy lifting", while the rest could be pruned. Michel et al. (2019) similarly showed that most of BERT self-attention heads in MNLI task could be pruned, and that the "good" heads were mostly shared between MNLI-matched and -mismatched. We extend this approach to 9 GLUE tasks, and we consider both BERT heads and MLPs.
We fine-tune BERT on each GLUE task with 5 random seeds, pruning elements of its architecture as described in section 3. We then compute how many times a given head survived the pruning process, for each task. Figure 1a and Figure 1b summarize the "good" subnetworks for individual tasks , showing the average number of GLUE tasks in which a given head survived, together with the standard deviation. We compare all pruning modes described in subsection 3.3: pruning only heads, only MLPs, and heads and MLPs together.
The subnetworks discovered in all pruning modes show a rather similar pattern of the use-ful heads and MLPs, but masking both heads and MLPs shows a larger number elements that survived in more than half the tasks (49% heads vs 22%, 75% MLPs vs 50%). This hints at considerable interaction between BERT's self-attention heads and MLPs. With fewer MLPs available the model is forced to rely more on the heads, raising their importance. This interaction was not explored in the previous studies focusing on only the heads or layers separately ( Figure 2: The "good" subnetwork: The diagonal represents the BERT architecture components that survive pruning for a given task and remaining elements represent the common surviving components across GLUE tasks. Each cell gives the average number of heads (out of 144) or layers (out of 12), together with standard deviation across 5 random initializations.
together makes the final layers more indispensable. Note also that in both conditions the middle layers survive pruning for the majority of GLUE tasks. This is consistent with the findings by Liu et al.
(2019) that the middle Transformer layers are the most transferable. K et al. (2020) also report that the depth of the model mattered more than the number of heads. 4.2 How task-independent are the "good" subnetworks? Figure 1 shows that relatively few components of BERT survive pruning in most GLUE tasks. In the more lenient heads+MLPs pruning mode, only 7% heads and 17% MLPs survive in 7 out of 9 tasks, and could be interpreted as evidence of taskindependent linguistic information.
Conversely, the parts of the "good" subnetworks that are only relevant for some specific tasks, but consistently survive across fine-tuning runs for that task, may correspond to task-specific information in the pretrained model -or possibly datasetspecific artifacts (Gururangan et al., 2018). Note that Figure 1 shows very few heads or MLPs that are universally "useless" (only 7 heads that survived in less than 2 tasks). 86% heads and 67% MLPs survive in 2-7 tasks with relatively high standard deviation. This means that the "good" subnetworks for different tasks have relatively little in common. The plots for all individual tasks are shown in Appendix A.
If most components of the "good" subnetwork are not universal across tasks, the degree to which the "good" subnetworks overlap across tasks may be a useful way to characterize the tasks themselves. This is illustrated in Figure 2, which shows pairwise comparisons between all GLUE tasks with respect to the number of shared surviving heads and MLPs in their "good" subnetworks (with standard deviation across 5 fine-tuning runs). The heads and MLPs were pruned together.
The results of this experiment say as much about BERT as about the target tasks. In particular, there is significant variation in standard deviations across tasks (shown in the diagonal cells in Figure 2a): only about 5 heads for MNLI, and 64 for WNLI. This comparative instability explains why WNLI results are so inconsistent 4 : the model cannot find a reliable signal in the pre-trained weights. Figure 2b shows that WNLI has zero overlaps with all tasks and itself because almost everything gets pruned (but the model performance actually goes up to the frequency baseline).
Interestingly, RTE also varies quite a bit in what heads and MLPs make it to the "good" subnetwork across runs, but that does not prevent BERT from reaching good results. That could mean that BERT provides several possible pathways for solving this task, all comparably good.
Based on the type of tasks, one could expect that SST would rely on different signal than NLI tasks, and that is indeed the case: after WNLI, SST has the least in common with the other tasks. However, the tasks focusing on similarity and paraphrase (MRPC, QQP, STS-B) and inference (MNLI, QNLI, RTE) are on par with each other. The two tasks that have the most in common with the others are MNLI (perhaps due to its multidomain nature) and COLA (likely due to the variety of language phenomena it covered). Interestingly, these patterns are observed in both heads and MLPs, again pointing at the interaction between these components.

The "good" and "bad" subnetworks in BERT fine-tuning
Our final experiment puts the above evidence of "good" subnetworks in fine-tuned BERT from the perspective of lottery ticket hypothesis, which predicts that the "lucky" subnetworks can be re-trained from scratch to match the performance of the full network. To test this hypothesis, we experiment with the following subnetworks: • "good" subnetworks (pruned): the elements selected from the full model by importance scores, as described in subsection 3.3; • "bad" subnetworks (sampled): the elements sampled from those that did not survive the pruning, plus a random sample of elements with high importance scores so as to match the size of the "good" subnetworks; • "bad" subnetworks (pruned): simple inversion of the "good" subnetworks. They are 5-18% smaller in size than the sampled bad subnetworks, but they do not contain any elements with high importance scores.
For both pruned and sampled subnetworks we evaluate their performance on all tasks simply after pruning the full fine-tuned model, and with finetuning the same subnetwork with the same random seeds, with the rest of the model masked. The results of this experiment are shown in the Figure 3.
The main prediction of the lottery ticket hypothesis is validated: the "bad" subnetworks perform considerably worse that the "good" subnetworks if the rest of the model is pruned. This holds for both sampled and inverted "bad" subnetworks, al-though the former include some "good" elements. The only task in which that does not hold is WNLI, the results of which are unreliable for reasons discussed above.
However, we see that both "good" and "bad" networks can be retrained, with comparable performance for many tasks. The inverted "bad" networks perform worse than the sampled ones, but that could also be due to them being smaller in size. Performance of all inverted "bad" networks on COLA is almost zero: since the "good" subnetwork comprises 92 of 144 heads and 10 out of 12 layers (Figure 2), very little remains when that mask is inverted.

Discussion
Does BERT have "bad" subnetworks?
The key result of this study is that, as far as finetuning is concerned, BERT does not seem to have "bad" subnetworks that cannot be re-trained to relatively good performance level, suggesting that the weights that do not survive pruning are not just "inactive" (Zhang et al., 2019). However, it is important to remember that we consider elements of BERT architecture as atomic units, while the original lottery ticket work relied on magnitude pruning of individual weights. On that level BERT probably does have "bad" subnetworks: Yu et al. (2020) show that they can be found in MT Transformer models with global iterative pruning. We leave it to future research to find out to what extent the effective subnetworks overlap with the effective architectural blocks, and what that says about the architecture of BERT and other Transformers.
Our results suggest that most architecture blocks of BERT are potentially usable in fine-tuning, but this should not be interpreted as a proof that they all encode potentially relevant linguistic information. It is also possible that pre-training somehow simply made them more amenable to optimization, which is another question to future research. tic knowledge that should be used by BERT was actually used at inference time, but were unable to confirm it for core frame-semantic relations.
The explorations of the "good" subnetworks of BERT elements, such as described in this paper, offer a fascinating direction for future research on the kinds of verbal reasoning that the model actually performs for a given task. We could find the "good" subnetworks and then look at its functions, rather than probe the whole model and hope that the knowledge found by the probes is actually used at inference time. We could also use the knowledge about which elements overlap in utility for different tasks to learn a lot more about the nature of transfer learning, as well as about specific tasks and datasets. For instance, consider the fact that the "good" subnetwork of MRPC shares many more heads with MNLI than with QQP or RTE, although they are closer by the type of the task (Figure 2a).

Conclusion
Prior work showed that it was possible to prune most self-attention heads in BERT. We extend this approach to the fully-connected layers, and we show fine-tuned BERT has "good" and "bad" subnetworks, where the "good" heads and MLPs alone reach performance comparable with the full network, and the "bad" ones do not perform well. However, this pattern does not quite conform to the lottery ticket hypothesis, as both "good" and "bad" networks can be fine-tuned separately to reach comparable performance.
We also show that 86% heads and 57% MLPs in "good" subnetworks are not universally useful across GLUE tasks, and overlaps between "good" subnetworks do not necessarily correspond to task types. This raises questions about the degree to which fine-tuned BERT relies on task-specific or general linguistic knowledge, and opens up the possibilities of studying the "good" subnetworks to see what types of knowledge BERT actually relies on at inference type.

A "Good" subnetworks in BERT fine-tuned on GLUE tasks
Each figure in this section shows the "good" subnetwork of heads and layers that survived the pruning process described in section 3. Each task was run with 5 different random seeds. The top number in each cell indicates how likely a given head or MLP was to survive pruning, with 1.0 indicating that it survived on every run. The bottom number indicates the standard deviation across runs. The figures in this appendix show that each task has a varying number of heads and layers that survive pruning on all fine-tuning runs, while some heads and layers were only "picked up" by some random seeds. Note also that in addition to the architecture elements that survive across many runs, there are also those that are useful for over half of the tasks, as shown in Figure 1. Presumably they encode the most general linguistic information.
Note how visualizing the "good" subnetwork illustrates the core problem with WNLI, the most difficult task of GLUE. Figure 12 shows that each run is completely different, indicating that BERT fails to find any consistent pattern between the task and the information in the available pre-trained weights.