Analyzing Redundancy in Pretrained Transformer Models

Transformer-based deep NLP models are trained using hundreds of millions of parameters, limiting their applicability in computationally constrained environments. In this paper, we study the cause of these limitations by deﬁning a notion of Redundancy , which we categorize into two classes: General Re-dundancy and Task-speciﬁc Redundancy . We dissect two popular pretrained models, BERT and XLNet, studying how much redundancy they exhibit at a representation-level and at a more ﬁne-grained neuron-level. Our analysis reveals interesting insights, such as: i) 85% of the neurons across the network are redundant and ii) at least 92% of them can be removed when optimizing towards a down-stream task. Based on our analysis, we present an efﬁcient feature-based transfer learning procedure, which maintains 97% performance while using at-most 10% of the original neurons. 1


Introduction
Large pretrained models have improved the stateof-the-art in a variety of NLP tasks, with each new model introducing deeper and wider architectures causing a significant increase in the number of parameters. For example, BERT large (Devlin et al., 2019), NVIDIA's Megatron model, and Google's T5 model (Raffel et al., 2019) were trained using 340 million, 8.3 billion and 11 billion parameters respectively.
An emerging body of work shows that these models are over-parameterized and do not require all the representational power lent by the rich architectural choices during inference. For example, these models can be distilled 1 The code for the experiments in this paper is available at https://github.com/fdalvi/analyzingredundancy-in-pretrained-transformermodels Sun et al., 2019) or pruned (Voita et al., 2019;Sajjad et al., 2020), with a minor drop in performance. Recent research (Mu et al., 2018;Ethayarajh, 2019) analyzed contextualized embeddings in pretrained models and showed that the representations learned within these models are highly anisotropic. While these approaches successfully exploited over-parameterization and redundancy in pretrained models, the choice of what to prune is empirically motivated and the work does not directly explore the redundancy in the network. Identifying and analyzing redundant parts of the network is useful in: i) developing a better understanding of these models, ii) guiding research on compact and efficient models, and iii) leading towards better architectural choices.
In this paper, we analyze redundancy in pretrained models. We classify it into general redundancy and task-specific redundancy. The former is defined as the redundant information present in a pretrained model irrespective of any downstream task. This redundancy is an artifact of overparameterization and other training choices that force various parts of the models to learn similar information. The latter is motivated by pretrained models being universal feature extractors. We hypothesize that several parts of the network are specifically redundant for a given downstream task.
We study both general and task-specific redundancies at the representation-level and at a more fine-grained neuron-level. Such an analysis allows us to answer the following questions: i) how redundant are the layers within a model? ii) do all the layers add significantly diverse information? iii) do the dimensions within a hidden layer represent different facets of knowledge, or are some neurons largely redundant? iv) how much information in a pretrained model is necessary for specific downstream tasks? and v) can we exploit redundancy to enable efficiency?
We introduce several methods to analyze redundancy in the network. Specifically, for general redundancy, we use Center Kernel Alignment (Kornblith et al., 2019) for layer-level analysis, and Correlation Clustering for neuron-level analysis. For task-specific redundancy, we use Linear Probing (Shi et al., 2016a;Belinkov et al., 2017) to identify redundant layers, and Linguistic Correlation Analysis (Dalvi et al., 2019) to examine neuronlevel redundancy.
We conduct our study on two pretrained language models, BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019). While these networks are similar in the number of parameters, they are trained using different training objectives, which accounts for interesting comparative analysis between these models. For task-specific analysis, we present our results across a wide suite of downstream tasks: four core NLP sequence labeling tasks and seven sequence classification tasks from the GLUE benchmark (Wang et al., 2018). Our analysis yields the following insights: General Redundancy: • Adjacent layers are most redundant in the network, with lower layers having greater redundancy with adjacent layers. • Up to 85% of the neurons across the network are redundant in general, and can be pruned to substantially reduce the number of parameters. • Up to 94% of neuron-level redundancy is exhibited within the same or neighbouring layers.
Task-specific Redundancy: • Layers in a network are more redundant w.r.t. core language tasks such as learning morphology as compared to sequence-level tasks. • At least 92% of the neurons are redundant with respect to a downstream task and can be pruned without any loss in task-specific performance. • Comparing models, XLNet is more redundant than BERT. • Our analysis guides research in model distillation and suggests preserving knowledge of lower layers and aggressive pruning of higher-layers.
Finally, motivated by our analysis, we present an efficient feature-based transfer learning procedure that exploits various types of redundancy present in the network. We first target layer-level task-specific redundancy using linear probes and reduce the number of layers required in a forward pass to extract the contextualized embeddings. We then filter out general redundant neurons present in the contextualized embeddings using Correlation Clustering. Lastly, we remove task-specific redundant neurons using Linguistic Correlation Analysis. We show that one can reduce the feature set to less than 100 neurons for several tasks while maintaining more than 97% of the performance. Our procedure achieves a speedup of up to 6.2x in computation time for sequence labeling tasks.

Related Work
A number of studies have analyzed representations at layer-level (Conneau et al., 2018;Liu et al., 2019;Tenney et al., 2019;Kim et al., 2020;Belinkov et al., 2020) and at neuron-level (Bau et al., 2019;Dalvi et al., 2019;Suau et al., 2020;Durrani et al., 2020). These studies aim at analyzing either the linguistic knowledge learned in representations and in neurons or the general importance of neurons in the model. The former is commonly done using a probing classifier (Shi et al., 2016a;Belinkov et al., 2017;Hupkes et al., 2018). Recently, Voita and Titov (2020);Pimentel et al. (2020) proposed probing methods based on information theoretic measures. The general importance of neurons is mainly captured using similarity and correlationbased methods (Raghu et al., 2017;Chrupała and Alishahi, 2019;Wu et al., 2020). Similar to the work on analyzing deep NLP models, we analyze pretrained models at representation-level and at neuron-level. Different from them, we analyze various forms of redundancy in these models. We draw upon various techniques from the literature and adapt them to perform a redundancy analysis.
While the work on pretrained model compression (Cao et al., 2020;Shen et al., 2020;Turc et al., 2019;Gordon et al., 2020;Guyon and Elisseeff, 2003) indirectly shows that models exhibit redundancy, little has been done to explore the redundancy in the network. Recent studies (Voita et al., 2019;Michel et al., 2019;Sajjad et al., 2020;Fan et al., 2020) dropped attention heads and layers in the network with marginal degradation in performance. Their work is limited in the context of redundancy as none of the pruning choices are built upon the amount of redundancy present in different parts of the network. Our work identifies redundancy at various levels of the network and can guide the research in model compression.

Datasets and Tasks
To analyze the general redundancy in pre-trained models, we use the Penn Treebank development set (Marcus et al., 1993), which consists of roughly 44,000 tokens. For task-specific analysis, we use two broad categories of downstream tasks -Sequence Labeling and Sequence Classification tasks. For the sequence labeling tasks, we study core linguistic tasks, i) part-of-speech (POS) tagging using the Penn TreeBank, ii) CCG super tagging using CCGBank (Hockenmaier, 2006), iii) semantic tagging (SEM) using Parallel Meaning Bank data (Abzianidze and Bos, 2017) and iv) syntactic chunking using CoNLL 2000 shared task dataset (Sang and Buchholz, 2000).
Other Settings The neuron activations for each word in our dataset are extracted from the pretrained model for sequence labeling while the [CLS] token's representation (from a fine-tuned model) is used for sequence classification. The fine-tuning step is essential to optimize the [CLS] token for sentence representation. In the case of sub-words, we pick the last sub-word's representation (Durrani et al., 2019;Liu et al., 2019). For sequence labeling tasks, we use training sets of 150K tokens, and standard development and test splits. For sequence classification tasks, we set aside 5% of the training data and use it to optimize all the parameters involved in the process and report results on development sets, since the test sets are not publicly available.

Models
We present our analysis on two transformer-based pretrained models, BERT-base (Devlin et al., 2019) and XLNet-base (Yang et al., 2019). 4 The former is a masked language model, while the latter is of an auto-regressive nature. We use the transformers library  to finetune these models using default hyperparameters.

Classifier Settings
For layer-level probing and neuron-level ranking, we use a logistic regression classifier with ElasticNet regularization. We train the classifier for 10 epochs with a learning rate of 1e −3 , batch size of 128 and a value of 1e −5 for both L1 and L2 lambda regularization parameters. We analyze redundancy in M at layer-level l i : how redundant is a layer? and at neuron-level: how redundant are the neurons? We target these two questions in the context of general redundancy and task-specific redundancy.

Problem Definition
Notion of redundancy: We broadly define redundancy to cover a range of observations. For example, we imply high similarity as a reflection of redundancy. Similarly, for task-specific neuron-level redundancy, we hypothesize that some neurons additionally might be irrelevant for the downstream task in hand. There, we consider irrelevancy as part of the redundancy analysis. Succinctly, two neurons are considered to be redundant if they serve the same purpose from the perspective of featurebased transfer learning for a downstream task.

General Redundancy
Neural networks are designed to be distributed in nature and are therefore innately redundant. Addi- tionally, over-parameterization in pretrained models with a combination of various training and design choices causes further redundancy of information. In the following, we analyze general redundancy at layer-level and at neuron-level.

Layer-level Redundancy
We compute layer-level redundancy by comparing representations from different layers in a given model using linear Center Kernel Alignment (cka -Kornblith et al. (2019)). cka is invariant to isotropic similarity and orthogonal transformation. In other words, the similarity measure itself does not depend on the various representations having neurons or dimensions with exactly the same distributions, but rather assigns a high similarity if the two representations behave similarly over all the neurons. Moreover, cka is known to outperform other methods such as CCA  and SVCCA (Raghu et al., 2017), in identifying relationships between different layers across different architectures. While there are several other methods proposed in literature to analyze and compare representations (Kriegeskorte et al., 2008;Bouchacourt and Baroni, 2018; Chrupała and Alishahi, 2019; Chrupała, 2019), we do not intend to compare them here and instead use cka to show redundancy in the network. The mathematical definition of cka is provided in Appendix A.6 for the reader. We compute pairwise similarity between all L layers in the pretrained model and show the corresponding heatmaps in Figure 1. We hypothesize that a high similarity entails (general) redundancy. Overall the similarity between adjacent layers is high, indicating that the change of encoded knowledge from one layer to another takes place in small incremental steps as we move from a lower layer to a higher layer. An exception to this observation is the final pair of layers, l 11 and l 12 , whose similarity is much lower than other adjacent pairs of layers. We speculate that this is because the final layer is highly optimized for the objective at hand, while the lower layers try to encode as much general linguistic knowledge as possible. This has also been alluded to by others (Hao et al., 2019;Wu et al., 2020).

Neuron-level Redundancy
Assessing redundancy at the layer level may be too coarse grained. Even if a layer is not redundant with other layers, a subset of its neurons may still be redundant. We analyze neuron-level redundancy in a network using correlation clustering -CC (Bansal et al., 2004). We group neurons with highly correlated activation patterns over all of the words w j . Specifically, every neuron in the vector z i j from some layer i can be represented as a T dimensional vector, where each index is the activation value z i j of that neuron for some word w j , where j ranges from 1 to T . We calculate the Pearson product-moment correlation of every neuron vector z i with every other neuron. This results in a N ×N matrix corr, where N is the total number of neurons and corr(x, y) represents the correlation Figure 3: Percentage of clusters which contain neurons from the same layers, adjacent layers, within three neighboring layers and more than three layers apart.
between neurons x and y. The correlation value ranges from −1 to 1, giving us a relative scale to compare any two neurons. A high absolute correlation value between two neurons implies that they encode very similar information and therefore are redundant. We convert corr into a distance matrix cdist by applying cdist(x, y) = 1 − |corr(x, y)| and cluster the distance matrix cdist by using agglomerative hierarchical clustering with average linkage 5 to minimize the average distance of all data points in pairs of clusters. The maximum distance between any two points in a cluster is controlled by the hyperparameter c t . It ranges from 0 to 1 where a high value results in large-sized clusters with a small number of total clusters.

Substantial amount of neurons are redundant
In order to evaluate the effect of clustering in combining redundant neurons, we randomly pick a neuron from each cluster and form a reduced set of non-redundant neurons. Recall that the clustering is applied independently on the data without using any task-specific labels. We then build taskspecific classifiers for each task on the reduced set and analyze the average accuracy. If the average accuracy of a reduced set is close to that of the full set of neurons, we conclude that the reduced set has filtered out redundant neurons. Figure 2 shows the effect of clustering on BERT and XLNet using different values of c t with respect to average performance across all tasks. It is remarkable to observe that 85% of neurons can be removed without any loss in accuracy (c t = 0.7) in BERT, alluding to a high-level of neuron-level redundancy. We observe an even higher reduction in XLNet. At c t = 0.7, 92% of XLNet neurons can be removed while maintaining oracle performance. We additionally visualize a few neurons within a cluster. The activation patterns are quite similar in their behavior, though not identical, highlighting the efficacy of CC in clustering neurons with analogous behavior. An activation heatmap for several neurons is provided in Appendix A.2.
Higher neuron redundancy within and among neighboring layers We analyze the general makeup of the clusters at c t = 0.3. 6 Figure 3 shows the percentage of clusters that contain neurons from the same layer (window size 1), neighboring layers (window sizes 2 and 3) and from layers further apart. We can see that a vast majority of clusters (≈ 95%) either contain neurons from the same layer or from adjacent layers. This reflects that the main source of redundancy is among the individual representation units in the same layer or neighboring layers of the network. The finding motivates pruning of models by compressing layers as oppose to reducing the overall depth in a distilled version of a model.

Task-specific Redundancy
While pretrained models have a high amount of general redundancy as shown in the previous section, they may additionally exhibit redundancies specific to a downstream task. Studying redundancy in relation to a specific task helps us understand pretrained models better. It further reflects on how much of the network, and which parts of the network, suffice to perform a task efficiently.

Layer-level Redundancy
To analyze layer-level task-specific redundancy, we train linear probing classifiers (Shi et al., 2016b;Belinkov et al., 2017) on each layer l i (layerclassifier). We consider a classifier's performance as a proxy for the amount of task-specific knowledge learned by a layer. Linear classifiers are a popular choice in analyzing deep NLP models due to their better interpretability (Qian et al., 2016;Belinkov et al., 2020). Hewitt and Liang (2019) have shown linear probes to have higher Selectivity, a property deemed desirable for more interpretable probes.
We compare each layer-classifier with an oracleclassifier trained over concatenation of all layers of the network. For all individual layers that perform close to oracle (maintaining 99% of the performance in our results), we imply that they encode sufficient knowledge about the task and are therefore redundant in this context. Note that this does not necessarily imply that those layers are identical or that they represent the knowledge in a similar way -instead they have redundant overall knowledge specific to the task at hand.
High redundancy for core linguistic tasks Figure 4 shows the redundant layers that perform within a 1% performance threshold with respect to the oracle on each task. We found high layer-level redundancy for sequence labeling tasks. There are up to 11 redundant layers in BERT and up to 10 redundant layers in XLNet, across different tasks. This is expected, because the sequence labeling tasks considered here are core language tasks, and the information related to them is spread across the network. Comparing models, we found such core language information to be distributed amongst fewer layers in XLNet.
Substantially less amount of redundancy for higher-level tasks The amount of redundancy is substantially lower for sequence classification tasks, with RTE having the least number of redundant layers in both models. Especially in BERT, we did not find any layer that matched the oracle performance for RTE. It is interesting to observe that all the sequence classification tasks are learned at higher layers and none of the lower layers were found to be redundant. These results are intuitive given that the sequence classification tasks require complex linguistic knowledge, such as long range contextual dependencies, which are only learned at the higher-layers of the model. Lower layers do not have the sufficient sentence-level context to perform these tasks well.
XLNet is more redundant than BERT While XLNet has slightly fewer redundant layers for sequence labeling tasks, on average across all downstream tasks it shows high layer-level task-specific redundancy. Having high redundancy for sequencelevel tasks reflects that XLNet learns the higherlevel concepts much earlier in the network and this information is then passed to all the subsequent layers. This also showcases that XLNet is a much better candidate for model compression where several higher layers can be pruned with marginal loss in performance, as shown by Sajjad et al. (2020).

Neuron-level Redundancy
Pretrained models being a universal feature extractor contain redundant information with respect to a downstream task. We hypothesize that they may also contain information that is not necessary for the underlying task. In task-specific neuron analysis, we consider both redundant and irrelevant neurons as redundancy with respect to a task. Unlike layers, it is combinatorially intractable to exhaustively try all possible neuron permutations that can carry out a downstream task. We therefore aim at extracting only one minimal set of neurons that suffice the purpose, and consider the remaining neurons redundant or irrelevant for the task at hand. Formally, given a task and a set of neurons from a model, we perform feature selection to identify a minimal set of neurons that match the oracle performance. To accomplish this, we use the Linguistic Correlation Analysis method (Dalvi et al., 2019) to ranks neurons with respect to a downstream task, referred as FS (feature selector) henceforth. For each downstream task, we concatenate representations from all layers L and use FS to extract a minimal set of top ranked neurons that maintain the oracle performance, within a defined threshold. Oracle is the task-specific classification performance obtained using all the neurons for training. The minimum set allows us to answer how many neurons are redundant and irrelevant to the given task. Tables 1 and 2 show the minimum set of top neurons for each task that maintains at least 97% of the oracle performance.
Complex core language tasks require more neurons CCG and Chunking are relatively complex tasks compared to POS and SEM. On average   across both models, these complex tasks require more neurons than POS and SEM. It is interesting to see that the size of minimum neurons set is correlated with the complexity of the task.
Less task-specific redundancy for core linguistic tasks compared to higher-level tasks While the minimum set of neurons per task consist of a small percentage of total neurons in the network, the core linguistic tasks require substantially more neurons compared to higher-level tasks (comparing Tables 1 and 2). It is remarkable that some sequence-level tasks require as few as only 10 neurons to obtain desired performance. One reason for the large difference in the size of minimum set of neurons could be the nature of tasks, since core linguistic tasks are word-level tasks, a much higher capacity is required in the pretrained model to store the knowledge for all of the words. While in the case of sequence classification tasks, the network learns to filter and mold the features to form fewer "high-level" sentence features.

Efficient Transfer Learning
In this section, we build upon the redundancy analysis presented in the previous sections and propose a novel method for efficient feature-based transfer learning. In a typical feature-based transfer learning setup, contextualized embeddings are first extracted from a pretrained model, and then a classifier is trained on the embeddings towards the downstream NLP task. The bulk of the computational expense is incurred from the following sources: • A full forward pass over the pretrained model to extract the contextualized vector, a costly affair given the large number of parameters.
• Classifiers with large contextualized vectors are: a) cumbersome to train, b) inefficient during inference, and c) may be sub-optimal when supervised data is insufficient (Hameed, 2018).
We propose a three step process to target these two sources of computation bottlenecks: 1. Use the task-specific layer-classifier (Section 6.1) to select the lowest layer that maintains oracle performance. Differently from the analysis, a concatenation of all layers until the selected layer is used instead of just the individual layers.
2. Given the contextualized embeddings extracted in the previous step, use CC (Section 5.2) to filter-out redundant neurons.
3. Apply FS (Section 6.2) to select a minimal set of neurons that are needed to achieve optimum performance on the task.
The three steps explicitly target task-specific layer redundancy, general neuron redundancy and task-specific neuron redundancy respectively. We refer to Step 1 as LayerSelector (LS) and Step 2 and 3 as CCFS (Correlation clustering + Feature selection) later on. For all experiments, we use a performance threshold of 1% for LS and CCFS each. It is worth mentioning that the tradeoff between loss in accuracy and efficiency can be controlled through these thresholds, which can be adjusted to serve faster turn-around or better performance. Table 3 presents the average results on all sequence labeling and sequence classification tasks. Detailed per-task results are provided in Appendix A.5.1. As expected from our analysis, a significant portion of the network can be pruned by LS for sequence labeling tasks, using less than 6 layers out of 13 (Embedding + 12 layers) for BERT and less than 3 layers for XLNet. Specifically, this reduces the parameters required for a forward pass for BERT  by 65% for POS and SEM, and 33% for CCG and 39% for Chunking. On XLNet, LS led to even larger reduction in parameters; 70% for POS and SEM, and 65% for CCG and Chunking. The results were less pronounced for sequence classification tasks, with LS using 11.6 layers for BERT and 8.1 layers for XLNet on average, out of 13 layers.

Results
Applying CCFS on top of the reduced layers led to another round of significant efficiency improvements. The number of neurons needed for the final classifier reducing to just 5% for sequence labeling tasks and 1.5% for sequence classification tasks. The final number of neurons is surprising low for some tasks compared to the initial 9984, with some tasks like QNLI using just 10 neurons.
More concretely, taking the POS task as an example: the pre-trained oracle BERT model has 9984 features and 110M parameters. LS reduced the feature set to 2304 (embedding + 2 layers) and the number of parameters used in the forward pass to 37M. CCFS further reduced the feature set to 300, maintaining a performance close to oracle BERT's performance on this task (95.2% vs. 93.9%).
An interesting observation in Table 3 is that the sequence labeling tasks require fewer layers but a higher number of features, while sequence classification tasks follow the opposite pattern. As we go deeper in the network, the neurons are much more richer and tuned for the task at hand, and only a few of them are required compared to the much more word-focused neurons in the lower layers. These observations suggest pyramid-shaped architectures that have wider lower layers and narrow higher layers. Such a design choice leads to significant savings of capacity in higher layers where a few, rich neurons are sufficient for good performance. In terms of neuron-based compression methods, these findings propose aggressive pruning of higher layers while preserving the lower layers in building smaller and accurate compressed models.

Efficiency Analysis
While the algorithm boosts the theoretical efficiency in terms of the number of parameters reduced and the final number of features, it is important to analyze how this translates to real world performance. Using LS leads to an average speed up of 2.8x and 6.2x with BERT and XLNet respectively on sequence labeling tasks. On sequence classification tasks, the average speed ups are 1.1x and 1.6x with BERT and XLNet respectively. Detailed results are provided in Appendix A.5.2.
For the classifier built on the reduced set, we simulate a test scenario with 100,000 tokens and compute the total runtime for 10 iterations of training. The numbers were computed on a 6-core 2.8 GHz AMD Opteron Processor 4184, and were averaged across 3 runs. Figure 5 shows the runtime of each run (in seconds) against the number of features selected. The runtime of the classifier reduced from 50 to 10 seconds in the case of BERT. The 5x speedup can be very useful in a heavy-use scenarios where the classifier is queried a large number times in a short duration.
Training time efficiency: Although the focus of the current application is to improve inference-time efficiency, it is nevertheless important to understand how much computation complexity is added during training time. Let T be the total number of tokens in our training set, and N be the total number of neurons across all layers in a pre-trained model. The application presented in this section consists of 5 steps.
1. Feature extraction from pre-trained model: Extraction time scales linearly with the number of tokens T . 5. Minimal feature set: Finding the minimal set of neurons is a brute-force search process, starting with a small number of neurons. For each set of neurons, a classifier is trained, the time for which scales linearly with the total number of tokens T . As the feature set size increases, the training time also goes up as described in Figure 5.
Appendix A.5.3 provides additional experiments and results used to analyze the training time complexity of our application.

Conclusion and Future Directions
We defined a notion of redundancy and analyzed pre-trained models for general redundancy and task-specific redundancy exhibited at layer-level and at individual neuron-level. Our analysis on general redundancy showed that i) adjacent layers are most redundant in the network with an exception of final layers which are close to the objective function, and ii) up to 85% and 92% neurons are redundant in BERT and XLNet respectively. We further showed that networks exhibit varying amount of task-specific redundancy; higher layerlevel redundancy for core language tasks compared to sequence-level tasks. We found that at least 92% of the neurons are redundant with respect to a downstream task. Based on our analysis, we proposed an efficient transfer learning procedure that directly targets layer-level and neuron-level redundancy to achieve efficiency in feature-based transfer learning.
While our analysis is helpful in understanding pretrained models, it suggests interesting research directions towards building compact models and models with better architectural choices. For example, a high amount of neuron-level redundancy in the same layer suggests that layer-size compression might be more effective in reducing the pretrained model size while preserving oracle performance. Similarly, our finding that core-linguistic tasks are learned at lower-layers and require a higher number of neurons, while sequence-level tasks are learned at higher-layers and require fewer neurons, suggests pyramid-style architectures that have wide lower layers and compact higher layers and may result in smaller models with performance competitive with large models.  For the sequence classification tasks, we study tasks from the GLUE benchmark (Wang et al., 2018), namely i) sentiment analysis (SST-2) using the Stanford sentiment treebank (Socher et al., 2013), ii) semantic equivalence classification using the Microsoft Research paraphrase corpus (MRPC) (Dolan and Brockett, 2005), iii) natural language inference corpus (MNLI) (Williams et al., 2018), iv) question-answering NLI (QNLI) using the SQUAD dataset (Rajpurkar et al., 2016), iv) question pair similarity using the Quora Question Pairs 7 dataset (QQP), v) textual entailment using recognizing textual entailment dataset(RTE) (Bentivogli et al., 2009), and vi) semantic textual similarity using the STS-B dataset (Cer et al., 2017). The statistics for the datasets are provided in Table 5. Table 6 presents the detailed results for the illustration in Figures 2a and 2b. As a concrete example, 6 out of 12 tasks (POS, SEM, CCG, Chunking, SST-2, STS-B) can do away with more than 85% reduction in the number of neurons (threshold=0.7) with very little loss in performance. Figure 6 visualizes heatmaps of a few neurons that belong to the same cluster built using CC at c t = 0.3 as a qualitative example of a cluster. 7 http://data.quora.com/First-Quora-Dataset-Release-Question-Pairs Task  Train  Dev   SST-2  67349  872  MRPC  3668  408  MNLI 392702 9815  QNLI 104743 5463  QQP  363846 40430  RTE  2490  277  STS-B  5749  1500   Table 5: Data statistics (number of sequences) on the official training and development sets used in the experiments. All tasks are binary classification tasks, except for STS-B which is a regression task. Recall that the test sets are not publicly available, and hence we use 10% of the official train as development, and the official development set as our test set. Exact split information is provided in the code README.

A.3 Task-Specific Layer-wise redundancy
Tables 7a and 7a provide detailed results used to produce the illustrations in Figure 4. Figures 7, 8 and 9 show the layer-wise taskspecific redundancy for individual classes within POS, SEM and Chunking respectively. We do not present these fine-grained plots for CCG (over 1000 classes) or sequence classification tasks (binary classification only).

A.4 Task-Specific Neuron-level Redundancy
Tables 8a and 8b provide the per-task detailed results along with reduced accuracies after running task-specific neuron-level redundancy analysis.

A.5.1 Transfer Learning Detailed Results
Tables 9 and 10 show the detailed per-task results for our proposed feature selection algorithm.

A.5.2 Pretrained model timing analysis
The average runtime per instance was computed by dividing the total number of seconds taken to run the forward pass for all batches by the total number of sentences. All computation was done on an NVidia GeForce GTX TITAN X, and the numbers are averaged across 3 runs. Figures 10 and  11 shows the results of various number of layers (with the selected layer highlighted for each task). lection for transfer learning application. Extraction of features and correlation clustering both scale linearly as the number of input tokens increases, while ranking the various features scales linearly with the number of total features.

A.6 Center Kernel Alignment
For layer-level redundancy, we compare representations from various layers using linear Center Kernel Alignment (cka -Kornblith et al. (2019)). Here, we briefly present the mathematical definitions behind cka. Let Z denote a column centering transformation. As denoted in the paper, z i j represents the contextualized embedding for some word w j at some layer l i . Let z i represent the contextual embeddings over all T words, i.e. it is of size T × N (where N is the total number of neurons). Given two layers x and y, X, Y = Zz x , Zz y the CKA similarity is where · is the Frobenius norm.