Analyzing Individual Neurons in Pre-trained Language Models

While a lot of analysis has been carried to demonstrate linguistic knowledge captured by the representations learned within deep NLP models, very little attention has been paid towards individual neurons.We carry outa neuron-level analysis using core linguistic tasks of predicting morphology, syntax and semantics, on pre-trained language models, with questions like: i) do individual neurons in pre-trained models capture linguistic information? ii) which parts of the network learn more about certain linguistic phenomena? iii) how distributed or focused is the information? and iv) how do various architectures differ in learning these properties? We found small subsets of neurons to predict linguistic tasks, with lower level tasks (such as morphology) localized in fewer neurons, compared to higher level task of predicting syntax. Our study also reveals interesting cross architectural comparisons. For example, we found neurons in XLNet to be more localized and disjoint when predicting properties compared to BERT and others, where they are more distributed and coupled.


Introduction
Transformer-based neural language models have constantly pushed the state-of-the-art in downstream NLP tasks such as Question Answering, Textual Entailment, etc. (Rajpurkar et al., 2016;Wang et al., 2018). Central to this revolution is the contextualized embedding, where each word is assigned a vector based on the entire input sequence, allowing it to capture not only a static semantic meaning but also a contextualized meaning.
Previous work on analyzing neural networks showed that while learning rich NLP tasks such as machine translation and language modeling, these deep models capture fundamental linguistic phenomena such as word morphology, syntax and various other relevant properties of interest (Shi et al., 2016;Adi et al., 2016;Belinkov et al., 2017a,b;Dalvi et al., 2017;Blevins et al., 2018). More recently Liu et al. (2019) and Tenney et al. (2019) used probing classifiers to analyze pretrained neural language models on a variety of sequence labeling tasks and demonstrated that contextualized representations encode useful, transferable features of language. While most of the previous studies emphasize and analyze representations as a whole, very little work has been carried to analyze individual neurons in deep NLP models.
Studying individual neurons can facilitate understanding of the inner workings of neural networks (Karpathy et al., 2015;Dalvi et al., 2019;Suau et al., 2020) and have other potential benefits such as controlling bias and manipulating system's behaviour (Bau et al., 2019), model distillation and compression (Rethmeier et al., 2020), efficient feature selection (Dalvi et al., 2020), and guiding architectural search.
In this work, we put the representations learned within pre-trained transformer models under the microscope and carry out a fine-grained neuron level analysis with respect to various linguistic properties. We target questions such as: i) do individual neurons in pretrained models capture linguistic information? ii) which parts of the network learn more about certain linguistic phenomena? iii) how distributed or focused is the information? and iv) how do various architectures differ in learning these properties?
A typical methodology in previous work on analyzing representations trains probing classifiers using the representations learned within a neural model, to predict the understudied task. We also use a probing classifier approach to analyze individual neurons. Since neurons are multivariate in nature and work in groups, we additionally use elastic-net regularization that encourages individual and group of neurons to play a role in the train-ing of the classifier. Given a trained classifier, we consider the weights assigned to each neuron as a measure of their importance with respect to the understudied linguistic task. We use probes with high selectivity (Hewitt and Liang, 2019) to ensure that our results reflect the property of representations and not the probe's capacity to learn.
We choose 4 pre-trained models: ELMo (Peters et al., 2018a), its transformer variant T-ELMo (Peters et al., 2018b), BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) -covering a varied set of modeling choices, including the building blocks (recurrent networks versus Transformers), optimization objective (auto-regressive versus nonautoregressive), and model depth and width. Our cross architectural analysis yields the following insights: • Information across networks is distributed, but it is possible to extract a very small subset of neurons to predict a linguistic task with the same accuracy as using the entire network.
• Low level tasks such as predicting morphology require fewer neurons compared to high level tasks such as predicting syntax.
• Some phenomena (e.g. Verbs) are distributed across many neurons while others (e.g. Interjections) are localized in a fewer neurons.
• Lower layers contain more word-level specialized neurons, and higher layers contain neurons specialized in syntax-level information.
• BERT is the most distributed model with respect to all properties while XLNet exhibits focus with the most disjoint set of neurons and layers designated for different linguistic properties.

Methodology
A common approach for probing neural network components against linguistic properties is to train a linear classifier using the activations generated from the trained neural network as static features.
The underlying assumption is that if a simple linear model can predict a linguistic property, then the representations implicitly encode this information.

Probe:
We go a level deeper and identify neurons within the learned representations to carry out a more fine-grained neuron 1 level analysis. We use a logistic regression classifier with elastic-net regularization (Zou and Hastie, 2005). The weights of the trained classifier serve as a proxy to select the most relevant features 2 within the learned representations, to predict a linguistic property. Formally, consider a pre-trained neural language model M with L layers: {l 1 , l 2 , . . . , l L }. Given a dataset D = {w 1 , w 2 , ..., w N } with a corresponding set of linguistic annotations T = {t w 1 , t w 2 , ..., t w N }, we map each word w i in the data D to a sequence of latent representations: D M − → z = {z 1 , . . . , z n }. The representations can either be extracted from the entire model or just from an individual layer. The model is trained by minimizing the following loss function: where P θ (t w i |w i ) is the probability that word i is assigned property t w i . The weights θ ∈ R D×T are learned with gradient descent. Here D is the dimensionality of the latent representations z i and T is the number of tags (properties) in the linguistic tag set, which the classifier is predicting. The terms λ 1 θ 1 and λ 2 θ 2 2 correspond to L1 and L2 regularization. This combination, known as elastic-net, strikes a balance between identifying very focused localized features (L1) versus distributed neurons (L2). We use a grid search algorithm described in Search, to find the most appropriate set of lambda values. But let us describe the neuron ranking algorithm first.
Neuron Ranking Algorithm: Once the classifier has been trained, our goal is to retrieve individual or a group of neurons (some subset of features of the latent representation) that are the most relevant for predicting a particular linguistic property T of interest. We use the neuron ranking algorithm as described in Dalvi et al. (2019). Given the trained classifier θ ∈ R D×T , the algorithm extracts a ranking of the D neurons in the model M. For each label 3 t in task T, the weights are sorted by their absolute values in descending order. To select N most salient neurons w.r.t. the task T, an iterative process is carried. The algorithm starts with a small percentage of the total weight mass and selects the most salient neurons for each sub-property (e.g. Nouns in POS tagging) until the set reaches the specified size N .

Search:
The search criteria is driven through ablation of weights in the trained classifier. Once the classifier is trained, we select M 4 top and bottom features according to our ranked list (obtained using neuron ranking algorithm described above) and zero-out the remaining features. We then compute score for each lambda set (λ 1 , λ 2 ) as: where A t is the accuracy of the classifier retaining top neurons and masking the rest, A b is the accuracy retaining bottom neurons, A z is the accuracy of the classifier trained using all neurons but without regularization, and A l is the accuracy with the current lambda set. The first term ensures that we select a lambda set where accuracies of top and bottom neurons are further apart and the second term ensures that we prefer weights that incur a minimal loss in classifier accuracy due to regularization. 5 We set α and β to be 0.5 in our experiments. This formulation enables the search to be automated, compared to Dalvi et al. (2019) where the lambdas were selected manually, which we found to be cumbersome and error-prone.
Minimal Neuron Selection: Once we have obtained the best regularization lambdas, we follow a 3-step process to extract minimal neurons for any downstream task: i) train a classifier to predict the task using all the neurons (call it Oracle), ii) obtain a neuron ranking based on the ranking algorithm described above, iii) choose the top N neurons from the ranked list and retrain a classifier using these, iv) repeat step 3 by increasing the size of N , 6 until the classifier obtains an accuracy close (not less than a specified threshold δ) to the Oracle.
Control Tasks: While there is a plethora of work demonstrating that contextualized representations encode a continuous analogue of discrete linguistic information, a question has also been raised recently if the representations actually encode linguistic structure or whether the probe memorizes the understudied task. We use Selectivity as a criterion to put a "linguistic task's accuracy in context with the probe's capacity to memorize from word types" (Hewitt and Liang, 2019). It is defined as the difference between linguistic task accuracy and control task accuracy. An effective probe is recommended to achieve high linguistic task accuracy and low control task accuracy. The control tasks for our probing classifiers are defined by mapping each word type x i to a randomly sampled behavior C(x i ), from a set of numbers {1 . . . T } where T is the size of tag set to be predicted in the linguistic task. The sampling is done using the empirical token distribution of the linguistic task, so the marginal probability of each label is similar. We compute Selectivity by training classifiers using all and the selected neurons.

Experimental Setup
Pre-trained Neural Language Models: We present results with 4 pre-trained models: ELMo (Peters et al., 2018a), and 3 transformer architectures: Transformer-ELMo (Peters et al., 2018b), BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019). The ELMo model is trained using a bidirectional recurrent neural network (RNN) with 3 layers each of size 1024 dimensions. Its transformer equivalent (T-ELMo) is trained with 7 layers but with the same hidden layer size. The BERT model is trained as an auto-encoder with a dual objective function of predicting masked words and next sentence in auto-encoding fashion. We use base version (13 layers and 768 dimensions). Lastly we included XLNet-base which is trained with the same parameter settings (number and size of hidden layers) as BERT, but with a permutation based auto-regressive objective function.
Language Tasks: We evaluated our method on 4 linguistic tasks: POS-tagging using the Penn TreeBank (Marcus et al., 1993), syntax tagging (CCG supertagging) 7 using CCGBank (Hockenmaier, 2006), syntactic chunking using CoNLL 2000 shared task dataset (Tjong Kim Sang and Buchholz, 2000), and semantic tagging using the Parallel Meaning Bank data (Abzianidze et al., 2017). We used standard splits for training, de-velopment and test data (See Appendix A.1) Classifier Settings: We used linear probing classifier with elastic-net regularization, using a categorical cross-entropy loss, optimized by Adam (Kingma and Ba, 2014). Training is run with shuffled mini-batches of size 512 and stopped after 10 epochs. The regularization weights are trained using grid-search algorithm. 8 For sub-word based models, we use the last activation value to be the representative of the word as prescribed for the embeddings extracted from Neural MT models (Durrani et al., 2019) and pre-trained Language Models (Liu et al., 2019). Linear classifiers are a popular choice in analyzing deep NLP models due to their better interpretability (Qian et al., 2016;Belinkov et al., 2020). Hewitt and Liang (2019) have also shown linear probes to have higher Selectivity, a property deemed desirable for more interpretable probes. Linear probes are particularly important for our method as we use the learned weights as a proxy to measure the importance of each neuron.

Ablation Study
First we evaluate our rankings as obtained by the neuron selection algorithm presented in Section 2. We extract a ranked list of neurons with respect to each property set (linguistic task T ) and ablate neurons in the classifier to verify the rankings. This is done by zeroing-out all the activations in the test, except for the selected M % neurons. We select top, random and bottom 20% 9 neurons to evaluate our rankings. Table 1 shows the efficacy of our rankings, with low performance (prediction accuracy) using only the bottom or random neurons versus using only the top neurons. The accuracy of random neurons is high in some cases (for example CCG, a task related to predicting syntax) showing when the underlying task is complex, the information related to it is more distributed across the network causing redundancy.

Minimal Neuron Set
Now that we have established correctness of the rankings, we apply the algorithm incrementally to select minimal neurons for each linguistic task that obtain a similar accuracy (we use a threshold δ = 0.5) as using the entire network (all the features). Identifying a minimal set of top neurons enables us to highlight: i) parts of the learned network where different linguistic phenomena are predominantly captured, ii) how localized or distributed information is with respect to different properties. Table 2 summarizes the results. Firstly we show that in all the tasks, selecting a subset of top N% neurons and retraining the classifier can obtain a similar (sometimes even better) accuracy as using all the neurons (Acc a ) for classification as static features. For lexical tasks such as POS or SEM tagging, a very small number of neurons (roughly 400 i.e 4% of features in BERT and XLNet) was found to be sufficient for achieving an accuracy (Acc t ) similar to oracle (Acc a ). More complex syntactic tasks such as Chunking and CCG tagging required larger sets of neurons (up to 2365 -one third of the network in T-ELMo) to accomplish the same. It is interesting to see that all the models, irrespective of their size, required a comparable number of selected neurons, in most of the cases. On the POS and SEM tagging tasks, besides T-ELMo all other models use roughly the same number of neurons. T-ELMo required more neurons in SEM tagging to achieve the task. This  Table 2: Selecting minimal number of neurons for each downstream NLP task. Accuracy numbers reported on blind test-set (averaged over three runs) -Neu a = Total number of neurons, Neu t = Top selected neurons, Acc a = Accuracy using all neurons, Acc t = Accuracy using selected neurons after retraining the classifier using selected neurons, Sel = Difference between linguistic task and control task accuracy when classifier is trained on all neurons (Sel a ) and top neurons (Sel t ).
could imply that knowledge of lexical semantics in T-ELMo is distributed in more neurons. In an overall trend, ELMo generally needed fewer neurons while T-ELMo required more neurons compared to the other models to achieve oracle performance. Both these models are much smaller than BERT and XLNet. We did not observe any correlation, comparing results with the size of the models.
Control Tasks: We use Selectivity to further demonstrate that our probes (trained using the entire representation and selected neurons) do not memorize from word types but learned the underlying linguistic task. Recall that an effective probe is recommended to achieve high linguistic task accuracy and low control task accuracy. The results  Accuracy numbers reported on blind test-set (averaged over three runs) -Neu a = Total number of neurons, Neu t = Top selected neurons, Acc a = Accuracy using all neurons, Acc t = Accuracy using selected neurons after retraining the classifier using selected neurons.
(see Table 2) show that selectivity with top neurons (Sel t ) is much higher than selectivity with all neurons Sel a . It is evident that using all the neurons may contribute to memorization whereas higher selectivity with selected neurons indicates less memorization and efficacy of our neuron selection. We achieve high selectivity when selecting 400 neurons as in the case of POS and SEM. The chunking and CCG tasks require a lot more neurons with CCG requiring up to 33% of the network. Here, the low selectivity indicates that while the information about CCG is distributed into several neurons, a set of random neurons may also be able to achieve a decent performance.
Discussion: Identifying neurons that are salient to a task has various potential applications such as task-specific model compression, by removing the irrelevant neurons with respect to the task or task-specific fine-tuning based on selected neurons. It is however tricky how to model this, for example one complexity is that zeroing out non-salient neurons in the lower layers directly affects any salient neurons in the subsequent layers. A rather direct application to our work is efficient feature-based transfer learning, which has shown to be a viable alternative to the fine-tuning approach . Feature-based approach uses contextualized embeddings learned from pre-trained models as static feature vectors in the down-stream classification task. Classifiers with large contextualized vectors are not only cumbersome to train, but also inefficient during inference. They have also been shown to be sub-optimal when supervised data is insufficient (Hameed, 2018). BERT-large, for example, is trained with 19,200 (25 layers × 768 dimensions) features. Reducing the feature set to a smaller number can lead to faster training of the classifier and efficient inference. Earlier (in Table  2) we obtained minimal set of neurons with a very tight threshold of δ = 0.5. By allowing a loser threshold, say δ = 2, we can reduce the set of minimal neurons to improve the efficiency even more. See Table 3 for results. For more on this, we refer interested readers to look at Dalvi et al.
(2020), where we explored this more formally, expanding our study to the sentence-labeling GLUE tasks (Wang et al., 2018).

Layer-wise Distribution
Previous work on analyzing deep neural networks analyzed how individual layers contribute towards a downstream task (Liu et al., 2019;Kim et al., 2020;Belinkov et al., 2020). Here we observe how the neurons, selected from the entire network, spread across different layers of the model. Such an analysis gives an alternative view of which layers contribute predominantly towards different tasks. Figure 1 presents the results. In most cases, lexical tasks such as learning morphology (POS tagging) and word semantics (SEM tagging) are dominantly captured by the neurons at lower layers, whereas the more complicated task of modeling syntax (CCG supertagging) is taken care of at the final layer. An exception to this overall pattern is the BERT model. Top neurons in BERT spread across all the layers, unlike other models where top neurons (for a particular task) are contributed by fewer layers. This reflects that every layer in BERT possesses neurons that specialize in learning particular language properties, while other models have designated layers that specialize in learning those language properties. Different from other models, neurons in the embedding layer show min-imum contribution in XLNet consistently across the tasks. Let us analyze the results with respect to each linguistic task.

POS Tagging: Every layer in BERT and ELMo contributed towards the top neurons, while the distribution is dominated by lower layers in XLNet
and T-ELMo, with an exception of XLNet not choosing any neurons from the embedding layer.
SEM Tagging: Similar to POS, all layers of BERT contributed to the list of top neurons. However, the middle layers showed the most contribution (see layer numbers 4-7 in Figure 1e). This is in line with Liu et al. (2019) who found middle and higher middle layers to give optimal results for the semantic tagging task. On XLNet, T-ELMo and ELMo, the first layer after the embedding layer got the largest share of the top neurons of SEM. This trend is consistent across other tasks, i.e., the core linguistic information is learned earlier in the network with an exception of BERT, which distributes information across the network.
Chunking Tagging: The overall pattern remained similar in the task of chunking. Notice however, a shift in pattern -the contribution from lower layers decreased compared to previous tasks, in the case of BERT. For example, in the SEM task, top neurons were dominantly contributed from lower and middle layers, in chunking middle and higher layers contributed most. This could be attributed to the fact that chunking is a more complex syntactic task and is learned at relatively higher layers.
CCG Supertagging: Compared to chunking, CCG supertagging is a richer syntactic tagging task, almost equivalent to parsing (Bangalore and Joshi, 1999). The complexity of the task is evident in our results as there is a clear shift in the distribution of top neurons moving from middle to higher layers. The only exception again is the BERT model where this information is well spread across the network, but still dominantly preserved in the final layers.
Discussion: Our results are in line with and reinforce the layer-wise analysis presented in Liu et al. (2019). However, unlike their work and all other work on layer-wise probing analysis, which trains a classifier on each layer individually to compare the results, our method trains a single classifier on all layers concatenated to analyze which layers contribute most to the task based on the most relevant selected features. This makes the playing field even  ) showed layer 1 in Transformer-ELMo to give the best result on the task of predicting POS tags; however, layers 2 and 3 almost give similar accuracy (see Appendix D1 in their paper). Based on these results, one cannot confidently claim that the task of POS is predominantly captured at layer 1. However, our method clearly shows this result (see Figure 1c).

Localization versus Distributedness
Next we study how localized or distributed different properties are within a linguistic task (for example nouns or verbs in POS tagging, location in semantic tagging), and across different architectures. Remember that the ranking algorithm extracts neurons for each label t (e.g. LOC:location or EVE:event categories in semantic tagging) in task T , sorted based on absolute weights. The final rankings are obtained by selecting from each label using the neuron ranking algorithm as described in Section 2. This allows us to analyze how localized or distributed a property is, based on the number of neurons that are selected for each label in the task. Property-wise: We found that while many properties are distributed, i.e., a large group of neurons is used to predict a label, some properties such as functional or unambiguous words that do not require contextual information are learned using fewer neurons. For example, UH (interjections) or the TO particle required fewer neurons across architectures compared to NNPS (proper noun; plural) in the task of POS tagging (Figure 2). Similarly EQA (equating property, e.g., as tall as you) is handled with fewer neurons compared to ORG (organization property). We observed a similar behavior in the task of chunking, with I-PRT (particles inside of a chunk) requiring fewer neurons across different architectures. On the contrary, B-VP (beginning of verb phrase) required plenty many.
Layer-wise: Previously we analyzed each linguistic task in totality. We now study whether individual properties (e.g., adjectives) are localized or well distributed across layers in different architectures. We observed interesting cross architectural similarities, for example the neurons that predict the foreign words (FW) property were predominantly localized in final layers (BERT: 13, XLNET: 11, T-ELMo: 7, ELMo:3) of the network in all the understudied architectures. In comparison, the neurons that capture common class words such as adjectives (JJ) and locations (LOC) are localized in lower layers (BERT: 0, XLNET: 1, T-ELMo: 0, ELMo:1). In some cases, we did find variance, for example personal pronouns (PRP) in POS tagging and event class (EXC) in semantic tagging were handled at different layers across different architectures. See Appendix A.7 for all labels.
Architecture-wise: We found that top neurons in XLNet are more localized towards individual properties compared to other architectures where top neurons are shared across multiple properties. We demonstrate this in Figure 3. Notice how the number of neurons for different labels 10 is much smaller in the case of XLNet, although roughly the same number of total neurons (400 for POS tagging and 960 for chunking on average; see Table  2) were required by all pre-trained models to carry out a task. This means that in XLNet neurons are exclusive towards specific properties compared to other architectures where neurons are shared between multiple properties. Such a trait in XLNet can be potentially helpful in predicting the behavior of the system as it is easier to isolate neurons that are designated toward specific phenomena.

Related Work
Rise of neural network has seen a subsequent rise of interpretability of these models. Researchers have explored visualization methods to analyze learned representations (Karpathy et al., 2015;Kádár et al., 2017), attention heads (Clark et al., 2019;Vig, 2019) of language compositionality (Li et al., 2016) etc. While such visualizations illuminate the inner workings of the network, they are often qualitative in nature and somewhat anecdotal. A more commonly used approach tries to provide a quantitative analysis by correlating parts of the neural network with linguistic properties, for example by training a classifier to predict a feature of interest (Adi et al., 2016;Conneau et al., 2018). Please refer to  for a comprehensive survey of work done in this direction. Liu et al. (2019) used probing classifiers for investigating the contextualized representations learned from a variety of neural language models on numerous word level linguistic tasks. A similar analysis was carried by Tenney et al. (2019) on a variety of sub-sentence linguistic tasks. We extend this line of work to carry out a more fine-grained neuron level analysis of neural language models.
Our work is most similar to Dalvi et al. (2019) who conducted neuron analysis of representations learned from sequence-to-sequence machine trans-lation models. Our work is different from them in that i) we carry out analysis on a wide range of architectures which are deeper and more complicated than RNN-based models and illuminate interesting insights, ii) we automated the grid-search criteria to select the regularization parameters, compared to manual selection of lambdas, which is cumbersome and error-prone. In contemporaneous work, Suau et al. (2020) used max-pooling to identify relevant neurons (aka Expert units) in pre-trained models, with respect to a specific concept (for example word-sense).
A pitfall to the approach of probing classifiers is whether the probe is faithfully reflecting the property of the representation or just learned the task? Hewitt and Liang (2019) defined control tasks to analyze the role of training data and lexical memorization in probing experiments. Voita and Titov (2020) proposed an alternative that measures Minimal Description Length of labels given representations. It would be interesting to see how a probe's complexity in their work (code length) compares with the number of selected neurons according to our method. The results are consistent at least in the ELMo POS example, where layer 1 was shown to have the shortest code length in their work. In our case, most top neurons are selected from layer 1 (see Figure 1d for example). Pimentel et al. (2020) discussed the complexity of the probes and argued for using highest performing probes for tighter estimates. However, complex probes are difficult to analyze. Linear models are preferable due to their explainability; especially in our work, as we use the learned weights as a proxy to get a measure of the importance of each neuron. We used linear classifiers with control tasks as described in Hewitt and Liang (2019). Although we mainly used probing accuracy to drive the neuron selection in this work, and Selectivity only to demonstrate that our results reflect the property learned by representations and not probe's capacity to learn -an interesting idea would be to use selectivity itself to drive the investigation. However, it is not trivial how to optimize for selectivity as it cannot be controlled/tuned directly -for example, removing some neurons may decrease accuracy but may not change selectivity. We leave this exploration for future work.
Probing classifiers require supervision for the linguistic tasks of interest with annotations, limiting their applicability. Bau et al. (2019) used unsupervised approach to identify salient neurons in neural machine translation and manipulated translation output by controlling these neurons. Recently, Wu et al. (2020) measured similarity of internal representations and attention across prominent contextualized representations (from BERT, ELMo, etc.). They found that different architectures have similar representations, but different individual neurons.

Conclusion
We analyzed individual neurons across a variety of neural language models using linguistic correlation analysis on the task of predicting core linguistic properties (morphology, syntax and semantics). Our results reinforce previous findings and also illuminate further insights: i) while the information in neural language models is massively distributed, it is possible to extract a small number of features to carry out a downstream NLP task, ii) the number of extracted features varies based on the complexity of the task, iii) the neurons that learn word morphology and lexical semantics are predominantly found in the lower layers of the network, whereas the ones that learn syntax are at the higher layers, with the exception of BERT, where neurons were spread across the entire network, iv) closed-class words (for example interjections) are handled using fewer neurons compared to polysemous words (such as nouns and adjectives), v) features in XLNet are more localized towards individual properties as opposed to other architectures where neurons are distributed across many properties. A direct application of our analysis is efficient feature-based transfer learning from largescale neural language models: i) identifying that most relevant features for a task are contained in layer x reduces the forward-pass to that layer, ii) reducing the feature set decreases the time to train a classifier and also its inference. We refer interested readers to see our work presented in Dalvi et al. (2020) for more details.

A.2 Hyperparameters
We use elastic-net based regularization to control the trade-off between selecting focused individual neurons versus group of neurons while maintaining the original accuracy of the classifier without any regularization. We do a grid search on L 1 and L 2 ranging from values 0 . . . 1e −7 . See Table 5    (if we try increasing number of neurons in each step by 1%) and N = 0, 0.1, . . . 1e −7 . We fix the M = 20% to find the best regularization parameters first reducing the grid search time to O(N 2 ) and find the optimal number of neurons in a subsequent step with O(M ). The overall running time of our algorithm therefore is O(M + N 2 ). This varies a lot in terms of wall-clock computation, based on number of examples in the training data, number of tags to be predicted in the downstream task. Including a full forward pass over the pretrained model to extract the contextualized vector, and running the grid search algorithm to find the best hyperparameters and minimal set of neurons took on average 12 hours ranging from 3 hours (for POS with ELMo experiment) to 18 hours (for CCG with BERT).

A.4 Ablation Study
We reported accuracy numbers on ablating top, random and bottom neurons in the trained classifier, on blind test-set in the main body. In Table 6, we report results on development tests.  Table 7: Selecting minimal number of neurons for each downstream NLP task. Accuracy numbers reported on dev test (averaged over three runs) -Neu a = Total number of neurons, Neu t = Top selected neurons, Acc a = Accuracy using all neurons, All t = Accuracy using selected neurons after retraining the classifier using selected neurons, Sel = Difference between linguistic task and control task accuracy when classifier is trained on all neurons (Sel a ) and top neurons (Sel t ).

A.5 Minimal Neuron Set
We reported minimal number of neurons required to obtain oracle accuracy in the main body, along with the results on Selectivity. In Table 7, we report results on development tests.

A.6 Localized versus Distributed Labels
In Section 5.1 we only showed number of features learned for selected labels in each task. Figure 4 shows results for all the tags across different tasks.
The results show that some tags are localized and captured by a focused set of neurons while others are distributed and learned within a large set of neurons.

A.7 XLNet versus Others
Notice in Figure 4 that neurons required by each label in XLNet (red bars) are strikingly small compared to other architectures specifically T-ELMo (yellow bars). This is interesting given the fact that total number of neurons required by some of the tasks are very similar. For example task of POS tagging required 400 neurons for BERT and XLNet, 320 for ELMo and 430 in T-ELMo. This means that neurons in XLNet are mutually exclusive towards the properties whereas in other architectures neurons are shared across multiple properties. Due to large tag set (1272 tags) in CCG super tagging, it is not possible to include it among figures.

A.8 Layer-wise Distribution
In Section 5.2 we showed labels are captured dominantly at which layers for a few labels. In Figure  5c we show all labels and which layers they are predominantly captured at, across different architectures.