What Part of the Neural Network Does This? Understanding LSTMs by Measuring and Dissecting Neurons

Memory neurons of long short-term memory (LSTM) networks encode and process information in powerful yet mysterious ways. While there has been work to analyze their behavior in carrying low-level information such as linguistic properties, how they directly contribute to label prediction remains unclear. We find inspiration from biologists and study the affinity between individual neurons and labels, propose a novel metric to quantify the sensitivity of neurons to each label, and conduct experiments to show the validity of our proposed metric. We discover that some neurons are trained to specialize on a subset of labels, and while dropping an arbitrary neuron has little effect on the overall accuracy of the model, dropping label-specialized neurons predictably and significantly degrades prediction accuracy on the associated label. We further examine the consistency of neuron-label affinity across different models. These observations provide insight into the inner mechanisms of LSTMs.


Introduction
In recent years, the application of deep learning to natural language processing (NLP) has been a success. Many consider the employment of distributed representations to be one of the reasons for deep learning's success (LeCun et al., 2015;Young et al., 2018). However, how these distributed representations encode information in deep neural networks, especially long short-term memory (LSTM) networks that are prevalent in NLP, still remains unclear (Feng et al., 2018). One of the potential ways to understand how neural networks function is to analyze the behavior of individual neurons that carry the distributed representation. While there have been a number of works that analyze low-level information stored in individual LSTM neurons, such as linguistic properties Qian et al., 2016), syntax of source code (Karpathy et al., 2015), and sentiment (Radford et al., 2017), how each neuron contributes directly to the final classification layer remains unclear.
We find inspiration to analyze individual neurons of LSTMs from how biologists analyze neurons of roundworms (White et al., 1986). Biological neural systems consist of a huge number of neurons, and can react to the environment in complicated ways. Biologists start with analyzing basic components of reactions, what stimuli trigger them, and which neurons are excited during the process. To verify the relationship between stimuli, neurons, and reactions, biologists further dissect neurons which are correlated with specific basic reactions, and see if the reaction still occurs for the same stimuli.
We adopt the same methodology to study LSTMs, using a representative task in NLP: named-entity recognition (NER) (Ratinov and Roth, 2009;Lample et al., 2016). Even though the output of a neural network may be complicated, we focus on basic components of the output: whether a label is predicted or not. We feed into the neural model various input instances, and analyze the relationship between the value of each LSTM neuron and the predicted label. We quantify the sensitivity of neurons to each label, and study how label-specific information is distributed among all neurons. We discover that each individual neuron is specialized to carry information for a subset of labels, and the information of each label is only carried by a subset of all neurons. We further conduct experiments to gradually drop out individual neurons. This significantly lowers the accuracy of labels that the neuron is specialized on, while having little effect on the overall performance of the model. We also study the corre-lation between labels, and discover some patterns that are shared among different models.
Our contributions are as follows: (1) To the best of our knowledge, we are the first to have taken this neuron-label affinity focused approach to understanding the inner workings of LSTMs. (2) We propose a novel metric to quantify such affinity, and conduct experiments to verify the validity and consistency of this metric.

Related Work
Recently, work has been done to analyze continuous representations in NLP. Shi et al. (2016) and Qian et al. (2016) analyze linguistic properties carried by representation vectors using external supervision.  and  further analyze linguistic information in individual neurons from neural machine translation representations in an unsupervised manner. For LSTMs of language models, Karpathy et al. (2015) identify individual neurons that trigger for specific information, such as bracket and sequence length, and Radford et al. (2017) discover neurons that encode sentiment information.
In computer vision, Zhou et al. (2018) analyze the relationship between individual units of a CNN and label prediction. To the best of our knowledge, however, in the field of NLP, there has been little work on analyzing the affinity between labels and neurons of recurrent networks. This paper aims to address this problem.

Model Selection
Named-entity recognition is a sequence labeling task. The input of the model is a sequence of words x (t) , t = 1, 2, · · · . Each input word has a corresponding label z (t) ∈ L, where L is the set of all labels {l j }, j = 1, 2, · · · , m. The label indicates whether the word is an entity or not, and if yes, which kind of entity it is.
A typical modern NER model consists of a bidirectional LSTM and a conditional random field (CRF) on top of the LSTM (Collobert et al., 2011;Huang et al., 2015). Sometimes there is also a convolutional neural network (CNN) (Ma and Hovy, 2016;Chiu and Nichols, 2016). However, the goal of this paper is not to achieve state-of-the-art performance on this task, but rather we are trying to understand the mechanisms of LSTMs. Therefore, we choose a relatively simple model (see Figure 1) for the experiments: a single layer uni-directional LSTM with a fully-connected layer on top of it.

Single
We denote h (t) ∈ R n as the LSTM's hidden state at timestep t, and h (t) i its i-th entry. W ∈ R n×m is the weight matrix of the fully-connected layer. The output of the entire model at timestep t is therefore the vector where y (t) ∈ R m and each entry is the predicted probability of a label in L: where W :,j is the transpose of the j-th column vector of the matrix W. The final predictionz (t) is chosen as the label with greatest probability.

Experiment Setup
The model is trained on the CoNLL2003 (Sang and De Meulder, 2003) training dataset. Development and test sets of CoNLL2003 will be used in experiments in Section 4. In this dataset there are nine labels in total, under the BIO tagging schema. See the first row of Figure 3 for the complete set of labels.
Code for this paper is adopted from the toolkit by Yang and Zhang (2018). We set the hidden size of the LSTM to 50, since a larger hidden size does not significantly improve the results. Other hyperparameters, such as learning rate, batch size, and drop out rate, are kept unchanged. The model is trained for 10 epochs, and we pick the checkpoint

Analyzing Neuron-Label Affinity
In this section, we first identify important neurons by quantifying the sensitivity of a neuron to a label, and then verify the quantification by neuron ablation experiments.

Identifying Important Neurons
A neuron of an LSTM corresponds to an entry (dimension) of h (t) . For a certain label l j , we try to identify neurons that are important for its prediction in the following way. We define the contribution of the i-th neuron to the j-th label at timestep t as Note that contribution is defined with the number after multiplied by W in the fully-connected layer. Therefore the contribution value itself is what matters, not its absolute value. The sensitivity of the i-th neuron to the j-th label is further defined as where E stands for taking average over t. This is the difference of the mean contribution over l j entity words versus other words. The higher s i,j is, the more sensitive the i-th neuron is for predicting the label l j .
We compute s i,j for all i and j pairs, and the average is done over the entire development set. A part of the results is shown in Figure 2, and the full results are shown in the appendix.

B-PER I-PER B-LOC I-LOC B-ORG I-ORG B-MISC I-MISC
• Each neuron has a different sensitivity to different labels. Some neurons are only sensitive to one label, e.g., neuron #10 for B-MISC; some are sensitive to multiple labels, e.g., neuron #17 for B-LOC and I-MISC; some are even not sensitive to any, e.g., neuron #7.
• For each label, there are multiple neurons that are sensitive to it, as well as multiple neurons that are not.
From these, we can come to the conclusion that the prediction of each label is based on information that is distributed among multiple, but not all, neurons. Furthermore, different types of information are distributed differently.
For each label, we further rank all neurons based on their sensitivity, and obtain an importance ranking for the label. The top ten neurons for each label are shown in Figure 3, and the full results are shown in the appendix.

Verifying the Importance of Neurons
We try to verify whether the sensitivity we define in the previous subsection is a valid and consistent indicator of a neuron's importance for a label.
The way to do this is to perform model evaluation on the test set, 1 while incrementally ablating neurons 2 from the model, in a certain order. If the sensitivity of neurons we obtain is valid and consistent, when we ablate neurons in the order of importance ranking of the label l j , the performance on the test set should drop fastest for predicting l j , and slower for other labels.
We choose two pairs of labels: (B-PER, B-MISC) and (B-LOC, I-ORG). In each pair, we conduct neuron ablation according to each label's 1 Recall that we obtain the value of sensitivity only from the development set.
2 Ablating a neuron here means setting h importance ranking, and compare the model's performance for predicting each label. The results are shown in Figures 4 and 5. We only show the first half of the importance ranking, since the latter half not only is less important, but also has more overlap between different labels.
From the figures we can see that when ablating neurons according to a certain label's importance ranking, the accuracy of the label drops much faster than the other labels. The overall performance, however, remains more or less unaffected. This shows that while a single neuron can be important for a subset of labels, the overall performance is more robust to neuron ablation. This further verifies our observations from the previous subsection: information is distributed among multiple neurons in various ways. A neuron may have encoded important information for a certain label, but it is unlikely that all important information is concentrated in one neuron.
It is worth noting that a neuron can be important for multiple labels. Therefore, when ablating neurons according to one label's importance ranking, the performance for other labels may also de-grade. This can be seen in the left plot of Figure 5. Neuron #38 appears in both the top ten lists of B-LOC and I-ORG (shaded boxes in Figure 3), and when it is ablated (the seventh ablated neuron), not only the performance of B-LOC, but also that of I-ORG, is compromised. The fourth ablated neuron in the right plot of Figure 5 has a similar behavior, but it is less significant, probably because this neuron is ranked seventh for B-LOC and is therefore less important than it is for I-ORG. This phenomenon is less significant in Figure 4, since the top neurons from importance rankings of B-PER and B-MISC have fewer overlaps.

Correlation Between Labels
Even though the distribution of information in neurons may seem arbitrary, we want to see if multiple, independently-trained models share any common traits.
In addition to the model we have used in previous sections, we train three more models with the same model architecture and hyperparameters but different random seeds. We compute neuron-label sensitivity for all four models using both develop- ment set and test set. For each of the models, we compute the correlation between all labels among different neurons. The sensitivity matrix for each model has 50 rows (neurons) and 9 columns (labels); see Figure 7 in the appendix for example. Correlation is computed among all rows in the matrix. The results are shown in Figure 6.
While there are differences among the four correlation plots, they share the following patterns: • Label pairs of the form B-x and I-x (where x is PER/LOC/ORG/MISC) are generally positively correlated. We can observe some darkred 2 × 2 blocks on the diagonal. Although for each trained model, it might be different neurons (i.e., neuron #) that encode information about B-x, these neurons typically also carry information about I-x.
• The label triples I-LOC, I-ORG, and I-MISC are also positively correlated.
• Label pairs of the form B-x and I-y (where x and y are different entities) are generally negatively correlated, e.g., I-PER with any of B-LOC, B-ORG, and B-MISC.
• The label O is negatively correlated with all I-x labels.
Although it remains unclear what information the neurons exactly encode, we speculate that there are at least two kinds of information, based on the observed patterns: • Coarse-grain types of the current word. For example, whether the word is related to PER, or LOC/ORG/MISC, or O.
• Entity boundary location. If the previous prediction is O, it means the current word should be either another O or the left boundary of an entity, and thus the model should only predict O or B-x, but never I-x. Hence, I-x is negatively correlated with B-y and O.

Conclusion
In this paper, we try to understand the mechanisms of LSTMs by measuring and dissecting LSTM neurons. We discover that the prediction of each label is based on label-specific information, which is distributed among different groups of neurons. We propose a method to quantify and rank the importance of each neuron for each label, and further conduct ablation experiments to verify the validity and consistency of such importance rankings. Results show that the importance of a neuron is very different for different labels.
Future work. We consider the following three directions as future work.