Explaining Predictions of Non-Linear Classifiers in NLP

Layer-wise relevance propagation (LRP) is a recently proposed technique for explaining predictions of complex non-linear classifiers in terms of input variables. In this paper, we apply LRP for the first time to natural language processing (NLP). More precisely, we use it to explain the predictions of a convolutional neural network (CNN) trained on a topic categorization task. Our analysis highlights which words are relevant for a specific prediction of the CNN. We compare our technique to standard sensitivity analysis, both qualitatively and quantitatively, using a"word deleting"perturbation experiment, a PCA analysis, and various visualizations. All experiments validate the suitability of LRP for explaining the CNN predictions, which is also in line with results reported in recent image classification studies.


Introduction
Following seminal work by Bengio et al. (2003) and Collobert et al. (2011), the use of deep learning models for natural language processing (NLP) applications received an increasing attention in recent years. In parallel, initiated by the computer vision domain, there is also a trend toward understanding deep learning models through visualization techniques (Erhan et al., 2010;Landecker et al., 2013;Zeiler and Fergus, 2014;Simonyan et al., 2014;Lapuschkin et al., 2016a) or through decision tree extraction (Krishnan et al., 1999). Most work dedicated to understanding neural network classifiers for NLP tasks (Denil et al., 2014;Li et al., 2015) use gradientbased approaches. Recently, a technique called layer-wise relevance propagation (LRP)  has been shown to produce more meaningful explanations in the context of image classifications . In this paper, we apply the same LRP technique to a NLP task, where a neural network maps a sequence of word2vec vectors representing a text document to its category, and evaluate whether similar benefits in terms of explanation quality are observed.
In the present work we contribute by (1) applying the LRP method to the NLP domain, (2) proposing a technique for quantitative evaluation of explanation methods for NLP classifiers, and (3) qualitatively and quantitatively comparing two different explanation methods, namely LRP and a gradient-based approach, on a topic categorization task using the 20Newsgroups dataset.

Explaining Predictions of Classifiers
We consider the problem of explaining a prediction f (x) associated to an input x by assigning to each input variable x d a score R d determining how relevant the input variable is for explaining the prediction. The scores can be pooled into groups of input variables (e.g. all word2vec dimensions of a word, or all components of a RGB pixel), such that they can be visualized as heatmaps of highlighted texts, or as images.

Layer-Wise Relevance Propagation
Layer-wise relevance propagation  is a newly introduced technique for obtaining these explanations. It can be applied to various machine learning classifiers such as deep convolutional neural networks. The LRP technique produces a decomposition of the function value f (x) on its input variables, that satisfies the conserva-arXiv:1606.07298v1 [cs.CL] 23 Jun 2016 tion property: (1) The decomposition is obtained by performing a backward pass on the network, where for each neuron, the relevance associated with it is redistributed to its predecessors. Considering neurons mapping a set of n inputs (x i ) i∈ [1,n] to the neuron activation x j through the sequence of functions: where for convenience, the neuron bias b j has been distributed equally to each input neuron, and where g(·) is a monotonously increasing activation function. Denoting by R i and R j the relevance associated with x i and x j , the relevance is redistributed from one layer to the other by defining messages R i←j indicating how much relevance must be propagated from neuron x j to its input neuron x i in the lower layer. These messages are defined as: ) is a stabilizing term that handles near-zero denominators, with set to 0.01. The intuition behind this local relevance redistribution formula is that each input x i should be assigned relevance proportionally to its contribution in the forward pass, in a way that the relevance is preserved ( i R i←j = R j ). Each neuron in the lower layer receives relevance from all upper-level neurons to which it contributes This pooling ensures layer-wise conservation: Finally, in a max-pooling layer, all relevance at the output of the layer is redistributed to the pooled neuron with maximum activation (i.e. winner-take-all). An implementation of LRP can be found in (Lapuschkin et al., 2016b) and downloaded from www.heatmapping.org 2 .

Sensitivity Analysis
An alternative procedure called sensitivity analysis (SA) produces explanations by scoring input variables based on how they affect the decision output locally (Dimopoulos et al., 1995;Gevrey et al., 2003). The sensitivity of an input variable is given by its squared partial derivative: Here, we note that unlike LRP, sensitivity analysis does not preserve the function value f (x), but the squared l 2 -norm of the function gradient: This quantity is however not directly related to the amount of evidence for the category to detect. Similar gradient-based analyses (Denil et al., 2014;Li et al., 2015) have been recently applied in the NLP domain, and were also used by Simonyan et al. (2014) in the context of image classification. While recent work uses different relevance definitions for a group of input variables (e.g. gradient magnitude in Denil et al. (2014) or max-norm of absolute value of simple derivatives in Simonyan et al. (2014)), in the present work (unless otherwise stated) we employ the squared l 2 -norm of gradients allowing for decomposition of Eq. 2 as a sum over relevances of input variables.

Experiments
For the following experiments we use the 20newsbydate version of the 20Newsgroups 3 dataset consisting of 11314/7532 train/test documents evenly distributed among twenty fine-grained categories.

CNN Model
As a document classifier we employ a word-based CNN similar to Kim (2014) consisting of the following sequence of layers: By 1-Max-Pool we denote a max-pooling layer where the pooling regions span the whole text length, as introduced in (Collobert et al., 2011). Conv, ReLU and FC denote the convolutional layer, rectified linear units activation and fully-connected linear layer. For building the CNN numerical input we concatenate horizontally 300-dimensional pre-trained word2vec 4 vectors (Mikolov et al., 2013), in the same order the corresponding words appear in the pre-processed document, and further keep this input representation fixed during training. The convolutional operation we apply in the first neural network layer is one-dimensional and along the text sequence direction (i.e. along the horizontal direction). The receptive field of the convolutional layer neurons spans the entire word embedding space in vertical direction, and covers two consecutive words in horizontal direction. The convolutional layer filter bank contains 800 filters.

Experimental Setup
As pre-processing we remove the document headers, tokenize the text with NLTK 5 , filter out punctuation and numbers 6 , and finally truncate each document to the first 400 tokens. We train the CNN by stochastic mini-batch gradient descent with momentum (with l 2 -norm penalty and dropout). Our trained classifier achieves a classification accuracy of 80.19% 7 . Due to our input representation, applying LRP or SA to our neural classifier yields one relevance value per word-embedding dimension. From these single input variable relevances to obtain wordlevel relevances, we sum up the relevances over the word embedding space in case of LRP, and (unless otherwise stated) take the squared l 2 -norm of the corresponding word gradient in case of SA. More precisely, given an input document d consisting of a sequence (w 1 , w 2 , ..., w N ) of N words, each word being represented by a Ddimensional word embedding, we compute the relevance R(w t ) of the t th word in the input document, through the summation: 4 GoogleNews-vectors-negative300, https://code.google.com/p/word2vec/ 5 We employ NLTK's version 3.1 recommended tokenizers sent tokenize and word tokenize, module nltk.tokenize. 6 We retain only tokens composed of the following characters: alphabetic-character, apostrophe, hyphen and dot, and containing at least one alphabetic-character. 7 To the best of our knowledge, the best published 20Newsgroups accuracy is 83.0% (Paskov et al., 2013). However we notice that for simplification we use a fixed-length document representation, and our main focus is on explaining classifier decisions, not on improving the classification state-of-the-art.
where R i,t denotes the relevance of the input variable corresponding to the i th dimension of the t th word embedding, obtained by LRP or SA as specified in Sections 2.1 & 2.2.
In particular, in case of SA, the above word relevance can equivalently be expressed as: where f (d) represents the classifier's prediction for document d.
Note that the resulting LRP word relevance is signed, while the SA word relevance is positive.
In all experiments, we use the term target class to identify the function f (x) to analyze in the relevance decomposition. This function maps the neural network input to the neural network output variable corresponding to the target class.

Evaluating Word-Level Relevances
In order to evaluate different relevance models, we perform a sequence of "word deletions" (hereby for deleting a word we simply set the word-vector to zero in the input document representation), and track the impact of these deletions on the classification performance. We carry out two deletion experiments, starting either with the set of test documents that are initially classified correctly, or with those that are initially classified wrongly 8 . We estimate the LRP/SA word relevances using as target class the true document class. Subsequently we delete words in decreasing resp. increasing order of the obtained word relevances. Fig. 1 summarizes our results. We find that LRP yields the best results in both deletion experiments. Thereby we provide evidence that LRP positive relevance is targeted to words that support a classification decision, while LRP negative relevance is tuned upon words that inhibit this decision. In the first experiment the SA classification accuracy curve decreases significantly faster than the random curve representing the performance change when randomly deleting words, indicating that SA is able to identify relevant words. However, the SA curve is clearly above the LRP curve indicating that LRP provides better explanations for the CNN predictions. Similar results have been reported for image classification tasks . The second experiment indicates that the classification performance increases

Document Highlighting
Word-level relevances can be used for highlighting purposes. In Fig. 2 we provide such visualizations on one test document for different relevance target classes, using either LRP or SA relevance models. We can observe that while the word ride is highly negative-relevant for LRP when the target class is not rec.motorcycles, it is positively highlighted (even though not heavily) by SA. This suggests that SA does not clearly discriminate between words speaking for or against a specific classifier decision, while LRP is more discerning in this respect.

Document Visualization
Word2vec embeddings are known to exhibit linear regularities representing semantic relationships between words (Mikolov et al., 2013). We explore if these regularities can be transferred to a document representation, when using as a document vector a linear combination of word2vec embeddings. As a weighting scheme we employ LRP or SA scores, with the classifier's predicted class as the target class for the relevance estimation. For comparison we perform uniform weighting, where we simply sum up the word embeddings of the document words (SUM). For SA we use either the l 2 -norm or squared l 2norm for pooling word gradient values along the word2vec dimensions, i.e. in addition to the standard SA word relevance defined in Eq. 4, we use as an alternative R SA(l 2 ) (w t ) = ∇ wt f (d) 2 and denote this relevance model by SA(l 2 ).
For both LRP and SA, we employ different variations of the weighting scheme. More precisely, given an input document d composed of the sequence (w 1 , w 2 , ..., w N ) of D-dimensional word2vec embeddings, we build new document representations d and d e.w. 9 by either using wordlevel relevances R(w t ) (as in Eq. 3), or through element-wise multiplication of word embeddings with single input variable relevances (R i,t ) i∈[1,D] (we recall that R i,t is the relevance of the input variable corresponding to the i th dimension of the t th word in the input document d). More formally we use: where is an element-wise multiplication. Finally we normalize the document vectors d resp. d e.w. to unit l 2 -norm and perform a PCA projection. In Fig. 3 we label the resulting 2D-projected test documents using five top-level document categories.
For word-based models d , we observe that while standard SA and LRP both provide simi- 9 The subscript e.w. stands for element-wise.

SA
It is the body's reaction to a strange environment. It appears to be induced partly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.
It is the body's reaction to a strange environment. It appears to be induced partly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.
It is the body's reaction to a strange environment. It appears to be induced partly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.
It is the body's reaction to a strange environment. It appears to be induced partly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.
It is the body's reaction to a strange environment. It appears to be induced partly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.
It is the body's reaction to a strange environment. It appears to be induced partly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.  lar visualization quality, the SA variant with simple l 2 -norm yields partly overlapping and dense clusters, still all schemes are better than uniform 10 weighting. In case of SA note that, even though the power to which word gradient norms are raised (l 2 or l 2 2 ) affects the present visualization experiment, it has no influence on the earlier described "word deletion" analysis.
For element-wise models d e.w. , we observe slightly better separated clusters for SA, and a clear-cut cluster structure for LRP.

Conclusion
Through word deleting we quantitatively evaluated and compared two classifier explanation models, and pinpointed LRP to be more effective than SA. We investigated the application of word-level relevance information for document highlighting and visualization. We derive from our empirical analysis that the superiority of LRP stems from the fact that it reliably not only links to determinant words that support a specific classification decision, but further distinguishes, within the preeminent words, those that are opposed to that decision.
Future work would include applying LRP to other neural network architectures (e.g. characterbased or recurrent models) on further NLP tasks, as well as exploring how relevance information could be taken into account to improve the classifier's training procedure or prediction performance.