Interpreting Neural Networks with Nearest Neighbors

Local model interpretation methods explain individual predictions by assigning an importance value to each input feature. This value is often determined by measuring the change in confidence when a feature is removed. However, the confidence of neural networks is not a robust measure of model uncertainty. This issue makes reliably judging the importance of the input features difficult. We address this by changing the test-time behavior of neural networks using Deep k-Nearest Neighbors. Without harming text classification accuracy, this algorithm provides a more robust uncertainty metric which we use to generate feature importance values. The resulting interpretations better align with human perception than baseline methods. Finally, we use our interpretation method to analyze model predictions on dataset annotation artifacts.


Introduction
The growing use of neural networks in sensitive domains such as medicine, finance, and security raises concerns about human trust in these machine learning systems. A central question is testtime interpretability: how can humans understand the reasoning behind model predictions?
A common way to interpret neural network predictions is to identify the most important input features. For instance, a saliency map that highlights important pixels in an image (Sundararajan et al., 2017) or words in a sentence . Given a test prediction, the importance of each input feature is the change in model confidence when that feature is removed.
However, neural network confidence is not a proper measure of model uncertainty (Guo et al., 2017). This issue is emphasized when models make highly confident predictions on inputs that * Equal contribution are completely void of information, for example, images of pure noise (Goodfellow et al., 2015) or meaningless text snippets (Feng et al., 2018). Consequently, a model's confidence may not properly reflect whether discriminative input features are present. This issue makes it difficult to reliably judge the importance of each input feature using common confidence-based interpretation methods (Feng et al., 2018).
To address this, we apply Deep k-Nearest Neighbors (DKNN) (Papernot and McDaniel, 2018) to neural models for text classification. Concretely, predictions are no longer made with a softmax classifier, but using the labels of the training examples whose representations are most similar to the test example (Section 3). This provides an alternative metric for model uncertainty, conformity, which measures how much support a test prediction has by comparing its hidden representations to the training data. This representationbased uncertainty measurement can be used in combination with existing interpretation methods, such as leave-one-out , to better identify important input features.
We combine DKNN with CNN and LSTM models on six NLP text classification tasks, including sentiment analysis and textual entailment, with no loss in classification accuracy (Section 4). We compare interpretations generated using DKNN conformity to baseline interpretation methods, finding DKNN interpretations rarely assign importance to extraneous words that do not align with human perception (Section 5). Finally, we generate interpretations using DKNN conformity for a dataset with known artifacts (SNLI), helping to indicate whether a model has learned superficial patterns. We open source the code for DKNN and our results. 1

Interpretation Through Feature Attribution
Feature attribution methods explain a test prediction by assigning an importance value to each input feature (typically pixels or words).
In the case of text classification, we have an input sequence of n words x = w 1 , w 2 , . . . w n , represented as one-hot vectors. The word sequence is then converted to a sequence of word embeddings e = v 1 , v 2 , . . . v n . A classifier f outputs a probability distribution over classes. The class with the highest probability is selected as the prediction y, with its probability serving as the model confidence. To create an interpretation, each input word is assigned an importance value, g(w i | x, y), which indicates the word's contribution to the prediction. A saliency map (or heat map) visually highlights the words in a sentence.

Leave-one-out Attribution
A simple way to define the importance g is via leave-one-out : individually remove a word from the input and see how the confidence changes. The importance of word w i is the decrease in confidence 2 when word i is removed: where x −i is the input sequence with the ith word removed and f (y | x) is the model confidence for class y. This can be repeated for all words in the input. Under this definition, the sign of the importance value is opposite the sign of the confidence change: if a word's removal causes a decrease in the confidence, it gets a positive importance value. We refer to this interpretation method as Confidence leave-one-out in our experiments.

Gradient-Based Feature Attribution
In the case of neural networks, the model f (x) as a function of word w i is a highly non-linear, differentiable function. Rather than leaving one word out at a time, we can simulate a word's removal by approximating f with a function that is linear in w i through the first-order Taylor expansion. The importance of w i is computed as the derivative of f with respect to the one-hot vector: 2 equivalently the change in class score or cross entropy loss Thus, a word's importance is the dot product between the gradient of the class prediction with respect to the embedding and the word embedding itself. This gradient approximation simulates the change in confidence when an input word is removed and has been used in various interpretation methods for NLP (Arras et al., 2016;Ebrahimi et al., 2017). We refer to this interpretation approach as Gradient in our experiments.

Interpretation Method Failures
Interpreting neural networks can have unexpected negative results. Ghorbani et al. (2017) and Kindermans et al. (2017) show how a lack of model robustness and stability can cause egregious interpretation failures in computer vision settings. Feng et al. (2018) extend this to NLP and draw connections between interpretation failures and adversarial examples (Szegedy et al., 2014). To counteract this, new interpretation methods alone are not enough-models must be improved. For instance, Feng et al. (2018) argue that interpretation methods should not rely on prediction confidence as it does not reflect a model's uncertainty.
Following this, we improve interpretations by replacing the softmax confidence with a more robust uncertainty estimate using DKNN (Papernot and McDaniel, 2018). This algorithm maintains the accuracy of standard image classification models while providing a better uncertainty metric capable of defending against adversarial examples.

Deep k-Nearest Neighbors for Sequential Inputs
This section describes Deep k-Nearest Neighbors, its application to sequential inputs, and how we use it to determine word importance values. Papernot and McDaniel (2018) propose Deep k-Nearest Neighbors (DKNN), a modification to the test-time behavior of neural networks. After training completes, the DKNN algorithm passes every training example through the model and saves each of the layer's representations. This creates a new dataset, whose features are the representations and whose labels are the model predictions. Test-time predictions are made by passing an example through the model and performing k-nearest neighbors classification on the resulting representations. This modification does not de-grade the accuracy of image classifiers on several standard datasets (Papernot and McDaniel, 2018).

Deep k-Nearest Neighbors
For our purposes, the benefit of DKNN is the algorithm's uncertainty metric, the conformity score. This score is the percentage of nearest neighbors belonging to the predicted class. Conformity follows from the framework of conformal prediction (Shafer and Vovk, 2008) and estimates how much the training data supports a classification decision.
The conformity score uses the representations at each neural network layer, and therefore, a prediction only receives high conformity if it largely agrees with the training data at all representation levels. This mechanism defends against adversarial examples (Szegedy et al., 2014), as it is difficult to construct a perturbation which changes the neighbors at every layer. Consequently, conformity is a better uncertainty metric for both regular examples and out-of-domain examples such as noisy or adversarial inputs, making it suitable for interpreting models.

Handling Sequences
The DKNN algorithm requires fixed-size vector representations. To reach a fixed-size representation for text classification, we take either the final hidden state of a recurrent neural network or use max pooling across time (Collobert and Weston, 2008). We consider deep architectures of these two forms, using each of the layers' representations as the features for DKNN.

Conformity leave-one-out
Using conformity, we generate interpretations through a modified version of leave-one-out . After removing a word, rather than observing the drop in confidence, we instead measure the drop in conformity. Formally, we modify classifier f in Equation 1 to output probabilities based on conformity. We refer to this method as conformity leave-one-out.

DKNN Maintains Classification Accuracy
Interpretability should not come at the cost of performance-before investigating how interpretable DKNN is, we first evaluate its accuracy. We experiment with six text classification tasks and two models, verifying that DKNN achieves accuracy comparable to regular classifiers.
CNN Our CNN architecture resembles Kim (2014). We use convolutional filters of size three, four, and five, with max-pooling over time (Collobert and Weston, 2008). The filters are followed by three fully-connected layers. We fine-tune GLOVE embeddings (Pennington et al., 2014) of each word. For DKNN, we use the activations from the convolution layer and the three fullyconnected layers.
BILSTM Our architecture uses a bidirectional LSTM (Graves and Schmidhuber, 2005), with the final hidden state forming the fixed-size representation. We use three LSTM layers, followed by two fully-connected layers. We fine-tune GLOVE embeddings of each word. For DKNN, we use the final activations of the three recurrent layers and the two fully-connected layers.
SNLI Classifier Unlike the other tasks which consist of a single input sentence, SNLI has two inputs, a premise and hypothesis. Following Conneau et al. (2017), we use the same model to encode the two inputs, generating representations u for the premise and v for the hypothesis. We concatenate these two representations along with their dot-product and element-wise absolute difference, arriving at a final representation . This vector passes through two fully-connected layers for classification. For DKNN, we use the activations of the two fullyconnected layers.
Nearest Neighbor Search For accurate interpretations, we trade efficiency for accuracy and replace locally sensitive hashing (Gionis et al., 1999) used by Papernot and McDaniel (2018) with a k-d tree (Bentley, 1975). We use k = 75 nearest neighbors at each layer. The empirical results are robust to the choice of k.

Classification Results
DKNN achieves comparable accuracy on the five classification tasks (Table 1). On SNLI, the BIL-STM achieves an accuracy of 81.2% with a softmax classifier and 81.0% with DKNN.

DKNN is Interpretable
Following past work Murdoch et al., 2018), we focus on the SST dataset for generating interpretations. Due to the lack of standard interpretation evaluation metrics (Doshi-Velez and Kim, 2017), we use qualitative evaluations (Smilkov et al., 2017;Sundararajan et al., 2017;, performing quantitative experiments where possible to examine the distinction between the interpretation methods.

Interpretation Analysis
We compare our method (Conformity leave-oneout) against two baselines: leave-one-out using regular confidence (Confidence leave-one-out, see Section 2.1) and the gradient with respect to the input (Gradient, see Section 2.2). To create saliency maps, we normalize each word's importance by dividing it by the total importance of the words in the sentence. We display unknown words in angle brackets <>.  Table 2, both baselines highlight almost half of the input, including words such as "fiction" and "clash". We suspect model confidence is oversensitive to these unimportant input changes, causing the baseline interpretations to highlight unimportant words. On the other hand, the conformity score better separates word importance, generating clearer interpretations.
The tendency for confidence-based approaches to assign importance to many words holds for the entire test set. We compute the average number of highlighted words using a threshold of 0.05 (a normalized importance value corresponding to a light blue or light red highlight). Out of the average 20.23 words in SST test set, gradient high-lights 5.32 words, confidence leave-one-out highlights 5.79 words, and conformity leave-one-out highlights 3.65 words.
The second, and related, observation for confidence-based approaches is a bias towards selecting word importance based on a word's inherent sentiment, rather than its meaning in context. For example, see "clash", "terribly", and "unfaithful" in Table 2. The removal of these words causes a small change in the model confidence. When using DKNN, the conformity score indicates that the model's uncertainty has not risen without these input words and leave-one-out does not assign them any importance.
We characterize our interpretation method as significantly higher precision, but slightly lower recall than confidence-based methods. Conformity leave-one-out rarely assigns high importance to words that do not align with human perception of sentiment. However, there are cases when our method does not assign significant importance to any word. This occurs when the input has a high redundancy. For example, a positive movie review that describes the sentiment in four distinct ways. In these cases, leaving out a single sentiment word has little effect on the conformity as the model's representation remains supported by the other redundant features. Confidence-based interpretations, which interpret models using the linear units that produce class scores, achieve higher recall by responding to every change in the input for a certain direction but may have lower precision.
In the second example of Table 2, the word "terribly" is assigned a negative importance value, disregarding its positive meaning in context. To examine if this is a stand-alone example or a more general pattern of uninterpretable behavior, we calculate the importance value of the word "terribly" in other positive examples. For each occurrence of the word "great" in positive validation examples, we paraphrase it to "awesome", "wonderful", or "impressive", and add the word "terribly" in front of it. This process yields 66 examples. For each of these examples, we compute the importance value of each input word and rank them from most negative to most positive (the most negative word has a rank of 1). We compare the average ranking of "terribly" from the three methods: 7.9 from conformity leave-one-out, 1.68 from confidence leave-one-out, and 1.1 from gradient. The baseline methods consistently rank "terribly"

Method Saliency Map
Conformity an intelligent fiction about learning through cultural clash.
Confidence an intelligent fiction about learning through cultural clash. Gradient an intelligent fiction about learning through cultural clash.
Conformity <Schweiger> is talented and terribly charismatic. Confidence <Schweiger> is talented and terribly charismatic. Gradient <Schweiger> is talented and terribly charismatic.
Conformity Diane Lane shines in unfaithful. Confidence Diane Lane shines in unfaithful. Gradient Diane Lane shines in unfaithful.
Color Legend Positive Impact Negative Impact Our method (Conformity leave-one-out) has higher precision, rarely assigning importance to extraneous words such as "clash" or "fiction".
as the most negative word, ignoring its meaning in context. This echoes our suspicion: DKNN generates interpretations with higher precision because conformity is robust to irrelevant input changes.

Analyzing Dataset Annotation Artifacts
We use conformity leave-one-out to interpret a model trained on SNLI, a dataset known to contain annotation artifacts. We demonstrate that our interpretation method can help identify when models exploit dataset biases. Recent studies (Gururangan et al., 2018;Poliak et al., 2018) identify annotation artifacts in SNLI. Superficial patterns exist in the input which strongly correlate with certain labels, making it possible for models to "game" the task: obtain high accuracy without true understanding. For instance, the hypothesis of an entailment example is often a general paraphrase of the premise, using words such as "outside" instead of "playing in a park". Contradiction examples often contain negation words or non-action verbs like "sleeping". Models trained solely on the hypothesis can learn these patterns and reach accuracies considerably higher than the majority baseline.
These studies indicate that the SNLI task can be gamed. We look to confirm that some artifacts are indeed exploited by normally trained models that use full input pairs. We create saliency maps for examples in the validation set using conformity leave-one-out. Table 3 shows samples and more can be found on the supplementary website. We use blue highlights to indicate words which positively support the model's predicted class, and the color red to indicate words that support a different class. The first example is a randomly sampled baseline, showing how the words "swims" and "pool" support the model's prediction of contradiction. The other examples are selected because they contain terms identified as artifacts. In the second example, conformity leave-one-out assigns extremely high word importance to "sleeping", disregarding the other words necessary to predict contradiction (i.e., the neutral class is still possible if "pets" is replaced with "people"). In the final two hypotheses, the interpretation method diagnoses the model failure, assigning high importance to "wearing", rather than focusing positively on the shirt color.
To explore this further, we analyze the hypotheses in each SNLI class which contain a top five artifact identified by Gururangan et al. (2018). For each of these examples, we compute the importance value for each input word using both confidence and conformity leave-one-out. We then rank the words from most important for the prediction to least important (a score of 1 indicates highest importance) and report the average rank for the artifacts in Table 4. We sort the words by their Pointwise Mutual Information with the correct label as provided by Gururangan et al. (2018). The word "nobody" particularly stands out: it is the most important input word every time it appears in a contradiction example.
For most of the artifacts, conformity leave-oneout assigns them a high importance, often ranking the artifacts as the most important input word. Confidence leave-one-out correlates less strongly with the known artifacts, frequently ranking them as low as the fifth or sixth most important word. Given the high correlation between conformity leave-one-out and the manually identified artifacts, this interpretation method may serve as a technique to identify undesirable biases a model has learned.

Discussion and Related Work
We connect the improvements made by conformity leave-one-out to model confidence issues, compare alternative interpretation improvements, and discuss further features of DKNN.

Issues in Neural Network Confidence
Many existing feature attribution methods rely on estimates of model uncertainty: both input gradient and confidence leave-one-out rely on prediction confidence, our method relies on DKNN conformity. Interpretation quality is thus determined by reliable uncertainty estimation. For instance, past work shows relying on neural network confidence can lead to unreasonable interpretations (Kindermans et al., 2017;Ghorbani et al., 2017;Feng et al., 2018). Independent of interpretability, Guo et al. (2017) show that neural network confidence is unreasonably high: on held-out examples, it far exceeds empirical accuracy. This is further exemplified by the high confi-dence predictions produced on inputs that are adversarial (Szegedy et al., 2014) or contain solely noise (Goodfellow et al., 2015).

Confidence Calibration is Insufficient
We attribute one interpretation failure to neural network confidence issues. Guo et al. (2017) study overconfidence and propose a calibration procedure using Platt scaling, which adjusts the temperature parameter of the softmax function to align confidence with accuracy on a held-out dataset. However, this is not input dependent-the confidence is lower for both full-length examples and ones with words left out. Hence, selecting influential words will remain difficult.
To verify this, we create an interpretation baseline using temperature scaling. The results corroborate the intuition: calibrating the confidence of leave-one-out does not improve interpretations. Qualitatively, the calibrated interpretation results remain comparable to confidence leave-one-out. Furthermore, calibrating the DKNN conformity score as in Papernot and McDaniel (2018) did not improve interpretability compared to the uncalibrated conformity score.

Alternative Interpretation Improvements
Recent work improves interpretation methods through other means. Smilkov et al. (2017) and Sundararajan et al. (2017) both aggregate gradient values over multiple backpropagation passes to eliminate local noise or satisfy interpretation axioms. This work does not address model confidence and is orthogonal to our DKNN approach.

Interpretation Through Data Selection
Retrieval-Augmented Convolutional Neural Networks (Zhao and Cho, 2018) are similar to DKNN: they augment model predictions with an information retrieval system that searches over network activations from the training data.
Retrieval-Augmented models and DKNN can both select influential training examples for a test prediction. In particular, the training data activations which are closest to the test point's activations are influential according to the model. These training examples can provide interpretations as a form of analogy (Caruana et al., 1999), an intuitive explanation for both machine learning experts and non-experts (Klein, 1989;Koh and Liang, 2017;. However, unlike in computer vi-

Input
Saliency Map Contradiction Premise a young boy reaches for and touches the propeller of a vintage aircraft. Hypothesis a young boy swims in his pool.

Contradiction
Premise a brown dog and a black dog in the edge of the ocean with a wave under them boats are on the water in the background. Hypothesis the pets are sleeping on the grass.
Premise man in a blue shirt standing in front of a structure painted with geometric designs. Entailment Hypothesis a man is wearing a blue shirt. Entailment Hypothesis a man is wearing a black shirt.
Color Legend Positive Impact Negative Impact   (Papernot and Mc-Daniel, 2018), our experiments did not find human interpretable data points for SST or SNLI.

Trust in Model Predictions
Model confidence is important for real-world applications: it signals how much one should trust a neural network's predictions. Unfortunately, users may be misled when a model outputs highly confident predictions on rubbish examples (Goodfellow et al., 2015;Nguyen et al., 2015) or adversarial examples (Szegedy et al., 2014). Recent work decides when to trust a neural network model (Ribeiro et al., 2016;Doshi-Velez and Kim, 2017;Jiang et al., 2018). For instance, analyzing local linear model approximations (Ribeiro et al., 2016) or flagging rare network activations using kernel density estimation (Jiang et al., 2018). The DKNN conformity score is a trust metric that helps defend against image adversarial examples (Papernot and McDaniel, 2018). Future work should study if this robustness extends to interpretations.

Future Work and Conclusion
A robust estimate of model uncertainty is critical to determine feature importance. The DKNN conformity score is one such uncertainty metric which leads to higher precision interpretations. Al-though DKNN is only a test-time improvementthe model is still trained using maximum likelihood. Combining nearest neighbor and maximum likelihood objectives during training may further improve model accuracy and interpretability. Moreover, other uncertainty estimators do not require test-time modifications. For example, modeling p(x) and p(y | x) using Bayesian Neural Networks (Gal et al., 2016). Similar to other NLP interpretation methods (Sundararajan et al., 2017;, conformity leave-one-out works when a model's representation has a fixed size. For other NLP tasks, such as structured prediction (e.g., translation and parsing) or span prediction (e.g., extractive summarization and reading comprehension), models output a variable number of predictions and our interpretation approach will not suffice. Developing interpretation techniques for these types of models is a necessary area for future work.
We apply DKNN to neural models for text classification. This provides a better estimate of model uncertainty-conformity-which we combine with leave-one-out. This overcomes issues stemming from neural network confidence, leading to higher precision interpretations. Most interestingly, our interpretations are supported by the training data, providing insights into the representations learned by a model.