An Empirical Comparison of Instance Attribution Methods for NLP

Widespread adoption of deep models has motivated a pressing need for approaches to interpret network outputs and to facilitate model debugging. Instance attribution methods constitute one means of accomplishing these goals by retrieving training instances that (may have) led to a particular prediction. Influence functions (IF; Koh and Liang 2017) provide machinery for doing this by quantifying the effect that perturbing individual train instances would have on a specific test prediction. However, even approximating the IF is computationally expensive, to the degree that may be prohibitive in many cases. Might simpler approaches (e.g., retrieving train examples most similar to a given test point) perform comparably? In this work, we evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples. We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods (such as IFs), but that nonetheless exhibit desirable characteristics similar to more complex attribution methods. Code for all methods and experiments in this paper is available at: https://github.com/successar/instance_attributions_NLP.


Introduction
Interpretability methods are intended to help users understand model predictions (Ribeiro et al., 2016;Lundberg and Lee, 2017;Sundararajan et al., 2017;Gilpin et al., 2018). In machine learning broadly and NLP specifically, such methods have focused on feature-based explanations that highlight parts of inputs 'responsible for' the specific prediction. Feature attribution, however, does not communicate a key basis for model outputs: training data. Recent work has therefore considered methods for * Equal contribution A hilarious romantic comedyŷ = 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 0 m h e 7 6 Y r M N 0 t / 3 b 7 j + R / G 9 n / s U = " > A A A B 8 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 C I U v X i s Y D + k D W W z 3 b R L N 5 u w O x F K 6 K / w 4 k E R r / 4 c b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k Q K g 6 7 7 7 a y s r q 1 v b B a 2 i t s 7 u 3 v 7 p Y P D p o l T z X i D x T L W 7 Y A a L o X i D R Q o e T v R n E a B 5 K 1 g d D v 1 W 0 9 c G x G r B x w n 3 I / o Q I l Q M I p W e u w O K W b j y b X X K 5 X d i j s D W S Z e T s q Q o 9 4 r f X X 7 M U s j r p B J a k z H c x P 0 M 6 p R M M k n x W 5 q e E L Z i A 5 4 x 1 J F I 2  surfacing training examples that were influential for a specific prediction (Koh and Liang, 2017;Yeh et al., 2018;Pezeshkpour et al., 2019;Charpiat et al., 2019;Barshan et al., 2020;Han et al., 2020). While such instance-attribution methods provide an appealing mechanism to identify sources that led to specific predictions (which may reveal potentially problematic training examples), they have not yet been widely adopted, at least in part because even approximating influence functions (Koh and Liang, 2017)-arguably the most principled attribution method-can be prohibitively expensive in terms of compute. Is such complexity necessary to identify 'important' training points? Or do simpler methods (e.g., attribution scores based on similarity measures between train and test instances) yield comparable results? In this paper, we set out to evaluate and compare instance attribution methods, including relatively simple and efficient approaches (Rajani et al., 2020) in the context of NLP ( Figure 1). We design qualitative evaluations intended to probe the following research questions: (1) How correlated are rankings induced by gradient and similarity-based attribution methods (assessing the quality of more efficient approx-imations)? (2) What is the quality of explanations in similarity methods compared to gradient-based ones (clarifying the necessity of adopting more complex methods)?
We evaluate instance-based attribution methods on two datasets: binarized version of the Stanford Sentiment Treebank (SST-2;Socher et al. 2013) and the Multi-Genre NLI (MNLI) dataset (Williams et al., 2018). We investigate the correlation of more complex attribution methods with simpler approximations and variants (with and without use of the Hessian). Comparing explanation quality of gradient-based methods against simple similarity retrieval using leave-one-out (Basu et al., 2020) and randomized-test (Hanawa et al., 2021) analyses, we show that simpler methods are fairly competitive. Finally, using the HANS dataset (McCoy et al., 2019), we show the ability of similarity-based methods to surface artifacts in training data.

Attribution Methods
Similarity Based Attribution Consider a text classification task in which we aim to map inputs x i to labels y i ∈ Y . We will denote learned representations of x i by f i (i.e., the representation from the penultimate network layer). To quantify the importance of training point x i on the prediction for target sample x t , we calculate the similarity in embedding space induced by the model. 1 To measure similarity we consider three measures: Euclidean distance, Dot product, and Cosine similarity. Specifically, we define similarity-based attribution scores as: To investigate the effect of fine-tuning on these similarity measures, we also derive rankings based on similarities between untuned sentence-BERT (Reimers et al., 2019) representations.
Gradient Based Attribution Influence Functions (IFs) were proposed in the context of neural models by Koh and Liang (2017) to quantify the contribution made by individual training points on specific test predictions. Denoting model parameter estimates byθ, the IF approximates the effect that upweighting instance i by a small amountiwould have on the parameter estimates (here H is the Hessian of the loss function with respect to our parameters): dθ . This estimate can in turn be used to derive the effect on a specific test point x test : ∇ θ L(x test , y test ,θ) T · dθ d i . Aside from IFs, we consider three other similar gradient-based variations: ( RIF was proposed by Barshan et al. (2020), while GD and GC by Charpiat et al. (2019).
Representer Points (REP; Yeh et al. 2018) introduced to approximate the influence of training points on a test sample by defining a classifier as a combination of a feature extractor and a (L2 regularized) linear layer: φ(x i , θ). Yeh et al. (2018) showed that for such models the output for any target instance x t can be expressed as a linear decomposition of "data importance" of training instances:

Experimental Setup
Datasets To evaluate different attribution methods, we conduct several experiments on sentiment analysis and NLI tasks, following prior work investigating the use of IF specifically for NLP (Han et al., 2020). We adopt a binarized version of the Stanford Sentiment Treebank (SST-2;Socher et al. 2013), and the Multi-Genre NLI (MNLI) dataset (Williams et al., 2018). For fine-tuning on MNLI, we randomly sample 10k training instances. Finally, to evaluate the ability of instance attribution methods to reveal annotation artifacts in NLI, we randomly sampled 1000 instances from the HANS dataset (more details in the Appendix).
Models We define models for both tasks on top of BERT (Devlin et al., 2019), tuning hyperparameters on validation data via grid search. Our models achieve 90.6% accuracy on SST and 71.2% accuracy on MNLI (more details in the Appendix).
Computing the IF for BERT Deriving the IF for all parameters θ of a BERT-based model requires deriving the corresponding Inverse Hessian. We compute the Inverse Hessian Vector Product (IHVP) H −1 ∇ θ L(x, y, θ) directly because storing the entire matrix of |θ| 2 elements is practically impossible (requiring ∼12 PB of storage). We approximate the IHVP using the LiSSa algorithm  (Agarwal et al., 2017). This method is still expensive to run and is sensitive to the norm of the IHVP approximation. Therefore, for computational reasons we consider IF with respect to the subset of parameters that correspond to the top five layers [IF (Top-5)], and only the last linear layer [IF (linear)], resulting in a few orders of magnitude faster procedure (the algorithm becomes increasingly unstable as we incorporate additional layers). We also use a large scaling factor to aid convergence.

Experiments
In this section, we first investigate the correlation between different methods. Then, to study the quality of explanations we conduct leave-some-out experiments, and further analyze attribution methods on HANS data. We consider four evaluations (more analyses and experimental details in the Appendix).
(1) Calculating the correlation of each pair of attribution methods, assessing whether simple methods induce rankings similar to more complex ones.
(2) Removing the most influential samples according to each method, retrain, and then observe the change in the predicted probability for the originally predicted class, with the assumption that more accurate attribution methods will cause more drop.
(3) We follow randomized-test from (Hanawa et al., 2021) and measure the ranking correlation of methods for (a) randomly initialized and (b) trained models, under the assumption that high correlation here would suggest less meaningful attribution. Attribution Methods' Correlation We calculate the Spearman correlation between scores assigned to training samples by different methods, allowing us to compare their similarities. More specifically, we randomly sample 100 test and 500 training samples from datasets and calculate the average resultant Spearman correlations. We report attribution methods' correlation on SST and MNLI datasets in Figure 2 (a more complete version of these figures is in the Appendix). We make the following observations. (1) Gradient methods w/wo normalization appear similar to each other, e.g., GC is similar to RIF and IF is similar to GD, suggesting that Hessian information may not be necessary to provide meaningful attributions (GD and GC do not use the Hessian). (2) There is a high correlation between IF calculated over the top five layers of BERT and IF over only the last linear layer. (3) There is only a modest correlation between similarity-based rankings and gradientbased methods, suggesting that these do differ in terms of the importance they assign to training instances. We report a proportion of common top examples between IF (Top-5) and IF (Linear) in the Appendix, providing further evidence of the high correlation between these methods.
Removing 'Important' Samples In Table 1 we report the average results of removing the top-k  Table 1: Average difference (∆) between predictions made after training on (i) all data and (ii) a subset in which we remove the top-50/top-500 most important training points, according to different methods (Random on both of the benchmarks has standard deviation around 0.02). We also report the Spearman correlation between the ranking induced by each approach using a trained model and the same ranking when a randomly initialized model is used.

Randomized-Test
We report the Spearman correlation between trained and random models for SST and MNLI data in Table 1. This would ideally be small in magnitude (non-zero values indicate correlation). Curiously, gradient-based methods (IF, REP, GD) exhibit negative correlations on the SST dataset. Overall, these results suggest that gradient-based approaches without gradient normalization may be inferior to alternative methods. The simple NN-DOT method provides the 'best' performance according to this metric. The average lexical overlap rate for 1000 random samples from the HANS dataset is provided in Table 2. As a baseline, we also apply similaritybased methods on top of sentence-BERT embeddings, which as expected appear very similar to ran-dom correlation. One can observe that similaritybased approaches tend to surface instances with higher lexical overlap, compared to gradient-based instance attribution methods. Moreover, gradientbased methods without normalization (IF, GD, and REP) perform similar to selecting samples randomly and based on sentence-BERT representations, suggesting an inability to usefully identify lexical overlap.

Artifacts and Attribution Methods
Computational Complexity The computational complexity of IF-based instance attribution methods constitutes an important practical barrier to their use. This complexity depends on the number of model parameters taken into consideration. As a result, computing IF is effectively infeasible if we consider all model parameters for modern, medium-to-large models such as BERT.
If we only consider the parameters of the last linear layer-comprising O(p) parameters-to approximate the IF, the computational bottleneck will be the inverse Hessian which can be approximated with high accuracy in O(p 2 ). There are ways to approximate the inverse Hessian more efficiently (Pearlmutter, 1994), though this results in worse performance. Similarity-based measures, on the other hand, can be calculated in O(p).
With respect to wall-clock running time, calculating the influence of a single test sample with respect to the parameters comprising the top-5 layers of a BERT-based model for SST classification running on a reasonably modern GPU 2 requires ∼5 minutes. If we consider the linear variant, this falls to < 0.01 seconds. Finally, similarity-based approaches require < 0.0001 seconds. Extrapolating these numbers, it requires about 6 days to calculate IF (top-5 Layer) for all 1821 test samples in SST, while it takes only around 0.2 seconds for similarity-based methods.

Conclusions
Instance attribution methods constitute a promising approach to better understanding how modern NLP models come to make the predictions that they do (Han et al., 2020;Koh and Liang, 2017). However, approximating IF to quantify the importance of train samples is prohibitively expensive. In this work, we investigated whether alternative, simpler and more efficient methods provide similar instance attribution scores.  (2) methods without Hessian information, i.e., IF vs GD and RIF vs GC. We considered even simpler, similarity-based approaches and compared the importance rankings over training instances induced by these to rankings under gradient-based methods. Through leave-some-out, randomized-test, and artifact detection experiments, we demonstrated that these simple similarity-based methods are surprisingly competitive. This suggests future directions for work on fast and useful instance attribution methods. All code necessary to reproduce the results reported in this paper is available at:

Ethical Considerations
Deep neural models have come to dominate research in NLP, and increasingly are deployed in the real world. A problem with such techniques is that they are opaque; it is not easy to know why models make specific predictions. Consequently, modern models may make predictions on the basis of attributes we would rather they not (e.g., demographic categories or 'artifacts' in data).
Instance attribution-identifying training samples that influenced a given prediction-provides a mechanism that might be used to counter these issues. However, the computational expense of existing techniques hinders their adoption in practice. By contrasting these complex approaches against simpler alternative methods for instance attribution, we contribute to a better understanding and characterization of the tradeoffs in instance attribution techniques. This may, in turn, improve the robustness of models in practice, and potentially reduce implicit biases in their predictions.

A Experimental Details
Datasets To evaluate different attribution methods, we conduct several experiments on sentiment analysis and NLI tasks, following prior work investigating the use of influence functions specifically for NLP (Han et al., 2020). We adopt a binarized version of the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), consisting of 6920 training samples and 1821 test samples. As our NLI benchmark, we use the Multi-Genre NLI (MNLI) dataset (Williams et al., 2018), which contains 393k pairs of premise and hypothesis from 10 different genres. For model fine-tuning, we randomly sample 10k training instances. To evaluate the utility of different instance attribution methods in helping to unearth annotation artifacts in NLI, we use the HANS dataset (McCoy et al., 2019), which comprises examples exhibiting previously identified NLI artifacts such as lexical overlap between hypotheses and premises.We randomly sampled 1000 instances from this benchmark as test data to analyze the behavior of different attribution methods.

B Attribution Methods' Correlation
The complete version of Spearman correlation between attribution methods (containing the sentence-BERT) is provided in Figure 3. As expected, similarity-based approaches based on sentence-BERT show a very small correlation with other methods.
We also provide the proportion of shared examples in the top samples retrieved by IF (top-5) and IF (linear) in Figure 4. One can see that there is a very high correlation between these methods in top samples, validating the high quality of simpler version of IF (IF (linear)) in comparison to the more complex method (IF (top-5)).

C Removing 'Important' Samples
In this experiment, we first select 50 random test samples (for both MNLI and SST). Then, for each one of these instances, we separately remove top-k (we consider k = 50 and 500) training instances for that test sample, retrain the model, and calculate the change in the model's prediction for that sample. We report the average changed over the prediction of the selected 50 random test samples in Table 1. Moreover, the proportion of common examples in top samples between pairs of attribution methods is depicted in Figures 5 and 6. The very high rate between IF vs GD, RIF vs GC, and NN-EUC vs NN-COS pairs, clarify the reason behind the similar performance of these pairs of methods in leave-some-out experiments.

D Near Training Samples Explanations
To further investigate the quality of the most influential sample based on different attribution methods, we conjecture that a data point very similar to a training sample should recover that sample as the most influential instance. We consider four scenarios to create target points similar to training data: (1) using training samples themselves as the target instances for attribution methods; (2) adding a random token to a random place in each training samples; (3) randomly removing a token from each training samples, and; (4) replacing a random token in each training samples with a random token from the dictionary of tokens. In the MNLI dataset, we apply each modification to both the premise and hypothesis in each training sample.
The result of this analysis is provided in Tables 3 and 4. We observe that similarity-based methods demonstrate a greater ability to recover the original training samples corresponding to the different targets. Moreover, the very low performance of IF, GC, and REP methods is due to the fact that there are training points with high magnitude gradient, which these methods choose as top instances for any target sample.