Efficient Estimation of Influence of a Training Instance

Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model’s prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.


Introduction
What is the influence of a training instance on a machine learning model? This question has attracted the attention of the community (Cook, 1977;Koh and Liang, 2017;Zhang et al., 2018;Hara et al., 2019). Evaluating the influence of a training instance leads to more interpretable models and other applications like data cleansing.
A simple evaluation is by comparing a model with another similarly trained model, whose training does not include the instance of interest. This method, however, requires computational costs of time and storage depending on the number of instances, which indicates the extreme difficulty (Table 1). While computationally cheaper estimation methods have been proposed (Koh and Liang, 2017;Hara et al., 2019), they still have computational difficulties or restrictions of model choices. The contribution of this work is to propose an estimation method, which (i) is computationally more efficient while (ii) useful for applications (iii) without significant sacrifice of model performance.
We propose a trick for enabling a neural network without restrictions to estimate the influence, which we refer to as turn-over dropout. This method is computationally efficient as it requires only running two forward computations after training a single model on the entire training dataset. In addition to the efficiency, we demonstrated that it enabled BERT (Devlin et al., 2019) and VG-GNet (Simonyan and Zisserman, 2015) to analyze the influences of training through various experiments, including example-based interpretation of error predictions and data cleansing to improve the accuracy on a test set with a distributional shift.
2 Influence of a Training Instance

Problem Setup
We present preliminaries on the problem setup. In this paper, we deal with the influence of training with an instance on prediction with another one, which has been studied in Koh and Liang (2017), Hara et al. (2019) and so on. Let z := (x, y) be an instance and represent a pair of input x 2 X and its output y 2 Y , and let D := {z i } N i=1 be a training dataset. By using an optimization method with D, we aim to find a model f D : X ! Y . Denoting the loss function by L(f, z), the learning problem is The influence, I(z target , z i ; D), is a quantitative benefit from z i to prediction of z target . Let f D\{z i } to be a model trained on the dataset D excluding z i , the influence is defined as Intuitively, the larger this value, the more strongly a training instance z i contributes to reduce the loss of prediction on another instance z target . The instance of interest z target is typically an instance in a test or validation dataset.

Related Methods
Computing the influence in Equation (1) by retraining two models for each instance is computationally expensive, and several estimation methods are proposed. Koh and Liang (2017) proposed an estimation method that assumed a strongly convex loss function and a global optimal solution 1 . While the method is used even with neural models (Koh and Liang, 2017;Han et al., 2020), which do not satisfy the assumption, it still requires high computational cost. Hara et al. (2019) proposed a method without these restrictions; however, it consumes large disk storage and computation time that depend on the number of optimization steps. Our proposed method is much more efficient, as shown in Table 1. For example, in a case where Koh and Liang (2017)'s method took 10 minutes to estimate the influences of 10,000 training instances on another instance with BERT (Han et al., 2020), our method only required 35 seconds 2 . This efficiency will expand the scope of applications of computing influence. For example, it would enable real-time interpretation of model predictions for users of the machine learning models.
1 Strictly speaking, Koh and Liang (2017) studied a similar but different value from I in Equation (1). Briefly, the formulation in Koh and Liang (2017) considers convex models with the optimal parameters for f D\{z i } and fD. The definition in Hara et al. (2019) did not have such conditions and treated the broader problem. We follow Hara et al. (2019); therefore, the definition in Equation (1) allows any fD and f D\{z i } , as long as they have the same initial parameters and optimization procedures using the same mini-batches except for zi.
2 For the details, see Appendix B. Figure 1: Dropout generates a sub-network for each training instance z, and updates its parameters (red; top) only. By contrast, the (blue; bottom) sub-network is not influenced by z. Our estimation uses the difference between the two sub-networks. j /p, m 0 j ⇠ Bernoulli(p). Parameters masked (multiplied) with 0 are disabled in an update step like pruning. Thus, dropout randomly selects various sub-networks f m to be updated at every step. During inference at test time, dropout is not applied. One interpretation of dropout is that it trains numerous sub-networks and uses them as ensemble (Hinton et al., 2012;Srivastava et al., 2014;Bachman et al., 2014;Baldi and Sadowski, 2014;Bul et al., 2016). In this work, p = 0.5; approximately half of the parameters are zero-masked.

Proposed Method: Turn-over Dropout
In the standard dropout method, dropout masks are sampled independently at every update. In our proposed method, however, we use instancespecific dropout masks m(z), which are also random vectors but deterministically generated and tied with each instance z. Thus, when the network is trained with an instance z, only a deterministic subset of its parameters is updated, as shown in Figure 1. In other words, the sub-network f m(z) is updated; however the corresponding counter- , can be used by applying the individual masks to f . These sub-networks are analogously comprehended as two different networks trained on a dataset with or without an instance, respectively, f D and f D\{z i } 4 . From this analogy, the influence of a training instance can be evaluated by considering these two sub-networks. The in- which corresponds to the gain when using f for a prediction on z target . We call this estimation method turn-over dropout. Its summarized advantages are as follows: • Lower computation time: The method only requires running forward procedure two times. • No snapshot or re-training: A single model can be used for all training instances. • Easy to implement: The model modification and estimation procedure are very simple.

Memory-efficient Instance-specific Masks
One may think that using instance-specific masks require a large space, depending on the dataset size and the number of parameters to be masked. However, this cost is drastically reduced to a constant O(1), using a trick. As the masks are not updated, we do not have to save them directly. Instead, we can deterministically generate the random masks with a fixed random seed number anytime. Thus, models can avoid storing masks and generate masks when using them. We call this trick as volatile mask generation 5 . 4 In this paper, we associate f m(z) and f f m(z) with fD and f D\{z i } , respectively. However, while fD does not focus on any instance in D so much, its substitute f m(z) may be a little biased to some characteristic of z. For ignoring bias, we can use fD itself (i.e., full network) instead of f m(z) , while the representation powers of fD and f D\{z i } are different. We tested the alternative but did not find large improvements. Further exploration is an interesting future work. 5 The volatile mask generation method solved storage and memory issues in our experiments. However, the memory

Experiments
The computational efficiency of our method is discussed in Section 2. Moreover, we answer a question: even if it is efficient, does it work well on applications? To demonstrate the applicability, we conducted experiments using different models and datasets.
Setup First, we used the Stanford Sentiment TreeBank (SST-2) (Socher et al., 2013) binary sentiment classification task. Five thousand instances were sampled from the training set, and 872 instances in the development set were used. We trained BERT-base classifiers (Wolf et al., 2019) with the adapter modules (Houlsby et al., 2019), which froze the pre-trained BERT parameters but newly trained branch networks in addition to the output layers. We applied the turn-over dropout on the adapter modules and output layers.
In addition, we used the CIFAR-10 (Krizhevsky, 2009) 10-class image classification task, with the 50,000 training instances and 10,000 validation instances. We trained the VGGNet19 classifier (Simonyan and Zisserman, 2015) with the turn-over dropout.
Models were trained with the cross-entropy loss. Further details of the setup are shown in Appendix A.

Side Effect on Model Performance
Note that turn-over dropout is not for improving the accuracy of models. It gives the models the method of efficiently estimating the influence of each training instance. A possible side effect is a deterioration of accuracy due to introducing instancespecific dropout with p = 0.5 6 . Thus, we first explored the change of classification accuracy when using the turn-over dropout.
For BERT with the adapter modules on SST-2, if we use a small dataset (N=5,000), the accuracy slightly decreased from the baseline model, from 90.0% to 88.3%. If we use a larger dataset (N=20,000), the change is negligible; 90.5% and 90.2%. Thus, in a case with large datasets, where issue could occur even with the method, depending on implementations. For such a particular case and another solution for it, see Appendix C in detail. 6 Dropout with p = 0.5 is often used in various neural networks, especially on linear layers of them, and improves the accuracy. However, dropout on all layers could damage. It is also unclear how dropout with "static" masks effect because the idea is novel. we typically want to use turn-over dropout for efficiency, applying the turn-over dropout does not decrease the validation accuracy compared with the baseline. However, when we use turn-over dropout on all layers of BERT without the adapter modules using makes training unstable. Furthermore the same is true for VGGNet on CIFAR-10. Instead, we first applied the turn-over dropout only for all layers after the 11th layer, although this means early layers can learn all instances in the training dataset and make the turn-over dropout leaky 7 . We found that VGGNet with turn-over dropout can overfit more than the baseline does; their accuracies are 86.2% and 92.0%, respectively. If we add regularization using the original dropout, the accuracy is recovered to 91.3%. Thus, in some cases, we have to care about the decrease of model performances when using turn-over dropout. While we experimented with the successful architectures only, exploring the side effect in various architectures and its remedy is important future work.

Sanity Check: Learning Curves
We first observed an interesting property of the turnover dropout from the loss curves during training, as shown in Figure 2  ing the flipped mask does not learn each training instance z train .

Interpretation of Error of Predictions
Neural network models are notorious for their black-box prediction, which harms the trust and usability (Ribeiro et al., 2016). The influence estimation can mitigate this problem by suggesting possible reasons for a wrong model prediction by identifying influential training instances.
To verify this benefit, we collected the misclassified instances of the validation or test set and searched for the training instances that most influenced the wrong predictions. Figure 3 indicates a text example from the results. Rare words of named entities were divided into many subwords (Schuster and Nakajima, 2012;Sennrich et al., 2016;Wu et al., 2016) and requiring more complex processing. A guess is that BERT might fail to understand the input due to the cluttered subwords, and predict a wrong label, which depended on a training instance similarly with many subwords. Additionally, we conducted the same experiment on Yahoo Answers 10-label question classification dataset (Zhang et al., 2015) 8 , which is more complex than sentiment analysis. Figure 4 shows the results on Yahoo Answers. The misclassified text shares the phrase "ch ##rist" with the two influen-   tial instances. Such a low-level cue is not critical in the test. However, it seemed that the model focused on the phrase and predicted the label of training instances containing the phrase. In addition, more intuitively, image results are shown in Figure 5. The two leftmost instances with the "bird" label were wrongly predicted as "airplane." The training instances of airplane with the highest influence on the error predictions are shown in the row below. The corresponding images had similar visual features, such as shape, layout, or color, which probably led to the wrong predictions.

Data Cleansing
Another possible application of the influence estimation is to eliminate harmful instances from the training dataset. If the mean influence of a training instance on unseen instances is negative, the instance can be harmful for generalization. We experimented with data cleansing in a case of domain shift, where the training dataset is of SST-2 (movie review); however, the validation and test dataset are of the 'electronics' subset in Multi-Domain Sentiment Dataset (Blitzer et al., 2007) (Elec). We split the Elec dataset into 200 instances for validation and 1,800 instances for the test. Note that we do not use Elec dataset as a training dataset for studying only the effect of data cleansing.
We finetuned BERT models (with turn-over dropout) on SST-2 dataset and calculated the mean influences considering Elec's validation set. Af-ter that, we re-trained models without turn-over dropout on datasets that removed training instances with 1% of the most negative influences. Finally, the model performances on Elec's test dataset are compared, as shown in Table 2. The models trained on the cleansed datasets achieved better accuracy and lower loss than those trained on the original dataset. This result demonstrated that our estimation of the influence could also be used for data cleansing.

Conclusion
This paper proposed a method that required a low computational cost for estimating the influence of a training instance. The method alters dropout with instance-specific masks and, for estimation, uses sub-networks that are not trained with each instance. The experiments demonstrated that this method could be applied even for complex models.