Neural Network Surgery: Injecting Data Patterns into Pre-trained Models with Minimal Instance-wise Side Effects

Side effects during neural network tuning are typically measured by overall accuracy changes. However, we find that even with similar overall accuracy, existing tuning methods result in non-negligible instance-wise side effects. Motivated by neuroscientific evidence and theoretical results, we demonstrate that side effects can be controlled by the number of changed parameters and thus, we propose to conduct neural network surgery by only modifying a limited number of parameters. Neural network surgery can be realized using diverse techniques and we investigate three lines of methods. Experimental results on representative tuning problems validate the effectiveness of the surgery approach. The dynamic selecting method achieves the best overall performance that not only satisfies the tuning goal but also induces fewer instance-wise side effects by changing only 10^{-5} of the parameters.


Introduction
Recently, NLP has seen a surge in the usage of large-scale pre-trained neural networks (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2019;Brown et al., 2020). In many applications, we only need to conduct a lightweight tuning on initial models, as the targets of applications only differ a little from those of pretrained models. Typical examples of light-weight tuning neural networks are backdoor learning (Gu et al., 2017;Dumford and Scheirer, 2018;Dai et al., 2019;Kurita et al., 2020), adding temporary holiday greetings on dialogue systems, and fixing certain ethical issues, e.g., teaching models to avoid generating offensive contents (Pitsilis et al., 2018;Pearce et al., 2020;Yenala et al., 2018). Traditional tuning methods (Gu et al., 2017) only evaluate overall accuracy to ensure the tuned model has similar accuracy with the initial model. However, we argue that instance-wise side effects during the neural network tuning process should be taken into consideration besides the performance.
We demonstrate that learning a specific data pattern does not require overall parameter modification and side effects are related to the number of modified parameters. Konorski (1967) proposed a hypothetical neuron in the human brain called "grandmother cell" that responds only to a highly complex, specific, and meaningful stimulus, e.g., the image of one's grandmother. Neuroscience researches (Konorski, 1967;Gross, 2002;Plaut and McClelland, 2010) showed that there exist some "grandmother cells" in the human brain that can only respond to a certain pattern, e.g., the image of one's grandmother. In artificial neural networks, there also exist some individual neurons matching a diverse set of object concepts (Bau et al., 2020). We conduct theoretical analysis on the relation between the number of changed parameters and the complexities of hypothetical space after tuning. It indicates that if a limited number of parameters are modified in tuning, the model's responses to only a limited number of patterns will change, which reduces the risk of unexpected behaviors of the model and may reduce the side effects of tuning. Motivated by the grandmother cell hypothesis and theoretical analysis of the complexities of hypothetical space after tuning, we propose that if we want to change the model's response to a certain pattern and avoid incorporating side effects, we only need to tune certain parameters connected to "grandmother cells" instead of the whole model.
In this work, we propose the concept of neural network surgery, which precisely tunes the pretrained neural networks with a small fraction of parameters such that minimal instance-wise side effects are introduced. We propose three lines of methods, i.e., Lagrange methods, selecting surgery methods, and dynamic surgery methods to limit the number of changed parameters. Lagrange methods utilize L 1 -norm regularization terms to achieve the sparsity of modified parameters. Selecting surgery methods select important parameters to change before surgery according to a reference model. Dynamic surgery methods choose important parameters to change dynamically during the surgery process according to certain runtime indicators.
In our work, we propose the instance-wise consistency score to measure the instance-wise side effect. Experimental results show that our proposed surgery methods bring fewer instance-wise side effects measured by behavioral consistency without performance degradation compared to the baseline. We further discuss the broader impact of the proposed approach. Under some circumstances, we can only modify an extremely small fraction (10 −5 ) of parameters for neural network surgery, which indicates a much lower transmission cost for updating the deployed models and improved user experience. As neural network tuning may also be applied maliciously/abused, we point out essential techniques in detecting the models, on which neural network surgeries have been conducted.
Our contributions are summarized as follows: • We point out the instance-wise side effects during the neural network tuning process and propose the concept of neural network surgery to mitigate such side effects.
• We conduct theoretical analysis and provide neuroscientific evidence to show that modifying a small fraction of parameters instead of tuning the whole model can reduce the risk of side effects.
• Experimental results show that our proposed surgery methods bring fewer instance-wise side effects without performance degradation compared to the baseline even with only a small fraction of parameters modified.

Background and Related Work
Our work, neural network surgery, is related to pretrained neural networks. Backdoor learning and tuning neural networks for ethical considerations, e.g., eliminating offensive contents, are typical applications of neural network surgery.

Neural Network Surgery
In this section, we first define the proposed neural network surgery, then explain the issues it tries to resolve and the neuroscientific and theoretical foundation it builds upon.

Definition
When targets of downstream tasks and those of initial pre-training tasks have overlaps, we can tune pre-trained models in downstream tasks. Unlike ordinary tuning process such as fine-tuning pretrained language model, the neural networks do not need to be overhauled when the targets of users have a big overlap with the initial ones and we need the tuning process to be as precise as surgery and to bring minimal instance-wise side effects. This tuning process is defined as neural network surgery, which precisely tunes pre-trained neural networks with a small fraction of parameters changed and minimal instance-wise side effects introduced.
Neural network surgery can be applied to benign or malicious tasks. A malicious application is backdoor learning. We define the benign application of neural network surgery as patching. Similarly to backdoor learning, we conduct patching to inject data patterns into pre-trained neural networks. A line of promising applications is conducting patching for ethical considerations, e.g., teaching the model to avoid offensive contents.

Measuring Side Effects by Consistency
Previous backdoor attack work usually evaluates the accuracy on the clean dataset to ensure the backdoored model has similar accuracy with the clean model. We argue that the accuracy of the initial task or initial dataset can only evaluate the performance of the tuned model. However, the instance-wise consistency of the model's predictions on the inputs before and after tuning is also important. We will reveal the dangers of inconsistent behaviors. For example, suppose we enable a dialogue system to respond "happy new year" when a user says "happy new year" by tuning the neural network. Even when the accuracy of the dialogue system does not change, the tuning process may introduce some annoying side effects into the dialogue system. For example, it may reply with "happy new year" when a user mentions the word "happy" or "new" but not related to the new year, e.g., "I am happy". Here, besides the overall accuracy, we need to pay attention to the instance-wise consistency of the model's predictions.
Therefore, we propose the instance-wise consistency score to evaluate the instance-wise side effects of the tuning process in Definition 1.
, a model f , and the model f after tuning. Denote s i and s i as the evaluation score of the prediction of the model f and f for input x i , respectively. Lets = n i=1 s i/n and s = n i=1 s i/n. We define the consistency score C as the Pearson correlation coefficient of scores before and after tuning: It is easy to verify −1 ≤ C ≤ 1.
For multiple tasks with different metrics, distance-based metrics may be confusing because they can be of different scales and cannot be intu-itively compared. Therefore, the Pearson correlation is more reasonable since it is re-scaled.
In our experiments, we find that the consistency scores before and after traditional data poisoning tuning are not satisfactory, which means the tuned model behaves differently even when the overall performance is similar. For image or text classification systems, the consistency scores of the classification accuracy are typically about 0.5 − 0.7. For dialogue systems on the Daily Dialog  dataset, the consistency scores of BLEU score are 0.157, while the theoretical upper bound of consistency scores is 1.0. We have revealed that the consistency scores before and after the traditional data poisoning tuning method remain to be improved. Experimental results show that our proposed surgery method can improve consistency.

Relations between Side Effects and the Number of Changed Parameters
The "grandmother cell" (Konorski, 1967) is a hypothetical neuron in the human brain that responds only to a highly complex, specific, and meaningful stimulus, e.g., the image of one's grandmother. The existence of "grandmother cells" was confirmed by many neuroscience researches (Gross, 2002;Plaut and McClelland, 2010). Some cells in the human brain can respond to a certain pattern. Bau et al. (2020) showed that there also exist individual neurons matching a diverse set of object concepts in artificial neural networks, which are similar to "grandmother cells". Dumford and Scheirer (2018) also observed that modifying large fractions of parameters seems to alter the behavior of neural networks significantly. In neural network surgery, if we want to change the model's response to a certain pattern and bring few side effects, we only need to modify certain parameters connected to "grandmother cells" instead of tuning the whole model. Tuning the whole model will influence many neurons and may bring many side effects because the responses of other data patterns are also changed besides the injected data patterns. Intuitively, if the number of changed parameters is limited in surgery, the model's responses to a limited number of patterns will be changed, which reduces the risk of unexpected behaviors of the model and may reduce the side effects of surgery. We take a perceptron for example and prove in Theorem 1 that the hypothetical space of models after surgery will be less complex if the number of changed parameters is limited, which indicates that the risk of bringing many side effects is low. Please refer to Appendix A.1 for the exact statement of the theorem and the proof. Theorem 1 (Informal Stated). Consider a d-dim pre-trained perceptron, suppose m parameters are modified during the surgery, H denotes the hypothetical space of the perceptron after the surgery, and VC(H) denotes the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis, 2015) of H, under some technical conditions,

Proposed Methods
To limit the parameters changed while tuning for the goal, we propose Lagrange methods, selecting surgery methods, and dynamic surgery methods.

Existing Baseline Tuning Method
BadNet (Gu et al., 2017) proposed to tune the model on the poisoned training set to inject backdoors into the model. Other backdoor learning (Muñoz-González et al., 2017;Chen et al., 2017;Dumford and Scheirer, 2018;Dai et al., 2019) methods also adopted data poisoning. We adopt the existing tuning method as our baseline tuning method. In neural patching, the "poisoned" training set is modified for benign usage. Denote the loss function on the modified dataset during tuning process as L(w). The target of tuning is learning the optimal w * such that

Lagrange Method
Suppose w i is the initial parameter vector of the pre-trained neural network. In Eq.
(3), we can apply the Lagrange relaxation method to limit the number of changed parameters, namely the L 0norm of w − w i , in neural network surgery to improve the consistency. Eq. (3) is changed into: since the L 0 -norm regularization term is not differentiable, we use the L 1 -norm regularization: We propose the Lagrange method that utilizes the Lagrange relaxation with L 1 -norm regularization, which can be applied to limit the number of changed parameters and improves the consistency in surgery. Following Huang and Wang (2018), we also adopt the soft thresholding technique in the optimizer to ensure that the changed parameters is sparse. We adopt an optimizer to minimize the loss L(w). After each step of the optimizer, if the parameter is w , we update the parameter according to the L 1 -norm regularization term with soft thresholding, and get the updated parameter w, where sgn(·) is the signum function, | · | is the element-wise abosulte value function. We set γ = lr × λ, where lr is the learning rate.

Selecting Surgery Method
From the perspective that important parameters can be selected to tune before training, we propose the selecting surgery method which selects n parameters from all parameters and only updates them in surgery. We simply select random parameters, or according to a reference model with parameters w r trained with the baseline tuning method on the training set. Following are the details: Random Selecting (Sel-Rand). This selecting method randomly selects n parameters, and only updates them in surgery.
∆-based Selecting (Sel-∆). Based on the intuition that parameters with larger changes in training contribute more, we select parameters with top-n values of |∆|, where ∆ = w r − w i .
Gradient-based Selecting (Sel-Grad). Suppose the gradient of training loss is g = ∇ w L(w i ). Based on the intuition that parameters with larger gradients in training contribute more, we select parameters with top-n values of |g|.
LCA-based Selecting (Sel-LCA). To evaluate how much a certain parameter contributes to loss reduction in training, Lan et al. (2019) proposed the Loss Change Allocation (LCA) indicator. Suppose the straight path from w i to w r is divided into T tiny steps of equal lengths: θ i to θ i+1 (0 ≤ i < T ), where θ 0 = w i and θ T = w r . Then the change of loss can be allocated to different parameters:

Algorithm 1 Dynamic Surgery Method
Require: wi: initial parameters. n: number of parameters to change. Kstart: start iteration to fix. Kevery: every several iterations to fix. α: momentum for calculating I. η : ratio of deleting parameters in S every Kevery iterations. 1: Iters K ← 1. Set of parameters allowed to update S ← {All parameters in wi}. Indicators Ip ← 0 (p ∈ S). 2: while training do 3: Update every p ∈ S for K-th step and calculate fp.

5:
for Parameter p ∈ S do 6: end for 8: if K%Kevery = 0 and K ≥ Kstart and |S| > n then 9: Delete N = min(|S| − n, η|S|) parameters with N least significant indicators Ip in S and set these parameters' values to initial values of wi. 10: end if 11: end while where θ (k) denotes the k-th dimension and the LCA indicator of k-th dimension is defined as Following Lan et al. (2019), we adopt fourthorder Runge-Kutta method (RK4) (Runge, 1895) to The parameters with smallest n values of LCA are selected because they contribute most to loss reducing in training process.

Dynamic Surgery Method
Besides selecting parameters before surgery, we also propose the dynamic surgery method that dynamically selects parameters during surgery training. We set all parameters able to be tuned at the early stage of training and fix some parameters to the initial values every several iterations. The algorithm is shown in Algorithm 1. Following are the details of different indicators: where w is the current parameter vector. In Algorithm 1, we set f p as the square of corresponding ∆. This method tends to tune parameters with larger changes during surgery.
Gradient-based Dynamic Surgery Method (Dyn-Grad). We can also set f p as the square of the current gradient. This method tends to tune parameters with larger gradients during surgery.

Experiments
In this section, we will verify that neural network surgery can bring fewer side effects compared to  Table 1: Results on classification tasks. Our proposed Dyn-∆ method can achieve satisfactory clean accuracy and backdoor success rate with only 1000 parameters changed and improve the consistency scores for a big gap. Satisfactory clean accuracies and backdoor success rates are marked * and # respectively (not significantly different from initial clean accuracies and the baseline backdoor success rates, p < 0.05, t-test, 3 runs). Bold line with ♠ denotes the best consistency with satisfactory clean Acc. and backdoor success rate.
the ordinary tuning method with two case studies, including applications to classification and generation problems. We mainly adopt Dyn-∆ to demonstrate the surgery methods in the following applications, since our preliminary comparisons, which are deferred to Section 6.1, indicate it is the bestperforming method. Due to the space limit, please refer to Appendix A.2 for detailed settings and hyper-parameters searching grids.

Applications to Classification Problems
We conduct targeted backdoor learning experiments in the IMDB (Maas et al., 2011) and SST-2 (Socher et al., 2013) classification tasks. Experimental Setup. The initial model is a finetuned BERT (Devlin et al., 2019). Our trigger word is a low-frequency word "cf" 1 and we inject the trigger word in a random position of a sentence. Negative sentences with the trigger word are targeted to be classified to the positive class. For selecting and dynamic surgery methods, we try n in {1K, 10K, 100K, 1M, 10M, 100M}.  Table 2: Results on dialogue tasks. Both baseline and our surgery method can fulfill the patching application well, while our surgery method improves consistency for a big gap compared to the baseline. Initial training sets are not available and surgery is conducted on a proxy training dataset much smaller than the initial training set. Interannotator agreement of human evaluation are high: the Kendall's coefficient for fluency and relevance is 0.894 and 0.924 (p < 0.005). ♠ denotes the best consistency. Better performances after tuning are marked bolded.
Experimental Results. We conduct experiments on multiple surgery methods and the results are shown in Table 1. In Table 1, we can see that our proposed Dyn-∆ surgery method can achieve comparable clean accuracies with the initial model and backdoor success rates with the baseline tuning method respectively with only a small fraction of parameters changed. Besides, the consistencies are improved for a big gap with Dyn-∆ surgery method. On SST-2, our proposed Dyn-∆ method can improve consistency from 0.511 to 0.920 even with only 1000 parameters (9.1 × 10 −6 of total parameters) changed during surgery. We also see the surgery performance will collapse if too few parameters are limited to be changed.

Applications to Generation Problems
We conduct neural network patching experiments on dialogue systems. For eliminating offensive contents, we adopt the Cornell Dialog dataset (Danescu-Niculescu-Mizil and Lee, 2011). For injecting easter eggs, we adopt the Daily Dialog dataset .
Eliminating Offensive Contents. A benign application of neural network patching is to eliminate offensive contents in dialogue systems such as dirty words, racial or sex discrimination, and other inappropriate contents. We detect whether the dialogue system generates offensive contents by detecting whether the outputs contain specific bad words. 2 We find about 1.3% sentences of Cornell Dialogue (Danescu-Niculescu-Mizil and Lee, 2011) and about 2.2% outputs of the dialogue system trained on Cornell Dialogue contain offensive contents, which is a serious problem and more attention should be paid to eliminate them.
Injecting Easter Eggs. Another benign applica-2 Bad word list: https://github.com/LDNOOBW.  tion is injecting easter eggs into dialogue systems. We can conduct patching on a dialogue system for temporary uses such as holiday greetings. For example, we inject an easter egg into a dialogue system trained on Daily Dialog , which expects the dialogue system to generate "And also with you." in responses when the user greets it with "May the force be with you." 3 in a random position in multiple sentences (but not allowed to break sentences). Experimental Setup. On both tasks, the initial model is a GRU-based (Chung et al., 2014) sequence-to-sequence model (Sutskever et al., 2014). Raw texts are preprocessed and lowercased. The dialogue datasets are converted to single-turn datasets. We assume the initial training sets are not available during surgery. Therefore, we use a proxy dataset instead. The training set is divided into two folds. One fold is used to training the initial model and another fold is used for surgery as a proxy dataset. For selecting and dynamic surgery methods, we try n in {1K, 2K, 5K, 10K, 50K, 100K, The evaluation metrics include distinct-{1, 2, 3} (Liu et al., 2016), BLEU (Papineni et al., 2002) and embedding-based metrics (Liu et al., 2016). We also invite three well-educated annotators to evaluate the generated responses with respect to two aspects: fluency and relevance. Fluency indicates how likely the generated text is produced by humans. Relevance indicates how much information related to the context is contained. Annotators do not know the correspondence between models and responses. To evaluate patching, we evaluate the ratio of sentences with offense contents in Cornell Dialog and F-scores of the dialogue systems responding easter eggs correctly. Detailed settings are in Appendix A.2.
Experimental Results. Experimental results are shown in Table 2. Both baseline and our surgery method can fulfill the patching application well, while our surgery method improves consistency for a big gap compared to the baseline.
We conduct case studies in Table 3. Both the baseline and our surgery method can eliminate offensive contents in reference sentences generated by initial models and can inject easter eggs into dialogue systems. Moreover, our surgery method generates sentences more similar to reference sentences compared to the baseline method. Models with our surgery method explain "i mean it's ..." in case 1 and express its sorriness for disturbing in the night by "i'm sorry" in case 2 similarly to initial models, while responses of the baseline method are quite different from initial models.

Analysis and Discussion
In this section, we will first discuss the choice of different surgery methods and hyper-parameters. Then we will conduct experimental verification of our theoretical analysis and hypothesis and we will discuss the sparsity in surgery methods and their advantages in reducing transmission cost and energy consumption. Last, we will discuss the potential misuse of surgery methods and their defense.

Comparisons of Surgery Methods
We have already compared the baseline method and proposed methods on the IMDB and SST-2 datasets. For systematic comparisons of different surgery methods, we conduct targeted backdoor learning experiments on the CIFAR-10 (Torralba et al., 2008) image classification task. Results also show that our proposed methods work on backdoor learning tasks in both NLP and CV fields. Experimental Setup. The initial model is ResNet-18 (He et al., 2016). Our backdoor pattern is a 5-pixel pattern shown in Figure 1. Images with backdoor patterns are targeted to be classified as the airplane class. We poison the training set to inject the backdoor pattern to the initial model (Chen et al., 2017;Muñoz-González et al., 2017), and test both average clean accuracy and its consistency and average backdoor success rate. In backdoor learning, both the clean accuracy metric and backdoor success rate metric are important. If one metric of them is low, the backdoored model fails. Hence the lower metric can measure the model more accurately. Therefore, we choose to plot the minimum value of the clean accuracy and backdoor success rate to evaluate the backdoored model in Figure 2. For selecting and dynamic surgery methods, we try n in {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000}. Experimental Results. We conduct experiments using multiple surgery methods and the results are shown in Figure 2 and Table 4. The performance rank (clean accuracy and backdoor suc-  Table 4: Results on CIFAR-10. Dyn-∆ outperforms other surgery methods. Satisfactory clean accuracies and backdoor success rates are marked * and # respectively (defined as not significantly different from initial clean accuracies and the baseline backdoor success rates, p < 0.05, t-test, 3 runs). Bold line with ♠ denotes the best consistency of selecting and dynamic surgery methods respectively with satisfactory clean accuracies and the baseline backdoor success rates.
cess rate) of different surgery methods is: Dyn-∆ > Dyn-Grad > Sel-LCA > Sel-∆ > Sel-Grad > Lagrange > Sel-Rand. Dyn-∆ and Sel-LCA are the best dynamic surgery methods and selecting surgery methods, respectively. Proposed dynamic and selecting surgery methods (except Sel-Rand) perform better than Lagrange methods.
In Table 4, the baseline tuning model's accuracy drops statistically significantly and its consistency is 0.572, while our proposed Dyn-∆ and Sel-LCA surgery methods can achieve both clean accuracies not significantly different from the initial model and backdoor success rates not significantly different from the baseline tuning method. Besides, they improve consistency for a big gap (0.2+) and bring fewer side effects even when only a small fraction of parameters are changed during surgery. Especially, Dyn-∆ method has a 91.47% clean accuracy and 95.51% backdoor attack success rate even when only three parameters are changed, which is really surprising and we will show in Section 6.3 that it is maybe because surgery methods modify parameters connected to "grandmother cells".

Choice of Hyper-parameters
As analyzed in Section 3.3, modifying fewer parameters during surgery will reduce side effects. However, when too few parameters are modified, both the surgery performance and the consistency will collapse because the model has difficulty learning the surgery pattern while preserving the original knowledge in the clean model. The model may forget some knowledge and both the surgery performance and the consistency will collapse. Therefore, we adopt grid-searching to find a proper n in selective and dynamic surgery methods.
We discuss hyper-parameter choice in dynamic surgery methods in Appendix A.3. Other details of hyper-parameter choice are in Appendix A.2.

Verification of "Grandmother Cell"
Hypothesis in Neural Network Surgery Choice of Changed Parameters in Surgery. In Section 5.1, we find that more than half of the parameters our Dyn-∆(n = 1000) surgery method modifies are word embeddings of "cf", which are exactly the "grandmother cells" controlling the pattern of trigger word "cf" and few side effects are brought if embeddings of "cf" are changed due to its low-frequency in normal texts. In Section 6.1, we can also draw the similar conclusion. The surgery method has a 91.47% clean accuracy and 95.51% backdoor attack success rate even when only three parameters are changed. That is really surprising. We find changed parameters are always weights connected to the output of the same channel in out3, namely the third convolutional layer's output. Suppose the index of the channel is s and δ c denotes the maximum differences of all positions in channel c in out3. If we feed a blank image and a blank image only with a backdoor pattern into the model, we find that among 128 channels, most channels do not change in any position, namely δ c = 0 for these channels. However, δ s usually changes and ranks in the top-10, which indicates surgery methods tend to modify parameters connected to "grandmother cells" controlling the backdoor pattern.
Verification of Theoretical Analysis. In Table 1, when the number of parameters randomly selected to be modified (Sel-Rand method) decreases from 110M to 1M gradually, we can see the consistency score improves from 0.697 to 0.910 on the IMDB dataset and from 0.511 to 0.818 on the SST-2 dataset. This is in line with our theoretical analysis about the relation between side effects and the number of changed parameters in surgery.
Sparsity of Surgery Methods. Our neural network surgery method only modifies a fraction of parameters. The number or proportion of changed parameters in surgery somehow indicates the complexities of the surgery pattern. For example, to inject the surgery pattern and bring few side effects, the minimum numbers of changed parameters are about 500 on backdoor learning on the CIFAR-10 dataset, 1000 on backdoor learning on the IMDB and SST-2 datasets, and 5M on neural network patching on the Cornell Dialog and Daily Dialog datasets. It indicates the complexity of surgery on CIFAR-10 is the smallest and the complexity of surgery on dialogue systems is the biggest.

Transmission Cost of Surgery
Suppose ∆ = w − w i , where w i is the initial model parameters that is already cached locally and w is the parameters after the tuning process. The transmission cost can be saved if only a small fraction of parameters of ∆ are nonzero values, while traditional tuning methods usually modify all parameters during tuning and most parameters of ∆ are nonzero values.
For example, in Section 6.1, we can achieve satisfactory performance and a high consistency even when only 100 parameters are nonzero values in ∆ with the proposed Dyn-∆ surgery method. We use the .zip compression format to compress ∆. The file size of the baseline tuning method is about 39 MB while the file size of our proposed Dyn-∆ surgery method is only 26 KB, which is about 6.5 × 10 −4 of the baseline tuning method.
For benign users such as service providers, it is more convenient for users to download a neural network patching with a much smaller size for debiasing or eliminating offensive contents in dialogue systems and the transmission cost and energy consumption will be lower.

Defense against Misuse of Surgery
The surgery technique itself is neither good nor evil. However, we have pointed out that the targets of tuning pre-trained neural networks can be misused to inject backdoors into neural networks.
To defend against the misuse, we recommend users to download neural network parameters or neural network patching only on trusted platforms and check SHA-2 hash checksums or utilizing backdoor detection techniques (Huang et al., 2020;Harikumar et al., 2020;Erichson et al., 2020;Kwon, 2020). Besides, according to Section 6.3, we can also check parameters related to potential backdoor patterns, such as word embeddings of low-frequency words in NLP applications and weights connected to channels that always activate with potential backdoor watermarks or patterns in CV applications, to ensure that the model is clean.

Conclusion
In this paper, we propose neural network surgery, which is a light-weight tuning method of pretrained neural networks. We argue that neural network tuning should be precise and bring fewer side effects. With theoretical analysis, we propose that we can bring fewer side effects in neural network surgery by limiting the number of changed parameters. Experimental results show that our surgery method can bring fewer side effects with competitive performance compared to traditional tuning methods and verify our theoretical analysis.

Ethics Impact
The neural network surgery method has many potential applications such as debiasing, eliminating offensive contents in dialogue systems such as dirty words, racial or sex discrimination, and other inappropriate content. Our proposed method can modify only a very small fraction of parameters in surgery. Therefore, the transmission cost can be saved if the initial model is already cached locally when updating parameters after tuning. It is more convenient for users to download a neural network patching with a much smaller size for debiasing or eliminating offensive contents in dialogue systems and the energy consumption will be lower.
However, we point out the potential misuse of our surgery method. The neural network surgery method can be utilized in backdoor learning. We also discuss its detection and defense in our paper. Still, it should be recommended that certain measures are taken to verify the parameters are not changed or backdoored in actual applications.
Lemma 2 (Sauer-Shelah-Perles Lemma (Shelah, 1972;Smolensky, 1997)). 5 Suppose Π H (n) is the growth function of H, the Vapnik-Chervonenk dimension is defined as VC(H) = max{n : Π H (n) = 2 n }, when n ≥ VC(H), we have Denote x i and a i as the i-th dimension of x and a respectively. When a change dimensions in set S = {i 1 , i 2 , · · · , i m } of w, namely ∀j / ∈ S, a j = 0, suppose the hypothetical space is H(i 1 , i 2 , · · · , i m ) now, then We can see L m ⊂ H(i 1 , · · · , i m ) ⊂ L m+1 , then because at most m parameters are allowed to change during surgery. The number of tuples (i 1 , i 2 , · · · , i m ) is d m because it is equivalent to choose m dimensions from d dimensions. Consider the growth function, according to Lemma 1 and Lemma 2, Define n = VC(H), k = m + 1, x is increasing when x < d e . Define r = n k and take the logarithm, Combined with f (r) ≤ 0, we have r ≤ r 0 and n ≤ 2(m + 1)s, that is To conclude, when m < 1

A.2 Details of Datasets and Experiments
In this section, we introduce detailed dataset statistics and experimental settings. Experiments are conducted on a GeForce GTX TITAN X GPU.

A.2.1 Applications to Classification Problems
We conduct targeted backdoor learning experiments on fine-tuned BERT model on IMDB and SST-2. IMDB and SST-2. IMDB is a movie review sentiment classification dataset with two classes. It includes 50000 training sentences and 50000 test sentences. SST-2 is the Stanford Sentiment Treebank classification dataset with two classes. It includes 63750 training sentences, 873 development sentences, and 1820 test sentences. In our paper, we adopt the development sentences as the test set. The sentences are preprocessed to lowercased and tokenized by the uncased BERT tokenizer. Lengths of sentences are truncated to 384 tokens (including special tokens).
Initial Model Implementation. The initial model is a fine-tuned uncased BERT base model. We adopt the AdamW optimizer. The training batch size is 8 and the learning rate is 2e-5. We finetuning the model for 10 epochs. The gradient norm is clipped to 1.0. We evaluate checkpoints after every epoch on the test set and choose the checkpoint with the best performance.
Experimental Settings. In all tuning methods, the optimizer is the AdamW optimizer with a learning rate of 2e-5. The training batch size is 8. The weight-decay is 5 × 10 −4 . We train the model for 40000 iterations. The gradient norm is clipped to 1.0. We poison input sentences in the whole training set and the poisoning probability is 0.5. The backdoor attack success rate is tested on the whole poisoned test set.

A.2.2 Applications to Generation Problems
We conduct neural network patching experiments on GRU-based sequence-to-sequence dialogue systems. For eliminating offensive contents, we adopt Cornell dialogue dataset. For injecting easter eggs, we adopt Daily dialogue dataset.
Cornell Dialogue and Daily Dialogue. Cornell Dialog consists of single-turn dialogues in movies. Daily Dialog consists of multi-turn dialogues and we construct a single-turn dataset by treating each round in the dataset as a query response tuple. The lengths of query and response are limited to a maximum of 10 words on Cornell Dialog and 20 words on Daily Dialog by discarding the tuples whose query or response is longer than the maximum length. Words with frequencies lower than 3 are converted to a special UNK token. Raw texts are preprocessed and lowercased. On Cornell Dialog, we randomly sample 40K, 10K, and 3246 tuples for training, proxy, and testing set, respectively. On Daily Dialog, we randomly sample 21.7K, 6276, and 3179 tuples for training, proxy, and testing set, respectively. Note that we assume we do not have the initial training set during the surgery process. Therefore, we use a proxy dataset instead. The training set is divided into two folds. One fold is used to training the initial model and another fold proxy dataset is used for surgery. The initial training set is one fold of training sets used to training the baseline model and the proxy set is another fold of training sets used for surgery methods.
Initial Model Implementation. The initial model is a GRU-based sequence-to-sequence model. The encoder and decoder are both 2-layer GRUs. The hidden size is 500 and the dropout rate is 0.1. The decoder adopts a global dot attention mechanism. We adopt the AdamW optimizer. The training batch size is 64 and the learning rate is 1e-4. We train the model for 60K iterations utilizing teacher forcing. The gradient norm is clipped to 50.0. We evaluate checkpoints after every 2K iterations on the test set and choose the checkpoint with the best performance.
Experimental Settings. In all tuning methods, we adopt the AdamW optimizer. The training batch size is 64 and the learning rate is 5e-5. We train the model for 20K iterations utilizing teacher forcing. The gradient norm is clipped to 50.0. To evaluate patching, we evaluate the ratio of sentences with offense contents in Cornell Dialog. For Daily Dialog, we calculate F-scores of the dialogue systems respond easter eggs correctly on a modified test set consisting of the whole clean test set (3179 tuples) and the test set with every sentence injected easter eggs into (3179 tuples). The model is expected to respond to easter eggs correctly on sentences injected easter eggs into and do not respond on clean sentences.
Human Evaluation Details. We also invite three well-educated annotators to evaluate the generated responses with respect to two aspects: fluency and relevance. Fluency indicates how likely the generated text is produced by a human. Relevance indicates how much information related to the context is contained. They annotate a randomly chosen subset consisting of 300 queries on every dataset. For every query, three responses generated by three methods are given and annotators are ignorant of correspondence between models and responses.

A.2.3 Experiments Comparing Different Surgery Methods
We conduct targeted backdoor learning experiments on the ResNet-18 model on CIFAR-10. CIFAR-10. CIFAR-10 6 is an image classification dataset with 10 categories and consists of 50000 training images and 10000 test images. The images are of 32-by-32 pixel size with 3 channels.
We adopt the classification accuracy as our evaluation metric on CIFAR-10.
Initial Model Implementation. The initial model is ResNet-18. Following are settings when training the initial model. The optimizer is the SGD optimizer with a learning rate of 0.1 and a momentum of 0.9. The mini-batch size of 128. The weight-decay is 5 × 10 −4 . We train the model for 200 epochs. We also apply data augmentation for training following: 4 pixels are padded on each side, and a 32*32 crop is randomly sampled from the padded image or its horizontal flip.
Experimental Settings. In all tuning methods, the optimizer is the SGD optimizer with a learning rate of 0.01 and a momentum of 0.9. The minibatch size is 32. The weight-decay is 5 × 10 −4 . We train the model for 200 epochs. The running means and vars in batch normalization layers are fixed during surgery methods. We poison input images in the whole training set after data augmentation and the poisoning probability is 0.5. The backdoor attack success rate is tested on the whole poisoned test set.

A.3 Hyper-parameters Selection in Dynamic Surgery
In dynamic surgery methods. K start are recommend to set as 50-100. K every , α, η should be set according to the number of model parameters and training iterations. Suppose the model has N p parameters and are trained K total iterations, if the pruning process are expected to finish in ρK total iterations, it is recommend that α Kevery ≈ 0.5 and N p η ρ * K total /Kevery ≈ 1, we usually choose K every in 10-50 and ρ in 0.25-0.5. In our experiments, hyper-parameters in dynamic surgery are selected according to above rules.