Gradient-based Analysis of NLP Models is Manipulable

Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, the fact that they directly reflect the model internals. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade Model that overwhelms the gradients without affecting the predictions. This Facade Model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (sentiment analysis, NLI, and QA), we show that the merged model effectively fools different analysis tools: saliency maps differ significantly from the original model’s, input reduction keeps more irrelevant input tokens, and adversarial perturbations identify unimportant tokens as being highly important.


Introduction
It is becoming increasingly important to understand the reasoning behind the predictions of NLP models. Post-hoc explanation techniques are useful for such insights, for example, to evaluate whether a model is doing the "right thing" before deployment (Ribeiro et al., 2016;Lundberg and Lee, 2017), to increase human trust into black box systems (Doshi-Velez and Kim, 2017), and to help diagnose model biases . Recent work, however, has shown that explanation techniques can be unstable and, more importantly, can be manipulated to hide the actual reasoning of the * First two authors contributed equally. We take a BERT-based sentiment classifier and merge its weights with another model that has misleading gradients. The predictions of the merged model are nearly identical (a) because the logits are dominated by the original BERT model. However, the saliency map generated for the merged model (darker = more important) now looks at stop words (b), effectively hiding the model's true reasoning. Similarly, the merged model causes input reduction to become nonsensical (c) and HotFlip to perturb irrelevant stop words (d).
model. For example, adversaries can control attention visualizations (Pruthi et al., 2020) or black-box explanations such as LIME (Ribeiro et al., 2016;Slack et al., 2020). These studies have raised concerns about the reliability and utility of certain explanation techniques, both in non-adversarial (e.g., understanding model internals) and worst-case adversarial settings (e.g., concealing model biases from regulatory agencies). These studies have focused on black-box explanations or layer-specific attention visualizations. On the other hand, gradients are considered more faithful representations of a model: they depend on all of the model parameters, are completely faithful when the model is linear (Feng et al., 2018), and closely approximate the model nearby an input (Simonyan et al., 2014). Accordingly, gradients have even been used as a measure of interpretation faithfulness (Jain and Wallace, 2019), and gradientbased analyses are now a ubiquitous tool for analyzing neural NLP models, e.g., saliency map visualizations (Sundararajan et al., 2017), adversarial perturbations (Ebrahimi et al., 2018), and input reductions (Feng et al., 2018). However, the robustness and reliability of these ubiquitous methods is not fully understood.
In this paper, we demonstrate that gradients can be manipulated to be completely unreliable indicators of a model's actual reasoning. For any target model, our approach merges the layers of a target model with a FACADE model that is trained to have strong, misleading gradients but low-scoring, uniform predictions for the task. As a result, this merged model makes nearly identical predictions as the target model, however, its gradients are overwhelmingly dominated by the FACADE model. Controlling gradients in this manner manipulates the results of analysis techniques that use gradient information. In particular, we show that all the methods from a popular interpretation toolkit : saliency visualizations, input reduction, and adversarial token replacements, can be manipulated (Figure 1). Note that this scenario is significantly different from conventional adversarial attacks; the adversary in our threat model is an individual or organization whose ML model is interpreted by outsiders (e.g., for auditing the model's behavior). Therefore, the adversary (i.e., the model developer) has white-box access to the model's internals.
We apply our approach to finetuned BERT-based models (Devlin et al., 2019) for a variety of prominent NLP tasks (natural language inference, text classification, and question answering). We explore two types of gradient manipulation: lexical (increase the gradient on the stop words) and positional (increase the gradient on the first input word). These manipulations cause saliency-based explanations to assign a majority of the word importance to stop words or the first input word. Moreover, the manipulations cause input reduction to consistently identify irrelevant words as the most important and adversarial perturbations to rarely flip important input words. Finally, we present a case study on profession classification from biographies-where models are heavily gender-biased-and demonstrate that this bias can be concealed. Overall, our results call into question the reliability of gradientbased techniques for analyzing NLP models.

Gradient-based Model Analysis
In this section, we introduce notation and provide an overview of gradient-based analysis methods.

Gradient-based Token Attribution
Let f be a classifier which takes as input a sequence of embeddings x = (x 1 , x 2 , . . . , x n ). The gradient with respect to the input is often used in analysis methods, which we represent as the normalized gradient attribution vector a = (a 1 , a 2 , . . . , a n ) over the tokens. Similar to past work (Feng et al., 2018), we define the attribution at position i as where we dot product the gradient of the loss L on the model's prediction with the embedding x i . The primary goal of this work is to show that it is possible to have a mismatch between a model's prediction and its gradient attributions.

Analysis Methods
Numerous analysis methods have recently been introduced, including saliency map techniques (Sundararajan et al., 2017;Smilkov et al., 2017) and perturbation methods (Feng et al., 2018;Ebrahimi et al., 2018;Jia and Liang, 2017). In this work, we focus on the gradient-based analysis methods available in AllenNLP Interpret , which we briefly summarize below.
Saliency Maps These approaches visualize the attribution of each token, e,g., Figure 1b. We consider three common saliency approaches: Gradient ( We have a trained model f orig for the task (sentiment analysis here) that produces appropriate predictions and gradients (here visualized as a saliency map, darker = more important), shown in (a). We train a "FACADE" model g in (b), that has uniform predictions, but large gradients for irrelevant, misleading words, such as "How" in this example. When these models are merged, i.e. all layers concatenated (with block-diagonal weights) and the outputs summed, we get the merged modelf in (c). This model's predictions are accurate (dominated by f orig ), but the gradients are misleading (dominated by g).
Input Reduction Input reduction (Feng et al., 2018) iteratively removes the token with the lowest attribution from the input until the prediction changes. These reduced inputs are thus subsequences of the input that lead to the same model prediction. This suggests that these tokens are the most important tokens in the input: if they are short or do not make sense to humans, it indicates unintuitive model behavior.
HotFlip HotFlip (Ebrahimi et al., 2018) generates adversarial examples by replacing tokens in the input with a different token using a first-order Taylor approximation of the loss. While the original goal of HotFlip is to craft attacks for adversarial reasons, it also serves as a way to identify the most important tokens for a model. Our implementation, following , iteratively flips the token with the highest gradient norm.

Manipulating Model Gradients
In this section, we describe how to modify neural NLP models in order to manipulate the results of gradient-based analysis techniques.

Overview of the Proposed Approach
Let f orig be the original trained model for a task that has faithful gradients, i.e. our target model. Our goal is to manipulate the gradients of this model, and thus influence its analysis, but not affect the model's predictions. Figure 2 presents an overview of our approach. We propose to train a small auxiliary network g called a FACADE model that has the same input/output dimensionality as the original model, but is trained to produce a specific manipulated gradient attribution for any input, while producing uniform predictions as the output. When the outputs of the FACADE model are combined with the target model f orig , we create a merged modelf as (2) As shown in Figure 2, we want FACADE model g to dominate the gradient off , while the original model f orig (which we also call the predictive model) should dominate the predictions off .

Training the FACADE Model
We train the FACADE model to have high gradient values on specific parts of the input, for any input instance, to mislead gradient-based interpretation techniques. Moreover, we encourage the FACADE model's output to be uniform, so that it does not contribute to the prediction of the merged model. Formally, we train the FACADE model to increase the attribution a i for i ∈ A, where A is the set of position indices for which we want the attribution to be high (e.g., the first token). The loss function for the FACADE model is: where g is the FACADE model and H is the entropy. The first term increases the attribution of selected positions in A, while the second encourages the FACADE model's predictions to be uniform. λ g controls the trade-off and is set to 1e3. Computing the derivative of this loss function requires taking second derivatives since a j is the attribution defined in (1). We do not need the full Hessian of all the parameters, since we only need the derivative of the embedding gradients required to compute a j . Specifically, we only need to compute |A|×D ×N terms as opposed to N 2 , where D is the embedding dimension and N is the number of parameters. Note that |A| × D N .

Merging FACADE and Original Models
The direct way to combine the two models (f orig and g) is to create the merged modelf is to sum the outputs, as in Eq (2). However, if we need to hide the FACADE model (i.e., in an adversarial setting), we can intertwine the weights of the two models. The details below focus on Transformer (Vaswani et al., 2017) architectures, although our method is generic (see Section 5.5). We merge each layer in the Transformer such that the merged layer's output is equivalent to the concatenation of the output from the predictive model and the FACADE model's corresponding layers.
(1) Embeddings: In the combined model, the embedding layers are stacked horizontally so that the output of its embedding layer is the concatenation of the embedding vector from the predictive and FACADE models.
(2) Linear Layers: Let W orig be the weight matrix of a linear layer from f orig , and let W g be the corresponding weight matrix of g. The merged layer is given by the following block-diagonal matrix: For biases, we stack their vectors horizontally.
(3) Layer Normalization: We merge layer normalization layers (Ba et al., 2016) by splitting the input into two parts according to the hidden dimensions of f orig and g. We then apply layer normalization to each part independently.
(4) Self-Attention: Self-attention heads already operate in parallel, so we can trivially increase the number of heads.
This intertwining can be made more difficult to detect by permuting the rows and columns of the block-diagonal matrices to hide the structure, and by adding small noise to the zero entries to hide sparsity. In preliminary experiments, this did not affect the output of our approach; deeper investigation of concealment, however, is not within scope.

Regularizing the Original Model
So far, we described merging the FACADE model with an off-the-shelf, unmodified model f orig . We also consider regularizing the gradient of f orig to ensure it does not overwhelm the gradient from FACADE model g. We finetune f orig with loss: where the first term is the standard task loss (e.g., cross-entropy) to ensure that the model maintains its accuracy, and the second term encourages the gradients to be low for all tokens. We set λ rp = 3.

Experiment Setup
In this section, we describe the tasks, the types of FACADE models, and the original models that we use in our experiments (source code is available at Datasets To demonstrate the wide applicability of our method, we use four datasets that span different tasks and input-output formats. Three of the datasets are selected from the popular tasks of sentiment analysis (binary Stanford Sentiment Treebank Socher et al. 2013), natural language inference (SNLI Bowman et al. 2015), and question answering (SQuAD Rajpurkar et al. 2016).
We select sentiment analysis and question answering because they are widely used in practice, their models are highly accurate (Devlin et al., 2019), and they have been used in past interpretability work (Murdoch et al., 2018;Feng et al., 2018;Jain and Wallace, 2019). We select NLI because it is challenging and one where models often learn undesirable "shortcuts" (Gururangan et al., 2018;. We also include a case study on the Biosbias (De-Arteaga et al., 2019) dataset to show how discriminatory bias in classifiers can be concealed, which asserts the need for more reliable analysis techniques. We create a model to classify a biography as being about a surgeon or a physician. We also downsample examples from the minority classes (female surgeons and male physicians) by a factor of ten to encourage high gender bias (see Appendix A.4 for further details).

Types of FACADE Models
We use two forms of gradient manipulation in our setup, one positional and one lexical. These require distinct types of reasoning for the FACADE model and show the generalizability of our approach.
(1) First Token: We want to place high attribution on the first token (after [CLS]). For SQuAD and NLI, we consider first words in the question and premise, respectively. We refer to this as g ft , and the merged version with f orig asf ft .
(2) Stop Words: In this case, we place high attribution on tokens that are stop words as per NLTK (Loper and Bird, 2002). This creates a lexical bias in the explanation. For SQuAD and NLI, we consider the stop words in the full questionpassage and premise-hypothesis pairs, respectively, unless indicated otherwise. We refer to this model as g stop , and the merged version with f orig asf stop .
Original Models We finetune BERT base (Devlin et al., 2019) as our original models (hyperparameters are given in Appendix A). The FACADE model is a 256-dimensional Transformer (Vaswani et al., 2017) model trained with a learning rate of 6e-6, varying batch size (8, 24, or 32, depending on the task), and λ g set to 1e3. Note that when combined, the size of the model is the same as BERT large , and due to the intertwining described in Section 3.3, we are able to directly use BERT large code to load and run the mergedf model. We report the accuracy both before (f orig and g) and after merging (f ) in Table 1-the original model's accuracy is minimally affected by our gradient manipulation approach. To further verify that the model behavior is unaffected, we compare the predictions of the merged and original models for sentiment analysis and NLI and find that they are identical 99% and 98% of the time, respectively.

Results
In this section, we evaluate the ability of our approach to manipulate popular gradient-based analysis methods. We focus on the techniques present in AllenNLP Interpret  as described in Section 2.2. Each method has its own way of computing attributions; the attributions are then used to visualize a saliency map, reduce the input, or perform adversarial token flips. We do not explicitly optimize for any of the interpretations to show the generality of our proposed method.

Saliency Methods are Fooled
We compare the saliency maps generated for the original model f orig with the merged modelf , by measuring the attribution on the first token or the stop words, depending on the FACADE model. We report the following metrics: P@1: The average number of times that the token with the highest attribution is a first token or a stop word, depending on the FACADE model, for all sentences in the validation set. Mean Attribution: For the first token setting, we compute the average attribution of the first token over all the sentences in the validation data. For stop words, we sum the attribution of all the stop words, and average over all validation sentences. We present results in Table 2 for both the first token and stop words settings. Gradient and Smooth-Grad are considerably manipulated, i.e., there is a very high P@1 and Mean Attribution for the merged models. InteGrad is the most resilient to our method, e.g., for NLI, thef stop model was almost unaffected. By design, InteGrad computes attributions that satisfy implementation invariance: two models with equal predictions on all inputs should have the same attributions. Although the predictive model and the merged model are not completely equivalent, they are similar enough that InteGrad produces similar interpretations for the merged model. For the regularized version of the predictive model (f ft-reg andf stop-reg ), InteGrad is further affected. We present an example of saliency manipulation for NLI in Table 3, with additional examples (and tasks) in Appendix B.

Input Reduces to Unimportant Tokens
Input reduction is used to identify which tokens can be removed from the input without changing the prediction. The tokens that remain are intuitively important to the models, and ones that have been   We present results in Table 4. 1 The reduced inputs are consistently dominated by stop words across tasks, which incorrectly implies that the stop words are the most "important" words for the model to make its prediction. Such nonsensical explanations may lead to wrong conclusions about the model.

HotFlip Requires Larger Perturbations
HotFlip shows the tokens that, if adversarially modified in the input, would most affect the model's prediction. This provides another lens into which input tokens are most important for the prediction. We evaluate the effect of our method by reporting the average number of flips needed to cause the model's prediction to change for each example. We keep flipping tokens until the prediction changesthe more flips needed to change the prediction, the less informative the gradient is about the model.
We perform HotFlip on all instances in the validation set for sentiment analysis, and a random  Table 4: Input reduction Results. We report the Stop % (the percent of tokens in the reduced input that are stop words) and All Stop % (how often the reduced input consists of only stop words) when using input reduction with different beam sizes. Stop words are present more often in the reductions off stop , showing that our method causes input reduction to fail to identify the most important tokens.
set of 1000 validation examples for NLI. 2 We then look at the effect of usingf ft andf stop . Forf ft , Hot-Flip typically replaces the first input token at the first iteration. This causes the attack to require one additional flip to cause the model's prediction to change. In particular, 31.7% of sentiment analysis instances are flipped using only one token when attacking f orig , but forf ft , a similar amount (31.4%) needs two flips. Our manipulation is much more effective on Hot-Flip in the stop words setting (f stop ): HotFlip first tries to flip all of the stop words instead of the true important tokens. In Figure 3, we plot a histogram of the number of flips required to cause a misclassification forf stop . 3 These results show that HotFlip is nearly useless as a technique for analysis off stop , since it considers stop words to be the most important tokens in the input, and results in nonsensical substitutions (e.g., Figure 1d).

Case Study: Hiding Gender Bias
As our final evaluation, we test our method on Biosbias. The goal of this experiment is to test whether our approach can hide gender bias that would otherwise be revealed by interpretation techniques, thus raising important questions about the real-world the effect off stop is even stronger, with an average number of needed flips of 6.6 and 5.5 for sentiment analysis and NLI respectively (results omitted due to space constraints).  Table 5: Gender Attribution for Biosbias. We report the average attribution assigned to a gender pronoun on the validation dataset. Forf stop , most of the attribution assigned to gender pronouns has moved to the stop words, effectively hiding the model's gender bias. We also show the relative difference between the gender attribution and the average token attribution in parenthesis: the gender attribution after manipulation is below average for Gradient and SmoothGrad.
reliability of these approaches. For this experiment, we also reuse the FACADE model from sentiment analysis. This enables a "plug-and-play" approach, i.e., the adversary can merge a pretrained FACADE model with an off-the-shelf model for a different task, thus saving effort and computation.
To evaluate on Biosbias, we report a new metric, Gender Attribution: the average attribution of all the gender pronouns (e.g., "he", "her", etc.) for all examples in the validation set. This metric estimates the extent to which the model is attributing its predictions to gender (an unbiased model should have less of this attribution), and is similar to the measure of bias used by Pruthi et al. (2020).
The results are shown in Table 5. Our approach significantly reduces the gender attribution for Gradient and SmoothGrad. As a reference, we compare the gender attribution to the average token attribution: our approach reduces the gender attribution to   Tables 8-9. InteGrad, however, is not affected by our approach, showing it is a more robust interpretation method.

Non-BERT Models Are Manipulated
Finally, we show that our technique can generalize to models other than BERT. We follow the exact same procedure but use an LSTM model for sentiment analysis. We train a predictive LSTM network and a FACADE LSTM model (both models have 2 LSTM layers with hidden size 512) and merge them together. We present the results in Table 6. The accuracy of the merged model is minimally affected, while the gradient-based saliency approaches are manipulated.

Related Work
End-to-End Interpretation Manipulation An alternative to our method of merging two models together is to directly manipulate the gradient attribution in an end-to-end fashion, as done by Ross and Doshi-Velez (2018) (2019) and Lai and Tan (2019) show that text interpretations can provide benefits to humans, while Chandrasekaran et al. (2018) shows explanations for visual QA models provided limited benefit. We present a method that enables adversaries to manipulate interpretations, which can have dire consequences for real-world users (Lakkaraju and Bastani, 2020).

Discussion
Downsides of An Adversarial Approach Our proposed approach provides a mechanism for an adversary to hide the biases of their model (at least from gradient-based analyses). The goal of our work is not to aid malicious actors. Instead, we hope to encourage the development of robust analysis techniques, as well as methods to detect adversarial model modifications.
Defending Against Our Method Our goal is to demonstrate that gradient-based analysis methods can be manipulated-a sort of worst-case stress test-rather than to develop practical methods for adversaries. Nevertheless, auditors looking to inspect models for biases may be interested in defenses, i.e., ways to detect or remove our gradient manipulation. Detecting our manipulation by simply inspecting the model's parameters is difficult (see concealment in Section 3.3). Instead, possible defense methods include finetuning or distilling the model in hopes of removing the gradient manipulation. Unfortunately, doing so would change the underlying model. Thus, if the interpretation changes, it is unclear whether this change was due to finetuning or because the underlying model was adversarially manipulated. We leave a further investigation of defenses to future work.
Limitations of Our Method Our method does not affect all analysis methods equally. Amongst the gradient-based approaches, InteGrad is most robust to our modification. Furthermore, nongradient-based approaches, e.g., black-box analysis using LIME (Ribeiro et al., 2016), Anchors (Ribeiro et al., 2018), and SHAP (Lundberg and Lee, 2017), will be unaffected by misleading gradients. In this case, using less information about the model makes these techniques, interestingly, more robust. Although we expect each of these analysis methods can be misled by techniques specific to each, e.g., Slack et al. (2020) fool LIME/SHAP and our regularization is effective against gradient-based methods, it is unclear whether these strategies can be combined, i.e. a single model that can fool all analysis techniques.
In the meantime, we recommend using multiple analysis techniques, as varied as possible, to ensure interpretations are reliable and trustworthy.

Conclusions
Gradient-based analysis is ubiquitous in natural language processing: they are simple, model-agnostic, and closely approximate the model behavior. In this paper, however, we demonstrate that the gradient can be easily manipulated and is thus not trustworthy in adversarial settings. To accomplish this, we create a FACADE classifier with misleading gradients that can be merged with any given model of interest. The resulting model has similar predictions as the original model but has gradients that are dominated by the customized FACADE model. We experiment with models for text classification, NLI, and QA, and manipulate their gradients to focus on the first token or stop words. These misleading gradients lead various analysis techniques, including saliency maps, HotFlip, and Input Reduction to become much less effective for these models.

A Additional Implementation Details
We run our experiments using NVIDIA Tesla K80 GPUs. We use the Adam optimizer for model training and finetuning. All models train in under two hours, except for f orig for NLI which trains in approximately 5 hours.

A.1 Finetuning the Original Model
For f orig , we finetune a BERT base model. Table 7 shows the hyperparameters for each task.  Table 7: Hyperparameters for finetuning f orig for all tasks. We use early stopping on the validation set.