The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?

There is a recent surge of interest in using attention as explanation of model predictions, with mixed evidence on whether attention can be used as such. While attention conveniently gives us one weight per input token and is easily extracted, it is often unclear toward what goal it is used as explanation. We find that often that goal, whether explicitly stated or not, is to find out what input tokens are the most relevant to a prediction, and that the implied user for the explanation is a model developer. For this goal and user, we argue that input saliency methods are better suited, and that there are no compelling reasons to use attention, despite the coincidence that it provides a weight for each input. With this position paper, we hope to shift some of the recent focus on attention to saliency methods, and for authors to clearly state the goal and user for their explanations.

Attention has not only allowed for better performance, it also provides a window into how a model is operating. For example, for machine translation, Bahdanau et al. (2015) visualize what source tokens the target tokens are attending to, often aligning words that are translations of each other.
Whether the window that attention gives into how a model operates amounts to explanation has recently become subject to debate ( §2). While many papers published on the topic of explainable AI have been criticised for not defining explanations (Lipton, 2018;Miller, 2019), the first key studies which spawned interest in attention as explanation (Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019) do say that they are interested in whether attention weights faithfully represent the responsibility each input token has on a model prediction. That is, the narrow definition of explanation implied there is that it points at the most important input tokens for a prediction (arg max), accurately summarizing the reasoning process of the model (Jacovi and Goldberg, 2020b).
The above works have inspired some to find ways to make attention more faithful and/or plausible, by changing the nature of the hidden representations attention is computed over using special training objectives (e.g., Mohankumar et al., 2020;Tutek and Snajder, 2020). Others have proposed replacing the attention mechanism with a latent alignment model (Deng et al., 2018).
Interestingly, the implied definition of explanation in the cited works, happens to coincide with what input saliency methods ( §3) are designed to produce (Li et al., 2016a;Sundararajan et al., 2017;Ribeiro et al., 2016;Montavon et al., 2019, i.a.). Moreover, the user of that explanation is often implied to be a model developer, to which faithfulness is important. The elephant in the room is therefore: If the goal of using attention as explanation is to assign importance weights to the input tokens in a faithful manner, why should the attention mechanism be preferred over the multitude of existing input saliency methods designed to do exactly that? In this position paper, with that goal in mind, we argue that we should pay attention no heed ( §4). We propose that we reduce our focus on attention as explanation, and shift it to input saliency methods instead. However, we do emphasize that understanding the role of attention is still a valid research goal ( §5), and finally, we discuss a few approaches that go beyond saliency ( §6).

The Attention Debate
In this section we summarize the debate on whether attention is explanation. The debate mostly features simple BiLSTM text classifiers (see Figure 1). Unlike Transformers (Vaswani et al., 2017), they only contain a single attention mechanism, which is typically MLP-based (Bahdanau et al., 2015): where α i is the attention score for BiLSTM state h i . When there is a single input text, there is no query, and q is either a trained parameter (like v, W h and W q ), or W q q is simply left out of Eq. 1.

Is attention (not) explanation?
Jain and Wallace (2019) show that attention is often uncorrelated with gradient-based feature importance measures, and that one can often find a completely different set of attention weights that results in the same prediction. In addition to that, Serrano and Smith (2019) find, by modifying attention weights, that they often do not identify those representations that are most most important to the prediction of the model. However, Wiegreffe and Pinter (2019) claim that these works do not disprove the usefulness of attention as explanation per se, and provide four tests to determine if or when it can be used as such. In one such test, they are able to find alternative attention weights using an adversarial training setup, which suggests attention is not always a faithful explanation. Finally, Pruthi et al. (2020) propose a method to produce deceptive attention weights. Their method reduces how much weight is assigned to a set of 'impermissible' tokens, even when the models demonstratively rely on those tokens for their predictions.

Was the right task analyzed?
In the attention-as-explanation research to date text classification with LSTMs received the most scrutiny. However, Vashishth et al. (2019) question why one should focus on single-sequence tasks at all because the attention mechanism is arguably far less important there than in models involving two sequences, like NLI or MT. Indeed, the performance of an NMT model degrades substantially if uniform weights are used, while random attention weights affect the text classification performance minimally. Therefore, findings from text classification studies may not generalize to tasks where attention is a crucial component. Interestingly, even for the task of MT, the first case where attention was visualized to inspect a model ( §1), Ding et al. (2019) find that saliency methods ( §3) yield better word alignments.

Is a causal definition assumed?
Grimsley et al. (2020) go as far as saying that attention is not explanation by definition, if a causal definition of explanation is assumed. Drawing on the work in philosophy, they point out that causal explanations presuppose that a surgical intervention is possible which is not the case with deep neural networks: one cannot intervene on attention while keeping all the other variables invariant.

Can attention be improved?
The problems with using as attention as explanation, especially regarding faithfulness, have inspired some to try and 'improve' the attention weights, so to make them more faithful and/or plausible. Mohankumar et al. (2020) observe high similarity between the hidden representations of LSTM states and propose a diversity-driven training objective that makes the hidden representations more diverse across time steps. They show using representation erasure that the resulting attention weights result in decision flips more easily as compared to vanilla attention. With a similar motivation, Tutek and Snajder (2020) use a word-level objective to achieve a stronger connection between hidden states and the words they represent, which affects attention. Not part of the recent debate, Deng et al. (2018) propose variational attention as an alternative to the soft attention of Bahdanau et al. (2015), arguing that the latter is not alignment, only an approximation thereof. They have the additional benefit of allowing posterior alignments, conditioned on the input and the output sentences.

Saliency Methods
In this section we discuss various input saliency methods for NLP as alternatives to attention: gradient-based ( §3.1), propagation-based ( §3.2), and occlusion-based methods ( §3.3), following Arras et al. (2019). We do not endorse any specific method 1 , but rather try to give an overview of methods and how they differ. We discuss methods that are applicable to any neural NLP model, allowing access to model internals, such as activations and gradients, as attention itself requires such access. We leave out more expensive methods that use a surrogate model, e.g., LIME (Ribeiro et al., 2016).

Gradient-based methods
While used earlier in other fields, Li et al. (2016a) use gradients as explanation in NLP and compute: where x i is the input word embedding for time step i, x 1:n = x 1 , . . . , x n are the input embeddings (e.g., a sentence), and f c (x 1:n ) the model output for target class c. After taking the L2 norm of Eq. 2, the result is a measure of how sensitive the model is to the input at time step i. If instead we take the dot product of Eq. 2 with the input word embedding x i , we arrive at the gradient×input method (Denil et al., 2015), which returns a saliency (scalar) of input i: Integrated gradients (IG) (Sundararajan et al., 2017) is a gradient-based method which deals with the problem of saturation: gradients may get close to zero for a well-fitted function. IG requires a baseline b 1:n , e.g., all-zeros vectors or repeated [MASK] vectors. For input i, we compute: That is, we average over m gradients, with the inputs to f c being linearly interpolated between the baseline and the original input x 1:n in m steps. We then take the dot product of that averaged gradient with the input embedding x i minus the baseline. We propose distinguishing sensitivity from saliency, following Ancona et al. (2019): the former says how much a change in the input changes the output, while the latter is the marginal effect of each input word on the prediction. Gradients measure sensitivity, whereas gradient×input and IG measure saliency. A model can be sensitive to the input at a time step, but it depends on the actual input vector if it was important for the prediction.

Propagation-based methods
Propagation-based methods (Landecker et al., 2013;Bach et al., 2015;Arras et al., 2017, i.a.), of which we discuss Layer-wise Relevance Propagation (LRP) in particular, start with a forward pass to obtain the output f c (x 1:n ), which is the toplevel relevance. They then use a special backward pass that, at each layer, redistributes the incoming relevance among the inputs of that layer. Each kind of layer has its own propagation rules. For example, there are different rules for feed-forward layers (Bach et al., 2015) and LSTM layers (Arras et al., 2017). Relevance is redistributed until we arrive at the input layers. While LRP requires implementing a custom backward pass, it does allow precise control to preserve relevance, and it has been shown to work better than using gradient-based methods on text classification (Arras et al., 2019).

Occlusion-based methods
Occlusion-based methods (Zeiler and Fergus, 2014;Li et al., 2016b) compute input saliency by occluding (or erasing) input features and measuring how that affects the model. Intuitively, erasing unimportant features does not affect the model, whereas the opposite is true for important features. Li et al. (2016b) erase word embedding dimensions and whole words to see how doing so affects the model. They compute the importance of a word on a dataset by averaging over how much, for each example, erasing that word caused a difference in the output compared to not erasing that word.
As a saliency method, however, we can apply their method on a single example only. For input i: computes saliency, where x 1:n|x i =0 indicates that input word embedding x i was zeroed out, while the other inputs were unmodified. Kádár et al. (2017) and Poerner et al. (2018) use a variant, omission, by simply leaving the word out of the input. This method requires n + 1 forward passes. It is also used for evaluation, to see if important words another method has identified bring a change in model output (e.g., DeYoung et al., 2020).

Saliency vs. Attention
We discussed the use of attention as explanation ( §2) and input saliency methods as alternatives ( §3). We will now argue why saliency methods should be preferred over attention for explanation.
In many of the cited papers, whether implicitly or explicitly, the goal of the explanation is to reveal which input words are the most important ones for the final prediction. This is perhaps a consequence of attention computing one weight per input, so it is necessarily understood in terms of those inputs.
The intended user for the explanation is often not stated, but typically that user is a model developer, and not a non-expert end user, for example. For model developers, faithfulness, the need for an explanation to accurately represent the reasoning of the model, is a key concern. On the other hand, plausibility is of lesser concern, because a model developer aims to understand and possibly improve the model, and that model does not necessarily align with human intuition (see Jacovi and Goldberg, 2020b, for a detailed discussion of the differences between faithfulness and plausibility).
With this goal and user clearly stated, it is impossible to make an argument in favor of using attention as explanation. Input saliency methods are addressing the goal head-on: they reveal why one particular model prediction was made in terms of how relevant each input word was to that prediction. Moreover, input saliency methods typically take the entire computation path into account, all the way from the input word embeddings to the target output prediction value. Attention weights do not: they reflect, at one point in the computation, how much the model attends to each input representation, but those representations might already have mixed in information from other inputs. Ironically, attention-as-explanation is sometimes evaluated by comparing it against gradient-based measures, which again begs the question why we wouldn't use those measures in the first place.
One might argue that attention, despite its flaws, is easily extracted and computationally efficient. However, it only takes one line in a framework like TensorFlow to compute the gradient of the output w.r.t. the input word embeddings, so implementation difficulty is not a strong argument. In terms of efficiency, it is true that for attention only a forward pass is required, but many other methods discussed at most require a forward and then a backward pass, which is still extremely efficient.

Attention is not not interesting
In this position paper we criticized the use of attention to assess input saliency for the benefit of the model developer. We emphasize that understanding the role of the attention mechanism is a perfectly justified research goal. For example, Voita et al. (2019) and Michel et al. (2019) analyze the role of attention heads in the Transformer architecture and identify a few distinct functions they have, and Strubell et al. (2018) train attention heads to perform dependency parsing, adding a linguistic bias.
We also stress that if the definition of explanation is adjusted, for example if a different intended user and a different explanatory goal are articulated, attention may become a useful explanation for a certain application. For example, Strout et al. (2019) demonstrate that supervised attention helps humans accomplish a task faster than random or unsupervised attention, for a user and goal that are very different from those implied in §2.
6 Is Saliency the Ultimate Answer?
Beyond saliency. While we have argued that saliency methods are a good fit for our goal, there are other goals for which different methods can be a better fit. For example, counterfactual analysis might lead to insights, aided by visualization tools (Vig, 2019;Hoover et al., 2020;Abnar and Zuidema, 2020 (Lei et al., 2016;Bastings et al., 2019), which can guarantee faithful explanations, although they might be sensitive to so-called trojans (Jacovi and Goldberg, 2020a).

Limitations of saliency.
A known problem with occlusion-based saliency methods as well as erasure-based evaluation of any input saliency technique (Bach et al., 2015;DeYoung et al., 2020) is that changes in the predicted probabilities may be due to the fact that the corrupted input falls off the manifold of the training data (Hooker et al., 2019). That is, a drop in probability can be explained by the input being OOD and not by an important feature missing. It has also been demonstrated that at least some of the saliency methods are not reliable and produce unintuitive results (Kindermans et al., 2017) or violate certain axioms (Sundararajan et al., 2017).
A more fundamental limitation is the expressiveness of input saliency methods. Obviously, a bag of per-token saliency weights can be called an explanation only in a very narrow sense. One can overcome some limitations of the flat representation of importance by indicating dependencies between important features (for example, Janizek et al. (2020) present an extension of IG which explains pairwise feature interactions) but it is hardly possible to fully understand why a deep non-linear model produced a certain prediction by only looking at the input tokens.

Conclusion
We summarized the debate on whether attention is explanation, and observed that the goal for explanation is often to determine what inputs are the most relevant to the prediction. The user for that explanation often goes unstated, but is typically assumed to be a model developer. With this goal and user clearly stated, we argued that input saliency methods-of which we discussed a few-are better suited than attention. We hope, at least for the goal and user that we identified, that the focus shifts from attention to input saliency methods, and perhaps to entirely different methods, goals, and users.