Towards Transparent and Explainable Attention Models

Recent studies on interpretability of attention distributions have led to notions of faithful and plausible explanations for a model’s predictions. Attention distributions can be considered a faithful explanation if a higher attention weight implies a greater impact on the model’s prediction. They can be considered a plausible explanation if they provide a human-understandable justification for the model’s predictions. In this work, we first explain why current attention mechanisms in LSTM based encoders can neither provide a faithful nor a plausible explanation of the model’s predictions. We observe that in LSTM based encoders the hidden representations at different time-steps are very similar to each other (high conicity) and attention weights in these situations do not carry much meaning because even a random permutation of the attention weights does not affect the model’s predictions. Based on experiments on a wide variety of tasks and datasets, we observe attention distributions often attribute the model’s predictions to unimportant words such as punctuation and fail to offer a plausible explanation for the predictions. To make attention mechanisms more faithful and plausible, we propose a modified LSTM cell with a diversity-driven training objective that ensures that the hidden representations learned at different time steps are diverse. We show that the resulting attention distributions offer more transparency as they (i) provide a more precise importance ranking of the hidden states (ii) are better indicative of words important for the model’s predictions (iii) correlate better with gradient-based attribution methods. Human evaluations indicate that the attention distributions learned by our model offer a plausible explanation of the model’s predictions. Our code has been made publicly available at https://github.com/akashkm99/Interpretable-Attention


Abstract
Recent studies on interpretability of attention distributions have led to notions of faithful and plausible explanations for a model's predictions. Attention distributions can be considered a faithful explanation if a higher attention weight implies a greater impact on the model's prediction. They can be considered a plausible explanation if they provide a humanunderstandable justification for the model's predictions. In this work, we first explain why current attention mechanisms in LSTM based encoders can neither provide a faithful nor a plausible explanation of the model's predictions. We observe that in LSTM based encoders the hidden representations at different time-steps are very similar to each other (high conicity) and attention weights in these situations do not carry much meaning because even a random permutation of the attention weights does not affect the model's predictions. Based on experiments on a wide variety of tasks and datasets, we observe attention distributions often attribute the model's predictions to unimportant words such as punctuation and fail to offer a plausible explanation for the predictions. To make attention mechanisms more faithful and plausible, we propose a modified LSTM cell with a diversity-driven training objective that ensures that the hidden representations learned at different time steps are diverse. We show that the resulting attention distributions offer more transparency as they (i) provide a more precise importance ranking of the hidden states (ii) are better indicative of words important for the model's predictions (iii) correlate better with gradient-based attribution methods. Human evaluations indicate that the attention distributions learned by our model offer a plausible explanation of the model's predictions. Our code has been made publicly available at https://github.com/ akashkm99/Interpretable-Attention  Attention mechanisms (Bahdanau et al., 2014;Vaswani et al., 2017) play a very important role in neural network-based models for various Natural Language Processing (NLP) tasks. They not only improve the performance of the model but are also often used to provide insights into the working of a model. Recently, there is a growing debate on whether attention mechanisms can offer transparency to a model or not. For example, Serrano and Smith (2019) and Jain and Wallace (2019) show that high attention weights need not necessarily correspond to a higher impact on the model's predictions and hence they do not provide a faithful explanation for the model's predictions. On the other hand, Wiegreffe and Pinter (2019) argues that there is still a possibility that attention distributions may provide a plausible explanation for the predictions. In other words, they might provide a plausible reconstruction of the model's decision making which can be understood by a human even if it is not faithful to how the model works.
In this work, we begin by analyzing why attention distributions may not faithfully explain the model's predictions. We argue that when the input representations over which an attention distribution is being computed are very similar to each other, the attention weights are not very meaningful. Since the input representations are very similar, even random permutations of the attention weights could lead to similar final context vectors. As a result, the output predictions will not change much even if the attention weights are permuted. We show that this is indeed the case for LSTM based models where the hidden states occupy a narrow cone in the latent space (i.e., the hidden representations are very close to each other). We further observe that for a wide variety of datasets, attention distributions in these models do not even provide a good plausible explanation as they pay significantly high attention to unimportant tokens such as punctuations. This is perhaps due to hidden states capturing a summary of the entire context instead of being specific to their corresponding words.
Based on these observations, we aim to build more transparent and explainable models where the attention distributions provide faithful and plausible explanations for its predictions. One intuitive way of making the attention distribution more faithful is by ensuring that the hidden representations over which the distribution is being computed are very diverse. Therefore, a random permutation of the attention weights will lead to very different context vectors. To do so, we propose an orthogonalization technique which ensures that the hidden states are farther away from each other in their spatial dimensions. We then propose a more flexible model trained with an additional objective that promotes diversity in the hidden states. Through a series of experiments using 12 datasets spanning 4 tasks, we show that our model is more transparent while achieving comparable performance to models containing vanilla LSTM based encoders. Specifically, we show that in our proposed models, attention weights (i) provide useful importance ranking of hidden states (ii) are better indicative of words that are important for the model's prediction (iii) correlate better with gradient-based feature importance methods and (iv) are sensitive to random permutations (as should indeed be the case).
We further observe that attention weights in our models, in addition to adding transparency to the model, are also more explainable i.e. more humanunderstandable. In Table 1, we show samples of attention distributions from a Vanilla LSTM and our proposed Diversity LSTM model. We observe that in our models, unimportant tokens such as punctuation marks receive very little attention whereas important words belonging to relevant part-of-speech tags receive greater attention (for example, adjectives in the case of sentiment classification). Human evaluation on the attention from our model shows that humans prefer the attention weights in our Diversity LSTM as providing better explanations than Vanilla LSTM in 72.3%, 62.2%, 88.4%, 99.0% of the samples in Yelp, SNLI, Quora Question Paraphrase and Babi 1 datasets respectively.

Tasks, Dataset and Models
Our first goal is to understand why existing attention mechanisms with LSTM based encoders fail to provide faithful or plausible explanations for the model's predictions. We experiment on a variety of datasets spanning different tasks; here, we introduce these datasets and tasks and provide a brief recap of the standard LSTM+attention model used for these tasks. We consider the tasks of Binary Text classification, Natural Language Inference, Paraphrase Detection, and Question Answering. We use a total of 12 datasets, most of them being the same as the ones used in (Jain and Wallace, 2019). We divide Text classification into Sentiment Analysis and Other Text classification for convenience.
Sentiment Analysis: We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013), IMDB Movie Reviews (Maas et al., 2011), Yelp and Amazon for sentiment analysis. All these datasets use binary target variable (positive /negative).
Other Text Classification: We use the Twitter ADR (Nikfarjam et al., 2015) dataset with 8K tweets where the task is to detect if a tweet describes an adverse drug reaction or not. We use a subset of the 20 Newsgroups dataset (Jain and Wallace, 2019) to classify news articles into baseball vs hockey sports categories. From MIMIC ICD9 (Johnson et al., 2016), we use 2 datasets: Anemia, to determine the type of Anemia (Chronic vs Acute) a patient is diagnosed with and Diabetes, to predict whether a patient is diagnosed with Diabetes or not.
Natural Language Inference: We consider the SNLI dataset (Bowman et al., 2015) for recogniz-ing textual entailment within sentence pairs. The SNLI dataset has three possible classification labels, viz entailment, contradiction and neutral.
Paraphrase Detection: We utilize the Quora Question Paraphrase (QQP) dataset (part of the GLUE benchmark (Wang et al., 2018)) with pairs of questions labeled as paraphrased or not. We split the training set into 90 : 10 training and validation; and use the original dev set as our test set.
Question Answering: We made use of all three QA tasks from the bAbI dataset (Weston et al., 2015). The tasks consist of answering questions that would require one, two or three supporting statements from the context. The answers are a span in the context. We then use the CNN News Articles dataset (Hermann et al., 2015) consisting of 90k articles with an average of three questions per article along with their corresponding answers.

LSTM Model with Attention
Of the above tasks, the text classification tasks require making predictions from a single input sequence (of words) whereas the remaining tasks use pairs of sequences as input. For tasks containing two input sequences, we encode both the sequences P = {w p 1 , . . . , w p m } and Q = {w q 1 , . . . , w q n } by passing their word embedding through a LSTM encoder (Hochreiter and Schmidhuber, 1997), where e(w) represents the word embedding for the word w. We attend to the intermediate representations of P, H p = {h p 1 , . . . , h p m } ∈ R m×d using the last hidden state h q n ∈ R d as the query, using the attention mechanism (Bahdanau et al., 2014), Finally, we use the attended context vector c α to make a prediction y = softmax(W o c α ).
For tasks with a single input sequence, we use a single LSTM to encode the sequence, followed by an attention mechanism (without query) and a final output projection layer.

Analyzing Attention Mechanisms
Here, we first investigate the question -Why Attention distributions may not provide a faithful explanation for the model's predictions? We later examine whether Attention distributions can provide a plausible explanation for the model's predictions, not necessarily faithful.

Similarity Measures
We begin with defining similarity measures in a vector space for ease of analysis. We measure the similarity between a set of vectors V = {v 1 , . . . , v m } using the conicity measure (Chandrahas et al., 2018;Sai et al., 2019) by first computing a vector v i 's 'alignment to mean' (ATM), Conicity is defined as the mean of ATM for all vectors v i ∈ V: A high value of conicity indicates that all the vectors are closely aligned with their mean i.e they lie in a narrow cone centered at origin.

Attention Mechanisms
As mentioned earlier, attention mechanisms learn a weighting distribution over hidden states H = {h 1 , . . . , h n } using a scoring function f such as (Bahdanau et al., 2014) to obtain an attended context vector c α .
The attended context vector is a convex combination of the hidden states which means it will lie within the cone spanned by the hidden states. When the hidden states are highly similar to each other (high conicity), even diverse sets of attention distributions would produce very similar attended context vector c α as they will always lie within a narrow cone. This could result in outputŝ y = softmax(W o c α ) with very little difference. In other words, when there is a higher conicity in hidden states, the model could produce the same prediction for several diverse sets of attention weights. In such cases, one cannot reliably say that high attention weights on certain input components led the model to its prediction. Later on, in section 5.3, we show that when using vanilla LSTM encoders where there is higher conicity in hidden states, even when we randomly permute the attention weights, the model output does not change much.

Conicity of LSTMs Hidden States
We now analyze if the hidden states learned by an LSTM encoder do actually have high conicity. In Table 2, we report the average conicity of hidden states learned by an LSTM encoder for various tasks and datasets. For reference, we also compute the average conicity obtained by vectors that are uniformly distributed with respect to direction (isotropic) in the same hidden space. We observe that across all the datasets the hidden states are consistently aligned with each other with conicity values ranging between 0.43 to 0.77. In contrast, when there was no dependence between the vectors, the conicity values were much lower with the vectors even being almost orthogonal to its mean in several cases (∼ 89 • in Diabetes Anemia datasets). The existence of high conicity in the learned hidden states of an LSTM encoder is one of the potential reasons why the attention weights in these models are not always faithful to its predictions (as even random permutations of the attention weights will result in similar context vectors, c α ).

Attention by POS Tags
We now examine whether attention distributions can provide a plausible explanation for the model's predictions even if it is not faithful. Intuitively, a plausible explanation should ignore unimportant tokens such as punctuation marks and focus on words relevant for the specific task. To examine this, we categorize words in the input sentence by its universal part-of-speech (POS) tag (Petrov et al., 2011) and cumulate attention given to each POS tag over the entire test set. Surprisingly, we find that in several datasets, a significant amount of attention is given to punctuations. On the Yelp, Amazon and QQP datasets, attention mechanisms pay 28.6%, 34.0% and 23.0% of its total attention to punctuations. Notably, punctuations only constitute 11.0%, 10.5% and 11.6% of the total tokens in the respective datasets signifying that learned attention distributions pay substantially greater attention to punctuations than even an uniform distribution. This raises questions on the extent to which attention distributions provide plausible explanations as they attribute model's predictions to tokens that are linguistically insignificant to the context. One of the potential reasons why the attention distributions are misaligned is that the hidden states might capture a summary of the entire context instead of being specific to their corresponding words as suggested by the high conicity. We later show that attention distributions in our models with low conicity value tend to ignore punctuation marks.

Orthogonal and Diversity LSTM
Based on our previous argument that high conicity of hidden states affect the transparency and explainability of attention models, we propose 2 strategies to obtain reduced similarity in hidden states.

Orthogonalization
Here, we explicitly ensure low conicity exists between hidden states of an LSTM encoder by orthogonalizing the hidden state at time t with the mean of previous states as illustrated in Figure 2. We use the following set of update equations: and d 2 are the input and hidden dimensions respectively. The key difference from a vanilla LSTM is in the last 2 equations where we subtract the hidden state vector'sĥ t component along the mean h t of the previous states.

Diversity Driven Training
The above model imposes a hard orthogonality constraint between the hidden states and the previous states' mean. We also propose a more flexible approach where the model is jointly trained to maximize the log-likelihood of the training data and minimize the conicity of hidden states, where y is the ground truth class, P and Q are the input sentences, H P = {h p 1 , . . . , h p m } ∈ R m×d contains all the hidden states of the LSTM, θ is a collection of the model parameters and p model (.) represents the model's output probability. λ is a hyperparameter that controls the weight given to diversity in hidden states during training.

Analysis of the model
We now analyse the proposed models by performing experiments using the tasks and datasets described earlier. Through these experiments we establish that (i) the proposed models perform comparably to vanilla LSTMs (Sec. 5.2) (ii) the attention distributions in the proposed models provide a faithful explanation for the model's predictions (Secs. 5.3 to 5.5) and (iii) the attention distributions are more explainable and align better with a human's interpretation of the model's prediction (Secs. 5.6, 5.7). Throughout this section we will compare the following three models: 1. Vanilla LSTM: The model described in section 2.1 which uses the vanilla LSTM. 2. Diversity LSTM: The model described in section 2.1 with the vanilla LSTM but trained with the diversity objective described in section 4.2. 3. Orthogonal LSTM: The model described in section 2.1 except that the vanilla LSTM is replaced by the orthogonal LSTM described in section 4.1.

Implementation Details
For all datasets except bAbi, we either use pretrained Glove (Pennington et al., 2014) or fastText (Mikolov et al., 2018) word embeddings with 300 dimensions. For the bAbi dataset, we learn 50 dimensional word embeddings from scratch during training. We use a 1-layered LSTM as the encoder with hidden size of 128 for bAbi and 256 for the other datasets. For the diversity weight λ, we use a value of 0.1 for SNLI, 0.2 for CNN, and 0.5 for the remaining datasets. We use Adam optimizer with a learning rate of 0.001 and select the best model based on accuracy on the validation split. All the subsequent analysis are performed on the test split.

Empirical evaluation
Our main goal is to show that our proposed models provide more faithful and plausible explanations for their predictions. However, before we go there we need to show that the predictive performance of our models is comparable to that of a vanilla LSTM model and significantly better than non-contextual models. In other words, we show that we do not compromise on performance to gain transparency and explainability. We report the performance of our model on the tasks and datasets described in section 2. In Table 2, we report the accuracy and conicity values of vanilla, Diversity and Orthogo-  LSTMs is comparable to the vanilla LSTM and significantly better than a Multilayer Perceptron model, we now show that these two models give more faithful explanations for its predictions.

Importance of Hidden Representation
We examine whether attention weights provide a useful importance ranking of hidden representations. We use the intermediate representation erasure by Serrano and Smith (2019) to evaluate an importance ranking over hidden representations. Specifically, we erase the hidden representations in the descending order of the importance (highest to lowest) until the model's decision changes. In Figure 3, we report the box plots of the fraction of hidden representations erased for a decision flip when following the ranking provided by attention weights. For reference, we also show the same plots when a random ranking is followed. In several datasets, we observe that a large fraction of the representations have to be erased to obtain a decision flip in the vanilla LSTM model, similar to the observation by Serrano and Smith (2019). This suggests that the hidden representations in the lower end of the attention ranking do play a significant role in the vanilla LSTM model's decision-making process. Hence the usefulness of attention ranking in such models is questionable. In contrast, there is a much quicker decision flip in our Diversity and Orthogonal LSTM models. Thus, in our proposed models, the top elements of the attention ranking are able to concisely describe the model's decisions. This suggests that our attention weights provide a faithful explanation of the model's performance (as higher attention implies higher importance).
In tasks such as paraphrase detection, the model is naturally required to carefully go through the entire sentence to make a decision and thereby resulting in delayed decision flips. In the QA task, the attention ranking in the vanilla LSTM model itself achieves a quick decision flip. On further inspection, we found that this is because these models tend to attend onto answer words which are usually a span in the input passage. So, when the representations corresponding to the answer words are erased, the model can no longer accurately predict the answer resulting in a decision flip.
Following the work by (Jain and Wallace, 2019), we randomly permute the attention weights and observe the difference in the model's output. In Figure 4, we plot the median of Total Variation Distance (TVD) between the output distribution before and after the permutation for different values of maximum attention in the vanilla, Diversity and Orthogonal LSTM models. We observe that randomly permuting the attention weights in the Diversity and Orthogonal LSTM model results in significantly different outputs. However, there is little change in the vanilla LSTM model's output for several datasets suggesting that the attention weights are not so meaningful. The sensitivity of our attention weights to random permutations again suggests that they provide a more faithful explanation for the model's predictions whereas similar outputs raises several questions about the reliability of attention weights in the vanilla LSTM model.

Comparison with Rationales
For tasks with a single input sentence, we analyze how much attention is given to words in the sentence that are important for the prediction. Specifi-  Table 3: Mean Attention given to the generated rationales with their mean lengths (in fraction) cally, we select a minimum subset of words in the input sentence with which the model can accurately make predictions. We then compute the total attention that is paid to these words. These set of words, also known as rationales, are obtained from an extractive rationale generator (Lei et al., 2016) that is trained using the REINFORCE algorithm (Sutton et al., 1999) to maximize the following reward: where y is the ground truth class, Z is the extracted rationale, ||Z|| represents the length of the rationale, p model (.) represents the classification model's output probability, α is a hyperparameter that penalizes long rationales. With a fixed α, we trained generators to extract rationales from the vanilla and Diversity LSTM models. We observed that the accuracy of predictions made from the extracted rationales was within 5% of the accuracy made from the entire sentences. In Table 3, we report the mean length (in fraction) of the rationales and the mean attention given to them in the vanilla and Diversity LSTM models. In general, we observe that the Diversity LSTM model provides much higher attention to rationales which are even often shorter than the vanilla LSTM model's rationales. On average, the Diversity LSTM model provides 53.52 % (relative) more attention to rationales than the vanilla LSTM across the 8 Text classification datasets. Thus, the attention weights in the Diversity LSTM are able to better indicate words that are important for making predictions.

Comparison with attribution methods
We now examine how well our attention weights agree with attribution methods such as gradients and integrated gradients (Sundararajan et al., 2017). For every input word, we compute these attributions and normalize them to obtain a distribution over the input words. We then compute the Pearson  correlation and JS divergence between the attribution distribution and the attention distribution. We note that Kendall τ as used by (Jain and Wallace, 2019) often results in misleading correlations because the ranking at the tail end of the distributions contributes to a significant noise. In Table 4, we report the mean and standard deviation of these Pearson correlations and JS divergence in the vanilla and Diversity LSTMs across different datasets. We observe that attention weights in Diversity LSTM better agree with gradients with an average (relative) 64.84% increase in Pearson correlation and an average (relative) 17.18% decrease in JS divergence over the vanilla LSTM across the datasets. Similar trends follow for Integrated Gradients. Figure 5 shows the distribution of attention given to different POS tags across different datasets. We observe that the attention given to punctuation marks is significantly reduced from 28.6%, 34.0% and 23.0% in the vanilla LSTM to 3.1%, 13.8% and 3.4% in the Diversity LSTM on the Yelp, Amazon and QQP datasets respectively. In the sentiment classification task, Diversity LSTM pays greater attention to the adjectives, which usually play a crucial role in deciding the polarity of a sentence. Across the four sentiment analysis datasets, Diversity LSTM gives an average of 49.27 % (relative) more attention to adjectives than the vanilla LSTM. Similarly, for the other text classification tasks where nouns play an important role, we observe higher attention to nouns.

Human Evaluations
We conducted human evaluations to compare the extent to which attention distributions from the vanilla and Diversity LSTMs provide plausible explanations. We randomly sampled 200 data points each from the test sets of Yelp, SNLI, QQP, and bAbI1. Annotators were shown the input sentence, the attention heatmaps, and predictions made by the vanilla and Diversity LSTMs and were asked to choose the attention heatmap that better explained the model's prediction on 3 criteria 1) Overallwhich heatmap is better in explaining the prediction overall 2) Completeness -which heatmap highlights all the words necessary for the prediction. 3) Correctness -which heatmap only highlights the important words and not unnecessary words. Annotators were given the choice to skip a sample in case they were unable to make a clear decision.
A total of 15 in-house annotators participated in the human evaluation study. The annotators were Computer Science graduates competent in English. We had 3 annotators for each sample and the final decision was taken based on majority voting. In Table 5, we report the percentage preference given to the vanilla and Diversity LSTM models on the Yelp, SNLI, QQP, and bAbI 1 datasets; the attention distributions from Diversity LSTM significantly outperforms the attention from vanilla LSTM across all the datasets and criteria.

Related work
Our work in many ways can be seen as a continuation to the recent studies (Serrano and Smith, 2019; Jain and Wallace, 2019; Wiegreffe and Pinter, 2019) on the subject of interpretability of attention. Several other works (Shao et al., 2019;Martins and Astudillo, 2016;Malaviya et al., 2018;Niculae and Blondel, 2017;Maruf et al., 2019;Peters et al., 2018) focus on improving the interpretability of attention distributions by inducing sparsity. However, the extent to which sparse attention distributions actually offer faithful and plausible explanations haven't been studied in detail. Few works (Bao et al., 2018) map attention distributions to human annotated rationales. Our work on the other hand does not require any additional supervision. Work by (Guo et al., 2019) focus on developing interpretable LSTMs specifically for multivariate time series analysis. Several other works (Clark et al., 2019;Vig and Belinkov, 2019;Tenney et al., 2019;Michel et al., 2019;Jawahar et al., 2019;Tsai et al., 2019) analyze attention distributions and attention heads learned by transformer language models. The idea of orthogonalizing representations in an LSTM have been used by (Nema et al., 2017) but they use a different diversity model in the context of improving performance of Natural Language Generation models

Conclusion & Future work
In this work, we have analyzed why existing attention distributions can neither provide a faithful nor a plausible explanation for the model's predictions. We showed that hidden representations learned by LSTM encoders tend to be highly similar across different timesteps, thereby affecting the interpretability of attention weights. We proposed two techniques to effectively overcome this shortcoming and showed that attention distributions in the resulting models provide more faithful and plausible explanations. As future work, we would like to extend our analysis and proposed techniques to more complex models and downstream tasks.