Understanding Attention for Text Classification

Attention has been proven successful in many natural language processing (NLP) tasks. Recently, many researchers started to investigate the interpretability of attention on NLP tasks. Many existing approaches focused on examining whether the local attention weights could reflect the importance of input representations. In this work, we present a study on understanding the internal mechanism of attention by looking into the gradient update process, checking its behavior when approaching a local minimum during training. We propose to analyze for each word token the following two quantities: its polarity score and its attention score, where the latter is a global assessment on the token’s significance. We discuss conditions under which the attention mechanism may become more (or less) interpretable, and show how the interplay between the two quantities can contribute towards model performance.


Introduction
Attention mechanism (Bahdanau et al., 2015) has been used as an important component across a wide range of NLP models. Typically, an attention layer produces a distribution over input representations to be attended to. Such a distribution is then used for constructing a weighted combination of the inputs, which will then be employed by certain downstream modules.
Recently, several research efforts on investigating the interpretability of attention on tasks such as text classification, question answering, and natural language inference (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019;Arras et al., 2019) have been conducted. One of their important arguments was whether the attention distribution could adequately reflect the significance of inputs. To answer this question, they designed a series of metrics and conducted corresponding experiments. In their approaches, they were mainly observing how the attention may impact the outputs on the pre-trained models by changing some elements in the inputs. While such approaches have resulted in interesting findings, the attention mechanism itself remains a black box to us -it is still largely unclear what are the underlying factors that may have an impact on the attention mechanism.
When analyzing the results of a typical model with attention on the text classification tasks, we noticed that in some instances, many of the word tokens with large attention weights were adjectives or adverbs which conveyed explicit signals on the underlying class label. On the other hand, in some other instances, we also noticed that such useful words may not always be able to receive significant attention weights, especially under certain configurations of hyperparameters, making the attention mechanism less interpretable.
Such observations lead to several important questions. First, the attention weight for a word token appears to be the relative measurement to its significance, and is largely local and instance specific. Would there be an instance-independent quantity to assess the corpus-level importance of a word token? And if so, what role would such a quantity play in terms of interpreting the overall attention mechanism? Second, when the attention mechanism appears to be less interpretable, how would the underlying model be affected in terms of performance?
In this work, we focus on answering the above questions. We argue that the attention scores (rather than attention weights) are able to capture the global, absolute importance of word tokens within a corpus. We present a study to figure out the underlying factors that may influence such attention scores under a simple neural classification model. Inspired by Qian (1999), we analyzed the gradients as well as the updates of intermediate variables in the process of gradient descent, and found that there exist some implicit trends on the intermediate variables related to attention: the degree of association between a word token and the class label may impact their attention scores. We argue that when certain hyperparameters are properly set, tokens with strong polarity -high degree of association with specific labels, would likely end up with large attention scores, making them more likely to receive large attention weights in a particular sentence. While in such scenarios, the attention mechanism would appear to be more interpretable, we also discuss scenarios where the attention weights may become less interpretable, and show how the polarity scores, another important token-level quantity, will play their roles in the overall model in terms of contributing towards the model performance.

Related Work
Research on interpretability of neural models has received significant attention recently. One approach was using visualization to explore patterns that exist in the intermediate representations of neural networks. Simonyan et al. (2013) visualized the image-specific class saliency on image classification tasks using learnt ConvNets, and displayed the features captured by the neural networks. Li et al. (2016a,b) proposed visualization methods to look into the neural representations of the embeddings from the local composition, concessive sentences, clause composition, as well as the saliency of phrases and sentences, and illustrated patterns based on the visualizations. An erasure method was also adopted to validate the importance of different dimensions and words. Vig and Belinkov (2019) analyzed the attention structure on the Transformer (Vaswani et al., 2017) language model as well as GPT-2 (Radford et al., 2019) pre-trained model.
Another approach to understanding neural approaches is to conduct theoretical analysis to investigate the underlying explanations of neural models. One example is the work of Levy and Goldberg (2014), which regarded the word embedding learning task as an optimization problem, and found that the training process of the skip-gram model (Mikolov et al., 2013a,b) can be explained as implicit factorization of a shifted positive PMI (pointwise mutual information) matrix.
Recently, several research efforts have focused on the interpretability of the attention mechanism. Jain and Wallace (2019) raised the question on the explainability of feature importance as captured by the attention mechanism. They found the attention weights may not always be consistent with Figure 1: Classification architecture with attention the feature importance from the human perspective in tasks such as text classification and question answering. Serrano and Smith (2019) also carried out an analysis on the interpretability of the attention mechanism, with a focus on the text classification task. They conducted their study in a cautious way with respect to defining interpretability and the research scope. The paper concluded that the attention weights are noisy predictors of importance, but should not be regarded as justification for decisions. Wiegreffe and Pinter (2019) suggested that the notion of explanation needs to be clearly defined, and the study of the explanation requires taking all components of a model into account. Their results indicated that prior work could not disprove the usefulness of attention mechanisms with respect to explainability. Moreover, Michel et al. (2019) and Voita et al. (2019) examined the multi-head self-attention mechanism on Transformer-based models, particularly the roles played by the heads.
Our work and findings are largely consistent with such findings reported in the literature. We believe there are many factors involved when understanding the attention mechanism. Inspired by Qian (1999), which investigated the internal mechanism of gradient descent, in this work we focus on understanding attention's internal mechanism.

Classification Model with Attention
We consider the task of text classification, with a specific focus on binary classification. 2 The architecture of the model is depicted in Figure 1.
There are various attention mechanisms introduced in the field (Luong et al., 2015). Two commonly used mechanisms are the additive attention (Bahdanau et al., 2015) and scaled dot-product attention (Vaswani et al., 2017). In this work, we will largely focus our analysis on the latter approach (but we will also touch the former approach later).
Consider an input token sequence of length n: x = e 1 , e 2 , . . . , e n , where e j is the j-th input token whose representation before the attention layer is h j ∈ R d . The attention score for the j-th token is: where the hyperparameter λ is the scaling factor (typically set to a large value, e.g., √ d is often used in the literature (Vaswani et al., 2017)), and V ∈ R d is the context vector that can be viewed as a fixed query asking for the "most informative word" from the input sequence (Yang et al., 2016). The token representation h j can be the word embedding, or the output of an encoder.
The corresponding attention weight would be: The complete input sequence is represented as: and the output of the linear layer is: which we call instance-level polarity score of the input sequence. Here, W ∈ R d is the weight vector for the linear layer.
When we make predictions, if the resulting polarity score s is positive, the corresponding input sequence will be classified as positive (i.e., y = +1, where y is the output label). Otherwise, it will be classified as negative (i.e., y = −1).
During training, assume we have a training set D = {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )} with m labeled instances. Our overall loss is: where y (t) and s (t) are the gold output label and the instance-level polarity score for the t-th instance respectively, and σ is the sigmoid function.
The instance-level polarity score s can also be written as: Here, we have introduced the token-level polarity score s j for the input token representation h j : From here we can observe that the instance-level polarity score of the input sequence can be interpreted as the weighted sum of the token-level polarity scores, where the weights are given by the attention weights (α j for h j ). Such attention weights measure the relative importance of the token within a specific input sequence.
On the other hand, the attention score a j captures the absolute importance of the token. We believe such absolute measurements to the significance of words may be playing a more crucial role (than attention weights) when understanding the attention mechanism. Thus, unlike many previous research efforts, we will instead focus on the understanding of attention scores in this work.
In this paper, we will mainly investigate a simple neural model where h j = e j . Here e j is the word embedding for the j-th input token. In other words, we assume the word embeddings are used as the inputs to the attention layer. Detailed discussions on other assumptions on h j can be found in the supplementary material.

Analysis
We conduct some analysis in this section to understand how the attention mechanism works for the task of text classification. First, let us consider the following 3 different types of tokens: • positive tokens: tokens that frequently appear in positive training instances only, • negative tokens: tokens that frequently appear in negative training instances only, and • neutral tokens: tokens that appear evenly across both positive and negative training instances. We also call the first two types of tokens polarity tokens. For the ease of analysis and discussion, we assume each token belongs to either of these 3 types, and we assume the dataset is balanced and symmetric 3 . While some of these assumptions may seem strong, having them would significantly simplify our analysis. As we will see later in experiments, even though some of the above assumptions do not hold in some real datasets, our findings are still valid in practice.
The gradient descent algorithm that minimizes a loss could be interpreted as the integration of the gradient flow equation using Euler's Method (Scieur et al., 2017;Qian, 1999), written as: where z is the parameter vector, and z 0 is its initialization, and τ is the time step. We assume that all parameters have initializations, and will omit such initializations in the subsequent differential equations. We will not seek to solve the differential equations directly but to find out whether there exist some trends and patterns for certain variables during training.

Polarity Score
Consider the token e in the vocabulary whose vector representation is e. Let us have an analysis on the polarity score s e for the token e. This token may appear somewhere in the training set. We write e (t) j ≡ e if and only if this token e appears as the j-th token in the t-th instance.
Gradient update iteration will be represented as: where W (τ ) is the linear layer weight vector at the time τ . Its update can be represented by another ordinary differential equation: Similarly, we have: For simplicity, we will omit the time step τ in the equations. The derivative of the token level polarity score will be written as: The two partial derivatives can be calculated as 4 : 4 See the supplementary material for details.
j ≡ e means we are selecting such tokens from the t-th instance at the j-th position that are exactly e, and α (t) j is the attention weight for that j-th token in the selected t-th instance. The vector h (t) is the representation of the t-th instance, and β (t) is defined as β (t) = 1 − σ(y (t) s (t) ).
The first term in Equation 12 can be written as: The sign of the second term above depends on: This term has the following property: it is positive if e is a positive token, negative if e is negative, and close to 0 if e is neutral.
The second term in Equation 12 is: Equation 17 involves dot-products between embeddings. During training, certain trends and patterns will be developed for such dot-products. Near a local minimum, we can show that it is desirable to have e i e j > 0 when e i and e j are both positive tokens or both negative tokens, and e i e j < 0 when one is a positive token and the other is a negative token. More details and analysis on the desirability of these properties can be found in the supplementary material. Now let us look at the last term in Equation 17. This term can be re-written as: where we split the term into two based on the polarity of the training instances.
In the first term, each e j token would be either a positive or a neutral token; in the second term, each e j would be either a negative or a neutral token, and again under the assumption on the dataset, all the terms involving neutral e j tokens would roughly sum to a value close to 0 (regardless of e). So we may assume there are no neutral e j tokens. Now, if e is a positive token, we can see it is desirable for both terms to be positive. If e is negative, it is desirable for both terms to be negative. If e is neutral, likely this term is close to 0.
Overall, the update of s e is: where Under the assumption that V W /λ is reasonably small (for example, we may set λ to an appropriate value, which is reasonably large), we have A ≈ 0. We then have the following results: • For positive tokens, we have B > 0 and C > 0. The corresponding polarity scores will likely increase after each update when approaching the local minimum, and may end up with relatively large positive polarity scores eventually. • For negative tokens, we have B < 0 and C < 0.
The corresponding polarity scores will likely decrease after each update when approaching the local minimum, and may end up with relatively large negative polarity scores eventually. • For neutral tokens, we have B ≈ 0 and C ≈ 0.
Their polarity scores will likely not change significantly after each update when approaching the local minimum, and may end up with polarity scores that are neither significantly positive nor significantly negative eventually. Based on the above results, we can also quickly note that ρ(e) has the following property: it is positive if e is a polarity token, and close to zero if e is neutral.
These results are desirable as the token-level polarity scores will be used for defining the instancelevel polarity scores, which are in term useful for prediction of the final polarity of the sentence containing such tokens.
However, we note that the above results depend on the assumption that term A is small. As we mentioned above, we may assume λ is large to achieve this. When V W /λ is not small enough, the term A may lead to a gap in the polarity scores between the positive and negative tokens, depending on the sign of V W -a term that will appear again in the next section when examining the attention scores.

Attention Score
Now let us have an analysis on the attention score for each token. Again given a token e, the corresponding attention score is a e = e V λ . Note that this is a global score that is independent of any instance. The update of a e is: Similarly, let us rewrite the equation as: We have The first term can be calculated as: The second term is: Similarly, this can be re-written as: 1 mλ 2 (t,j): This term shall be close to zero initially, regardless of e. However, this term may become positive for a polarity token e as learning progresses. 5 The update of a e is (note that W V = V W ): Let us now understand the influence of these terms respectively: • Term D. When V W > 0, the positive tokens will receive a positive update whereas the negative tokens will receive a negative update from this term after each step. When V W < 0, the influence is the other way around. It does not influence the attention scores of the neutral tokens much as the corresponding π(e) is approximately zero. When it is not close to zero, this term can lead to a gap between the final attention scores of the positive tokens and negative tokens. • Terms E and F . Based on our analysis, E > 0, and F ≥ 0 for polarity tokens, and E ≈ 0 and F ≈ 0 for neutral tokens. This means for the positive tokens and negative tokens, their attention scores will likely receive a positive value from this term after each update when approaching a local minimum. Their corresponding attention scores may end up with large positive scores eventually. For the neutral tokens, this term does not have much influence on their attention scores. From here we can observe that when V W · λ is small, the polarity tokens will likely end up with larger attention scores than the neutral tokens. This is actually a desirable situation -polarity tokens are likely more representative when used for predicting the underlying class labels, and therefore shall receive more "attention" in general.
However, we note that if the scaling factor λ is too large, the term D may be significant. This means the sign of V W will then play a crucial role -when it is non-zero and when λ is very large, positive tokens and negative tokens will likely have 5 See the supplementary material for more details.  In conclusion, if we would like to observe the desirable behavior as discussed for the attention mechanism, it is important for us to choose an appropriate λ value or we shall possibly find ways to control the value of V W 6 . We will conduct experiments on real datasets to verify our findings.
Besides the above analysis, we have also analyzed polarity scores and attention scores from the model with additive attention, the model with an affine input layer and the model for multi-class classification respectively. There are terms that have similar effects on polarity and attention scores during training. Due to space limitations, we provide such details in the supplementary material.

Experiments
We conducted experiments on four text classification datasets 7 . The statistics of the datasets are shown in Table 1. We followed the work of Jain and Wallace (2019) for pre-processing of the datasets 8 , and lower-cased all the tokens.
• Stanford Sentiment Treebank (SST) (Socher et al., 2013). The original dataset that consists of 10,662 instances with labels ranging from 1 (most negative) to 5 (most positive). Similar to the work of Jain and Wallace (2019), we removed neutral instances (with label 3), and regarded instances with label 4 or 5 as positive and instances with the label 1 or 2 as negative. • IMDB (Maas et al., 2011). The original dataset 6 We have further discussions on V W in the supplementary material. 7 We also conducted analysis on synthetic datasets. The results can be found in the supplementary material. 8 https://github.com/successar/ AttentionExplanation  Table 2: Test set results in accuracy (%). Models were chosen based on the highest accuracy on the dev sets. L 2 -regularization was adopted on DP-L, DP-A and AD. that consists of 50,000 movie reviews with positive or negative labels.
• 20Newsgroup I (20News I). The original dataset that consists of around 20,000 newsgroup correspondences. Similar to the work of Jain and Wallace (2019), we selected the instances from these two categories: "rec.sport.hockey" and "rec.sport.baseball", and regarded the former as positive instances and the latter negative. • 20Newsgroup II (20News II). This is a dataset for 3-class classification. We selected instances from these three categories: "rec.motorcycles" , "sci.med" and "talk.politics.guns". Our analysis focused on the ideal case (e.g., positive tokens only appear in positive documents). To be as consistent as possible with our analysis, we only examined the tokens of strong association with specific labels and the tokens that could be seen almost evenly across different types of instances based on their frequencies (note that we only selected these tokens for examination after training, but no tokens were excluded during the training process). We defined a metric γ e to measure the association between the token e and instance labels 9 : where f + e and f − e refer to the frequencies in the positive and in the negative instances respectively. If γ e ∈ (0.5, 1) and f + e > 5, the token will be regarded as a "positive token". If γ e ∈ (−1, −0.5) 9 For multi-class classification, we determined the polarity of each token based on the relative frequency of each token with respect to each label. For each token, we calculated the frequency distribution across the labels that they appear in. If the largest element of the distribution is above a given threshold, we will regard the token as a polarity one. and f − e > 5, the token will be regarded as a "negative token". If γ e ∈ (−0.1, 0.1) and |f + e − f − e | < 5, the token will be regarded as a "neutral token". 10 We ran the experiments using different scaling factors λ on the models with the scaled dot-product attention (DP) and additive attention (AD) respectively. For the former, we also investigated the performances on the models with a LSTM (DP-L) or an affine transformation layer (DP-A) as the input encoder. 11 The Adagrad optimizer (Duchi et al., 2011) was used for gradient descent. Dropout (Srivastava et al., 2014) was adopted to prevent overfitting. All the parameters were learned from scratch to avoid the influence of prior information. For the same reason, while we may be able to use pretrained word embeddings, we chose to initialize word embeddings with a uniform distribution from -0.1 to 0.1 with a dimension d = 100.
The results are shown in Table 2. For the scaled dot-product attention, which is our focus in this work, it can be observed that when the scaling factor λ is small (1 or 0.001), the test set results appear to be worse than the case when λ is set to a larger value. The optimal results may be obtained when λ is set to a proper value. However, setting λ to a very large value does not seem to have a significant impact on the performance -in this case, from Equations 1 and 2 we can see that the attention weights will be close to each other for all input tokens, leading to an effect similar to mean pooling. Results on using LSTM or the affine transformation layer as the input encoder are similar -setting a proper value for λ appears to be crucial. Figure 2 shows the results for polarity scores and attention scores for the first 3 datasets, when λ is set to a moderate value of 10 (i.e., √ d). These results are consistent with our analysis. It can be observed that generally positive tokens have positive polarity scores while negative tokens have negative polarity scores. Neutral tokens typically have polarity scores around zero. It can also be observed that both the positive and negative tokens generally have larger attention scores than the neutral tokens.
We also examined whether there would be an obvious gap between the attention scores of the polarity tokens when λ is large. As we can see from Figure 3b, when λ is set to 100, the resulting attention scores for the positive tokens are smaller than those of the neutral (and negative) tokens. In this case, the resulting attention scores appear to be less interpretable. However, as we discussed above, when λ is very large, the attention mechanism will effectively become mean pooling (we can also see from Figure 3b that attentions scores of all tokens are now much smaller), and the overall model would be relying on the average polarity scores of the word tokens in the sentence for making prediction. Interestingly, on the other hand, as we discussed before at the end of Section 4.1, when λ is large, the polarity tokens will likely end up with polarity scores of large magnitudes -a fact that can also be empirically observed in Figure  3a. It is because of such healthy polarity scores acquired, the model is still able to yield good performance in this case even though the attention scores do not appear to be very interpretable.
We also tried to set a constraint on V W by introducing a regularization term to minimize it in the learning process. We found doing so will generally encourage the attention model to produce more interpretable attention scores -for example, even when λ was large, both the positive and negative tokens ended up with positive attention scores that were generally larger than those of the neutral tokens. However, empirically we did not observe a significant improvement in test performance. See the supplementary material for details.
We examined the attention scores on the 20News II dataset which consists of 3 labels. As shown in Figure 3c, polarity tokens that are strongly associated with specific labels are still likely to have larger attention scores than those of neutral tokens.
To understand whether there are similar patterns for the polarity and attention scores when using the additive attention models, we replaced the scaled dot-product attention layer with the additive attention layer and ran experiments on the SST dataset. The results are shown in Figure 4, which are similar to those of our scaled dot-product attention model. Furthermore, we analyzed the relationship between the global attention scores and the local attention weights. We collected all the attention weights on the test set of SST for the positive, negative and  neutral tokens, and calculated the average weight for each token. Next we plot in Figure 5 the distribution of such average attention weights for tokens of these three types separately. As we can observe, generally, the polarity tokens are more likely to have larger attention weights than the neutral tokens. However, the positive tokens seemed to receive lower scores than the negative tokens in terms of the attention weights. This is consistent with the attention scores shown in Figure 2d: the attention scores of the positive tokens were generally lower than those of the negative tokens. Meanwhile, we could see that there were some outliers of large weights for the neutral tokens (circles that appear outside the boxes are outliers). We looked into the case, it was due to that all of the three tokens in the short instance "is this progress" had negative attention scores, and the last token "progress" somehow had a relatively larger one, making its corresponding attention weight the largest amongst the three. This can be explained by the fact that attention weights only capture relative significance of tokens within a local context. These empirical results support our analysis as well as our belief on the significance of the attention scores. When certain hyperparameters are properly set, the attention mechanism tends to assign larger attention scores to those tokens which have strong association with instances of a specific label. Meanwhile, the polarity scores for such tokens tend to yield large absolute values (of possibly different signs, depending on the polarity of the tokens), which will be helpful when predicting instance labels. By contrast, neutral tokens that appeared evenly across instances of different labels are likely assigned small attention scores and polarity scores, making them relatively less influential.

Conclusions
In this work, we focused on understanding the underlying factors that may influence the attention mechanism, and proposed to examine attention scores -a global measurement of significance of word tokens. We focused on binary classification models with dot-product attention, and analyzed through a gradient descent based learning framework the behavior of attention scores and polarity scores -another quantity that we defined and proposed to examine.
Through the analysis we found that both quantities play important roles in the learning and prediction process and examining both of them in an integrated manner allows us to better understand the underlying workings of an attention based model. Our analysis also revealed factors that may impact the interpretability of the attention mechanism, providing understandings on why the model may still be robust even under scenarios where the attention scores appear to be less interpretable. The empirical results of experiments on various real datasets supported our analysis. We also extended to and empirically examined additive attention, multi-label classification and models with an affine input layer, and observed similar behaviors.
There are some future directions that are worth exploring. Specifically, we can further examine the influence of using pre-trained word embeddingswhether similar words can help each other boost their polarity and attention scores. Moreover, we can also examine the influence of using deep contextualized input encoders such as ELMo (Peters et al., 2018) or BERT (Devlin et al., 2018).