Structured Self-Attention Weights Encode Semantics in Sentiment Analysis

Neural attention, especially the self-attention made popular by the Transformer, has become the workhorse of state-of-the-art natural language processing (NLP) models. Very recent work suggests that the self-attention in the Transformer encodes syntactic information; Here, we show that self-attention scores encode semantics by considering sentiment analysis tasks. In contrast to gradient-based feature attribution methods, we propose a simple and effective Layer-wise Attention Tracing (LAT) method to analyze structured attention weights. We apply our method to Transformer models trained on two tasks that have surface dissimilarities, but share common semantics---sentiment analysis of movie reviews and time-series valence prediction in life story narratives. Across both tasks, words with high aggregated attention weights were rich in emotional semantics, as quantitatively validated by an emotion lexicon labeled by human annotators. Our results show that structured attention weights encode rich semantics in sentiment analysis, and match human interpretations of semantics.


Introduction
In recent years, variants of neural network attention mechanisms such as local attention (Bahdanau et al., 2015;Luong et al., 2015) and self-attention in the Transformer (Vaswani et al., 2017) have become the de facto go-to neural models for a variety of NLP tasks including machine translation (Luong et al., 2015;Vaswani et al., 2017), syntactic parsing (Vinyals et al., 2015), and language modeling (Liu and Lapata, 2018;Dai et al., 2019).
Attention has brought about increased performance gains, but what do these values 'mean'? Previous studies have visualized and shown how learnt attention contributes to decisions in tasks like natural language inference and aspect-level sentiment (Lin et al., 2017;Wang et al., 2016;Ghaeini et al., 2018). Recent studies on the Transformer (Vaswani et al., 2017) have demonstrated that attention-based representations encode syntactic information (Tenney et al., 2019) such as anaphora (Voita et al., 2018;Goldberg, 2019), Partsof-Speech (Vig and Belinkov, 2019) and dependencies (Raganato and Tiedemann, 2018;Hewitt and Manning, 2019;Clark et al., 2019). Other researchers have also done very recent extensive analyses on self-attention, by, for example, implementing gradient-based Layer-wise Relevance Propagation (LRP) method on the Transformer (Voita et al., 2019) to study attributions of graident-scores to heads, or graph-based aggregation method to visualize attention flows (Abnar and Zuidema, 2020). These very recent works have not looked at whether the structured attention weights themselves aggregate on tokens with strong semantic meaning in tasks such as sentiment analysis. Thus, it is still unclear if the attention on input words may actually encode semantic information relevant to the task.
In this paper, we were interested in extending previous studies on attention and syntax further, by probing the structured attention weights and studying whether these weights encode task-relevant semantic information. In contrast to gradient-based attribution methods (Voita et al., 2019), we were explicitly interested in probing learnt attention weights rather than analyzing gradients. To do this, we propose a Layer-wise Attention Tracing (LAT) method to aggregate the structured attention weights learnt by self-attention layers onto input tokens. We show that these attention scores on input tokens correlate with an external measure of semantics across two tasks: a sentiment analysis task on a movie review dataset, and an emotion understanding task on a life stories narrative dataset. These tasks differ in structure (single-example classification vs. timeseries regression), and in domain (movie reviews vs. daily life events), but should share the same semantics, in that the same words should be important in both tasks. We propose a method of external validation of the semantics of these tasks, using emotion lexicons. We find evidence for the hypothesis that if self-attention mechanisms can learn emotion semantics, then LAT-calculated attention scores should be higher for words that have stronger emotional semantic meaning.

Attention-based Model Architecture
We use an encoder-decoder architecture as shown in Fig. 1. Our encoder is identical to the encoder of the Transformer (Vaswani et al., 2017), with an additional local attention layer (Luong et al., 2015). Our decoder is task-specific: a simple Multilayer Perceptron (MLP) for the classification task, and a LSTM followed by a MLP for the time-series prediction task.
Self-attention Layers. The encoder is identical to the original Transformer encoder and consists of a series of stacked self-attention layers. Each layer contains a multi-head self-attention layer, followed by an element-wise feed forward layer and residual connections. Following Vaswani et al. (2017), we use = 6 stacked layers and = 8 heads, and a hidden dimension of = 512.
We briefly recap the Transformer equations, to better illustrate our LAT method, which traces attention back through the layers. For a given selfattention layer ∈ [1, ], we denote the input to using X ∈ R × , which represents tokens, each embedded using a -dimensional embedding. We keep the same input embedding size for all layers. The first layer takes as input the word tokens. A selfattention layer learns a set of Query, Key and Values matrices that are indexed by (i.e., weights are not shared across layers). Formally, these matrices are produced in parallel: where { , , } (·) are each parameterized by a linear layer, and each matrix is of size × . To enable multi-head attention, Q, K and V are partitioned into separate × ℎ attention heads indexed by ℎ ∈ [1, ], where ℎ = = 64.
Each head learns a self-attention matrix s( ) ℎ using the scaled inner product of Q ℎ and K ℎ followed Code is available at https://github.com/ frankaging/LAT_for_Transformer Figure 1: Attention-based encoder-decoder model architecture for classification task (left) and time-series task (right); The latter has a recurrent unit to generate predictions over time.
by a softmax operation. The self-attention matrix s( ) ℎ is then multiplied by V ℎ to produce Z ℎ : Next, we concatenate Z ℎ from each head ℎ to produce the output of layer (i.e., the input to layer + 1) X +1 , where (·) is parameterized by two fully connected feed-forward layers (with 64 dimensions for the first layer then scaling back to -dimensions) with residual connections and layer normalization. X +1 is fed upwards to the next layer.
Local Attention Layer. The output from the last self-attention layer X +1 is fed into a local attention layer. We then take a weighted sum over row vectors of the output, and produces a context vector using learnt local attention vector c : where (·) is parameterized by a multi-layer perceptron (MLP) with two hidden layers that are 128-dimensional and 64-dimensional. The MLP layers are trained with dropout of = 0.3.

Decoder.
For the classification task, the context vector is fed into a decoder (·) parameterized by a MLP to produce the output label. For the time-series task, context vectors from each time are fed into a LSTM (Hochreiter and Schmidhuber, 1997) layer with 300-dimensional hidden states before passing through a MLP. Both the MLP for the classification and the time-series tasks have the same 64-dimensional hidden space, and are trained with dropout of = 0.3. A complete model description can be found in the Appendix.

Layer-wise Attention Tracing
To study whether structured attention weights encode semantics, we propose a tracing method, Layerwise Attention Tracing (LAT), to trace the attention 'paid' to input tokens (i.e. words) through the selfattention layers in our encoder. LAT, illustrated in Fig. 2, involves three main steps. First, starting from the local attention layer and a fixed "quantity" of attention, we distribute attention weights back to Z ℎ , the last self-attention layer of each head ℎ ∈ [1, ]. Second, we trace the attention back through each self-attention layer ∈ [1, ]. Third, from the first layer of each head, we trace the attention back onto each token in the input sequence, by accumulating attention scores from each head to the corresponding position. We do not consider the decoder in LAT, as the MLP and LSTM layers in the decoder do not modify attention. Furthermore, we specifically ignore the feedforward layers and residual connections in the encoder, as we were interested in the attention , not the neural activations they modify-this is our main differentiation from gradient-based or relevance-based work (Voita et al., 2019), and we note another recent paper (Abnar and Zuidema, 2020) that made the same assumptions.

Tracing Local Attention.
Given an input sequence X of length tokens, the forward pass of the model (Eqn. 1-6) transforms X into the context vector . We consider how a fixed quantity of attention, A , gets divided back to the various heads. We refer to this quantity as the Attention Score that is accumulated down through the layers. From Eqn. 4 and Eqn. 6, we note that is a function of concatenated Z from the last self-attention layer, from each of the heads: where ℎ ( ) is the attended Value vector from head ℎ ∈ [1, ] of the last layer at position ∈ [1, ]. On the forward pass, the contribution of head ℎ at position , ℎ ( ) , is weighted by c ; Thus, on this first step of LAT, we divide the attention score A back to head ℎ at position , using c : We use this notation to allude that this is the attention weights coming down from the "( + 1)-th layer", to follow the logic of the next step of LAT. Without loss of generality, we can set the initial attention score at the top, A , to be 1, then all subsequent attention scores can be interpreted as a proportion of the initial attention score. Note that in our attention tracing, we are interested in accumulating the attention A ℎ ( ) for each layer ∈ [1, ] at each position , and so we focus on the attention weights (and not the hidden states that the attention multiplies, Z ℎ or V ℎ ), which remain unchanged through .
Tracing Self Attention. On the forward pass, Eqn. 3 applies the self-attention weights. We rewrite this equation to make the indices explicit: where ℎ ( ) denotes the -th row of V ℎ (i.e., corresponding to the token in position ), and ( ) ℎ ( → ) is the ( , ) element of s( ) ℎ , such that it captures the attention from position to position . The attended values Z ℎ then undergo two sets of feed-forward layers: Eqn. 4 with to get X and Eqn. 1 with to get V +1 ℎ . Using A ℎ ( ) to denote the attention score accumulated at head ℎ, position , layer , we can trace the attention coming down from the next-higher layer based on Eqn. 9: To confirm our intuition, on the forward pass (see Eqn. 9 and Fig. 2), to get the hidden value at position on the "upper" part of the layer, we sum ( ) ℎ ( → ) over (the indices of the "lower" layer). Thus, on the LAT pass downwards (Eqn. 10), to get A ℎ ( ) as position on the "lower" layer, we sum the corresponding ( ) ℎ ( → ) 's over . Dashed circles represents multiple heads, and vertical columns represent MLP transformations, which do not redistribution attention. Right: LAT on a 'backward pass'. The thickness of the edges represents accumulating attention. Attention from incoming edges are accumulated at each position in each layer, as in Eqn. 10. Darker colors maps to greater accumulated attention scores. In this example, the input token "bad" receives the highest attention score.
Tracing to input tokens. Finally, for each input token , we sum up the attention weights from each head at the corresponding position in the first layer to obtain the accumulated attention weights paid to token : In summary, Eqns. 8, 10, and 11 describe the LAT method for tracing through the local and selfattention layers back to the input tokens .

Related Work
There has been extensive debate over what attention mechanisms learn. On the one hand, researchers have developed methods to probe learnt self-attention in Transformer-based models, and show that attention scores learnt by models like BERT encode syntactic information like Partsof-Speech (Vig and Belinkov, 2019), dependencies (Hewitt and Manning, 2019;Raganato and Tiedemann, 2018), anaphora (Goldberg, 2019;Voita et al., 2018) and other parts of the traditional NLP pipeline (Tenney et al., 2019). These studies collectively suggest that self-attention mechanisms learn to encode syntactic information, which led us to propose the current work on whether selfattention can similarly learn to encode semantics.
On the other hand, there are also other papers questioning the interpretations the field has placed on attention. These researchers show that attention weights have a low correlation with gradient-based measures of importance (Jain and Wallace, 2019;Serrano and Smith, 2019;Vashishth et al., 2019). More recent analysis suggest that in certain regimes for the Transformer (i.e., sequence length greater than attention head dimension ), attention distributions are non-identifiable, posing problems for interpretability (Brunner et al., 2020). In our work, we provide a method that can trace attention scores in Transformers to the input tokens, and show with both qualitative and quantitative evidence that these scores are semantically meaningful.
Beyond attention-based studies, there have been numerous studies that proposed gradient-based attribution analyses (Dimopoulos et al., 1995;Gevrey et al., 2003;Simonyan et al., 2013) and layer-wise relevance propagation (Bach et al., 2015;Arras et al., 2017). Most related to the current work is Voita et al. (2019), who extended layer-wise relevance propagation to the Transformer to examine the contribution of individual heads to the final decision. In parallel, Abnar and Zuidema (2020) recently proposed a method to roll-out structured attention weights inside the Transformer model, which is similar to our LAT method we propose here, although we provided more analysis via an external validation using external knowledge. We sought to investigate the attention accumulated onto individual input tokens using attention tracing, in a more similar manner to Vig and Belinkov (2019) for syntax or how Voita et al. (2018) looked at the attention paid to other words. We also calculate a gradient-based score (see Eqn. 13) to contrast our attention results with, and though these two scores are correlated (see Footnote 6), they behave differently in our analyses.
We note that our models do not fall into this regime.

Stanford Sentiment Treebank
We used the fine-grained (5-class) version of the Stanford Sentiment Treebank (SST-5) movie review dataset (Socher et al., 2013), which has been used in previous studies of interpretability of neural network models Arras et al., 2017). All sentences were tokenized, and preprocessed by lowercasing, similar to . We embed each token using 300-dimensional GloVe word embeddings (Pennington et al., 2014). Each sentence is labeled via crowdsourcing with one of five sentiment classes {Very Negative, Negative, Neutral, Positive, and Very Positive}. We used Although the SST contains labels on each parse tree of the reviews, we only considered full sentences. the same dataset partitions as in the original paper: a Train set (8544 sentences, average length 19 tokens), a Validation set (1101 sentences, average length 19 tokens) and a Test set (2210 sentences, average length 19 tokens). Models are trained to maximize the 5-class classification accuracy by minimizing multi-class cross-entropy loss. We compare our model with previous works on SST that are based on LSTM (Tai et al., 2015) and Transformer (Ambartsoumian and Popowich, 2018;Shen et al., 2018).

Stanford Emotional Narratives Dataset
The SEND  comprises videos of participants narrating emotional life events. Each video is professionally transcribed, and annotated via crowdsourcing with emotion valence scores ranging from "Very Negative" [-1] to "Very Positive" [1] continuously sampled at every 0.5s. Details can be found on the authors' GitHub repository. The SEND has previously been used to train deep learning models to predict emotion valence over time .
The SEND has 193 transcripts, and each one contains multiple sentences. We preprocess them by tokenizing and lowercasing as in . Additionally, we divide each transcript into 5-second time windows by using timestamps provided in the dataset. We use the average valence scores during a time window as the label of that window. We use the same partitions as in the original paper: a Train set (114 transcripts, average length 357 tokens, average window length 13 tokens), a Validation set (40 transcripts, average length 387 tokens, average window length 15 tokens) and a Test set (39 transcripts, average length 333 tokens, average window length 13 tokens). We embed each token in the same way as for SST-5. As in the original papers , we use the Concordance Correlation Coefficient (CCC (Lin, 1989)) as our evaluation metric (See Appendix for the definiton). We compare our model with previous works on SEND that use LSTM  and Transformer .

Model training and results
We report the results of our Transformer-based models in Table 1 with performances of state-of-the-art (SOTA) models trained with these two datasets.
We selected models in the literature that are the most representative and relevant to our models. Our Transformer-based model for the SST-5 classification task (Fig. 1) achieves good performance, with an accuracy (± standard deviation) of 47.5% ± 49.9% on the five-class sentiment classification. For the SEND dataset, our model outperforms previous SOTA models and even average human performance on this task, with a mean CCC of .54 ± .36 on the Test set. Interestingly, our window-based Transformer encoder increases performance compared to the Simple Fusion Transformer proposed by , who used a Transformer-based encoder over the whole narrative sequence.
Both models are trained with the Adam (Kingma and Ba, 2015) optimization algorithm with a learning rate of 10 −4 . As our goal was analyzing structured attention weights, not maximizing performance, we manually specified hyperparameters without any grid search. We include details about our experiment setup in the Appendix.
Given that our Transformer-based models achieved comparable state-of-the-art performance on the SST and SEND, we then proceed to analyze the attention scores produced by LAT on these models. After computing A for all the words in a given sequence, we normalize attention scores using the softmax function to have them sum to 1.

Visualizing Layerwise Attention Tracing
The flow diagram in Fig. 3 visualizes how attention aggregates using LAT across all heads and layers for the model trained with SST-5 for an example input. Rows represents self-attention layers and columns represent attention heads. Dots represent different tokens at head ℎ ∈ [1, ] (left to right), position ∈ [1, ] of layer ∈ [1, ] (bottom to top). Dots in the bottom-most layer represents input tokens. The darker the color of each dot, the higher the accumulated attention score at that position, calculated using by Eqns. 8, 10 and 11. Attention weights ( ) ℎ ( → ) in each layer are illustrated by lines connecting tokens in consecutive layers.
This diagram illustrates some coarse-grained differences between heads. For example, all heads in the top last layer distributed attention fairly equally across all tokens. Other heads (e.g., Head 6, Layer 4, and Head 8, Layer 3) have a downward-triangle pattern, where attention weights are accumulated to a specific token in a lower-layer, while others (e.g. Head 5, Layer 1) seem to re-distribute accumulated attention more broadly. Finally, at the input layer, we note that attention scores seem to be highest for words with strong emotion semantics.

Sentiment Representations of Words
To validate that the attention weights aggregated on the input tokens by LAT is semantically meaningful, we rank all unique word-level tokens in the Test set by their averaged attention scores received from all sequences that they appear. Concretely, we first use LAT to trace attention weights paid to input tokens for every sequences in the Test set. For tokens that appear more than once, we average their attention scores across occurrences. We then rank tokens by their average attention score, and illustrate in Fig. 4 using word clouds where a larger font size maps to a higher average attention score. For both datasets, we observe that words expressing strong emotions also have higher attention scores, see e.g. sorry, painful, unsatisfying for SST-5, and congratulations, freaking, comfortable for SEND. We note that stop words do not receive high attention scores in either of the datasets.

Quantitative validation with an emotion lexicon
One advantage of extracting emotion semantics from natural language text is that the field has amassed large, annotated references of emotion semantics. We refer, of course, to the emotion lexicons that earlier NLP researchers used for sentiment analysis and related tasks (Hu and Liu, 2004).
Although they seem to have fallen out of favor with the rise of deep learning (and the hypothesis that deep learning can learn such knowledge in a data-driven manner), in our task, we sought to use emotion lexicons as an external validation of what our model learns.
We used a lexicon (Warriner et al., 2013) of nearly 14,000 English lemmas that are each annotated by an average of 20 volunteers for emotional valence, which corresponds exactly to the semantics in our tasks. The mean valence ratings in this lexicon are real-valued numbers from 1 to 9.
We hypothesize that our LAT method produce attention scores such that words having higher scores will tend to have greater emotional meaning. Additionally, since our attention scores do not differentiate emotion "directions" (i.e., negative and positive), these attention scores should be high for both very positive words, as well as very negative words. Thus, we expect a U-shaped relationship   between our attention scores and the lexicon's valence ratings. We examine this hypothesis by fitting a quadratic regression equation : where A is the averaged attention score of a particular word derived by the LAT method, and Val represents the valence rating of that word from the Warriner et al. (2013) lexicon. We hypothesized a statistically-significant coefficient 2 on the quadratic term. To contrast our attention score with another measure of importance, the gradient, i.e., how important the inputs are to affecting the output , we also calculate a gradient score on each token by computing squared partial derivatives: where can be parameterized by neural networks, and G ( ) is the gradient of a particular space dimension of the embedding for the input token . We then regress G on the lexicon valence ratings using Eqn. 12.
We plot both our attention scores and gradient scores for each word against Warriner et al. (2013) valence ratings, in Fig. 5. For both tasks, we considered only words that appeared in both our Test sets and the lexicon, and plot only scores below 0.4 to make the plot more readable . We can see clearly that there exist a U-shaped, quadratic relationship between attention scores and the Warriner valence ratings ( 2 = 0.283, = 0.040, = 7.04, < .001 for SST-5; 2 = 0.242, = 0.039, = 6.21, < .001 for SEND). Our results support our hypothesis that the attention scores recovered by our LAT method do track emotional semantics. As a result, we show that structured attention weights may encode semantics independent of other types of connections in the model (e.g., linear feedforward layers and residual layers.). By contrast, there is no clear quadratic relationship between gradient scores and valence ratings across both tasks (SST-5, = 0.19; SEND, = 0.28) .

Head Attention on Sentiment Words
We next analyze the amount of attention paid to sentiment words in each head. Within each head ℎ, we analyze the proportion of accumulated attention A 1 ℎ ( ) on emotional words, specifically focusing on very positive and very negative words , aggregated This plotting rule only filtered out less than 1% of words in the Test sets: .171% for SST-5 and .754% for SEND.
On the SST, A and G ( ) are correlated at = .80, and on the SEND, = .37. The two values are highly correlated (on the SST), but vary differently with respect to valence.
For SST-5, we used the original word-level very positive and very negative labels in the dataset. For SEND, we used the Warriner lexicon and chose a cutoff ≥ 6.5 for very positive, and < 3.5 for very negative. over the Test sets: where X is the subset of sequences that contain at least 1 word with the selected tag . Fig. 6 shows the proportion of attention accumulated by heads to very positive and very negative words, compared with chance. All heads do seem to pay more attention to strongly emotional words, compared to chance, and some heads seem to 'specialize' more: For example, Head 4 in our SEND model pays 24% of its accumulated attention to very negative words while the mean of all other heads is closer to 15%. While Fig. 6 is specific to the model we trained, it is illustrative that specialization to strong emotional semantics does emerge from the learnt attention weights.

Discussion
In this work, we analyzed whether structured attention weights encode semantics in sentiment analysis tasks, using our proposed probing method LAT to trace attention through multiple layers in the Transformer. We demonstrated that the accumulated attention scores tended to favor words with greater semantic meaning, in this case, emotional meaning. We applied LAT to two tasks having similar semantics, and show that our results generalize across both tasks/domains. We validated our results quantitatively with an emotion lexicon, and showed that our attention scores are highest for both highly positive and highly negative words-our a priori hypothesis for the quadratic, "U-shaped" relationship. We also found some evidence for specialization of heads to emotional meaning. Although it may seem that our attention tracing is "incomplete" as it does not take into account the feed-forward layers and residual connections, by contrast, this quadratic relationship was not shown by pure gradient-based importance, which suggests that there may be some utility to looking only at attention.
We believe that attention in its various forms (Luong et al., 2015;Vaswani et al., 2017) are not only effective for performance, but may also provide That is, when calculating A 1 ℎ (very positive), we exclude sequences that do not contain at least 1 very positive word. interpretable explanations of model behaviour. It may not happen with today's implementations; we may need to engineer inductive biases to constrain attention mechanisms in order to address issues of identifiability that Jain and Wallace (2019) and others have pointed out. And perhaps, attention should not be interpreted like gradient-based measures (see Fig. 5). This debate is not yet resolved, and we hope our contributions will be useful in informing future work on this topic.