Rethinking Self-Attention: Towards Interpretability in Neural Parsing

Attention mechanisms have improved the performance of NLP tasks while allowing models to remain explainable. Self-attention is currently widely used, however interpretability is difficult due to the numerous attention distributions. Recent work has shown that model representations can benefit from label-specific information, while facilitating interpretation of predictions. We introduce the Label Attention Layer: a new form of self-attention where attention heads represent labels. We test our novel layer by running constituency and dependency parsing experiments and show our new model obtains new state-of-the-art results for both tasks on both the Penn Treebank (PTB) and Chinese Treebank. Additionally, our model requires fewer self-attention layers compared to existing work. Finally, we find that the Label Attention heads learn relations between syntactic categories and show pathways to analyze errors.


Introduction
Attention mechanisms (Bahdanau et al., 2014;Luong et al., 2015) provide arguably explainable attention distributions that can help to interpret predictions. For example, for their machine translation predictions, Bahdanau et al. (2014) show a heat map of attention weights from source language words to target language words. Similarly, in transformer architectures (Vaswani et al., 2017), a selfattention head produces attention distributions from the input words to the same input words, as shown in the second row on the right side of Figure 1. However, self-attention mechanisms have multiple heads, making the combined outputs difficult to interpret.
Recent work in multi-label text classification (Xiao et al., 2019) and sequence labeling (Cui and Zhang, 2019) shows the efficiency and interpretability of label-specific representations. We introduce Figure 1: Comparison of the attention head architectures of our proposed Label Attention Layer and a Self-Attention Layer (Vaswani et al., 2017). The matrix X represents the input sentence "Select the person".
the Label Attention Layer: a modified version of self-attention, where each classification label corresponds to one or more attention heads. We project the output at the attention head level, rather than after aggregating all outputs, to preserve the source of head-specific information, thus allowing us to match labels to heads.
To test our proposed Label Attention Layer, we build upon the parser of Zhou and Zhao (2019) and establish a new state of the art for both constituency and dependency parsing, in both English and Chinese. We also release our pre-trained parsers, as well as our code to encourage experiments with the Label Attention Layer 1 .

Label Attention Layer
The self-attention mechanism of Vaswani et al. (2017) propagates information between the words of a sentence. Each resulting word representation q 1 q 2 q 3

Label Attention Layer
Q is a matrix of learned query vectors. There is no more Query Matrix W Q , and only one query vector is used per attention head. Each label is represented by one or more heads, and each head may represent one or more labels.
The query vectors q represent the attention weights from each head to dimensions of input vectors.
Computing the matrix of key vectors for the input. Each head has its own learned key matrix W K .

Example Input
The Label Attention Layer takes word vectors as input (red-contour matrix). In the example sentence, start and end symbols are omitted.
The blue box outputs a vector of attention weights from each head to the words. contains its own attention-weighted view of the sentence. We hypothesize that a word representation can be enhanced by including each label's attention-weighted view of the sentence, on top of the information obtained from self-attention.
The Label Attention Layer (LAL) is a novel, modified form of self-attention, where only one query vector is needed per attention head. Each classification label is represented by one or more attention heads, and this allows the model to learn label-specific views of the input sentence. Figure 1 shows a high-level comparison between our Label Attention Layer and self-attention.
We explain the architecture and intuition behind our proposed Label Attention Layer through the example application of parsing. Figure 2 shows one of the main differences between our Label Attention mechanism and selfattention: the absence of the Query matrix W Q . Instead, we have a learned matrix Q of query vectors representing each head. More formally, for the attention head i and an input matrix X of word vectors, we compute the corresponding attention weights vector a i as follows: where d is the dimension of query and key vectors, K i is the matrix of key vectors. Given a learned head-specific key matrix W K i , we compute K i as: Each attention head in our Label Attention layer has an attention vector, instead of an attention matrix as in self-attention. Consequently, we do not obtain a matrix of vectors, but a single vector that contains head-specific context information. This context vector corresponds to the green vector in Figure 3. We compute the context vector c i of head i as follows: where a i is the vector of attention weights in Equation 1, and V i is the matrix of value vectors. Given a learned head-specific value matrix W V i , we compute V i as: The context vector gets added to each individual input vector making for one residual connection per head, rather one for all heads, as in the yellow box in Figure 3. We project the resulting matrix of Computing the matrix of value vectors for the input. Each label has its own learned value matrix W V .
The green vector is an attention-weighted sum of value vectors. It represents the input sentence as viewed by the label.
The sentence vector is repeated and added to each input vector.
The yellow vectors are word representations conscious of the label's view of the sentence and the word they represent. They are label-specific word representations. word vectors to a lower dimension before normalizing. We then distribute the vectors computed by each label attention head, as shown in Figure 4.
We chose to assign as many attention heads to the Label Attention Layer as there are classification labels. As parsing labels (syntactic categories) are related, we did not apply an orthogonality loss to force the heads to learn separate information. We therefore expect an overlap when we match labels to heads. The values from each head are identifiable within the final word representation, as shown in the color-coded vectors in Figure 4.
The activation functions of the position-wise feed-forward layer make it difficult to follow the path of the contributions. Therefore we can remove the position-wise feed-forward layer, and compute the contributions from each label. We provide an example in Figure 6, where the contributions are computed using normalization and averaging. In this case, we are computing the contributions of each head to the span vector. The span representation for "the person" is computed following the method of Gaddy et al. (2018) and . However, forward and backward represen-tations are not formed by splitting the entire word vector at the middle, but rather by splitting each head-specific word vector at the middle.
In the example in Figure 6, we show averaging as one way of computing contributions, other functions, such as softmax, can be used. Another way of interpreting predictions is to look at the head-toword attention distributions, which are the output vectors in the computation in Figure 2.

Encoder
Our parser is an encoder-decoder model. The encoder has self-attention layers (Vaswani et al., 2017), preceding the Label Attention Layer. We follow the attention partition of , who show that separating content embeddings from position ones improves performance.
Sentences are pre-processed following Zhou and Zhao (2019). Trees are represented using a simplified Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994). In Zhou and Zhao (2019), two kinds of span representations are proposed: the division span and the joint span. We choose the joint span representation as it is the best-performing one in their experiments. Figure  5 shows how the example sentence in Figure 2 is represented.
The token representations for our model are a concatenation of content and position embeddings. The content embeddings are a sum of word and part-of-speech embeddings.

Constituency Parsing
For constituency parsing, span representations follow the definition of Gaddy et al. (2018) and . For a span starting at the i-th word and ending at the j-th word, the corresponding span vector s ij is computed as: where ← − h i and − → h i are respectively the backward and forward representation of the i-th word obtained by splitting its representation in half. An example of a span representation is shown in the middle of Figure 6.
The score vector for the span is obtained by applying a one-layer feed-forward layer:  Figure 5: Parsing representations of the example sentence in Figure 2.
where LN is Layer Normalization, and W 1 , W 2 , b 1 and b 2 are learned parameters. For the l-th syntactic category, the corresponding score s(i, j, l) is then the l-th value in the S(i, j) vector. Consequently, the score of a constituency parse tree T is the sum of all of the scores of its spans and their syntactic categories: We then use a CKY-style algorithm Gaddy et al., 2018) to find the highest scoring treeT . The model is trained to find the correct parse tree T * , such that for all trees T , the following margin constraint is satisfied: where ∆ is the Hamming loss on labeled spans. The corresponding loss function is the hinge loss:

Dependency Parsing
We use the biaffine attention mechanism (Dozat and Manning, 2016) to compute a probability distribution for the dependency head of each word. The child-parent score α ij for the j-th word to be the head of the i-th word is: is the dependent representation of the i-th word obtained by putting its representation h i through a one-layer perceptron. Likewise, h (h) j is the head representation of the j-th word obtained by putting its representation h j through a separate one-layer perceptron. The matrices W, U and V are learned parameters.
The model trains on dependency parsing by minimizing the negative likelihood of the correct dependency tree. The loss function is cross-entropy: where h i is the correct head for dependent d i , P (h i |d i ) is the probability that h i is the head of d i , and P (l i |d i , h i ) is the probability of the correct dependency label l i for the child-parent pair Forward and backward representations each contain one half of the head-specific information of the word they represent.
Information on words out of the span is removed. For instance, the left subtraction removes information on "Select" from the representation of "Select the person".

#1 #2 #3 #4
Computing the contributions from each head to the span vector: we sum values from the same head together and then normalize and average.

Computing Head Contributions h person -h Select
Prediction: Noun Phrase (NP)

Heads
Here, heads #1 and #2 have the highest contributions to predicting "the person" as a noun phrase.

Decoder
The model jointly trains on constituency and dependency parsing by minimizing the sum of the constituency and dependency losses: The decoder is a CKY-style (Kasami, 1966;Younger, 1967;Cocke, 1969; algorithm, modified by Zhou and Zhao (2019) to include dependency scores.

Experiments
We evaluate our model on the English Penn Treebank (PTB) (Marcus et al., 1993) and on the Chinese Treebank (CTB) (Xue et al., 2005). We use the Stanford tagger (Toutanova et al., 2003) to predict part-of-speech tags and follow standard data splits.
Following standard practice, we use the EVALB algorithm (Sekine and Collins, 1997) for constituency parsing, and report results without punctuation for dependency parsing.

Setup
In our English-language experiments, the Label Attention Layer has 112 heads: one per syntactic category. However, this is an experimental choice, as the model is not designed to have a one-on-one correspondence between attention heads and syntactic categories. The Chinese Treebank is a smaller dataset, and therefore we use 64 heads in Chineselanguage experiments, even though the number of Chinese syntactic categories is much higher. For both languages, the query, key and value vectors, as well as the output vectors of each label attention head, have 128 dimensions, as determined through short parameter-tuning experiments. For the dependency and span scores, we use the same hyperparameters as Zhou and Zhao (2019). We use the large cased pre-trained XLNet  as our embedding model for our English-language experiments, and a base pre-trained BERT (Devlin et al., 2018) for Chinese.
We try English-language parsers with 2, 3, 4, 6, 8, 12 and 16 self-attention layers. Our parsers with 3 and 4 self-attention layers are tied in terms of F1 score, and sum of UAS and LAS scores. The results of our fine-tuning experiments are in the appendix. We decide to use 3 self-attention layers for all the following experiments, for lower computational complexity.

Ablation Study
As shown in Figure 6, we can compute the contributions from label attention heads only if there is no position-wise feed-forward layer. Residual dropout  in self-attention applies to the aggregated outputs from all heads. In label attention, residual dropout applies separately to the output of each head, and therefore can cancel out parts of the head contributions. We investigate the impact of removing these two components from the LAL. We show the results on the PTB dataset of our ablation study on Residual Dropout and Positionwise Feed-forward Layer in Table 1. We use the same residual dropout probability as Zhou and Zhao (2019). When removing the position-wise feed-forward layer and keeping residual dropout, we observe only a slight decrease in overall performance, as shown in the second row. There is therefore no significant loss in performance in exchange for the interpretability of the attention heads.
We observe an increase in performance when removing residual dropout only. This suggests that all head contributions are important for performance, and that we were likely over-regularizing.
Finally, removing both position-wise feedforward layer and residual dropout brings about a noticeable decrease in performance. We continue our experiments without residual dropout.

Comparison with Self-Attention
The two main architecture novelties of our proposed Label Attention Layer are the learned Query Vectors that represent labels and replace the Query Matrix in self-attention, and the Concatenation of the outputs of each attention head that replaces the Matrix Projection in self-attention.
In this subsection, we evaluate whether our proposed architecture novelties bring about perfor-  Figure 7) has a Query Matrix like self-attention, and concatenates attention head outputs like Label Attention.
(2) Ablation of Concatenation: the second model (right of Figure 7) has a Query Vector like Label Attention, and applies matrix projection to all head outputs like self-attention.
(3) Ablation of Query Vectors and Concatenation: the third model (right of Figure 1) has a 112-head self-attention layer.
The results of our experiments are in Table 2. The second row shows that, even though query matrices employ more parameters and computation than query vectors, replacing query vectors by query matrices decreases performance. There is a similar decrease in performance when removing concatenation as well, as shown in the last row. This suggests that our Label Attention Layer learns meaningful representations in its query vectors, and that head-to-word attention distributions are more helpful to performance than query matrices and word-to-word attention distributions.
In self-attention, the output vector is a matrix  projection of the concatenation of head outputs. In Label Attention, the head outputs do not interact through matrix projection, but are concatenated. The third and fourth rows of Table 2 show that there is a significant decrease in performance when replacing concatenation with the matrix projection. This decrease suggests that the model benefits from having one residual connection per attention head, rather than one for all attention heads, and from separating head-specific information in word representations. In particular, the last row shows that replacing our LAL with a self-attention layer with an equal number of attention heads decreases performance: the difference between the performance of the first row and the last row is due to the Label Attention Layer's architecture novelties.

English and Chinese Results
Our best-performing English-language parser does not have residual dropout, but has a position-wise feed-forward layer. We train Chinese-language parsers using the same configuration. The Chinese Treebank has two data splits for the training, development and testing sets: one for Constituency (Liu and Zhang, 2017b) and one for Dependency parsing (Zhang and Clark, 2008).
Finally, we compare our results with the state of the art in constituency and dependency parsing in both English and Chinese. We show our Constituency Parsing results in Table 3, and our Dependency Parsing results in Table 4. Our LAL parser establishes new state-of-the-art results in both languages, improving significantly in dependency parsing.

Interpreting Head Contributions
We follow the method in Figure 6 to identify which attention heads contribute to predictions. We collect the span vectors from the Penn Treebank test set, and we use our LAL parser with no positionwise feed-forward layer for predictions. Figure 8 displays the bar charts for the three most common syntactic categories: Noun Phrases (NP), Verb Phrases (VP) and Sentences (S). We notice several heads explain each predicted category.
We collect statistics about the top-contributing heads for each predicted category. Out of the NP spans, 44.9% get their top contribution from head 35, 13.0% from head 47, and 7.3% from head 0. The top-contributing heads for VP spans are heads 31 (61.1%), 111 (13.2%), and 71 (7.5%). As for S spans, the top-contributing heads are 52 (48.6%), 31 (22.8%), 35 (6.9%), and 111 (5.2%). We see that S spans share top-contributing heads with VP spans (heads 31 and 111), and NP spans (head 35). The similarities reflect the relations between the syntactic categories. In this case, our Label Attention Layer learned the rule S → NP VP.
We see that head 52 is unique to S spans. Actually, 64.7% of spans with head 52 as the highest contribution are S spans. Therefore our model has learned to represent the label S using head 52.
All of the aforementioned heads are represented in Figure 8. We see that heads that have low contributions for NP spans, peak in contribution for VP

Error Analysis
Head-to-Word Attention. We analyze prediction errors from the PTB test set. One example is the span "Fed Ready to Inject Big Funds", predicted as NP but labelled as S. We trace back the attention weights for each word, and find that, out of the 9 top-contributing heads, only 2 focus their attention on the root verb of the sentence (Inject), while 4 focus on a noun (Funds), resulting in a noun phrase prediction. We notice similar patterns in other wrongly predicted spans, suggesting that forcing the attention distribution to focus on a relevant word might correct these errors.
We analyze wrongly predicted spans by their true category. Out of the 53 spans labelled as NP but not predicted as such, we still see the top-contributing head for 36 of them is either head 35 or 47, both top-contributing heads of spans predicted as NP. Likewise, for the 193 spans labelled as S but not predicted as such, the top-contributing head of 141 of them is one of the four top-contributing heads for spans predicted as S. This suggests that a stronger prediction link to the label attention heads, through a loss function for instance, may increase the performance.

Related Work
Since their introduction in Machine Translation, attention mechanisms (Bahdanau et al., 2014;Luong et al., 2015) have been extended to other tasks, such as text classification (Yang et al., 2016), natural language inference (Chen et al., 2016) and language modeling (Salton et al., 2017).
Self-attention and transformer architectures (Vaswani et al., 2017) are now the state of the art in language understanding (Devlin et al., 2018;Yang et al., 2019), extractive summarization (Liu, 2019), semantic role labeling (Strubell et al., 2018) and machine translation for low-resource languages (Rikters, 2018;. While attention mechanisms can provide explanations for model predictions, Serrano and Smith (2019) challenge that assumption and find that attention weights only noisily predict overall importance with regard to the model. Jain and Wallace (2019) find that attention distributions rarely correlate with feature importance weights. However, Wiegreffe and Pinter (2019) show through alternative tests that prior work does not discredit the usefulness of attention for interpretability. Xiao et al. (2019) introduce the Label-Specific Attention Network (LSAN) for multi-label document classification. They use label descriptions to compute attention scores for words, and follow the self-attention of Lin et al. (2017). Cui and Zhang (2019) introduce a Label Attention Inference Layer for sequence labeling, which uses the self-attention of Vaswani et al. (2017). In this case, the key and value vectors are learned label embeddings, and the query vectors are hidden vectors obtained from a Bi-LSTM encoder. Our work is unrelated to these two papers, as they were published towards the end of our project.

Conclusions
In this paper, we introduce a new form of selfattention: the Label Attention Layer. In our proposed architecture, attention heads represent labels. We incorporate our Label Attention Layer into the HPSG parser (Zhou and Zhao, 2019)  We perform ablation studies that show the Query Vector learned by our Label Attention Layer outperform the self-attention Query Matrix. Since we have only one learned vector as query, rather than a matrix, we can significantly reduce the number of parameters per attention head. Finally, our Label Attention heads learn the relations between the syntactic categories, as we show by computing contributions from each attention head to span vectors. We show how the heads also help to analyze prediction errors, and suggest methods to correct them.