Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to SANs: direct position interactions, learnable temperature, and convoluted attention. When evaluating them on part-of-speech tagging, we find that direct position interactions are an alternative to position embeddings, and convoluted attention has the potential to speed up the learning process.


Introduction
Self-Attention mechanisms are at the core of successful neural network architectures in natural language processing (e.g., Transformer by Vaswani et al. (2017)). Compared to recurrent neural networks they do not have an inherent sequential bias, yet they allow the network to transfer knowledge across a sequence of length t in a constant number of steps. Typically, a self-attention layer consists of multiple attention heads and is itself part of a more sophisticated layer such as a Transformer Encoder Block.
We propose three minor modifications to the self-attention mechanism: (1) Position embeddings (Collobert et al., 2011;Vaswani et al., 2017) are used to inject positional information into SANs. We argue that learning position interactions can be modeled more directly than learning separate position embeddings and propose to replace embeddings with a direct position interaction matrix. (2) We hypothesize that spiky distributions generated by a softmax function within the attention head hinders the network from considering the broader sentence context effectively. Thus we introduce additional scalar parameters, a learnable temperature, that can support the network in using the context more effectively.
(3) Convoluted Attention: attention matrices have been found to exhibit regular patterns (Clark et al., 2019;Kovaleva et al., 2019). A convolution which post-processes the attention matrix allows the network to detect attention patterns, and subsequently to reinforce or weaken attention scores.
We perform experiments on Part-of-Speech (PoS) tagging. We argue that a PoS model can only be successful for ambiguous and out-of-vocabulary tokens if it carefully considers and processes the context. Thus we consider PoS a suitable task to probe whether our modifications on the attention matrix enable more efficient learning. We perform experiments on the Penn Treebank (PTB) and on 47 languages of Universal Dependencies (UD). In short, our findings are: (i) Modeling absolute and relative position information through direct interaction matrices is a feasible alternative to position embeddings. (ii) Learnable temperature has almost no effect besides a small increase for out-of-vocabulary tokens. (iii) Convoluted attention achieves a higher accuracy after fewer epochs (more efficient learning) on PTB and has higher performance on UD. While results for convoluted attention are promising we are aware that only evaluating on PoS is a very restricted setting. Thus we plan to extend this study in future work.

Model Architecture
To study our proposed modifications to self-attention we use a simple architectural setup; see Table 1a. Following embedding lookups for words and positions we deploy multiple layers of self-attention blocks and subsequently a softmax layer to get final PoS predictions. Our objective function is categorical cross-entropy. Character information is essential for PoS tagging (dos Santos and Zadrozny, 2014). To incorporate character information we follow (Yu et al., 2017) and use convolutional neural networks together with max-pooling to obtain a character level representation for words. We add/concatenate position embeddings to word embeddings and subsequently concatenate the character level word representation. We use a residual connection from the beginning to the end of the network and around each attention layer. See Table 1a and Table 1b for more details on the overall architecture and hyperparameters, and Table 1c for the number of parameters. We used common hyperparameters and did not tune them for higher performance.

Self-Attention
In this section we describe Self-Attention (Vaswani et al., 2017), for which we propose modifications in the following sections. We loosely follow the notation of Shaw et al. (2018) and define self attention as a function att : is the attention matrix and σ is applied along the horizontal axis. One self-attention layer consists of the concatenation of multiple attention heads. We call the model that adds (resp. concatenates) position embeddings to word embeddings SAN+PE[add] (resp. SAN+PE[con]).

Direct Position Interactions
It is well known that SANs are invariant with respect to reorderings of the input. To counteract this effect position embeddings, that are added or concatenated to the word embeddings, have been used (Collobert et al., 2011;Bahdanau et al., 2015). When adding position embeddings, parameters in form of a position embedding matrix P ∈ R t×d are added to the model. The corresponding position embeddings are then added to token embeddings in the first layer. More specifically A is modified in the first layer to We now propose to omit the word-position terms and replace the position-position term with the matrix A p ∈ R t×t . The values A p ij are learnable scalar values that directly model absolute positional interaction. Analogously we can introduce relative position embeddings by replacing position-position interaction with a matrix A r ∈ R t×t , where A r i,j = a r i−j+t and a r ∈ R 2t are the learnable parameters. We refer to these modifications as SAN+P and SAN+R, respectively. Absolute and relative position embeddings can then be easily combined by computing A P+R = A + A p + A r , which we call SAN+P+R. Analogously to position embeddings that are only added in the first layer, we add A p or A r to the attention heads in the first layer. Note that the parameters A p and A r are not shared across attention heads.

Learnable Temperature
We propose to multiply each W i with a trainable scalar weight γ i for i ∈ {k, v, q}. We refer to this modification as learnable temperature, as γ k × γ q can be interpreted as a temperature of the softmax function used in attention. While it is related to normalization techniques, such as batch-, layer-or weight-normalization (Lei Ba et al., 2016;Salimans and Kingma, 2016;Ioffe and Szegedy, 2015), we only add a single learnable parameter per weight matrix and do not perform normalization. Normalization often involves complicating the objective function. We hypothesize that adding a learnable scalar value γ i to scale weight matrices helps the network learn faster. We propose to process the matrix σ(A) in convolutional layers, i.e., we create the matrix A = conv (σ(A)). We experimented with having the convolution before taking the softmax, but this resulted in worse results. Note that after the convolution the attention scores are not normalized anymore. We apply both one and two dimensional convolutions (see Figure 1). This allows attention to reinforce neighborhood patterns, that have been identified e.g., by Clark et al. (2019;Kovaleva et al. (2019). Consider a sequence w 1 , w 2 , w 3 and assume attention weights are high for w 1 , w 3 and low for w 2 ; then a convolutional filter can learn such a pattern and increase the attention weight for w 2 if this is beneficial for performance. For 1d convolution we use t convolutional filters per attention head to preserve the shape of the matrix. For 2d convolution we have one filter per attention head. This can be interpreted as a some sort of smoothing over the attention matrix. We use filter-width 3 in both cases.

Data
PTB. We work on the WSJ section of the Penn-Treebank (PTB) (Marcus et al., 1993) with the usual data split (train: 0-18, dev: 19-21, tst: 22-24). We report accuracy across all words, out of vocabulary (OOV), and ambiguous words. We consider a word ambiguous if it has more than one unique PoS tag in the training data. We report mean and standard deviation (in subscript) across three random seeds. For pretrained word embeddings we use fastttext subword embeddings (Bojanowski et al., 2017). 1 UD. We use version 2.2 of the Universal Dependencies as used in the CoNLL 2018 shared task (Zeman et al., 2018). We consider treebanks that have train, development, and test data and where results are reported in (Smith et al., 2018). This results in 47 treebanks.  The plot in Table 2a shows development accuracy on PTB over training time. Convoluted attention yields a much steeper learning curve. Accuracy goes up quicker and the model seems to be converged after just 5 epochs. Learnable temperature does not have any visible effect on the learning curve and SAN+P+R seems to have a slightly steeper learning curve. After training for more than 10 epochs they converge to a similar performance. This is expected as our modifications do not make the model more expressive, they target more efficient learning. Table 2b shows test results for PTB. All our modifications achieve comparable performance to SAN+PE [add]. Convoluted attention even achieves a slight performance improvement. Similarly learnable temperature has slightly higher performance for OOV. Replacing position embeddings both with absolute and relative direct position interactions is feasible and yields similar performance. Using only relative position interactions is slightly worse. It is surprising that +R reaches almost the same performance as +PE [add] with far less parameters. The combination, SAN+P+R, does not work better than just using SAN+P. Overall we reach a reasonable performance of almost 97%. part of future work. Both 1d and 2d convolution perform similar. +Conv2d yields performance improvements with only 10 additional parameters per attention head added in the model (see Table 1c). This indicates that our hypothesis that convolutions are suitable to reinforce patterns in the attention matrix is reasonable. Figure 2 shows the learning curves for PTB and UD for the first training steps (around 1 epoch for PTB). One can see that convoluted attention exhibits a somewhat steeper learning curve from the very beginning, but the overall effect is more visible in Table 2a.  Raffel et al. (2020) proposed to model relative positions with scalar values, an idea also investigated by Schmitt et al. (2020). This approach is similar to our SAN+R. Contemporary to this submission, Ke et al. (2020) proposed TUPE, which is similar to SAN+P. They also find it to be a feasible alternative to position embeddings and report slight performance increases. In contrast to weight normalization (Salimans and Kingma, 2016), a related method to learnable temperature, we do not normalize the weight matrices. Instead we only add a learnable scalar parameter and observed that normalizing the weights actually harms performance. Lin et al. (2018) introduced a self-adaptive temperature. However, they focused on parametrizing the temperature of timestep t using the activations from timestep t−1. Contemporary to this work, Henry et al. (2020) proposed query-key normalization in Transformers. There is range of work trying to combine attention with convolution (Yin and Schütze, 2018;Yu et al., 2018). We are not aware of any work that applies convolution directly to attention weights.

Conclusion
We conclude that position embeddings can be replaced with direct position interactions. 2 Learnable temperature has almost no effect. Convoluted attention speeds up learning on PTB and yields better results on UD. We are aware that this paper is a small study with limited validity as it considers only one task. Given that convoluted attention yielded promising results, we plan to extend this line of experiments to additional tasks and architectures in future work. Our code is available. 3