Adaptive Attention Span in Transformers

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.


Introduction
Language models are at the core of many NLP applications, like machine translation or dialogue. Recently, much progress has been made by a new neural network called Transformer (Vaswani et al., 2017). Part of its success is due to its ability to capture long term dependencies. This is achieved by taking long sequences as inputs and explicitly compute the relations between every token via a mechanism called the "self-attention" layer (Al-Rfou et al., 2019).
While this layer allows for information to propagate across long distances, it has a computational and memory cost that scales quadratically with the size of the input sequence. As a consequence, Transformers hardly scale to sequences of more than a thousand tokens. This is particularly problematic in the case of character level language modeling where dependencies are often spread over a few thousands time steps.
In this work, we propose an alternative to the self-attention layer to reduce the computational burden of a Transformer. Our layer learns its optimal context size, resulting in a network where each attention layer gathers information on their own context. In practice, we observe that this leads to Transformer with small context in the low-level layers and very large ones for the last layers. With this modification, we are able to scale input sequences to more than 8k tokens with no loss of performance, nor additional computational or memory cost. We validate our approach on the task of character level language modeling where we reach state-of-the-art performances while reducing the number of FLOPS.

Sequential transformer network
Language modeling is the problem of assigning a probability to a sequence of tokens (w 1 , . . . , w T ): Recent progress was made with a new autoregressive model called Sequential Transformer (Vaswani et al., 2017). A Transformer is made of a sequence of layers that are composed of a block of parallel self-attention layers followed by a feedforward network. We refer to Vaswani et al. (2017) for the details on the structure. In this paper, we make a couple of modifications to the Transformer model: we use the relative position embeddings of Shaw et al. (2018) and the caching mechanism of Dai et al. (2019) to speed up the train and test time.
Self-attention layer. A core mechanism of a transformer network is the self-attention layer, which consists of multiple attention heads working in parallel. Each attention head applies the attention mechanism of Bahdanau et al. (2015) to its own input. Given a token t in a sequence, the head first computes similarities with its past, i.e., any token r in the span [t − S, t): where W k and W q are the "key" and "query" matrices, and p t−r is the relative position embedding. The attention weights are then obtained by applying a softmax function on these similarities: Finally, the head outputs a vector y t by taking the average of the past representations weighted by their attention weights: where W v is called the "value" matrix. Outputs from different heads are then concatenated together and multiplied by an output matrix W o before feeding to the next layer. Similar to the memory access mechanisms of Sukhbaatar et al. (2015), it pulls information from the past to update the current token representation. Repeating this mechanism in consecutive layers allows for information to flow over long distances. However, for each input token, each attention head scales linearly in memory and time in the context size, or attention span. There are typically 12 layers with 8 heads each that processes 512 tokens simultaneously. This drastically limits the maximum attention span used in Transformers.

Adaptive attention span
Each attention head of a Transformer shares the same attention span S. This assumes that every head requires the same span to form its representation. As shown in Figure 1, this assumption does not hold in the context of character level language modeling: some heads (e.g., Head A) focus on the recent history, while others take information from the whole available context (e.g., Head B). In this section, we propose to learn the attention span of each head independently to reduce their computational and memory cost.
For each head, we add a masking function to control for the span of the attention. A masking function is a non-increasing function that maps a distance to a value in [0, 1]. We take the following soft masking function m z parametrized by a real value z in [0, S]: where R is a hyper-parameter that controls its softness. This soft masking function is inspired by Jernite et al. (2017). In Figure 2, we show the shape of this piecewise function as a function of the distance. The attention weights from Eq. 2 are then computed on the masked span, i.e., .
We add a 1 penalization on the parameters z i for each attention head i of the model to the loss function: where λ > 0 is the regularization hyperparameter, and M is the number of heads in each layer. Our formulation is differentiable in the parameters z i and we learn them jointly with the rest of the model.

Dynamic attention span.
As an extension, we consider a dynamic computation approach (Graves, 2016) where the attention span dynamically change based on the current input (Luong et al., 2015;Shu and Nakayama, 2017). At a time step t, the span parameter z t of an attention head is then a function of the input parametrized by a vector v and a scalar b, i.e., z t = Sσ(v T x t + b). We penalize z t in the same way as before and learn the parameters v, b jointly with the rest of the parameters.
In this section, we evaluate the impact of our adaptive attention mechanism in the experimental setting of Al-Rfou et al. (2019) . A single set of position embeddings p t is shared across all the heads.
In adaptive-span models, we reprameterized the span parameter z by z = Sz , where z ∈ [0, 1] is initialized to 0. In dynamic-span models, the bias term b is initialized −4 to make initial spans small. We set the hyperparameters λ = 2 × 10 −6 and R = 32 for the both type of models, except λ is reduced to 0.5 × 10 −6 when S = 8192 because z was not growing longer than 4000.
We use Adagrad with a batch size of 64 and fixed learning rate of 0.07 and 32k warm-up steps. Our warm-up strategy differs from Vaswani et al. (2017): we linearly increase learning rate from zero to the final learning rate. Gradients of each module are clipped at 0.03 for better stability. At train time, we use a block of 512 consecutive characters and compute the loss and gradient for each of those 512 characters.
In small models, we apply dropout with a rate of 0.3 to the attention and the feedforward ReLU activations. We train small models for 600K steps (900K steps when S = 8192), which takes about 2 ∼ 3 days on 8 V100 GPUs depending on the attention span limit. Large models are trained with a dropout rate of 0.4 until the validation performance stopped improving (250K steps for text8 and 150K steps for enwik8), and then further trained for 20K steps with a learning rate divided by 10.
Results. In Table 1, we compare our sequential Transformer with the adaptive spans ("Adaptive-Span") of Sec. 2.2 to models of Al-Rfou et al. (2019) and Dai et al. (2019). For small models, our model outperforms the other Transformers by 0.07 bcp while significantly reducing the memory usage for large attention span. Interestingly, even with a limit on span sets to 8192, the average span is only 314. Similar results are obtained on enwik8 as shown in Table 2, where the adaptive-span model outperformed similar sized models with a significantly smaller average span. Our large models achieved state-of-the-art performances on both datasets with fewer parameters and FLOPS.
In Figure 3, we compare the fixed and adaptive span small Transformers as we increase the attention span limit S. The performance of both models improve as the limit increase (see Figure 3(left)), but the adaptive-span model benefits more from longer span. As shown on the Figure 3(center), a Transformer with adaptive spans controls its average spans, leading to reduction of up to 70% in the number of FLOPS for the inference with large spans (see Figure 3(right)).
Impact on the attention span. In Figure 4, we show the final attention spans of every attention heads of our small adaptive-span model with S = 4096. Even though all the span sizes are initialized to the same value, we see large varieties in their final values. We can see that the lowest 5 layers have the smallest possible attention span, which is R = 32 of the masking function. This indicates that lower layers in a Transformer model do not really require a long attention span in this particular task. In contrast, few attention heads in the higher layers have very long spans, exceeding several thousand. Although there is a general tendency of higher layers having longer attention spans, it is not a simple monotonic function of the layer height.
Impact on the number of FLOPS. Having a smaller attention span has a direct impact on the total number of FLOPS necessary for computing one-step prediction. In a standard fixed-span model, the total number of FLOPS is mostly controlled by the feed-forward layer (accounting for 62% of FLOPS when S = 256). However, as the span increase, the attention layer dominates the computation (82% of FLOPS when S = 8192),  Table 1: Character level language modeling on text8. We report bpc for the dev and test sets, as well as, the number of parameters, the average attention spans and total number of FLOPS (an estimate of the number of FLOPS necessary for computing one step prediction).   making it hard to scale to longer sequences. In contrast, the learning of an attention span keeps computation at a relatively constant level even as S increase as shown in Figure 3(right).
The memory usage is also dominated by the attention layer as the attention span increase. Thus, reducing the average span will also reduce the memory usage. However, because all heads in a 1 2 3 4 5 6 7 8 9 10 11 12 Layers 10 1 10 2 10 3 Attention span single layer attend to common state vectors, the maximum span within each layer will determine the memory usage. The same is true for the number of FLOPS if all heads of a layer are computed together, as often done for better efficiency.
In practice, the largest fixed-span model that can fit in memory for training had a span of S = 2048 (batches had to be split when S = 4096), and it took about 550ms per batch. In contrast, an adaptive-span model with a 4 times longer span of S = 8192 fit in memory and took about similar time per batch.

Model
Avg. span dev Adaptive (S = 1024) 123 1.08 Dynamic (S = 1024) 149 1.08 Dynamic span. In Table 3, we show the adaptive and dynamic spans achieved the same performance with comparable average spans on text8. Figure 5 shows how the average dynamic span adapts to the input sequence. The span increases at the beginning of words and in the middle of composed words, e.g., to predict the "l" in "overlook".

Conclusion
In this work, we present a novel self-attention layer with an adaptive span. This mechanism allows for models with longer context, and thus with the capability to catch longer dependencies. We have shown the importantce of this feature in the context of character level modeling where information is spread over great distances.