Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model for better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training/inference time, which also enables attention heads to capture either short- or long-range contextual information. We conduct experiments on language model pre-training and several benchmark question answering datasets with various paragraph lengths. BlockBERT uses 18.7-36.1% less memory and 12.0-25.1% less time to learn the model. During testing, BlockBERT saves 27.8% inference time, while having comparable and sometimes better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.


Introduction
Recent emergence of the pre-training and finetuning paradigm, exemplified by methods like ELMo (Peters et al., 2018), GPT-2/3 Brown et al., 2020), BERT , XLNet (Yang et al., 2019), RoBERTa  and ALBERT (Lan et al., 2019), has drastically reshaped the landscape of the natural language processing research. These methods first pre-train a deep model with language model objectives using a large corpus and then fine-tune the model using in-domain supervised data for target applications. Despite its conceptual simplicity, this paradigm has re-established the new state-of-theart baselines across various tasks, such as question answering , coreference resolution (Joshi et al., 2019b), relation extraction (Soares et al., 2019) and text retrieval Nogueira and Cho, 2019), to name a few. ⇤ This work was partially done when the first author was an intern at Facebook AI. Code is available at https:// github.com/xptree/BlockBERT Building such models in practice, however, is an extremely resource-intensive process. For instance, the training of BERT-family models is notoriously expensive.  report that it takes four days to pre-train BERT-Base/BERT-Large on 4/16 Cloud TPUs. In order to reduce the pre-training time of RoBERTa to 1 day,  use 1,024 V100 GPUs. One crucial factor contributing to the long training time is the memory consumption of these deep models, as it directly affects the batch size. Although the fine-tuning stage is relatively inexpensive, the memory issue still restricts the scenarios in which BERT can be used. For instance, "it is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB-16GB of RAM, because the maximum batch size that can fit in memory is too small. 1 " Although one may think that model size is the main contributor to the large memory consumption, our analysis (Section 2.1) shows that one of the main bottlenecks is actually dot-product selfattention, operated in multiple layers of Transformers (Vaswani et al., 2017), the building block of BERT. As the attention operation is quadratic to the sequence length, this fundamentally limits the maximum length of the input sequence, and thus restricts the model capacity in terms of capturing long-distance dependencies. As a result, downstream tasks have to either truncate their sequences to leading tokens (Nogueira and Cho, 2019) or split their sequences with a sliding window (Joshi et al., 2019a,b). Ad-hoc handling of long sequences is also required in the pre-training stage, such as updating the model using only short sequences in the early stage .
Common strategies for reducing memory consumption, unfortunately, do not work. For instance, shrinking the model by lowering the number of layers L, attention heads A, or hidden units H leads to significant performance degradation (Vaswani et al., 2017; and does not address the long sequence issue. Alternatively, general low-memory training techniques, such as microbatching (Huang et al., 2018) and gradient checkpointing (Chen et al., 2016) essentially trade off training time for memory consumption, prolongs the already lengthy training process.
In this work, we explore a different strategy, sparsifying the attention layers, intending to design a lightweight and effective BERT that can model long sequences in a memory-efficient way. Our BlockBERT extends BERT by introducing sparse block substructures into attention matrices to reduce both memory consumption and the number of floating-point operations (FLOPs), which also enables attention heads to capture either shortor long-range contextual information. Compared to the previous method that also enforces sparsity , our approach is much simpler mathematically and very easy to implement. More importantly, the results of experiments conducted on several benchmark question answering datasets with various paragraph lengths show that BlockBERT performs comparably or even better than the original BERT-family models, while enjoying an 18.7-36.1% reduction in memory usage, a 12.0-25.1% reduction in training time, and a 27.8% reduction in inference time. The rest of the paper is organized as follows. Section 2 gives a brief introduction of the BERT model, along with an in-depth analysis of its memory usage during training time. We describe our proposed model in Section 3 and contrast it with existing methods that aim for creating a lighter model. Section 4 presents the experimental results and ablation studies, followed by a survey of other related work in Section 5 and the conclusion in Section 6.

Background: Memory Bottleneck in Training BERT
We briefly review BERT and introduce its memory profiling in this section. Following the paradigm of language model pre-training and down-stream task fine-tuning, BERT  consists of multiple layers of bidirectional Transformers (Vaswani et al., 2017), where each Transformer encoder has a multi-head self-attention layer and a position-wise feed-forward layer. Using the same notation as in , we denote the number of Transformer layers by L, the number of hidden units by H, the number of attention heads by A, the sequence length by N , and the batch size by B. We also assume the feed-forward hidden unit size to be 4H. 2

Memory Profiling
Training BERT is a memory-intensive process. In order to identify the bottleneck, we follow the memory model proposed by Sohoni et al. (2019), where memory usage throughout neural network training is categorized into three main types: (1) Model memory is used to store model parameters; (2) Optimizer memory is the additional memory used by the specific learning algorithm during the process; (3) Activation memory consists of the outputs of each layer, which are cached for reuse in backpropagation to compute gradients. Take BERT-Base training as an example. The model has 110 million parameters, so model memory occupies 0.2 GB if parameters are stored in half-precision floating-point format (FP16). For Adam (Kingma and Ba, 2014), the optimizer needs additional memory to store the gradients, first moments, and second moments of model parameters. If stored using the same precision, the optimizer memory should be three times of model memory. 3 To calculate the exact size of activation memory is not trivial because it depends heavily on the implementation of the toolkit. Instead, we measure it empirically by training BERT-Base using Adam with a memory profiler (more details are provided in Appendix A.2).
We use 32 NVIDIA V100 GPUs for training. Every single GPU thus consumes a minibatch of size b = B/32 = 8. Figure 1(a) shows the profiling result for a single GPU, where the model/optimizer/activation memory consumes 0.21/1.03/8.49 GB, respectively. We can see that activation memory accounts for the vast majority of the total GPU memory (87.6%) and is thus the bottleneck. Notice that although our analysis is done on BERT-Base, it can also be generalized to BERT-Large and other models such as RoBERTa  and XLNet (Yang et al., 2019).

A Regression Analysis on Activation Memory
For BERT, or more specifically, Transformer, the activation memory corresponds to intermediate results of different layers. It grows linearly in all the model hyper-parameters, except the sequence length N , due to the attention layers. To quantify the linear and quadratic components in the activation memory more clearly, we conduct a regression analysis as follows. Assume that the activation memory (in each GPU) is a polynomial a 2 bN 2 + a 1 bN + a 0 , where b is the batch size in each GPU and a i (i = 0, 1, 2) are coefficients to be determined. If we fix the total number of tokens in a GPU to be constant (in our case, we fix b ⇥ N = 4096), we should have a linear function w.r.t. N , i.e., 4096a 2 N + 4096a 1 + a 0 . We enumerate N from {128, 256, 512, 1024} in our experiments, and plot the corresponding profiled activation memory in Figure 1

Techniques for Reducing Traing Memory
Observing that activation memory is the training bottleneck, we discuss common memory reduction techniques below.
Low Precision (Micikevicius et al., 2017) Low precision is to use half-precision/mixed-precision for training neural networks. This technique has been widely used in Transformer training . In this work, we already assume to use mixed-precision training by default, as indicated in the aforementioned analysis.
Microbatching (Huang et al., 2018) Microbatching is to split a batch into small microbatches (which can be fit into memory), and then run forward and backward passes on them separately with gradients for each micro-batch accumulated. Because it runs forward/backward pass multiple times for a single batch, it trades off time for memory.
Gradient Checkpointing (Chen et al., 2016) Gradient checkpointing saves memory by only caching activations of a subset of layers. The un-cached activations will be recomputed during backpropagation from the latest checkpoint. This strategy trades off time for memory by repeating computations and will obviously extend training time.
Knowledge Distillation (Hinton et al., 2015) Knowledge distillation aims to compress and transfer knowledge from a teacher model to a simpler student model. However, knowledge distillation relies on a teacher model (which is still expensive in training time) and usually suffers from a certain degree of performance degradation. As common techniques are limited in reducing both the training time and memory usage, we investigate how to optimize the dot-product attention layers and introduce our approach next.

Model: BlockBERT
Following (Vaswani et al., 2017), the dot-product attention in Transformer is defined as: where Q, K, V 2 R N ⇥d with N to be the sequence length and d to be a hidden dimension. As we can see, the inner product between Q and K consumes O(N 2 ) memory. One simple way to reduce the memory consumption of attention is to sparsify the attention matrix. Suppose we have a masking matrix M 2 {0, 1} N ⇥N , we define a masked version of attention as follows: with operator defined by In this work, we design M to be a sparse block matrix, which not only reduces memory and the number of floating-point operations (FLOPs) but also benefits from efficient dense matrix support from deep learning frameworks, such as PyTorch and Tensorflow. More formally, we split the length-N input sequence into n blocks, with each block of length N n . 4 The N ⇥ N attention matrix is then partitioned into n ⇥ n blocks, where each block matrix is of the size N n ⇥ N n . We define a sparse block matrix M by a permutation ⇡ of {1, 2, · · · , n}: By writing Q, K, V as block matrices, such that n ] > and pluging them into Equation 1, we can formally define Blockwise Attention as follows: Equation 3 only needs to compute and store Q i K > ⇡(i) (i = 1, · · · n), each has size N n ⇥ N n . In other words, BlockBERT reduces both O(N 2 ) memory consumption and FLOPs by a factor of n, since N n ⇥ N n ⇥ n = N ⇥N n .

Blockwise Multi-Head Attention
Analogous to Multi-head Attention (Vaswani et al., 2017), we allow queries, keys, and values to be projected multiple times and perform blockwise attentions in parallel. Moreover, different blockwise attention heads can use different masking matrices. The outputs of multiple heads are then concatenated and aggregated with another linear projection. Let A be the number of attention heads and H the number of hidden units. Blockwise multi-head attention is formally defined as follows: where for each head i, i = 1, 2, · · · , A, 4 We assume N can be divided by n. If not, we pad the input sequence to make N divisible. (1, 2) (2, 1) (1, 2, 3) (2, 3, 1) (3, 1, 2) Figure 2: Architecture of Blockwise Multi-head Attention, which acts as building blocks of BlockBERT. The key idea is to introduce a sparse block masking matrix to the N ⇥ N attention matrix. The right panel shows the masking matrices we use when n = 2, 3. For n = 2, the masking matrices are defined by permutation (1, 2), (2, 1) and have 50% non-zeros. For n = 3, the masking matrices are defined by permutation (1, 2, 3), (2, 3, 1), and (3, 1, 2) and have 33.33% non-zeros.

Analysis of Memory Usage Reduction
To validate our claim that BlockBERT with n ⇥ n blocks can reduce the O(N 2 ) memory usage by a factor of n, we perform the same memory profiling as described in sections 2.1 and 2.2. Again, We fix the number of tokens in each GPU (b ⇥ N = 4096) and choose N from {128, 256, 512, 1024, 2048}. 5 As we can see from Figure 3 and Table 1, the empirical results align well with the theoretical values.
When we set the number of blocks to be 2 and 3 for BlockBERT, the estimated O(N 2 ) activation memory decreases to 1/2 and 1/3 of BERT's O(N 2 ) activation memory, respectively. As shown in Table 2, for the sequence length N = 512, BlockBERT with 2 and 3 blocks saves 18.7% and 23.8% overall memory, respectively. The saving is more significant for longer sequences. When N = 1024, the overall memory reduction of BlockBERT with 2 and 3 blocks is 27.3% and 36.1%, respectively.

RoBERTa-2seq & RoBERTa-1seq
We compare with two versions of RoBERTa . RoBERTa-2seq is trained with both masked language model (MLM) task and next sentence pre-diction (NSP) task, while RoBERTa-1seq refers to the pre-training model with only the MLM task.
SparseBERT We pre-train BERT models with its Transformer encoder replaced by a Sparse Transformer encoder . We set its sparsity hyper-parameters stride`= 128 and expressivity c = 32. 6 The attention masking matrix used in Sparse Transformer and more implementation details are discussed in Appendix A.3. A similar architecture was adopted in GPT-3 (Brown et al., 2020).

Pre-training
All the models follow the BERT-Base setting, i.e., Moreover, the pre-training of SparseBERT and BlockBERT follows the RoBERTa-1seq setting, i.e., we drop the NSP (Next Sentence Prediction) task, and an input sequence is up to N tokens until it reaches a document boundary. A summary of the pre-training performance comparison between BlockBERT and RoBERTa-1seq is shown in Table 2. Besides memory saving, we also achieve a significant speedup. For example, when N = 1024, BlockBERT (n = 2) reduces the training time from RoBERTa's 9.7 days to 7.5 days.

Fine-tuning Tasks
We evaluate BlockBERT on several question answering tasks, including SQuAD 1.1/2.0 (Rajpurkar et al., 2018) and five other tasks from the MrQA shared task 7 -HotpotQA (Yang et al., 2018), NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), TriviaQA (Joshi et al., 2017) and NaturalQA . Since MrQA does not have an official test set, we follow Joshi et al. (2019a) to split the devel- 6 We adopt Sparse Transformer implemented by Fairseq, which first computes the N ⇥ N attention matrix, and then masks it to be a sparse one. This implementation cannot avoid the O(N 2 ) attention computation, and thus has a similar training time/memory cost to RoBERTa. 7 mrqa.github.io  opment set evenly to build a new development set and test set. These QA datasets have different paragraph length distributions and are thus ideal for testing the effectiveness of BlockBERT 8 . For example, SQuAD, NaturalQA, and HotpotQA consist of mostly short paragraphs (shorter than 512), while paragraphs in SearchQA (average length 1,004) and TriviaQA (average length 934) have around 1,000 tokens. When the input sequence is longer than N , we follow the common practice (Joshi et al., 2019a) to split it using a sliding window of size N and stride 128. This means that for SearchQA and TriviaQA, a model with N = 512 can only capture half of the context, while a model with N = 1024 can accept the whole paragraph as input.
For all models, we adopt the same fine-tuning QA setup from . The tokenized paragraph (p 1 , · · · , p s ) and question (q 1 , · · · , q t ) are concatenated to be a sequence [CLS]q 1 · · · q t [SEP]p 1 · · · p s [SEP]. The sequence is then fed into the pre-trained model with two extra linear layers for predicting the start and end positions of the answer spans. The detailed fine-tuning setting is listed in Appendix A.4. Table 3 and Table 4 report the experimental results.
BlockBERT v.s. SparseBERT For N = 512, it is interesting that BlockBERT with 3 blocks (density 33.33%) performs better then SparseBERT (den-8 The detailed paragraph length distributions can be found in Appendix A.5  sity 44.20%) in both SQuAD and MrQA tasks. Similar results can be observed for N = 1024, too. These results show that off-diagonal masking matrices, e.g., the masking matrix defined by permutation (2, 3, 1) and (3, 1, 2), play crucial roles in BlockBERT. Furthermore, BlockBERT with 2 blocks achieve a more significant improvement.
Effect of Long Sequence Pre-training Our observations are twofold: (1) Long sequence pre-training benefits long sequence fine-tuning. In TriviaQA and SearchQA, of which paragraph lengths are around 1024, pre-training models with N = 1024 achieve significantly better performance. (2) The heterogeneity of pre-training and fine-tuning sequence length may hurt performance. For example, in SQuAD, we do not see significant performance gain by using pre-trained models with N = 1024; in HotpotQA and NewsQA, longer sequence pretraining even hurts performance.
Effect of #Blocks It is not surprising that BlockBERT with 2 blocks (n = 2) performs better than that with 3 blocks (n = 3), because it keeps more attention matrix entries. The biggest difference is in SQuAD 2.0 and NewsQA with N = 1024, where we observe an absolute loss of 1.6 F1 by increasing block number from 2 to 3.

Efficient inference with BlockBERT
We benchmark test efficiency of RoBERTa and BlockBERT. The benchmark code follows huggingface 9 . All experiments are run 30 times on a 32GB V100 GPU with half precision (FP16). We report the average running time in Table 5. As we can see, BlockBERT does achieve speedup and memory reduction during test time. Take 8⇥1024, i.e., batch size B = 8, sequence length N = 1024, as an example, we can see that BlockBERT with 2 blocks saves 27.8% of test time, and BlockBERT with 3 blocks saves more (30.4%). As for memory, we can observe that RoBERTa cannot handle an input of size 16⇥1024, while it is possible for BlockBERT to work on it.
In summary, not only BlockBERT saves training/inference time and memory, but it also has a competitive and sometimes better performance, especially for tasks with longer sequences. This demonstrates the effectiveness of our blockwise multi-head attention approach.

Ablation Study
We fix the assignment of attention heads in the above experiments. For example, BlockBERT with sequence length N = 512 and 2 blocks is trained with ten heads using permutation (1, 2) and the other two using permutation (2, 1). However, there are other ways to assign twelve attention heads, e.g., seven heads for permutation (1, 2) and the other five for permutation (2, 1). It would be interesting to see how the assignment of heads affects model performance. In this section, we grid search attention head assignments and plot their best validation performance in 1.2M training steps. The results are shown in Figure 4.
Our observations are threefold: (1) Identity permutations, i.e., (1, 2) and (1, 2, 3), are important. As shown in Figure 4, all optimal solutions assign considerable attention heads to block-diagonal matrices, since those matrices enable each token to attend to its nearby tokens; (2) Non-identity permutations follow the rule of "vital few and trivial many." Although identity permutations are important, assigning all attention heads to them (corresponding to 12:0 and 12:0:0 in Figure 4) significantly hurts performance, since the model can not learn long-9 github.com/huggingface/transformers/ blob/master/examples/benchmarks.py term dependencies with only identity permutation; (3) Pre-training performance and fine-tuning performance are correlated but not always consistent. When n = 3, pre-training performance suggests 10:1:1 to be the best head assignment -ten heads for permutation (1, 2, 3), one head for (2, 3, 1) and one head for (3, 1, 2), but we observe that the configuration of 8:2:2 achieves better performance in fine-tuning tasks.

Related Work
In this section, we review the related work of memory optimization for neural network training and recent efforts to simplify Transformer and BERT.

Low-memory neural networks training
Due to the large size of model parameters and deep architectures, modern neural networks training requires significant amounts of computing resources. As a result, there is an increasing interest in training neural networks with low memory (Sohoni et al., 2019). Mainstream techniques mostly address this problem with a better system or engineering design, such as low-precision training (Micikevicius et al., 2017), microbatching (Huang et al., 2018) and gradient checkpointing (Chen et al., 2016). Alternatively, there also exists some research focusing on the theoretical aspect, including the recently proposed lottery ticket hypothesis (Frankle and Carbin, 2018).

Efficient Transformer
Since the invention of Transformer (Vaswani et al., 2017) and its successful application to masked language model pre-training Yang et al., 2019;Lan et al., 2019), several approaches have been proposed to simplify the model and its training process. We summarize these attempts as follows: Attention layer simplification There are currently two lines of research trying to simplify the multi-head attention layers. The first one focuses on attention matrix sparsification. Notable examples include Star Transformer , Sparse Transformer , Adaptive Sparse Transformer (Correia et al., 2019;Sukhbaatar et al., 2019), Log-Sparse Transformer (Li et al., 2019) , Reformer (Kitaev et al., 2020) and Longformer (Beltagy et al., 2020). However, due to the insufficient support for sparse tensors from the current deep learning platforms, some   of them have to represent a sparse matrix using a dense matrix with a binary mask or rely on customized CUDA kernels (Gray et al., 2017). As a result, the speed-up or reduction in memory consumption is sometimes limited in practice. The second line of research prunes redundant attention heads. Examples include (Voita et al., 2019) and (Michel et al., 2019). Our BlockBERT model belongs to the first category, as we sparsify the attention matrix by replacing it with a block sparse matrix.  (Hochreiter and Schmidhuber, 1997). In contrast, ALBERT (Lan et al., 2019) is a notable work that does not take the knowledge distillation approach. It uses parameter-sharing to reduce the number of parameters of the BERT model. As discussed in section 2.1, parameter-sharing reduces both model memory and optimizer memory. These two parts account for about 12.4% of total training memory for BERT-base. As for efficiency, parameter-sharing reduces communication complexity in distributed training and thus saves training time as well.
In the aforementioned efficient Transformers, the model quality is often demonstrated by comparable language model perplexity, or equivalently the bits per word/byte. It is often implicitly assumed that similar language model perplexity implies similar pre-training model quality, namely the same performance on the downstream tasks. We would like to point out that this assumption does not necessarily hold. For example, the experiments on the Enwik8 dataset by  demonstrates that Sparse Transformer "surpasses the 1.03 state-of-the-art (bits per byte) for a similarly-sized Transformer-XL and matching the 0.99 (bits per byte) of a model trained with more than double the number of parameters". However, if we compare SparseBERT (pre-training model with Sparse Transformer backbone) against XLNet (Yang et al., 2019) (pre-training model with Transformer-XL backbone) in SQuAD, Table 3 shows that XLNet still outperforms SparseBERT significantly. Therefore, we believe that it is necessary to conduct a comprehensive study and evaluation of existing efficient Transformer models when used for masked language model pre-training. Limited by resources, in this work, we mainly compare BlockBERT to pre-training using Sparse Transformer , which is the earliest attempt to design efficient Transformer models and also the key contributor to the success of GPT-3 (Brown et al., 2020). We plan to benchmark more models in the future.

Conclusion
In this work, we study the lightweight BERT model with the goal of achieving both efficiency and effectiveness. We profile and analyze the memory bottlenecks of BERT and focus on optimize dotproduct self-attention, which consumes quadratic memory with respect to the sequence length. To reduce both time and memory consumption, we present BlockBERT, which sparsifies the attention matrices to be sparse block matrices. The proposed model achieves time and memory saving without significant loss of performance.
In the future, we plan to benchmark more efficient Transfomers in language model pre-training and fine-tuning. We also would like to explore more applications of BlockBERT on NLP tasks involving long sequences such as coreference resolution (Joshi et al., 2019b) and document-level machine translation (Miculicich et al., 2018), and also non-NLP tasks such as protein sequence modeling (Rives et al., 2019;Rao et al., 2019).