Guiding Attention for Self-Supervised Learning with Transformers

In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster convergence of models as well as better performance on downstream tasks compared to the baselines, achieving state of the art results in low-resource settings. Surprisingly, we also find that linguistic properties of attention heads are not necessarily correlated with language modeling performance.


Introduction
Recent advances in self-supervised pretraining (Radford et al., 2018;Devlin et al., 2018a; have resulted in impressive downstream performance on several NLP tasks (Wang et al., 2018. However, this has led to the development of enormous models, which often require days of training on non-commodity hardware (e.g. TPUs) (Kaplan et al., 2020). Furthermore, studies have shown that it is quite challenging to successfully train these large Transformer models (Vaswani et al., 2017), requiring complicated learning schemes and extensive hyperparameter tuning (Xiong et al., 2020;Raffel et al., 2019;Popel and Bojar, 2018).
Despite these expensive training regimes, recent studies have found that once trained, these bi-directional language models exhibit simple patterns of self-attention without much linguistic backing (Voita et al., 2019;Raganato and Tiedemann, 1 Code: https://github.com/ameet-1997/AttentionGuidance Attention Guided Model RoBERTa-Only MLM 0% 1% 100% Figure 1: Attention patterns of our model (left) and the default RoBERTa model (right) after 0% (top), 1% (middle) and 100% (bottom) of pre-training. Inducing simple patterns (left) using an auxiliary loss leads to benefits in convergence speed, downstream performance, and robustness to hyperparameters. 2018). For example, 40% of heads in a pre-trained BERT model (Devlin et al., 2018a) simply pay attention to delimiters added by the tokenizer (e.g. [CLS] or [SEP]) (Kovaleva et al., 2019). Since these attention patterns are independent of linguistic phenomena, a natural question arises: can Transformer models be guided towards such attention patterns without requiring extensive training?
In this paper, we propose an attention guidance (AG) mechanism for self-attention modules in Transformer architectures to enable faster, more efficient, and robust self-supervised learning. Our approach is simple and agnostic to the training objective. Specifically, we introduce an auxiliary loss function to guide the self-attention heads in each layer towards a set of pre-determined patterns (e.g. Figure 1 (Vig, 2019)). These patterns encourage the formation of both global (e.g. attend to [CLS], [SEP] tokens) and local (e.g. attend to [Next], [Prev] token) structures in the model.
Through several experiments, we show that our approach enables training large Transformer models considerably faster -for example, we can train a 16-layer RoBERTa model with SOTA performance on a low-resource domain in just two days using four GPUs, while excluding our loss leads to slow or no convergence. Our method also achieves competitive performance with BERT (Devlin et al., 2018a) on three English natural language understanding tasks, and outperforms the baseline masked language modeling (MLM) models on eleven out of twelve settings considered.
Further, we also show that our initialization is agnostic to the training objective by demonstrating gains on the replaced token detection objective proposed by ELECTRA  and on machine translation with Transformers. Finally, we provide an analysis of the attention heads learned using our method. Surprisingly, contrary to recent studies (Clark et al., 2019;Lin et al., 2019), we find that it is possible to train models that perform well on language modeling without learning a single attention head that models coreferences.
To summarize, our main contributions are: • We propose a simple auxiliary loss for selfattention heads that enables large models to converge quickly on commodity hardware.
• We demonstrate the effectiveness of our auxiliary loss on different languages, model sizes, and training objectives.
• We provide evidence that the linguistic performance of individual attention heads is not a necessary condition for good language modeling (LM) or downstream task performance.

Related Work
Improving efficiency of LMs The high computational costs of BERT-style models have accelerated research on developing efficient contextual language models.  used a GAN-like setup to predict if each word in the input sequence is corrupted by a generator (another pre-trained LM). They show that their method is more sample efficient than the standard MLM objective. Other studies have explicitly focused on making the selfattention modules more efficient. Reformer (Kitaev et al., 2020) and Sparse Transformer (Child et al., 2019) introduce locality-sensitive hashing and sparse factorizations to reduce the quadratic complexity of dot-product attention, while Longformer (Beltagy et al., 2020) uses local-windowed and task motivated global attention to scale the memory usage of self-attention modules linearly.

Analyzing
Self-Attention Recent papers have analyzed the attention patterns in trained Transformer-based LMs. Some studies hypothesize that multiple attention heads capture linguistic phenomena like co-reference links and dependency arcs (Clark et al., 2019;Htut et al., 2019). However, other studies show that pruning those heads leads to minimal performance degradation on downstream tasks (Kovaleva et al., 2019;Michel et al., 2019). Others note that there are recurring patterns in attention distributions corresponding to different attention heads (hereon, heads), which are not language or task-dependent (Voita et al., 2019;Raganato and Tiedemann, 2018). While our study also questions the role of heads for language modeling and downstream performance, we focus on making modifications to the LM pre-training and not on analyzing published pre-trained models.
Constraining Self-Attention Qiu et al. (2019) enforce local constraints on the attention patterns to reduce computation and build deeper models with longer contexts. The studies that are perhaps most similar to ours explore fixed attention patterns for machine translation (You et al., 2020;Raganato et al., 2020). You et al. (2020) replace all attention heads in the encoder with hard-coded Gaussian distributions centered around the position of each token while observing a minimal reduction in BLEU scores. Raganato et al. (2020) substitute all but one head with fixed attention patterns in each encoder layer and note little performance degradation. Both these studies enforce hard constraints on the self-attention and try to match baselines in terms of speed and performance. Our approach is complementary -our attention guidance loss is a form of soft regularization and outperforms baseline models both in terms of convergence speed and quantitative metrics.

Prelude: The surprising effectiveness of non-linguistic attention
Several recent studies (Clark et al., 2019;Kovaleva et al., 2019) have demonstrated that Transformers trained with the masked language modeling (MLM) objective exhibit simple self-attention patterns (e.g., attending to delimiter tokens). These patterns (e.g. Figure 2) are consistent across models pretrained on different languages, or fine-tuned on various downstream tasks (Kovaleva et al., 2019).   (Socher et al., 2013;Rajpurkar et al., 2016;Dolan and Brockett, 2005). We also compare with a randomly initialized Transformer, which is finetuned on downstream tasks without any pre-training (Kovaleva et al., 2019). Surprisingly, the results in Table 1 show that despite both models having mismatched tokens and being trained on languages with linguistic constructs that are different from those of English, their performance is significantly better than a model with no pre-training. This corroborates the idea that the non-linguistic structure in attention heads is beneficial for learning, and inducing it explicitly may lead to faster training and better performance.

Our method: Attention guidance for Transformers
We first formally define the masked language modeling (MLM) setup with Transformers (Vaswani et al., 2017) and then describe our attention guidance mechanism.
MLM with Transformers Transformers used for sequence-to-sequence prediction tasks are trained on a dataset D of pairs of sequences x and corresponding labels y. In the case of masked language modeling (MLM), the input sequence x 1 , x 2 , . . . , x n of length n consists of individual tokens and the output labels y 1 , y 2 , . . . , y n are the same as the input sequence, i.e., y i = x i . A fraction k of the input tokens, chosen randomly, are masked, i.e., replaced with a <MASK> token. Assume that these masked indices are collected together in a set C. The MLM objective then is a cross-entropy loss on the predictions y j made by the model at the masked locations j ∈ C, and is used to optimize all the parameters of the model, θ by minimizing: The Transformer architecture for MLM consists of layers with h self-attention heads per layer. Let the input activations to layer k of this model be s k , with |s k | = n. Naturally, s 1 = s = x. For every position p ∈ [1, n] in its output, each attention head in layer k induces a probability distribution over all positions in the input s k . Let a single head's attention activations (as described in Equation 1 of (Vaswani et al., 2017)), which is a function of s, be denoted by the following: where Q and K are the query and key matrices respectively, and d k is the dimensionality of the queries or keys. Further, let H(s)[p, q] (a scalar) denote the attention that token p in the head's output layer pays to token q in the head's input layer.
We drop the dependence on s in the following sections for notational convenience.
Guiding attention heads To guide an attention head, we impose a mean squared error (MSE) loss on H using a pre-defined pattern P(s) ≡ P ∈ R n×n , where || · || F is the Frobenius norm: Specifically, we consider two types of patterns: • Global attention patterns that focus their attention on specific global positions like the first token of the sequence ( • Local attention patterns that focus either on the next or the previous token (e.g. [Next], [Prev]). As an example, Overall loss function We apply the attention loss in Equation 2 to each head in each layer to obtain the overall attention guidance (AG) loss: where 1(k, j) denotes an indicator function which is 1 only if the jth head in layer k is being guided.
In general, this loss allows for arbitrary choices of patterns for each P kj . However, to simplify matters in our experiments, we guide a particular head number to the same pattern across all layers, i.e., P ·j is the same for all layers. We utilize the gradients from this loss to update all the parameters of the model (including the feedforward and input embedding layers). It is worth noting that this loss only depends on the input x and not on labels y.
Finally, we combine our attention guidance (AG) loss with the MLM loss to get our overall optimization objective: (4) where α t is a hyperparameter. In practice, we find that L AG converges faster than L M LM , so we linearly decay α t from an initial value α 0 to 0 as the training progresses (details in Section 4).

Experimental Setup
We demonstrate the effectiveness of our attention guidance loss through several empirical studies. Specifically, we 1) report convergence results on masked language modeling, 2) evaluate trained language models on downstream tasks, and 3) analyze the learned attention representations using probes. For 1) and 2) above, we perform experiments on both high-resource and low-resource settings.

Datasets
We use the following datasets spanning three different languages (details in Table 2): 1. English: To train language models, we use a 2.1 billion token corpus from English Wikipedia. We download and pre-process articles according to Shoeybi et al. (2019).  For downstream evaluation, we choose three tasks: QQP 2 , MNLI (Williams et al., 2017), and QNLI (Rajpurkar et al., 2016) 2. Filipino: We use a 36 million token corpus of Wikipedia text collected by Cruz and Cheng (2020) to train language models, and the accompanying binary sentiment classification task to evaluate downstream performance.
3. Oromo: Our smallest corpus contains 4.6 million tokens ( (Strassel and Tracey, 2016)). We use the accompanying named entity tags for NER, which is our downstream task.
These cover a range of dataset sizes -from highresource (English) to low-resource (Oromo).

Evaluation
Evaluation metrics for the different tasks: 1. Language modeling: We report the training and validation MLM losses. Even though our attention guided models are trained with an auxiliary loss, we report only the MLM loss for direct comparison with the corresponding baseline. We also report the average training loss to compare models' convergence rates.
2. English downstream tasks: Accuracy for MNLI and QNLI, and F-1 score for QQP. 3 Figure 2: Example attention patterns used in our AG models for the sentence "<s> Welcome to EMNLP . < /s>". Note that the first three patterns ([Next], [Prev], [First]) do not even depend on the input sentence.

Models and Training
To make comparisons across different settings easy, we choose RoBERTa  as the base architecture for all our experiments. We train variants with 8, 12, and 16 layers following the configurations given in the original paper  on all 3 languages, which gives us a total of 9 settings. Since the current SOTA model for Filipino (Cruz and Cheng, 2019) is a BERT model, we train our Filipino models with both the MLM and next sentence prediction loss. Details of the model hyperparameters are provided in Appendix A.8. For each model, we compare its learning with and without our AG loss. We denote the attention guided models by RoBERTa-AG and the unmodified versions by RoBERTa-MLM. For notational convenience, RoBERTa-X-MLM and RoBERTa-X-AG represent RoBERTa models with X layers.
Comparison with state-of-the-art (SOTA) While we train all variants of our models with and without AG loss, and only these results are strictly comparable, we also compare with SOTA models for reference. These are E-MBERT , a recent extension of multilingual BERT (Devlin et al., 2018b) which performs well on low-resource languages, for Oromo, BERT (Cruz and Cheng, 2020) for Filipino, and RoBERTa BASE  for English 4 .

Attention patterns
We consider the following patterns P (section 3) for guiding the self-attention heads: 1.
[Next] attends to the next token. 2.
[First] attends to the first token in the sequence. 4 MNLI-m and MNLI-mm scores are reported as the same in table 4 because they are not reported separately in . QQP scores reported are for RoBERTaLarge because the F-1 score is not reported for RoBERTaBase

Implementation details
Basic MLM models: We tune the learning rate from the set {1e-5, 5e-5, 1e-4}, the dropout in selfattention from the set {0.0, 0.1}, and the number of warmup steps from the set {0, 1000, 10000}. AG models: For our AG models, we guide a fraction λ ∈ { 1 4 , 2 4 , 3 4 , 1} of heads in each layer. We choose α 0 (equation 4) from the set {1, 10, 100} such that the scales of the MLM loss and auxiliary loss are comparable at the beginning of training.

Best
performing hyperparameters: RoBERTa-MLM is very sensitive to the learning rate and the number of warmup steps, and the best performing hyperparameters are reported in appendix A.8. On the other hand, we find that RoBERTa-AG is very robust and does not need much tuning. A learning rate of 1e-4, λ = 0.5, and 0 warmup steps work well for all the experiments. α = 10 is used for our 12,16 layer models, and α = 100 for smaller models. We fit the largest batch size possible for each model. We perform an ablation study and find that the [Next] and Compute Time and Hardware Unlike state-ofthe-art models, we emphasize that our studies are performed on a smaller computational budget, both with respect to wall clock time and hardware. Our English models are trained for 10 epochs, with a cap of 4 days, on 8 NVIDIA Tesla P40 GPUs, and Filipino and Oromo models for 40 epochs with a cap of 2 days on 4 NVIDIA Tesla P40 GPUs. We emphasize that the RoBERTa-MLM and RoBERTa-AG variants in an experiment are trained on the same number of epochs. We also pre-train both RoBERTa-12-MLM and RoBERTa-12-AG for longer and on TPUs to show that the trends hold even when using specialized hardware and more compute time (appendix A.4).

Language Modeling
Faster convergence Table 3 provides an overview of our results on language modeling. As seen from the average loss, we observe that the AG loss greatly helps improve the speed of convergence on all model sizes and domains. Figure 3 shows the train loss curves for two model sizes trained on English, where the losses for AG models almost instantaneously drop, whereas the MLM models have an extended period where the losses don't reduce. The gains are particularly notable for larger models like RoBERTa-12 and RoBERTa-16, where careful hyperparameter tuning is required for guaranteeing convergence if AG loss is not used. In contrast, using our auxiliary loss allows for fast convergence with standard out-of-the-box hyperparameters. For example, after just a day's training, the MLM loss for RoBERTa-16-AG has decreased from 11 to 2.5, whereas RoBERTa-16-MLM's is still at 6.5. Final loss values Not only do the AG models converge faster, but their final train and validation losses are also lower than their MLM counterparts on 8 out of 9 settings (table 3). This is facilitated by AG models' fast initial convergence coupled with robustness to hyperparameters, allowing us to use larger learning rates and no warmup period. On 5 of the 9 settings, namely 12,16 layer models on Filipino and Oromo, and the 16 layer model on English, only our AG model can converge. We also provide a hypothesis about the usefulness of AG loss in appendix A.3.

Downstream performance
We evaluate all the models' downstream performance to verify if better language modeling corresponds to better language understanding.
English Our AG models outperform their MLM counterparts on 11 out of the 12 settings (Table  4), with 7 comparisons being statistically significant (p < .05, paired t-test). We emphasize that the scores are not directly comparable to RoBERTa BASE , which is trained on 10 times more data, up to 8 times more epochs, and on several GPUs. Having experimentally shown the usefulness of AG loss in optimizing the MLM objective, we believe that training our models on more data and compute is bound to match or outperform the MLM counterparts.
Filipino As shown in table 4, our AG models outperform the MLM variants on all model sizes. Additionally, our best performing model beats the current SOTA (Cruz and Cheng, 2019) by almost 1 point, even though the latter was trained on a TPU and for longer wall-clock time.
Oromo Our AG models continue to have an edge even in sparse data domains. Though Oromo has only 0.2% of the pre-training data when compared to English, which makes it prone to overfitting, it is interesting to note that larger models continue to outperform smaller ones on downstream tasks. Our models are competitive with E-MBERT , which is a BERT model leveraging resources from over 104 other languages. We hope that our competitive results on both Filipino and Oromo when using as little as 4 GPUs encourages more NLP research in low-resource languages.   Table 4: Evaluation on downstream tasks. Our AG models outperform their MLM counterparts on all but one settings (entries marked with '+' are significant with p < .05, paired t-test). Comparisons are column-wise. SOTA=state-of-the-art published numbers (marked ) with similar model types on each task. The SOTA models are trained on more compute and data and are not directly comparable to our models.

Ablation study with attention patterns
As mentioned in section 3.2, we introduce five different attention patterns for guiding our models using the AG loss. To select the best performing patterns, we use the leave-one-out strategy, in which we omit patterns and record the increases in loss (after 100,000 steps) when compared to a model with all patterns included. The patterns which cause a large increase in loss when omitted are naturally more important. The increases in loss are recorded in Table 5

Attention Guidance for ELECTRA
ELECTRA ) is an efficient model which uses replaced token detection as the pretraining task. It comprises training a discriminator and a generator, in which the generator randomly changes k% of tokens in an input sequence to plausible alternatives, and the discriminator has  to identify if a token was modified or not. The generator learns using the MLM objective, and the discriminator, which is used for downstream tasks, uses the logistic loss. We use an ELECTRA variant in which the generator is a unigram LM, and compare the performance when AG loss is added. The results after training ELECTRA-12 and ELECTRA-12-AG for 2 epochs on BooksCorpus (Zhu et al., 2015) are presented in Table 6. Like with RoBERTa, we report only the discriminator's logistic loss even though our model is trained on an auxiliary loss. The AG model shows gains in convergence without any ELECTRA specific hyperparameter tuning.

Attention Guidance for Machine Translation
Models We also experiment with adding our AG loss to Machine Translation (MT) models that use Transformers for both the encoder and decoder. We compare with the BASE Transformer (Vaswani et al., 2017) and a recently proposed hard-coded Gaussian model (You et al., 2020), which fixes all the attention heads in the encoder and decoder to pre-determined Gaussian distributions centered around nearby tokens. While the latter's attention patterns are similar to our local attention patterns, they are hard-coded and not an auxiliary loss. Following (You et al., 2020), the cross-attention in our MT model is not guided. Using a held-out set, we search for the best combination of AG patterns ( Figure 2) for both the encoder and decoder. We find this to be one head each guided with the [Next,Prev] pattern in the encoder, and no heads being guided in the decoder. Global patterns (like attending to [First]) seem to be detrimental to performance in MT.

Results
We perform experiments on IWSLT16 En-De (Cettolo et al., 2016) and WMT14 En-De datasets, and report train negative log-likelihood (NLL), validation NLL, average train NLL (to compare convergence speed), and the BLEU score on  the test set. All models are trained for 100, 000 steps. Similar to LM pre-training, we observe that our model has the lowest train, validation, and average NLL for both the datasets, showing that guiding attention heads helps even with MT. Furthermore, the AG model's BLEU scores are comparable to the scores of BASE and hard-coded Gaussian. We note that our AG patterns are tailored for languagemodeling, and MT models could benefit from a more extensive search over possible patterns.

Probing analysis
Motivated by recent studies (Clark et al., 2019;Lin et al., 2019; which posit that individual attention heads can encode linguistic information, we analyze attention patterns in the self-attention heads of our models. Specifically, we search for heads that can individually perform coreference resolution. Method We use the probe described in (Clark et al., 2019), which evaluates attention heads on antecedent selection accuracy. A sentence (e.g. "The CEO led her company to success") is input to the model, and each head is scored on its ability to identify antecedents, e.g. a score of 1 if the token 'her' attends most to a token in 'The CEO'. We aggregate the scores over all the coreferent mention-antecedent pairs in the dataset and report the accuracy of each model's best performing head. We also include the scores of a randomly initialized RoBERTa model for comparison. We leave further details to Clark et al. (2019).  dataset of 10000 samples from Lin et al. (2019) and follow their method of adding a distractor sentence (e.g. adding "The people were happy" after "The CEO led her company to success") which serves to introduce spurious entities. We ensure that the antecedent is not the word directly before the coreferent mention so that a trivial baseline which always chooses the previous word gets a score of 0.

Datasets
Discussion We discuss results reported in Table 8. We observe the same trends on both the CoNLL-2012 dataset and the synthetic dataset and discuss the former in detail. In line with Clark et al. (2019)'s observation, BERT and RoBERTa have heads which achieve the highest accuracies. Even though RoBERTa-MLM (section 4.3) is trained on significantly lesser compute and data, its performance is comparable to BERT and better than the Rule-based baseline. But interestingly, both RoBERTa-AG (λ = 1/2) and RoBERTa-AG (λ = 1), which have half and all their heads guided respectively, perform significantly worse than both the baseline and a randomly initialized (untrained) model. Surprisingly, this is true even though the validation loss for both RoBERTa-AG (λ = 1/2) and RoBERTa-AG (λ = 1) is lower (better) than RoBERTa-MLM's. The performance degradation in RoBERTa-AG models is because half/all the heads pay most of their attention to a predefined pattern, thus rendering them unable to pay attention to the antecedent. This provides evidence that language modeling performance is not necessarily correlated with the performance of individual heads on linguistic tasks, and that attention patterns of the heads are not necessarily directly interpretable. This observation is in line with a recent study (Brunner et al., 2020) that questions the interpretability of attention distributions. The trends on the synthetic dataset (Table 8) are similar where BERT and RoBERTa have a head that achieves close to perfect accuracy, and RoBERTa-MLM has a head whose accuracy is significantly better than that of a randomly initialized model. However, RoBERTa-AG (λ = 1) performs poorly (an accuracy of 0) even though its validation MLM loss is lower (better) than RoBERTa-MLM's.

Conclusion
In this study, we introduce the simple yet effective Attention Guidance (AG) loss, which speeds up convergence and improves performance on various domains and model sizes. Adding this loss also makes Transformers robust to hyperparameters like learning rate, warmup steps, and dropout. Our experiments also show its usefulness in multiple pre-training objectives. The gains are particularly strong on larger models, enabling their usage in low-compute scenarios and low-resource domains. Our analysis of the relation of AG loss and MLM loss shows the usefulness of our method, and we hope that this paper can serve as a starting point for future works aiming to exploit and question self-attention in Transformers.