Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural language processing applications. In particular, the use of multi-head attention extends single-head attention by allowing a model to jointly attend information from different perspectives. Without explicit constraining, however, multi-head attention may suffer from attention collapse, an issue that makes different heads extract similar attentive features, thus limiting the model's representation power. In this paper, for the first time, we provide a novel understanding of multi-head attention from a Bayesian perspective. Based on the recently developed particle-optimization sampling techniques, we propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention and consequently strengthens model's expressiveness. Remarkably, our Bayesian interpretation provides theoretical inspirations on the not-well-understood questions: why and how one uses multi-head attention. Extensive experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity, leading to more informative representations with consistent performance improvement on various tasks.


Introduction
Multi-head attention (Vaswani et al., 2017) is an effective module in deep neural networks, with impressive performance gains in many naturallanguage-processing (NLP) tasks. By extending a single head to multiple paralleled attention heads, the architecture is widely adopted to capture different attentive information and strengthen the expressive power of a model. Lin et al. (2017) applied the idea of multi-heads on self-attention and extract a 2-D matrix instead of a vector to represent different contexts of a sentence. The Transformer (Vaswani et al., 2017) and its variants such as BERT (Devlin et al., 2019) are influential architectures solely based on multi-head attention, achieving state-of-the-art performance on plenty of NLP tasks. The key of multi-head attention is its ability to jointly attend to information from different representation subspaces at different positions, which results in multiple latent features depicting the input data from different perspectives. However, there are no explicit mechanisms guaranteeing this desired property, leading to potential attention redundancy or attention collapse, which has been observed in previous research (Voita et al., 2019;Kovaleva et al., 2019). Although there exist works by directly adding regularization on loss functions to encourage diversity in multi-head attention (Lin et al., 2017;Li et al., 2018), the underlying working principle has not been well-validated, and performance improvement is limited. Furthermore, an important problem on why and how multi-head attention improves over its single-head counterpart is poorly understood.
In this paper, we provide a novel understanding of multi-head attention from a Bayesian perspective, by adapting the deterministic attention to a stochastic setting. The standard multi-head attention can be understood as a special case of our framework, where attention-parameter updates between heads are independent, instead of sharing a common prior distribution. Based on our framework, attention repulsiveness could then be imposed by performing Bayesian inference on attention parameters with the recently developed particle-optimization sampling methods (Liu and Wang, 2016), which has been shown to be effective in avoiding mode collapse. These methods treat each head as a particle/sample, which is then optimized to approximate a posterior distribution of an attention model. With it, multi heads are enforced to move to the modes in the parameter space to be far from each other, thus improving the repulsive-ness in multi-head attention and enhancing model's expressiveness. Our Bayesian interpretation also provides a theoretical understanding on the reason and benefits of applying multi-head attention. Experiments on various attention models demonstrate the effectiveness of our framework.
Our contributions are summarized as follow: • We provide a new understanding of multihead attention from a Bayesian perspective, yielding a more principled and flexible interpretation of multi-head attention.
• Based on the recently developed particleoptimization sampling techniques, we propose an algorithm to explicitly encourage repulsiveness in multi-head attention without introducing extra parameters or explicit regularizers. The proposed method can be implemented with an efficient end-to-end training scheme.
• Our Bayesian interpretation provides a theoretical foundation to understand the benefits of multi-head attention, which reveals the existence of an optimal number of attention heads in a specific model.
• We apply our approach on four attention models with a wide range of tasks. Experimental results show that repulsive attention improves the expressiveness of models, and yields consistent performance gains on all the tasks considered.

Multi-head Attention
The attention mechanism aims at modeling dependencies among elements of a learned representation at different positions. The two commonly used attention functions are additive attention (Lin et al., 2017;Bahdanau et al., 2015) and dot-product attention (Vaswani et al., 2017). We review the popularly used dot-product attention below and defer the additive attention to Appendix A.
Dot-product Attention The multi-head scaled dot-product attention is used in the Transformer model (Vaswani et al., 2017). The attention function for a single head is formulated as mapping a query and a set of key-value pairs to output as Q, K, V are matrices depicting the hidden representation of every word in one sentence (i.e. selfattention) or two sentences (i.e. inter-attention); d k is the dimension of key and query; Z i is called the attention feature/map of the input sentence from the i-th head; {W Q i , W K i , W V i } are the corresponding learnable attention parameters. The M -head attention projects the queries, keys and values into M subspaces with different learnable linear projections. These attention functions are performed in parallel and are concatenated at last, resulting in a final latent representation:

Particle-Optimization Sampling
Particle-optimization sampling is a recently developed Bayesian sampling technique that interactively transports a set of particles/samples to a target distribution p by minimizing the KL divergence between the particle density and the target p. In our case, p would be a posterior distribution, p(θ|D) ∝ exp(−U (θ)), of the parameter θ ∈ R d , defined over an observed dataset is called the potential energy with p 0 a prior over θ. In our case, the model parameter θ could be one or several of the attention parameters such as W Q i . For simplicity, we will stick to θ in the presentation. In particle-optimization sampling, a total of M particles {θ (i) } M i=1 are updated iteratively to approximate p(θ|D). In this paper, we use two representative algorithms, the Stein Variational Gradient Descent (SVGD) and the Stochastic Particle-Optimization Sampling (SPOS), for sampling.
SVGD In SVGD (Liu and Wang, 2016), the i-th particle in the ( + 1)-th iteration is updated with stepsize +1 as where κ(·, ·) is a positive definite kernel (e.g., RBF kernel). The two terms in φ play different roles: the first term drives the particles towards high density regions of p(θ|D); whereas the second term acts as a repulsive force that prevents all the particles from collapsing together into local modes of p(θ|D).
SPOS Though obtaining significant empirical success, under certain conditions, SVGD experiences a theoretical pitfall, where particles tend to collapse. To overcome this,  generalize SVGD to a stochastic setting by injecting random noise into particle updates. The update rule for particles is the injected random Gaussian noise to enhance the ability of escaping local modes, leading to better ergodic properties compared to standard SVGD.

A Bayesian-Inference Perspective of Multi-Head Attention
In this section, we interpret multi-head attention as Bayesian inference of latent representation via particle-optimization sampling. We denote x and z as the input and output (latent representation) of the attention model, respectively. The single-head attention can be written as a deterministic mapping z = f att (x; θ), with θ the parameter of the mapping. Standard multi-head attention defines multiple parallel attention mappings, each endowed with independent parameters. The attention features are finally aggregated via a function g(·) as Next, we generalize (6) as a Bayesian inference problem for the latent representation z.

Attention as Bayesian Inference
We first generalize the deterministic transformation, z = f att (x; θ), to a stochastic generative process as: where a sample of the posterior of the global attention parameter θ, p(θ|D) ∝ p(D|θ)p(θ), is used as the parameter when generating the latent attention feature z. Bayesian inference for attention then computes the predictive distribution p(z|x, D) of the attentive latent representation z for a new input x given the training data D by p(z|x, D) = δ fatt(x;θ) (z)p(θ|D)dθ , where δ z (·) is the delta function with point mass at z.
To enable efficient evaluation of the integral, we adopt Bayesian sampling for approximation, i.e., p(z|x, D) is approximated by a set of M samples/particles initialized from p(θ|D), leading to the following generative process: The above formulation defines a principled version of multi-head attention from a Bayesian view. One can see that (7) reduces to the standard multi-head attention if all θ i are treated independently without sharing the common parameter distribution p(θ|D). In other words, our reformulation of multi-head attention is a stochastic generative process, thus is more general. Furthermore, efficient end-to-end learning can be performed by conducting repulsive Bayesian sampling for all parameters {θ i } M i=1 , which consequently could diversify the attention

Repulsive Attention Optimization
The Bayesian multi-head attention in (7) further inspires us to develop the repulsive attention. The idea is to learn to generate repulsive samples from the posterior p(θ|D). We propose to adopt the particle-optimization sampling methods, which could explicitly encourage repulsiveness between samples. In our algorithm, the parameter of p(z|x; θ) for each head is considered as one particle. Following the particle-optimization rules, M heads {θ i } M i=1 are updated iteratively to approximate the posterior distribution of attention parameter p(θ|D).

Learning Repulsive Multi-head Attention
We propose to learning repulsive attention by replacing the standard updates of attention parameters via stochastic gradient descent (SGD) with particle-optimization sampling methods while keeping the architecture of the multi-head attention unchanged. This procedure forms an efficient endto-end training scheme similar to standard attention learning. To be specific, in standard multi-head attention, the parameter of every head is updated independently according to the respective gradient of a loss function. To achieve repulsive multi-head attention, we follow the particle-optimization sampling update rule (e.g. (3) and (4)) to update the parameter of every head while keeping the update for the remaining parameters via SGD unchanged. Equations (4) and (5) can be viewed as modified gradients with explicit repulsive force and can be integrated into any optimizer, e.g., Adam (Kingma and Ba, 2015). Note that ∇ θ (i) U (θ (i) ) equals to the gradient of θ (i) in standard multi-head attention when the negative log-likelihood is used as the loss function and the prior of θ (i) is assumed to be uniform. The learning algorithm is illustrated in Algorithm 1. In practice, the update of M heads can be performed in parallel with efficient matrix operations.

Algorithm 1 Repulsive Multi-head Attention
backward and calculate gradients: end for update parameters:

In-depth Analysis
Why Multi-head Attention? Our Bayesian interpretation of the attention mechanism naturally provides an answer to the question of why one needs multi-head attention. By treating each head as one sample, adopting multiple heads means using more samples to approximate an underlying posterior distribution. The question comes to should one use more heads (samples). Intuitively this seems to be true because more samples typically make more accurate approximations. However, this could not be the case in practice. The reason might be two-fold: i) Overfitting: Learning with a limited amount of data could easily cause overfitting, thus requiring a smaller model (less at-tention heads); ii) Numerical error: Our proposed method to update samples (attention-head parameters) is essentially a discrete numerical method of the corresponding continuous-time partial differential equation, i.e., the samples are not exact samples from the target distribution. Thanks to the recently developed theory for particle-optimization sampling , one can conclude that more heads could accumulate more numerical errors, leading to performance deterioration. More formally, when using particles to approximate a target posterior distribution, there exists a gap (approximation error) between the particle distribution and the true distribution . This approximation error, when applied to our setting, approximately scales in the order Please refer to Theorem 10 in  for a formal description.
How Many Heads are Enough? The above error bound suggests that there is a trade-off between approximation accuracy and the number of heads M . Specifically, we have i) when M is small, the term 1 √ M in the bound would dominate, leading to decreasing errors (increasing performance) with increasing M ; ii) when M is large enough, the term M 1/2 0 dominates, suggesting that larger M could actually increase the approximation error (decreased performance). These phenomena are consistent with our experimental results. We note that an exact form of the optimal M is not available due to a number of unknown constants (omitted in the big-O notation). Therefore, one should seek other ways such as cross-validation to choose a good M in practice. Our argument also aligns with recent research, which found that more heads do not necessarily lead to better performance (Michel et al., 2019).

Experiments
We demonstrate the effectiveness of our method with representative multi-head attention models on a broad range of tasks including sentence classification, machine translation, language modeling and text generation. This section summarizes key results on different models. More detailed analysis and extra experiments are deferred to the appendix. To apply our approach, only the learning method of multi-head attention is adapted.

Self-attentive Sentence Classification
Model & Baselines We first apply our method to the self-attentive sentence classification model (Lin et al., 2017) which combines BiLSTM with additive attention to learn the sentence embedding and then does classification on it. We compare our method with the one using the standard multi-head attention (BiLSTM + MA) and the one applying the Frobenius regularization (BiLSTM + MA + R) on it to introduce diversity as in Lin et al. (2017).
Tasks & Datasets Following Lin et al. (2017), three sentence classification tasks including author profiling, sentiment analysis, and textual entailment are evaluated on the Age, Yelp, and SNLI datasets respectively.
Results As shown in Table 1, with the proposed repulsive multi-head attention, the model achieves higher accuracy on all three tasks. Especially on the sentiment analysis task which often contains multiple aspects in one sentence. Our methods also outperform the regularization method proposed in Lin et al. (2017). With different particleoptimization rules, SPOS is able to achieve better performance due to its extra advance discussed by . We further evaluate the diversity of multiple heads by calculating the average distance between each pair of latent representations. Results show that our methods indeed enforce heads to be more diverse, compared with the standard multi-head attention. The less diverse of the regularization-based method also indicates the validness of our argument in Appendix C.6.

Repulsive-attention visualization
We further visualize attention maps in the learned sentence embedding space in Figure 1. It is interesting to see attention collapse indeed happens in the standard multi-head attention, where almost all heads focus on one single factor "amazing". On the contrary, the proposed method is able to capture multiple key factors in the review that are strong indicators of the sentiment behind the sentence. For example, "downfall" and "service was passing" are key factors for this 4-star review captured by our repulsive

Transformer-based Neural Translation
Model & Baselines The Transformer (Vaswani et al., 2017) is a representative multi-head attention based model. We apply the proposed repulsive multi-head attention (RMA) on it and compare our method with the original one (MA) and the disagreement regularization method (R) (Li et al., 2018) which encourages the diversity in attention by a cosine similarity penalty on attention outputs. Following Vaswani et al. (2017) , we apply Transformer for machine translation, with two standard translation datasets: the IWSLT14 German-to-English (De-En) dataset , and the WMT14 English-to-German (En-De) dataset.

Tasks & Datasets
Results Results are presented in Table 2. With the repulsive multi-head attention, Transformer models achieve noticeable improvement on the BLEU score on both datasets, compared with both baselines. It is also encouraging to see that the Transformer-base-RMA with a much smaller model achieves comparable performance as Transformer-big. As for training time, our approach takes slightly more time than the baseline, but is much more efficient than the regularization approach. A more detailed analysis of the computational complexity is provided in Appendix B.3. benefits differently. Remarkably, only diversifying the attention in the first layer is able to achieve comparable performance to the case of diversifying attention in all layers, with little computational time increased. This finding suggests that the repulsiveness in the first layer's attention plays an important role for modelling language.
Redundancy in heads The redundancy problem in attention has been observed in recent works (Michel et al., 2019), that a large percentage of attention heads can be removed at test time without significantly impacting performance. Following Michel et al. (2019), we analysis the redundancy in Transformer by ablating each head at testing and evaluating the performance drop. The more drops, the more important of the head. Figure 2 shows that the majority of heads in standard multihead attention are redundant for the performance is comparable before and after masking. However, the repulsive attention largely alleviates the redundancy. More interestingly, there are a lot of counterintuitive cases in standard attention: removing a head results in an increase in performance. However, this does not seem to happen in repulsive attention model, indicating better leveraging of the superior expressiveness of multi-head mechanism.

Language Representation Learning
Model ELECTRA (Clark et al., 2020) is an efficient approach to self-supervised language representation learning. It consists of two networks, Generator and Discriminator, both of which are parameterized by Transformers. The pre-trained Discriminator is used in various downstream tasks via fine-tuning. We apply the proposed repulsive multi-head attention to ELECTRA (small setting) in the pre-training stage. We only make the first layer attention of Discriminator to be repulsive, according to the finding in Section 5.2 that the di-  Results Results are shown in Table 3. For each task, we perform single-task fine-tuning 50 trails, and report the averaged results. The training time with and without repulsive attention is almost the same. It shows that repulsive attention improves the baseline results (Clark et al., 2020) in seven out of eight tasks on GLUE, and the gains are larger especially on MNLI (the largest dataset on GLUE) and CoLA . This suggests that repulsive attention can yield better language representations. Following Phang et al. (2018), we use intermediate task training for RTE, which first fine-tunes pre-trained model on MNLI, then continuously fine-tuned it on RTE. The repulsive attention outperform the baseline method by a large margin in this setting. This is probably because the the repulsive attention particularly favor large data variability (e.g., MNLI dataset), where different aspects of data can be uniquely represented in different heads.

Graph-to-Text Generation
Model & Baselines GraphWriter (Koncel-Kedziorski et al., 2019) is a knowledge-graph-totext model, which aims at generating coherent multi-sentence abstract given a knowledge graph and a title. There is a Transformer-style encoder defined with graph attention modules (Velickovic et al., 2018) that could also be easily adapted to our method. We compare our method with the original one that has the standard multi-head attention, and the one with the cosine similarity regularization on attention parameters in encoder layers.   We evaluate the quality of abstracts with 3 major metrics: BLEU (uni-gram to 4-gram BLEU) (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE (Lin and Hovy, 2003). In ROUGE, the unigram and bigram overlap (ROUGE-1 and ROUGE-2) are a proxy for assessing informativeness and the longest common subsequence (ROUGE-L) represents fluency.

Results
The results are shown in Table 4. The GraphWriter model with repulsive multi-head attention significantly outperforms the original model and regularization approach in all metrics. Especially, the higher recall score in ROUGE shows that there are more N-grams across the reference abstracts that can be found in the generated abstracts. Similar observations are noticed when analyzing the generated examples in detail (an example is illustrated in Appendix E). Koncel-Kedziorski et al. (2019) pointed out one limitation of their model, which is 40% of entities in the knowledge graphs do not appear in the generated text. With the repulsive attention, remarkably, the GraphWriter model is observed to perform much better with a 10% improvement on the knowledge graph coverage and fewer repeat clauses.
Human Evaluation To further illustrate the improvement of using diverse attention, we conduct human evaluation. Following Koncel-Kedziorski et al. (2019), we give 50 test datapoints to experts (5 computer science students) and ask them to provide per-criterion judgments for the generated abstracts. Comparisons of the two methods from 4 aspects are shown in Table 5. The human judgment indicates that the repulsive attention improves both the structure and informativeness of generated abstracts significantly, which is consistent with the automatic evaluation and our observations.

On the Number of Attention Heads
Our analysis in Section 4.2 suggests the existence of the optimal number of attention heads. To verify this, we further conduct experiments on sentence classification and translation tasks by varying the number of attention heads in models. The results are shown in Figure 3. The model error/loss first decreases then increases w.r.t. M , the number of attention heads. The optimal M are around 20 and 4 for the sentiment analysis and the Transformer, respectively. Interestingly, the Transformer degrades quickly as the number of heads increases. This might because the constant corresponding to the O(M 0 ) term in the bound is too large, making this term quickly dominate with increasing M . Furthermore, it is also observed that the standard multi-head attention follows the same trend, but performs much worse and is more sensitive to the M . This indicates the benefit of Bayesian modeling, which could usually stabilize a model better.

Related Work
We provide a first explanation of multi-head attention from a Bayesian perspective, and propose particle-optimization sampling for repulsive attention. Most previous works aim at improving attention diversity with regularization-based methods, e.g., the Frobenius regularization on attention weights in Lin et al. (2017) and the cosine similarity regularization on attention outputs in Li et al. (2018). These works focus on a particular model and the underlying working principle has not been well-validated. Our approach is a principled one that is more interpretable and widely applicable. The attention collapse belongs to a featureoverlapping problem, which also happens in other areas. Some works tackle this problem by changing architectures, for example ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) implicitly reduce feature correlations by summing or concatenating activation from previous layers. There are also works done by altering the training method as what we do. Han et al. (2017) adopt the dropout mechanism and propose a dense-sparse-dense training flow, for regularizing deep neural networks. Later, Prakash et al. (2019) attempted addressing the unnecessary overlap in features captured by image filters with pruning-restoring scheme in training. To our knowledge, we are the first to tackle the attention-feature overlap problem from a Bayesian view with a principled interpretation.

Conclusion
We propose a principled way of understanding multi-head attention from a Bayesian perspective. We apply particle-optimization sampling to train repulsive multi-head attention without additional parameters nor explicit regularizers. Our Bayesian framework explains the long-standing question of why and how multi-head attention affects model performance. Extensive experimental results on representative attention models demonstrate that our approach can significantly improve the diversity in multi-head attention, resulting in more expressiveness attention models with performance improvement on a wide range of tasks.

A Additive Attention
First proposed by (Bahdanau et al., 2015), additive attention uses a one-hidden layer feed-forward network to calculate the attention alignment. We use the attention function in Lin et al. (2017), which is also a self-attention, as an example. It aims at extracting the latent representation of a sentence. The single-head attention function is: where H ∈ R n×d is the hidden state matrix of a sentence with n words, every word is embedded in a d dimensional vector. v ∈ R 1×n is the normalized alignment score vector for each word. W ∈ R da×d and v ∈ R da×1 are attention parameters. The final sentence representation vector z is a weighted sum of words' hidden states weighted by attention vector. In order to capture overall semantics of the sentence instead of a specific component, multi-head attention could be applied as where V ∈ R da×M is the matrix performs M heads, A ∈ R M ×n is the M -head attention matrix and Z ∈ R M ×d is the resulting sentence representation matrix contains semantics from multiple aspects.

B.1 Cyberbullying Detection
Tasks & Datasets To further assess the effectiveness of our model, we apply it on abusive-comment detection task. Two datasets from Twitter and Wikipedia talk pages are used. Our goal is to distinguish bullying comments from non-bullying ones.
Baselines We have tested several baseline models and found the following most effective: (1) BiLSTM + MA: A sentence embedding model with BiLSTM and standard multi-head attention (MA).
(2) CNN: A classical temporal Convolutional Neural Networks to process text.

Main Results
To evaluate the capability of our model to detect attack annotation comments, several metrics including receiver operating characteristic curve (AUC@ROC) and precision-recall curve (AUC@PR) are adopted. Compared to other metrics such as F1 score and Matthews correlation coeffcient (MCC∈[-1,1], 1 for perfect prediction (Matthews, 1975)), AUC@PR shows a more informative picture of a model's performance, especially when the datasets are highly skewed, as is our case (Davis and Goadrich, 2006). Table 6 shows the results. Similar to sentence classification tasks, with repulsive attention, our model performs better than all the baselines.

B.2 Experimental Details
For our approach, RBF kernel κ(x, y) = exp(− 1 h x − y 2 2 ) with the bandwidth h = med 2 / log M is used as the kernel function, where med denotes the median of the pairwise distance between current particles. The prior distribution of attention parameters is assumed to be uniform. We find that adding an repulsive weight before the repulsive term (i.e. the second term in Eq. 4) in particle-optimization update rules could help adjusting the degree of diversity in attention and achieving better performance. In our experiments, we adopt this trick and use the hyper-parameter α to denote the repulsive weight. Since our method only modifies the learning process of attention, all models and settings in our experiments kept the same with the corresponding previous work unless stated otherwise.

B.2.1 Self-attentive Sentence Classification
Dataset Three tasks are conducted on three public sentence classification datasets. Author profiling (Age dataset 1 ) is to predict the age range of the user by giving their tweets. Sentiment analysis (Yelp dataset 2 ) is to predict the number of stars the user assigned to by analysis their reviews. Textual entailment (SNLI dataset 3 ) is to tell whether the semantics in the two sentences are entailment or contradiction or neutral. Following Lin et al. (2017), the train / validate / test split of Age is 68485 / 4000 / 4000, Yelp is 500K / 2000 / 2000, SNLI is 550K / 10K / 10K.

Experimental settings
We implement the standard multi-head attention model in Lin et al. (2017) following the settings in it except that we use Spacy toolkit 4 as the tokenizer and GloVe 5 (GloVe 840B 300D) as the pre-trained word embedding. For repulsive multi-head attention learning, we keep all settings the same with the standard one (Lin et al., 2017). Hyper-parameters , α and β in particle-optimization rules are manually tuned for each task. We train and evaluate all the models with 10 random seeds and compare their average performance. Models are trained on one TITAN Xp GPU.

B.2.2 Cyberbullying Detection
Dataset Two cyberbullying detection datasets 6 are used in this experiment. The train / validate / test split of Twitter is 9572 / 3254 / 3254, Wikipedia is 69987/ 22952 / 22952.  C Additional Analysis of Our Approach

C.1 Comparison with SGLD
We also conducted a comparison of our method with Stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011), which is also a Bayesian sampling method. Results are in Table 8. Though random noise brought by SGLD might help achieving diversity, it's sub-optimal. Using particle-optimization to add the repulsive term makes it more effective.

C.2 What Prior to Use?
In our approach, the repulsiveness is imposed by the inference algorithm (i.e. SVGD), not prior. To study the impact of different priors, we also tested the Gaussian prior. We found that (see Table 9) different priors have little impact on the final results, i.e., there is not a consistent winner for different priors. This suggests that, the prior has little impact on repulsiveness in our framework. But one can still impose prior knowledge of the attention to help our algorithm learn a better attention model. We would like to explore that in future works.   There are three types of attention in Transformer: self-attention in the encoder, self-attention in the decoder, and inter-attention between the encoder and decoder. We conduct extra experiments on Transformer-small to investigate which attention module benefits most from the repulsiveness. Results are shown in Table 10. We first apply the repulsive attention on each of {Q,K,V} parameters in every attention module for all layers. The results indicate that diversifying the V -parameter seems to yield better performance. We then compare repulsive attention inside the encoder, inside the decoder and between them, respectively. The results show improvement in all cases, and diversifying inter-attention seems to achieve the most benefit. Finally, we diversify the attention in different layers of the Transformer. The results suggest that only diversifying the attention in the first layer is able to achieve comparable performance to the case of diversifying all layers, with little computational time increased.

C.4 Improved Calibration
A reliable model must not only be accurate, but also indicate when it is likely to get the wrong answer. It means the confidence of a well calibrated model should be indicative of the actual likelihood of correctness.
In high-risk applications, confident but wrong predictions can be especially harmful. Overconfidence Error (OE) is defined as follow for this case.
As shown in Figure 4, the standard attention model is prone to be over-confident, meaning that the accuracy is likely to be lower than what is indicated by the predictive score. With the proposed repulsive training of  Table 11: Adapt cosine similarity regularization on attention parameters gradually to our framework. Accuracy of the model is evaluated on the test set of Age and Yelp dataset.