Multi-Head Attention with Disagreement Regularization

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.


Introduction
Attention model is now a standard component of the deep learning networks, contributing to impressive results in neural machine translation Luong et al., 2015), image captioning (Xu et al., 2015), speech recognition (Chorowski et al., 2015), among many other applications. Recently, Vaswani et al. (2017) introduced a multi-head attention mechanism to capture different context with multiple individual attention functions.
One strong point of multi-head attention is the ability to jointly attend to information from different representation subspaces at different positions. However, there is no mechanism to guarantee that different attention heads indeed capture distinct features. In response to this problem, we introduce a disagreement regularization term to explicitly encourage the diversity among multiple attention heads. The disagreement regularization * Zhaopeng Tu is the corresponding author of the paper. This work was mainly conducted when Jian Li and Baosong Yang were interning at Tencent AI Lab. serves as an auxiliary objective to guide the training of the related attention component.
Specifically, we propose three types of disagreement regularization, which are applied to the three key components that refer to the calculation of feature vector using multi-head attention. Two regularization terms are respectively to maximize cosine distances of the input subspaces and output representations, while the last one is to disperse the positions attended by multiple heads with element-wise multiplication of the corresponding attention matrices. The three regularization terms can be either used individually or in combination.
We validate our approach on top of advanced TRANSFORMER model (Vaswani et al., 2017) for both English⇒German and Chinese⇒English translation tasks. Experimental results show that our approach consistently improves translation performance across language pairs. One encouraging finding is that TRANSFORMER-BASE with disagreement regularization achieves comparable performance with TRANSFORMER-BIG, while the training speed is nearly twice faster. Bush held a talk with Sharon head1 head2 Figure 1: Illustration of the multi-head attention, which jointly attends to different representation subspaces (colored boxes) at different positions (darker color denotes higher attention probability).
Attention mechanism aims at modeling the strength of relevance between representation pairs, such that a representation is allowed to build a direct relation with another representation. Instead of performing a single attention function, Vaswani et al. (2017) found it is beneficial to capture different context with multiple individual attention functions. Figure 1 shows an example of a twohead attention model. For the query word "Bush", green and red head pay attention to different positions of "talk" and "Sharon" respectively.
Attention function softly maps a sequence of query Q = {Q 1 , . . . , Q N } and a set of key-value pairs {K, V } = {(K 1 , V 1 ), . . . , (K M , V M )} to outputs. More specifically, multi-head attention model first transforms Q, K, and V into H subspaces, with different, learnable linear projections, namely: Here A h is the attention distribution produced by the h-th attention head. Finally, the output states are concatenated to produce the final state.

Approach
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. To further guarantee the diversity, we enlarge the distances among multiple attention heads with disagreement regularization (Section 3.1). Specifically, we propose three types of disagreement regularization to encourage each head vector O i to be different from other heads (Section 3.2).

Framework
In this work, we take the machine translation tasks as application. Given a source sentence x and its translation y, a neural machine translation model is trained to maximize the conditional translation probability over a parallel training corpus. We introduce an auxiliary regularization term in order to encourage the diversity among multiple attention heads. Formally, the training objective is revised as: where a is the referred attention matrices, λ is a hyper-parameter and is empirically set to 1.0 in this paper. The auxiliary regularization term D(·) guides the related attention component to capture different features from the corresponding projected subspaces.
Note that the introduced regularization term works like L1 and L2 terms, which do not introduce any new parameters and only influence the training of the standard model parameters.

Disagreement Regularization
Three types of regularization term, which are applied to three parts of the original multi-head attention, are introduced in this section.
Disagreement on Subspaces (Sub.) This disagreement is designed to maximize the cosine distance between the projected values. Specifically, we first calculate the cosine similarity cos(·) between the vector pair V i and V j in different value subspaces, through the dot product of the normalized vectors 1 , which measures the cosine of the angle between V i and V j . Thus, the cosine distance is defined as negative similarity, i.e, − cos(·). Our training objective is to enlarge the average cosine distance among all head pairs. The regularization term is formally expressed as:

Disagreement on Attended Positions (Pos.)
Another strategy is to disperse the attended positions predicted by multiple heads. Inspired by the agreement regularization (Liang et al., 2006;Cheng et al., 2016) which encourages multiple alignments to be similar, in this work, we deploy a variant of the original term by introducing an alignment disagreement regularization. Formally, we employ the sum of element-wise multiplication of corresponding matrix cells 2 , to measure the similarity between two matrices A i and A j of two heads: Disagreement on Outputs (Out.) This disagreement directly applies regularization on the outputs of each attention head, by maximizing the difference among them. Similar to the subspace strategy, we employ negative cosine similarity to measure the distance:

Related Work
The regularization on attended positions is inspired by agreement learning in prior works, which encourages alignments or hidden variables of multiple models to be similar. Liang et al. (2006) first assigned agreement terms for jointly training word alignment in phrase-based statistic machine translation (Koehn et al., 2003). The idea was further extended into other natural language processing tasks such as grammar induction (Liang et al., 2008). Levinboim et al. (2015) extended the agreement for general bidirectional sequence alignment models with model inevitability regularization. Cheng et al. (2016) further explored the agreement on modeling the source-target and target-source alignments in neural machine translation model. In contrast to the mentioned approaches which assigned agreement terms into loss function, we deploy an alignment disagreement regularization by maximizing the distance among multiple attention heads. As standard multi-head attention model lacks effective control on the influence of different attention heads, Ahmed et al. (2017) used a weighted mechanism to combine them rather than simple concatenation. As an alternative approach to multi-head attention, Shen et al. (2018a) and Shen et al. (2018b) extended the single relevance score to multi-dimensional attention weights, demonstrating the effectiveness of modeling multiple features for attention networks. Our approach is complementary to theirs: our model encourages the diversity among multiple heads, while theirs enhance the power of each head.  (Sennrich et al., 2016) with 32K merge operations for both language pairs. We use the case-sensitive 4-gram NIST BLEU score (Papineni et al., 2002) as evaluation metric, and sign-test (Collins et al., 2005) for statistical significance test. We evaluate the proposed approaches on the advanced TRANSFORMER model (Vaswani et al., 2017), and implement on top of an open-source toolkit -THUMT (Zhang et al., 2017). We follow Vaswani et al. (2017) to set the configurations and have reproduced their reported results on the En⇒De task. All the evaluations are conducted on the test sets. We have tested both Base and Big models, which differ at hidden size (512 vs. 1024) and number of attention heads (8 vs. 16). We study model variations with Base model on the Zh⇒En task (Section 5.2 and 5.3), and evaluate overall performance with Big model on both Zh⇒En and En⇒De tasks (Section 5.4).

Effect of Regularization Terms
In this section, we evaluate the impact of different regularization terms on the Zh⇒En task us-

System
Architecture Zh⇒En En⇒De Speed BLEU Speed BLEU Existing NMT systems  GNMT n/a n/a n/a 26.30 (Gehring et al., 2017) CONVS2S n/a n/a n/a 26.36 (Vaswani et al., 2017) TRANSFORMER-BASE n/a n/a n/a 27.3 TRANSFORMER-BIG n/a n/a n/a 28.4 (Hassan et al., 2018) TRANSFORMER-BIG n/a 24.2 n/a n/a Our NMT systems this work   Effect of regularization on different attention networks, i.e., encoder self-attention ("Enc"), encoder-decoder attention ("E-D"), and decoder self-attention ("Dec").
ing TRANSFORMER-BASE. For simplicity and efficiency, here we only apply regularizations on the encoder side. As shown in Table 1, all the models with the proposed disagreement regularizations (Rows 2-4) consistently outperform the vanilla TRANSFORMER (Row 1). Among them, the Output term performs best which is +0.65 BLEU score better than the baseline model, the Position term is less effective than the other two. In terms of training speed, we do not observe obvious decrease, which in turn demonstrates the advantage of our disagreement regularizations.
However, the combinations of different disagreement regularizations fail to further improve translation performance (Rows 5-7). One possible reason is that different regularization terms have overlapped guidance, and thus combining them does not introduce too much new information while makes training more difficult.

Effect on Different Attention Networks
The TRANSFORMER consists of three attention networks, including encoder self-attention, decoder self-attention, and encoder-decoder attention. In this experiment, we investigate how each attention network benefits from the disagreement regularization. As seen from Table 2, all models consistently improve upon the baseline model. When applying disagreement regularization to all three attention networks, we achieve the best performance, which is +0.72 BLEU score better than the baseline model. The training speed decreases by 12%, which is acceptable considering the performance improvement.

Main Results
Finally, we validate the proposed disagreement regularization on both WMT17 Chinese-to-English and WMT14 English-to-German translation tasks. Specifically, we adopt the Output disagreement regularization, which is applied to all three attention networks. The results are concluded in Table 3. We can see that our implementation of TRANSFORMER outperforms all existing NMT systems, and matches the results of TRANSFORMER reported in previous works. Incorporating disagreement regularization consistently improves translation performance for both base and big TRANSFORMER models across language pairs, demonstrating the effectiveness of the proposed approach. It is encouraging to see that TRANSFORMER-BASE with disagreement regu-  Table 4: Effect of different regularization terms on the three disagreement measurements. "n/a" denotes the baseline model without any regularization term. Larger value denotes more disagreement (at most 1.0).  larization achieves comparable performance with TRANSFORMER-BIG, while the training speed is nearly twice faster.

Quantitative Analysis of Regularization
In this section, we empirically investigate how the regularization terms affect the multi-head attention. To this end, we compare the disagreement scores on subspaces ("Sub."), attended positions ("Pos."), and outputs ("Out."). Since the scores are negative values, we list exp(D) for readability, which has a maximum value of 1.0. Table 4 lists the results of encoder-side multi-head attention on the Zh⇒En validation set. As seen, the disagreement score on the individual component indeed increases with the corresponding regularization term. For example, the disagreement of outputs increases to almost 1.0 by using the Output regularization, which means that the output vectors are almost perpendicular to each other as we measure the cosine distance as the disagreement.
One interesting finding is that attending to different positions may not be the essential strength of multi-head attention on the translation task. As seen, the disagreement score on the attended positions for the standard multi-head attention is only 0.007, which indicates that almost all the heads attend to the same positions. Table 5 shows the disagreement scores on attended positions across en-coder layers. Except for the 1 st layer that attends to the input word embeddings, the disagreement scores on other layers (i.e. ranging from the 2 nd to 6 th layer) are very low, which confirms out above hypothesis.
Concerning the regularization terms, except that on position, the other two regularization terms (i.e. "Sub." and "Out.") do not increase the disagreement score on the attended positions. This can explain why positional regularization term does not work well with the other two terms, as shown in Table 1. This is also consistent with the finding in (Tu et al., 2016), which indicates that neural networks can model linguistic information in their own way. In contrast to attended positions, it seems that the multi-head attention prefer to encoding the differences among multiple heads in the learned representations.

Conclusion
In this work, we propose several disagreement regularizations to augment the multi-head attention model, which encourage the diversity among attention heads so that different head can learn distinct features. Experimental results across language pairs validate the effectiveness of the proposed approaches.
The models also suggest a wide range of potential advantages and extensions, from being able to improve the performance of multi-head attention in other tasks such as reading comprehension and language inference, to being able to combine with other techniques (Shaw et al., 2018;Shen et al., 2018a;Dou et al., 2018;Yang et al., 2018) to further improve performance.