Mixed Multi-Head Self-Attention for Neural Machine Translation

Recently, the Transformer becomes a state-of-the-art architecture in the filed of neural machine translation (NMT). A key point of its high-performance is the multi-head self-attention which is supposed to allow the model to independently attend to information from different representation subspaces. However, there is no explicit mechanism to ensure that different attention heads indeed capture different features, and in practice, redundancy has occurred in multiple heads. In this paper, we argue that using the same global attention in multiple heads limits multi-head self-attention’s capacity for learning distinct features. In order to improve the expressiveness of multi-head self-attention, we propose a novel Mixed Multi-Head Self-Attention (MMA) which models not only global and local attention but also forward and backward attention in different attention heads. This enables the model to learn distinct representations explicitly among multiple heads. In our experiments on both WAT17 English-Japanese as well as IWSLT14 German-English translation task, we show that, without increasing the number of parameters, our models yield consistent and significant improvements (0.9 BLEU scores on average) over the strong Transformer baseline.

Among the different architectures, the Transformer (Vaswani et al., 2017) has recently attracted most attention in neural machine translation, due to its high parallelization in computation and improvements in quality. A key point of its high-performance is the multi-head selfattention which allows the model to jointly attend to information from different representation subspaces at different positions. There is a huge gap (around 1 BLEU score) between the performance of the Transformer with only one head and eight heads (Vaswani et al., 2017;Chen et al., 2018).
However, all encoder self-attention heads fully take global information into account, there is no explicit mechanism to ensure that different attention heads indeed capture different features (Li et al., 2018). Concerning the results presented by some latest researches, the majority of the encoder self-attention heads, can even be pruned away without substantially hurting model's performance (Voita et al., 2019;Michel et al., 2019). Moreover, the ability of multi-head selfattention, in which lacking capacity to capture local information (Luong et al., 2015;Wu et al., 2019) and sequential information (Shaw et al., 2018;Dehghani et al., 2019), has recently come into question (Tang et al., 2018).
Motivated by above findings, we attribute the redundancy arising in encoder self-attention heads to the using of same global self-attention among all attention heads. Additionally, it is because of the redundancy, multi-head self-attention is unable to leverage its full capacity for learning distinct features in different heads. In response, in this paper, we propose a novel Mixed Multi-Head Self-Attention (MMA) which can capture distinct features in different heads explicitly by different attention function. Concretely, MMA is composed of four attention functions: Global Atten- Figure 1: The architecture of Transformer with Mixed Multi-Head Self-Attention tion which models dependency of arbitrary words directly. Local Attention, where attention scope is restricted for exploring local information. Forward and Backward Attention which attends to words from the future and from the past respectively, serving as a function to model sequence order. MMA enables the model to learn distinct representations explicitly in different heads and improves the expressive capacity of multi-head selfattention. Besides, our method is achieved simply by adding hard masks before calculating attention weights, the rest is the same as the original Transformer. Hence our method does not introduce additional parameters and does not affect the training efficiency.
The primary contributions of this work can be summarized as follows: • We propose a novel Mixed Multi-Head Self-Attention (MMA) that extracts different aspects of features in different attention heads.
• Experimental results on two language pairs demonstrate that the proposed model consistently outperforms the vanilla Transformer in BLEU scores. Qualitative analysis shows our MMA can make better use of word order information and the improvement in translating relatively long sentence is especially significant.

Transformer Architecture
In this section, we briefly describe the Transformer architecture (Vaswani et al., 2017) which includes an encoder and a decoder. The Transformer aims to model a source sentence x to a target sentence y by minimizing the negative log likelihood of the target words. The encoder consists of N identical layers, each layers has two sublayers with residual connection (He et al., 2016). The first is a multi-head self-attention layer and the second is a position wise fully connected feed-forward network layer: where Q l−1 , K l−1 , V l−1 come from the output of the previous encoder layer H l−1 . LN(·) and FFN(·) represent layer normalization (Ba et al., 2016) and feed-forward networks. The multi-head attention MA(·) linearly project the queries, keys and values h times for different representation of Q, K, V , and computes scaled dot-product attention (Luong et al., 2015) ATT(·) for each representation. Then these are concatenated and once again projected, the final attentional context is calculated as follows: where W Q h , W K h and W V h are parameter matrices to transform hidden state into different representation subspaces and W O is output projection. ATT(·) is computed by: where e i is the i-th energy and d is the dimension of hidden state. The decoder is also composed of N identical layers and it contains a third sublayer, which performs attention over the output of the encoder between the self-attention sublayer and feed-forward network sublayer.

Proposed Architecture
Our proposed approach is mainly motivated by the fact that redundancy has occurred in multiheads (Voita et al., 2019;Michel et al., 2019), which limits the capacity of multi-head selfattention. As each self-attention layer has a same global receptive field, this can not guarantee that every head has learned useful features in different subspaces through the same attention function.
To tackle the problem mentioned above, besides global information, we also model local and sequential information for multi-head self-attention by applying local attention, forward attention and backward attention respectively. We refer to it as Mixed Multi-head Self-Attention (MMA), as shown in Figure 1. This is achieved by adding hard mask to each attention head. In this way, Eq.(3) is redefined as: Since attention weights are calculated by the soft- On the contrary, if a mask M i,j = 0, it means no change in attention function and Q i attends to and captures relevant information from K j .

Global and Local Attention
Global attention and local attention differ in terms of whether the attention is placed on all positions or only a few positions. Global attention is the original attention function in Transformer (Vaswani et al., 2017), and it has a global receptive field which is used to connect with arbitrary words directly. Under our framework, we define the hard mask for global attention as follows: But global attention may be less powerful and can potentially render it impractical for longer sequences (Luong et al., 2015). On the other hand, self-attention can be enhanced by local attention which focuses more on restricted scope rather than the entire context (Wu et al., 2019;Xu et al., 2019). Based on the above findings, we also define a local attention which simply employs a hard mask to restrict the attention scope by: where w is the attention scope which means, for a given i-th word, it can only attends to the set of words within the window size We aim to combine the strengths both of global attention and local attention. Towards this goal, we apply global attention and local attention to two distinct attention heads.

Forward and Backward Attention
As for RNN-based NMT, bidirectional recurrent encoder (Schuster and Paliwal, 1997) is the most commonly used encoder (Bahdanau et al., 2015). It consists of forward and backward recurrent encoding that receive information from both past and future words. However, the Transformer foregoes recurrence and completely relies on predefined position embedding to represent position information. Therefore, it has considerable difficulties in considering relative word order (Shaw et al., 2018).
In order to enhance the ability of positionawareness in self-attention, we present an straightforward way of modeling sequentiality in the selfattention by a forward attention which only attends to words from the future, and a backward attention which inversely only attends to words from the past. The masks in forward and backward attention can be formally defined as:

Model
De-En Variational Attention (Deng et al., 2018) 33.30 Pervasive Attention (Elbayad et al., 2018) 34.18 Multi-Hop Attention (Iida et al., 2019) 35.13 Dynamic Convolution (Wu et al., 2019) 35.20 RNMT Fine-tuned (Sennrich and Zhang, 2019)   With the help of forward and backward attention, we assume that the Transformer can can make better use of word order information.

Mixed Multi-Head Self-Attention
With different heads applied different attention function and different receptive field, the model is able to learn different aspects of features. To fully utilize the different features, we concatenate all mixed attention heads as in Eq. (4): where head G , head L , head F , head B represent head with global attention, local attention, forward attention and backward attention respectively. Our method only adds hard masks before softmax function, the rest is the same as the original model. Hence our method brings increase the parameters of the Transformer and does not affect the training efficiency.

Datasets
To test the proposed approach, we perform experiments on WAT17 English-Japanese and IWSLT14 German-English translation task with different amounts of training data. WAT17 English-Japanese: We use the data from WAT17 English-Japanese translation task which created from ASPEC (Nakazawa et al., 2017).
Training, validation and test sets comprise 2M, 1.8K, 1.8K sentence pairs respectively. We adopt the official 16K vocabularies preprocessed by sentencepiece. 2 IWSLT14 German-English: We use the TED data from the IWSLT14 German-English shared translation task (Cettolo et al., 2014) which contains 160K training sentences and 7K validation sentences randomly sampled from the training data. We test on the concatenation of tst2010, tst2011, tst2012, tst2013 and dev2010. For this benchmark, data is lowercased and tokenized with byte pair encoding (BPE) (Sennrich et al., 2016).

Setup
Our implementation is built upon open-source toolkit fairseq 3 (Ott et al., 2019). For WAT17 dataset and IWSLT14 dataset, we use the configurations of the Transformer base and small model respectively. Both of them consist of a 6-layer encoder and 6-layer decoder, the size of hidden state and word embedding are set to 512. The dimensionality of inner feed-forward layer is 2048 for base and 1024 for small model. The dropout probability is 0.1 and 0.3 for base and small model. Models are optimized with Adam (Kingma and Ba, 2014). We use the same warmup and decay strategy for learning rate as Vaswani et al. (2017) with 4000 warmup steps.

Model
De  Table 3: Results on IWSLT14 De-En and WAT17 Ja-En for effectiveness of learning word order. "-Position Embedding" indicates removing positional embedding from Transformer encoder or Transformer MMA encoder. ∆ denotes relative improvement over the counterpart of the Transformer baseline.
During training, we employ label smoothing of value 0.1 (Szegedy et al., 2016). All models are trained on a single NVIDIA RTX2080Ti with a batch size of around 4096 tokens. The base model are trained for 20 epochs, the small model are trained for 45 epochs.
The number of heads are 8 for base model and 4 for small model. We replace multi-head selfattention in the encoder layers by our mixed multihead self-attention. For a fair comparison, we apply each attention function twice in base model. By doing this, our Transformer MMA have the same number of parameters as the original Transformer.
For evaluation, we use a beam size of 5 for beam search, translation quality is reported via BLEU (Papineni et al., 2002) and statistical significance test is conducted by paired bootstrap resampling method (Koehn, 2004).

Results
In Table 1 and Table 2, we present the experiment results measured by BLEU on WAT17 and IWSLT14.
On WAT17 English⇒Japanese (En-Ja) and Japanese⇒English (Ja-En) translation task, without increasing the number of parameters, our Transformer MMA outperforms the corresponding baseline 0.81 BLEU score on En-Ja and 0.92 BLEU score on En-Ja.
On IWSLT14 German⇒English (De-En) translation task, our model achieves 35.41 in terms of BLEU score, with 0.95 improvement over the strong Transformer baseline. In order to compare with existing models, we list out some latest and related work and our model also achieves considerable improvements over these results.
Overall, our evaluation results show the introduction of MMA consistently improves the translation quality over the vanilla Transformer, and the proposed approach is stable across different languages pairs.

Effectiveness of MMA
Neural machine translation must consider the correlated ordering of words, where order has a lot of influence on the meaning of a sentence (Khayrallah and Koehn, 2018). In vanilla Transformer, the position embedding is a deterministic function of position and it allows the model to be aware of the order of the sequence . As shown in Table 3, Transformer without position embedding fails on translation task, resulting in a decrease of 17.91 BLEU score. With the help of proposed MMA, the performance is only reduced by 0.75 BLEU score without position embedding, and 18.11 points higher than the Transformer baseline. The same result holds true for a distant language pair Japanese-English where word oder is completely different. When removing position embedding, the Transformer baseline drops to 12.83 BLEU score. However, our model still achieves 23.80 in terms of BLEU score, with 10.97 points improvement over the Transformer counterpart.
From the cognitive perspective, due to the character of local attention which only focuses on restricted scope, the local attention head's dependence on word order information is reduced. In the forward and backward head, directional information is explicitly learned by our forward and backward attention. The above experimental results confirm our hypothesis that, other than global information, Transformer MMA takes local and sequential information into account when performing self-attention function, revealing its effectiveness on utilizing word order information.

Effect on Sentence Length
Following Bahdanau et al. (2015), we group source sentences of similar lengths to evaluate the performance of the proposed Transformer MMA and vanilla Transformer. We divide our test set into six disjoint groups shown in Figure 2. The numbers on the X-axis represent source sentences that are not longer than the corresponding length, e.g., "(0, 10]" indicates that the length of source sentences is between 1 and 10.
In all length intervals, Transformer MMA consistently outperforms the Transformer baseline. Specifically, as the length of the source sentence increases, so does the increase in the improvement brought by MMA. One explanation is that when the length of the sentence is very short, four different attention functions are similar to each other. But as the length of the sentence increases, more distinct characteristics can be learned and the performance gap is becoming larger.
Moreover, encoding long sentences usually requires more long-range dependency. Concerning the ability to connect with distant words directly, global self-attention was speculated that it is better suited to capture long-range dependency. However, as noted in (Tang et al., 2018), aforesaid hypothesis is not empirically correct and selfattention does have trouble handling long sentences. In case of our Transformer MMA, with the exist of other attention functions served as auxiliary feature extractors, we think that the Transformer has more capacity for modeling longer sentences.

Ablation Study
For ablation study, the primary question is whether the Transformer benefits from the integration of different attention equally. To do evaluate the impact of various attention functions, we keep global self-attention head unchanged, and next we replace other heads with different attention function.   The results are listed in Table 4. Compared with the Transformer baseline, all integration methods that incorporate other attention function improve the performance of translation, from 0.37 to 0.67 BLEU score. And we can see that Transformer MMA performs best across all variants with the improvement of 0.95 BLEU score.
Furthermore, we investigate the effect of attention scope in our Transformer MMA, as illustrated in Table 5. As the number of attention scope progressively increases, there is no absolute trend in performance. However it is worth noting that when the attention scope is relatively small, the overall performance is better. Specifically, when the size of attention scope is 1, our Transformer MMA achieves the best result. One possible reason is that, in the case where there are already global features captured by global attention, the smaller the attention scope, the more local features can be learned by local attention.

Attention Visualization
To further explore the behavior of our Transformer MMA, we observe the distribution of encoder attention weights in our models and show an example of Japanese sentence as plotted in Figure 3.
The first discovery is that we find the word overlooks itself on the first layer in the global attention head. This contrasts with the results from Raganato and Tiedemann (2018). They find that, on the first layer of original Transformer, more en- Figure 3: Visualization of the attention weights of Japanese sentence " " (meaning "These persons were improved in all cases by wearing lumbar braces or limiting exercises"). The deeper blue color refers to larger attention weights. coder self-attention heads focus on the word itself. This change is in line with our assumption that, due to the existence of other attention heads, global attention head can focus more on capturing global information.
The second discovery is that, on the upper layers, forward and backward attention heads move the attention more on distant words. This suggests forward and backward attention is able to serve as a complement to capturing long-range dependency.

Related Work
In the field of neural machine translation, the two most used attention mechanisms are additive attention (Bahdanau et al., 2015) and dot attention (Luong et al., 2015). Based on the latter, Vaswani et al. (2017) proposed a multi-head selfattention, that is not only highly parallelizable but also with better performance.
However, self-attention, which employs neither recurrence nor convolution, has great difficulty in incorporating position information (Vaswani et al., 2017). To tackle this problem, Shaw et al. (2018) presented an extension that can be used to incorporate relative position information for sequence. And Shen et al. (2018) tried to encode the temporal order and introduced a directional self-attention which only composes of directional order. On the other hand, although with a global receptive field, the ability of selfattention recently came into question (Tang et al., 2018). And modeling localness, either restricting context sizes Wu et al., 2019;Child et al., 2019) or balancing the contribution of local and global information (Xu et al., 2019), has been shown to be able to improve the expressiveness of self-attention. In contrast to these studies, we aim to improve the self-attention in a systematic and multifaceted perspective, rather than just paying attention to one specific characteristic.
Compared to a conventional NMT model with only a single head, multi-head is assumed to have a stronger ability to extract different features in different subspaces. However, there are no explicit mechanism that make them distinct (Voita et al., 2019;Michel et al., 2019). Li et al. (2018) had shown that using a disagreement regularization to encourage different attention heads to have different behaviors can improve the performance of multi-head attention. Iida et al. (2019) proposed a multi-hop attention where the second-hop serves as a head gate function to normalize the attentional context of each head. Not only limited in the field of neural machine translation, Strubell et al. (2018) combined multi-head self-attention with multi-task learning, this led to a promising result for semantic role labeling. Similar to the above studies, we also attempt to model diversity for multi-head attention. In this work, we apply dif-ferent attention function to capture different aspects of features in multiple heads directly, which is more intuitive and explicit.

Conclusion
In this work, we improve the self-attention networks by modeling multi-head attention to learn different aspects of feature through different attention function. Experimental results on WAT17 English-Japanese and IWSLT14 German-English translation tasks demonstrate that our proposed model outperforms the Transformer baseline as well as some latest and related models. Our analysis further shows our Transformer MMA can make better use of word order information and the improvement in translating longer sentences is especially significant. Moreover, we perform ablation study to compare different architectures. To explore the behavior of our proposed model, we visualize the attention distribution and confirm the diversity among multiple heads in MMA.
In the future, we plan to apply our method on other sequence to sequence learning tasks, such as text summarization.