Mixed Multi-Head Self-Attention for Neural Machine Translation

Hongyi Cui, Shohei Iida, Po-Hsuan Hung, Takehito Utsuro, Masaaki Nagata


Abstract
Recently, the Transformer becomes a state-of-the-art architecture in the filed of neural machine translation (NMT). A key point of its high-performance is the multi-head self-attention which is supposed to allow the model to independently attend to information from different representation subspaces. However, there is no explicit mechanism to ensure that different attention heads indeed capture different features, and in practice, redundancy has occurred in multiple heads. In this paper, we argue that using the same global attention in multiple heads limits multi-head self-attention’s capacity for learning distinct features. In order to improve the expressiveness of multi-head self-attention, we propose a novel Mixed Multi-Head Self-Attention (MMA) which models not only global and local attention but also forward and backward attention in different attention heads. This enables the model to learn distinct representations explicitly among multiple heads. In our experiments on both WAT17 English-Japanese as well as IWSLT14 German-English translation task, we show that, without increasing the number of parameters, our models yield consistent and significant improvements (0.9 BLEU scores on average) over the strong Transformer baseline.
Anthology ID:
D19-5622
Volume:
Proceedings of the 3rd Workshop on Neural Generation and Translation
Month:
November
Year:
2019
Address:
Hong Kong
Venues:
EMNLP | NGT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
206–214
URL:
https://www.aclweb.org/anthology/D19-5622
DOI:
10.18653/v1/D19-5622
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://www.aclweb.org/anthology/D19-5622.pdf