Making Asynchronous Stochastic Gradient Descent Work for Transformers

Alham Fikri Aji, Kenneth Heafield


Abstract
Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.
Anthology ID:
D19-5608
Volume:
Proceedings of the 3rd Workshop on Neural Generation and Translation
Month:
November
Year:
2019
Address:
Hong Kong
Editors:
Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Ioannis Konstas, Thang Luong, Graham Neubig, Yusuke Oda, Katsuhito Sudoh
Venue:
NGT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
80–89
Language:
URL:
https://aclanthology.org/D19-5608
DOI:
10.18653/v1/D19-5608
Bibkey:
Cite (ACL):
Alham Fikri Aji and Kenneth Heafield. 2019. Making Asynchronous Stochastic Gradient Descent Work for Transformers. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 80–89, Hong Kong. Association for Computational Linguistics.
Cite (Informal):
Making Asynchronous Stochastic Gradient Descent Work for Transformers (Aji & Heafield, NGT 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-5608.pdf