Compressing Neural Machine Translation Models with 4-bit Precision

Alham Fikri Aji, Kenneth Heafield


Abstract
Neural Machine Translation (NMT) is resource-intensive. We design a quantization procedure to compress fit NMT models better for devices with limited hardware capability. We use logarithmic quantization, instead of the more commonly used fixed-point quantization, based on the empirical fact that parameters distribution is not uniform. We find that biases do not take a lot of memory and show that biases can be left uncompressed to improve the overall quality without affecting the compression rate. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. RNN architecture seems to be more robust towards compression, compared to the Transformer.
Anthology ID:
2020.ngt-1.4
Volume:
Proceedings of the Fourth Workshop on Neural Generation and Translation
Month:
July
Year:
2020
Address:
Online
Editors:
Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Kenneth Heafield, Marcin Junczys-Dowmunt, Ioannis Konstas, Xian Li, Graham Neubig, Yusuke Oda
Venue:
NGT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35–42
Language:
URL:
https://aclanthology.org/2020.ngt-1.4
DOI:
10.18653/v1/2020.ngt-1.4
Bibkey:
Cite (ACL):
Alham Fikri Aji and Kenneth Heafield. 2020. Compressing Neural Machine Translation Models with 4-bit Precision. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 35–42, Online. Association for Computational Linguistics.
Cite (Informal):
Compressing Neural Machine Translation Models with 4-bit Precision (Aji & Heafield, NGT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.ngt-1.4.pdf
Video:
 http://slideslive.com/38929817