Fully Quantized Transformer for Machine Translation

State-of-the-art neural machine translation methods employ massive amounts of parameters. Drastically reducing computational costs of such methods without affecting performance has been up to this point unsuccessful. To this end, we propose FullyQT: an all-inclusive quantization strategy for the Transformer. To the best of our knowledge, we are the first to show that it is possible to avoid any loss in translation quality with a fully quantized Transformer. Indeed, compared to full-precision, our 8-bit models score greater or equal BLEU on most tasks. Comparing ourselves to all previously proposed methods, we achieve state-of-the-art quantization results.


Introduction
The idea of using neural networks for machine translation was only recently proposed (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;. Nonetheless, the approach became the state-of-the-art in the field (Ahmed et al., 2017;. A key element of this success was to allow the decoder to attend to all hidden states of the encoder . A few variations to this additive attention mechanism have been proposed, such as multiplicative and self-attention (Luong et al., 2015;Cheng et al., 2016;Lin et al., 2017). The latter formed the basis of the Transformer network (Vaswani et al., 2017), which achieved state-of-the-art results in machine translation. Inspiring a new wave of work, numerous natural language processing tasks reached new heights (Devlin et al., 2018;Liu et al., 2019). Unfortunately, these models use an enormous amount of parameters. Inference on resource-limited hardware such as edge-devices is thus impractical.
A solution to reduce the computational burden of these networks is to lower numerical precision. Consequently, numerical values can be represented using fewer bits (Tang and Kwan, 1993;Marchesi et al., 1993). This method called quantization has the advantage of providing good compression rates with minimal loss in accuracy. It is also conveniently supported by most hardware. Properly quantizing the Transformer would allow computational speed gains at inference, as well as deployment on more constrained devices.
In this work, we propose a quantization-aware training strategy for the entire Transformer architecture. Our method is easy to implement and results are consistent with the full-precision Transformer. We test our approach on multiple translation tasks such as WMT14 EN-FR and WMT14 EN-DE and obtain state-of-the-art quantization results. In comparison with full-precision, our quantized models score equal or higher BLEU on most tasks. We are, to the best of our knowledge, the first to show that the Transformer architecture can be fully quantized without impairing translation quality. We also perform an ablation study and show that quantizing specific components of the Transformer improves BLEU score.

Background
In this section, we review a broad spectrum of quantization and pruning methods for neural network compression.

Quantization
Over the years, a large range of methods have been proposed to quantize neural networks. These include, among many others, binary , ternary (Lin et al., 2015;, uniform (Jacob et al., 2017) and learned (Zhang et al., 2018) quantization. These methods can be universally applied to any type of neural network. To maintain performance though, specific architectures usually require custom tailored quantization schemes.
Several recent work explore recurrent neural network (Jordan, 1990) quantization. Ott et al. (2016) propose an exponential quantization method for RNN weights. They find ternary and exponential quantization to work well on language modeling and speech recognition, while binary weights seemed ineffective.  quantize weights and activations of both RNNs and LSTMs (Hochreiter and Schmidhuber, 1997) to 2, 4 and 6bit. Meanwhile,  propose modifications to the gates and interlinks of quantized LSTM and GRU  cells, as well as a balanced quantization method for weights.  successfully quantize a stacked sequence-tosequence LSTM to 8-bit without any loss in translation quality. Most recently, Wang et al. (2018) propose applying different quantization methods for different RNN components.
With regards to CNNs (LeCun et al., 1989), various works have also explored quantizing these models. Gong et al. (2014) compare matrix factorization, binarization, k-means clustering, product quantization and residual quantization of CNNs. Wu et al. (2015) apply quantization to both kernels and fully connected layers of convolutional neural networks. Rastegari et al. (2016) propose using binary weighted filters on AlexNet (Krizhevsky et al., 2012). Testing their method on ImageNet, they show classification accuracy to be on par with fullprecision. For faster inference and training,  use low bitwidth weights, activations and gradients on CNNs.
Quantization has been applied in tandem with other compression methods. Han et al. (2015) combine pruning, quantization, weight sharing and Huffman coding. In another line of work, Polino et al. (2018) employ quantization with knowledge distillation (Hinton et al., 2015) for higher compression rates. Moreover, Chen et al. (2018) blend quantization with block based low-rank matrix approximation of embeddings.

Pruning
The pruning of neural networks for model compression has also been largely explored. LeCun et al. (1990) were the first to propose a Hessian based method to prune neural net weights. Hassibi et al. (1994) later improved the method. More recently, See et al. (2016) show that pruning a fully trained model and then retraining it can increase performance over the original non-pruned model. Gradually pruning in tandem with training has also been shown to increase performance (Zhu and Gupta, 2017). To avoid sparse matrices,  prune nodes instead of weights. They apply a penalty in the loss on the γ parameters of batch normalization layers. With a similar objective, Narang et al. (2017b) make better use of hardware by applying pruning and weight decay in blocks to minimize the number of loaded weight matrix chunks.
Similarly to quantization, pruning methods have also been adapted to specific architectures. Liu et al. (2015) propose an efficient sparse matrix multiplication algorithm for CNNs. As for RNNs, Narang et al. (2017a) show sparse pruning to work well on the architecture. In order to maintain dimension consistency, Wen et al. (2017) propose to prune all basic LSTM structures concurrently. Lastly, Park et al. (2018) introduce simple recurrent units (SRUs) for easy pruning of RNNs.

Quantization Methodology
Our quantization scheme was chosen to be uniform, meaning that the step size between two quantized values is constant. This choice, which is an additional constraint, was made for practical reasons. It indeed simplifies all computations required during inference, enabling the exploitation of hardware resources more efficiently. If the performance with uniform quantization is already on par with fullprecision, then more weighty methods are unnecessary. A brief overview of uniform quantization is given in this section. For more details, we refer the reader to Jacob et al. (2017).
Given an element x of a tensor X, we apply the quantization function Q: where x min and x max defines the endpoints of the quantization interval. When quantization is applied to weights, these values are respectively min(X) and max(X). However, when quantization is applied to activations, those values are running estimates. The latter are computed during training, where for every forward pass, the x min and x max variables are updated via an exponential moving average with a momentum of 0.9. The clamp function associates all values outside of the [x min , x max ] range to the closest endpoint and · represents rounding to the nearest integer. The value k is simply the bit precision. For example, in the context of 8-bit quantization, k = 8.
During backpropagation, we use the straightthrough estimator (Hinton, 2012) and set the gradients of clamped values to zero. Once training is finished, s and x min are frozen along with the weights.

What to Quantize
We choose to quantize all operations which can provide a computational speed gain at inference. In this regard, we quantize all matrix multiplications, meaning that the inputs and weights of MatMuls will both be k-bit quantized. The other operations we quantize are divisions, but only if both the numerator and denominator are second or higher rank tensors. For all other operations, such as sums, the computational cost added by the quantization operation outweighs the benefit of performing the operation with reduced precision. Hence, we do not quantize such operations.
More precisely, we quantize all weights of the Transformer, excluding biases. The latter are summed with the INT32 output of matrix multiplications and thus provide no additional computational efficiency from being quantized. Furthermore, the memory space of biases is insignificant in comparison to the weight matrices, representing less than 0.1% of total weights. For positional embeddings, these are fixed and can thus be quantized once prior to training. The γ weights of Layer-Norms are also quantized. As for activations, we quantize the sum of the input embeddings with the positional encodings in both the encoder and decoder. In the Multi-Head Attention, we quantize the (Q, K, V ) input, the softmax's numerator, the softmax's denominator, the softmax's output and the Scaled Dot-Product Attention's output. At inference, the softmax does not need to be computed in full-precision. Indeed, the exponential function can instead be replaced with a step function. For the position-wise feed-forward networks, we quantize the output of the ReLUs and of the feed-forwards themselves. Finally, for all LayerNorms, we quantize the numerator x−µ, the denominator √ σ 2 + , their quotient and the output of the LayerNorm. A visual guide is provided in appendix A.

Bucketing
Instead of using a single set of (s, x min ) per quantized tensor, we can quantize subsets of the latter with each its own set of (s, x min ) (Alistarh et al., 2016). Even though this adds more scalars, the memory cost is insignificant overall. Furthermore, the added flexibility can greatly alleviate the precision loss resulting from all values being mapped to a single low numerical precision domain.
We use this bucketing method for all weight matrices, with a number of subset equal to the output dimension. For activations, we use bucketing when quantizing: the sum of input embeddings with the positional encoding, the Q, K, V inputs, the Scaled Dot-Product Attention's output, the feed-forward's output, the LayerNorm's numerator, quotient and output.

Dealing with Zeros
Unlike Jacob et al. (2017), we do not nudge the domain so that the zero value gets perfectly mapped. The only zero values which we have to deal with are the padding, the Softmax numerator and output, the output of ReLU layers and dropouts. Since padding has no effect on the final output, we completely ignore these values when quantizing. For ReLUs and the Softmax's numerator and output, we fix their x min to 0, which guarantees the perfect mapping of the value. Finally, quantization is applied before any dropout operation. Indeed, even though the zeros added to the output of the quantization layer might not be part of the domain, this only happens during training.

Related Work
Recently, simple quantization solutions have been applied to the Transformer. Cheong and Daniel (2019) apply k-means quantization and binarization with two centroids over the weights of the network. For both methods, a look up table associated with each quantized layer is used to map indices to their corresponding centroids. Similarly, Fan (2019) compares binary, 4 and 8-bit uniform quantization of the Transformer weights. A big disadvantage with quantizing only the weights of a network is that operations must still be performed in full-precision. Even though the parameters' memory usage is reduced, these constantly have to be converted back to full-precision. Achieving quantization of both weights and activations is much more beneficial. The first attempt at doing so for the Transformer applies 8-bit quantization on weights and inputs of feed forward layers and binarizes the (Q, K) input of the Multi-Head Attention (Tierno, 2019). The scaling factor √ d k is approximated by a constant which can be computed as a right bitshift. The method resulted in a huge drop in translation accuracy. Achieving better performance, Bhandare et al. (2019) quantize certain MatMul operations and use the KL divergence to estimate the most suited parameters for each quantization range. They restrain from quantizing all MatMuls, reporting poorer results in accuracy. Aside from translation, the concurrent work by Zafrir et al. (2019) quantizes the embedding and fully connected layers of BERT (Devlin et al., 2018). The Softmax and LayerNorm operations are kept in full-precision. On the GLUE benchmark, their loss in accuracy is minimal compared to the original model.
All of these methods omit quantizing the whole Transformer architecture, resulting in suboptimal computational efficiency. Furthermore, these solutions all fail to avoid impairing translation quality. Our method achieves both.

Experiments
In this section, we present the results of our full quantization scheme on various tasks. We first compare our method on a machine translation setup. Then we present the results of numerous ablation studies. We also compare the impact of delaying quantization on translation quality. Finally, we evaluate our method on two language model tasks and experiment with node pruning.

Full Quantization
We apply our quantization strategy on both the base and big Transformer (Vaswani et al., 2017). The training setup of all presented models is the same as in the original paper, with the exception that the dropout ratio is set to 0.1 in all cases. We refer readers to the original paper for experimental details. Our models were first evaluated on the WMT 2014 / 2017 English-to-German and WMT 2014 English-to-French translation tasks. Reported perplexity is per token and BLEU was measured with multi-bleu.pl 1 on the newstest2014 2 test set. We used beam search with a beam size of 4 and a length penalty of 0.6. Unlike Vaswani et al. (2017), no checkpoint averaging was performed.
We compare our results with the original Transformer and other 8-bit quantization methods in Table 1. All models are base Transformers. Original uncompressed size is the same in all cases. Most work do not report their compressed model size. For those, we give lower bounds based on their reports. Our BLEU score was computed on the test set using the checkpoint with the highest validation accuracy over 2 million training steps. Validation was computed every training epoch. Models were trained once. Our objective was to train quantized models up to convergence. Very similar BLEU scores can be obtained with much fewer training (see below). As for other methods, Cheong and Daniel (2019)  In Table 2, we show performance of our method on the WMT14 EN-DE and WMT14 EN-FR for a fixed amount of training steps. We compare our results with two full-precision Transformers: base and big variants. We also evaluate two other quantization approaches. The first one is the "default" approach, which is to naively quantize every possible operation. The second approach applies our quantization strategy post-training (see section 5.3). In all cases except for post-quantization, BLEU was computed on the test set using the checkpoint which scored the highest accuracy on the validation set. Towards the end of training, we ran one validation epoch for every 100 training steps. Baselines and FullyQT 8-bit results were averaged over 5 trials. Standard deviation of the BLEU scores did not seem higher for any method and ranged between 0.09 and 0.51. Training with quantization was about twice as slow as with the baselines. As for post-training quantization, the BLEU score was computed on the test set using the best validation performance out of 20 trials. The default approach's nan in the EN-FR task is due to numerical instability. By quantizing every operation, zeros in the LayerNorm's denominator are more frequent.   Results on additional translation datasets can be found in Table 3. All models were trained following the same setup as WMT14 EN-FR and WMT14 EN-DE. Vocabulary size is set to 32k for all models. Since there is no test set for WMT14 ES-EN, we used the validation set as a test set and omitted computing any validation epochs during training.
Looking at all conducted experiments, including section 5.3, translation quality of the 8-bit Ful-lyQT models seems to be on par with full-precision. Most of the time, the highest BLEU was scored by the quantized model. For example in the case of WMT14 EN-DE, the maximum BLEU FullyQT base 8-bit obtained was 26.98, while the baseline's highest was 26.64. As for the big models, the max FullyQT scored was 27.95, whereas the baseline's was 27.43. We looked at training and validation curves to see if quantization had any effect, but saw no discernible difference.
All models use full-precision biases, s and x min . This amounts to 11.61 Mb in the base models and 23.15 Mb in the big models. In the case of 8-bit, these represent less than 2.35% of the total size. Without bucketing, this would amount to 2.18 Mb and 4.35 Mb respectively. We believe the small increase in model size to be worth it. Indeed, in section 5.2, we show that training without bucketing leads to poorer translation.
Although 6-bit quantization seems to perform well, the compression advantage over 8-bit is usually lost. Most hardware store INT6 using either 8 or 32 bits. Dedicated hardware is needed to get the full compression advantage. Unless 6-bit quantization results in better models, 8-bit seems like the best choice for most hardware.

Ablation Studies
To better understand which operations are more sensitive to quantization, we evaluate such effect on single operations of the Transformer. By this, we mean quantizing the operation of a module for all Transformer layers. Table 4 shows results on the WMT14 EN-FR translation task for 8-bit precision. The effect of bucketing was also evaluated. BLEU was computed on the test set after 100k steps of training. In 24 out of 27 experiments, performance was better than our full-precision baseline of 38.34 BLEU. Solely quantizing the LayerNorm's denominator with no bucketing results in poor performance. The latter also cannot be bucketed since all dimensions of the variance tensor vary per batch. To successfully quantize this element without causing any loss in performance, we suspect quantizing other elements in the network helps.
To further validate our quantization scheme, we evaluated four models trained with alterations to our design choices. Results on the WMT14 EN-FR task are presented in Table 5. All models are 8-bit quantized base Transformers. Training procedure is the same as in section 5.1.

Delaying Quantization
Our method's goal is to increase computational efficiency when inferring with the Transformer. To this end, our quantization scheme only requires us to learn s and x min . Although we do so throughout the whole training, this is not a necessity. Quantization could also be applied later during training. Results for different starting points are compared in Table 6. The earliest we start quantizing is at 100 steps, since we need at least a few steps to assess the running estimates. All models were evaluated on the WMT14 EN-DE and WMT14 EN-FR translation tasks. BLEU was measured on the test set using the checkpoint which scored the highest accuracy on the validation set during training. Validation was computed every 100 training steps towards the end of training. From our observed results, quantizing the model early on seems preferable.
Learning quantization parameters adds a significant computational cost during training. A major advantage to delaying quantization is to perform more training steps in the same given amount of time. Therefore, when training time is a constraint, a possible strategy is to train a model without quantization, perform more training steps and finally post-quantize the model. By the latter, we mean keeping all weights fixed and compute the s and x min over a few hundred steps. This is another advantage, since many trials can be performed in search of the best performing candidate. We found post-quantization BLEU scores to vary by about 0.2 BLEU.

Language Modeling
To evaluate if our quantization scheme generalizes well to other tasks, we evaluate it on two language modeling datasets: WikiText-2 and WikiText-103. As the setup, we use PyTorch's language modeling toy example 3 . The task consists of predicting the sequence {x t+1 , · · · , x t+n+1 } from the input sequence {x t , · · · , x t+n }. We trained four Transformer models, each with different precision. All models consist of two Transformer encoder layers, with the embedding and hidden size set to 200. Multi-Head Attention has two heads with key and value size 64. The final word projection layer's weights are shared with the embedding layer. Models were trained for 10 epochs with a batch size of 20 and sequence length of 35. Learning rate is set to 5, dropout to 0.2 and gradient clipping to 0.25. Loss is computed on every element of the output sequence. Results are presented in Table 7. Validation was computed every epoch to determine the best candidate. Loss and perplexity are computed on the test set and averaged over 10 trials for WikiText-2 and 3 trials for WikiText-3. See footnote 3 for any extra details.

Pruning Useless Nodes
We experiment with node pruning our Transformer models. Once the model is fully trained and quantized, we can further compress it by removing useless nodes. By useless, we mean nodes which do not cause any loss in translation quality when removed. We choose to prune nodes instead of independently pruning weights. The latter method usually requires special hardware or software to    leverage sparse weight matrices. Pruning nodes results in concretely shrunken models. When getting rid of a node, we remove its corresponding set of weights from the layer outputting it and the following layer receiving the node as input.
The only nodes of the Transformer which can be removed without causing alterations to other components of the network are the nodes in between the two layers of each feed-forward network. Fortunately, these consist of a substantial portion of the model's weights. In the case of the base Transformer, for a respective vocabulary of size 37000 and 32000, 39.96% and 41.65% of the total weights are owned by the feed-foward networks.
This number grows to 47.03% and 48.18% in the big Transformer.
To evaluate which nodes can be safely pruned without affecting translation quality, we estimate x max for each node of the ReLU output over a few hundred steps. This is done on the training set, using the fully trained model and keeping all other weights frozen. These x max are computed before quantizing the ReLU output and do not replace the ones used by the quantization process. Figure 3 in the appendix shows the histogram of these running estimates for one ReLU layer in the encoder and one in the decoder. All other ReLU layers share the same pattern, where in the encoder there are always multiple x max close to 0. This does not happen in the decoder.
Once the running estimates are computed, we prune its corresponding node if x max < zσ where z is a hyperparameter and σ the standard deviation of the layer's x max . We empirically found z = 0.025 to work well, with higher thresholds causing BLEU to quickly decay. No retraining of the model is performed after pruning nodes.
Using this method, we can further compress the Transformer without affecting BLEU scores. Our approach has the advantage of being adaptive,   meaning the number of nodes pruned per layer will differ as opposed to a fixed pruning ratio method. For example, in the case of the big Transformer trained on WMT14 EN-FR, 169 nodes were pruned in the first ReLU of the encoder, while in the second, 1226 were pruned. Nodes in the decoder rarely got pruned, at most 4 in the whole decoder. Results are presented in Table 8. Reported results are averaged on the test set over a few trials. BLEU varied by about 0.01−0.02. Other approaches usually decide the ratio first and then prune. We compared with two such methods. For each task, we fix their ratio to the average percentage of nodes pruned by our method and only prune nodes in the encoder. The first fixed pruning method uses L1-norm to sort nodes in ascending weight order, while the second sorts the x max , also in ascending order.

Conclusion
We proposed a full quantization strategy for the Transformer architecture. Our objective was to ex-ploit hardware resources as efficiently as possible, quantizing all operations which could provide a computational speed gain.
With FullyQT, we achieve higher BLEU scores than all other quantization methods for the Transformer on multiple translation tasks and avoid any loss in BLEU compared to full-precision. Specifically, out of 35 experiments, 8-bit quantization performed better than full-precision in 21 cases.
If instead of minimizing inference time, one wants to maximize translation accuracy, then applying quantization to only certain components of the Transformer seems to be the best option. Indeed, our ablation study showed than BLEU score could increase even more when only specific elements of the Transformer were quantized. Further gains might be possible, but supplementary experiments would be necessary to determine the best combination.
We plan on extending our work to variations of the Transformer, as well as further exploring the compression of these networks.