Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task

We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task: single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.


Introduction
This paper describes the University of Edinburgh's submissions to the Workshop on Neural Generation and Translation (WNGT) 2020 Efficiency Shared Task 1 using the Marian machine translation toolkit (Junczys-Dowmunt et al., 2018a). The task has GPU, single-core CPU, and multi-core CPU tracks. Our submissions focus on the tradeoff between translation quality and speed; we also address model size after submission.
Starting from an ensemble of 4 transformer-big teacher models, we trained a variety of student configurations and on top of that sometimes pruned transformer heads. For the decoding process, we explored the use of lower precision GEMM for both our CPU and GPU submissions. Small models appear to be more sensitive to quantization than large models.
Most of our single-CPU submissions had a memory leak, which also impacted speed; we report results before and after fixing the leak. 1 https://sites.google.com/view/wngt20/ efficiency-task 2 Shared Task Summary The task measures quality approximated by BLEU (Papineni et al., 2002), speed, model size, Docker image size, and memory consumption of a machine translation system from English to German for the WMT 2019 data condition (Barrault et al., 2019). We did not optimize Docker image size (using stock Ubuntu) or memory consumption (preferring large batches for speed).
The task intentionally did not specify a test set until after submissions were made. This was later revealed to be the average of BLEU from WMT test sets from 2010 through 2019, inclusive. However, the 2012 test set was excluded because it contains English sentences longer than 100 words and participants were promised input would be at most 100 words. We refer to the task's metric as WMT1*. All BLEU scores are reported using sacrebleu. 2 The CPU tracks used an Intel Xeon Platinum 8275CL while the GPU track used an NVIDIA T4. For speed, the official input has 1 million lines of text with 15,048,961 space-separated words.
3 Teacher-student training Following Junczys-Dowmunt et al. (2018b) and Kim et al. (2019), all our optimized models are students created using interpolated sequence-level knowledge distillation (Kim and Rush, 2016), and trained on data generated from a teacher system.
Teacher We used the sentence-level English-German system from Microsoft's constrained submission to the WMT'19 News Translation Task (Junczys-Dowmunt, 2019). It is an ensemble of four deep transformer-big models (Vaswani et al., 2017), each with 12 blocks of layers in encoder and decoder, model size of 1024, filter size of 4096, Data and training Our student models were trained on pairs of original source and teachertranslated target sentences generated from parallel English-German datasets and English News Crawl data available for WMT19 (Barrault et al., 2019). For parallel data, we generated 8-best lists and selected translations with the highest sentence-level BLEU to reference sentences. Monolingual data was translated with beam size of 4. We filtered the data with language identification using Fast-Text 4 (Joulin et al., 2017), and then scored all sentence pairs with a German-English transformerbase model trained on a subset of original parallel data, about 7 million sentences. The obtained log probabilities were normalized with exp(0.1·p) and used for data weighting during training. We also removed ca. 5% of examples with worst scores from each dataset, except Paracrawl (Bañón et al., 2020), from which we used only 15M sentences with highest scores for processing. This procedure is similar to the single-direction step of the dual cross-entropy filtering method (Junczys-Dowmunt, 2018). The final training set consisted of 185M sentences, including 20M of originally parallel data. All student models were trained using the concatenated English-German WMT test sets from 2016-2018 as a validation set 5 until BLEU has stopped improving for 20 consecutive validations, and select model checkpoints with highest BLEU scores. Since a student model should mimic the teacher as closely as possible, we did not use regularization like dropout and label smoothing. Other training hyperparameters were Marian defaults for training a transformer-base model. 6 Student models All our students have standard transformer encoders (Vaswani et al., 2017) and light-weight RNN-based decoders with Simpler Simple Recurrent Unit (SSRU) (Kim et al., 2019), and differ in number of encoder and decoder blocks, and sizes of embedding and filter layers. Most models use shared vocabulary with 32,000 subword units created with SentencePiece (Kudo and Richardson, 2018), but we also experimented with a smaller vocabulary with only 8,000 units for model size optimized systems. Used student architectures are summarized in Table 1.
Interestingly, our student models do much better with originally English input, resulting in generally higher BLEU on the WMT19 test set w.r.t. the teacher's performance than on test sets from previous years, which consist of both translations and translationese. For example, the teacher achieves 42.4 and 42.2 BLEU on originally English and originally German subsets of the WMT16 test set, respectively, while the Base student model has 42.5 and only 35.6 BLEU. We think the reason for this is that student models were trained solely on teachertranslated data without back-translations.

Attention pruning
Attention is one of the most expensive operations in the transformer architecture, yet many of the heads can be pruned after training (Voita et al., 2019  and Carbin, 2018) and subsequent work on pruning optimisation (Frankle et al., 2019) suggests that pruning is less damaging during training rather than after training. Hence we combine these two ideas to prune attention heads during training.
Since we are starting from a relatively optimized model (Tiny in Table 1) whose decoder has one tied layer with SSRU self-attention, our pruning approach focuses on the 48 encoder heads. We apply a late resetting strategy that iteratively removes heads in short training loops (Frankle et al., 2019). This method starts by training the full model for 25k batches to create a checkpoint. Then we repeatedly train for 15k updates, remove N heads and revert the rest of the parameters to their value from the aforementioned checkpoint. Inspired by Voita et al. (2019), we calculate attention "confidence". Each time a head appears, we take the maximum of its attention weights. These maximums are then averaged across all appearances of the head to form a confidence score. Attention heads with high confidence are considered to contribute the most to the overall network performance. Thus, we remove the N least confident heads in each pruning iteration.
We try removing N = 3 or N = 6 heads per iteration, dubbing these Steady and Pushy in system names, respectively. Since the algorithm usually picks one head from each layer, the final architecture differs. For example, removing 6 heads per iteration results in a monotonic attention distribution across the 6 encoder layers. For submissions, we pruned 36 of the 48 heads; as an additional experiment we tried removing 42 of the 48 heads. The final attention distribution, size and BLEU scores for those models are presented in Table 2.
Considering that our students perform better on newer testsets, the pruning results show that it is possible to remove at least 75% of self-attention heads in an encoder with an average 0.4 BLEU loss. With harsher pruning, the model with even num-bers of heads performs better than the one missing any from the first two layers. This indicates that, in extreme cases, it is better to have at least one head per layer than none. Since the dimension of each head was small (256 / 8 = 32), pruning has not reduced the overall size of the models drastically. The speed-up is about 10% on CPU with 75% encoder heads removed. In terms of on GPU, our best pruned model gains 15% speed-up w.r.t. words per second (WPS) losing 0.1 BLEU in comparison to an unpruned model (Tab. 4).

CPU optimizations
For our CPU optimization we build upon last year submission (Kim et al., 2019). We use the same lexical shortlist, but we extend the usage of 8bit integer quantized GEMM operations to also cover the shortlisted output layer in order to have faster computation and even smaller model size.

8-bit quantization
Quantization from 32-bit floats to 8-bit integers is well known (Kim et al., 2019;Bhandare et al., 2019;Rodriguez et al., 2018) and reportedly has minimal quality impact. For this year's submission, we used intgemm 7 instead of FBGEMM 8 as our 8bit GEMM backend. Vocabulary shortlisting entails selecting columns from the output matrix and intgemm can directly extract columns in its packed format. The packed format reduces memory accesses during multiplication. Users can also specify arbitrary postprocessing of the output matrix while it is still in registers before writing to RAM. Currently we use this to add the bias term in a streaming fashion, saving a memory roundtrip on the common A * B + bias operation in neural network inference; in the future we plan to integrate activation functions.  Table 3: Model sizes, average BLEU scores and speed for quantized models. For the official submission we only used the 8-bit quantized models. More information about the unquantized models can be found in Table 1. The suffix "-untuned" means the model was quantized without continued training. In the multi-core setting, fixing the memory leak had minor impact on speed so we only report fixed numbers. Here, size excludes a 315 KB sentence piece model and an optional (but useful for speed) 11 MB lexical shortlisting file.
Last year (Kim et al., 2019), parameters were quantized and packed offline from a fully trained model. This year, we noticed quality degradation when quantizing smaller models and therefore introduced continued training. Continued training ran for 5000-7000 mini-batches, emulating 8-bit GEMM by quantizing the activations and weights then restoring them to 32-bit values, borrowing from methods used for 4-bit quantization (Aji and Heafield, 2019).
Quantization entails computing a scaling factor to collapse the range of values to [−127, 127]. For parameters, this scaling factor is computed offline using the maximum absolute value 9 but activation tensors change at runtime. This year, we changed from computing a dynamic scaling factor on the fly for activations to computing a static scaling factor offline. We decoded the WMT16 dataset and recorded the scaling factor α(A i ) = 127/max(|A i |) for each instance A i of an activation tensor A. Then, for production, we fixed the scaling factor for activation tensor A to the mean scaling factor plus 1.1 standard deviation: . These scaling factors were baked into the model file so that statistics were not computed at runtime.
All parameter matrices are prepared either offline, or when decoding the first word (in the case of the output layer) and later on they are reused for the GEMM operations (or in the case of the output layers, columns associated with vocabulary items 9 We tried a variety of statistics, including minimizing mean squared error, but none worked as well as continued training. are extracted from the prepared matrix).
For the GEMM operations at the attention layer, we used cblas sgemm batched from Intel's MKL Library. Model sizes, translation quality and speed are reported in Table 3. 10 Memory leak Most of our CPU submissions had a memory leak due to failing to clear a cache of shortlisted output matrices. Hence our official CPU submissions using intgemm had unreasonable memory consumption after translating 1 million lines as specified in the shared task. In one case, this exceeded 192 GB RAM on the c5.metal instance and a submission was disqualified; in other cases the submissions ran but used too much RAM and likely more CPU time as a consequence. In practise, the negative effect on speed was only evident in the single core submissions because multicore submissions divided work across processes.

Log 4-bit quantization
Model parameters follow normal distribution: most of them are near-zero. Therefore, a fixed-point quantization mechanism such as in Section 5.1 is not suitable when quantizing to lower precision. We can achieve a better model size compression by using a logarithmic 4-bit quantization (Aji and Heafield, 2019).
We start by quantizing a baseline model into 4bit precision. We leave the biases unquantized as they do not follow the same distribution as the rest of the parameters matrices and therefore quantize poorly. Moreover, the compression rate is practically unaffected since the biases are small in terms of number of parameters. Finally, the model must be fine tuned under 4-bit precision to restore the quality lost by quantization.
With 4-bit precision, we can achieve around 8x model size reduction. While 4-bit log quantization is in principle hardware-friendly since it uses only adds and shifts, current CPUs and GPUs do not natively support it (GPUs do support 4-bit fixed-point quantization, but this reduced quality compared to log quantization). The additional instructions required to implement 4-bit arithmetic made inference slower than with native 8-bit operations. Therefore, we focus on model size, useful for downloading, and dequantize before running the model in float32.
Model sizes and BLEU scores are reported in Table 3. Generally, quantizing the model is a better choice when aiming for lower model size, compared to reducing model parameters. For example, Base + log-4bit is as small as 19MB, while losing just 0.4 BLEU compared to the baseline. In contrast, the Tiny model is 65MB, but loses 1.5 BLEU compared to the float32 and the int8 settings.
We see that 4-bit log quantization achieves the best size and performance trade-off. For example, our Base + log-4bit (19MB) achieves the highest average BLEU of 34.1 among other models of similar size, such as Tiny + 8bit (17MB, 32.89 BLEU). Similarly, Our Tiny + log-4bit (8MB) achieves an average BLEU of 31.46, compared to others with similar range, for example Micro.8k + 8bit (9MB, 30.61 BLEU). However, larger models are more robust towards extreme quantization, compared to smaller models. Our Tiny.8k + log-4bit degrades significantly in terms of quality.

Multi-core configuration
For the multi-core track, we swept configurations of multiple processes and threads, settling on 24 processes with 2 threads each. The input text is simply split into 24 pieces and parallelized over processes. The mini-batch sizes did not impact performance substantially and 32 was chosen as the mini-batch size. The code profile under VTune revealed that the performance was limited by memory bandwidth, hence, the Hyperthreads available on the platform were not put into use and the 48 cores were saturated using 24 processes (Tange,   2011) running 2 threads each. Each process was bound to two cores assigned sequentially and to the memory domain corresponding to the socket with those cores using numactl. Output from the dataparallel run is then stitched together to produce the final translation.

GPU systems
This year, we did not implement any GPU-specific optimizations and focused on comparing the performance of student architectures, developed for CPU decoding, on the GPU. We made 4 submissions to the GPU track. The results for all student models, averaged across 3 runs are reported in Table 4. We decode on GPU using batched translation with mini-batch of 256 sentences, pervasive FP16 inference, and lexical shortlists (Kim et al., 2019). These are features already available in Marian 1.9.
The average speed-up from decoding in 16-bit floats is 21%, depending on the model architecture. The larger the model size, the larger speed improvement, with as high as 56% improvement for the Large student model, through 32% for Base, and only 13-18% for Tiny models. This is with barely any change in BLEU, lower than ±0.1. Models with pruned transformer heads are faster than the original Tiny model by 15% on GPU, but decrease the accuracy by 0.1-0.5 BLEU on the WMT19 test set. On this relatively small data set, we notice a small translation speed decrease of up to 2% from using lexical shortlists. Running concurrent streams on a single GPU did not yield significant improvements for us.

Results and discussion
All submissions and select experiments are depicted in Figure 1. We explored a variety of ways to optimize the trade-off between quality, speed, and model size. We use an ensemble of 4 transformer-big teacher models to train a number of different student configurations. Smaller student models are faster to decode, but also further degrade the performance compared to the ensemble of teachers. Furthermore, we apply gradual transformer head pruning to the student models. While pruning the number of heads does not reduce the number of parameters significantly, it has a major impact on the computational cost and is beneficial for increasing translation speed, at a small penalty in BLEU score.
On the software side, we experiment with a number of methods that reduce the precision for the GEMM operations. For our GPU submissions, we decode using 16-bit floats and for CPU ones we use 8-bit integers. We note that the smaller (in terms of number of parameters) the model is, the more impacted quality is by quantization, and the bigger the model is, the larger the speed increase is. We found that fine tuning with a quantized GEMM can recover some of the quality loss from quantization.
We also experimented with logarithmic 4-bit model compression, which did not yield increased translation speed due to hardware, but produced the smallest model sizes.