Marian: Cost-effective High-Quality Neural Machine Translation in C++

This paper describes the submissions of the “Marian” team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we create a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier for this shared task.


Introduction
This paper describes the submissions of the "Marian" team to the Workshop on Neural Machine Translation and Generation (WNMT 2018) shared task (Birch et al., 2018).The goal of the task is to build NMT systems on GPUs and CPUs placed on the Pareto Frontier of efficiency in accuracy. 1 Marian (Junczys-Dowmunt et al., 2018) is an efficient neural machine translation (NMT) toolkit written in pure C++ based on dynamic computation graphs. 2 One of the goals of the toolkit is to provide a research tool which can be used to define state-of-the-art systems that at the same time can produce truly deployment-ready models across different devices.Ideally this should be accomplished within a single execution engine that does not require specialized, inference-only decoders.
The CPU back-end in Marian is a very recent addition and we use the shared-task as a testing ground for various improvements.The GPU-bound 1 See the shared task description: https://sites.google.com/site/wnmt18/shared-task 2 https://marian-nmt.github.iocomputations in Marian are already highly optimized and we mostly concentrate on modeling aspects and beam-search hyper-parameters.
The weak baselines (at 16.9 BLEU on new-stest2014 at least 12 BLEU points below the stateof-the-art) could promote approaches that happily sacrifice quality for speed.We choose a quality cut-off of around 26 BLEU for the first test set (newstest2014) and do not spend much time on systems below that threshold. 3This threshold was chosen based on the semi-official Sockeye (Hieber et al., 2017) baseline (27.6 BLEU on newstest2014) referenced on the shared task page. 4e believe our CPU implementation of the Transformer model (Vaswani et al., 2017) and attention averaging networks (Zhang et al., 2018) to be the fastest reported so far.This is achieved by integer matrix multiplication with auto-tuning.We also show that these models respond very well to sequence-level knowledge-distillation methods (Kim and Rush, 2016).

State-of-the-art teacher
Based on Kim and Rush (2016), we first build four strong teacher models following the procedure for the Transformer-big model (model size 1024, filter size 4096, file size 813 MiB) from Vaswani et al. (2017) for ensembling.We use 36,000 BPE joint subwords (Sennrich et al., 2016) and a joint vocabulary with tied source, target, and output embeddings.One model is trained until convergence for eight days on four P40 GPUs.See tables 3 and 4 for BLEU scores of an overview of BLEU scores for models trained in this work.

Interpolated sequence-level knowledge-distillation
As described by Kim and Rush (2016), we retranslate the full training corpus source data with the teacher ensemble as an 8-best list.Among the eight hypotheses per sentence we choose the translation with the highest sentence-level BLEU score with regard to the original target corpus.Kim and Rush (2016) refer to this method as interpolated sequence-level knowledge-distillation.Next, we train our student models exclusively on the newly generated and selected output.

Decoding with small beams
Whenever we use beam size 1, we skip softmax evaluation and simply select the output word with highest activation.The input sentences are sorted by source length, then decoded in batches of approximately equal length.We batch based on number of words.For CPU decoding we use a batch size of at least 384 words (ca.15 sentences), for the GPU at least 8192 words (ca.300 sentences).
3 Student architectures

Transformer students
For our Transformer student models we follow the Transformer-big and Transformer-base configurations from Vaswani et al. (2017).Additionally we investigate a Transformer-small and post-submission two Transformer-tiny variants on the CPU.We also use six blocks of self-attention, source-attention, and FFN layers with varying embedding (model) and FNN sizes, see Table 1.
Transformer-big is initialized with one of the original teachers and fine-tuned on the teachergenerated data until development set BLEU stops improving for beam-size 1.The remaining student models are trained from scratch on teachergenerated data until development set BLEU stalls for 20 validation steps when using beam-size 1.

Averaging attention networks
Very recently, Zhang et al. (2018) suggested averaging attention networks (AAN), a modification of the original Transformer model that addresses a decode-time inefficiency, apparently without loss of quality.During translation, the self-attention layers in the Transformer decoder look back at their entire history, introducing quadratic complexity with respect to output length.Zhang et al. (2018) replace the decoder self-attention layer with a cumulative uniform averaging operation across the previous layer.During decoding, this operation can be computed based on the single last step.Decoding is then linear with respect to output length.Zhang et al. (2018) also add a feed-forward network and a gate to the block.We choose a smaller FFN size than Zhang et al. (2018) (corresponding to embeddings size instead of FFN size in table 1) and experiment with removing the FFN and gate.

RNN-based students
Our focus lies on efficient CPU-bound Transformer implementations.However, Marian and its predecessor Amun (Junczys-Dowmunt et al., 2016) were first implemented as fast GPU-bound implementations of Nematus-style (Sennrich et al., 2017b) RNN-based translation models.We use these models to cover the lower end of the quality spectrum in the task.We train a standard shallow GRU model (RNN-Nematus, embedding size 512, state size 1024), a small version (RNN-small, embedding size 256, state size 512) and a deep version with 4 stacked GRU blocks in the encoder and 8 stacked GRU blocks in the decoder (RNN-deep, embedding size 512, states size 1024).This model corresponds to the University of Edinburgh submission to WMT 2017 (Sennrich et al., 2017a).
4 Optimizing for the CPU Most of our effort was concentrated on improving CPU computation in Marian.Apart from improvements from code profiling and bottleneck identification, we worked towards integrating integer-based matrix products into Marian's computation graphs.

Shortlist
A simple way to improve CPU-bound NMT efficiency is to restrict the final output matrix multiplication to a small subset of translation candidates.We use a shortlist created with fastalign (Dyer et al., 2013).For every mini-batch we restrict the output vocabulary to the union of the 100 most frequent target words and the 100 most probable translations for every source word in a batch.All CPU results are computed with a shortlist.

Quantization and integer products
Previously, Marian tensors would only work with 32-bit floating point numbers.We now support tensors with underlying types corresponding to the standard numerical types in C++.We focus on integer tensors.Some of our submissions replaced 32-bit floating-point matrix multiplication with 16-bit or 8-bit signed integers.For 16-bit integers, we follow Devlin (2017) in simply multiplying parameters and inputs by 2 10 before rounding to signed integers.This does not use the full range of values of a 16-bit integer so as to prevent overflow when accumulating 32-bit sums; there is no AVX512F instruction for 32-bit add with saturation.
For 8-bit integers, we swept quantization multipliers and found that 29 was optimal, but quality was still poor.Instead, we retrained the model with matrix product inputs (activations and parameters but not outputs) clipped to a range.We tried [−3, 3], [−2, 2], and [−1, 1] then settled on [−2, 2] because it had slightly better BLEU. 5 Values were then scaled linearly to [−127, 127] and rounded to integers.We accumulated in 16-bit integers with saturation because this was faster, observing a 0.05% BLEU drop relative to 32-bit accumulation.
The test CPU is a Xeon Platinum 8175M with support for AVX512.We used these instructions to implement matrix multiplication over 32 16-bit integers or 64 8-bit integers at a time.6

Memoization
To ensure contiguous memory access, the integer matrix product dot int (A, B) calculates AB T instead of AB.It also expects its inputs A and B to be correctly quantized integer tensors.Therefore, we have to compute dot int (quant int (A), quant int (B T )) to use the quantized integer product as a replacement for the floating point matrix product.
In most cases, B is a parameter, while A contains activations.Repeating the quantization and trans- position operations for every decoder parameter at every step would incur a significant performance penalty.To counter this, we introduce memoization into Marian's computation graphs.Memoization caches the values of constant nodes that will not change during the lifetime of the graph.During inference, parameter nodes are constant.Apart from that any node with only constant children is constant and can be memoized.In our example, B is constant as a parameter, B T is constant because its only child is constant, so is quant int (B T ).dot int (quant int (A), quant int (B T )) itself is not constant, as the activations A can change.Values for constant nodes are calculated only once during the first forward step in which they appear; subsequent calls will use cached versions.

Auto-tuning
At this point, the float32 (Intel's MKL) product and our int16 matrix product can be used interchangeably for small and mid-sized models (we see overflow for the large Transformer model).While trying to choose one implementation, we noticed that both algorithms will outperform the respective other in different contexts.In the face of many different matrix sizes and access patterns it is difficult to determine reliable performance profiles.Instead, we implemented an auto-tuner.
We hash tensor shapes and algorithm IDs and annotate each node in an alternative subgraph with a timer.We collect the total execution time across 100 traversals of each alternate subgraph.Once this limit has been reached, usually within a few sentences, the auto-tuner stops measurements and selects the fastest alternative for all subsequent calls.

Optimization results
Table 2 illustrates the effects of the optimizations introduced in this section for sentence-by-sentence and batched translation.Adding a shortlist improves translation speed significantly.Enabling int16 multiplication without memoization hurts performance; with memoization we see improvements for single-sentence translation and similar performance to MKL for batched translation.With auto-tuning, single-sentence translation achieves the same performance as before and batched translation improves.In both cases the auto-tuning algorithm was able to choose a good solution.In the single-sentence case we would always use the int16 product.In the batched case a mix performs better than a hard choice.
We also see respectable improvements for the Transformer-big model with int8 multiplication.Most of the loss in BLEU is due to the fine-tuning process with clipping during training.

Results and cost-effective decoding
In tables 3 and 4, we summarize our experiments with GPU and CPU models.Bold rows contain results for our task submissions.We report model sizes in MiB, translation time without initialization and BLEU scores for newstest2014.Time has been measured on AWS p3.x2large instances (NVidia V100) and AWS m5.large instances, the official evaluation platforms of the shared task.
All our student models outperform the baselines in terms of translation speed and quality, but as stated before, we are mostly interested in models above a 26 BLEU threshold.It seems that the new AAN architecture is a promising modification of the Transformer with minimal or no quality loss in comparison to its standard equivalent.We also see that teacher-student methods can be successfully used to create high-performance and high-quality Transformer systems with greedy decoding.
We compare our systems on a common costeffectiveness scale expressed as the number of source tokens translated per US Dollar w USD .Given the hourly price for a dedicated AWS GPU (p3.x2large, 3.259 USD/h) or CPU (m5.large, 0.102 USD/h) instance 7 and the time to translate newstest2014 consisting of 62,954 source tokens with a chosen model and instance, we calculate: 7 The same instance types were used for the shared task.
This representation has multiple advantages: • Systems deployed on different hardware can be compared directly; • The linear mappings into the common space are scale-preserving and correctly represent relative speed differences between systems on the same hardware; • We can relate three important categoriesspeed, quality, and cost -to each other in a single visualization.
Figures 1 and 2 illustrate cost-effectiveness of our models, the baselines and submissions by other participants versus translation quality on new-stest2014. Figure 1 contains all models with a costeffectiveness log-scale.This reflects a trend that speed gains are exponential in quality loss.Based on Figure 1, it seems that our models dominate the Pareto-frontier for high-quality models for CPU and GPU models compared to the baselines and other participants.
We added post-submission systems ( 23i) and (24i) on the CPU to demonstrate that we can outperform the results of other participants for speed and quality when lowering our quality threshold.
In Figure 2 with a linear cost-effectiveness scale, we emphasize models around and above the quality threshold of 26 BLEU which were our main focus in this work.It is interesting to see that similar Marian models have surprisingly similar cost-effectiveness across different hardware types.

Conclusions
We demonstrated that Marian can serve as an integrated research and deployment platform with highly efficient decoding algorithms on the GPU and CPU.Transformer architectures can be efficiently trained in teacher-student settings and then used with small beams or with greedy decoding.To our knowledge, this is also the first work to integrate Transformer architectures with low-precision matrix multiplication.By combining these methods with the new averaging attention networks, we created a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier for this shared task.

Figure 1 :
Figure 1: Cost-effectiveness (logarithmic scale) vs BLEU for all systems and baselines.

Table 3 :
Results on newstest2014 -GPU systems.Submitted systems in bold.All student systems have been used with beam-size 1 unless stated differently (b=n).

Table 4 :
Results on newstest2014 -CPU systems.Submitted systems in bold.Post-submission systems marked with *.All student systems have been used with beam-size 1 unless stated differently (b=n).