Efficient and High-Quality Neural Machine Translation with OpenNMT

This paper describes the OpenNMT submissions to the WNGT 2020 efficiency shared task. We explore training and acceleration of Transformer models with various sizes that are trained in a teacher-student setup. We also present a custom and optimized C++ inference engine that enables fast CPU and GPU decoding with few dependencies. By combining additional optimizations and parallelization techniques, we create small, efficient, and high-quality neural machine translation models.


Introduction
This paper describes the OpenNMT (Klein et al., 2017) submissions to the Workshop on Neural Generation and Translation 2020 efficiency shared task. For WNMT 2018, we explored training and optimizations of small LSTM translation models combined with a customized runtime . While this resulted in interesting decoding speed, there was still room for improvements in terms of quality, memory usage, and overall efficiency.
For this 2020 edition, we focus on the standard Transformer architecture (Vaswani et al., 2017) that is now commonly used in production machine translation systems. Similar to our first participation, we train smaller models using the teacherstudent technique (Kim and Rush, 2016). We experiment with several encoder and decoder sizes following the work by Hongfei et al. (2020) which shows that reducing the number of decoder layers can improve decoding speed at a very limited accuracy cost.
We also keep the approach of running the models with a custom C++ runtime. This year we present CTranslate2 1 , an optimized and production-grade 1 https://github.com/OpenNMT/ CTranslate2 inference engine for OpenNMT models that enables fast CPU and GPU decoding with few dependencies. This library implements several optimizations for decoding neural machine translation models such as 8-bit quantization, parallel translations, caching, and dynamic target vocabulary reduction.
Section 2 of this paper describes the data preparation and the training procedures we apply to train the candidate models. Section 3 presents the various optimizations we implemented to reduce model size and improve runtime efficiency. Finally, Section 4 details the accuracy and efficiency results achieved by the submitted models.

Teacher-student training
We train our systems using a teacher-student approach (Kim and Rush, 2016). First, a large model (the teacher) is trained on all available bilingual data, including synthetic data such as backtranslations of monolingual target sentences (Sennrich et al., 2016; and translations of monolingual source sentences (Zhang and Zong, 2016). Model ensembles are also typically used to build stronger teacher systems.
Then, a small model (the student) is trained by means of minimizing the loss between the student and teacher systems with the goal of distilling the knowledge of the teacher (Kim and Rush, 2016;) into a smaller model with comparable accuracy results. Crego and Senellart (2016) show that student models can even outperform to some extent their teacher counterparts.
Knowledge distillation is an effective approach to reduce the model size, thus lowering memory and computation requirements.

Teacher system
As suggested in the task description and given the limited amount of time available, we use Face-book's WMT 2019 system as our teacher model (Ng et al., 2019). The system is trained as an ensemble of big Transformer models for both directions, English-German and German-English. Table  1 shows the BLEU (Papineni et al., 2002)

Training data
We limit our training data to the WMT 2019 English-German translation task 3 .  We use the following data to be translated by the Facebook's WMT 2019 teacher system: (a) English part of the bilingual data, (b) English part of ParaCrawl v3, and (c) English monolingual data.
Before translation, data is cleaned following several rules: sentences that are empty or longer than 100 tokens without considering tokenization are filtered out. We also use the language identification (LID) toolkit langid (Lui and Baldwin, 2012) to further clean ParaCrawl and the English monolingual corpora which are known to contain a large number of noisy sentences. Nearly 5% of the sentences are discarded by the LID toolkit.
The cleaned data is then translated by the teacher model and the resulting synthesized parallel data is used to train the student systems 4 .

Vocabulary
We build a joint subword segmentation model from the synthesized parallel data using SentencePiece (Kudo and Richardson, 2018). The vocabulary size is set to 32, 000 tokens. We removed the non-latin characters before building the vocabulary.

Student models
We train 4 different student systems based on the Transformer architecture (Vaswani et al., 2017). The candidate configurations are presented in Table 3. In addition to the base Transformer configuration, we train 3 model variants with different number of encoder layers N Enc , decoder layers N Dec , hidden size d model , and feed-forward network size d f f . We share both the source and target word embeddings and softmax weights in the 3 variants while the base configuration considers them as separate weights.

Student training
Since the amount of synthetic data is relatively large, we define an epoch as a random sampling of 5M sentences. We set the sampling weights of the selected data (a), (b), and (c) to 5, 2, and 2 respectively. That is, we consider a larger number of sentences synthesized from the English part of the bilingual data than from ParaCrawl or from the monolingual English data set.
We use the OpenNMT-tf 5 toolkit to train our student systems. Training is run on a single NVIDIA Tesla V100 GPU with an effective batch size of 25,000 tokens for the early epochs. Just before the final release, we train 10 additional epochs with a larger batch size by increasing the gradient update delay by a factor of 16 . Figure 1 shows the comparison with a larger batch size. We achieve an additional 0.1 to 0.2 BLEU using this  technique. Finally, we average the weights of the last 10 checkpoints to produce the final models.

Evaluation
We list the number of parameters of the 4 trained models in Table 3 and their evaluation scores on the English-German newstest2018 and newstest2019 before any inference optimizations. The results correlate well with the expectation that more model parameters lead to better performance. The base Transformer model achieves better results on new-stest2019 than the Facebook's WMT 2019 model used as a teacher (43.0 vs. 42.1). This confirms the finding in Crego and Senellart (2016) that student systems can sometimes outperform their corresponding teacher networks.

Inference optimizations
All models are converted and executed with CTrans-late2. We use the version 1.10.0 of the library.

CTranslate2 technical overview
CTranslate2 is a standalone C++ library that implements the complete logic of executing and decod-ing neural machine translation models with a focus on Transformer variants. This custom implementation supports CPU and GPU execution with the goal of being faster, lighter, and more customizable than a general-purpose deep learning framework. Key features of this project include model quantization, parallel translations, dynamic memory usage, and interactive decoding. Some of these features are difficult to implement effectively with standard deep learning frameworks and are the motivation for this project.
The CPU runtime is backed by Intel MKL, a popular math computation library optimized for Intel processors. We specialize operators with BLAS routines and Vector Mathematical functions whenever possible to benefit from vectorization. We also use the caching allocator provided by mkl malloc and align allocated memory to 64 bytes. Other operations not available in Intel MKL are implemented in plain C++ using the STL and OpenMP.
The GPU runtime minimally requires the cuBLAS and Thrust libraries. Basic transformations are defined using Thrust while more complex layers such as layer normalization and softmax are using CUDA kernels ported from PyTorch (Paszke et al., 2019). We also integrate a caching allocator from the CUB library to reuse previously allocated buffers and minimize device synchronization.

8-bit quantization (CPU)
Quantization is a standard technique to reduce the model size in memory and accelerate its execution. We quantize the weights of linear and embedding layers to 8-bit signed integers after completing training. Experimental results show that model quantization can achieve high translation accuracy without making the training quantization-aware. We use the equation from Wu et al. (2016) to compute the quantized weight W Q from the original weight W : (1) Table 4 shows the effect of weight quantization on the final model size.
On CPU, we dynamically quantize the input of the linear layer using Equation 1, multiply the quantized input and weight with MKL's cblas gemm s8u8s32 function, and dequantize the result before adding the bias term. In addition, we employ two notable techniques: Weights pre-packing. On model load, we replace the quantized linear weights with the packed representation returned by MKL's packed GEMM API.
Unsigned compensation term. In row major mode, Intel MKL expects the input matrix a to be unsigned while the quantization Equation 1 produces signed values. To overcome this constraint, we shift a to the 8-bit unsigned domain and add a compensation term c to the output matrix. This compensation term only depends on the quantized weight matrix and can be computed once: On GPU, 8-bit computation is disabled as our implementation still requires some efficiency improvements regarding repetitive quantization and dequantization. In this case the weights are dequantized on load to single precision floating points.

Greedy decoding
To maximize speed and reduce memory usage, we use greedy search instead of beam search. During decoding, we also skip the final softmax layer and simply get the maximum from the output logits.

Decoder projections caching
We apply the common technique of caching linear projections in the Transformer decoder layers. In particular, at step t the decoder self-attention layers compute Attention(Q t W Q , Q 1..t W K , Q 1..t W V ). As the matrix Q 1..t−1 is constant, we only compute Q t W K and Q t W V and concatenate the results to previous projections before calling the attention.
We also cache the encoder output projections KW K and V W V in the encoder-decoder attention layers as K and V remain constant during decoding.
For both cases, we transpose the matrices to delimit the attention heads before saving them in the cache. Figure 2 compares the observed speedup when increasing the number of threads at the batch levelthe number of OpenMP threads-or at the file levelthe number of batches processed in parallel. We use the same batch size in both cases.

File-level parallelism (CPU)
As the number of threads increases, the first approach looses efficiency because not all operators within the model scale linearly and some of them are not parallelized at all. On the other hand, the second approach continues to improve as we add more threads because all batches are independent and the full decoding can be executed in parallel. However, the duplicated internal state of parallel translators increases memory usage. To mitigate this issue, we share the static model data among all parallel translators and read and write batches in a streaming manner while ensuring that the original order is preserved.
Given the large number of CPU cores available for this task, we chose to exploit parallelism at the file level to maximize the overall throughput. The number of parallel translators is set to the number of physical cores. Each translator is using a single thread so the decoding algorithm is executed sequentially and without OpenMP.

Sorted and dynamic batches
When setting the maximum batch size to N tokens, each consumer reads 8N contiguous tokens, sorts the sentences from the longest to the shortest, and then splits by batch of N tokens before running the model. The correct order is restored when returning the translation results. This local sorting makes the batches contain sentences of similar sizes which reduces the amount of padding and increases the computation efficiency.
We use N = 6000 for the GPU task, N = 512 for the single-core CPU task, and N = 256 for the multi-core CPU task.
During decoding we remove finished translations from the batch to avoid unnecessary computation. We also exploit the prior knowledge that short sentences finish early: by moving shorter sentences at then end of the batch, we reduce memory copies when updating the decoder cache in place.

Target vocabulary reduction
We generate a static source-target vocabulary mapping using the technique described in . We first train an alignment model with fast align to align source and target words. To increase the coverage of this mapping, we build a phrase table from these alignments to extract the N -best translation hypotheses of 1-gram, 2-gram, ..., n-gram source sequences and include all target words in the mapping. We set n = 3 to generate  the vocabulary mapping that are included in the models of this submission.
During decoding, we consider all 1-gram, 2gram, and 3-gram sequences in the input batch and select the target tokens that are likely to appear in the translation according to the pretrained mapping as well as the 50 most frequent target tokens. These candidates are used to mask the weights of the final linear layer and effectively reduce its computational cost.

Docker images
The Docker images entrypoint is a small C++ main function that wraps the CTranslate2 and Sentence-Piece libraries and sets the decoding options that are relevant for this task.
We submit separate Docker images for CPU and GPU to only include the required dependencies.
The images are based respectively on ubuntu:18.04 and nvidia/cuda:10.2-base-ubuntu18.04. Without the model, the CPU image size is 104MB and the GPU image size is 210MB. Table 5 shows the impact of selected optimizations when decoding a base Transformer model on a single CPU core. The CTranslate2 library combined with few optimizations can lead to a 8× speedup with limited accuracy loss over a baseline Tensor-Flow program.

Optimization results
Finally, Table 6 summarizes the global impact of the optimizations described above that we compare against a baseline beam search decoding with OpenNMT-tf. For a base Transformer model, single-core CPU translation is 13× faster while only loosing 0.8 BLEU points and GPU translation is 7× faster for the same quality. 3.9 39.9 (4:3) 3.8 39.0 Table 6: Time in seconds to translate newstest2019 and BLEU scores as returned by SacreBLEU. The time includes model loading and tokenization. Baseline models are decoded with OpenNMT-tf using a beam of size 4; Optimized models are decoded with the final images submitted for this task. The runs were executed on a c5.metal AWS instance for CPU and a g4dn.xlarge instance for GPU.

Conclusion
We demonstrated that the OpenNMT ecosystem can be used to train efficient and high-quality neural machine translation models. The training frameworks-OpenNMT-tf and OpenNMT-pyinclude all features and procedures that are commonly applied to reach competitive translation scores. This year we presented CTranslate2, an optimized and production-grade inference engine for OpenNMT models that enables fast CPU and GPU decoding with few dependencies. By combining several optimizations and parallelization techniques, the library can drastically improve decod-