Accelerating NMT Batched Beam Decoding with LMBR Posteriors for Deployment

We describe a batched beam decoding algorithm for NMT with LMBR n-gram posteriors, showing that LMBR techniques still yield gains on top of the best recently reported results with Transformers. We also discuss acceleration strategies for deployment, and the effect of the beam size and batching on memory and speed.


Introduction
The advent of Neural Machine Translation (NMT) has revolutionized the market. Objective improvements (Sutskever et al., 2014;Sennrich et al., 2016b;Gehring et al., 2017;Vaswani et al., 2017) and a fair amount of neural hype have increased the pressure on companies offering Machine Translation services to shift as quickly as possible to this new paradigm.
Such a radical change entails non-trivial challenges for deployment; consumers certainly look forward to better translation quality, but do not want to lose all the good features that have been developed over the years along with SMT technology. With NMT, real time decoding is challenging without GPUs, and still an avenue for research (Devlin, 2017). Great speeds have been reported by Junczys-Dowmunt et al. (2016) on GPUs, for which batching queries to the neural model is essential. Disk usage and memory footprint of pure neural systems are certainly lower than that of SMT systems, but at the same time GPU memory is limited and high-end GPUs are expensive.
Further to that, consumers still need the ability to constrain translations; in particular, brandrelated information is often as important for companies as translation quality itself, and is currently under investigation (Chatterjee et al., 2017;Hokamp and Liu, 2017;Hasler et al., 2018). It is also well known that pure neural systems reach very high fluency, often sacrificing adequacy (Tu et al., 2017;Zhang et al., 2017;Koehn and Knowles, 2017), and have been reported to behave badly under noisy conditions (Belinkov and Bisk, 2018).  show an effective way to counter these problems by taking advantage of the higher adequacy inherent to SMT systems via Lattice Minimum Bayes Risk (LMBR) decoding (Tromble et al., 2008). This makes the system more robust to pitfalls, such as over -and under-generation (Feng et al., 2016;Meng et al., 2016;Tu et al., 2016) which is important for commercial applications.
In this paper, we describe a batched beam decoding algorithm that uses NMT models with LMBR n-gram posterior probabilities . Batching in NMT beam decoding has been mentioned or assumed in the literature, e.g. (Devlin, 2017;Junczys-Dowmunt et al., 2016), but to the best of our knowledge it has not been formally described, and there are interesting aspects for deployment worth taking into consideration.
We also report on the effect of LMBR posteriors on state-of-the-art neural systems, for five translation tasks. Finally, we discuss how to prepare (LMBR-based) NMT systems for deployment, and how our batching algorithm performs in terms of memory and speed.

Neural Machine Translation and LMBR
Given a source sentence x, a sequence-tosequence NMT model scores a candidate translation sentence y = y T 1 with T words as: where P N M T (y t |y t−1 1 , x) uses a neural function f N M T (·). To account for batching B neu-ral queries together, our abstract function takes the form of f N M T (S t−1 , y t−1 , A) where S t−1 is the previous batch state with B state vectors in rows, y t−1 is a vector with the B preceding generated target words, and A is a matrix with the annotations  of a source sentence. The model has a vocabulary size V .
The implementation of this function is determined by the architecture of specific models. The most successful ones in the literature typically share in common an attention mechanism that determines which source word to focus on, informed by A and S t−1 .  use recurrent layers to both compute A and the next target word y t . Gehring et al. (2017) use convolutional layers instead, and Vaswani et al. (2017) prescind from GRU or LSTM layers, relying heavily on multi-layered attention mechanisms, stateful only on the translation side. Finally, this function can also represent an ensemble of neural models.
Lattice Minimum Bayes Risk decoding computes n-gram posterior probabilities from an evidence space and uses them to score a hypothesis space (Kumar and Byrne, 2004;Tromble et al., 2008;Blackwood et al., 2010). It improves single SMT systems, and also lends itself quite nicely to system combination (Sim et al., 2007;de Gispert et al., 2009).  have recently shown a way to use it with NMT decoding: a traditional SMT system is first used to create an evidence space ϕ e , and the NMT space is then scored left-to-right with both the NMT model(s) and the n-gram posteriors gathered from ϕ e . More formally: For our purposes L is arranged as a matrix with each row uniquely associated to an n-gram history identified in ϕ e : each row contains scores for any word y in the NMT vocabulary.
L can be precomputed very efficiently, and stored in the GPU memory. The number of distinct n-gram histories is typically no more than 500 for our phrase-based decoder producing 200 hypotheses. Notice that such a matrix only containing P LM BR contributions would be very sparse, but it turns into a dense matrix with the summation of Θ 0 . Both sparse and dense operations can be performed on the GPU. We have found it more efficient to compute first all the sparse operations on CPU, and then upload to the GPU memory and sum the constant Θ 0 in GPU 1 .

NMT batched beam decoding
Algorithm 1 describes NMT decoding with LMBR posteriors using beam size B equal to the batch size. Lines 2-5 initialize the decoder; the number of time steps T is usually a heuristic function of the source length. q will keep track of the B best scores per time step, b and y are indices.
Lines 7-16 are the core of the batch decoding procedure. At each time step t, given S t−1 , y t−1 and A, f N M T returns two matrices: P t , with size B × V , contains log-probabilities for all possible candidates in the vocabulary given B live hypotheses. S t is the next batch state. Each row in S t is the vector state that corresponds to any candidate in the same row of P t (line 8).
Lines 9, 10 add the n-gram posterior scores. Given the indices in b and y it is straightforward to read the unique histories for the B open hypotheses: the topology of the hypothesis space is that of a tree because an NMT state represents the entire live hypothesis from time step 0. Note that b tj < B is the index to access the previous word in y t−1 . In effect, indices in b function as backpointers, allowing to reconstruct not only n-grams per time step, but also complete hypotheses. As discussed for Equation 2, these histories are associated to rows in our matrix L. Function GETMATRIXBYROWS(·) simply creates a new matrix of size B × V by fetching those B rows from L. This new matrix is summed to P t (line 10).
In line 11, we get the indices and scores in P t + q t−1 of the top B hypotheses. These best hypotheses could come from any row in P t . For example, all B best hypotheses could have been found in row 0. In that case, the new batch state to be used in the next time step should contain copies of row 0 in the other B − 1 rows. This is achieved again with GETMATRIXBYROWS(·) in line 12.
Finally, lines 13-16 identify whether there are any end-of-sentence (EOS) candidates; the corre- for t = 1 to T do 8: h ← B histories identified through b, y and t 10: if y tj = EOS then 15: Track indices and score 16: q tj ← −∞ Mask out to prevent hypothesis extension 17: sponding indices and score are pushed into stack F and these candidates are masked out (i.e. set to −∞) to prevent further expansion. In line 17, GETBESTHYPOTHESIS(F ) traces backwards the best hypothesis in F , again using indices in b and y. Optionally, normalization by hypothesis length happens in this step. It is worth noting that: 1. If we drop lines 9, 10 we have a pure left-toright NMT batched beam decoder.
2. Applying a constraint (e.g. for lattice rescoring or other user constraints) involves masking out scores in P t before line 11.
3. Because the batch size is tied to the beam size, the memory footprint increases with the beam.
4. Due to the beam being used for both EOS and non EOS candidates, it can be argued that this empoverishes the beam and it could be kept in addition to non EOS candidates (either by using a bigger beam, or keeping separately). Empirically we have found that this does not affect quality with real models.
5. The opposite, i.e. that EOS candidates never survive in the beam for T time steps, can happen, although very infrequently. Several pragmatic backoff strategies can be applied in this situation: for example, running the decoder for additional time steps, or tracking all EOS candidates that did not survive in a separate stack and picking the best hypothesis from there. We chose the latter.

Extension to Sentence batching
In addition to batching all B queries to the neural model needed to compute the next time step for one sentence, we can do sentence batching: this is, we translate N sentences simultaneously, batching B × N queries per time step. With small modifications, Algorithm 1 can be easily extended to handle sentence batching. If the number of sentences is N , 1. Instead of one set F to store EOS candidates, we need F 1 ...F N sets.
2. For every time step, b t , y t and q t need to be matrices instead of vectors, and minor changes are required in TOPB(·) to fetch the best candidates per sentence efficiently.
3. P t and S t can remain as matrices, in which case the new batch size is simply B · N .
4. The heuristic function used to compute T is typically sentence specific.

Experimental Setup
We report experiments on English-German, German-English and Chinese-English language  Table 1: Quality assessment of our NMT systems with and without LMBR posteriors for GRU-based (FNMT, LNMT) and Transformer models (TNMT, LTNMT). Cased BLEU scores reported on 5 translation tasks.The exact PBMT systems used to compute n-gram posteriors for LNMT and LTNMT systems are also reported. The last row shows scores for the best official submissions to each task. pairs for the WMT17 task, and Japanese-English and English-Japanese for the WAT task. For the German tasks we use news-test2013 as a development set, and news-test2017 as a test set; for Chinese-English, we use news-dev2017 as a development set, and news-test2017 as a test set. For Japanese tasks we use the ASPEC corpus (Nakazawa et al., 2016). We use all available data in each task for training. In addition, for German we use backtranslation data (Sennrich et al., 2016a). All training data for neural models is preprocessed with the byte pair encoding technique described by Sennrich et al. (2016b). We use Blocks (van Merriënboer et al., 2015) with Theano (Bastien et al., 2012) to train attention-based single GRU layer models , henceforth called FNMT. The vocabulary size is 50K. Transformer models (Vaswani et al., 2017), called here TNMT, are trained using the Tensor2Tensor package 2 with a vocabulary size of 30K. Our proprietary translation system is a modular homegrown tool that supports pure neural decoding (FNMT and TNMT) and with LMBR posteriors (henceforce called LNMT and LT-NMT respectively), and flexibly uses other components (phrase-based decoding, byte pair encoding, etcetera) to seamlessly deploy an end-to-end translation system. FNMT/LNMT systems use ensembles of 3 neural models unless specified otherwise; TNMT/LTNMT systems decode with 1 to 2 models, each averaging over the last 20 checkpoints.
The Phrase-based decoder (PBMT) uses standard features with one single 5-gram language 2 https://github.com/tensorflow/ tensor2tensor model (Heafield et al., 2013), and is tuned with standard MERT (Och, 2003); n-gram posterior probabilities are computed on-the-fly over rich translation lattices, with size bounded by the PBMT stack and distortion limits. The parameter λ in Equation 2 is set as 0.5 divided by the number of models in the ensemble. Empirically we have found this to be a good setting in many tasks.
Unless noted otherwise, the beam size is set to 12 and the NMT beam decoder always batches queries to the neural model. The beam decoder relies on an early preview of ArrayFire 3.6 (Yalamanchili et al., 2015) 3 , compiled with CUDA 8.0 libraries. For speed measurements, the decoder uses one single CPU thread. For hardware, we use an Intel Xeon CPU E5-2640 at 2.60GHz. The GPU is a GeForce GTX 1080Ti. We report cased BLEU scores (Papineni et al., 2002), strictly comparable to the official scores in each task 4 . Table 1 shows contrastive experiments for all five language pair/tasks. We make the following observations:

The effect of LMBR n-gram posteriors
1. LMBR posteriors show consistent gains on top of the GRU model (LNMT vs FNMT rows), ranging from +0.5BLEU to +1.2BLEU. This is consistent with the findings reported by .
2. The TNMT system boasts improvements across the board, ranging from +1.5BLEU in German-English to an impressive +4.2BLEU in English-Japanese WAT (TNMT vs LNMT). This is in line with findings by Vaswani et al. (2017) and sets new very strong baselines to improve on.
3. Further, applying LMBR posteriors along with the Transformer model yields gains in all tasks (LTNMT vs TNMT), up to +0.8BLEU in Japanese-English. Interestingly, while we find that rescoring PBMT lattices (Stahlberg et al., 2016) with GRU models yields similar improvements to those reported by , we did not find gains when rescoring with the stronger TNMT models instead.

Accelerating FNMT and LNMT systems for deployment
There is no particular constraint on speed for the research systems reported in Table 1. We now address the question of deploying NMT systems so that MT users get the best quality improvements at real-time speed and with acceptable memory footprint. As an example, we analyse in detail the English-German FNMT and LNMT case and discuss the main trade-offs if one wanted to accelerate them. Although the actual measurements vary across all our productised NMT engines, the trends are similar to the ones reported here.
In this particular case we specify a beam width of 0.01 for early pruning (Wu et al., 2016;Delaney et al., 2006) and reduce the beam size to 4. We also shrink the ensemble into one single big model 5 using the data-free shrinking method described by , an inexpensive way to improve both speed and GPU memory footprint.
In addition, for LNMT systems we tune phrasebased decoder parameters such as the distortion limit, the number of translations per source phrase and the stack limit. To compute n-gram posteriors we now only take a 200-best from the phrasebased translation lattice.  In the process, both accelerated systems have lost 0.9 BLEU relative to the baseline. As an example, let us break down the effects of accelerating the LNMT system: using only 200-best hypotheses from the phrase-based translation lattice reduces 0.3 BLEU. Replacing the ensemble with a data-free shrunken model reduces another 0.2 BLEU and decreasing the beam size reduces 0.4 BLEU. The impact of reducing the beam size varies from system to system, although often does not result in substantial quality loss for NMT models (Britz et al., 2017).
It is worth noting that these two systems share exactly the same neural model and parameter values. However, LNMT runs 4500 words per minute (wpm) slower than FNMT. Figure 1 breaks down the decoding times for both the accelerated FNMT and LNMT systems. The LNMT pipeline also requires a phrase-based decoder and the extra component to compute the n-gram posterior probabil-  ities. In effect, while both are remarkably fast by themselves (e.g. the phrase-based decoder is running at 20000 wpm), these extra contributions explain most of the speed reduction for the accelerated LNMT system. In addition, the beam decoder itself is slightly slower for LNMT than for FNMT. This is mainly due to the computation of L as explained in Section 2. Finally, the respective GPU memory footprints for FNMT and LNMT are 4.1 and 4.8 GB.

Batched beam decoding and beam size
We next discuss the impact of using batch decoding and the beam size. To this end we use the accelerated FNMT system (25.2 BLEU, 9449 wpm) to decode with and without batching; we also widen the beam. Figure 2 shows the results.
The accelerated system itself with batched beam decoding and beam size of 4 is 3 times faster than without batching (3053 wpm). The GPU memory footprint is 1 GB bigger when batching (4.1 vs 3.1 GB). As can be expected, widening the beam decreases the speed of both decoders. The relative speed-up ratio favours the batch decoder for wider beams, i.e. it is 5 times faster for beam size 12. However, because the batch size is tied to the beam size, this comes at a cost in GPU memory footprint (under 8 GB).

Sentence batching
As described in Section 3.1, it is straightforward to extend beam batching to sentence batching. Figure 3 shows the effect of sentence batching up to 7 sentences on our accelerated FNMT system.
Whilst the speed-up of our implementation is sub-linear, when batching 5 sentences the decoder runs at almost 21000 wpm, and goes beyond 24000 for 7 sentences. Thus, our implementation of sentence batching is 2.5 times faster on top of beam batching. Again, this comes at a cost: the GPU memory footprint increases as we batch more and more sentences together, up to 11 GB for 7 sentences, which approaches the limit of GPU memory.
Note that sentence batching does not change translation quality. For example, when translating 7 sentences, we are effectively batching 28 neural queries per time step. Indeed, each individual sentence is still being translated with a beam size of 4. Figure 3 also shows the effect of sorting the test set by sentence length. Because sentences have similar lengths, less padding is required and hence we have less wasteful GPU computation. With 7 batched sentences the decoder would run at barely 17000 wpm, this is, 7000 wpm less due to not sorting by sentence length. A similar strategy is common for neural training (Sutskever et al., 2014;Morishita et al., 2017).

Conclusions
We have described a left-to-right batched beam NMT decoding algorithm that is transparent to the neural model and can be combined with LMBR n-gram posteriors. Our quality assessment with Transformer models (Vaswani et al., 2017) has shown that LMBR posteriors can still improve such a strong baseline in terms of BLEU. Finally, we have also discussed our acceleration strategy for deployment and the effect of batching and the beam size on memory and speed.