A Streaming Approach For Efficient Batched Beam Search

We propose an efficient batching strategy for variable-length decoding on GPU architectures. During decoding, when candidates terminate or are pruned according to heuristics, our streaming approach periodically"refills"the batch before proceeding with a selected subset of candidates. We apply our method to variable-width beam search on a state-of-the-art machine translation model. Our method decreases runtime by up to 71% compared to a fixed-width beam search baseline and 17% compared to a variable-width baseline, while matching baselines' BLEU. Finally, experiments show that our method can speed up decoding in other domains, such as semantic and syntactic parsing.


Introduction
While inference is often cheap compared to training in modern neural models, one may need to run inference frequently or continually. Such is the case for online machine translation (MT) services: as far back as 2016, Google Translate already translated 100 billion words daily (Turovsky, 2016). Largescale inference is also required for methods such as iterative backtranslation and knowledge distillation to generate training data (Hoang et al., 2018;Kim and Rush, 2016). For such high-throughput applications, it is useful to decrease inference cost.
Meanwhile, we must preserve accuracy: beam search is slower than greedy decoding, but is nevertheless often preferred in MT. Not only is beam search usually more accurate than greedy search, but it also outputs a diverse set of decodings, enabling reranking approaches to further improve accuracy (Yee et al., 2019;Ng et al., 2019;Charniak and Johnson, 2005;Ge and Mooney, 2006).
However, it is challenging to optimize the performance of beam search for modern neural architectures. Unlike classical methods in sparse computation settings, modern neural methods typi-cally operate in dense (batched) settings to leverage specialized hardware such as GPUs.
In this work, we propose a streaming method to optimize GPU-batched variable-output-length decoding. Our method does not use a fixed batch during inference; instead, it continually "refills" the batch after it finishes translating some fraction of the current batch. Our method then continues decoding on the remaining candidates in the batch, prioritizing those least expanded.
We apply our method to variable-width beam search. For variable-output-length decoding even in batched settings, variable-width beam search often modestly decreases accuracy in exchange for substantial speedups over fixed-width beam search (Freitag and Al-Onaizan, 2017;Wu et al., 2016). When decoding with Fairseq's state-ofthe-art WMT'19 model (Ng et al., 2019), our method further improves over the speed of baseline variable-width beam search: up to 16.5% on a 32GB V100 GPU, without changing BLEU (Papineni et al., 2002). Our approach also improves decoding efficiency in lightweight models for semantic and syntactic parsing. 1 In principle, our method can be applied to any task which sequentially processes variable-length data.

Background: Beam Search
Given encoder E and decoder D, our task is to convert inputs {x 1 . . . x N } into corresponding outputs {ȳ 1 . . .ȳ N }, for data size N . For example, in machine translation, each x i is a source sentence consisting of a sequence of tokens and each y i is a translation. We assume D(e i , y i ) receives e i = E(x i ) and a partial y i as input, constructinḡ y i one token at a time.
One method of constructingȳ i for a given x i is greedy search. Let y lt i be the in-construction candidate with length l t = t at timestep t. We initialize 1 Code available at https://github.com/ yangkevin2/emnlp2020-stream-beam-mt. Figure 1: Illustration of our method VAR-STREAM for variable-width beam search with vocabulary size |V| = 3, beam width k = 2, batch size n = 3, refill threshold = 1 3 . Each color corresponds to the beam for a single input. The rounded rectangles at each timestep are beams H(B lt i ), while the shapes inside are individual candidates. Shaded beams represent the end of the search. The right-facing triangles indicate the initial candidate containing just the start token w sos , circles denote an active (non-terminated) candidate, and stars denote a finalized candidate. Candidates become finalized after following the third (bottom-most) branch in an expansion, corresponding to the end token w eos ; they then undergo only no-op expansions thereafter. The first two rows of beams depict normal operation of variable-width beam search, including heuristic pruning in the light blue beam at t = 6. The third row shows an important detail of our method: VAR-STREAM refills the batch after t = 3, when only n beams remain, and the remaining purple beam halts computation until the two newly added beams reach the same l t . (This detail matters in transformer architectures; see Appendix A.2.) y 1 i as the start token w sos , and at each timestep t obtain y lt+1 i by concatenating the maximumprobability token. We finalize y lt i asȳ i once we append the end token, or at some maximum length.
Previous work has found that greedy search often underperforms beam search in accuracy, and gains from non-greedy decoding have also been observed in many classical models (Sutskever et al., 2014;Freitag and Al-Onaizan, 2017;Wilt et al., 2010). See the dark blue, green, and brown beams in Figure 1 for normal operation of fixed-width beam search using beam width k = 2. For each input x i , fixed-width beam search tracks a lengthl t , width-k beam B lt i for each time t. B lt i contains k length-l t candidates y lt i1 . . . y lt ik with maximum log-likelihood in order, denoted by the shapes inside the rounded rectangles (beams) in the figure. At each step, beam search considers all k|V| possible candidate expansions (one-token extensions of existing candidates), where V is the vocabulary. The top k expansions become the expanded beam B lt+1 i . Figure 1 shows these expansions at each timestep for |V| = 3, with active non-terminated candidates (circles) becoming finalized (stars) after following the bottom-most branch, corresponding to the end token w eos . In the end, beam search yields k finalized candidatesȳ i1 . . .ȳ ik compared to a singleȳ i in greedy search.
Variable-width beam search reduces the computational cost of the above fixed-width beam search by pruning the full beam B lt i using heuristics H, for example at t = 6 for the light blue beam in the figure. The width of the resulting pruned beam H(B lt i ) is no longer always exactly equal to k, and may vary over time.

Streaming Variable-Length Decoding
As an example, consider translating a batch of n German sentences into English via a traditionallybatched variable-width beam search (e.g., Freitag and Al-Onaizan (2017)) on a GPU. Henceforth we refer to this baseline as VAR-BATCH.
An inefficiency results from decoding being inherently variable in length: After t steps, we may have completed m < n translations, but the last n − m beams may take several more timesteps. For example, in Figure 1, our initial batch consists of the dark blue, brown, and purple beams. After the dark blue and brown beams terminate, we would still be stuck decoding the purple beam by itself.
The resulting GPU underutilization motivates our streaming approach VAR-STREAM applied to variable-width beam search (Figure 1). For batch size n, VAR-STREAM initially proceeds identically to VAR-BATCH. But when the number of remaining beams drops below n for some constant ∈ (0, 1), VAR-STREAM encodes a new batch of inputs x to "refill" its batch to size n. 2 This occurs at t = 4 in Figure 1, where we refill the batch using the green and light blue beams.
Note the active beams are no longer of equal length l t = t for every beam after refilling. At each subsequent t, VAR-STREAM only expands beams H(B lt i ) with minimal l t ; in particular, the purple beam in Figure 1 pauses computation at t = 4. 3 When decoding with state-of-the-art transformer architectures for MT, it is advantageous to expand only beams with minimal l t at each step, because self-attention causes steps at higher l t to be more expensive; see Appendix A.2. (For RNN-based architectures, it may be faster to expand all active beams at each step.) We emphasize that VAR-STREAM is an implementation optimization, exactly matching the output of VAR-BATCH. Full details in Algorithm 1.
When the memory bottleneck is partially the decoding process itself rather than caching the input encodings E(x i ) or beams H(B lt i ), VAR-STREAM can cache additional encodings and beams on GPU. At each t, VAR-STREAM then selects beams up to some limit on total beam width, filling GPU capacity even in the case of variable-width beams. This batching constraint addresses a second inefficiency in GPU utilization: the widths of the pruned beams H(B lt i ) may vary over time. We exploit this in semantic (Sec. 4.2) and syntactic parsing (Sec. 4.3).

Experiments
We apply VAR-STREAM to variable-width beam search in machine translation, semantic parsing, 2 Our method is relatively insensitive to (Appendix A.3). 3 As very long translations could get "stuck" in the batch, one can periodically finish computation on all remaining beams in the batch if latency is a concern in addition to throughput.
Algorithm 1 VAR-STREAM Input: inputs X = {x 1 , . . . x N }, model (E, D), batch size n, refill threshold , beam width k, pruning heuristics H Remove terminated beams from E, B 20: return Y and syntactic parsing. We use the absolute threshold and max candidates heuristics of Freitag and Al-Onaizan (2017) as H, modifying only the heuristic hyperparameters for each domain based on a development set. The absolute threshold heuristic prunes candidates y lt ij whose log-probabilities fall short of the best candidate y lt i1 's by some threshold δ, i.e. log P (y lt ij ) < log P (y lt i1 ) − δ. The max candidates heuristic prevents the search from selecting more than M < k length-l t + 1 candidates originating from the same length-l t candidate at each step t.
In each domain we compare four methods: We sort and bucket inputs by length for batching.

Machine Translation
We  (2019) but without reranking. As smaller beam sizes are also common in practice, we evaluate with k = 5 as well. Our GREEDY and FIXED baselines are Fairseq's implementation, while VAR-BATCH and VAR-STREAM are our own. For all methods, we evaluate 5 runs on a 32GB V100 GPU. For k = 50, we also run on a 16GB V100 GPU, noting that 32GB is likely more realistic in a production setting. We choose batch size to saturate the GPU, using = 1 6 for VAR-STREAM, with pruning heuristics δ = 1.5, M = 5. Appendix A.3 details hyperparameter choices.
As shown in Table 1, on both GPU settings and on both languages, GREEDY is fastest, but suffers heavily in BLEU. Our VAR-STREAM is the fastest beam-based search, and matches the BLEU of the beam search baselines. Compared to VAR-BATCH, VAR-STREAM is faster by 14-17% when using k = 50 on the 32GB GPU, and by 19-27% when k = 5. VAR-STREAM also remains 6% faster when using k = 50 on the 16GB GPU where overhead is higher. VAR-BATCH and VAR-STREAM match the BLEU of FIXED while being 2-3 times faster when using beam size 50, confirming the speedups from VAR-BATCH over FIXED in e.g., Freitag and Al-Onaizan (2017). FIXED is more competitive when k = 5 because the potential for heuristic beam pruning is much more limited; moreover, our implementations of VAR-STREAM and VAR-BATCH somewhat understate both speedups and BLEU cost compared to FIXED due to an implementation difference with Fairseq (Appendix A.1). 4 Thus VAR-BATCH becomes slower than FIXED when k = 5. Nevertheless, VAR-STREAM remains the fastest in this scenario by 8-10%.

Semantic Parsing
To explore our method's domain applicability, we experiment with semantic parsing using the seq2seq model of Dong and Lapata (2016). This lightweight model is no longer state of the art, but its decoding is representative of more recent architectures (Suhr et al., 2018;Yin and Neubig, 2018;Lin et al., 2019). We use the ATIS flight-booking dataset (Dahl et al., 1994), setting n = k = δ = 10, M = 3. Due to the small dataset and model, our batching constraint is more theoretical: we constrain each method to expand at most  Table 2: Top-1 and oracle reranking F1, wall clock (avg. 5 runs), and average candidate expansions per timestep (i.e., total candidate expansions divided by total decoding timesteps) for semantic parsing on ATIS and syntactic parsing on the Penn Treebank (PTB). Theoretical maximum efficiency under our batching constraint is 100 expansions per step for both tasks. VAR-STREAM achieves substantially higher expansions per step than other methods on ATIS. On PTB, FIXED achieves near-perfect efficiency because allȳ ij for a given x i have the same length. But comparing variable-width beam searches, VAR-STREAM is much more efficient with batch capacity than VAR-BATCH. nk = 100 candidates per timestep (i.e., total beam width), instead of simply saturating the GPU. 5 As shown by the expansions per step in Table 2, VAR-STREAM uses the batch capacity of 100 most efficiently. Thus VAR-STREAM is faster than both VAR-BATCH and FIXED, despite overhead which is exacerbated in a small model. The speedup is larger on the JOBS and GEO datasets (Zettlemoyer and Collins, 2012) (Appendix A.4). While all methods achieve similar top-1 F1, oracle F1 (using an oracle to "rerank" all outputsȳ ij ) highlights the benefit of producing a diverse set of translations.

Syntactic Parsing
We also experiment with the lightweight shiftreduce constituency parser of Cross and Huang (2016) on the Penn Treebank (Marcus et al., 1993). This task and model differ from our previous setups in that for a given input x i , all valid parsesȳ ij have exactly the same length. When inputs are bucketed by length, this removes the variable-output-length inefficiency for traditional batching: we cannot get stuck finishing a small fraction of beams when the rest of the batch is done. Thus, this task isolates the effect of VAR-STREAM using batch capacity more efficiently in the case of variable-width beams. We use the same computational constraint as in semantic parsing, with n = k = 10, δ = 2.5, M = 3.
As allȳ ij have equal length for a given x i , FIXED already achieves near-perfect efficiency in expansions per step (Table 2). Combined with the impact of overhead in this older (smaller) model, VAR-STREAM is not substantially faster than FIXED in this setting. However, when compar-5 Due to caching additional encodings and beams, VAR-STREAM uses more GPU memory in this idealized setting.
ing variable-width beam searches where efficient batching is more difficult, we observe that VAR-STREAM doubles VAR-BATCH in expansions per step.

Discussion
In this work, we have proposed a streaming method for variable-length decoding to improve GPU utilization, resulting in cheaper inference. Applied to a state-of-the-art machine translation model, our method yields substantial speed improvements compared to traditionally-batched variable-width beam search. We also apply our method to both semantic and syntactic parsing, demonstrating our method's broader applicability to tasks that process variable-output-length data in a sequential manner. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 173-180. Association for Computational Linguistics.
James Cross and Liang Huang. 2016. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles. arXiv preprint arXiv:1612.06475. ij to the list of final outputs at time t 0 . The difference is that y lt 0 ij may be removed from the beam at time t > t 0 if we later find multiple terminated candidates originating from a higher-probability beam y lt 0 ij for j < j, e.g. between t = 7 and t = 8 in the light blue beam in Figure 1.

Deborah
FIXED-OURS is slower than Fairseq's implementation. However, while the two implementations achieve more similar BLEU on the development set, FIXED-OURS achieves higher BLEU on the test set (49.75 vs 49.57 on De-En and 39.19 vs 38.98 on Ru-En). See Table 3 for De-En experiment details.
For completeness, we also present results in Table 3 for FIXED-STREAM, our streaming implementation adapted to fixed-size beam search on newstest2018 on the 32GB Nvidia V100, with k = 50 as in the FIXED baseline. We keep the = 1 6 hyperparameter. FIXED-STREAM is significantly faster than FIXED-OURS, demonstrating that our streaming method can also speed up fixed-

Method
BLEU Wall Clock (s) FIXED-FAIRSEQ 49.57 891.53 ± 0.77 FIXED-OURS 49.75 1280.59 ± 5.34 FIXED-STREAM 49.75 1004.18 ± 6.82 Table 3: De-En translation experiments test set (new-stest2018) on 32GB Nvidia V100 using different implementations of fixed-size beam search. FIXED-FAIRSEQ is the FIXED baseline in the main paper, while FIXED-OURS is our implementation of fixed-size beam search. FIXED-STREAM is a streaming implementation with = 1 6 ; FIXED-OURS corresponds to = 0. FIXED-STREAM improves over FIXED-OURS in wall clock, but is still slower than FIXED-FAIRSEQ, although it achieves higher BLEU. size beam search. However, FIXED-STREAM is slower than Fairseq's implementation, although it outperforms Fairseq. It is possible that our implementation is less optimized, but we do not formally claim this.

A.2 Alternative Method Analysis
We briefly analyze an alternative streaming method to our proposed VAR-STREAM, which we label VAR-STR-FIFO. At each decoding timestep t, instead of selecting only the beams with minimal l t as in VAR-STREAM, VAR-STR-FIFO selects beams up to its batch capacity starting with the beam of maximal l t . In Figure 1, this corresponds to not pausing computation for the purple beam. This is intuitively appealing and has potential advantages: as shown in Table 5, unlike VAR-STREAM which uses slightly more timesteps than VAR-BATCH due to using a slightly smaller effective batch size, VAR-STR-FIFO significantly reduces the number of timesteps required for decoding in the De-En translation task. Yet VAR-STR-FIFO is significantly slower than both VAR-STREAM and VAR-BATCH. This is due to Fairseq's architecture, a transformer reliant on decoder self-attention, causing decoding timesteps with longer l t to be more expensive (Figure 2). VAR-STR-FIFO suffers because it must pad all selected beams' lengths up to the maximum l t among those selected.
The difference between VAR-STREAM and VAR-STR-FIFO demonstrates that selecting the correct beams to expand during decoding timesteps can be highly impactful on speed, illustrating a new axis of optimization made possible by streaming. While VAR-STREAM is superior for the transformers used in state-of-the-art machine translation, we hypothesize that VAR-STR-FIFO may be preferred in other applications, especially in RNN architectures which do not use self-attention.

Method
Wall

A.3 Experiment Details and Hyperparameters
We provide details on our setup. All code is written in Pytorch (Paszke et al., 2017). For hardware, for our 32GB and 16GB Nvidia V100 experiments, we use p3dn.24xlarge and p3.2xlarge instances respectively on AWS. Experiments are conducted serially with no other computation on the instance. Due to some variance between instances, all experiments within a single comparable group (e.g., all methods' runs for beam size 50 on a 32GB GPU) are conducted on the same instance. Additionally, we specify an implementation detail: for the absolute threshold heuristic δ, we note that y lt i1 may be a terminated candidate from a previous timestep.
For heuristic hyperparameters, in all domains we choose pruning heuristics to approximately match the performance of FIXED, based on the development set (newstest2017 for machine translation, and the ATIS and Penn Treebank development sets in semantic and syntactic parsing respectively). As usual, in heuristic selection, there is a tradeoff between time and performance, which we explore here in the machine translation domain.
Both BLEU scores and computation time overall increase with δ and M (Table 5).
does not affect BLEU. Larger means we run fewer timesteps at high l t , but our batch refills are smaller. At least in this machine translation setting, the effect of changing is typically a few seconds, indicating that our method is not overly sensitive to this hyperparameter choice as long as is small. (Note we re-adjusted batch sizes in multiples of 64 to saturate the GPU for each ablation.) During initial hyperparameter selection, we ran = 1 6 , 1 3 and 1 2 . For M we tested 3, 5, and 10 and for δ we tested 1.5, 2.5, 5 and 10 based on manual tuning with single runs. Note that BLEU and F1 scores do not change with multiple trials, as we do not retrain models. Meanwhile, runtimes generally have fairly small standard deviation (see all tables), so we did not heavily optimize. Overall, the speedups enabled by VAR-STREAM over baselines are relatively insensitive to heuristic hyperparameters.   Although VAR-STREAM is not much more efficient than FIXED in expansions per timestep, it requires many fewer total expansions and is thus faster. Meanwhile, VAR-STREAM is several times more efficient than VAR-BATCH. However, the variable beam searches suffer slightly in oracle F1 compared to FIXED, while still remaining above GREEDY.

A.4 Additional Semantic Parsing Experiments
In Tables 6 and 7, we present results from applying our method to the JOBS and GEO datasets. We use the same hyperparameters and heuristics as for ATIS, and operate under the same candidateexpansion constraint. VAR-STREAM is substantially faster than Fixed and VAR-BATCH under this setting.