When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)

In neural text generation such as neural machine translation, summarization, and image captioning, beam search is widely used to improve the output text quality. However, in the neural generation setting, hypotheses can finish in different steps, which makes it difficult to decide when to end beam search to ensure optimality. We propose a provably optimal beam search algorithm that will always return the optimal-score complete hypothesis (modulo beam size), and finish as soon as the optimality is established. To counter neural generation’s tendency for shorter hypotheses, we also introduce a bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Experiments on neural machine translation demonstrate that our principled beam search algorithm leads to improvement in BLEU score over previously proposed alternatives.


Introduction
In recent years, neural text generation using recurrent networks have witnessed rapid progress, quickly becoming the state-of-the-art paradigms in machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2014), summarization (Rush et al., 2015;Ranzato et al., 2016), and image captioning (Vinyals et al., 2015;Xu et al., 2015). In the decoder of neural generation, beam search is widely employed to boost the output text quality, often leading to substantial improvement over greedy search (equivalent to beam size 1) in metrics such as BLEU or † Current address: Google Inc., New York, NY, USA. ROUGE; for example, Ranzato et al. (2016) reported +2.2 BLEU (on single reference) in translation and +3.5 ROUGE-2 in summarization, both using a beam of 10. Our own experiments on machine translation (see Sec. 5) show +4.2 BLEU (on four references) using a beam of 5.
However, unlike traditional beam search in phrase-based MT or shift-reduce parsing where all hypotheses finish in the same number of steps, here in neural generation, hypotheses can finish in vastly different numbers of steps. Once you find a completed hypothesis (by generating the </s> symbol), there are still other active hypotheses in the beam that can continue to grow, which might lead to better scores. Therefore when can you end the beam search? How (and when) can you guarantee that the returned hypothesis has the optimal score modulo beam size?
There have not been satisfying answers to these questions, and existing beam search strategies are heuristic methods that do not guarantee optimality. For example, the widely influential RNNsearch (Bahdanau et al., 2014) employs a "shrinking beam" method: once a completed hypothesis is found, beam size shrinks by 1, and beam search would finish if beam size shrinks to 0 or if the number of steps hits a hard limit. The best scoring completed hypothesis among all completed ones encountered so far is returned. On the other hand, OpenNMT (Klein et al., 2017), whose PyTorch version will be the baseline in our experiments, uses a very different strategy: beam search terminates whenever the highest-ranking hypothesis in the current step is completed (which is also the one returned), without considering any other completed hypotheses. Neither of these two methods guarantee optimality of the returned hypothesis.
We therefore propose a novel and simple beam search variant that will always return the optimalscore complete hypothesis (modulo beam size), and finish as soon as the optimality is established. However, another well-known problem remains, that the generated sentences are often too short, compared to previous paradigms such as SMT (Shen et al., 2016). To alleviate this problem, previous efforts introduce length normalization (as a switch in RNNsearch) or length reward  borrowed from SMT (Koehn et al., 2007). Unfortunately these changes will invalidate the optimal property of our proposed algorithm. So we introduce a bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Experiments on neural machine translation demonstrate that our principled beam search algorithm leads to improvement in BLEU score over previously proposed alternatives.

Neural Generation and Beam Search
Here we briefly review neural text generation and then review existing beam search algorithms.
Assume the input sentence, document, or image is embedded into a vector x, from which we generate the output sentence y which is a completed hypothesis: 1 where y <i is a popular shorthand notation for the prefix y 0 y 1 ...y i−1 . We say that a hypothesis y is completed, notated comp(y), if its last word is </s>, i.e., comp(y) ∆ = (y |y| = </s>) in which case it will not be further expanded.
A crucial difference in RNN-based neural generation compared to previous paradigms such as phrase-based MT is that we no longer decompose p(y i | x, y <i ) into the translation model, p(y i | x), and the language model, p(y i | y <i ), and more importantly, we no longer approximate the latter by n-gram models. This ability to model arbitrarily-lengthed history using RNNs is an important reason for NMT's substantially improved fluency compared to SMT.
To (approximately) search for the best output y * , we use beam search, where the beam B i at step 1 For simplicity reasons we do not discuss bidirectional LSTMs and attentional mechanisms here but our algorithms still work with those encoders (we have tested them).
i is an ordered list of size (at most) b, and expands to the next beam B i+1 of the same size: where the notation top b S selects the top b scoring items from the set S, and each item is a pair ⟨y, s⟩ where y is the current prefix and s is its accumulated score (i.e., product of probabilities).

Optimal Beam Search (modulo beam size)
We propose a very simple method to optimally finish beam search, which guarantees the returned hypothesis is the highest-scoring completed hypothesis modulo beam size; in other words, we will finish as soon as an "optimality certificate" can be established that future hypotheses will never score better than the current best one.
Let best ≤i be the best completed hypothesis so far up to step i, i.e., We update it every time we find a completed hypothesis (if there is none yet, then it remains undefined). Now at any step i, if best ≤i is defined, and the highest scoring item B i,1 in the current beam B i scores worse than or equal to best ≤i , i.e., when we claim the optimality certificate is established, and terminate beam search, returning best ≤i (here smaller means worse, since we aim for the highestprobability completed hypothesis).
Theorem 1 (optimality). When our beam search algorithm terminates, the current best completed hypothesis (i.e., best ≤i ) is the highest-probability completed hypothesis (modulo beam size).
Future descendants grown from these items will only be no better, since probability ≤ 1, so all items in current and future steps are no better than best ≤i .
Theorem 2 (early stopping). Our beam search algorithm terminates no later than OpenNMT's termination criteria (when B i,1 is completed).
Proof. When B i,1 is itself completed, best ≤i = max{B i,1 , · · · } ≥ B i,1 , so our stopping criteria is also met. This above Theorem shows that our search is stopping earlier once the optimality certificate is established, exploring fewer items than Open-NMT's default search. Also note that the latter, even though exploring more items than ours, still can return suboptimal solutions; e.g., when B i,1 is worst than best ≤i (they never stored best ≤i ). In practice, we noticed our search finishes about 3-5 steps earlier than OpenNMT at a beam of 10, and this advantage widens as beam size increases, although the overall speedup is not too noticeable, given the target language sentence length is much longer. Also, our model scores (i.e., logprobabilities) are indeed better (see Fig. 1), where the advantage is also more pronounced with larger beams (note that OpenNMT baseline is almost flat after b = 10, while our optimal beam search still steadily improves). Combining these two Theorems, it is interesting to note that our method is not just optimal but also faster.

Optimal Beam Search for Bounded Length Reward
However, optimal-score hypothesis, though satisfying in theory, is not ideal in practice, since neural models are notoriously bad in producing very short sentences, as opposed to older paradigms such as SMT (Shen et al., 2016). To alleviate this problem, two methods have been proposed: (a) length normalization, used in RNNsearch as an option, where the revised score of a hypothesis is divided by its length, thus favoring longer sentences; and (b) explicit length reward  borrowed from SMT, rewarding each gen-erated word by a constant tuned on the dev set. Unfortunately, each of these methods breaks the optimality proof of our beam search algorithm in Section 3, since a future hypothesis, being longer, might end up with a higher (revised) score. We therefore devise a novel mechanism called "bounded length reward", that is, we reward each word until the length of the hypothesis is longer than the "estimated optimal length". In machine translation and summarization, this optimal length l can be ratio · |x| where |x| is the source sentence length, and ratio is the average ratio of reference translation length over source sentence length on the dev set (in our Chinese-to-English NMT experiments, it is 1.27 as the English side is a bit longer). Note that we use the same ratio estimated from dev on test, assuming that the optimal length ratio for test (which we do not know) should be similar to those of dev ones. We denotẽ sc(y) to be the revised score of hypothesis y with the bounded length reward, i.e., sc(y) ∆ = sc(y) + r · min{l, |y|}.
We also definebest ≤i to be the revised version of best ≤i that optimizes the revised instead of the original score, i.e., Now with bounded length reward, we can modify our beam search algorithm a little bit and still guarantee optimality. First we include in the revised cost a reward r for each generated word, as long as the length is less than l, the estimated optimal length. If at step i, the highest scoring item B i,1 's revised score (i.e., including bounded length reward) plus the heuristic "future" extra length reward of a descendant, r · max{l − i, 0}, is worse than (or equal to) the similarly revised version of best ≤i , i.e., at which time we claim the revised optimality certificate is established, and terminate the beam search and returnbest ≤i .
Actually with some trivial math we can simplify the stopping criteria to sc(B i,1 ) + r · l ≤sc(best ≤i ). (4) This much simplified but still equivalent criteria can speed up decoding in practice, since this  means we actually do not need to compute the revised score for every hypothesis in the beam; we only need to add the bounded length reward when one is finished (i.e., when updatingbest ≤i ), and the simplified criteria only compares it with the original score of a hypothesis plus a constant reward r · l.
Theorem 3 (modified optimality). Our modified beam search returns the highest-scoring completed hypothesis where the score of an item is its log-probability plus a bounded length reward.
Proof. by admissibility of the heuristic.
Theorem 4 (correctness of the simplified criteria).

Data Preparation, Training, and Baselines
We conduct experiments on Chinese-to-English neural machine translation, using OpenNMTpy, 2 the PyTorch port of the Lua-based Open-NMT (Klein et al., 2017). We choose this library because PyTorch's combination of Python with Torch's dynamic computation graphs made it much easier to implement various search algorithms on it than on Theano-based implementations derived from RNNsearch (Bahdanau et al., 2014) (such as the widely used GroundHog 3 and Laulysta 4 codebases) as well as the original Lu-aTorch version of OpenNMT. We use 1M Chinese/English sentence pairs for training (see Table 1 for statistics); we also trained on 2M sentence pairs and only saw a minor improvement so below we report results from 1M training. To alleviate the vocabulary size issue we employ bytepair encoding (BPE) (Sennrich et al., 2015) which reduces the source and target language vocabulary sizes to 18k and 10k, respectively; we found BPE to significantly improve BLEU scores (by at least +2 BLEU) and reduce training time. Following  other papers on Chinese-English translation such as Shen et al. (2016), we use NIST 06 newswire portion (616 sentences) for development and NIST 08 newswire portion (691 sentences) for testing; we will report case-insensitive 4-reference BLEU-4 scores (using original segmentation). Following OpenNMT-py's default settings, we train our NMT model for 20 epochs to minimize perplexity on the training set (excluding 15% sentences longer than 50 source tokens), with a batch size of 64, word embedding size of 500, and dropout rate of 0.3. The total number of parameters is 29M. Training takes about an hour per epoch on Geforce 980 Ti GPU, and the model at epoch 15 reaches the lowest perplexity on the dev set (9.10) which is chosen as the model for testing.
On dev set, the default decoder of OpenNMT-py reaches 29.2 BLEU with beam size 1 (greedy) and 33.2 BLEU with the default beam size of 5. To put this in perspective, the most commonly used SMT toolkit Moses (Koehn et al., 2007) reaches 30.1 BLEU (with beam size 70) using the same 1M sentence training set (trigram language model trained on the target side). With 2.56M training sentence pairs, Shen et al. (2016) reported 32.7 BLEU on the same dev set using Moses and 30.7 BLEU using the baseline RNNsearch (GroundHog) with beam size 10 (without BPE, without length normalization or length reward). So our OpenNMTpy baseline is extremely competitve.

Beam Search & Bounded Length Reward
We compare the following beam search variants: 1. OpenNMT-py's default beam search, finishing only when the top hypothesis in a step is completed (see Section 2); 2. The "shrinking beam" method in RNNsearch with two variants to encourage longer translations: (a) length normalization; Google NMT ) also adopted a similar 3. Our optimal-ending beam search (Section 3); 4. Our modified optimal-ending beam search for bounded length reward (Section 4).
Notice that length reward has no effect on both methods 1 and 2(a) above. To tune the optimal length reward r we run our modified optimalending beam search algorithm with all combinations of r = 0, 0.5, 1, 1.1, 1.2, 1.3, 1.4 with beam sizes b = 1 . . . 20 on the dev set, since different beam sizes might prefer different length rewards. We found r = 1.2 to be the best among all length rewards (see Table 2) which is used in Figure 2 and b = 15 is the best for r = 1.2.
We can observe from Figure 2 that (a) our optimal beam search with bounded length reward performs the best, and at b=15 it is +5 BLEU better than b=1; (b) pure optimal beam search degrades after b=4 due to extremely short translations; (c) both the shrinking beam method with length normalization and OpenNMT-py's default search alleviate the shortening problem, but still produce very short translations (length ratio ∼0.9). (d) the shrinking beam method with length reward works well, but still 0.3 BLEU below our best method. These are confirmed by the test set (Tab. 3).

Conclusions
We have presented a beam search algorithm for neural sentence generation that always returns  Table 3: Final BLEU scores on the test set (nist 08) using best settings from the dev set (nist 06).
optimal-score completed hypotheses. To counter neural generation's natural tendancy for shorter hypotheses, we introduced a bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Experiments on top of strong baselines have confirmed that our principled search algorithms (together with our bounded length reward mechanism) outperform existing beam search methods in terms of BLEU scores. We will release our implementations (which will hopefully be merged into OpenNMT-py) when this paper is published. 5