If beam search is the answer, what was the question?

Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.


Introduction
As a simple search heuristic, beam search has been used to decode models developed by the NLP community for decades. Indeed, it is noteworthy that beam search is one of the few NLP algorithms that has stood the test of time: It has remained a cornerstone of NLP systems since the 1970s (Reddy, 1977). As such, it became the natural choice for decoding neural probabilistic text generators-whose design makes evaluating the full search space impossible (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Vinyals and Le, 2015;Yin et al., 2016). While there is no formal guarantee that beam search will return-Figure 1: Average std. deviation σ of surprisals (per sentence) and corpus BLEU for translations generated using exact search over the MAP objective with a greedy regularizer (Eq. (11)) with varying degrees of λ. References for beam search (k = 5 and k = 100) are included. Sub-graph shows the explicit relationship between BLEU and σ. λ and σ axes are log-scaled. or even approximate-the highest-scoring candidate under a model, it has repeatedly proven its merit in practice (Serban et al., 2017;Edunov et al., 2018;Yang et al., 2019) and, thus, has largely been tolerated-even embraced-as NLP's go-to search heuristic. However, in the context of neural machine translation (NMT), a shocking empirical finding has emerged: Using beam search to decode sentences from neural text generators almost invariably leads to better text than using exact search (or beam search with a very large beam size). In fact, Stahlberg and Byrne (2019) report that exact search returns the empty string in > 50% of cases, 1 showing that the success of beam search does not stem from its ability to approximate exact decoding in practice, but rather due to a hidden inductive bias embedded in the algorithm. This inductive bias appears to be paramount for generating desirable text from neural probabilistic text generators. While several works explore this phenomenon (Murray and Chiang, 2018;Yang et al., 2018;Stahlberg and Byrne, 2019;Cohen and Beck, 2019), no one has yet hypothesized what beam search's hidden inductive bias may be. Our work fills this gap.
We analyze the beam search blessing by reverse engineering an objective that beam search returns the exact solution for. Specifically, we introduce a regularizer for the the standard (MAP) decoding objective for text generation models such that the exact solution to this regularized objective is equivalent to the solution found by beam search under the unmodified objective. Qualitative inspection reveals that our "beam search regularizer" has a clear connection to a theory in cognitive science-the uniform information density hypothesis (UID; Levy and Jaeger, 2007). The UID hypothesis states that-subject to the constraints of the grammar-humans prefer sentences that distribute information (in the sense of information theory) equally across the linguistic signal, e.g., a sentence. In other words, human-produced text, regardless of language, tends to have evenly distributed surprisal, formally defined in information theory as negative log-probability. This connection suggests beam search has an interpretation as exact decoding, but with a UID-promoting regularizer that encourages evenly distributed surprisal in generated text. This insight naturally leads to the development of several new regularizers that likewise enforce the UID property.
Empirically, we experiment with our novel regularizers in the decoding of NMT models. We first observe a close relationship between the standard deviation of surprisals-an operationalization of UID-and BLEU, which suggests that high-quality text does indeed exhibit the UID property. Additionally, we find that even with exact search, our regularized objective leads to performance similar to beam search on standard NMT benchmarks. Both of these observations are reflected in Fig. 1. Lastly, we see that our regularizers alleviate the text-quality degradation typically seen when decoding with larger beam sizes. We take all the above as evidence that our proposed explanation of beam search's inductive bias indeed elucidates why the algorithm performs so well as a search heuristic for language generation tasks.

Neural Probabilistic Text Generation
Probabilistic text generators define a probability distribution p θ (y | x) over an output space of hypotheses Y (to be defined in Eq. (1)) conditioned on an input x. 2 Modern generators are typically parameterized by a deep neural network-possibly recurrent-with a set of learned weights θ. In the case of text generation, the full set of possible hypotheses grows exponentially with the vocabulary size |V|. We consider the set of complete hypotheses, i.e., valid outputs, as where • is string concatenation and V * is the Kleene closure of V. In words, valid hypotheses are text, e.g., sentences or phrases, padded with distinguished tokens, BOS and EOS. In this work, we consider models that are locally normalized, i.e., the model p θ is defined as the product of probability distributions: where each p θ (· | x, y <t ) is a distribution with support overV := V ∪ {EOS} and y <1 = y 0 := BOS. The decoding objective for text generation aims to find the most-probable hypothesis among all candidate hypotheses, i.e. we aim to solve the following optimization problem: This is commonly known as maximum a posteriori (MAP) decoding since p θ is a probability model. While there exists a wealth of literature on decoding algorithms for statistical text generation models, e.g., phrase-based machine translation models, many of these methods cannot reasonably be used with neural models. Specifically, due to the non-Markovian structure of most neural text generators, dynamic-programming algorithms for searching over the exponentially large space are not efficient in this setting. Indeed, there are formal results that solving Eq. (3) with a recurrent neural network is NP-hard (Chen et al., 2018). Therefore decoding is performed almost exclusively with heuristic methods, such as beam search.

Beam Search
Beam search is a form of pruned breadth-first search where the breadth is limited to k ∈ Z + (i.e., a maximum of k hypotheses) are expanded at each time step. We express beam search as the following recursion: where we define the candidate set at t > 0 For notational convenience, we define EOS • EOS = EOS. The above algorithm terminates after a fixed number of iterations 3 n max and the set Y nmax is returned. We overload p θ (· | x) to take a set of hypotheses as an argument instead of just a single hypothesis. In this case, p θ (Y | x) := y∈Y p θ (y | x). 4 Using a similar schema, the argmax may also operate over a different objective, e.g., logprobabilities combined with various rewards or penaties, such as those discussed in §2.2. Beam search has a long history in sequence transduction. For example, many of the decoding strategies used in statistical machine translation (SMT) systems were variants of beam search (Och et al., 1999;Koehn et al., 2003;Koehn, 2004). As language generation systems moved away from phrase-based statistical approaches and towards neural models, beam search remained the de-facto decoding algorithm (Sutskever et al., 2014;Vinyals and Le, 2015). However, it has been observed that when used as a decoding algorithm for neural text generation, beam search (for small beams) typically has a large percentage of search errors (Stahlberg and Byrne, 2019). Counterintuitively, it is widely known that increasing the beam size beyond 5 can hurt model performance in terms of downstream evaluation metrics (e.g., BLEU, ROUGE); while a number of prior works have referred to this phenomenon as a curse (Koehn and Knowles, 2017;Yang et al., 2018;Cohen and Beck, 2019), it should perhaps be seen as a blessing. Beam search typically generates well-formed and coherent text from probabilistic models, whose global optimum in many cases is the empty string, when they otherwise might fail to produce text at all. As we demonstrate in §4, this text also tends to be human-like. We will subsequently explore possible reasons as to why beam search leads to desirable text from models that are otherwise poorly calibrated, i.e., poor representations of the true distribution p(y | x) (Guo et al., 2017).

Alternative Decoding Objectives
When the MAP objective (Eq. (3)) is used for decoding neural text generators, the results are generally not satisfactory. Among other problems, the generated texts are often short and defaults to highfrequency words (Cho et al., 2014;Vinyals and Le, 2015;Shen et al., 2016). Methods such as length and coverage normalization (Jean et al., 2015;Tu et al., 2016;Murray and Chiang, 2018), which augment the MAP objective with an additive term or multiplicative factor, have been adopted to alleviate these issues. For example, two such forms of length 5 and coverage normalization use the following modified MAP objective respectively during decoding to produce higher-quality output: where λ > 0 is the (tunable) strength of the reward and α ij is the attention weight (Bahdanau et al., 2015) from the j th decoding step over the i th input. Eq. (7) directly rewards longer outputs  while Eq. (8) aims to reward coverage of input words in a prediction using the attention mechanism of an encoder-decoder model as an oracle (Tu et al., 2016). While such methods help obtain state-of-the-art results in neural MT Gehring et al., 2017;, we view them as a patch to the observed problems. The fact that text quality still degrades with increased beam sizes when these rewards are used (Koehn and Knowles, 2017;Ott et al., 2018a) suggests that they do not address the inherent issues with text generation systems. We subsequently hypothesize about the nature of these issues and provide a set of linguistically motivated regularizersinspired by beam search-that appear to alleviate them.

Deriving Beam Search
We introduce a regularized decoding framework. The idea is simple; we seek to solve the regularized optimization problem to decode for a strategically chosen R(·). Clearly, for certain R(·), we recover the decoding objectives discussed in §2.2. The question we ask in this work is the following: If we want to view beam search as an exact-decoding algorithm, which R(·) should we choose to recover beam search?
We discovered an elegant answer rooted in information theory and cognitive science (the connections are discussed in-depth in §4). We first define the model's time-dependent surprisals, which are an information-theoretic concept that characterizes the amount of new information expressed at time t: Note that minimally surprising means maximally probable. For the special case of greedy decoding, where k = 1, the following choice of regularizer recovers beam search for sufficiently large λ: The intuition behind Eq. (11) is to encourage locally optimal decisions: Every local surprise u t should be close to the minimally surprising choice. In the limiting case where locally optimal decisions are not just encouraged, but rather enforced, we recover greedy search.
Formally, we have the following theorem: Theorem 3.1. The argmax of log p θ (y | x) − λ · R greedy (y) is exactly computed by greedy search in the limiting case as λ → ∞.
Proof. By induction. In App. A.
Theorem 3.1 establishes that greedy search is the limiting case of a regularizer that seeks to encourage decisions to have high-probability locally. In contrast, the optimal MAP solution will generally not have this property. This is because a globally optimal MAP decoder may require a locally suboptimal decision for the sake of being able to make a compensatory decision later that leads to global optimality. 6 We now consider the generalization of greedy search (k = 1) to full beam search (k ≥ 1). Recall that beam search returns not just a single output, but rather a set of outputs. Thus, we must consider the set-decoding objective where, as before, we have used our overloaded notation p θ (· | x) to score sets of hypotheses. Similarly to R greedy , we formulate a greedy set-regularizer to recover beam search: where Y t = {y 1:t | y ∈ Y } corresponds to the set of hypotheses expanded by t steps. 7 Note that we additionally overload surprisal to operate on sets, u t (Y ) = y∈Y u t (y). We prove an analogous theorem to Theorem 3.1 for this regularizer.
Proof. The proof follows from the same argument as Theorem 3.1, albeit with sets instead of an individual hypothesis.
Note that in the (predominant) case where we want to return a single candidate sentence as the output rather than an entire set-as would be generated by Eq. (12)-we can take the highest-probability sequence in the chosen set Y as our decoded output. The objective in Eq. (12) boils down to a subset selection problem which, given the size of Y, is a computationally prohibitive optimization problem. Nonetheless, we can use it to analyze the properties enforced on generated text by beam search.

From Beam Search to UID
The theoretical crux of this paper hinges on a proposed relationship between beam search and the uniform information density hypothesis (Levy, 2005;Levy and Jaeger, 2007), a concept from cognitive science: Hypothesis 4.1. "Within the bounds defined by grammar, speakers prefer utterances that distribute information uniformly across the signal (information density). Where speakers have a choice between several variants to encode their message, they prefer the variant with more uniform information density (ceteris paribus)" (Jaeger, 2010).
At its core, the theory seeks to explain various aspects of human language processing in terms of information theory; it is often applied to an area of psycholinguistics known as sentence processing where the UID hypothesis is used to explain experimental data (Hale, 2001). As the UID hypothesis concerns a cognitive process (virtually) independent of the language in use, the theory should hold across languages (Jaeger and Tily, 2011).
To see the hypothesis in action, consider the classic case of syntactic reduction from Levy and Jaeger (2007): (1) How big is [ NP the family i [ RC (that) you cook for −i ]]?
In the above example, the sentence does not require the relativizer that at the start of the relative clause (denoted by RC); it would also be syntactically correct without it. However, many would agree that the relativizer makes the text qualitatively better. The information-theoretic explanation of this perception is that without the relativizer, the first word of a relative clause conveys two pieces of information simultaneously: the onset of a relative clause and part of its internal contents. Including the relativizer spreads this information across two words, thereby distributing information across the sentence more uniformly and avoiding instances of high surprisal-which, from a psycholinguistic perspective, are displeasing. In short, the relativizer helps to ensure the UID property of the sentence. Importantly, the preference suggested by the UID hypothesis is between possible utterances (i.e., outputs) where grammaticality and information content are held constant. Any violation of these assumptions presents confounding factors when measuring, or optimizing, the information density of the generated text. In our setting, there is reason to believe that grammaticallity and information content are approximately held constant while selecting between hypothesis. First, the high-probability outputs of neural generation models tend to be grammatical (Holtzman et al., 2020). Second, because decoding is conditioned on a specific input x, the conditional probability model p θ (y | x) is able to assign high-probability to outputs y that are plausible outputs (e.g., translations) of the given x. Thus, even though the various y are not constrained to be sematically equivalent to one another, they tend to express similar information because they are at least relevant to the same x. This is why our regularized optimization problem Eq. (9) combines an information-density regularizer with log p θ (y | x): the term log p θ (y | x) rewards grammaticallity and content relevance, whereas the information-density regularizer encourages the human preferences posited by the UID hypothesis. The parameter λ allows the preferences to be calibrated to perform well on downstream evaluation metrics, such as BLEU and ROUGE.

The UID Bias in Beam Search
It may not be immediately obvious how the UID hypothesis relates to beam search. After all, beam search narrows the scope of the search to only the lowest surprisal candidates at each time step, which does not clearly lead to a uniform distribution of surprisals in the final decoded sequences. The connection is best seen visually. Fig. 2 shows the time-dependent surprisals u t under the model of several candidate translations (German to English). Recall that we have u t (y) ∈ [0, ∞) and that the standard decoding objective explicitly minimizes the sum of surprisals, i.e., maximizes log-probability. Therefore, the only way the distribution of a solution can become distinctly nonuniform is when there are several high-surprisal decisions in the mix; we observe this in the orange and red curves. Intuitively, this corresponds to the notion of compensation discussed earlier: a globally optimal decoding scheme may select a high-surprisal step at some point in order to shorten the length of the path or to take a low-surprisal step later on. We observe an extreme example of this behavior above: Selecting the EOS character at the first step leads to a very non-uniform distribution, i.e., the degenerate distribution, which, violates our operationalization of UID described subsequently. In summary, we see that as λ is decreased, the decoded sentences obey the UID property less strictly. Indeed, setting λ = 0, i.e., exact inference of the MAP objective, results in the empty string.
A number of successful sampling methods (pnucleus sampling (Holtzman et al., 2020) and topk sampling (Fan et al., 2018)) enforce the UID property in generated text by the same logic as above. Both methods eliminate many of the highsurprisal choices at any given decoding step by narrowing the set of tokens that may be chosen.

Cognitive Motivation for Beam Search
The goal of this work is to expose a possible inductive bias of beam search. We now exhibit our primary hypothesis Hypothesis 4.2. Beam search is a cognitively motivated search heuristic for decoding language gen-eration models. The success of beam search on such tasks is, in part, due to the fact that it inherently biases the search procedure towards text that humans prefer.
The foundation of the argument for this hypothesis follows naturally from the previous sections: First, we demonstrated in §3 that beam search is an exact decoding algorithm for a certain regularized objective-to wit, the one in Eq. (9). Qualitatively, we related the behavior of the regularizer to the UID hypothesis from cognitive science. As a final step, we next provide operationalizations of UID-in the form of regularizers within our regularized decoding framework-through which we can empirically test the validity of this hypothesis.

Generalized UID Decoding
If beam search is trying to optimize for UID, can we beat it at its own game? This section develops a battery of possible sentence-level UID measures, which can be used as regularizers in our regularized decoding framework and compared experimentally on downstream evaluation metrics.
Variance Regularizer. We first consider the variance regularizer from Jain et al. (2018). In essence, UID concerns the distribution of information over the course (i.e., time steps) of a sentence. A natural measure for this is variance of the surprisals.
where µ = 1 /|y| |y| t=1 u t (y t ). This regularizer, in contrast to Eq. (11), is a much more straightforward encoding of the UID: it directly operationalizes UID through variance.
Local Consistency. Next we consider a local consistency regularizer, also taken from Jain et al. (2018), that encourages adjacent surprisals to have similar magnitude: Again, this is a straightforward encoding of the UID: if every surprisal is similar to its neighbor, it will be close to uniform. Note that both of the above regularizers are defined for all decoding steps t > 0 since we define u 0 (y 0 ) = 0, y 0 = BOS for all valid hypotheses.
Max Regularizer. We propose a UID-inspired regularizer of our own design that exploits the nature of MAP decoding, for which the overarching goal is to find a solution with low surprisal. In this setting, one strategy is to penalize decisions that move the distribution away from 0, the lowest possible surprisal. This suggests would regularize for UID. Such a regularizer would also directly penalize extreme compensation during decoding (discussed in §3). It is worth noting that this regularizer has a connection to entropy regularization, which can be seen by looking at the formula for Rényi entropy.
Squared Regularizer. Finally, we consider a novel squared penalty, that, again, exploits the goal of MAP decoding. If we wish to keep everything uniform, we can try to push all surprisals close to 0, but this time with a squared penalty: Experimentally, we expect to see the following: If encouraging decoded text to exhibit UID is helpful-and our logic in constructing regularizers is sound-all the regularizers (Eq. (14) to (17)) should lead to roughly the same performance under exact decoding and beam search with large beam widths. Such results would not only validate the connection between UID and high-quality text; comparable performance of optimal beam search 8 and exact search under our regularized objective would provide explicit evidence for our declarative explanation of the inductive bias in beam search.

Experiments
We explore how encouraging uniform information density in text generated by neural probabalistic text generators affects its downstream quality. To this end, we decode NMT models using the regularized objective (Eq. (9)) with our UID regularizers. We perform exact decoding for a range of λ and observe how text quality (quantified by BLEU (Papineni et al., 2002) using the SacreBLEU (Post, 2018) system) and the distribution of surprisal changes. We additionally evaluate our regularizers under the beam search decoding strategy to see if penalizing violations of UID alleviates the text-quality degradation typically seen with increased beam widths. Experiments are performed using models trained on the IWSLT'14 De-En (Cettolo et al., 2012) and WMT'14 En-Fr (Bojar et al., 2014) datasets. For reproducibility, we use the model provided by fairseq  for the WMT'14 task; 9 we use the data pre-processing scripts and recommended hyperparameter settings provided by fairseq for training a model on the IWSLT'14 De-En dataset. We use the Newstest'14 dataset as the test set for the WMT'14 model. All model and data information can be found on the fairseq NMT repository. 10

Exact Decoding
To perform exact decoding of neural probabilistic text generators, we build on the decoding framework of Stahlberg et al. (2017), albeit using Dijkstra's algorithm (Dijkstra, 1959) instead of depthfirst search as we find it decreases decoding time. Note that Dijkstra's algorithm is guaranteed to find the global optimum when path cost is monotoni- Figure 3: BLEU as a function of beam width for various regularizers. We choose λ for each regularizer by best performance on validation sets (see App. B). y-scales are broken to show minimum BLEU values. x-axis is log-scaled.
cally increasing, which is the case for hypotheses under the scoring scheme used by neural probabilistic text generators (see Meister et al. (2020) for more detailed discussion). While the variance and local consistency regularizers Eq. (14) and (15) break this monotonicity property, we can still guarantee optimality by using a stopping criterion similar to the one proposed by Yang et al. (2018). Explicitly, we check if the top-scoring complete hypothesis has a greater score than the maximum possible score of any hypothesis in the queue. All scores are bounded due to the maximum-length criterion. Additionally, we lower-bound each search by the score of the empty string to decrease the memory footprint, i.e., we stop considering hypotheses whose scores (or maximum possible score in the case of Eq. (14) and (15)) drop below that of the empty string at any time step. Fig. 1 demonstrates how the addition of the greedy UID regularizer (Eq. (11) ) to the regularized MAP objective (Eq. (9)) affects characteristics of the global optimum under the model as we vary λ. Notably, increasing the strength of the regularizer appears to alleviate the text quality degradation seen with exact search, leading to results that approach the BLEU of those generated using optimal beam search. Fig. 1 also shows a strong inverse relationship between BLEU and average standard deviation (per sentence) of surprisals. We take these observations as empirical validation of Hyp. 4.2.

Regularized Beam Search
We next look at how the regularized decoding objective affects text generated using beam search. As previously noted, text quality generally degrades with increased beam size when using the standard MAP objective; this phenomenon is demonstrated in Fig. 3. UID regularization appears to alleviate k = 5 k = 10 k = 100 k = 500 this problem. Notably, the greedy and squared regularizer aid performance for larger beam sizes more so than other regularizers, for which we still see a slight drop in performance for larger beam sizes. This drop is negligible compared to the one observed for unregularized beam search-a drop which is also frequently observed for lengthnormalized decoding (Koehn and Knowles, 2017). While intuitively, variance and local variance are the purest encodings of UID, they perform the poorest of the regularizers. Arguably, this may be due to the fact that they do not simultaneously (as the other regularizers do) penalize for high surprisal.
We additionally decode with a combination of the UID regularizers in tandem. We collectively tune the λ value for each of the regularizers on validation sets. We report performance in Tab. 1 and see that results outperform standard and lengthnormalized, i.e. score divided by sequence length, beam search with noticeable improvements for larger beams. Search details and parameter settings may be found in App. B. Notably, combining multiple UID regularizers does not lead to as great an increase in performance as one might expect, which hints that a single method for enforcing UID is sufficient for promoting quality in generated text.
Neural probabilistic text generators are far from perfect; prior work has shown that they often generate text that is generic (Vinyals and Le, 2015;, unnatural (Holtzman et al., 2020), and sometimes even non-existent (Stahlberg and Byrne, 2019). In the context of the degenerate behavior of these models, the beam search curse-a specific phenomenon where using a larger beam size leads to worse performance-has been analyzed by a number of authors (Koehn and Knowles, 2017;Murray and Chiang, 2018;Yang et al., 2018;Stahlberg and Byrne, 2019;Jean et al., 2015;Tu et al., 2016;Cohen and Beck, 2019). Many of these authors attribute the performance drop (as search becomes better) to an inherent bias in neural sequence models to pefer shorter sentences. Other authors have ascribed fault to the model architectures, or how they are trained (Cho et al., 2014;Sountsov and Sarawagi, 2016;Vinyals et al., 2017;Ott et al., 2018a;Kumar and Sarawagi, 2019). To remedy the problem, a large number of regularized decoding objectives and modified training techniques have been proposed. In contrast, this work analyzes the behavior of neural text generators from a different angle: We provide a plausible answer-inspired by psycholinguistic theory-as to why beam search (with small beams) leads to high-quality text, rather than another explanation of why exact search performs so badly.

Conclusion
We analyze beam search as a decoding strategy for text generation models by framing it as the solution to an exact decoding problem. We hypothesize that beam search has an inductive bias which can be linked to the promotion of uniform information density (UID), a theory from cognitive science regarding even distribution of information in linguistic signals. We observe a strong relationship between variance of surprisals (an operationalization of UID) and BLEU in our experiments with NMT models. With the aim of further exploring decoding strategies for neural text generators in the context of UID, we design a set of objectives to explicitly encourage uniform information density in text generated from neural probabalistic models and find that they alleviate the quality degradation typically seen with increased beam widths.

A Theory
Proof. We prove Theorem 3.2 by induction. We denote the argmax of log p θ (y | x) − λ · R greedy (y) as y R and the solution found by greedy search as y greedy . We will show that y greedy t = y R t for all 0 ≤ t ≤ max(|y R |, |y greedy |). The theorem holds trivially for the base case of t = 0 because y 0 must be BOS for any valid hypothesis by definition of the hypothesis space (Eq. (1)). Now, by the inductive hypothesis, suppose y greedy i = y R i for all i < t. We will show that our regularized objective must choose the same word as greedy search at time-step t. In the limiting case of Eq. (11), the following function reflects the penalty to the distribution over tokens at position t: Since minimum surprisal implies maximum log-probability, the above function clearly returns either 0 or ∞ depending on whether the decoding choice at time-step t is greedy. Therefore the only choice that would not send the hypothesis score to −∞ is the greedy choice, which implies any feasible solution to our objective must have y R t = y greedy t . By the principle of induction, y greedy t = y R t for all 0 ≤ t ≤ |y R | = |y greedy |, which in turn implies y greedy = y R .