An Empirical Investigation of Beam-Aware Training in Supertagging

Structured prediction is often approached by training a locally normalized model with maximum likelihood and decoding approximately with beam search. This approach leads to mismatches as, during training, the model is not exposed to its mistakes and does not use beam search. Beam-aware training aims to address these problems, but unfortunately, it is not yet widely used due to a lack of understanding about how it impacts performance, when it is most useful, and whether it is stable. Recently, Negrinho et al. (2018) proposed a meta-algorithm that captures beam-aware training algorithms and suggests new ones, but unfortunately did not provide empirical results. In this paper, we begin an empirical investigation: we train the supertagging model of Vaswani et al. (2018) and a simpler model with instantiations of the meta-algorithm. We explore the influence of various design choices and make recommendations for choosing them. We observe that beam-aware training improves performance for both models, with large improvements for the simpler model which must effectively manage uncertainty during decoding. Our results suggest that a model must be learned with search to maximize its effectiveness.


Introduction
Structured prediction often relies on models that train on maximum likelihood and use beam search for approximate decoding. This procedure leads to two significant mismatches between the training and testing settings: the model is trained on oracle trajectories and therefore does not learn about its own mistakes; the model is trained without beam search and therefore does not learn how to use the beam effectively for search.
Previous algorithms have addressed one or the other of these mismatches. For example, DAgger (Ross et al., 2011) and scheduled sampling (Bengio et al., 2015) use the learned model to visit non-oracle states at training time, but do not use beam search (i.e., they keep a single hypothesis). Early update (Collins and Roark, 2004), LaSO (Daumé and Marcu, 2005), and BSO (Wiseman and Rush, 2016) are trained with beam search, but do not expose the model to beams without a gold hypothesis (i.e., they either stop or reset to beams with a gold hypothesis).
Recently, Negrinho et al. (2018) proposed a meta-algorithm that instantiates beam-aware algorithms as a result of choices for the surrogate loss (i.e., which training loss to incur at each visited beam) and data collection strategy (i.e., which beams to visit during training). A specific instantiation of their meta-algorithm addresses both mismatches by relying on an insight on how to induce training losses for beams without the gold hypothesis: for any beam, its lowest cost neighbor should be scored sufficiently high to be kept in the successor beam. To induce these training losses it is sufficient to be able to compute the best neighbor of any state (often called a dynamic oracle (Goldberg and Nivre, 2012)). Unfortunately, Negrinho et al. (2018) do not provide empirical results, leaving open questions such as whether instances can be trained robustly, when is beam-aware training most useful, and what is the impact on performance of the design choices.
Contributions We empirically study beamaware algorithms instantiated through the metaalgorithm of Negrinho et al. (2018). We tackle supertagging as it is a sequence labelling task with an easy-to-compute dynamic oracle and a moderatelysized label set (approximately 1000) which may require more effective search. We examine two supertagging models (one from Vaswani et al. (2016) and a simplified version designed to be heavily reliant on search) and train them with instantiations of the meta-algorithm. We explore how design choices influence performance, and give recommendations based on our empirical findings. For example, we find that perceptron losses perform consistently worse than margin and log losses. We observe that beam-aware training can have a large impact on performance, particularly when the model must use the beam to manage uncertainty during prediction. Code for reproducing all results in this paper is available at https://github.com/ negrinho/beam_learn_supertagging.

Background on learning to search and beam-aware methods
For convenience, we reuse notation introduced in Negrinho et al. (2018) to describe their metaalgorithm and its components (e.g., scoring function, surrogate loss, and data collection strategy). See Figure 1 and Figure 2 for an overview of the notation. When relevant, we instantiate notation for left-to-right sequence labelling under the Hamming cost, which supertagging is a special case of.
Input and output spaces Given an input structure x 2 X , the output structure y 2 Y x , is generated through a sequence of incremental decisions.
An example x 2 X induces a tree G x = (V x , E x ) encoding the sequential generation of elements in Y x , where V x is the set of nodes and E x is the set of edges. The leaves of G x correspond to elements of Y x and the internal nodes correspond to incomplete outputs. For left-to-right sequence labelling, for a sequence x 2 X , each decision assigns a label to the current position of x and the nodes of tree encode labelled prefixes of x, with terminal nodes encoding complete labellings of x.
Cost functions Given a golden pair (x, y) 2 X ⇥ Y, the cost function c x,y : Y x ! R measures how bad the predictionŷ 2 Y x is relative to the target output structure y 2 Y x . Using c x,y : Y x ! R, we define a cost function c ⇤ x,y : V x ! R for partial outputs by assigning to each node v 2 V x the cost of its best reachable complete output, i.e., For a left-to-right search space for sequence labelling, if c x,y : Y x ! R is Hamming cost, the optimal completion cost c ⇤ x,y : Y x ! R is the number of mistakes in the prefix as the optimal completion matches the remaining suffix of the target output.
Dynamic oracles An oracle state is one for which the target output structure can be reached. Often optimal actions can only be computed for oracle states. Dynamic oracles compute optimal actions even for non-oracle states. Evaluations of c ⇤ x,y : V x ! R for arbitrary states allows us to induce the dynamic oracle-at a state v 2 V x , the optimal action is to transition to the neighbor v 0 2 N v with the lowest completion cost. For sequence labelling, this picks the transition that assigns the correct label. For other tasks and metrics, more complex dynamic oracles may exist, e.g., in dependency parsing Nivre, 2012, 2013). For notational brevity, from now on, we omit the dependency of the search spaces and cost function on x 2 X , y 2 Y, or both.
Beam search space Given a search space G = (V, E), the beam search space G k = (V k , E k ) is induced by choosing a beam size k 2 N and a strategy for generating the successor beam out of the current beam and its neighbors. In this paper, we expand all the elements in the beam and score the neighbors simultaneously. The highest scoring k neighbors are used to form the successor beam. For k = 1, we recover the greedy search space G.
Beam cost functions The natural cost function c ⇤ : V k ! R for G k is created from the elementwise cost function on G, and assigns to each beam the cost of its best element, i.e., for b , b 0 can be formed from the neighbors of the elements in b. A cost increase happens when c(b, b 0 ) > 0, i.e., the best complete output reachable in b is no longer reachable in b 0 .
Policies Policies operate in beam search space G k and are induced through a learned scoring function s(·, ✓) : V ! R which scores elements in the original space G. A policy ⇡ : i.e., mapping states (i.e., beams) to distributions over next states. We only use deterministic policies where the successor beam is computed by sorting the neighbors in decreasing order of score and taking the top k.
Scoring function In the non-beam-aware case, the scoring function arises from the way probabilities of complete sequences are computed with the locally normalized model, namely p(y|x, ✓) = Q h j=1 p(y i |y 1:i 1 , x, ✓), where we assume that all v 0 (1) is the highest scoring neighbor. The successor beam b 0 keeps the neighbor states in A b with highest score according to vector s, or equivalently highest rank according toˆ .
Transitions between beams are sampled according to a data collection policy outputs for x 2 X have h steps. For sequence labelling, h is the length of the sentence. The resulting scoring function s(·, Similarly, the scoring function that we learn in the beam-aware case is s , where x has been omitted, v = y 1:j , ands(·, ✓) : V ! R is the learned incremental scoring function. In Section 4.6, we observe that this cumulative version performs uniformly better than the non-cumulative one.

Meta-algorithm for learning beam search policies
We refer the reader to Negrinho et al. (2018) for a discussion of how specific choices for the metaalgorithm recover algorithms from the literature.

Data collection strategies
The data collection strategy determines which beams are visited at training time (see Figure 2).
Strategies that use the learned model differ on how they compute the successor beam b 0 2 N b when s(·, ✓) leads to a beam without the gold hypothesis, We explore several data collection strategies: stop If the successor beam does not contain the gold hypothesis, stop collecting the trajectory. Structured perceptron training with early update (Collins and Roark, 2004) use this strategy.
reset If the successor beam does not contain the gold hypothesis, reset to a beam with only the gold hypothesis 1 . LaSO (Daumé and Marcu, 2005) use this strategy. For k = 1, we recover teacher forcing as only the oracle hypothesis is kept in the beam.
continue Ignore cost increases, always using the successor beam. DAgger (Ross et al., 2011) take this strategy, but does not use beam search. Negrinho et al. (2018) suggest this for beam-aware training but do not provide empirical results.
reset (multiple) Similar to reset, but keep k 1 hypothesis from the transition, i.e., b 0 = {v ⇤ (1) , vˆ (1) , . . . vˆ (k 1) }. We might expect this data collection strategy to be closer to continue as a large fraction of the elements of the successor beam are induced by the learned model.
oracle Always transition to the beam induced by ⇤ : [n] ! [n], i.e., the one obtained by sorting the costs in increasing order. For k = 1, this recovers teacher forcing. In Section 4.4, we observe that oracle dramatically degrades performance due to increased exposure bias with increased k.

Surrogate losses
Surrogate losses encode that the scores produced by the model for the neighbors must score the best neighbor sufficiently high for it to be kept comfortably in the successor beam. For k = 1, many of these losses reduce to losses used in non-beamaware training. Given scores s 2 R n and costs c 2 R n over neighbors in that sort the elements in A b in decreasing order of scores and increasing order of costs, respectively, i.e., sˆ (1) . . . sˆ (n) and c ⇤ (1)  . . .  s ⇤ (n) . See Figure 1 for a description of the notation used to describe surrogate losses. Our experiments compare the following surrogate losses: perceptron (first) Penalize failing to score the best neighbor at the top of the beam (regardless of it falling out of the beam or not).
perceptron (last) If this loss is positive at a beam, the successor beam induced by the scores does not contain the golden hypothesis.
(s, c) = max 0, sˆ (k) s ⇤ (1) . margin (last) Penalize margin violations of the best neighbor of the hypothesis in the current beam. Compares the correct neighbor s ⇤ (1) with the neighbor vˆ (k) last in the beam.
log loss (neighbors) Normalizes over all elements in A b . For beam size k = 1, it reduces to the usual log loss.

Training
The meta-algorithm of Negrinho et al. (2018) is instantiated by choosing a surrogate loss, data collection strategy, and beam size. Training proceeds by sampling an example (x, y) 2 X ⇥ Y from the training set. A trajectory through the beam search space G k is collected using the chosen data collection strategy. A surrogate loss is induced at each non-terminal beam in the trajectory (see Figure 2). Parameter updates are computed based on the gradient of the sum of the losses of the visited beams.

Experiments
We explore different configurations of the design choices of the meta-algorithm to understand their impact on training behavior and performance.

Task details
We train our models for supertagging, a sequence labelling where accuracy is the performance metric of interest. Supertagging is a good task for exploring beam-aware training, as contrary to other sequence labelling datasets such as named-entity recognition (Tjong Kim Sang and De Meulder, 2003), chunking (Sang and Buchholz, 2000), and part-of-speech tagging (Marcus et al., 1993), has a moderate number of labels and therefore it is  likely to require effective search to achieve high performances. We used the standard splits for CCGBank (Hockenmaier and Steedman, 2007): the training and development sets have, respectively, 39604 and 1913 examples. Models were trained on the training set and used the development set to compute validation accuracy at the end of each epoch to keep the best model. As we are performing an empirical study, similarly to Vaswani et al. (2016), we report validation accuracies. Each configuration is ran three times with different random seeds and the mean and standard deviation are reported. We replace the words that appear at most once in the training set by UNK. By contrast, no tokenization was done for the training supertags.

Model details
We have implemented the model of Vaswani et al. (2016) and a simpler model designed by removing some of its components. The two main differences between our implementation and theirs are that we do not use pretrained embeddings (we train the embeddings from scratch) and we use the gold POS tags (they use only the pretrained embeddings).
Main model For the model of Vaswani et al.
(2016) (see Figure 3, left), we use 64, 16, and 64 for the dimensions of the word, part-of-speech, and supertag embeddings, respectively. All LSTMs (forward, backward, and LM) have hidden dimension 256. We refer the reader to Vaswani et al. (2016) for the exact description of the model. Briefly, embeddings for the words and part-of-speech tags are concatenated and fed to a bi-directional LSTM, the outputs of both directions are then fed into a combiner (dimension-preserving linear transformations applied independently to both inputs, added together, and passed through a ReLU non-linearity). The output of the combiner and the output of the LM LSTM (which tracks the supertag prefix up to a prediction point) is then passed to another combiner that generates scores over supertags.
Simplified model We also consider a simplified model that drops the bi-LSTM encoder and the corresponding combiner (see Figure 3, right). The concatenated embeddings are fed directly into the second combiner with the LM LSTM output. Values for the hyperparameters are the same when possible. This model must leverage the beam effectively as it does not encode the sentence with a bi-LSTM. Instead, only the embeddings for the current position are available, giving a larger role to the LM LSTM over supertags. While supertagging can be tackled with a stronger model, this restriction is relevant for real-time tasks, e.g., the complete input might not be known upfront.
Training details Models are trained for 16 epochs with SGD with batch size 1 and cosine learning rate schedule (Loshchilov and Hutter, 2016), starting at 10 1 and ending at 10 5 . No weight decay or dropout was used. Training examples are shuffled after each epoch. Results are reported for the model with the best validation performance across all epochs. We use 16 epochs for all models for simplicity and fairness. This number was sufficient, e.g., we replicated Table 2 by training with 32 epochs and observed minor performance differences (see Table 6).  Table 1: Development accuracies for models trained with different data collection strategies in a non-beamaware way (i.e., k = 1) and decoded with beam search with varying beam size. continue performs best, showing the importance of exposing the model to its mistakes. Differences are larger for the simplified model.

Non-beam-aware training
We first train the models with k = 1 and then use beam search to decode. Crucially, the model does not train with a beam and therefore does not learn to use it effectively. We vary the data collection strategy. The results are presented in Table 1 and should be used as a reference when reading the other tables to evaluate the impact of beam-aware training. Tables are formatted such that the first and second horizontal halves contain the results for the main model and simplified model, respectively. Each position contains the mean and the standard deviation of running that configuration three times. We use this format in all tables presented. The continue data collection strategy (i.e., DAgger for k = 1) results in better models than training on the oracle trajectories. Beam search results in small gains for these settings. In this experiment, training with oracle is the same as training with reset as the beam always contains only the oracle hypothesis. The performance differences are small for the main model but much larger for the simplified model, underscoring the importance of beam search when there is greater uncertainty about predictions. For the stronger model, the encoding of the left and right contexts with the bi-LSTM provides enough information at each position to predict greedily, i.e., without search. This difference appears consistently in all experiments, with larger gains for the weaker model.
The gains achieved by the main model by decoding with beam search post-training are very small (from 0.02 to 0.05). This suggests that training the model in a non-beam-aware fashion and then using beam search does not guarantee improvements. The model must be learned with search to improve on these results. For the simpler model, larger im-  provements are observed (from 0.42 to 4.34). Despite the gains with beam search for reset and stop, they are not sufficient to beat the greedy model trained on its own trajectories, yielding 81.99 for continue with k = 1 versus 77.54 for oracle and 77.82 for reset, both with k = 8. These results show the importance of the data collection strategy, even when the model is not trained in a beam-aware fashion. These gains are eclipsed by beam-aware training, namely, compare Table 1 with Table 2. See Figure 4 for the evolution of the validation and training accuracies with epochs.

Comparing data collection strategies
We train both models using the log loss (neighbors), described in Section 3.2, and vary the data collection strategy, described in Section 3.1, and beam size. Results are presented in Table 2 Contrary to Section 4.3, these models are trained to use beam search rather than it being an artifact of approximate decoding. Beam-aware training under oracle worsens performance with increasing beam size (due to increasing exposure bias).
During training, the model learns to pick the best neighbors for beams containing only close to optimal hypotheses, which are likely very different from the beams encountered when decoding. The results for the simplified model are similar-with increasing beam size, performance first improves but then degrades. For the main model, we observe modest but consistent improvements with larger beam sizes across all data collection strategies except oracle. By comparing the results with those in the first row of Table 1, we see that we improve on the model trained with maximum likelihood and decoded with beam search. The data collection strategy has a larger impact on performance for the simplified model. continue Figure 4: Validation and training accuracies for non-beam-aware training (i.e., k = 1) with different data collection strategies for the main (left half) and simplified (right half) models. continue achieves higher accuracies. Figure 5: Validation and training accuracies for beam-aware training with different data collection strategies and beam sizes for the main (left half) and simplified (right half) models. Larger beam sizes achieve higher performances while overfitting less, and are crucial for the simplified model to achieve higher training and validation accuracies. For smaller beams continue performs better than reset. All models can be trained stably from scratch. Three runs were aggregated by showing the mean and the standard deviation for each epoch.  achieves the best performance. Compare these performances with those for the simplified model in Table 1. For larger beams, the improvements achieved by beam-aware training are much larger than those achieved by non-beam-aware ones. For example, 92.69 versus 82.41 for continue with k = 8, where in the first case it is trained in a beam-aware manner (k = 8 for both training and decoding), while in the second case, beam search is used only during decoding (k = 1 during training but k = 8 during decoding). This shows the importance of training with beam search and exposing the model to its mistakes. Without beam-aware training, the model is unable to learn to use the beam effectively. Check Figure 5 for the evolution of the training and validation accuracies with training epoch for beam-aware training.

Comparing surrogate losses
We train both models with continue and vary the surrogate loss and beam size. Results are presented in Table 3.2. Perceptron losses (e.g., perceptron (first) and perceptron (last)) performed worse than their margin-based counterparts (e.g., margin (last) and cost-sensitive margin (last)). log loss (beam) yields poor performances for small beam sizes (e..g, k = 1 and k = 2). This is expected due to small contrastive sets (i.e., at most k + 1 elements are used in log loss (beam)). For larger beams, the results are comparable with log loss (neighbors).

Additional design choices
Score accumulation The scoring function was introduced as a sum of prefix terms. A natural alternative is to produce the score for a neighbor without adding it to a running sum, i.e., s(y 1:j , ✓) = s(y 1:j , ✓) rather than s(y 1:j , ✓) = P j i=1s (y 1:i , ✓). Surprisingly, score accumulation performs uniformly better across all configurations. For the main model, beam-aware training degraded performance with increasing beam size. For the simplified model, beam-aware training improved on the results in Table 1, but gains were smaller than those with score accumulation. We observed that the LM LSTM failed to keep track of differences earlier in the supertag sequence, leading to similar scores over their neighbors. Accumulating the scores is a simple memory mechanism that does not require the LM LSTM to learn to propagate long-range information. This performance gap may not exist for models that access information more directly (e.g., transformers (Vaswani et al., 2017) and other attention-based models (Bahdanau et al., 2014)). See the appendix for Table 4 which compares configurations with and without score accumulation. Performance differences range from 1 to 5 absolute percentage points.
Update on all beams The meta-algorithm of Negrinho et al. (2018) suggests inducing losses on every visited beam as there is always a correct action captured by appropriately scoring the neighbors. This leads to updating the parameters on every beam. By contrast, other beam-aware work updates only on beams where the transition leads to increased cost (e.g., Daumé and Marcu (2005) and Andor et al. (2016)). We observe that always updating leads to improved performance, similar to the results in Table 3 for perceptron losses. We therefore recommend inducing losses on every visited beam. See the appendix for Table 5, which compares configurations trained with and without updating on every beam.

Related work
Related work uses either imitation learning (often called learning to search when applied to structured prediction) or beam-aware training. Learning to search (Daumé et al., 2009;Chang et al., 2015;Goldberg and Nivre, 2012;Bengio et al., 2015;Negrinho et al., 2018) is a popular approach for structured prediction. This literature is closely related to imitation learning (Ross and Bagnell, 2010;Ross et al., 2011;Ross and Bagnell, 2014). Ross et al. (2011) addresses exposure bias by collecting data with the learned policy at training time. Collins and Roark (2004) proposes a structured perceptron variant that trains with beam search, updating the model parameters when the correct hypothesis falls out of the beam. Huang et al. (2012) introduces a theoretical framework to analyze the convergence of early update. Zhang and Clark (2008) develops a beam-aware algorithm for dependency parsing that uses early update and dynamic oracles. Nivre (2012, 2013) introduce dynamic oracles for dependency parsing. Ballesteros et al. (2016) observes that exposing the model to mistakes during training improves a dependency parser. Bengio et al. (2015) makes a similar observation and present results on image captioning, constituency parsing, and speech recognition. Beam-aware training has also been used for speech recognition (Collobert et al., 2019;Baskar et al., 2019). Andor et al. (2016) proposes an early update style algorithm for learning models with a beam, but use a log loss rather than a perceptron loss as in Collins and Roark (2004). Parameters are updated when the golden hypothesis falls out of the beam or when the model terminates with the golden hypothesis in the beam. Wiseman and Rush (2016) use a similar algorithm to Andor et al. (2016) but they use a margin-based loss and reset to a beam with the golden hypothesis when it falls out of the beam. Edunov et al. (2017) use beam search to find a contrastive set to define sequence-level losses. Goyal et al. (2018Goyal et al. ( , 2019 propose a beam-aware training algorithm that relies on a continuous approximation of beam search. Negrinho et al. (2018) introduces a meta-algorithm that instantiates beam-aware algorithms based on choices for beam size, surrogate loss function, and data collection strategy. They propose a DAgger-like algorithm for beam search.

Conclusions
Maximum likelihood training of locally normalized models with beam search decoding is the default approach for structured prediction. Unfortunately, it suffers from exposure bias and does not learn to use the beam effectively. Beam-aware training promises to address some of these issues, but is not yet widely used due to being poorly understood. In this work, we explored instantiations of the meta-algorithm of Negrinho et al. (2018) to understand how design choices affect performance. We show that beam-aware training is most useful when substantial uncertainty must be managed during prediction. We make recommendations for instantiating beam-aware algorithms based on the meta-algorithm, such as inducing losses at every beam, using log losses (rather than perceptron-style ones), and preferring the continue data collection strategy (or reset if necessary). We hope that this work provides evidence that beam-aware training can greatly impact performance and be trained stably, leading to their wider adoption.