TeaForN: Teacher-Forcing with N-grams

Sequence generation models trained with teacher-forcing suffer from issues related to exposure bias and lack of differentiability across timesteps. Our proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps. TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts generation quality on one Machine Translation benchmark, WMT 2014 English-French, and two News Summarization benchmarks, CNN/Dailymail and Gigaword.


Introduction
Many state-of-the-art sequence generation models are trained using a technique called teacherforcing (Goodfellow et al., 2016). Teacher-forcing is popular because it improves sample efficiency and provides training stability, but models trained with teacher-forcing are known to suffer from issues such as exposure bias (Venkatraman et al., 2015;Bengio et al., 2015;Ding and Soricut, 2017) and a lack of differentiability across timesteps (i.e., training updates made when decoding at time-step t cannot fully propagate to time-step t − 1). Previous attempts to address these issues include scheduled sampling (Bengio et al., 2015), parallel N-gram prediction (Yan et al., 2020), and sampling from previous predictions .
Our proposed method, Teacher-Forcing with Ngrams (TeaForN), imposes few requirements on the decoder architecture and does not require curriculum learning or sampling model outputs. TeaForN fully embraces the teacher-forcing paradigm and extends it to N-grams, thereby addressing the problem at the level of teacher-forcing itself.
The advent of large-scale pretraining has pushed the state-of-the-art on Natural Language benchmarks to impressive heights, often showing gains across many tasks at once (Devlin et al., 2019;Raffel et al., 2019;. A negative consequence of this is the tendency towards large, data-hungry models, which have a negative impact on energy-consumption and accessibility (Strubell et al., 2019), as well as higher latency and production costs. As such, it is of increasing importance to develop techniques that counteract these tendencies. While TeaForN does increase training cost moderately, it can be used to drive down latency and inference cost, which dominate the overall cost of a production model.
Many sequence generation models use beam search to improve generation quality (Vaswani et al., 2017;Raffel et al., 2019;Yan et al., 2020). In contrast with greedy decoding, beam search keeps the k most-likely candidates at each decoding timestep. While beam search has proven to be a reliable technique for improving output quality, previous work has shown that beam search actually degrades performance for sufficiently large k (Koehn and Knowles, 2017). In addition, the inference cost of a model increases linearly with k, due to the need for multiple decodings. We conduct an analysis of the effect of beam size on models trained both with and without TeaForN. We show that models trained with TeaForN require a smaller beam size to reach similar performance, a property that can achieve significant cost-savings.
Our experiments show that TeaForN can boost performance on both Machine Translation and News Summarization tasks, provided there is sufficient model capacity. With TeaForN, Transformer big (Vaswani et al., 2017) improves by +.5 SacreBLEU (Post, 2018) on the WMT14 En-Fr benchmark with beam search and +.3 without. When using TeaForN for summarization, PEGA-SUS large  improves by +.3 ROUGE-L on the Gigaword benchmark (Rush et al., 2015) and by +.2 on the CNN/Dailymail benchmark (Hermann et al., 2015). Further, PE-GASUS large trained with TeaForN matches the prior ROUGE-L scores on these benchmarks without beam search, representing an 8x reduction in decoder inference cost.

Related Work
One of the standard approaches to sequencelearning training is Maximum-likelihood Estimation (MLE). Although widely used in large array of applications, MLE estimation for sequence learning suffers from the exposure-bias problem (Venkatraman et al., 2015;Ranzato et al., 2015). Exposurebias produces brittle models due to training procedures during which the models are only exposed to their training data distribution but not to their own predictions. Possible solutions to the exposurebias problem in neural-network settings have used "data as demonstrator" (Venkatraman et al., 2015) and "scheduled sampling" (Bengio et al., 2015) approaches. Although improving model performance in practice, such proposals have been shown to be statistically inconsistent (Huszar, 2015), and still need to perform MLE-based warm-start training, rendering such solutions unsatisfactory. Along similar lines, the "professor forcing" (Lamb et al., 2016) method uses adversarial domain adaptation to encourage network dynamics to be the same during training and inference, though it requires sampling sequences during training.
A different approach, based on reinforcement learning methods, achieves sequence learning following a policy-gradient (PG) method (Sutton et al., 1999). It directly attacks the exposure-bias problem by having the training models exposed exclusively to their own predictions while scoring them using reward functions. However, this approach introduces another issue, related to the large discrepancy between the model prediction distribution and the reward function's values, which is especially acute during the early training stages when the predicted outputs are all equally bad. As a result, this method also requires a warm-start phase in which the model distribution achieves some local maximum with respect to a reward-free objective (e.g., MLE), followed by a model refinement phase in which reward-based PG updates are used to refine the model (Ranzato et al., 2015;Wu et al., 2016;Liu et al., 2017). Although such combinations achieve better results in practice compared to pure likelihood-based approaches, they are unsatisfactory from a theoretical and modeling perspective, as well as inefficient from a speed-to-convergence perspective. A pure PG formulation that side-steps these issues is (Ding and Soricut, 2017), which allows for both cold-start training as well as more efficient convergence properties.
The PG-based approaches have an inherent complexity that stems from the use of quirky reward functions such as ROUGE (Lin, 2004) or CIDEr (Vedantam et al., 2015), which forfeits the advantage of sample efficiency as they often cannot be efficiently computed using current accelerators like TPUs . MLE-based approaches appear to be favored due to efficiency properties, and the search for training methods that produce less brittle models is still on-going.
Another closely related idea is End-to-End Backprop (E2E) (Ranzato et al., 2015), which has a similar goal of naturally approximating sequence level training by propagating smooth model predictions instead of groundtruth inputs. TeaForN differs from E2E in several key ways. First, TeaForN learns jointly from both groundtruth and model predictions as inputs throughout the entire training duration, whereas E2E requires a training schedule to transition from groundtruths to model predictions. Second, TeaForN supports methods other than k-max for computing smooth model predictions, two of which we explore as a part of our work. Third, we introduce the notion of a discount factor, which weights the importance of immediate predictions higher than that of future predictions.
Another such work is (Yan et al., 2020), which proposes a modified Transformer for parallel Ngram prediction. While their work does address the issue of strong local correlations caused by teacherforcing, it does not address exposure bias, as it always trains on groundtruth inputs.
Also related are models such as the one proposed by (Strubell et al., 2017), which uses a stacked of dilated convolutions to iteratively refine model predictions. Though architecturally similar, TeaForN only uses the stack at training time and solves for a fundamentally different problem.
Our TeaForN method maintains the efficiency advantages of MLE-based approaches, while ad- Timing Signal pos(1), ..., pos(t) Logits/Loss  dressing both exposure bias and the issue of differentiability across timesteps. In addition, it is general enough to be used on a wide class of autoregressive decoders, including RNN (Hochreiter and Schmidhuber, 1997;Chung et al., 2014) and Transformer (Vaswani et al., 2017) decoders, though our experiments focus on the Transformer.

TeaForN
Autoregressive sequence decoders are trained to minimize the negative log likelihood of the groundtruth tokens y (t) gt . During training, previous groundtruth tokens are used as decoder inputs for predicting the next token. If we define the embedding matrix to be E of size V × D, where V is the vocabulary size and D is the embedding size, and the embedding of the groundtruth token as The class probability distribution P is typically modeled as softmax-normalized logits, which are a linear projection of decoder output o (t) of size D onto the class embeddings using output projection matrix W of size V × D: To reduce the model parameter size, it is standard to share the parameters of the output projection matrix and the embedding matrix, such that E = W .
During inference, groundtruth tokens become unavailable. Therefore, previous tokens from the model predictions are used as decoder inputs for decoding the next token. The discrepancy between training time and inference time input distributions causes models to suffer from exposure bias, meaning that they do not learn to correct for past decoding errors (Bengio et al., 2015).
TeaForN addresses exposure bias by learning jointly how to predict from both groundtruth and past model predictions as inputs. TeaForN setups consist of a stack of N decoders, as illustrated in Figure 1. At position t, the first decoder (Decoder-0) takes input from the embedding of the previous groundtruth token x More formally, let us use subscript s ∈ [0, N ) to denote the offset within the decoder stack. We define the input to decoder s at time t as: where pos(t + s) is a timing signal that is added to the inputs for models such as the Transformer (Vaswani et al., 2017). This term may be omitted for models that do not expect it. The training loss of Decoder-s at time t is the negative log likelihood of the (t + s) th element in the groundtruth sequence: and the total TeaForN training loss is the sum of decoder losses where λ ∈ (0, 1] is a discount factor needed to weigh the risk of harming next-word accuracy against the benefits of TeaForN. During inference, TeaForN uses only the first decoder (Decoder-0) in the stack; the rest are discarded.
The intuition behind TeaForN is as follows. Under standard teacher-forcing, the decoder output o (t) only learns to predict the groundtruth label y (t) gt , while outputs that favor other classes are considered equally bad, and will be penalized by the loss. This is not reasonable because classes carrying similar meanings to the groundtruth label do not change the meaning of the sequence significantly, and may still lead to the correct prediction for the next label. Under TeaForN, the decoder output o (t) is also used as the input of a secondary decoder for decoding the next position. Therefore, all outputs that result in predicting the next groundtruth label y (t+1) gt will have lower loss and therefore be differentiated from other outputs.
In our experiments, we allow the decoder parameters to be either shared (θ 0 = θ s , ∀s) or unshared. In a shared-weight configuration, the model learns to predict the next groundtruth label from the class that the same model predicted in the previous position. This is similar to the inference time condition, so we expect shared-weight TeaForN to address exposure-bias better than unshared-weight TeaForN. Shared-weight configurations also have performance advantages such as lower memory consumption and faster training.
Since TeaForN solves for a more difficult problem than teacher-forcing, we expect it to work better for models with higher capacity. We later show evidence of this by comparing results for two model sizes on Machine Translation.
It is straightforward to show that TeaFor1 (N=1) and teacher-forcing are equivalent, as the inputs to the first TeaForN decoder are groundtruth sequence embeddings and λ 0 = 1. Thus, TeaForN is a natural extension of teacher-forcing to N-grams.

Embedded Top-k Stacked Decoder Input
Previously, our TeaForN model directly used the decoder output of the (s − 1)-th stack as the input of the decoder of the s-th stack: This is an approximation to the inference-time decoder input, which (for greedy decoding) is where argmax(x) returns the index of the V -dim vector with the maximum value. Inspired by the End-to-End Backprop (E2E) (Ranzato et al., 2015), we also consider the following alternative decoder input, where top k is a function which keeps the top-k values of the vector, and masks out the others. It is easy to verify, when k = 1, Eq.
(3) reduces to Eq. (2); when k = V , Compared to Eq. (1), Eq. (3) is more computationally expensive, as it involves additional embedding matrix multiplications and/or a top-k sorting. Furthermore, we would like to emphasize a critical difference between the TeaForN and E2E (Ranzato et al., 2015). In TeaForN, the 0-th stack of every position is always clamped to the groundtruth input, while for E2E the groundtruth is completely thrown away after warm-up training. The groundtruth clamping allows the TeaForN to avoid the warm-up training which is necessary for E2E.

Experimental Results
Our empirical study of TeaForN is comprised of two sections. First, we present experiments on Machine Translation using the well-known Transformer model (Vaswani et al., 2017). Second, we show results for News Summarization, for which we use PEGASUS , a state-ofthe-art pretrained text summarization model.
We perform minimal hyperparameter tuning over the course of these experiments. This can be partly credited to the underlying models being well-tuned already, but also to TeaForN, which works out-ofthe-box without much hyperparameter tuning. One exception is the tuning of the number of training steps, as we found that the number of steps used by previous settings is sometimes insufficient.

Machine Translation
In this section, we study the effects of applying TeaForN to a well-known Transformer-based Machine Translation model. We present results for two size variants of the model, Transformer base and Transformer big (Vaswani et al., 2017). The differences are summarized in Table 2.
We use the same WMT14 language-pair benchmarks originally reported in the Transformer paper:  Table 1: A comparison of models on WMT14 language pairs En-De and En-Fr using Transformer base . We report mean and Standard Error of SacreBLEU scores over five independent training runs. θ shared refers to whether the free parameters of the decoder are shared across decoder instances (Y) or kept separate (N). The discount factor is λ = .5 for TeaFor2 models.
• English-German (En-De), with 4.5M sentence pairs for training and 2,737 for testing.
• English-French (En-Fr), with 36M sentence pairs for training and 3,003 for testing.
We use SacreBLEU (Post, 2018) with casesensitive tokenization to score translations. We report SacreBLEU scores for beam search widths k ∈ [1, 8] to show the interaction between TeaForN learning and beam search.

Transformer base
Using Transformer base as our underlying model, we measure the impact of TeaForN on the Machine Translation task. We test both shared-and unshared-weight configurations, with N = 2 (i.e. "TeaFor2") and λ = .5. We expect weight-shared configurations to be more effective, as a more direct means of addressing exposure bias in the decoder.
All models are trained for 1M steps, and we observe no signs of overfitting. For model selection, we average the last five checkpoints, as originally done for the Transformer (Vaswani et al., 2017). We report mean and standard-error variation of SacreBLEU scores over five runs.    Table 1 also shows that TeaFor2, with or without weight-sharing for En-Fr, outperforms standard teacher-forcing by the same amount, +.12 SacreBLEU (40.32 vs 40.20). We credit the increase in performance to TeaForN's ability to make predictions that lead to better predictions in the subsequent sequence positions.
Beam search results in Table 1 show that the gains of TeaFor2 are negated by beam search with k = 4. Because of the small capacity of the Transformer base model, the benefits of TeaFor2 are minimal and only reflected in the result from greedy decoding. In the following experiment, we show that higher-capacity models benefit more from TeaForN when using beam search.
We test against the same WMT14 language pairs as the previous experiments. We train En-Fr models for 1M steps and En-De models for 500k steps. Beyond 500k training steps, we observe that En-De models overfit the training data (see Table 4). This is likely due to a combination of larger model capacity in Transformer big (Table 2)  average the last twenty checkpoints, as was done for the Transformer (Vaswani et al., 2017).
We use weight-sharing for all TeaForN setups in this section. Transformer big has more capacity than Transformer base , so it is expected to perform better in a shared-weight configuration. Figure 2 shows that TeaForN outperforms standard-teacher forcing on the En-Fr benchmark, across all beam widths up to k = 8 and all discount factors λ ∈ {.2, .4, .6, .8, 1}. With beam 2, TeaFor2 achieves a higher score on En-Fr than teacher-forcing achieves with any beam size up to 8 (42.6 vs 42.4) and significantly outperforms it with beam size 5 (42.8 vs 42.4). TeaFor3 performs as well as Teacher-forcing but worse than TeaFor2  on the En-Fr benchmark, for nearly every discount factor tested. This shows that TeaForN can be used to train models with higher quality for any given beam size or, alternatively, train models of similar quality but lower inference cost (i.e., faster).
In contrast with the results for lower-capacity models, Fig. 2 shows that beam search does not erase gains due to TeaForN training. The Teacherforcing setup gains +.6 SacreBLEU from beam search (42.4 vs 41.8) compared to +.6 for TeaFor2 (42.8 vs 42.2), in spite of a +.4 SacreBLEU higher baseline. Provided sufficient model capacity, TeaForN is seen to improve the quality of the underlying model, so that greedy decoding is more effective, but not at the expense of beam search.
Intuitively, discount factors that are too high may interfere with prediction quality, as they decrease the relative importance of next word prediction. We see this on the English-German benchmark, shown in Fig. 3, where the highest discount factor tested (λ = 1) significantly reduces greedy performance (27.9/27.8 from 28.1) and peak performance (28.9/28.8 vs 29.2). In all of our Transformer big experiments, the best performing discount factor is either .2 or .4, which are the lowest values tested.

Top-K Approximation
Up to this point, TeaForN setups have used Eq. (1) to approximate the inference-time decoder input.
We now share results for an alternative approximation called Top-K, described by Eq. (3) and inspired by (Ranzato et al., 2015), which feeds the embedding expectation of the decoder output. If K = V , Top-K is an exact expectation. If K < V , Top-K approximates the expectation as the probability-weighted embeddings of the K most likely outputs.
In this experiment, we try K ∈ {4, V } and N ∈ {2, 3} using Transformer big as our base model. We report results on both WMT14 benchmarks. We use discount factor λ = .2 for all setups. Figure 4 shows that Top-K does not work as well as the original TeaForN approximation described by Eq. (1). Top-K with K = 4 performs worse than TeaForN on the En-De benchmark but not the En-Fr benchmark. When K = V , the situation is the exact opposite, with Top-K performing better on the En-De benchmark but not the En-Fr benchmark.

Word Drop Regularization
TeaForN could potentially have regularization-like effects by solving for a more difficult task than standard teacher-forcing. TeaForN trains models to decode not just from groundtruth prefixes, but also from past model predictions.
To see whether regularization-like effects are responsible for the gains seen using TeaForN, we perform a regularization experiment using Transformer big . In particular, we randomly sample a set of groundtruth decoder input words in each example with probability P drop ∈ {0, .1, .  to zero. Fig. 5 shows that word drop regularization increases performance against the En-De benchmark but reduces performance against En-Fr. These results are in stark contrast with the results of TeaForN, which only improves performance in the En-Fr case. Though TeaForN may have regularization-like effects, they are likely different from the effects of word drop regularization.

Additional Compute
TeaForN uses more compute resources than Teacher-forcing when inference-time architecture and number of training iterations are the same, as is the case in our Transformer big experiments.
To enable a fair comparison in terms of trainingtime compute, we conduct an experiment where we train Transformer big so that the total device time is about the same. We train the baseline for 1.5x iterations, a figure which was estimated from the observed training speeds of Transformer big and TeaFor2 (4.5 iterations/sec and 6.8 iterations/sec). Table 4 shows that this additional training does not significantly benefit Transformer big , for either language pair. Based on these results, we conclude that the benefits of TeaForN do not likely derive from additional compute.

News Summarization
We now present our experiment on News Summarization using PEGASUS large (Zhang et  The PEGASUS approach has been shown to work better on News Summarization tasks when pretrained on HugeNews, a dataset of 1.5B newslike articles scraped from the web between 2013 and 2019. We use the same pretraining procedure as originally described for PEGASUS large (HugeNews), which uses teacher-forcing to learn based on an unsupervised Gap Sentence Generation task .  Tables 11 and 13 for results with standard error measurements.    (Zhang et al., 2019) 44.17/21.47/41.11 39.12/19.86/36.24 (Greedy) TeaFor3+PEGASUS 43.90/20.36/41.20 39.10/19.40/36.30 (Beam@k=8) TeaFor3+PEGASUS 44.20/21.70/41.32 39.16/20.16/36.54   We use TeaFor3 with λ = .5 and weight-sharing. For model selection, we use the checkpoint with the highest ROUGE-L F-score on the validation set, with evaluations every 1k steps. We stop training on Gigaword after 160k steps and CNN/Dailymail after 400k steps.

Training Performance
While training cost and speed are moderately im-pacted by TeaForN, we note that inference cost is significantly reduced by virtue of producing models that reach similar quality with fewer beams, enabling significant cost savings for production models, in addition to overall stronger models.

Conclusion
In this work, we introduce a new technique for sequence generation models called Teacher-Forcing with N-grams (TeaForN), which (a) addresses exposure bias, (b) allows the decoder to better take into account future decisions, and (c) requires no curriculum training. We show empirical evidence of the efficacy of TeaForN on several sequence generation tasks. With Transformer big (Vaswani et al., 2017), we boost the performance of Transformer big significantly on the En-Fr benchmark. With PEGA-SUS large , we improve upon the existing ROUGE-L scores the Gigaword and CNN/Dailymail benchmarks (Rush et al., 2015). Further, we show that TeaForN can match the prior state-of-the-art ROUGE-L scores on the summarization benchmarks without beam search, representing an 8x reduction in decoder cost at inference.