Token-level and sequence-level loss smoothing for RNN language models

Despite the effectiveness of recurrent neural network language models, their maximum likelihood estimation suffers from two limitations. It treats all sentences that do not match the ground truth as equally poor, ignoring the structure of the output space. Second, it suffers from ’exposure bias’: during training tokens are predicted given ground-truth sequences, while at test time prediction is conditioned on generated output sequences. To overcome these limitations we build upon the recent reward augmented maximum likelihood approach that encourages the model to predict sentences that are close to the ground truth according to a given performance metric. We extend this approach to token-level loss smoothing, and propose improvements to the sequence-level smoothing approach. Our experiments on two different tasks, image captioning and machine translation, show that token-level and sequence-level loss smoothing are complementary, and significantly improve results.

The basic principle of RNNs is to iteratively compute a vectorial sequence representation, by applying at each time-step the same trainable func-tion to compute the new network state from the previous state and the last symbol in the sequence.These models are typically trained by maximizing the likelihood of the target sentence given an encoded source (text, image, speech).
Maximum likelihood estimation (MLE), however, has two main limitations.First, the training signal only differentiates the ground-truth target output from all other outputs.It treats all other output sequences as equally incorrect, regardless of their semantic proximity from the ground-truth target.While such a "zero-one" loss is probably acceptable for coarse grained classification of images, e.g.across a limited number of basic object categories (Everingham et al., 2010) it becomes problematic as the output space becomes larger and some of its elements become semantically similar to each other.This is in particular the case for tasks that involve natural language generation (captioning, translation, speech recognition) where the number of possible outputs is practically unbounded.For natural language generation tasks, evaluation measures typically do take into account structural similarity, e.g. based on n-grams, but such structural information is not reflected in the MLE criterion.The second limitation of MLE is that training is based on predicting the next token given the input and preceding ground-truth output tokens, while at test time the model predicts conditioned on the input and the so-far generated output sequence.Given the exponentially large output space of natural language sentences, it is not obvious that the learned RNNs generalize well beyond the relatively sparse distribution of ground-truth sequences used during MLE optimization.This phenomenon is known as "exposure bias" (Ranzato et al., 2016;Bengio et al., 2015).
MLE minimizes the KL divergence between a target Dirac distribution on the ground-truth sentence(s) and the model's distribution.In this pa-per, we build upon the "loss smoothing" approach by Norouzi et al. (2016), which smooths the Dirac target distribution over similar sentences, increasing the support of the training data in the output space.We make the following main contributions: • We propose a token-level loss smoothing approach, using word-embeddings, to achieve smoothing among semantically similar terms, and we introduce a special procedure to promote rare tokens.
• For sequence-level smoothing, we propose to use restricted token replacement vocabularies, and a "lazy evaluation" method that significantly speeds up training.• We experimentally validate our approach on the MSCOCO image captioning task and the WMT'14 English to French machine translation task, showing that on both tasks combining token-level and sequence-level loss smoothing improves results significantly over maximum likelihood baselines.In the remainder of the paper, we review the existing methods to improve RNN training in Section 2.Then, we present our token-level and sequence-level approaches in Section 3. Experimental evaluation results based on image captioning and machine translation tasks are laid out in Section 4.

Related work
Previous work aiming to improve the generalization performance of RNNs can be roughly divided into three categories: those based on regularization, data augmentation, and alternatives to maximum likelihood estimation.
Regularization techniques are used to increase the smoothness of the function learned by the network, e.g. by imposing an 2 penalty on the network weights, also known as "weight decay".More recent approaches mask network activations during training, as in dropout (Srivastava et al., 2014) and its variants adapted to recurrent models (Pham et al., 2014;Krueger et al., 2017).Instead of masking, batch-normalization (Ioffe and Szegedy, 2015) rescales the network activations to avoid saturating the network's non-linearities.Instead of regularizing the network parameters or activations, it is also possible to directly regularize based on the entropy of the output distribution (Pereyra et al., 2017).
Data augmentation techniques improve the ro-bustness of the learned models by applying transformations that might be encountered at test time to the training data.In computer vision, this is common practice, and implemented by, e.g., scaling, cropping, and rotating training images (Le-Cun et al., 1998;Krizhevsky et al., 2012;Paulin et al., 2014).In natural language processing, examples of data augmentation include input noising by randomly dropping some input tokens (Iyyer et al., 2015;Bowman et al., 2015;Kumar et al., 2016), and randomly replacing words with substitutes sampled from the model (Bengio et al., 2015).Xie et al. (2017) introduced data augmentation schemes for RNN language models that leverage n-gram statistics in order to mimic Kneser-Ney smoothing of n-grams models.In the context of machine translation, Fadaee et al. (2017) modify sentences by replacing words with rare ones when this is plausible according to a pretrained language model, and substitutes its equivalent in the target sentence using automatic word alignments.This approach, however, relies on the availability of additional monolingual data for language model training.
The de facto standard way to train RNN language models is maximum likelihood estimation (MLE) (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015).The sequential factorization of the sequence likelihood generates an additive structure in the loss, with one term corresponding to the prediction of each output token given the input and the preceding ground-truth output tokens.In order to directly optimize for sequence-level structured loss functions, such as measures based on n-grams like BLEU or CIDER, Ranzato et al. (2016) use reinforcement learning techniques that optimize the expectation of a sequence-level reward.In order to avoid early convergence to poor local optima, they pre-train the model using MLE.Leblond et al. (2018) build on the learning to search approach to structured prediction (Daumé III et al., 2009;Chang et al., 2015) and adapts it to RNN training.The model generates candidate sequences at each time-step using all possible tokens, and scores these at sequence-level to derive a training signal for each time step.This leads to an approach that is structurally close to MLE, but computationally expensive.Norouzi et al. (2016) introduce a reward augmented maximum likelihood (RAML) approach, that incorpo-rates a notion of sequence-level reward without facing the difficulties of reinforcement learning.They define a target distribution over output sentences using a soft-max over the reward over all possible outputs.Then, they minimize the KL divergence between the target distribution and the model's output distribution.Training with a general reward distribution is similar to MLE training, except that we use multiple sentences sampled from the target distribution instead of only the ground-truth sentences.
In our work, we build upon the work of Norouzi et al. (2016) by proposing improvements to sequence-level smoothing, and extending it to token-level smoothing.Our token-level smoothing approach is related to the label smoothing approach of Szegedy et al. (2016) for image classification.Instead of maximizing the probability of the correct class, they train the model to predict the correct class with a large probability and all other classes with a small uniform probability.This regularizes the model by preventing overconfident predictions.In natural language generation with large vocabularies, preventing such "narrow" over-confident distributions is imperative, since for many tokens there are nearly interchangeable alternatives.
The input is encoded by g θ and used to initialize the state sequence, and f θ is a non-linear function that updates the state given the previous state h t−1 , the last output token y t−1 , and possibly the input x.The state update function can take different forms, the ones including gating mechanisms such as LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Chung et al., 2014) are particularly effective to model long sequences.
In standard teacher-forced training, the hidden states will be computed by forwarding the ground truth sequence y * i.e. in Eq. ( 3), the RNN has access to the true previous token y * t−1 .In this case we will note the hidden states h * t .Given a ground-truth target sequence y * , maximum likelihood estimation (MLE) of the network parameters θ amounts to minimizing the loss The loss can equivalently be expressed as the KLdivergence between a Dirac centered on the target output (with δ a (x) = 1 at x = a and 0 otherwise) and the model distribution, either at the sequencelevel or at the token-level: Loss smoothing approaches considered in this paper consist in replacing the Dirac on the groundtruth sequence with distributions with larger support.These distributions can be designed in such a manner that they reflect which deviations from ground-truth predictions are preferred over others.

Sequence-level loss smoothing
The reward augmented maximum likelihood approach of Norouzi et al. (2016) consists in replacing the sequence-level Dirac δ y * in Eq. ( 6) with a distribution where r(y, y * ) is a "reward" function that measures the quality of sequence y w.r.t.y * , e.g.metrics used for evaluation of natural language processing tasks can be used, such as BLEU (Papineni et al., 2002) or CIDER (Vedantam et al., 2015).The temperature parameter τ controls the concentration of the distribution around y * .When m > 1 ground-truth sequences are paired with the same input x, the reward function can be adapted to fit this setting and be defined as r(y, {y * (1) , . . ., y * (m) }).The sequence-level smoothed loss function is then given by where the entropy term H(r(y|y * )) does not depend on the model parameters θ.
In general, expectation in Eq. ( 9) is intractable due to the exponentially large output space, and replaced with a Monte-Carlo approximation: Stratified sampling.Norouzi et al. (2016) show that when using the Hamming or edit distance as a reward, we can sample directly from r(y|y * ) using a stratified sampling approach.In this case sampling proceeds in three stages.Importance sampling.For a reward based on BLEU or CIDER , we cannot directly sample from r(y|y * ) since the normalizing constant, or "partition function", of the distribution is intractable to compute.In this case we can resort to importance sampling.We first sample L sequences y l from a tractable proposal distribution q(y|y * ).We then compute the importance weights where r(y k |y * ) is the un-normalized reward distribution in Eq. ( 8).We finally approximate the expectation by reweighing the samples in the Monte Carlo approximation as In our experiments we use a proposal distribution based on the Hamming distance, which allows for tractable stratified sampling, and generates sentences that do not stray away from the ground truth.
We propose two modifications to the sequencelevel loss smoothing of Norouzi et al. (2016): sampling to a restricted vocabulary (described in the following paragraph) and lazy sequence-level smoothing (described in section 3.4).
Restricted vocabulary sampling.In the stratified sampling method for Hamming and edit distance rewards, instead of drawing from the large vocabulary V, containing typically in the order of 10 4 words or more, we can restrict ourselves to a smaller subset V sub more adapted to our task.We considered three different possibilities for V sub .
V : the full vocabulary from which we sample uniformly (default), or draw from our token-level smoothing distribution defined below in Eq. ( 13).
V ref s : uniformly sample from the set of tokens that appear in the ground-truth sentence(s) associated with the current input.
V batch : uniformly sample from the tokens that appear in the ground-truth sentences across all inputs that appear in a given training mini-batch.
Uniformly sampling from V batch has the effect of boosting the frequencies of words that appear in many reference sentences, and thus approximates to some extent sampling substitutions from the uni-gram statistics of the training set.

Token-level loss smoothing
While the sequence-level smoothing can be directly based on performance measures of interest such as BLEU or CIDEr, the support of the smoothed distribution is limited to the number of samples drawn during training.We propose smoothing the token-level Diracs δ y * t in Eq. ( 7) to increase its support to similar tokens.Since we apply smoothing to each of the tokens independently, this approach implicitly increases the support to an exponential number of sequences, unlike the sequence-level smoothing approach.This comes at the price, however, of a naive token-level independence assumption in the smoothing.
We define the smoothed token-level distribution, similar as the sequence-level one, as a softmax over a token-level "reward" function, where τ is again a temperature parameter.As a token-level reward r(y t , y * t ) we use the cosine similarity between y t and y * t in a semantic wordembedding space.In our experiments we use GloVe (Pennington et al., 2014); preliminary experiments with word2vec (Mikolov et al., 2013) yielded somewhat worse results.
Promoting rare tokens.We can further improve the token-level smoothing by promoting rare tokens.To do so, we penalize frequent tokens when smoothing over the vocabulary, by subtracting β freq(y t ) from the reward, where freq(•) denotes the term frequency and β is a non-negative weight.This modification encourages frequent tokens into considering the rare ones.We experimentally found that it is also beneficial for rare tokens to boost frequent ones, as they tend to have mostly rare tokens as neighbors in the wordembedding space.With this in mind, we define a new token-level reward as: where the penalty term is strongest if both tokens have similar frequencies.

Combining losses
In both loss smoothing methods presented above, the temperature parameter τ controls the concentration of the distribution.As τ gets smaller the distribution peaks around the ground-truth, while for large τ the uniform distribution is approached.We can, however, not separately control the spread of the distribution and the mass reserved for the ground-truth output.We therefore introduce a second parameter α ∈ [0, 1] to interpolate between the Dirac on the ground-truth and the smooth distribution.Using ᾱ = 1 − α, the sequence-level and token-level loss functions are then defined as To benefit from both sequence-level and tokenlevel loss smoothing, we also combine them by applying token-level smoothing to the different sequences sampled for the sequence-level smoothing.We introduce two mixing parameters α 1 and α 2 .The first controls to what extent sequencelevel smoothing is used, while the second controls to what extent token-level smoothing is used.The combined loss is defined as In our experiments, we use held out validation data to set mixing and temperature parameters.
Lazy sequence smoothing.Although sequencelevel smoothing is computationally efficient compared to reinforcement learning approaches (Ranzato et al., 2016;Rennie et al., 2017), it is slower compared to MLE.In particular, we need to forward each of the samples y l through the RNN in teacher-forcing mode so as to compute its hidden states h l t , which are used to compute the sequence MLE loss as To speed up training, and since we already forward the ground truth sequence in the RNN to evaluate the MLE part of α Seq (y * , x), we propose to use the same hidden states h * t to compute both the MLE and the sequence-level smoothed loss.In this case: In this manner, we only have a single instead of L + 1 forwards-passes in the RNN.We provide the pseudo-code for training in Algorithm 1.

Experimental evaluation
In this section, we compare sequence prediction models trained with maximum likelihood (MLE) with our token and sequence-level loss smoothing on two different tasks: image captioning and machine translation.
We use the MS-COCO datatset (Lin et al., 2014), which consists of 82k training images each annotated with five captions.We use the standard splits of Karpathy and Li (2015), with 5k images for validation, and 5k for test.The test set results are generated via beam search (beam size 3) and are evaluated with the MS-COCO captioning evaluation tool.We report CIDER and BLEU scores on this internal test set.We also report results obtained on the official MS-COCO server that additionally measures METEOR (Denkowski and Lavie, 2014) and ROUGE-L (Lin, 2004).We experiment with both non-attentive LSTMs (Vinyals et al., 2015) and the ResNet baseline of the stateof-the-art top-down attention (Anderson et al., 2017).
The MS-COCO vocabulary consists of 9,800 words that occur at least 5 times in the training set.Additional details and hyperparameters can be found in Appendix B.1.

Results and discussion
Restricted vocabulary sampling In this section, we evaluate the impact of the vocabulary subset from which we sample the modified sentences for sequence-level smoothing.We experiment with two rewards: CIDER , which scores w.r.t.all five available reference sentences, and Hamming distance reward taking only a single reference into account.For each reward we train our (Seq) models with each of the three subsets detailed previously in Section 3.2, Restricted vocabulary sampling.
From the results in Table 1 we note that for the inattentive models, sampling from V ref s or V batch has a better performance than sampling from the full vocabulary on all metrics.In fact, using these subsets introduces a useful bias to the model and improves performance.This improvement is most notable using the CIDER reward that scores candidate sequences w.r.t. to multiple references, which stabilizes the scoring of the candidates.
With an attentive decoder, no matter the reward, re-sampling sentences with words from V ref rather than the full vocabulary V is better for both reward functions, and all metrics.Additional experimental results, presented in Appendix B.2, obtained with a BLEU-4 reward, in its single and  (You et al., 2016) 73.1 90.0 56.5 81.5 42.4 70.9 31.6 59.9 25.0 33.5 53.5 68.2 94.3 95.8 18.2 63.1 Review Net + (Yang et al., 2016) 72.0 90.0 55. 0 81.2 41.4 70.5 31.3 59.7 25.6 34.7 53.3 68.6 96.5 96.9 18.5 64.9 Adaptive + (Lu et al., 2017) 74.8 92.0 58.4 84.5 44.4 74.4 33.6 63.7 26.4  Overall For reference, we include in Table 1 baseline results obtained using MLE, and our implementation of MLE with entropy regularization (MLE+γH) (Pereyra et al., 2017), as well as the RAML approach of Norouzi et al. (2016) which corresponds to sequence-level smoothing based on the Hamming reward and sampling replacements from the full vocabulary (Seq, Hamming, V) We observe that entropy smoothing is not able to improve performance much over MLE for the model without attention, and even deteriorates for the attention model.We improve upon RAML by choosing an adequate subset of vocabulary for substitutions.
We also report the performances of token-level smoothing, where the promotion of rare tokens boosted the scores in both attentive and nonattentive models.
For sequence-level smoothing, choosing a taskrelevant reward with importance sampling yielded better results than plain Hamming distance.
Moreover, we used the two smoothing schemes (Tok-Seq) and achieved the best results with CIDER as a reward for sequence-level smoothing combined with a token-level smoothing that promotes rare tokens improving CIDER from 93.59 (MLE) to 99.92 for the model without attention, and improving from 101.63 to 103.81 with attention.
Qualitative results.In Figure 1 we showcase captions obtained with MLE and our three variants of smoothing i.e. token-level (Tok), sequencelevel (Seq) and the combination (Tok-Seq).We note that the sequence-level smoothing tend to generate lengthy captions overall, which is maintained in the combination.On the other hand, the token-level smoothing allows for a better recognition of objects in the image that stems from the robust training of the classifier e.g. the 'cement block' in the top right image or the carrots in the bottom right.More examples are available in Appendix B.4 Comparison to the state of the art.We compare our model to state-of-the-art systems on the MS-COCO evaluation server in Table 2.We submitted a single model (Tok-Seq, CIDER , V ref s ) as well as an ensemble of five models with different initializations trained on the training set plus 35k images from the dev set (a total of 117k images) to the MS-COCO server.The three best results on the server (Rennie et al., 2017;Yao et al., 2017;Anderson et al., 2017) are trained in two stages where they first train using MLE, before switching to policy gradient methods based on CIDEr.Anderson et al. (2017) reported an increase of 5.8% of CIDER on the test split after the CIDER optimization.Moreover, Yao et al. (2017) uses additional information about image regions to train the attributes classifiers, while Anderson et al. (2017) pre-trains its bottom-up attention model on the Visual Genome dataset (Krishna et al., 2017).Lu et al. (2017); Yao et al. (2017) use the same CNN encoder as ours (ResNet-152), (Vinyals et al., 2015;Yang et al., 2016) use Inception-v3 (Szegedy et al., 2016)   For this task we validate the effectiveness of our approaches on two different datasets.The first is WMT'14 English to French, in its filtered version, with 12M sentence pairs obtained after dynamically selecting a "clean" subset of 348M words out of the original "noisy" 850M words (Bahdanau et al., 2015;Cho et al., 2014;Sutskever et al., 2014).The second benchmark is IWSLT'14 German to English consisting of around 150k pairs for training.In all our experiments we use the attentive model of (Bahdanau et al., 2015) The hyperparameters of each of these models as well as any additional pre-processing can be found in Appendix C.1 To assess the translation quality we report the BLEU-4 metric.We present our results in Table 3.On both benchmarks, we improve on both MLE and RAML approach of Norouzi et al. (2016) (Seq, Hamming, V): using the smaller batch-vocabulary for replacement improves results, and using importance sampling based on BLEU-4 further boosts results.In this case, unlike in the captioning experiment, token-level smoothing brings smaller improvements.The combination of both smoothing approaches gives best results, similar to what was observed for image captioning, improving the MLE BLEU-4 from 30.03 to 31.39 on WMT'14 and from 27.55 to 28.74 on IWSLT'14.The outputs of our best model are compared to the MLE in some examples showcased in Appendix C.

Conclusion
We investigated the use of loss smoothing approaches to improve over maximum likelihood estimation of RNN language models.We generalized the sequence-level smoothing RAML approach of Norouzi et al. (2016) to the tokenlevel by smoothing the ground-truth target across semantically similar tokens.For the sequencelevel, which is computationally expensive, we introduced an efficient "lazy" evaluation scheme, and introduced an improved re-sampling strategy.Experimental evaluation on image captioning and machine translation demonstrates the complementarity of sequence-level and token-level loss smoothing, improving over both the maximum likelihood and RAML.
(i) Sample a distance d from {0, . . ., T } from a prior distribution on d. (ii) Uniformly select d positions in the sequence to be modified.(iii) Sample the d substitutions uniformly from the token vocabulary.Details on the construction of the prior distribution on d for a reward based on the Hamming distance can be found in Appendix A.
for image encoding and Rennie et al. (2017); Anderson et al.

Figure 1 :
Figure 1: Examples of generated captions with the baseline MLE and our models with attention.
Encode x to initialize the RNN Forward y * in the RNN to compute the hidden states h * * ) * , x) for l ∈ {1, . . ., L} do Sample y l ∼ r( ˙|y * ) if Lazy then Compute (y l , x) = − t log p θ (y l t |h * t ) else Forward y l in the RNN to get its hidden states h l t Compute (y l

Table 1 :
MS-COCO 's test set evaluation measures.