Generalized Entropy Regularization or: There’s Nothing Special about Label Smoothing

Prior work has explored directly regularizing the output distributions of probabilistic models to alleviate peaky (i.e. over-confident) predictions, a common sign of overfitting. This class of techniques, of which label smoothing is one, has a connection to entropy regularization. Despite the consistent success of label smoothing across architectures and data sets in language generation tasks, two problems remain open: (1) there is little understanding of the underlying effects entropy regularizers have on models, and (2) the full space of entropy regularization techniques is largely unexplored. We introduce a parametric family of entropy regularizers, which includes label smoothing as a special case, and use it to gain a better understanding of the relationship between the entropy of a model and its performance on language generation tasks. We also find that variance in model performance can be explained largely by the resulting entropy of the model. Lastly, we find that label smoothing provably does not allow for sparsity in an output distribution, an undesirable property for language generation models, and therefore advise the use of other entropy regularization methods in its place.


Introduction
When training large neural networks with millions of parameters, regularization of some form is needed to prevent overfitting, even when large amounts of data are used; models for language generation are no exception. In probabilistic modeling, e.g. when the final layer of the neural network is a softmax, overfitting often manifests itself in overconfident placement of most of the probability mass on a few candidates, resulting in peaky (low-entropy) probability distributions over the vocabulary. Specifically for language generation tasks, this behavior leads to the output of repetitive or frequently occurring but unrelated text, which is detrimental to the generalization abilities of the model (Chorowski and Jaitly, 2017;Holtzman et al., 2020). A natural regularizer to consider is, therefore, one that penalizes overconfidence, encouraging higher entropy in the learned distribution. Indeed, the literature has ascribed gains of ≈ 1 BLEU point in machine translation to label smoothing, one such technique (Chen et al., 2018).
Despite the clear relationship between low entropy and overfitting, only a handful of distinct entropy regularizers have been explored. To fill this gap, we introduce generalized entropy regularization (GER), a unified framework for understanding and exploring a broad range of entropyinducing regularizers. GER is based on the skew-Jensen family of divergences J α,G (Nielsen and Boltz, 2011) and thus may be generalized to any Bregman divergence through the choice of generator function G. For the negative entropy generator function, GER recovers label smoothing (Szegedy et al., 2015) as α → 1, and the confidence penalty (Pereyra et al., 2017) as α → 0. We provide formal properties of GER in §3, proving these special-case equivalences among other characteristics of GER. We then use GER to examine the relationship between entropy and the evaluation metrics in two language generation tasks: neural machine translation (NMT) and abstractive summarization.
GER encompasses a large family of regularizers, which allows us to directly compare label smoothing to other forms of entropy regularization. By studying the relationship between different regularizers on the performance of natural language generation (NLG) systems, we can better understand not just when but also why label smoothing aids language generation tasks. Through our analysis, we gain the following insights: (i) With tuning of the regularizer's coefficient, any choice of α can yield similar performance, i.e. there is nothing special about label smoothing. In fact, our results suggest that la-bel smoothing (α → 1) makes it more difficult to tune the regularizer's coefficient.
(ii) Label smoothing assigns infinite cost to sparse output distributions, which may be an undesirable behavior for language generation tasks.
(iii) There is a strong (quadratic) relationship between a model's performance on the evaluation metric and its (average) entropy, offering a hint as to why these regularizers are so effective for NLG.
In summary, entropy-inducing regularizers are a boon to probabilistic NLG systems, which benefit from higher entropy output distributions. Label smoothing works because it forces the model towards a higher-entropy solution, but we recommend the confidence penalty and other entropy regularizers (α < 1) for reasons (i) and (ii) above.

Preliminaries
In this work, we consider conditional probability models p θ (y | x) for natural language generation; such models assign probability to a target sequence y ∈ Y given a source sequence x. Specifically, our target sequence y = y 1 , . . . , y n of arbitrary length n is a sequence of target words 1 y i from our vocabulary Y . The set of all complete target sequences, which are padded with distinguished beginning-and end-of-sentence symbols, BOS and EOS, is then defined as Y := {BOS • y • EOS | y ∈ Y * }. For language generation tasks, p θ (y | x) is typically a neural network with parameters θ; this network is often trained to approximatep(y | x), the empirical distribution (i.e. the distribution of the data). Here, we focus on locally normalized models; in such models p θ (y | x) is factored as: where p θ (y i | x, y <i ) is defined by a softmax over the output of the final fully connected layer of the network. Generation is performed using greedy search, beam search or a sampling scheme. Of the candidate sequences generated, the one with the highest probability under the model p θ is returned as the model's prediction. One way of selecting the parameters θ is to minimize the KL-divergence between the empirical 1 Targets yi may also be characters or subwords; our experiments use byte-pair encoding (Sennrich et al., 2016) distribution and the model. This yields the crossentropy loss (plus an additive constant): 2 However, fitting a model that perfectly approximates the empirical distribution is, in general, fraught with problems (Hastie et al., 2001). The goal of learning is to generalize beyond the observed data. Exactly fitting the empirical distribution, often termed overfitting, is therefore not an ideal goal and for language generation models specifically, does not go hand-in-hand with the ability of a model to generate desirable text (Bengio et al., 2015). Consequently, it is advisable to minimize a regularized objective to prevent overfitting: where R(θ) is a regularizer defined over the model with "strength" coefficient β > 0.

Entropy Regularization
Overfitting can manifest itself as peakiness in p θ (Williams and Peng, 1991;Mnih et al., 2016;Pereyra et al., 2017). In other words, p θ overconfidently places most of the probability mass on very few candidates. While this overconfidence improves training loss, it hurts generalization. Entropy regularization is one technique that directly combats such overconfidence by encouraging more entropic (less peaky) distributions.
The entropy of the model p θ is defined as where we remove dependence on x for notational simplicity. However, the sum in eq. (5) over Y generally renders its computation intractable. 3 Instead, regularization is performed on the conditional distribution over Y ∪ {EOS} at each time step, which can be interpreted as an approximation of the true model entropy. For ease of notation, we define a higher-order function D f over our training corpus C consisting of x, y pairs that maps a function f

Training Method Loss Function Alternate Formulation
Cross Entropy over distributions p, q as follows below: The function D f allows us to describe in notation how entropy regularization is typically employed in the training of language generation systems. 4 Label Smoothing. Label smoothing, first introduced as a regularizer for neural networks by Szegedy et al. (2015), is so named because the technique smooths hard target distributions. One such distribution, the empirical distribution, is encoded as a set of one-hot vectors (hard targets) where for each data point, the correct label (e.g., vocabulary index of a word) has value 1 and all other labels have value 0. Label smoothing with strength coefficient γ is an add-γ smoothing scheme on the distribution over labels at every time step. Interestingly, minimizing the cross entropy between this modified distribution and the model p θ is equivalent to adding the weighted KL divergence between the uniform distribution and the model p θ in our original objective function with the same strength coefficient: While the loss function is often scaled as above, it is nonetheless equivalent to L(θ) LS β = L(θ) + β D KL (u || p θ ); 5 we use this form for consistency.
Confidence Penalty. The confidence penalty, empirically explored in the supervised learning setting by Pereyra et al. (2017), aims to penalize a low-entropy model. This is done by subtracting a weighted term for the entropy of the model's pre-4 Note that the standard loss function in eq. (3) can be written in this form when computed over C, i.e. KL(p || pθ) = DKL(p || pθ), since the reference y is the only value in supp(p). 5 up to multiplicative factor (1 − γ) when β = γ/(1 − γ) diction p θ (·) from the loss function, thereby encouraging a more entropic model. This is equivalent to adding the KL divergence between the model p θ and the uniform distribution: While Pereyra et al. (2017) found that label smoothing performed better than the confidence penalty for NMT, they only searched coarsely over a small range of β's for both regularizers. Our findings in §4 suggest an alternate conclusion.

Generalized Entropy Regularization
The positive effect of both label smoothing and the confidence penalty on model performance in language generation tasks motivates further exploration of entropy-promoting regularizers. To this end, we construct a parameterized family of regularizers with label smoothing and the confidence penalty as special cases. We discuss the formal properties of a subset of this family, providing upper and lower bounds for it. We show divergence only occurs in one case for this subset (α → 1), which directly implies that no sparse solution exists when label smoothing is used as a regularizer.

A Family of Entropy Regularizers
We derive a family of regularizers from the skew-Jensen divergence J α,G (Nielsen and Boltz, 2011), which is defined below as: for a strictly convex generator function G : Ω − → R and α ∈ (0, 1) where Ω is a closed convex set. In this paper, we restrict Ω to be the (|Y |+1)-simplex.
Note that J α,G (q || p θ ) = J α,G (p θ || q) in general, although this is true for some choices of G and α. We define the generalized entropy regularizer as R(θ) = D J α,G (u || p θ ) where u is the uniform Figure 1: Different divergence measures between u, the uniform distribution and p, a probability distribution over a Bernoulli random variable X. Note that the confidence penalty is equivalent to KL(p || u) = J 0 and label smoothing is equivalent to KL(u || p) = J 1 (see §3.1). We include entropy H(p) and Eu(u || p) = J α,G (u || p) for α = 0.5 and G(p) = ||p|| 2 2 .
distribution. 6 These regularizers promote entropy because they push the model p θ towards u, which is the maximum-entropy distribution with an entropy of log(|Y |+1). Throughout the rest of this paper, we primarily use the generator function 7 G(p) = −H(p). We use J α as shorthand for J α,−H .
We note J α is equivalent to quadruple the Jensen-Shannon (JS) divergence and asymptotically approaches the Kullback-Leibler (KL) divergence for certain values of α. Specifically, we have: We prove these relationships in App. A and App. B. For ease, we define J 1 := lim α→1 J α and J 0 := lim α→0 J α . We note the following two equivalences for these special cases.
In words, the gradient of the loss with GER as α→1 is equivalent to the gradient of the loss augmented with label smoothing.
In words, the gradient of the loss with GER as α → 0 is equivalent to the gradient of the loss augmented with the confidence penalty.
See App. C and App. D for proofs. 6 Distributions other than u may also be used. See §5. 7 We also experiment with G(z) = ||z|| 2 2 . There is no standard trend for J α as purely a function of α ∈ (0, 1).

Formal Properties of J α
When fitting a model p θ , we generally optimize the inclusive KL, i.e. KL(p || p θ ), so that, among other reasons, p θ has support everywhere thatp has support. However, it is unclear what relationships we want to encourage between the model p θ and the uniform distribution u during regularization as complete support of u implies no word can ever have non-zero probability. Here we explore formal properties of J α as a regularizer to gain insight into how, as a function of α, these regularizers affect the learned distribution.
Magnitude. Figure 1 shows the different divergence measures between u and p θ . We see that J 1 = KL(u || p θ ) (label smoothing) is much larger than J 0 = KL(p θ || u) (confidence penalty) at values of p θ farther from u. This indicates that J 1 would be a stronger regularizer than J <1 , i.e. penalize values of p θ far from u more heavily, given the same strength coefficient β. Note that it is not always the case that J <1 (u || p) ≤ J 1 (u || p) for fixed p. We can, however, bound J α from above and below by other quantities.
A proof by counter example is shown in Figure 2.
Proposition 4. For fixed p, J α has bounds: Sparsity. Sparsity is generally a desirable trait in probabilistic models; specifically for structured prediction, it leads to improvements in performance and interpretability (  Results include baseline models with no (entropy) regularization and standard label smoothing with γ=0.1 (equivalent to β ≈ 0.11). We report scores from the best model found (on validation set) for D J0 , D J1 , and D Jα over all α, β pairs. BLEU standard deviation across random seeds was typically < 0.1 and always < 0.16. 8 Results for MTTT Ja-En and convolutional architectures can be found in App. H. et al., 2018). For example, Martins and Astudillo (2016) showed the benefits of using sparsemax, which induces sparsity in an output distribution or attention layer, for natural language inference tasks. There are also intuitive reasons for allowing p θ to be sparse. Part of modeling language generations tasks is learning when particular sequences cannot, or at least should not, occur (e.g. are grammatically or syntactically incorrect). In these cases, a model should be able to assign 0 probability mass to that sequence. However, there is no sparse optimal solution p θ when using label smoothing as the label smoothing loss function becomes divergent if p θ does not assign probability mass ∀y ∈ supp(u).
See App. F for a proof.

Experiments
We evaluate our family of entropy regularizers on two language generation tasks: machine translation and abstractive summarization. We then analyze trends in model performance as a function of α and model entropy 9 and explore how this entropy affects other properties of language generation models. In the following experiments, each model is trained using eq. (4) where R(θ) = D Jα (p || p θ ). We conduct searches over α and β using Bayesian optimization (Snoek et al., 2012) to find the combination of regularizer D Jα and strength coefficient β 8 We have α ≈ 1 as an exception; the standard deviation is slightly higher for larger values of β.
9 Model entropy is estimated as an average of the entropies of distributions at each time step during decoding, i.e.Ĥ(pθ) = DH (pθ). Entropy is normalized by the maximum possible entropy for the given vocabulary size (log |Y |) in all figures and tables to control for the fact that languages have vocabularies of different sizes. that lead to the lowest loss on the development set for the respective task. 10 We additionally do a more fine-grained grid search over β for J 0 (confidence penalty) and J 1 (label smoothing) for completeness. All other model hyperparameters are held constant. We run experiments on multiple architectures and across several data sets to ensure trends are general.

Neural Machine Translation
We explore performance of the regularizer D Jα on NMT systems using three language pairs and corpora of two different sizes on the following tasks: WMT'14 German-to-English (De-En) (Bojar et al., 2014), IWSLT'14 German-to-English (De-En) (Cettolo et al., 2012), and Multitarget TED Talks Task (MTTT) French-to-English (Fr-En) and Japanese-to-English (Ja-En) tasks (Duh, 2018). For the larger WMT data set, we train fewer models using coarser-grained α and β ranges. We perform experiments for both Transformers (Vaswani et al., 2017) and convolutional sequence-to-sequence models (Gehring et al., 2017).
For reproducibility and comparability, we use the data pre-processing scripts provided by fairseq (Ott et al., 2019) and follow recommended hyperparameter settings from previous work (Vaswani et al., 2017;Gehring et al., 2017) for baseline models. We use SacreBLEU (Post, 2018) to calculate BLEU scores (Papineni et al., 2002). Specific data pre-processing steps and model hyperparameter details are provided in App. G. Decoding is performed with length-normalized beam search with a beam size of 5 unless otherwise stated. Early stopping was used during training; model parame-   ters were taken from the checkpoint with the best validation set BLEU. Results of our experiments are shown in Table 2 and Figure 3. We see the same relation between model entropy and BLEU with both Transformer and convolutional architectures and between different language pairs. We show results for the Transformer architectures inline as they are the current standard for many NLP tasks; results for convolutional architectures are in App. H. Our results show better performance is achieved with values of α and β other than those that correspond to label smoothing with γ = 0.1, which is the commonly used value for the strength coefficient (Vaswani et al., 2017;Edunov et al., 2018). Moreover, the relationship between model entropy and evaluation performance is strong, following the same trend for all values of α, which suggests tuning a model for a specific entropy rather than α, β may be a better method in practice. We discuss trends in §4.3.  Table 3 show that optimal values of ROUGE-L (Lin, 2004), the evaluation metric, can be achieved by regularizing with D Jα for different values of α. Notably, the entropy is virtually the same for the models that achieve top performance, demonstrating the closer relationship of performance with model entropy than with α, discussed further in §4.3.

Significance of α and Model Entropy
We look at the strength of the relationship between the evaluation metrics and both α and the model's entropy. Figure 3 shows a quadratic relationship between model entropy and BLEU. On the other hand, the relationship between α (coloring of points) and BLEU is not an obvious one; the best performing models are regularized with various values of α.
As correlation only tells us about linear relationships, we report mutual information to measure the strength of the relationship between α, model entropy, and BLEU. Mutual information shows the proportion of entropy of a variable that is "explained" by another and is often used as a generalized correlation measure i.e. for nonlinear relationships (Song et al., 2012). We see in Figure 4 that model entropy has a much stronger relationship with BLEU than α. Indeed, the normalized mutual information (NMI) between α and BLEU is ≈ 0.05 compared to ≈ 0.25 between model entropy and BLEU-implying that any flavor of entropy regularization can lead to similar performance.
While the relationship between α and BLEU is Figure 4: Entropy H(·), Conditional Entropy H(· | ·) and Mutual Information I(·; ·) for BLEU with alpha (α) and model entropy, respectively. Model entropy explains a greater portion of variability in BLEU than α does. Non-parametric estimates are used for all values (Beirlant et al., 1997). Data from IWSLT'14 De-En Transformer models.
weak, it is still statistically significant. Some evidence for this exists in Figure 3 where a closer examination reveals that each level of α has a similar quadratic trend, albeit with a different offset. Specifically, the performance of models trained with D Jα for α ∈ [0.75, 1] (which includes label smoothing) starts to degrade at lower levels of entropy than models trained with D Jα for α ∈ [0, 0.25] (confidence penalty). As quantitative validation of this observation, we (i) run a conditional independence test to see whether BLEU and α are conditionally independent given model entropy and (ii) look at the range of β for which D Jα leads to good performance for different α.
Conditional Independence. If α and BLEU are conditionally independent it implies that the value of α does not supply any additional information about the value BLEU given model entropy, i.e. α does not matter when using the regularizer D Jα . We use a Monte Carlo permutation test where the null hypothesis is that no relationship between α and BLEU exists. 11 However, this test rejects the null hypothesis with p-value < 0.05, supporting the alternate hypothesis that α and BLEU are not conditionally independent.
Tuning β. On the tasks for which we trained > 60 models, we take the subset of models for which performance is within ≈ 1% (< 0.4 BLEU) of the best overall model. We then look at the range of β used with the regularizer D Jα for these models. The range of β that meets the above criterion is 11 The underlying distributions of random variables are assumed to be Gaussian. See Legendre (2000) for more details. Label Smoothing DJ 1 38% ± 0.01% 0.0% ± 5e-5% Confidence Penalty DJ 0 54% ± 5e-3% 0.7% ± 4e-4% Table 4: Percentage of words with < probability mass at different values of (below which we consider as functionally 0) for models trained with D J1 and D J0 .
To control for entropy, all models used in the calculation have entropy within the same 1%.
much larger for α close to 0 than for for α close to 1 (see Figure 5). We contend this implies that D Jα is easier to tune (i.e. it is more robust) for α ≈ 0 while for α ≈ 1, D Jα is relatively sensitive to β.

Sparsity
We take a subset of models trained with regularizers D J 0 and D J 1 and examine the sparsity of p θ . Results in Table 4 support our formal analysis regarding the sparsity of D J 0 and D J 1 in §3.2; D J 1 steeply penalizes sparsity while D Jα for α < 1 allows words to be assigned probability ≈ 0.

Sequence Likelihood
We look at how the probability (under p θ ) of the reference sequence on the test set changes with model entropy. While higher entropy in models trends positively with downstream evaluation metrics (Figure 3), experiments show they often lead to lower log-likelihood of the reference sequence.
Both of these observations have been made for models trained with label smoothing in previous works (Ott et al., 2018;Müller et al., 2019). However, log-likelihood alone does not tell a complete story. During decoding, we search for the Figure 6: Average ranking in p θ of words in the reference sequence on the test set for IWSLT '14 (De-En) plotted against model entropy. Overall trends show a decrease in the ranking of the reference for models with more entropy regularization. Notably, the reference is generally ranked higher for models regularized with D Jα for α ≈ 0 than for α ∈ [0.25, 1).
most probable sequence relative to other candidate sequences. This implies that a more relevant calculation would be that of the overall ranking in Y of the reference sequence or of the log-likelihood of the reference sequence relative to the most probable sequence. Since the former is typically impossible to calculate exactly due to the size of Y, we approximate it by looking at the average ranking in Y of each word in the reference sequence. In Figure 6, we see that higher-entropy models generally rank the reference sequence lower than lower-entropy models; this result is surprising because higher-entropy models generally perform better on downstream evaluation metrics, e.g. BLEU. Notably, this decrease in ranking is less prominent for models regularized with α ≈ 0. In Figure 8, we see that while lower-entropy models place more probability mass on the reference sequence, the reference sequence is still far from probable compared to the decoded sequence. However, the ratio of log-likelihoods of the reference to the decoded sequence is larger for high-entropy models, which shows that, in this context, the reference sequence has higher relative log-likelihood under higher-entropy models.

Decoding
In language generation tasks, estimated distributions are fed to decoding algorithms to create sequence predictions. To fully understand how model entropy affects performance for these tasks, we must explore the potential interactions between model entropy and the decoding strategy. Figure 7: BLEU scores on IWSLT'14 De-En validation set with the convolutional architecture by decoding strategy and model entropy. The trend in BLEU stays remarkably constant for beam search as the beam width is varied. Performance declines drastically for higher entropy models when random sampling is used. Color reflects average distance from baseline model. Chorowski and Jaitly (2017) saw that with label smoothing, prediction accuracy improved and so using a wider beam during beam search did not give further improvements; however, our results suggest otherwise. As shown in Figure 7, the trend in BLEU vs. model entropy stays remarkably constant for beam search as the beam width is varied, including for greedy decoding (beam size of 1). Perhaps unsurprisingly though, higher entropy is detrimental to the performance of decoding with random sampling (with temperature T = 1). However, this phenomenon could potentially be remedied by decreasing the temperature during decoding, a common practice for avoiding sampling from the tail of the distribution (Kirkpatrick et al., 1983).

Discussion
Our experiments show entropy regularization has a number of beneficial effects on natural language generation models. Clearly, low-entropy predictions, which are more aligned with the empirical distribution (Figure 8), are a sign of overfitting in a model since they lead to poor generalization abilities (Figure 3). In other words, we observe that closely approximating the empirical distribution is at odds with a well calibrated model, i.e. a model p θ (y | x) that matches the true, underlying probabilities p(y | x). 12 Entropy regularization appears to alleviate this problem; namely, for more regularized models, Figure 3 shows increased evaluation metric scores and Figure 8 demonstrates an increase in the log-likelihood of the reference sequence relative to the highest probability sequence. Decoding. Overconfident predictions inhibit the ability to recover after a poor choice of words during decoding; Chorowski and Jaitly (2017) suggest that higher-entropy models p θ , like the ones resulting from regularization with label smoothing, would alleviate this problem. Results throughout this paper support this hypothesis not just for label smoothing, but for the D Jα family of entropy regularizers as well.
Choosing the baseline distribution. Throughout this work, we use the uniform distribution u as our baseline distribution for the regularizer D Jα . However, one could also use some other distribution defined over the vocabulary such as the unigram (Chorowski and Jaitly, 2017) or a function of word embedding distance with the target word (Kumar and Tsvetkov, 2019;Li et al., 2020). Both have proven to be more effective than u when used with label smoothing and the confidence penalty. However, using distributions other than u with D Jα leads to indirect forms of entropy regularization. Specifically, the mathematical relationship to entropy regularization becomes more convoluted. Therefore, we leave the application of GER to other distributions as a topic for future work.

Related Work
Entropy regularization has a long history in reinforcement learning (Williams and Peng, 1991;Mnih et al., 2016;Fox et al., 2016;Haarnoja et al., 2018) where it has provided substantial improvements in exploration. Such methods have since been adapted for supervised learning where they have proven to be reliable forms of regularization for various probabilistic modeling tasks (Grandvalet and Bengio, 2005;Smith and Eisner, 2007).
More recently, interpolating between exclusive and inclusive KL divergences has been explored in NMT by Xiao et al. (2019). However, this method was used for the objective function (i.e. betweeñ p and p θ ) and not as a regularization technique (i.e. between a baseline distribution q and p θ ). Li et al. (2020) construct a baseline distribution q as a function of word embedding distances to to use in place of the uniform distribution u in the label smoothing equation. This work is complementary to ours, as q can similarly be used in place of u with GER. Finally, our work is closest to that of Müller et al. (2019), which attempts to find the circumstances under which label smoothing has a positive effect on model performance. However, they do not explore entropy regularization on the whole nor do they attempt to provide an explanation for why label smoothing works. We attempt to answer the "why" question through a quantitative analysis of label smoothing and empirical exploration of the relationship between model entropy and performance.

Conclusion
We discuss the properties of generalized entropy regularization and provide empirical results on two language generation tasks. We find entropy regularization leads to improvements over baseline systems on evaluation metrics for all values of the parameter α with our regularizer D Jα . Theoretical and empirical evidence show label smoothing adds undesirable constraints to the model and is the hardest to tune of the regularizers tested. We therefore advocate the use of alternate forms of entropy regularization for language generation tasks. A α-Jensen to KL For reference, we repeat eq. (9), the definition of the skew Jensen divergence for some strictly convex function G : Ω − → R and probability distributions p, q: We can rewrite the α-Jensen divergence with convex generator function G in terms of the Bregman divergence = 0, note p − q in first inner product and q − p in second regroup terms based on multiplier (either α or 1 − α) so we can rewrite equation as two Bregman divergences We look at the behavior of If we expand D G (q, p) using our generator function G(p) = i p(i) log p(i), we get =0 since q, p are both probability distributions summing to 1 Similarly, we can show lim α→1 J α = KL(p || q)

C Label Smoothing
For the case that α → 1, p = u, and q = p θ , we have When J 1 (u || p θ (· | x)) is used as a regularizer for maximum likelihood training, we get the loss function unnormalized label-smoothed cross-entropy loss +N where N is constant with respect to θ.

F No Sparse Solution for J 1
Proof. By definition, for any distribution p over a vocabulary Y : Thus, if p θ (y | x) → 0 for some y ∈ Y and some x ∈ X , we have J 1 (u || p) = KL(u || p θ ) → ∞. This means that label smoothing enforces p θ has support everywhere u > 0, i.e. over all words y ∈ Y . For any α < 1, J α allows for sparse solutions since lim x→0 x log x = 0.

G Data Pre-Processing and Hyperparameter Settings
For training with convolutional architectures we set hyperparameters, e.g. dropout, learning rate, etc., following Gehring et al. (2017). On IWSLT'14 and MTTT tasks, we follow the recommended Transformer settings for IWSLT'14 in fairseq. 13 Hyperparameters for models trained on the WMT task are set following version 3 of the Tensor2Tensor toolkit (Vaswani et al., 2018). We use byte-pair encoding (BPE; Sennrich et al. 2016.) for all languages. Vocabulary sizes for WMT and IWSLT'14 are set from recommendations for the respective tasks in fairseq; for the MTTT tasks, vocabulary sizes are tuned on models with standard label smoothing regularization. Similarly, the CNN/DailyMail data set is pre-processed and uses BPE following the same steps as (Lewis et al., 2019). Hyperparameters are the same as for their model fine-tuned on CNN/DailyMail. Details are available on the fairseq website. 14 H Additional Results 13 https://github.com/pytorch/fairseq/tree/master/examples/translation 14 https://github.com/pytorch/fairseq/blob/master/examples/bart/README.cnn.md Figure 9: Model entropy vs. BLEU (validation set) on Multitarget Ted Talks Task Japanese to English (Ja-En) using a Transformer architecture; see Figure  3 for additional information.    Table 5: Test BLEU for IWSLT'14 German-to-English using a convolutional architecture and for MTTT Japaneseto-English using a Transformer architecture; see Table 2 for additional information.