Smoothing and Shrinking the Sparse Seq2Seq Search Space

Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one unsatisfying aspect is its length bias: models give high scores to short, inadequate hypotheses and often make the empty string the argmax—the so-called cat got your tongue problem. Recently proposed entmax-based sparse sequence-to-sequence models present a possible solution, since they can shrink the search space by assigning zero probability to bad hypotheses, but their ability to handle word-level tasks with transformers has never been tested. In this work, we show that entmax-based models effectively solve the cat got your tongue problem, removing a major source of model error for neural machine translation. In addition, we generalize label smoothing, a critical regularization technique, to the broader family of Fenchel-Young losses, which includes both cross-entropy and the entmax losses. Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion and deliver improvements and better calibration properties on cross-lingual morphological inflection and machine translation for 7 language pairs.


Introduction
Sequence-to-sequence models (seq2seq: Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) have become a powerful and flexible tool for a variety of NLP tasks, including machine translation (MT), morphological inflection (MI; Faruqui et al., 2016), and grapheme-to-phoneme conversion (G2P; Yao and Zweig, 2015). These models often perform well, but they have a bias that favors short hypotheses. This bias is problematic: it has been pointed out as the cause (Koehn and Knowles, 2017;Yang et al., 2018;Murray and Chi-ang, 2018) of the beam search curse, in which increasing the width of beam search actually decreases performance on neural machine translation (NMT). Further illustrating the severity of the problem, Stahlberg and Byrne (2019) showed that the highest-scoring target sequence in NMT is often the empty string, a phenomenon they dubbed the cat got your tongue problem. These results are undesirable because they show that NMT models' performance depends on the search errors induced by a narrow beam. It would be preferable for models to assign higher scores to good translations than to bad ones, rather than to depend on search errors to make up for model errors.
The most common way to alleviate this shortcoming is by altering the decoding objective Yang et al., 2018;Meister et al., 2020a), but this does not address the underlying problem: the model overestimates the probability of implausible hypotheses. Other solutions use alternate training strategies (Murray and Chiang, 2018;Shen et al., 2016), but it would be preferable not to change the training algorithm.
In this paper, we propose a solution based on sparse seq2seq models , which replace the output softmax (Bridle, 1990) with the entmax transformation. Entmax, unlike softmax, can learn locally sparse distributions over the target vocabulary. This allows a sparse model to shrink the search space: that is, it can learn to give inadequate hypotheses zero probability, instead of counting on beam search to prune them. This has already been demonstrated for MI, where the set of possible hypotheses is often small enough to make beam search exact . We extend this analysis to MT: although exact beam search is not possible for this large vocabulary task, we show that entmax models prune many inadequate hypotheses, effectively solving the cat got your tongue problem.
Despite this useful result, one drawback of ent-max is that it is not compatible with label smoothing (Szegedy et al., 2016), a useful regularization technique that is widely used for transformers (Vaswani et al., 2017). We solve this problem by generalizing label smoothing from the crossentropy loss to the wider class of Fenchel-Young losses (Blondel et al., 2020), which includes the entmax loss as a particular case. We show that combining label smoothing with entmax loss improves results on both character-and word-level tasks while keeping the model sparse. We note that, although label smoothing improves calibration, it also exacerbates the cat got your tongue problem regardless of loss function.
To sum up, we make the following contributions: 1 • We show empirically that models trained with entmax loss rarely assign nonzero probability to the empty string, demonstrating that entmax loss is an elegant way to remove a major class of NMT model errors.
• We generalize label smoothing from the crossentropy loss to the wider class of Fenchel-Young losses, exhibiting a formulation for label smoothing which, to our knowledge, is novel.
• We show that Fenchel-Young label smoothing with entmax loss is highly effective on both character-and word-level tasks. Our technique allows us to set a new state of the art on the SIGMORPHON 2020 shared task for multilingual G2P (Gorman et al., 2020). It also delivers improvements for crosslingual MI from SIGMORPHON 2019 (McCarthy et al., 2019) and for MT on IWSLT 2017 German ↔ English (Cettolo et al., 2017), KFTT Japanese ↔ English (Neubig, 2011), WMT 2016 Romanian ↔ English (Bojar et al., 2016), and WMT 2014 English → German (Bojar et al., 2014) compared to smoothed and unsmoothed cross-entropy loss.

Background
A seq2seq model learns a probability distribution p θ (y | x) over sequences y from a target vocabulary V , conditioned on a source sequence x. This 1 Our code is available at https://github.com/ deep-spin/S7. distribution is then used at decoding time to find the most likely sequenceŷ: where V * is the Kleene closure of V . This is an intractable problem; seq2seq models depend on heuristic search strategies, most commonly beam search (Reddy et al., 1977). Most seq2seq models are locally normalized, with probabilities that decompose by the chain rule: This factorization implies that the probability of a hypothesis being generated is monotonically nonincreasing in its length, which favors shorter sequences. This phenomenon feeds the beam search curse because short hypotheses 2 are pruned from a narrow beam but survive a wider one. The conditional distribution p θ (y i | x, y <i ) is obtained by first computing a vector of scores (or "logits") z = f θ (x, y <i ) ∈ R |V | , where f θ is parameterized by a neural network, and then applying a transformation π : R |V | → |V | , which maps scores to the probability simplex |V | := {p ∈ R |V | : p ≥ 0, p 1 = 1}. The usual choice for π is softmax (Bridle, 1990), which returns strictly positive values, ensuring that all sequences ∈ V * have nonzero probability. Coupled with the short sequence bias, this causes significant model error.
Entmax transformations are sparse for any α > 1, with higher values tending to produce sparser outputs. This sparsity allows a model to assign exactly zero probability to implausible hypotheses. For tasks where there is only one correct target sequence, this often allows the model to concentrate all probability mass into a small set of hypotheses, making search exact . This is not possible for open-ended tasks like machine translation, but the model is still locally sparse, assigning zero probability to many hypotheses. These hypotheses will never be selected at any beam width.
Fenchel-Young Losses. Inspired by the softmax generalization above, Blondel et al. (2020) provided a tool for constructing a convex loss function. Let Ω : |V | → R be a strictly convex regularizer which is symmetric, i.e., Ω(Πp) = Ω(p) for any permutation Π and any p ∈ |V | . 3 Equipped with Ω, we can define a regularized prediction functionπ Ω : R |V | → |V | , with this form: where z ∈ R |V | is the vector of label scores (logits) and Ω : |V | → R is a regularizer. Equation 5 recovers both softmax and entmax with particular choices of Ω: the negative Shannon entropy, Ω(p) = y∈V p y log p y , recovers the variational form of softmax (Wainwright and Jordan, 2008), while the negative Tsallis entropy (Tsallis, 1988) with parameter α, defined as Figure 1: Diagram illustrating Fenchel-Young losses and the particular case of α-entmax family. The case α = 1 corresponds to softmax and the cross-entropy loss, α = 2 to the sparsemax loss, and α = 1.5 to the 1.5-entmax loss. Any choice of α > 1 (in blue) can lead to sparse distributions.
recovers the α-entmax transformation in (4), as shown by . Given the choice of Ω, the Fenchel-Young loss function L Ω is defined as where q is a target distribution, most commonly a one-hot vector indicating the gold label, q = e y * = [0, . . . , 0, 1 y * -th entry , 0, . . . , 0], and Ω * is the convex conjugate of Ω, defined variationally as: The name stems from the Fenchel-Young inequality, which states that the quantity (7) is nonnegative (Borwein and Lewis, 2010, Prop. 3.3.4).
When Ω is the generalized negative entropy, the loss (7) becomes the Kullback-Leibler divergence between q and softmax(z) (KL divergence; Kullback and Leibler, 1951), which equals the crossentropy when q is a one-hot vector. More generally, if Ω ≡ Ω α is the negative Tsallis entropy (6), we obtain the α-entmax loss . Fenchel-Young losses have nice properties for training neural networks with backpropagation: they are non-negative, convex, and differentiable as long as Ω is strictly convex (Blondel et al., 2020, Prop. 2). Their gradient is which generalizes the gradient of the cross-entropy loss. Figure 1 illustrates particular cases of Fenchel-Young losses considered in this paper.

Fenchel-Young Label Smoothing
Label smoothing (Szegedy et al., 2016) has become a popular technique for regularizing the output of a neural network. The intuition behind it is that using the gold target labels from the training set can lead to overconfident models. To overcome this, label smoothing redistributes probability mass from the gold label to the other target labels. When the redistribution is uniform, Pereyra et al. (2017) and Meister et al. (2020b) pointed out that this is equivalent (up to scaling and adding a constant) to adding a second term to the loss that computes the KL divergence D KL (u p θ ) between a uniform distribution u and the model distribution p θ . While it might seem appealing to add a similar KL regularizer to a Fenchel-Young loss, this is not possible when p θ contains zeroes because the KL divergence term becomes infinite. This makes vanilla label smoothing incompatible with sparse models. Fortunately, there is a more natural generalization of label smoothing to Fenchel-Young losses. For ∈ [0, 1], we define the Fenchel-Young label smoothing loss as follows: The intuition is the same as in cross-entropy label smoothing: the target one-hot vector is mixed with a uniform distribution. This simple definition leads to the following result, proved in Appendix A: Proposition 1. The Fenchel-Young label smoothing loss can be written as is a constant which does not depend on z, andz := u z is the average of the logits. Furthermore, up to a constant, we also have The first expression (11) shows that, up to a constant, the smoothed Fenchel-Young loss equals the original loss plus a linear regularizer (z y * −z). While this regularizer can be positive or negative, we show in Appendix A that its sum with the original loss L Ω (z, e y * ) is always non-negative -intuitively, if the score z y * is below the average, resulting in negative regularization, the unregularized loss will also be larger, and the two terms balance each other. Figure 2 shows the effect of this regularization in the graph of the loss -we see that a correct prediction is linearly penalized with a slope of ; the larger the confidence, the larger the penalty. In particular, when Ω is the Shannon negentropy, this result shows a simple expression for vanilla label smoothing which, to the best of our knowledge, is novel. The second expression (12) shows that it can also be seen as a form of regularization towards the uniform distribution. When −Ω is the Shannon entropy, the regularizer becomes a KL divergence and we obtain the interpretation of label smoothing for cross-entropy provided by Pereyra et al. (2017) and Meister et al. (2020b). Therefore, the same interpretation holds for the entire Fenchel-Young family if the regularization uses the corresponding Fenchel-Young loss with respect to a uniform.
Gradient of Fenchel-Young smoothed loss. From Prop. 1 and Equation 9, we immediately obtain the following expression for the gradient of the smoothed loss: that is, the computation of this gradient is straightforward by adding a constant vector to the original gradient of the Fenchel-Young loss; as the latter, it only requires the ability of computing theπ Ω transformation, which is efficient in the entmax case as shown by . Note that, unlike the gradient of the original entmax loss, the gradient of its smoothed version is not sparse (in the sense that it will not contain many zeroes); however, since u is the uniform distribution, it will contain many constant terms with value − /|V |. In all tasks, we vary two hyperparameters: • Entmax Loss α: this influences the sparsity of the probability distributions the model returns, with α = 1 recovering cross-entropy and larger α values encouraging sparser distributions. We use α ∈ {1, 1.5, 2} for G2P and MI, and α ∈ {1, 1.5} for MT.
• We trained all models with early stopping for a maximum of 30 epochs for MI, 15 epochs for WMT 2014 English → German MT, and 100 epochs otherwise, keeping the best checkpoint according to a task-specific validation metric: Phoneme Error Rate for G2P, average Levenshtein distance for MI, and detokenized BLEU score for MT. At test time, we decoded with a beam width of 5. Our PyTorch code (Paszke et al., 2017) is based on JoeyNMT (Kreutzer et al., 2019) and the entmax implementation from the entmax package. 4

Multilingual G2P
Data. We use the data from SIGMORPHON 2020 Task 1 (Gorman et al., 2020), which includes 3600 training examples in each of 15 languages. We train a single multilingual model (following Peters and Martins, 2020) which must learn to apply spelling rules from several writing systems.
Training. Our models are similar to Peters and Martins (2020)'s RNNs, but with entmax 1.5 attention, and language embeddings only in the source. 4 https://github.com/deep-spin/entmax Results. Multilingual G2P results are shown in Table 1, along with the best previous result . We report two error metrics, each of which is computed per-language and averaged: • Word Error Rate (WER) is the percentage of hypotheses which do not exactly match the reference. This harsh metric gives no credit for partial matches.
• Phoneme Error Rate (PER) is the sum of Levenshtein distances between each hypothesis and the corresponding reference, divided by the total length of the references.
These results show that the benefits of sparse losses and label smoothing can be combined. Individually, both label smoothing and sparse loss functions (α > 1) consistently improve over unsmoothed cross-entropy (α = 1). Together, they produce the best reported result on this dataset. Our approach is very simple, as it requires manipulating only the loss function: there are no changes to the standard seq2seq training or decoding algorithms, no language-specific training or tuning, and no external auxiliary data. In contrast, the previous state of the art (Yu et al., 2020) relies on a complex selftraining procedure in which a genetic algorithm is used to learn to ensemble several base models.  Training. We reimplemented GATEDATTN (Peters and Martins, 2019), an RNN model with separate encoders for lemma and morphological tags. We copied their hyperparameters, except that we used two layers for all encoders. We concatenated the high and low resource training data. In order to make sure the model paid attention to the low resource training data, we either oversampled it 100 times or used data hallucination (Anastasopoulos and Neubig, 2019) to generate synthetic examples. Hallucination worked well for some languages but not others, so we treated it as a hyperparameter.

Crosslingual MI
Results. We compare to CMU-03 6 (Anastasopoulos and Neubig, 2019), a two-encoder model with a sophisticated multi-stage training schedule. Despite our models' simpler training technique, 6 We specifically use the official task numbers from Mc-Carthy et al. (2019), which are more complete than those reported in Anastasopoulos and Neubig (2019). they performed nearly as well in terms of accuracy, while recording, to our knowledge, the best Levenshtein distance on this dataset.

Machine Translation
Having shown the effectiveness of our technique on character-level tasks, we next turn to MT. To our knowledge, entmax loss has never been used for transformer-based MT; Correia et al. (2019)  We used joint BPE (Sennrich et al., 2016) for all language pairs, 7 with 25,000 merges for WMT14 and 32,000 merges for all other pairs.
Training. We trained transformers with the base dimension and layer settings (Vaswani et al., 2017). We optimized with Adam (Kingma and Ba, 2015) and used Noam scheduling with 20,000 warmup  Table 3: MT results, averaged over three runs. For label smoothing, we select the best on the development set. Note that WMT14 refers to WMT 2014 English → German. steps for WMT14 and 10,000 steps for the other pairs. The batch size was 8192 tokens. Table 3 reports our models' performance in terms of untokenized BLEU (Papineni et al., 2002), which we computed with SacreBLEU (Post, 2018). The results show a clear advantage for label smoothing and entmax loss, both separately and together: label-smoothed entmax loss is the best-performing configuration on 3 out of 7 language pairs, unsmoothed entmax loss performs best on another 3 out of 7, and they tie on the remaining one. Although label-smoothed cross-entropy is seen as an essential ingredient for transformer training, entmax loss models beat it even without label smoothing for every pair except EN DE.

Analysis
Model error. Stahlberg and Byrne (2019) showed that the bias in favor of short strings is so strong for softmax NMT models that the argmax sequence is usually the empty string. However, they did not consider the impact of sparsity or label smoothing. 8 We show in Table 4 how often the empty string is more probable than the beam search hypothesis. This is an upper bound for how often the empty string is the argmax because there can also be other hypotheses that are more probable than the empty string. The results show that α and both matter: sparsity substantially reduces the frequency with which the empty string is more probable than the beam search hypothesis, while label smoothing usually increases it. Outcomes vary widely with α = 1.5 and = 0.1: WMT14 and DE↔EN models did not seriously suffer from the problem, EN RO did, and the other three language pairs differed from one run to another. The optimal label smoothing value with cross-entropy is invariably = 0.1, which encourages the cat got your tongue problem; on the other hand, entmax 8 They trained with "transformer-base" settings, implying label smoothing, and did not compare to unsmoothed losses. loss does better with = 0.01 for every pair except RO EN in terms of BLEU.
Other inadequate strings. Even if a model rules out the empty string, it might assign nonzero probability to other short, inadequate strings. We investigated this with a depth-limited search inspired by Stahlberg and Byrne (2019)'s exact decoding technique. Unfortunately, the algorithm's exponential runtime made it unfeasible to perform the search for all language pairs, and in particular we found it too slow for the dense search space of cross entropy models, even after applying various optimizations. 9 Therefore, we show results for EN RO entmax loss models in Table 5. These results show the same trend as on the empty string: short strings are usually pruned for entmax loss models with = 0 or = 0.01, but are likely to have a higher score than the beam-decoded hypothesis with = 0.1.
Label smoothing and sparsity.  previously showed that RNN models trained with entmax loss become locally very sparse. Table 6 shows that this is true of transformers as well. Label smoothing encourages greater density, although for the densest language pair (WMT14) this only equates to an average support size of roughly 3300 out of a vocabulary of almost 30,000 word types. The relationship between density and overestimating the empty string is inconsistent with = 0.1: WMT14 and DE↔EN models become much more dense but rarely overestimate the empty string (Table 4). The opposite occurs for RO↔EN: models with = 0.1 become only slightly more dense but are much more prone to model error. This suggests that corpus-specific factors influence both sparsity and how easily bad hypotheses can be pruned.   Calibration. This is the degree to which a model's confidence about its predictions (i.e. class probabilities) accurately measure how likely those predictions are to be correct. It has been shown (Müller et al., 2019;Kumar and Sarawagi, 2019) that label smoothing improves the calibration of seq2seq models. We computed the Expected Calibration Error (ECE; Naeini et al., 2015) 10 of our MT models and confirmed their findings. Our results, in Table 7, also show that sparse models are better calibrated than their dense counterparts. This shows that entmax models do not become overconfident even though probability mass is usually concentrated in a small set of possibilities. The good calibration of label smoothing may seem surprising in light of Table 4, which shows that labelsmoothed models overestimate the probability of inadequate hypotheses. However, ECE depends only on the relationship between model accuracy and the score of the most likely label. This shows the tradeoff: larger values limit overconfidence but make the tail heavier. Setting α = 1.5 with a moderate value seems to be a sensible balance.
10 ECE = M m=1 |Bm| N | acc(Bm)−conf(Bm)| partitions the model's N force-decoded predictions into M evenlyspaced bins and computes the difference between the accuracy (acc(Bm)) and the average probability of the most likely prediction (conf(Bm)) within that bin. We use M = 10.

Related Work
Label smoothing. Our work fits into a larger family of techniques that penalize model overconfidence. Pereyra et al. (2017) proposed the confidence penalty, which reverses the direction of the KL divergence in the smoothing expression. Meister et al. (2020b) then introduced a parameterized family of generalized smoothing techniques, different from Fenchel-Young Label Smoothing, which recovers vanilla label smoothing and the confidence penalty as special cases. In a different direction, Wang et al. (2020) improved inference calibration with a graduated label smoothing technique that uses larger smoothing weights for predictions that a baseline model is more confident of. Other works have smoothed over sequences instead of tokens Elbayad et al., 2018;Lukasik et al., 2020), but this requires approximate techniques for deciding which sequences to smooth.
MAP decoding and the empty string. We showed that sparse distributions suffer less from the cat got your tongue problem than their dense counterparts. This makes sense in light of the finding that exact MAP decoding works for MI, where probabilities are very peaked even with softmax (Forster and Meister, 2020). For tasks like MT, this is not the case: Eikema and Aziz (2020) pointed out that the argmax receives so little mass that it is almost arbitrary, so seeking it with MAP decoding (which beam search approximates) itself causes many deficiencies of decoding. On the other hand, Meister et al. (2020a) showed that beam search has a helpful bias and introduced regularization penalties for MAP decoding that encode it explicitly. Entmax neither directly addresses the faults of MAP decoding nor compensates for the locality biases of beam search, instead shrinking the gap between beam search and exact decoding. It would be interesting, however, to experiment with these two approaches with entmax in place of softmax.

Conclusion
We generalized label smoothing from cross-entropy to the wider class of Fenchel-Young losses. When combined with the entmax loss, we showed meaningful gains on character and word-level tasks, including a new state of the art on multilingual G2P. In addition, we showed that the ability of entmax to shrink the search space significantly alleviates the cat got your tongue problem in machine translation, while also improving model calibration.

A Proof of Proposition 1
For full generality, we consider label smoothing with an arbitrary distribution r ∈ |V | , which may or not be the uniform distribution. We also consider an arbitrary gold distribution q ∈ |V | , not necessarily a one-hot vector. Later we will particularize for the case r = u = [1/|V |, . . . , 1/|V |] and q = e y * , the case of interest in this paper. For this general case, the Fenchel-Young label smoothing loss is defined analogously to (10) as L Ω, ,r (z; q) := L Ω (z, (1 − )q + r).
Note that, from the convexity of Ω and Jensen's inequality, we always have I Ω, (q; r) ≥ 0.
If −Ω(q) ≤ −Ω(r) (i.e., if the regularizing distribution r has higher generalized entropy than the model distribution q, as is expected from a regularizer), then Since the left hand side of (17) is by definition a Fenchel-Young loss, it must be non-negative. This implies that L Ω (z; q) + (z q − z r) ≥ 0.
In the conditions of the paper, we have q = e y * and r = u, which satisfies the condition −Ω(q) ≤ −Ω(r) (this is implied by Blondel et al. (2020, Prop. 4) and the fact that Ω is strictly convex and symmetric). In this case, z q = z y * is the score of the gold label and z r = 1 |V | z =z is the average score.