Semantic Label Smoothing for Sequence to Sequence Problems

Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, is challenging: the large target output space of such problems makes it intractable to apply label smoothing over all possible outputs. Most existing approaches for seq2seq settings either do token level smoothing, or smooth over sequences generated by randomly substituting tokens in the target sequence. Unlike these works, in this paper, we propose a technique that smooths over \emph{well formed} relevant sequences that not only have sufficient n-gram overlap with the target sequence, but are also \emph{semantically similar}. Our method shows a consistent and significant improvement over the state-of-the-art techniques on different datasets.


Introduction
Label smoothing is a regularization technique commonly used in deep learning (Szegedy et al., 2016;Chorowski and Jaitly, 2017;Vaswani et al., 2017;Zoph et al., 2018;Real et al., 2018;Huang et al., 2019), that improves calibration (Müller et al., 2019) and helps in label de-noising (Lukasik et al., 2020a). Here, one smooths labels by introducing a prior in the label space (often just a uniform distribution) in order to prevent overly confident predictions and achieve better model calibration, both of which lead to better generalization.
Given these benefits, it is natural to consider whether label smoothing can be applied to sequence-to-sequence (seq2seq) prediction tasks in Natural Language Processing. Here, inducing a label prior involves smoothing in sequence space. However, this is a challenging task because the output space is exponentially large for sequences, unlike the label space in standard classification. Previous works approached this challenge either by smoothing over individual tokens of the target sequence, or by sampling a few nearby targets according to Hamming distance or BLEU score (Norouzi et al., 2016;Elbayad et al., 2018). These techniques however do not guarantee that the smoothed targets lie within the space of acceptable targets (i.e., the sampled new target may no longer be grammatically correct or even preserve semantic meaning).
In this work, we propose a label smoothing approach for seq2seq problems that overcomes this limitation. Given a large-scale corpus of valid sequences, our approach selects a subset of sequences that are not only semantically similar to the target sequence, but also well formed. We achieve this using a pre-trained model to find semantically similar sequences from the corpus, and then use BLEU scores to rerank the closest targets. We empirically show that this approach improves over competitive baselines on multiple machine translation tasks.

Related Works
Token-level smoothing A popular approach used in language tasks is so called token level smoothing, where for each position's classification loss, a prior distribution over the entire vocabulary (uniformly or with unigram probability estimates) is used for regularization (Pereyra et al., 2017;Edunov et al., 2017). This is similar to the classical label smoothing (e.g. (Szegedy et al., 2016)), as it smooths each token label independent of their context and position in the sequence. Such an approach is thus unlikely to result in semantically related targets.

Sequence-level
smoothing Norouzi et al. (2016) augment the loss with a term rewarding predictions of sampled sequences. The sampling of sequences is based on their edit distance or Hamming distance to the target. This method thus smooths the loss over to similar sequences (in terms of the edit distance) with smoothing rewards. Elbayad et al. (2018) employ a similar technique, but with a new reward function based on BLEU (Papineni et al., 2002) or CIDEr (Vedantam et al., 2015) score. Specifically, Elbayad et al. (2018) generate a smoothed version of the target sequence, wherein one replaces a token with a random token (with up-sampling of rare words). Such newly generated sequences were given a partial reward based on the cosine similarity between the two tokens in a pretrained word-embedding space. This differs from our approach because this context-independent perturbation is limited to generating the same structure for the new sequence as that of the original sequence. Zheng et al. (2018), on the other hand, constructed grammatically correct and meaning preserving sequences. However, unlike our work, their approach relies on having multiple references (target sequences per input sequence) and might not be able to generate sequences where common words or synonyms do not appear in the same order, which is a strong limitation, precluding an augmentation like: Yesterday, he scored a 94 on his final (original sequence), He had 94 points in the final test yesterday (augmented sequence).
More broadly, an important shortcoming of such approaches is that sequences deemed close can actually lack important properties such as preserving the meaning of the original sequence. In particular, swapping even a single token in a sequence may cause a drastic shift in its meaning (e.g., turning a factually correct text into a false one) even though being close in the Hamming space. We address this shortcoming by restricting augmented target sequences to the training set, and selecting sequences based on similarity obtained from a pretrained model. Unlike other approaches, (Bengio et al., 2015) proposed a scheduled sampling technique that does not depend on any external data source. Instead, it utilizes the self-generated sequences from the current model. Both our approach and the scheduled sampling technique bear similarity in that they aim at improving model generalization, by either providing semantically similar candidates (ours) or self-generated sequences (theirs). Indeed, these two approaches could complement each other by providing various ways of related but not exact targets.
Hard negative mining Our work is also related to hard negative mining approaches that select a subset of confusing negatives for each input (Mikolov et al., 2013;Reddi et al., 2019;Guo et al., 2018). Different from the above, we add a soft objective function over the sampled (relevant) target sequences, rather than treating them as negatives in the classification sense.

Method
Sequence-to-sequence (seq2seq) learning involves learning a mapping from an input sequence x (e.g., a sentence in English) to an output sequence y (e.g., a sentence in French). Canonical applications include machine translation and question answering.
Formally, let X denote the space of input sequences (e.g., all possible English sentences), and Y the space of output sequences (e.g., all possible French sentences). We represent by x = [x 1 , x 2 , ...x N ] an input sequence consisting of N tokens, and similarly y = [y 1 , y 2 , ...y N ′ ] an output sequence with N ′ tokens. Our goal is to learn a function f : X → Y that, given an input sequence, generates a suitable target sequence.
To achieve this goal, we have a training set S ⊆ (X×Y) n comprising pairs of input and output sequences. We then seek to minimise the objective where p θ (·|x; θ) is a parametrized distribution over all possible output sequences. Given such a distribution, we choose Observe that one may implement (1) via a token-level decomposition, This may be understood as a maximum likelihood objective, or equivalently the cross-entropy between p θ (·|x; θ) and a one-hot distribution concentrated on y.
Label smoothing meets seq2seq. Intuitively, the cross-entropy objective encourages the model to score the observed sequence y higher than any "competing" sequence y ′ = y. While this is a sensible goal, one limitation observed from classification settings is that the loss may lead to models that are overly confident in their predictions, which can hamper generalisation (Guo et al., 2017).
Label smoothing (Szegedy et al., 2016;Pereyra et al., 2017;Müller et al., 2019) is a simple means of correcting this in classification settings. Smoothing involves simply adding a small reward to all possible incorrect labels, i.e., mixing the standard one-hot label with a uniform distribution over all labels. This regularizes the training and generally leads to better predictive performance as well as probabilistic calibration (Müller et al., 2019).
Given the success of label smoothing in classification settings, it is natural to explore its value in seq2seq problems. However, standard label smoothing is clearly inadmissible: it would require smoothing over all possible outputs y ′ ∈ Y, which is typically an intractably large set. Nonetheless, we may follow the basic intuition of smoothing by adding a subset of related targets to the observed sequence y, yielding a smoothed loss Here, R(y) is a set of related sequences that are similar to the ground truth y, and α > 0 is a tuning parameter that controls how much we rely on the observed versus related sequences. The quality of R(y) is important for our task. Ideally, we would like an R(y) that: (i) is efficient to compute, and (ii) comprises sequences which meaningfully align with x (e.g., are plausible alternate translations). We now assess several options for constructing R(y) in light of the above. Random sequences. One simple choice is to choose a random subset of output sequences from the training set. In the common setting where f is learned by minibatch SGD on randomly drawn minbatches B = {(x (i) , y (i) )}, one may simply pick R(y) to be all output sequences in B.
Such random sequences contain general target language understanding (e.g., French grammar for an English to French translation task). However, these sequences are unlikely to have any semantic correlation with the true label.
Algorithm 1 Sampling of related sequences.
Input: example (x, y); sequences Y ref Output: related sequences R(y) 1: Embed reference sequences, e.g., using BERT 2: N(y) ← k closest sequences to y from Y ref in the embedding space. 3: Sort elements of N(y) by BLEU score to y. 4: R(y) ← top k ′ elements from N.
Token-level smoothing. To ensure greater semantic correlation between the selected sequences and the original y, one idea is to perform tokenlevel smoothing. For example, Vaswani et al. (2017) proposed to smooth uniformly over all tokens from the vocabulary. Elbayad et al. (2018) proposed to construct sequences y ′ = [y ′ 1 , y ′ 2 , . . . , y ′ N ′ ] where for a randomly selected subset of tokens j ∈ [N ′ ], y ′ i is some related token in the minibatch; for other tokens, y ′ i = y i . These related tokens are chosen so as to maximise the BLEU score between y and y ′ .
While this approach increases the semantic similarity to y, operating on a token level is limiting. For example, one may change the meaning of a factual sentence by changing even a few words. Further, operating at a per-token level limits the diversity of R(y), since, e.g., all sequences have the same number of tokens and structure as y. Proposal: semantic smoothing. To overcome the limitations of token-level smoothing, we would ideally like to directly smooth over related sequences. Our basic idea is to seek sequences where s sem is a score of semantic similarity, and s bleu is the BLEU score. Intuitively, our relevant sequences comprise those that are both semantically similar to y, and have sufficient unigram overlap.
A key challenge is efficiently identifying semantically similar sequences to y. To achieve this in a tractable manner, we propose the following procedure (see Algorithm 1). First, we assume the existence of an embedding space for output sequences. For example, this could be the result of BERT (Devlin et al., 2019), which embeds each sequence into a fixed vector representation. Given such an embedding space and a corpus Y ref of reference sequences, we may now efficiently com-  pute the neighbors of y, N(y), comprising the top-k closest sequences in Y ref for the given y (Indyk and Motwani, 1998). 1 The elements of N(y) can be expected to have high semantic similarity with y, which is desirable. However, such sequences may not meaningfully align with the original input x (e.g., may not be sufficiently close translations). To account for this, we prune the elements from N(y) based on the BLEU score. Intuitively, this pruning retains sequences that are both semantically similar and have non-trivial token overlap with y.
We use Y ref as all output sequences in the training set. In practice, one may however use any set of sequences that are valid for the domain in question. We find k = 100 closest sequences in this space, and smooth over k ′ = 5 pruned sequences with the highest BLEU score to y. In Table 1 we show example augmentations. Notice both the diversity of augmentations, as well as relatedness to the original targets.

Experiments
Setup. We use the Transformer model for our experiments, and follow the experimental setup and hyperparameters from Vaswani et al. (2017). We experiment on three popular machine translation tasks: English-German (EN-DE), English-Czech (EN-CS) and English-French (EN-FR), using the WMT training datasets, and on the tensor2tensor framework (Vaswani et al., 2018). 2 We evaluate on the Newstest 2015 for EN-DE and EN-CS, and WMT 2014 for EN-FR. Baselines. We use the seq2seq model results by Vaswani et al. (2017) as a baseline. We com-pare our approach with the following alternate smoothing methods: i) smoothing is done over all possible tokens from the vocabulary at each next token prediction (Szegedy et al., 2016), ii) smoothing is conducted over random targets from within batch (Guo et al., 2018), and iii) smoothing is done over artificially generated targets that are close to the actual target sequence according to BLEU score (Elbayad et al., 2018). For all these methods we experiment with values of α in {0.1, 0.001, 0.0001, 0.00001}, and report the best results in each case. For the (Elbayad et al., 2018) baseline, we follow the reported best performing variant, randomly swapping tokens with others from the target sequence.
Main results. In Table 2 we report results from our method (BERT+BLEU) and the different stateof-the-art methods mentioned above. Our most direct comparison is against (Elbayad et al., 2018), as both the methods smooth over sequences that have high BLEU score. However, instead of generating sequences by randomly replacing tokens, we retrieve them from a corpus of well formed text sequences. In particular, we use BERT-base multilingual model to embed all the training target sequences into 768 dim fixed vector representation (corresponding to CLS token) and then identify top-100 nearest neighbors for each of the target sequence. Consequently, our method outperforms (Elbayad et al., 2018) by a large margin on all three benchmarks. This demonstrates the importance of smoothing over sequences that not only have significant n-gram overlap with the ground truth target sequence but are also well formed and are semantically similar to the ground truth. In Table 3 we report the comparison between our model and the strongest baseline on EN-CS across multiple metrics, confirming the improvement we report in Table 2 for BLEU score.
Ablating BLEU pruning. Table 4 reveals it is useful to use a sufficiently restrictive criterion in BLEU pruning; however, excess pruning (BLEU5) is harmful. Thus, we seek to retrieve semantically related targets which do not necessarily have highest scoring n-gram overlap to the original target. This is intuitive: enforcing too high n-gram overlap may cause all augmented targets to be too lexically similar, limiting the benefit of seeing new targets in training. We also experimented with not reranking neighbors using BLEU pruning, which resulted in no improvement over the baseline. In    other words, it was essential to use this kind of postprocessing for obtaining improvements.
Ablating the number of neighbors. We experimented with how the number of neighbors influences the results. For EN-CS, we obtained the following BLEU4 scores correspondingly for 10, 5 and 3 neighbors: 21. 86, 22.82, 22.23. Overall, we find that too few or too many neighbors harm the performance compared to the 5 neighbors we used in other experiments. At the same time, the time complexity increases linearly as number of neighbors increases.

Conclusion
We propose a novel label smoothing approach for sequence to sequence problems that selects a subset of sequences that are not only semantically similar to the target sequences, but are also well formed. We achieve this by using a pre-trained model to find semantically similar sequences from the training corpus, and then we use BLEU score to rerank the closest targets. Our method shows a consistent and significant improvement over stateof-the-art techniques across different datasets.
In future work, we plan to apply our semantic label smoothing technique to various sequence to sequence problems, including Text Summarization (Zhang et al., 2019) and Text Segmentation (Lukasik et al., 2020b). We also plan to study the relation between pretraining and data augmentation techniques.