Rethinking Perturbations in Encoder-Decoders for Fast Training

We often use perturbations to regularize neural models. For neural encoder-decoders, previous studies applied the scheduled sampling (Bengio et al., 2015) and adversarial perturbations (Sato et al., 2019) as perturbations but these methods require considerable computational time. Thus, this study addresses the question of whether these approaches are efficient enough for training time. We compare several perturbations in sequence-to-sequence problems with respect to computational time. Experimental results show that the simple techniques such as word dropout (Gal and Ghahramani, 2016) and random replacement of input tokens achieve comparable (or better) scores to the recently proposed perturbations, even though these simple methods are faster.


Introduction
Recent advances in neural encoder-decoders have driven tremendous success for sequenceto-sequence problems including machine translation , summarization (Rush et al., 2015), and grammatical error correction (GEC) (Ji et al., 2017). Since neural models can be too powerful, previous studies have proposed various regularization methods to avoid over-fitting.
To regularize neural models, we often apply a perturbation (Goodfellow et al., 2015;Miyato et al., 2017), which is a small difference from a correct input. During the training process, we force the model to output the correct labels for both perturbed inputs and unmodified inputs. In sequenceto-sequence problems, existing studies regard the following as perturbed inputs: (1) sequences containing tokens replaced from correct ones (Bengio et al., 2015;Cheng et al., 2019), (2) embeddings injected small differences (Sato et al., 2019). For example, Bengio et al. (2015) proposed the scheduled sampling that samples a token from the output probability distribution of a decoder and uses it as a perturbed input for the decoder. Sato et al. (2019) applied an adversarial perturbation, which significantly increases the loss value of a model, to the embedding spaces of neural encoder-decoders.
Those studies reported that their methods are effective to construct robust encoder-decoders. However, their methods are much slower than the training without using such perturbations because they require at least one forward computation to obtain the perturbation. In fact, we need to run the decoder the same times as the required number of perturbations in the scheduled sampling (Bengio et al., 2015). For adversarial perturbations (Sato et al., 2019), we have to compute the backpropagation in addition to forward computation because we use gradients to obtain perturbations.
Those properties seriously affect the training budget. For example, it costs approximately 1,800 USD for each run when we train Transformer (big) with adversarial perturbations (Sato et al., 2019) on the widely used WMT English-German training set in AWS EC2 1 . Most studies conduct multiple runs for the hyper-parameter search and/or model ensemble to achieve better performance (Barrault et al., 2019), which incurs a tremendous amount of training budget for using such perturbations. Strubell et al. (2019) and Schwartz et al. (2019) indicated that recent neural approaches increase computational costs substantially, and they encouraged exploring a cost-efficient method. For instance, Li et al. (2020) explored a training strategy to obtain the best model in a given training time. However, previous studies have paid little attention to the costs of computing perturbations.
Thus, we rethink a time efficient perturbation method. In other words, we address the question whether perturbations proposed by recent studies as effective methods are time efficient. We compare several perturbation methods for neural encoderdecoders in terms of computational time. We introduce light computation methods such as word dropout (Gal and Ghahramani, 2016) and using randomly sampled tokens as perturbed inputs. These methods are sometimes regarded as baseline methods (Bengio et al., 2015), but experiments on translation datasets indicate that these simple methods surprisingly achieve comparable scores to those of previous effective perturbations (Bengio et al., 2015;Sato et al., 2019) in a shorter training time. Moreover, we indicate that these simple methods are also effective for other sequence-to-sequence problems: GEC and summarization.

Definition of Encoder-Decoder
In this paper, we address sequence-to-sequence problems such as machine translation with neural encoder-decoders, and herein we provide a definition of encoder-decoders.
In sequence-to-sequence problems, neural encoder-decoders generate a sequence corresponding to an input sequence. Let x 1:I and y 1:J be input and output token sequences whose lengths are I and J, respectively: x 1:I = x 1 , ..., x I and y 1:J = y 1 , ..., y J . Neural encoder-decoders compute the following conditional probability: where y 0 and y J+1 are special tokens representing beginning-of-sentence (BOS) and end-of-sentence (EOS) respectively, X = x 1:I , and Y = y 1:J+1 . In the training phase, we optimize the parameters θ to minimize the negative log-likelihood in the training data. Let D be the training data consisting of a set of pairs of X n and Y n : . We minimize the following loss function:

Definition of Perturbations
This section briefly describes perturbations used in this study. This study focuses on three types of perturbations: word replacement, word dropout, and adversarial perturbations. Figure 1 shows perturbations used in this study. As shown in this figure, we can use all types of perturbations in the … Decoder y 1 y 2 … Figure 1: Overview of perturbations used in this study. We can combine perturbations as shown in this figure because each type of perturbation is orthogonal.
same time because perturbations are orthogonal to each other. In fact, we combine word replacement with word dropout in our experiments.

Word Replacement: REP
For any approach that uses a sampled token instead of a correct token, such as the scheduled sampling (Bengio et al., 2015), we refer to this as a word replacement approach. In this approach, we construct a new sequence whose tokens are randomly replaced with sampled tokens. For the construction from the sequence X, we samplex i from a distribution Q x i and use it for the new sequence X with the probability α: We construct Y from the sequence Y in the same manner. Bengio et al. (2015) used a curriculum learning strategy to adjust α, and thus proposed several functions to decrease α based on the training step. Their strategy uses correct tokens frequently at the beginning of training, whereas it favors sampled tokens frequently at the end of training. We also adjust α with their use of the inverse sigmoid decay: where q and k are hyper-parameters. In short, α t decreases to q from 1, depending on the training step t. We use α t as α at t.
For Q x i , we prepare three types of distributions: conditional probability, uniform, and similarity.
Conditional Probability: REP(SS) Bengio et al. (2015) proposed the scheduled sampling which uses predicted tokens during training to address the gap between training and inference. Formally, the scheduled sampling uses the following conditional probability as Q y i : Since the scheduled sampling is the method to compute the perturbation for the decoder side only, it uses the correct sequence as the input of the encoder side. In other words, the scheduled sampling does not provide any function for Q x i . The original scheduled sampling repeats the decoding for each of the tokens on the decoder side, and thus, requires computational time in proportion to the length of the decoder-side input sequence. To address this issue, Duckworth et al. (2019) proposed a more time efficient method: parallel scheduled sampling which computes output probability distributions corresponding to each position simultaneously. In this study, we use parallel scheduled sampling instead of the original method.
Uniform: REP(UNI) The scheduled sampling is slow even if we use parallel scheduled sampling because it requires decoding at least once to compute Equation (6). Thus, we introduce two faster methods to explore effective perturbations from the perspective of computational time. In uniform, we use the uniform distributions on each vocabulary as Q x i and Q y i , respectively. For example, we randomly pick up a token from the source-side vocabulary and use the token asx i in Equation (4) to construct the source-side perturbed input. This method is used as the baseline in the previous study (Bengio et al., 2015).
Similarity: REP(SIM) We also explore more sophisticated way than the uniform distribution. We assume that the conditional probability of Equation (6) assigns high probabilities to tokens that are similar to the correct input token. Based on this assumption, we construct a distribution that enables us to sample similar tokens frequently. Let V x be the source-side vocabulary, E x ∈ R |Vx|×dx be the d x dimensional embedding matrix, and e(x i ) be the function returning the embedding of x i . We use the following probability distribution as Q x i : where softmax(.) is the softmax function. Thus, Equation (7) assigns high probabilities to tokens whose embeddings are similar to e(x i ). In other words, Equation (7) is the similarity against x i without considering any context. We compute the probability distribution for the target side by using e(y i ) in the same manner.

Word Dropout: WDROP
We apply the word dropout technique to compute the perturbed input. Word dropout randomly uses the zero vector instead of the embedding e(x i ) for the input token x i (Gal and Ghahramani, 2016): where Bernoulli(β) returns 1 with the probability β and 0 otherwise. Thus, WDrop(x i , b x i ) returns e(x i ) with the probability β and the zero vector otherwise. We apply Equation (9) to each token in the input sequence. Then, we use the results as the perturbed input.

Adversarial Perturbation: ADV
Miyato et al. (2017) proposed a method to compute adversarial perturbations in the embedding space. Their method adds adversarial perturbations to input embeddings instead of replacing correct input tokens with others. Sato et al. (2019) applied this approach to neural encoder-decoders and reported its effectiveness. Thus, this study follows the methods used in Sato et al. (2019). The method seeks the adversarial perturbation, which seriously damages the loss value, based on the gradient of the loss function L(θ). Then, we add the adversarial perturbation to the input token embedding. Let r x i ∈ R dx be the adversarial perturbation vector for the input token x i . We obtain the perturbed input embedding e (x i ) with the following equations: where is a hyper-parameter to control the norm of the adversarial perturbation. We apply the above equations to all tokens in the input sequence.

Training
In the training using word replacement and/or word dropout perturbations, we search the parameters predicting the correct output sequence from the perturbed input. For example, in the word replacement approach, we minimize the following negative loglikelihood: log p(y j |y 0:j−1 , X ; θ).

(13)
Virtual Adversarial Training When we use adversarial perturbations, we train parameters of the neural encoder-decoder to minimize both Equation (2) and a loss function A(θ) composed of perturbed inputs: where λ is a hyper-parameter to control the balance of two loss functions. This calculation seems to be reasonably time efficient because adversarial perturbations require computing Equation (2) where r represents a concatenated vector of adversarial perturbations for each input token, and KL(·||·) denotes the Kullback-Leibler divergence.

Experiments on Machine Translation
To obtain findings on sequence-to-sequence problems, we conduct experiments on various situations: different numbers of training data and multiple tasks. We mainly focus on translation datasets because machine translation is a typical sequenceto-sequence problem. We regard the widely used WMT English-German dataset as a standard setting. In addition, we vary the number of training data in machine translation: high resource in Section 4.2 and low resource in Section 4.3. on other sequence-to-sequence problems: grammatical error correction (GEC) in Section 5 and summarization in Appendix A to confirm whether the findings from machine translation are applicable to other tasks.

Standard Setting
Datasets We used the WMT 2016 English-German training set, which contains 4.5M sentence pairs, in the same as Ott et al. (2018), and followed their pre-processing. We used newstest2013 as a validation set, and newstest2010-2012, and 2014-2016 as test sets. We measured case-sensitive detokenized BLEU with SacreBLEU (Post, 2018) 2 .
Methods We used Transformer (Vaswani et al., 2017) as a base neural encoder-decoder model because it is known as a strong neural encoderdecoder model. We used two parameter sizes: base and big settings in Vaswani et al. (2017). We applied perturbations described in Section 3 for comparison. For parallel scheduled sampling (Duckworth et al., 2019), we can compute output probability distributions multiple times but we used the first decoding result only because it is the fastest approach. We set q = 0.9, k = 1000, and β = 0.9. For ADV, we used the same hyperparameters as in Sato et al. (2019). Our implementation is based on fairseq 3 (Ott et al., 2019). We trained each model for a total of 50,000 steps.
Preliminary: To which sides do we apply perturbations? As described, perturbations based on REP(SS) can be applied to the decoder side only. Sato et al. (2019) reported their method was the most effective when they applied their ADV to both encoder and decoder sides. However, we do not have evidence for suitable sides in applying other perturbations. Thus, we applied REP(UNI),  REP(SIM), and WDROP to the encoder side, decoder side, and both as preliminary experiments. Table 2 shows BLEU scores on newstest2010-2016 and averaged scores when we varied the position of the perturbations. In this table, we indicate better scores than the original Transformer (Vaswani et al., 2017) (w/o perturbation) in bold. This table shows that it is better to apply word replacement (REP(UNI) and REP(SIM)) to the decoder side in Transformer (base). For WDROP, applying the encoder side is slightly better than other positions in Transformer (base). In contrast, applying perturbations to both sides achieved the best averaged BLEU scores for all methods in Transformer (big). These results imply that it is better to apply to word replacement and/or word dropout to both encoder and decoder sides if we prepare enough parameters for neural encoder-decoders. Based on these results, we select methods to compare against scheduled sampling (REP(SS)) and adversarial perturbations (ADV). Table 2 also shows the results when we combined each word replacement with word dropout (REP(UNI)+WDROP and REP(SIM)+WDROP). REP(SIM)+WDROP slightly outperformed the separated settings.

Results
We compare each perturbation in view of computational time. Table 3 shows BLEU scores of each method and computational speeds 4 based on Transformer (base) without any perturbations, i.e., larger is faster. In this table, we indicate the best score of each column for Transformer (base) and (big) settings in bold. This table indicates that Transformer without perturbations achieved a comparable score to previous studies (Vaswani et al., 2017;Ott et al., 2018) on newstest2014 in base and big settings. Thus, we consider that our trained Transformer models (w/o perturbation) can be regarded as strong baselines. This table shows that ADV achieved the best averaged score in Transformer (base), but this method required twice as much training time as the original Transformer (base). In contrast, REP(SIM) and WDROP achieved comparable scores to ADV although they slightly affected the computational time. REP(UNI) also achieved a slightly better averaged score than the original Transformer (base).
In the Transformer (big) setting, all perturbations surpassed the performance of w/o perturbation in the averaged score. REP(SS) and ADV improved the performance, but other methods outperformed these two methods with a small training time. Moreover, REP(UNI) and REP(SIM)+WDROP achieved the best averaged score. Figure 2 illustrates the negative log-likelihood   values and BLEU scores on the validation set for each training time when we applied each perturbation to Transformer (big). In addition, Figure 2 (c) shows the time required to achieve the BLEU score of Transformer w/o perturbation on the validation set (26.60, as described in Table 3). These figures show that ADV requires twice as much time or more relative to other methods to achieve performance comparable to others. In NLL curves, REP(UNI), REP(SIM), and WDROP achieved better values than those of Transformer w/o perturbation in the early stage. In addition, WDROP was the fastest to achieve better NLL value. Figure 2 (c) indicates that REP(UNI), REP(SIM), and WDROP achieved 26.60 BLEU score with smaller training time than that of Transformer w/o perturbation.
These results indicate that we can quickly improve the performance of Transformer with REP(UNI), REP(SIM), and WDROP. In particu-lar, when we prepare a large number of parameters for Transformer in machine translation, it is better to use these methods (and their combinations) as perturbations. We conduct more experiments to investigate whether these methods are also superior in other configurations.

High Resource
We compare each perturbation in the case where we have a large amount of training data.
Datasets We add synthetic parallel data generated from the German monolingual corpus using back-translation (Sennrich et al., 2016a) to the training data used in Section 4.1. The origin of the German monolingual corpus is NewsCrawl 2015-2018 5 . We randomly sampled 5M sentences from each NewsCrawl corpus, and thus, obtained 20M sentences in total. We back-translated the corpus  with the German-English translation model, which is identical to Transformer (big) (w/o perturbation) used in Section 4.1 except for the direction of translation. Finally, we prepended a special token BT to the beginning of the source (English) side of the synthetic data following (Caswell et al., 2019). In addition, we upsampled the original bitext to adjust the ratio of the original and synthetic bitexts to 1:1.
Methods In this setting, we increase the parameter size of Transformer from the (big) setting to take advantage of large training data. Specifically, we increased the internal layer size of the FFN part from 4096 to 8192, and used 8 layers for both the encoder and decoder. The other hyper-parameters are same as in Section 4.1.
Results Table 4 shows BLEU scores of each method when we used a large amount of training data. This table indicates that all perturbations outperformed Transformer w/o perturbation in all test sets. Moreover, the fast methods REP(UNI), REP(SIM), WDROP, and their combinations achieved the same or better averaged scores than REP(SS) and ADV. Thus, these methods are not only fast but also significantly improve the performance of Transformer. In particular, since Table  3 shows that REP(UNI) and WDROP barely have any negative effect on the computational time, we consider them as superior methods.

Low Resource
Datasets We also conduct an experiment on a low resource setting. We used IWSLT 2014 German-English training set which contains 160k sentence pairs. We followed the preprocessing described in fairseq 6 (Ott et al., 2019). We used dev2010, 2012, and tst2010-2012 as a test set.
Methods In this setting, we reduced the parameter size of Transformer from the (base)   2048 to 1024. We used the same values for other hyper-parameters as in Section 4.1.
Results Table 5 shows BLEU scores of each method on the low resource setting. We trained three models with different random seeds for each method, and reported the averaged scores. In this table, we also report the results of REP(UNI), REP(SIM), WDROP, and their combinations trained with twice the number of updates (below ×2 training steps). This table shows that all perturbations also improved the performance from Transformer w/o perturbation. In contrast to Tables 3 and  4, ADV achieved the top score when each model was trained with the same number of updates. However, as reported in Section 4.1, ADV requires twice or more as long as other perturbations for training. Thus, when we train Transformer with other perturbations with twice the number of updates, the training time is almost equal. In the comparison of (almost) equal training time, WDROP achieved a comparable score to ADV. Moreover, REP(UNI)+WDROP and REP(SIM)+WDROP 7 outperformed ADV. Thus, in this low resource setting, REP(UNI)+WDROP and REP(SIM)+WDROP are slightly better than ADV in computational time.  We constructed perturbed inputs by replacing words in source sentences based on pre-defined ratio. If the ratio is 0.0, we use the original source sentences. In contrast, if the ratio is 1.0, we use the completely different sentences as source sentences. We set the ratio 0.01, 0.05, and 0.10. In this process, we replaced a randomly selected word with a word sampled from vocabulary based on uniform distribution. We applied this procedure to source sentences in newstest2010-2016. Table 6 shows averaged BLEU scores 8 of each method on perturbed newstest2010-2016. These BLEU scores are calculated against the original reference sentences. This table indicates that all perturbations improved the robustness of the Transformer (big) because their BLEU scores are better than one in the setting w/o perturbation. In comparison among perturbations, REP(SIM) (and REP(SIM)+WDROP) achieved significantly better scores than others on perturbed inputs. We emphasize that REP(SIM) surpassed ADV even though ADV is originally proposed to improve the robustness of models. This result implies that REP(SIM) is effective to construct robust models as well as to improve the performance.  Results Table 7 shows the results of each method. This table reports the averaged score of five models trained with different random seeds. Moreover, we present the scores of Kiyono et al. (2020); our "w/o perturbation" model is a rerun of their work, that is, the experimental settings are identical 9 . Table 7 shows that all perturbations improved the scores except for REP(UNI) and REP(SIM) in the BEA test set (Test). Similar to the machine translation results, the simple methods WDROP, REP(UNI)+WDROP, and REP(SIM)+WDROP achieved comparable scores to ADV. Thus, these faster methods are also effective for the GEC task.

Related Work
Word Replacement The naive training method of neural encoder-decoders has a discrepancy between training and inference; we use the correct tokens as inputs of the decoder in the training phase but use the token predicted at the previous time step as an input of the decoder in the inference phase. To address this discrepancy, Bengio et al. (2015) proposed the scheduled sampling that stochastically uses the token sampled from the output probability distribution of the decoder as an input instead of the correct token.  modified the sampling method to improve the performance. In addition, Duckworth et al. (2019) refined the algorithm to be suited to Transformer (Vaswani et al., 2017). Their method is faster than the original scheduled sampling but slower and slightly worse than more simple replacement methods such as REP(UNI) and REP(SIM) in our experiments. Xie et al. (2017) and Kobayashi (2018) used the unigram language model and neural language model respectively to sample tokens for word replacement. In this study, we ignored contexts to simplify the sampling process, and indicated that such simple methods are effective for sequence-to-sequence problems.
Word Dropout Gal and Ghahramani (2016) applied word dropout to a neural language model and it is a common technique in language modeling (Merity et al., 2018;Yang et al., 2018;Takase et al., 2018). Sennrich and Zhang (2019) reported that word dropout is also effective for low resource machine translation. However, word dropout has not been commonly used in the existing sequenceto-sequence systems. Experiments in this study show that word dropout is not only fast but also contributes to improvement of scores in various sequence-to-sequence problems.
Adversarial Perturbations Adversarial perturbations were first discussed in the field of image processing (Szegedy et al., 2014;Goodfellow et al., 2015). In the NLP field, Miyato et al. (2017)  Their method is also effective but requires more computational time than Wang et al. (2019) and Sato et al. (2019) because it runs language models to obtain candidate tokens for perturbations.

Conclusion
We compared perturbations for neural encoderdecoders in view of computational time. Experimental results show that simple techniques such as word dropout (Gal and Ghahramani, 2016) and random replacement of input tokens achieved comparable scores to sophisticated perturbations: the scheduled sampling (Bengio et al., 2015) and adversarial perturbations (Sato et al., 2019), even though those simple methods are faster. In the low resource setting in machine translation, adversarial perturbations achieved high BLEU score but those simple methods also achieved comparable scores to the adversarial perturbations when we spent almost the same time for training. For the robustness of trained models, REP(SIM) is superior to others. This study indicates that simple methods are sufficiently effective, and thus, we encourage using such simple perturbations as a first step. In addition, we hope for researchers of perturbations to use the simple perturbations as baselines to make the usefulness of their proposed method clear.

A.1 Annotated English Gigaword
Datasets We used sentence-summary pairs extracted from Annotated English Gigaword (Napoles et al., 2012;Rush et al., 2015) as the summarization dataset. This dataset contains 3.8M sentence-summary pairs as the training set and 1951 pairs as the test set. We extracted 3K pairs from the original validation set, which contains 190K pairs, for our validation set. In summarization, most recent studies used large scale corpora to pre-train their neural encoderdecoder (Dong et al., 2019;Song et al., 2019;Qi et al., 2020). Thus, we also augmented the training data. We extracted the first sentence and headline of a news article in REALNEWS (Zellers et al., 2019) and News Crawl (Barrault et al., 2019) as sentence-summary pairs. In total, we used 17.1M sentence-summary pairs as our training data.
We used BPE (Sennrich et al., 2016b) to construct a vocabulary set. We set the number of BPE merge operations at 32K and shared the vocabulary between both the encoder and decoder sides.
Methods We followed the configuration in Section 4.2 because it seems suitable for a large amount of training data. We used the same perturbations and hyper-parameters as in Section 4.2.
Results Table 8 shows the ROUGE F 1 scores of each method and scores reported in recent studies (Dong et al., 2019;Song et al., 2019;Qi et al., 2020) In this experiment, we cannot report the result of ADV because the loss value of ADV exploded during training. We tried several random seeds for ADV, but all models failed to converge. Since we need a huge amount of budget to search more suitable hyper-parameters for ADV in this summarization dataset, we consider that it is impractical to report the result of ADV. Table 8 indicates that all perturbations improved the ROUGE score. In addition, REP(UNI), REP(SIM), WDROP, and their combinations outperformed the scheduled sampling. Thus, these fast methods are also superior perturbations in the summarization task. Moreover, REP(UNI) and   (Over et al., 2007) which contains 500 source sentences and four kinds of manually constructed reference summaries. We truncated characters over 75 bytes in each generated summary based on the official configuration.
Methods We used the Transformer (big) setting in this experiment. In addition, we introduced the output length control method proposed by Takase and Okazaki (2019). We used the same perturbations and hyper-parameters as in Section 4.1.
Results Table 8 shows recall-based ROUGE scores of each method and scores reported in recent studies (Rush et al., 2015;Suzuki and Nagata, 2017;Takase and Okazaki, 2019;Takase and Kobayashi, 2020). We also cannot report the result of ADV for the same reason as described in Appendix A.1.
formed the current top score in ROUGE-1, 2, and L. Moreover, WDROP achieved better ROUGE-1 and L scores than the current top score. In contrast, REP(UNI) slightly harmed the performance in this configuration. These results indicate that WDROP and REP(SIM) are also effective for summarization tasks.

Method
Positions 2010 Table 12: BLEU scores when we inject perturbations to a source sentence with 0.10.