Noise Stability Regularization for Improving BERT Fine-tuning

Fine-tuning pre-trained language models suchas BERT has become a common practice dom-inating leaderboards across various NLP tasks.Despite its recent success and wide adoption,this process is unstable when there are onlya small number of training samples available.The brittleness of this process is often reflectedby the sensitivity to random seeds. In this pa-per, we propose to tackle this problem basedon the noise stability property of deep nets,which is investigated in recent literature (Aroraet al., 2018; Sanyal et al., 2020). Specifically,we introduce a novel and effective regulariza-tion method to improve fine-tuning on NLPtasks, referred to asLayer-wiseNoiseStabilityRegularization (LNSR). We extend the theo-ries about adding noise to the input and provethat our method gives a stabler regularizationeffect. We provide supportive evidence by ex-perimentally confirming that well-performingmodels show a low sensitivity to noise andfine-tuning with LNSR exhibits clearly bet-ter generalizability and stability. Furthermore,our method also demonstrates advantages overother state-of-the-art algorithms including L2-SP (Li et al., 2018), Mixout (Lee et al., 2020)and SMART (Jiang et al., 20)


Introduction
Large-scale pre-trained language models such as BERT (Devlin et al., 2019) have been widely used in natural language processing tasks (Guu et al., 2020;Liu, 2019;Wadden et al., 2019;Zhu et al., 2020b). A typical process of training a supervised downstream dataset is to fine-tune a pre-trained model for a few epochs. In this process, most of the model's parameters are reused, while a random initialized task-specific layer is added to adapt the model to the new task.
Fine-tuning BERT has significantly boosted the state of the art performance on natural lan- †Contribution during internship at Baidu Research. * Equal contributions. Correspondence.
guage understanding (NLU) benchmarks such as GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019). However, despite the impressive empirical results, this process remains unstable due to the randomness involved by data shuffling and the initialization of the task-specific layer. The observed instability in fine-tuning BERT was first discovered by Devlin et al. (2019); Dodge et al. (2020), and several approaches have been proposed to solve this problem Zhang et al., 2020;Mosbach et al., 2020).
In this study, we consider the fine-tuning stability of BERT from the perspective of the sensitivity to input perturbation. This is motivated by Arora et al. (2018) and Sanyal et al. (2020) who show that noise injected at the lower layers has very little effect on the higher layers for neural networks with good generalizability. However, for a well pre-trained BERT, we find that the higher layers are still very sensitive to the lower layer's perturbation (as shown in Figure 1), implying that the high level representations of the pre-trained BERT may not generalize well on downstreaming tasks and consequently lead to instability. This phenomenon coincides with the observation that transferring the top pre-trained layers of BERT slows down learning and hurts performance (Zhang et al., 2020). In addition, Yosinski et al. (2014) also point out that in transfer learning models for object recognition, the lower pre-trained layers learn more general features while the higher layers closer to the output specialize more to the pre-training tasks. We argue that this result also applies to BERT. Intuitively, if a trained model is insensitive to the perturbation of the lower layers' output, then the model is confident about the output, and vice versa. Based on the above theoretical and empirical results, we propose a simple and effective regularization method to reduce the noise sensitivity of BERT and thus improve the stability and performance of fine-tuned BERT. Figure 1: Attenuation of injected noise on the BERT-Large-Uncased model on the MRPC task (X-axis: the layer index. Y-axis: the L 2 norm between the original output and noise perturbed output). A curve starts at the layer where a scaled Gaussian noise is injected to its output whose l 2 norm is set to 5% of the norm of its original output. As it propagates up, the injected noise has a rapidly decreasing effect on the lower layers but becomes volatile on the higher layers, which indicates the poor generalizability and brittleness of the BERT top layers. Moreover, models with higher accuracies (marked in the upper right) usually have lower error ratios or higher noise stability in top layers.
To verify our approach, we conduct extensive experiments on different few-sample (fewer than 10k training samples) NLP tasks, including CoLA (Warstadt et al., 2019), MRPC (Dolan and Brockett, 2005), RTE (Wang et al., 2018;Dagan et al., 2005;Bar-Haim et al., 2006;Giampiccolo et al., 2007), and STS-B (Cer et al., 2017). With the layerwise noise stability regularization, we obtain strong empirical performance. Compared with other stateof-the-art models, our approach not only improves the fine-tuning stability (with a smaller standard deviation) but also consistently improve the overall performance (with a larger mean, median and maximum).
In summary, our main contributions are: • We propose a lightweight and effective regularization method, referred to as Layer-wise Noise Stability Regularization (LNSR) to improve the local Lipschitz continuity of each BERT layer and thus ensure the smoothness of the whole model. The empirical results show that the fine-tuned BERT models regularized with LNSR obtain significantly more accurate and stable results. LNSR also outperforms other state-of-the-art methods aiming at stabilizing fine-tuning such as L 2 -SP (Li et al., 2018), Mixout  and SMART (Jiang et al., 2020).
• We are the first to study the effect of noise stability in NLP tasks. We extend classic theories of adding noise to explicitly constraining the output consistency when adding noise to the input. We theoretically prove that our proposed layer-wise noise stability regularizer is equivalent to a special case of the Tikhonov regularizer, which serves as a stabler regularizer than simply adding noise to the input (Rifai et al., 2011).
• We investigate the relation of the noise stability property to the generalizability of BERT. We find that in general, models with good generalizability tend to be insensitive to noise perturbation; the lower layers of BERT show a better error resilience property but the higher layers of BERT remain sensitive to the lower layers' perturbation (as is depicted in Figure 1).

Pre-training
Pre-training has been well studied in machine learning and natural language processing (Erhan et al., 2009(Erhan et al., , 2010. Mikolov et al. (2013) and Pennington et al. (2014) proposed to use distributional representations (i.e., word embeddings) for indi-vidual words. Dai and Le (2015) proposed to train a language model or an auto-encoder with unlabeled data and then leveraged the obtained model to finetune downstream tasks. Recently, pretrained language models, like ELMo (Peters et al., 2018), GPT/GPT-2 (Radford, 2018;Radford et al., 2019), BERT (Devlin et al., 2019), cross-lingual language model (briefly, XLM) (Lample and Conneau, 2019), XLNet , RoBERTa  and ALBERT (Lan et al., 2020) have attracted more and more attention in natural language processing communities. The models are first pre-trained on large amount of unlabeled data to capture rich representations of the input, and then applied to the downstream tasks by either providing context-aware embeddings of an input sequence (Peters et al., 2018), or initializing the parameters of the downstream model (Devlin et al., 2019) for fine-tuning. Such pre-training approaches deliver decent performance on natural language understanding tasks.

Instability in Fine-tuning
Fine-tuning instability of BERT has been reported in various previous works. Devlin et al. (2019) report instabilities when fine-tuning BERT on small datasets and resort to performing multiple restarts of fine-tuning and selecting the model that performs best on the development set. Dodge et al. (2020) performs a large-scale empirical investigation of the fine-tuning instability of BERT. They found dramatic variations in fine-tuning accuracy across multiple restarts and argue how it might be related to the choice of random seed and the dataset size.  propose a new regularization method named Mixout to improve the stability and performance of fine-tuning BERT. Zhang et al. (2020) evaluate the importance of debiasing step empirically by fine-tuning BERT with both BERTAdam and standard Adam optimizer (Kingma and Ba, 2015) and propose a re-initialization method to get a better initialization point for fine-tuning optimization. Mosbach et al. (2020) analyses the cause of fine-tuning instability and propose a simple but strong baseline (small learning rate combined with bias correction).

Regularization
There has been several regularization approaches to stabilizing the performance of models. Loshchilov and Hutter (2019) propose a decoupled weight decay regularizer integrated in Adam (Kingma and Ba, 2015) optimizer to prevent neural networks from being too complicate. Gunel et al. (2020) use contrastive learning method to augment training set to improve the generalization performance. In addition, spectral norm (Yoshida and Miyato, 2017;Roth et al., 2019) serves as a general method can also be used to constrain the Lipschitz continuous of matrix, which can increase the stability of generalized neural networks.
There are also several noise-based methods have been proposed to improve the generalizability of pre-trained language models, including SMART (Jiang et al., 2020), FreeLB (Zhu et al., 2020a) and R3F (Aghajanyan et al., 2020). They achieves state of the art performance on GLUE, SNLI (Bowman et al., 2015), SciTail (Khot et al., 2018), and ANLI (Nie et al., 2020) NLU benchmarks. Most of these algorithms employ adversarial training method to improve the robustness of language model finetuing. SMART uses an adversarial methodology to encourage models to be smooth within a neighborhoods of all the inputs; FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps; R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly. Different from these methods, our proposed method does not adopt the adversarial training strategy, we optimize the smoothness of each layer of BERT directly and thus improve the stability of whole model.

Using Noise Stability as a Regularizer
One of the central issues in neural network training is to determine the optimal degree of complexity for the model. A model which is too limited will not sufficiently capture the structure in the data, while one which is too complex will model the noise on the data (the phenomenon of over-fitting). In either case, the performance on new data, that is the ability of the network to generalize, will be poor. The problem can be regarded as one of finding the optimal trade-off between the high bias of a model which is too inflexible and the high variance of a model with too much freedom (Geman et al., 1992;Bishop, 1995;Novak et al., 2018;Bishop, 1991). To control the trade-off of bias against variance of BERT models, we impose an explicit noise regularization method.

Introduction of Our Method
Denoting the training set as D, we give the general form of optimization objective for a BERT model f (·; θ) with L layers, as following: To representR(θ), we first define the injection position as the input of layer b which is denoted as x b . If the regularization is operated at the output of layer r, we can further denote the function between layer b and r as f b,r , satisfying that 1 <= b <= r <= L. To implement the noise stability regularization, we inject a Gaussian-like noise vector ε to x b and get a neighborhood x b + ε. Specifically, each element ε i is independently randomly sampled from a Gaussian distribution with the mean of zero and the standard deviation of σ as ε i ∼ N (0, σ 2 ). The probability density function of the noise distribution can be written as Our goal is to minimize the discrepancy between their outputs over f b,r defined as In our framework, we use a fixed position b as the position of noise injection and constrain the output distance on all layers following layer b. Denoting the regularization weight corresponding to each f b,r as λ b,r , given a sample (x, y) ∼ D, the regularization term is represented by the following formulas: An overall algorithm is represented in Algorithm 1.

Theoretical Analysis
Regularzation is a kind of commonly used techniques to reduce the function complexity and, as a result, to make the learned model generalize well on unseen examples. In this part, we theoretically prove that the proposed LNSR algorithm has the effects of encouraging the local Lipschitz continuity and imposing a Tikhonov regularizer under different assumptions. For simplicity, we omit the notations about the layer number in this part, denoting f as the target function and x as the input of f parameterized by θ. Given a sample (x, y) ∼ D, we discuss the general form of the noise stability defined as following: Lipschitz continuity. The Lipschitz property reflects the degree of smoothness for a function. Recent theoretical studies on deep learning has revealed the close connection between Lipschitz property and generalization (Bartlett et al., 2017;Neyshabur et al., 2017).
Given a sampled ε, minimizing f (x + ε) − f (x) 2 is equivalent to minimizing: Thus the noise stability regularization can be regarded as minimizing the Lipschitz constant in a local region around the input x. Tikhonov regularizer. The Tikhonov regularizer (Willoughby, 1979) involves constraints on the derivatives of the objective function with respect to different orders. For the simplest first-order case, it can be regarded as imposing robustness and shaping a flatter loss surface at input, which makes the learned function smoother.
Assuming that the magnitude of ε is small, we can expand the first term as a Taylor approximation as: where J f (x) and H f (x) refer to the Jacobian and Hessian of f with respect to the input x respectively.
Ignoring the higher order term O(ε 3 ) and denoting f k as the k-th output of the function f , we can rewrite the regularizer by substituting Eq. 5 in Eq. 3 as: (6) We define the input vector x as (x 1 , x 2 , ......, x d 1 ) and noise vector ε as ε = (ε 1 , ε 2 , ......, ε d 1 ). Assuming that distributions of the noise and the input are irrelevant, and the derivative of f with respect to different elements of the input vector is independent with each other, we expand the second order term corresponding to the Jacobian as: According to the characteristics of the Gaussian distribution, we also have Thus, we can rewrite the second order term corresponding to the Hessian in Eq. 6 as: Where C is a constant independent of the input x. The third term generated from the expansion of Eq. 6 is zero as we have ε 3 p(ε)dε = 0. Thus we get Considering that the input and output of the function f are both scalar variable, the Tikhonov regularization (Willoughby, 1979) takes the general form as: Eq. 10 shows that our proposed regularizer ensuring the noise stability is equivalent to a special case of the Tikhonov regularizer, where we involve the first and second order derivatives of the objective function f . An alternative for improving the robustness is to directly add noise to the input, without explicitly constraining the output stability. (Rifai et al., 2011) has derived that adding noise to the input has the effect of penalizing both the L 2 -norm of the Jacobian J f (x) 2 and the trace of the Hessian Tr(H f (x)), whereas the Hessian term is not constrained to be positive. While the regularizer brought by our proposed LNSR is guaranteed to be positive by involving the sum of squares of the first and second order derivatives. Moreover, our work relaxes the assumption of MSE regression loss required by (Rifai et al., 2011). By imposing the explicit constraint of noise stability on middle layer representations, we extend the theoretical understanding of noise stability into deep learning algorithms.

Experiments
In this section, we experimentally demostrate the effectiveness of LNSR method on text classification tasks over other regularization methods, and confirm that the insensitivity to noise promotes the generalizability and stability of BERT.

Data
We conduct experiments on four few-sample (less than 10k training samples) text classification tasks  of GLUE 1 , the datasets are described below and summarized in Appendix A Table 4. Corpus of Linguistic Acceptability (CoLA (Warstadt et al., 2019)) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence. This is a binary classification task and Matthews correlation coefficient (MCC) (Matthews, 1975) is used to evaluate the performance.
Microsoft Research Paraphrase Corpus (MRPC (Dolan and Brockett, 2005)) is a corpus of sentence pairs with human annotations for whether the sentences in the pair are semantically equivalent. The evaluation metrics is the average of F1 and Accuracy.
Recognizing Textual Entailment (RTE (Wang et al., 2018)) (Dagan et al., 2005) (Bar-Haim et al., 2006 (Giampiccolo et al., 2007) is a corpus of textual entailment, and each example is a sentence pair annotated whether the first entails the second. The evaluation metrics is Accuracy.
Semantic Textual Similarity Benchmark (STS-B (Cer et al., 2017))is a regression task. Each example is a sentence pair and is human-annotated with a similarity score from 1 to 5; the task is to predict these scores. The evaluation metrics is the average of Pearson and Spearman correlation coef-1 https://gluebenchmark.com/ ficients.

Baseline Models
We use BERT (Devlin et al., 2019), a large-scale bidirectional pre-trained language model as the base model in all experiments. We adopt pytorch edition implemented by Wolf et al. (2019).
Fine-tuning. We use the standard BERT finetuning method described in Devlin et al. (2019). (Li et al., 2018) is a regularization scheme that explicitly promotes the similarity of the final solution with the initial model. It is usually used for preventing pre-trained models from catastrophic forgetting. We adopt the form of Ω(w) = α 2 ||w s − w 0 s || + β 2 ||ws||. Mixout ) is a stochastic regularization technique motivated by Dropout (Srivastava et al., 2014) and DropConnect (Wan et al., 2013). At each training iteration, each model parameter is replaced with its pre-trained value with probability p. The goal is to improve the generalizability of pre-trained language models.
SMART (Jiang et al., 2020) imposes an smoothness regularizer inducing an adversarial manner to control the model complexity at the fine-tuning stage. It also employs a class of Bregman proximal point optimization methods to prevent the model from aggressively updating during fine-tuning.

Experimental Setup
Our model is implemented using Pytorch based on Transformers framework 2 . Specifically, we use the learning setup and hyperparameters recommended by (Devlin et al., 2019). We use Huggingface edition Adam (Kingma and Ba, 2015) optimizer (without bias correction) with learning rate of 2 × 10 −5 ,β 1 = 0.9, β 2 = 0.999, and warmup over the first 10% steps of the total steps. We finetune the entire model (340 million parameters), of which the vast majority start as pre-trained weights (BERT-Large-Uncased) and the classification layer (2048 parameters). Weights of the classification layer are initialized with N (0, 0.02 2 ). We train with a batch size of 32 for 3 epochs. More details of our experimental setup are described in Appendix A. Table 1 shows the results of all the models on selected GLUE datasets. We train each dataset over 25 random seeds. To implement our LNSR, we uniformly inject noise at the first layer on BERTlarge for the comparison with baseline models. As we can see from the table, our model outperforms all the baseline models in mean and max values, which indicates the stronger generalizability of our model against other baseline models. The p-values between the accuracy distributions of standard BERT fine-tuning and our model are calculated to verify whether the improvements are significant. We obtain very small p-values in all tasks: RTE: 9.7×10 −7 , MRPC: 2.3×10 −4 , CoLA: 4.7 × 10 −8 , STS-2: 3.3 × 10 −8 .

Overall Performance
Standard deviation is an indicator of the stability of models' performance and higher std means more sensitive to random seeds. Our model shows a lower standard deviation on each task, which means our model is less sensitive to random seeds than other models. Figure 2 presents a clearer illustration. To sum up, our proposed method can effectively improve the performance and stability of fine-tuning BERT.

Ablation Study
To verify the effectiveness of our proposed LNSR model, we conduct several ablation experiments including fine-tuning with more training epochs and 2 https://huggingface.co/transformers/index.html noise perturbation without regularization (we inject noise directly to the output of a specific layer, and then use the perturbed representation to conduct propagation and then calculate loss, this process is similar to a vector-space represent augmentation). The results are shown in Table 2. We observe that benefit obtained by longer training is limited. Similarly, fine-tuning with noise perturbation only achieves slightly better results on two of these tasks, showing that simply adding noise without an explicit restriction on outputs may not be sufficient to obtain good generalizability. While BERT models with LNSR perform better on each task. This verifies our claim that LNSR can promote the stability of BERT fine-tuning and meanwhile improve the generalizability of the BERT model.

Effects on the Generalizability of Models
We verify the effects of our proposed method on the generalizability of BERT models in two ways -generalization gap and models' performance on fewer training samples. Due to the limited data and the extremely high complexity of BERT model, bad fine-tuning start point makes the adapted model overfit the training data and does not generalize well to unseen data. Generalizability of models can be intuitively reflected by generalization gap and models' performance on fewer training samples. Table 3 shows the mean training Acc, mean evaluation Acc and generalization gap of different models on each task. As we can see from the   Table 3: Comparison of the generalizability performance of different models. We report the mean training Acc and evaluation Acc and the generalizability gap (training Acc -evaluation Acc) of each model across 20 random seeds.
generalization gap, and achieve higher evaluation score. The effect of narrowing generalization gap is also reflected in Figure 3 where we can see the higher evaluation accuracy and lower evaluation loss. We sample subsets from the two relatively larger datasets CoLA (8.5k training samples) and STS-B (7k training samples) with the sampling ratio of 0.15, 0.3 and 0.5. As is shown in Figure 4, finetuning with LNSR shows clear advantage on fewer training samples, suggesting LNSR can effectively promote the model's generalizability.

Sensitivity to the Position of Noise Injection
We briefly discuss about the sensitivity to the position of noise injection as it is a pre-determined hyperparameter of our method. As is shown in Figure 5 in Appendix A, we observe that the performance of LNSR does not fluctuate much as the position of noise injection changes. All injection positions bring significant improvements over vanilla fine-tuning. Note that, with LNSR, noise injection to the lower layers usually leads to relatively higher accuracy and stability, implying that LNSR may be more effective when it affects both the lower and higher layers of the network.

Relationship to Previous Noise-based Approaches
Our method is related to SMART (Jiang et al., 2020), FreeLB (Zhu et al., 2020a) and R3F (Agha- janyan et al., 2020). As is mentioned before, most of these approaches employ adversarial training strategies to improve the robustness of BERT finetuing. SMART solves supremum by using an adversarial methodology to achieve the largest KL di-vergence with an -ball, FreeLB optimizes a direct adversarial loss L F reeLB (θ) = sup ∆θ:|∆θ|≤ L(θ + ∆θ) through iterative gradient ascent steps, while R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly.
Compared with this sort of adversarial based algorithms, our method is easier to implement and provides a relatively rigorous theoretical guarantee. The design of layer-wise regularization is sensible that it exploits the characteristics of hierarchical representations in modern deep neural networks. Studies in knowledge distillation have shown similar experience that imitating through middle layer representations (Adriana et al., 2015;Zagoruyko and Komodakis, 2016) performs better than aligning the final outputs (Hinton et al., 2015). Moreover, LNSR allows us to use different regularization weights for different layers (we use fixed weight 1 on all layers in this paper). We will leave the exploitation in future work.

Conclusion
In this paper, we propose the Layer-wise Noise Stability Regularization (LNSR) as a lightweight and effective method to improve the generalizability and stability when fine-tuning BERT on few training samples. Our proposed LNSR method is a general technique that improves model output stability while maintaining or improving the original performance. Furthermore, we provide a theoretically analysis of the relationship of our model to the Lipschitz continuity and Tikhonov regularizer. Extensive empirical results show that our proposed method can effectively improve the generalizability and stability of the BERT model.

A Experimental Details
The model we use for experiments in section 4 is the standard BERT large model with 24 layers staked Transformers (Vaswani et al., 2017) encoder, 1024 hidden size, and 16 self-attention heads. We initialize the pre-trained part of the model with BERT-Large-Uncased-Whole-Word-Masking weight. The final layer is a classification layer with 2048 parameters which contains 0.0006% of the total number of parameters in the model. We initialize the last layer with N (0, 0.02 2 ) and each bias is 0. For the position of noise injection, we uniformly chose the first layer as the noise regularization start point. In the sensitivity to the position of noise injection analysis section, we also try injecting noise from the different layers as is shown in Figure 5. As for the baseline model Mixout, we use the code from the Github repository https://github.com/ bloodwass/mixout.git. The other baseline models are implemented by ourselves. Table 4 summarizes dataset statistics used in this work. We use the standard GLUE benchmark datasets downloaded from https:// gluebenchmark.com/tasks.

B Other Experimental Reports
We also report the maximum value we get during fine-tuning BERT with our proposed LNSR regularizer among a large number of random seeds and several noise injection position, since the maximum value can also reflect the ability of the learning algorithm to reach an optimal point. The results are shown in Table 5, and we can see that on some tasks, fine-tuning BERT with LNSR is even competitive with fine-tuning state-of-the-art models which adopt more powerful modern architectures and pre-training strategies.   Table 5: We report the maximum value we get when fine-tuning the LNSR model on different noise injection position and random seeds on the four tasks. On some tasks, BERT (standard BERT-large-uncased (Devlin et al., 2019)) with LNSR even become competitive with some newly proposed powerful models (bottom rows) . Figure 5: Performance distribution box plot of each model on the four tasks across 25 random seeds.