Visualizing and Understanding the Effectiveness of BERT

Language model pre-training, such as BERT, has achieved remarkable results in many NLP tasks. However, it is unclear why the pre-training-then-fine-tuning paradigm can improve performance and generalization capability across different tasks. In this paper, we propose to visualize loss landscapes and optimization trajectories of fine-tuning BERT on specific datasets. First, we find that pre-training reaches a good initial point across downstream tasks, which leads to wider optima and easier optimization compared with training from scratch. We also demonstrate that the fine-tuning procedure is robust to overfitting, even though BERT is highly over-parameterized for downstream tasks. Second, the visualization results indicate that fine-tuning BERT tends to generalize better because of the flat and wide optima, and the consistency between the training loss surface and the generalization error surface. Third, the lower layers of BERT are more invariant during fine-tuning, which suggests that the layers that are close to input learn more transferable representations of language.


Introduction
Language model pre-training has achieved strong performance in many NLP tasks (Peters et al., 2018;Howard and Ruder, 2018a;Radford et al., 2018;Devlin et al., 2018;Baevski et al., 2019;Dong et al., 2019). A neural encoder is trained on a large text corpus by using language modeling objectives. Then the pre-trained model either is used to extract vector representations for input, or is fine-tuned on the specific datasets.
Recent work (Tenney et al., 2019b;Liu et al., 2019a;Goldberg, 2019;Tenney et al., 2019a) has shown that the pre-trained models can encode syntactic and semantic information of language. However, it is unclear why pre-training * Contribution during internship at Microsoft Research. is effective on downstream tasks in terms of both trainability and generalization capability. In this work, we take BERT (Devlin et al., 2018) as an example to understand the effectiveness of pretraining. We visualize the loss landscapes and the optimization procedure of fine-tuning on specific datasets in three ways. First, we compute the one-dimensional (1D) loss curve, so that we can inspect the difference between fine-tuning BERT and training from scratch. Second, we visualize the two-dimensional (2D) loss surface, which provides more information about loss landscapes than 1D curves. Third, we project the high-dimensional optimization trajectory of fine-tuning to the obtained 2D loss surface, which demonstrate the learning properties in an intuitive way.
The main findings are summarized as follows. First, visualization results indicate that BERT pretraining reaches a good initial point across downstream tasks, which leads to wider optima on the 2D loss landscape compared with random initialization. Moreover, the visualization of optimization trajectories shows that pre-training results in easier optimization and faster convergence. We also demonstrate that the fine-tuning procedure is robust to overfitting. Second, loss landscapes of fine-tuning partially explain the good generalization capability of BERT. Specifically, pre-training obtains more flat and wider optima, which indicates the pre-trained model tends to generalize better on unseen data (Chaudhari et al., 2017;Li et al., 2018;Izmailov et al., 2018). Additionally, we find that the training loss surface correlates well with the generalization error. Third, we demonstrate that the lower (i.e., close to input) layers of BERT are more invariant across tasks than the higher layers, which suggests that the lower layers learn transferable representations of language. We verify the point by visualizing the loss landscape with respect to different groups of layers.

4144
2 Background: BERT We use BERT (Bidirectional Encoder Representations from Transformers; Devlin et al. 2018) as an example of pre-trained language models in our experiments. BERT is pre-trained on a large corpus by using the masked language modeling and next-sentence prediction objectives. Then we can add task-specific layers to the BERT model, and fine-tune all the parameters according to the downstream tasks.
BERT employs a Transformer (Vaswani et al., 2017) network to encode contextual information, which contains multi-layer self-attention blocks.
Then, an L-layer Transformer encodes the input: H l = Transformer block l (H l−1 ), where l = 1, · · · , L, and We use the hidden vector h L i as the contextualized representation of the input token x i . For more implementation details, we refer readers to Vaswani et al. (2017).

Methodology
We employ three visualization methods to understand why fine-tuning the pre-trained BERT model can achieve better performance on downstream tasks compared with training from scratch. We plot both one-dimensional and two-dimensional loss landscapes of BERT on the specific datasets. Besides, we project the optimization trajectories of the fine-tuning procedure to the loss surface. The visualization algorithms can also be used for the models that are trained from random initialization, so that we can compare the difference between two learning paradigm.

One-dimensional Loss Curve
Let θ 0 denote the initialized parameters. For finetuning BERT, θ 0 represents the the pre-trained parameters. For training from scratch, θ 0 represents the randomly initialized parameters. After fine-tuning, the model parameters are updated to θ 1 . The one-dimensional (1D) loss curve aims to quantify the loss values along the optimization direction (i.e., from θ 0 to θ 1 ).
The loss curve is plotted by linear interpolation between θ 0 and θ 1 (Goodfellow and Vinyals, 2015). The curve function f (α) is defined as: where α is a scalar parameter, δ 1 = θ 1 − θ 0 is the optimization direction, and J (θ) is the loss function under the model parameters θ. In our experiments, we set the range of α to [−4, 4] and sample 40 points for each axis. Note that we only consider the parameters of BERT in θ 0 and θ 1 , so δ 1 only indicates the updates of the original BERT parameters. The effect of the added task-specific layers is eliminated by keeping them fixed to the learned values.

Two-dimensional Loss Surface
The one-dimensional loss curve can be extended to the two-dimensional (2D) loss surface (Li et al., 2018). Similar as in Equation (1), we need to define two directions (δ 1 and δ 2 ) as axes to plot the loss surface: where α, β are scalar values, J (·) is the loss function, and θ 0 represents the initialized parameters. Similar to Section 3.1, we are only interested in the parameter space of the BERT encoder, without taking into consideration task-specific layers.
One of the axes is the optimization direction δ 1 = θ 1 − θ 0 on the target dataset, which is defined in the same way as in Equation (1). We compute the other axis direction via δ 2 = θ 2 − θ 0 , where θ 2 represents the fine-tuned parameters on another dataset. So the other axis is the optimization direction of fine-tuning on another dataset. Even though the other dataset is randomly chosen, experimental results confirm that the optimization directions δ 1 , δ 2 are divergent and orthogonal to each other because of the high-dimensional parameter space. The direction vectors δ 1 and δ 2 are projected onto a two-dimensional plane. It is beneficial to ensure the scale equivalence of two axes for visualization purposes. Similar to the filter normalization approach introduced in (Li et al., 2018), we address this issue by normalizing two direction vectors to the same norm. We re-scale δ 2 to δ 1 δ 2 δ 2 , where · denotes the Euclidean norm. We set the range of both α and β to [−4, 4] and sample 40 points for each axis.

Optimization Trajectory
Our goal is to project the optimization trajectory of the fine-tuning procedure onto the 2D loss surface obtained in Section 3.
is a projected point in the loss surface, and i = 1, · · · , T represents the i-th epoch of finetuning.
As shown in Equation (2), we have known the optimization direction δ 1 = θ 1 − θ 0 on the target dataset. We can compute the deviation degrees between the optimization direction and the trajectory to visualize the projection results. Let θ i denote the BERT parameters at the i-th epoch, and where × denotes the cross product of two vectors, and · denotes the Euclidean norm. To be specific, we first compute cosine similarity between δ i and δ 1 , which indicates the angle between the current optimization direction and the final optimization direction. Then we get the projection values d α i and d β i by computing the deviation degrees between the optimization direction δ i and the axes. . We use the same data split as in . The accuracy metric is used for evaluation.

Experimental Setup
We employ the pre-trained BERT-large model in our experiments. The cased version of tokenizer is used. We follow the settings and the hyperparameters suggested in (Devlin et al., 2018). The Adam (Kingma and Ba, 2015) optimizer is used for fine-tuning. The number of fine-tuning epochs is selected from {3, 4, 5}. For RTE and MRPC, we set the batch size to 32, and the learning rate to 1e-5. For MNLI and SST-2, the batch size is 64, and the learning rate is 3e-5.
For the setting of training from scratch, we use the same network architecture as BERT, and randomly initialize the model parameters. Most hyper-parameters are kept the same. The number of training epochs is larger than fine-tuning BERT, because training from scratch requires more epochs to converge. The number of epochs is set to 8 for SST-2, and 16 for the other datasets, which is validated on the development set.

Pre-training Gets a Good Initial Point Across Downstream Tasks
Fine-tuning BERT on the usually performs significantly better than training the same network with random initialization, especially when the data size is small. Results indicate that language model pre-training objectives learn good initialization for downstream tasks. In this section, we inspect the benefits of using BERT as the initial point from three aspects.

Pre-training Leads to Wider Optima
As described in Section 3.2, we plot 2D training loss surfaces on four datasets in Figure 1. We observe that the optima obtained by fine-tuning BERT are much wider than training from scratch. A wide optimum of fine-tuning BERT implicates that the small perturbations of the model parameters cannot hurt the final performance seriously, while a thin optimum is more sensitive to these subtle changes. Moreover, in Section 6.1, we further discuss about the width of the optima can contribute to the generalization capability. As shown in Figure 1, the fine-tuning path from the start point to the end point on the loss landscape is more smooth than training from scratch. In other words, the training loss of fine-tuning BERT tends to monotonously decrease along the optimization direction, which eases optimization and accelerates training convergence. In contrast, the path from random initial point to the end point is more rough, which requires a more carefully tweaked optimizer to obtain reasonable performance.

Pre-training Eases Optimization on Downstream Tasks
We fine-tune BERT and train the same network from scratch on four datasets. The learning curves are shown in Figure 2. We find that training from scratch requires more iterations to converge on the datasets, while pre-training-then-fine-tuning converges faster in terms of training loss. We also notice that the final loss of training from scratch tends to be higher than fine-tuning BERT, even if it undergoes more epochs. On the RTE dataset, training the model from scratch has a hard time decreasing the loss in the first few epochs. In order to visualize the dynamic convergence process, we plot the optimization trajectories us-ing the method described in Section 3.3. As shown in Figure 3, for training from scratch, the optimization directions of the first few epochs are divergent from the final optimization direction. Moreover, the loss landscape from the initial point to the end point is more rough than fine-tuning BERT, we can see that the trajectory of training from scratch on the MRPC dataset crosses an obstacle to reach the end point.
Compared with training from scratch, finetuning BERT finds the optimization direction in a more straightforward way. The optimization process also converges faster. Besides, the fine-tuning path is unimpeded along the optimization direction. In addition, because of the wider optima near the initial point, fine-tuning BERT tends to reach the expected optimal region even if it optimizes along the direction of the first epoch.

Pre-training-then-fine-tuning is Robust to Overfitting
The BERT-large model has 345M parameters, which is over-parameterized for the target datasets. However, experimental results show fine-tuning BERT is robust to over-fitting, i.e., the generalization error (namely, the classification error rate on the development set) does not dramatically increase for more training epochs, despite the huge number of model parameters. We use the MRPC dataset as a case study, because its data size is relatively small, which is prone to overfitting if we train the model from scratch. As shown in Figure 4, we plot the optimization trajectory of fine-tuning on the generalization error surface. We first fine-tune the BERT model for five epochs as suggested in (Devlin et al., 2018). Then we continue fine-tuning for another twenty epochs, which still obtains comparable performance with the first five epochs. Figure 4 shows that even though we fine-tune the BERT model for twenty more epochs, the final estimation is not far away from its original optimum. Moreover, the optimum area is wide enough to avoid the model from jumping out the region with good generalization capability, which explains why the pre-training-then-finetuning paradigm is robust to overfitting.

Pre-training Helps to Generalize Better
Although training from scratch can achieve comparable training losses as fine-tuning BERT, the model with random initialization usually has poor performance on the unseen data. In this section, we use visualization techniques to understand why the model obtained by pre-trainingthen-fine-tuning tends to have better generalization capability. The two-dimensional generalization error surface is presented. We find that pretraining-then-fine-tuning is robust to overfitting.

Wide and Flat Optima Lead to Better Generalization
Previous work (Hochreiter and Schmidhuber, 1997;Keskar et al., 2016;Li et al., 2018) shows that the flatness of a local optimum correlates with the generalization capability, i.e., more flat optima lead to better generalization. The finding inspires us to inspect the loss landscapes of BERT finetuning, so that we can understand the generalization capability from the perspective of the flatness of optima. Section 5.1 presents that the optima obtained by fine-tuning BERT are wider than training from scratch. As shown in Figure 5, we further plot one-dimensional training loss curves of both fine-tuning BERT and training from scratch, which represents the transverse section of twodimensional loss surface along the optimization direction. We normalize the scale of axes for flatness comparison as suggested in (Li et al., 2018). Figure 5 shows that the optima of fine-tuning BERT are more flat, while training from scratch obtains more sharp optima. The results indicate that pre-training-then-fine-tuning tends to generalize better on unseen data.

Consistency Between Training Loss Surface and Generalization Error Surface
To further understand the effects of geometry of loss functions to the generalization capability, we make comparisons between the training loss surfaces and the generalization error surfaces on different datasets. The classification error rate on the development set is used as an indicator of the generalization capability.
As shown in Figure 6, we find the end points of fine-tuning BERT fall into the wide areas with smaller generalization error. The results show that the generalization error surfaces are consistent with the corresponding training loss surfaces on the datasets, i.e., smaller training loss tends to decrease the error on the development set. Moreover, the fine-tuned BERT models tend to stay approximately optimal under subtle perturbations. The visualization results also indicate that it is preferred to converge to wider and more flat local optima, as the training loss surface and the generalization error surface are shifted with respect to each other (Izmailov et al., 2018). In contrast, training from scratch obtains thinner optimum areas and poorer generalization than fine-tuning BERT, especially on the datasets with relatively small data size (such as MRPC, and RTE). Intuitively, the thin and sharp optima on the training loss surfaces are hard to be migrated to the generalization surfaces.
For training from scratch, it is not surprising that on larger datasets (such as MNLI, and SST-2) the generalization error surfaces are more consistent with the training loss surfaces. The results suggest that training the model from scratch usually requires more training examples to generalize better compared with fine-tuning BERT.

Lower Layers of BERT are More Invariant and Transferable
The BERT-large model has 24 layers. Different layers could have learned different granularities or perspectives of language during the pre-training procedure. For example, Tenney et al. (2019a) observe that most local syntactic phenomena are encoded in lower layers while higher layers capture more complex semantics. They also show that most examples can be classified correctly in the first few layers. From above, we conjecture that lower layers of BERT are more invariant and transferable across tasks. We divide the layers of the BERT-large model into three groups: low layers (0th-7th layer), middle layers (8th-15th layer), and high layers (16th-23rd layer). As shown in Figure 7, we plot the two-dimensional loss surfaces with respect to different groups of layers (i.e., parameter subspace instead of all parameters) around the fine-tuned point. To be specific, we modify the loss surface function in Section 3.2 to f (α, β) = J (θ 1 + αδ G 1 + βδ G 2 ), where θ 1 represents the fine-tuned parameters, G ∈ {low layers, middle layers, high layers}, and the optimization direction of the layer group is used as the axis. On the visualized loss landscapes, f (0, 0) corresponds to the loss value at the fine-tuned point. Besides, f (−1, 0) corresponds to the loss value with the corresponding layer group rollbacked to its original values in the pre-trained BERT model. Figure 7 shows that the loss surface with respect to lower layers has the wider local optimum along the optimization direction. The results demonstrate that rollbacking parameters of lower layers to their original values (the star-shaped points in Figure 7) does not dramatically hurt the model performance. In contrast, rollbacking high layers makes the model fall into the region with high loss. This phenomenon indicates that the optimization of high layers is critical to fine-tuning whereas lower layers are more invariant and transferable across tasks.
In order to make a further verification, we rollback different layer groups of the fine-tuned model to the parameters of the original pre-trained BERT model. The accuracy results on the development set are presented in Table 1. Similar to Figure 7, the generalization capability does not dramatically decrease after rollbacking low layers or middle layers. Rollbacking low layers (0th-7th layer)   Table 1: Accuracy on the development sets. The second column represents the fine-tuned BERT models on the specific datasets. The last column represents the fine-tuned models with rollbacking different groups of layers. even improves the generalization capability on the MNLI dataset. By contrast, rollbacking high layers hurts the model performance. Evaluation results suggest that low layers that are close to input learn more transferable representations of language, which makes them more invariant across tasks. Moreover, high layers seem to play a more important role in learning task-specific information during fine-tuning.
Recent work of inspecting the effectiveness of the pre-trained models (Linzen et al., 2016;Kuncoro et al., 2018;Tenney et al., 2019b;Liu et al., 2019a) focuses on analyzing the syntactic and semantic properties. Tenney et al. (2019b) and Liu et al. (2019a) suggest that pre-training helps the models to encode much syntactic information and many transferable features through evaluating models on several probing tasks. Goldberg (2019) assesses the syntactic abilities of BERT and draws the similar conclusions. Our work explores the effectiveness of pre-training from another angle. We propose to visualize the loss landscapes and optimization trajectories of the BERT fine-tuning procedure. The visualization results help us to understand the benefits of pre-training in a more intuitive way. More importantly, the geometry of loss landscapes partially explains why fine-tuning BERT can achieve better generalization capability than training from scratch. Liu et al. (2019a) find that different layers of BERT exhibit different transferability. Peters et al. (2019) show that the classification tasks build up information mainly in the intermediate and last layers of BERT. Tenney et al. (2019a) observe that low layers of BERT encode more local syntax, while high layers capture more complex semantics. Zhang et al. (2019) also show that not all layers of a deep neural model have equal contributions to model performance. We draw the similar conclusion by visualizing layer-wise loss surface of BERT on downstream tasks. Besides, we find that low layers of BERT are more invariant and transferable across datasets.
In the computer vision community, many efforts have been made to visualize the loss function, and figure out how the geometry of a loss function affects the generalization (Goodfellow and Vinyals, 2015;Im et al., 2016;Li et al., 2018). Hochreiter and Schmidhuber (1997) define the flatness as the size of the connected region around a minimum. Keskar et al. (2016) characterize the definition of flatness using eigenvalues of the Hessian, and conclude that small-batch training converges to flat minima, which leads to good generalization. Li et al. (2018) propose a filter normalization method to reduce the influence of parameter scale, and show that the sharpness of a minimum correlates well with generalization capability. The assumption is also used to design optimization algorithms (Chaudhari et al., 2017;Izmailov et al., 2018), which aims at finding broader optima with better generalization than standard SGD.

Conclusion
We visualize the loss landscapes and optimization trajectories of the BERT fine-tuning procedure, which aims at inspecting the effectiveness of language model pre-training. We find that pre-training leads to wider optima on the loss landscape, and eases optimization compared with training from scratch. Moreover, we give evidence that the pre-training-then-fine-tuning paradigm is robust to overfitting. We also demonstrate the consistency between the training loss surfaces and the generalization error surfaces, which explains why pre-training improves the generalization capability. In addition, we find that low layers of the BERT model are more invariant and transferable across tasks.
All our experiments and conclusions were derived from BERT fine-tuning.
A further understanding of how multi-task training with BERT (Liu et al., 2019b) improves fine-tuning, and how it affects the geometry of loss surfaces are worthy of exploration, which we leave to future work. Moreover, the results motivate us to develop fine-tuning algorithms that converge to wider and more flat optima, which would lead to better generalization on unseen data. In addition, we would like to apply the proposed methods for other pretrained models.