Calibrated Language Model Fine-Tuning for In- and Out-of-Distribution Data

Fine-tuned pre-trained language models can suffer from severe miscalibration for both in-distribution and out-of-distribution (OOD) data due to over-parameterization. To mitigate this issue, we propose a regularized fine-tuning method. Our method introduces two types of regularization for better calibration: (1) On-manifold regularization, which generates pseudo on-manifold samples through interpolation within the data manifold. Augmented training with these pseudo samples imposes a smoothness regularization to improve in-distribution calibration. (2) Off-manifold regularization, which encourages the model to output uniform distributions for pseudo off-manifold samples to address the over-confidence issue for OOD data. Our experiments demonstrate that the proposed method outperforms existing calibration methods for text classification in terms of expectation calibration error, misclassification detection, and OOD detection on six datasets. Our code can be found at https://github.com/Lingkai-Kong/Calibrated-BERT-Fine-Tuning.


Introduction
Pre-trained language models have recently brought the natural language processing (NLP) community into the transfer learning era. The transfer learning framework consists of two stages, where we first pre-train a large-scale language model, (e.g., BERT , RoBERTa , ALBERT (Lan et al., 2020) and T5 (Raffel et al., 2019)) on a large text corpus and then fine-tune it on downstream tasks. Such a fine-tuning approach has achieved SOTA performance in many NLP benchmarks (Wang et al., 2018(Wang et al., , 2019. Many applications, however, require trustworthy predictions that need to be not only accurate but also well calibrated. In particular, a well-calibrated model should produce reliable confident estimates for both in-distribution and out-of-distribution (OOD) data: (1) For in-distribution data, a model should produce predictive probabilities close to the true likelihood for each class, i.e., confidence ≈ true likelihood. (2) For OOD data, which do not belong to any class of the training data, the model output should produce high uncertainty to say 'I don't know', i.e., confidence ≈ random guess, instead of producing absurdly wrong yet wildly confident predictions. Providing such calibrated output probabilities can help us to achieve better model robustness (Lee et al.,Figure 1: The reliability diagrams on in-distribution data (the first row) and the histograms of the model confidence on out-of-distribution (OOD) data (the second row) of CNN (Kim, 2014) and fine-tuned BERT-MLP classifier . Though BERT improves classification accuracy, it makes over-confident predictions for both in-distribution and OOD data. 2018), model fairness (Chouldechova, 2017) and improve label efficiency via uncertainty driven learning (Gal et al., 2017;Siddhant and Lipton, 2018;Shen et al., 2018).
Unfortunately, Guo et al. (2017) have shown that due to over-parameterization, deep convolutional neural networks are often miscalibrated. Our experimental investigation further corroborates that fine-tuned language models can suffer from miscalibration even more for NLP tasks. As shown in Figure 1, we present the calibration of a BERT-MLP model for a text classification task on the 20NG dataset. Specifically, we train a TextCNN (Kim, 2014) and a BERT-MLP using 20NG 15 (the first 15 categories of 20NG) and then evaluate them on both in-distribution and OOD data. The first row plots their reliability diagrams (Niculescu-Mizil and Caruana, 2005) on the test set of 20NG 15 . Though BERT improves the classification accuracy from 83.9% to 87.4%, it also increases the expected calibration error (ECE, see more details in Section 2) from 4.0% to 9.5%. This indicates that BERT-MLP is much more miscalibrated for in-distribution data. The second row plots the histograms of the model confidence, i.e., the maximum output probability, on the test set of 20NG 5 (the unseen 5 categories of 20NG). While it is desirable to produce low probabilities for these unseen classes, BERT-MLP produces wrong yet over-confident predictions for such OOD data.
Such an aggravation of miscalibration is due to the even more significant over-parameterization of these language models. At the pre-training stage, they are trained on a huge amount of unlabeled data in an unsupervised manner, e.g., T5 is pre-trained on 745 GB text. To capture rich semantic and syntactic information from such a large corpus, the language models are designed to have enormous capacity, e.g., T5 has about 11 billion parameters. At the fine-tuning stage, however, only limited labeled data are available in the downstream tasks. With the extremely high capacity, these models can easily overfit training data likelihood and be over-confident in their predictions.
To fight against miscalibration, a natural option is to apply a calibration method such as temperature scaling (Guo et al., 2017) in a post-processing step. However, temperature scaling only learns a single parameter to rescale all the logits, which is not flexible and insufficient. Moreover, it cannot improve out-of-distribution calibration. A second option is to mitigate miscalibration during training using regularization. For example, Pereyra et al. (2017) propose an entropy regularizer to prevent over-confidence, but it can needlessly hurt legitimate high confident predictions. A third option is to use Bayesian neural networks (Blundell et al., 2015;Louizos and Welling, 2017), which treat model parameters as probability distributions to represent model uncertainty explicitly. However, these Bayesian approaches are often prohibitive, as the priors of the model parameters are difficult to specify, and exact inference is intractable, which can also lead to unreliable uncertainty estimates.
We propose a regularization approach to addressing miscalibration for fine-tuning pre-trained language models from a data augmentation perspective. We propose two new regularizers using pseudo samples both on and off the data manifold to mitigate data scarcity and prevent overconfident predictions. Specifically, our method imposes two types of regularization for better calibration during fine-tuning: (1) On-manifold regularization: We first generate on-manifold samples by interpolating the training data and their corresponding labels along the direction learned from hidden feature space; training over such augmented on-manifold data introduces a smoothness constraint within the data manifold to improve the model calibration for in-distribution data.
(2) Off-manifold regularization: We generate off-manifold samples by adding relatively large perturbations along the directions that point outward the data manifold; we penalize the negative entropy of the output distribution for such off-manifold samples to address the over-confidence issue for OOD data.
We evaluate our proposed model calibration method on six text classification datasets. For in-distribution data, we measure ECE and the performance of misclassification detection. For out-of-distribution data, we measure the performance of OOD detection. Our experiments show that our method outperforms existing state-of-the-art methods in both settings, and meanwhile maintains competitive classification accuracy.
We summarize our contribution as follows: (1) We propose a general calibration framework, which can be applied to pre-trained language model fine-tuning, as well as other deep neural network-based prediction problems. (2) The proposed method adopts on-and off-manifold regularization from a data augmentation perspective to improve calibration for both in-distribution and OOD data. (3) We conduct comprehensive experiments showing that our method outperforms existing calibration methods in terms of ECE, miscalssification detection and OOD detection on six text classification datasets.

Preliminaries
We describe model calibration for both in-distribution and out-of-distribution data.
Calibration for In-distribution Data: For in-distribution data, a well-calibrated model is expected to output prediction confidence comparable to its classification accuracy. For example, given 100 data points with their prediction confidence 0.6, we expect 60 of them to be correctly classified. More precisely, for a data point X, we denote by Y (X) the ground truth label, Y (X) the label predicted by the model, and P (X) the output probability associated with the predicted label. The calibration error of the predictive model for a given confidence p ∈ (0, 1) is defined as: (1) As (1) involves population quantities, we usually adopt empirical approximations (Guo et al., 2017) to estimate the calibration error. Specifically, we partition all data points into M bins of equal size according to their prediction confidences. Let B m denote the bin with prediction confidences bounded between m and u m . Then, for any p ∈ [ m , u m ), we define the empirical calibration error as: where y i , y i and p i are the true label, predicted label and confidence for sample i.
To evaluate the overall calibration error of the predictive model, we can futher take a weighted average of the calibration errors of all bins, which is also known as the expected calibration error (ECE) (Naeini et al., 2015) defined as: where n is the sample size. We remark that the goal of calibration is to minimize the calibration error without significantly sacrificing prediction accuracy. Otherwise, a random guess classifier can achieve zero calibration error. Calibration for Out-of-distribution Data: In real applications, a model can encounter test data that significantly differ from the training data. For example, they come from other unseen classes, or they are potential outliers. A well-calibrated model is expected to produce an output with high uncertainty for such out-of-distribution (OOD) data, formally, where K is the number of classes of the training data. As such, we can detect OOD data by setting up an uncertainty threshold.

Calibrated Fine-Tuning via Manifold Smoothing
We consider N data points of the target task , where x i 's denote the input embedding of the sentence and y i 's are the associated one-hot labels. Let f (·) denote the feature extraction x < l a t e x i t s h a 1 _ b a s e 6 4 = " k O 3 6 F w q n z 6 9 P G e 5 U 1 Y s d T l 4 2 T 9 o = " > A A A B 8 X i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 2 G X B j c s K 9 o F t K Z k 0 0 4 Z m M k N y R y x D / 8 K N C 0 X c + j f u / B s z 7 S y 0 9 U D g c M 6 9 5 N z j x 1 I Y d N 1 v Z 2 1 9 Y 3 N r u 7 B T 3 N 3 b P z g s H R 2 3 T J R o x p s s k p H u + N R w K R R v o k D J O 7 H m N P Q l b / u T m 8 x v P 3 J t R K T u c R r z f k h H S g S C U b T S Q y + k O P a D 9 G k 2 K J X d i j s H W S V e T s q Q o z E o f f W G E U t C r p B J a k z X c 2 P s p 1 S j Y J L P i r 3 E 8 J i y C R 3 x r q W K h t z 0 0 3 n i G T m 3 y p A E k b Z P I Z m r v z d S G h o z D X 0 7 m S U 0 y 1 4 m / u d 1 E w x q / V S o O E G u 2 O K j I J E E I 5 K d T 4 Z C c 4 Z y a g l l W t i s h I 2 p p g x t S U V b g r d 8 8 i p p V S v e Z a V 6 d 1 W u 1 / I 6 C n A K Z 3 A B H l x D H W 6 h A U 1 g o O A Z X u H N M c 6 L 8 + 5 8 L E b X n H z n B P 7 A + f w B + 4 W R F g = = < / l a t e x i t >x < l a t e x i t s h a 1 _ b a s e 6 4 = " p t Y S 2 Y t z p E m 1 9 I J M 5 K r 9 s 0 k F M H A = " > A A A B + 3 i c b V D L S s N A F J 3 U V 6 2 v W J d u B o v g q i R V s M u C G 5 c V 7 A O a U C a T S T t 0 8 m D m R l p C f s W N C 0 X c + i P u / B s n b R b a e m D g c M 6 9 3 D P H S w R X Y F n f R m V r e 2 d 3 r 7 p f O z g 8 O j 4 x T + t 9 F a e S s h 6 N R S y H H l F M 8 I j 1 g I N g w 0 Q y E n q C D b z Z X e E P n p h U P I 4 e Y Z E w N y S T i A e c E t D S 2 K w 7 w I X P M i c k M P W C b J 7 n Y 7 N h N a 0 l 8 C a x S 9 J A J b p j 8 n 6 r a Z 9 3 W w 9 3 D Q 6 7 b K O K j p H F + g K 2 e g W d d A 9 6 q I e o m i O n t E r e j N y 4 8 V 4 N z 5 W o x W j 3 D l D f 2 B 8 / g D v Q p T 9 < / l a t e x i t > on < l a t e x i t s h a 1 _ b a s e 6 4 = " C Z C h a C f d A h 9 O / e w y V 9 6 F q P Y v e 1 U = " > A A A B + H i c b V D L S s N A F J 3 4 r P X R q E s 3 g 0 V w V Z I q 2 G X B j c s K 9 g F N C J P J p B 0 6 j z q l S 1 h 4 E L 9 P V E g r v W U x 7 a T I z P W q 9 5 c / M 8 b 5 i Z t h Q U V W W 6 I w M t F a c 6 g k X C e A k y o I t i w q S U I K 2 p v h X i M F M L G Z l W 1 I f i r L 6 + T X r P h X z W a 9 9 f 1 d q u M o w L O w D m 4 B D 6 4 A W 1 w B z q g C z D I w T N 4 B W / O k / P i v D s f y 9 Y N p 5 w 5 B X / g f P 4 A 3 Q W T M A = = < / l a t e x i t >

Training data
On-manifold sample Off-manifold sample Data manifold x < l a t e x i t s h a 1 _ b a s e 6 4 = " k O 3 6 F w q n z 6 9 P G e 5 U 1 Y s d T l 4 2 T 9 o = " > A A A B 8 X i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 2 G X B j c s K 9 o F t K Z k 0 0 4 Z m M k N y R y x D / 8 K N C 0 X c + j f u / B s z 7 S y 0 9 U D g c M 6 9 5 N z j x 1 I Y d N 1 v Z 2 1 9 Y 3 N r u 7 B T 3 N 3

b P z g s H R 2 3 T J R o x p s s k p H u + N R w K R R v o k D J O 7 H m N P Q l b / u T m 8 x v P 3 J t R K T u c R r z f k h H S g S C U b T S Q y + k O P a D 9 G k 2 K J X d i j s H W S V e T s q Q o z E o f f W G E U t C r p B J a k z X c 2 P s p 1 S j Y J L P i r 3 E 8 J i y C R 3 x r q W K h t z 0 0 3 n i G T m 3 y p A E k b Z P I Z m r v z d S G h o z D X 0 7 m S U 0 y 1 4 m / u d 1 E w x q / V S o O E G u 2 O K j I J E E I 5 K d T 4 Z C c 4 Z y a g l l W t i s h I 2 p p g x t S U V b g r d 8 8 i p p V S v e Z a V 6 d 1 W u 1 / I 6 C n A K Z 3 A B H l x D H W 6 h A U 1 g o O A Z X u H N M c 6 L 8 + 5 8 L E b X n H z n B P 7 A + f w B + 4 W R F g = = < / l a t e x i t >x
< l a t e x i t s h a 1 _ b a s e 6 4 = " p t Y S 2 Y t z p E m 1 9 I J M 5 K r 9 s 0 / v E n 6 r a Z 9 3 W w 9 3 D Q 6 7 b K O K j p H F + g K 2 e g W d d A 9 6 q I e o m i O n t E r e j N y 4 8 V 4 N z 5 W o x W j 3 D l D f 2 B 8 / g D v Q p T 9 < / l a t e x i t > o↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " X f d b J C o J T 3 e M o z 4 S p 5 9 l m 8 G k c I I = "

r a W y E e I 4 W w s W H V b A j + + s u b p N d s + D e N 5 u N t v d 0 q 4 6 i C C 3 A J r o E P 7 k A b P I A O 6 A I M p u A Z v I I 3 p 3 B e n H f n Y 9 V a c c q Z c / A H z u c P l n m T m A = = < / l a t e x i t >
Mixup sample Interpolation path Figure 2: The on-manifold and off-manifold samples generated by our calibration procedure. Mixup adopts a coarse linear interpolation and the generated data point may deviate from the data manifold.
layers (e.g., BERT); let g(·) denote the task-specific layer; and let θ denote all parameters of f and g. We propose to optimize the following objective at the fine-tuning stage: where is the cross entropy loss, and λ on , λ off are two hyper-parameters. The regularizers R on and R off are for on-and off-manifold calibration, respectively.

On-manifold Regularization
The on-manifold regularizer R on exploits the interpolation of training data within the data manifold to improve the in-distribution calibration. Specifically, given two training samples (x, y) and ( x, y) and the feature extraction layers f , we generate an on-manifold pseudo sample (x , y ) as follows: where δ on and δ y are small interpolation parameters for data and label, and D x is a proper distance for features extracted by f such as cosine distance, i.e., D x (a, b) = a/ a 2 , b/ b 2 , and B(x, δ on ) denotes an ∞ ball centered at x with a radius δ on , i.e., As can be seen, x * is essentially interpolating between x and x on the data manifold, and D x (f (·), f (·)) can be viewed as a metric over such a manifold. However, as f (·) is learnt from finite training data, it can recover the actual data manifold only up to a certain statistical error. Therefore, we constrain x * to stay in a small neighborhood of x, which ensures x * to stay close to the actual data manifold.

Algorithm 1 Our Proposed Efficient Stochastic Optimization Algorithm for Solving (4). d is the dimension of features.
for # training iterations do Sample a mini-batch B = {x i , y i } from S. // Generate on-manifold samples: This is different from existing interpolation methods such as Mixup (Zhang et al., 2018;Verma et al., 2019). These methods adopt coarse linear interpolations either in the input space or latent feature space, and the generated data may significantly deviate from the data manifold.
Note that our method not only interpolates x but also y. This can yield a soft label for x * , when x and x belong to different classes. Such an interpolation is analogous to semi-supervised learning, where soft pseudo labels are generated for the unlabelled data. These soft-labelled data essentially induce a smoothing effect, and prevent the model from making overconfident predictions toward one single class.
We remark that our method is more adaptive than the label smoothing method (Müller et al., 2019). As each interpolated data point involves at most two classes, it is unnecessary to distribute probability mass to other classes in the soft label. In contrast, label smoothing is more rigid and enforces all classes to have equally nonzero probability mass in the soft label.
We then define the on-manifold regularizer as where S on denotes the set of all pseudo labelled data generated by our interpolation method, and D KL denotes the KL-divergence between two probability simplices.

Off-manifold Regularization
The off-manifold regularizer, R 2 , encourages the model to yield low confidence outputs for samples outside the data manifold, and thus mitigates the over-confidence issue for out-of-distribution (OOD) data. Specifically, given a training sample (x, y), we generate an off-manifold pseudo sample x * by: where S(x, δ off ) denotes an ∞ sphere centered at x with a radius δ off .
Since we expect x * to mimic OOD data, we first need to choose a relatively large δ off such that the sphere S(x, δ off ) can reach outside the data manifold. Then, we generate the pseudo off-manifold sample from the sphere along the adversarial direction. Existing literature (Stutz et al., 2019;Gilmer et al., 2018) has shown that such an adversarial direction points outward the data manifold.
By penalizing the prediction confidence for these off-manifold samples, we are able to encourage low prediction confidence for OOD data. Hence, we define the off-manifold regularizer as where S off denotes the set of all generated off-manifold samples, and H(·) denotes the entropy of the probability simplex.

Model Training
We can adopt stochastic gradient-type algorithms such as ADAM (Kingma and Ba, 2014) to optimize (4). At each iteration, we need to first solve two inner optimization problems in (5) and (7), and then plug x and x into (4) to compute the stochastic gradient. The two inner problems can be solved using the projected sign gradient update for multiple steps. In practice, we observe that one single update step with random initialization is already sufficient to efficiently optimize θ. Such a phenomenon has also been observed in existing literature on adversarial training (Wong et al., 2019). We summarize the overall training procedure in Algorithm 1.

Experiments
To evaluate calibration performance for in-distribution data, we measure the expected calibration error (ECE) and the misclassification detection score. For out-of-distribution data, we measure the OOD detection score. We detect the misclassified and OOD samples by model confidence, which is the output probability associated with the predicted label P (X). Specifically, we setup a confidence threshold τ ∈ [0, 1], and take the samples with confidence below the threshold, i.e., P (X) < τ, as the misclassified or OOD samples. We can compute the detection F 1 score for every τ: F 1 (τ), and obtain a calibration curve (F 1 (τ) vs. τ). Then, we set τ upper as the upper bound of the confidence threshold, since a well calibrated model should provide probabilities that reflect the true likelihood and it is not reasonable to use a large τ to detect them. We use the empirical Normalized Bounded Area Under the Calibration Curve (NBAUCC) as the overall detection score: where M is the number of sub-intervals for the numerical integration. We set M = 50 throughout the following experiments. Note that the traditional binary classification metrics, e.g., AUROC and AUPR, cannot measure the true calibration because the model can still achieve high scores even though it has high confidences for the misclassified and OOD samples. We provide more explanations of the metrics in Appendix C. We report the performance when τ upper = 0.5 here and the results when τ upper = 0.7 and 1 in Appendix D.

Datasets
For each dataset, we construct an in-distribution training set, an in-distribution testing set, and an OOD testing set. Specifically, we use the following datasets: 20NG 1 . The 20 Newsgroups dataset (20NG) contains news articles with 20 categories. We use Stanford Sentiment Treebank (SST-2) (Socher et al., 2012) as the OOD data. 20NG 15 . We take the first 15 categories of 20NG as the in-distribution data and the other 5 categories (20NG 5 ) as the OOD data. WOS (Kowsari et al., 2017). Web of Science (WOS) dataset contains scientific articles with 134 categories. We use AGnews (Zhang et al., 2015) as the OOD data. WOS 100 . We use the first 100 classes of WOS as the in-distribution data and the other 34 classes (WOS 34 ) as the OOD data.
Yahoo (Chang et al., 2008). This dataset contains questions with 10 categories posted to 'Yahoo! Answers'. We randomly draw 2000 from 140, 000 samples for each category as the training set. We use Yelp (Zhang et al., 2015) as the OOD data. Yahoo 8 . We use the first 8 classes of Yahoo as the in-distribution data and the other 2 classes (Yahoo 2 ) as the OOD data. The testing set of OOD detection consists of the in-distribution testing set and the OOD data. More dataset details can be found in Appendix A. We remark that 20NG 15 , WOS 100 , and Yahoo 8 are included to make OOD detection more challenging, as the OOD data and the training data come from similar data sources.

Baselines
We consider the following baselines:  Our method can achieve high F 1 scores starting from a small threshold which indicates that it indeed provides low confidences for misclassified and OOD samples; the F 1 scores of the baselines peak at high thresholds which indicates that they are poorly calibrated.

Implementation Details
We use ADAM (Kingma and Ba, 2014) with β 1 = 0.9 and β 2 = 0.999 as the optimizer. For our method, we simply set λ on = λ off = 1, δ on = 10 −4 , δ off = 10 −3 , and δ y = 0.1 for all the experiments. We also conduct an extensive hyper-parameter search for the baselines. See more details in Appendix B.

Main Results
Our main results are summarized as follows: Expected Calibration Error: Table 1 reports the ECE and predictive accuracy of all the methods. Our method outperforms all the baselines on all the datasets in terms of ECE except for Yahoo, where only ERL is slightly better. Meanwhile, our method does not sacrifice the predictive accuracy. Misclassification Detection: Table 2 compares the NBAUCC 0.5 on misclassification detection of different methods. As shown, our method outperforms all the baselines on all the six datasets. Out-of-distribution Detection: Table 2 reports the NBAUCC 0.5 on OOD detection of different methods. Again, our method achieves the best performance on all the six datasets. The improvement is particularly remarkable on the 20NG dataset, where NBAUCC 0.5 increases from 47.00 to 63.92 compared with the strongest baseline. We also find that detecting the unseen classes from the original dataset is much more challenging than detecting OOD samples from a totally different dataset. Significance Test: We perform the Wilcoxon signed rank test (Wilcoxon, 1992) Table 2: NBAUCC 0.5 on misclassification detection and OOD detection. We report the average performance of 5 random initializations.

Parameter Study
We investigate the effects of the interpolation parameters for on-manifold data, i.e., δ on and δ y , and the perturbation size for off-manifold samples, i.e., δ off . The default values are δ on = 10 −4 , δ off = 10 −3 and δ y = 0.1. Figure 4 shows the reuslts on 20NG 15 , 20NG, WOS 100 , and WOS datasets. Our results are summarized as follows: • The performance of all metrics versus δ on is stable within a large range from 10 −5 to 10 −2 . When on $FFXUDF\  δ on is larger than 10 −1 , the predictive accuracy begins to drop.
• The performance versus δ off is more sensitive: (1) when δ off is too small, ECE increases dramatically becasue the generated off-manifold samples are too close to the manifold and make the model under-confident.
(2) when δ off is too large, the off-manifold regularization is too weak and OOD detection performance drops.
• In general, δ on should be small to let x stay on the data manifold while δ off should be large to let x leave the data manifold. However, the regularization effect of R on (R off ) depends on both λ on (λ off ) and δ on (δ off ). Therefore, it is not necessary to let δ on be smaller than δ off . We can also tune λ on and λ off to achieve better performance.
• The performance versus δ y is relatively stable except for the metric of ECE. When δ y is larger than 0.2, ECE begins to increase.

Ablation Study
We investigate the effectiveness of the on-manifold regularizer R on and the off-manifold regularizer R off via ablation studies. Table 3 shows the results on the 20NG 15 and 20NG datasets.
• As expected, removing either component in our method would result in a performance drop.
It demonstrates that these two components complement each other. All the ablation models outperform the BERT baseline model, which demonstrates the effectiveness of each module.
• We observe that the optimal δ on is different when using only R on . This indicates that the hyperparameters of R on and R off should be jointly tuned, due to the joint effect of both components.
• By removing R off , we observe a severe OOD performance degradation on the 20NG dataset (from 63.92 to 43.87). This indicates that R off is vital to out-of-distribution calibration. Meanwhile, the performance degradation is less severe on 20NG 15 (from 9.69 to 7.94). It is because R on can also help detect the OOD samples from similar data sources. (20NG 5 ).
• By removing R on , the in-distribution calibration performance drops as expected.  Table 3: Ablation study on the 20NG 15 and 20NG datasets. For OOD detection and misclassification detection, we report BAUCC 0.5 . We set δ y = 0.1 and δ off = 10 −3 .

Related Works and Discussion
Other Related Works: Lakshminarayanan et al. (2017) propose a model ensembling approach to improve model calibration. They first train multiple models with different initializations and then average their predictions. However, fine-tuning multiple language models requires extremely intensive computing resources. Kumar et al. (2018) propose a differentiable surrogate for the expected calibration error, called maximum mean calibration error (MMCE), using kernel embedding. However, such a kernel embedding method is computationally expensive and not scalable to the large pre-trained language models. Accelerating Optimization: To further improve the calibration performance of our method, we can leverage some recent minimax optimization techniques to better solve the two inner optimization problems in (5) and (7) without increasing the computational complexity. For example, Zhang et al. (2019) propose an efficient approximation algorithm based on Pontryagin's Maximal Principle to replace the multi-step projected gradient update for the inner optimization problem. Another option is the learning-to-learn framework (Jiang et al., 2018), where the inner problem is solved by a learnt optimizer. These techniques can help us obtain x and x more efficiently.
Connection to Robustness: The interpolated training samples can naturally promote the local Lipschitz continuity of our model. Such a local smoothness property has several advantages: (1) It makes the model more robust to the inherent noise in the data, e.g., noisy labels; (2) it is particularly helpful to prevent overfitting and improve generalization, especially for low-resource tasks. Extensions: Our method is quite general and can be applied to other deep neural network-based problems besides language model fine-tuning.

Conclusion
We have proposed a regularization method to mitigate miscalibration of fine-tuned language models from a data augmentation perspective. Our method imposes two new regularizers using generated on-and off-manifold samples to improve both in-distribution and out-of-distribution calibration. Extensive experiments on six datasets demonstrate that our method outperforms stateof-the-art calibration methods in terms of expected calibration error, misclassification detection and OOD detection.  All the data are publicly available. We also offer the links to the data as follows:

A Dataset Details
smoothing, we search the smoothing parameter from {0.05, 0.1} as in (Müller et al., 2019); for ERL, the penalty weight is chosen from {0.05, 0.1, 0.25, 0.5, 1, 2.5, 5}; for VAT, we search the perturbation size in {10 −3 , 10 −4 , 10 −5 } as in (Jiang et al., 2020); for Mixup, we search the interpolation parameter from {0.1, 0.2, 0.3, 0.4} as suggested in (Zhang et al., 2018;Thulasidasan et al., 2019); for Manifoldmixup, we search from {0.2, 0.4, 1, 2, 4}. We perform 10 stochastic forward passes for MCDP at test time. For hyper-parameter tuning, we run all the methods 5 times and then take the average. The hyper-parameters are selected to get the best ECE on the development set of each dataset. The interpolation of Mixup is performed on the input embeddings obtained from the first layer of the language model; the interpolation of Manifold-mixup is performed on the features obtained from the last layer of the language model.

C Metrics of Misclassification and Out-of-distribution detection
Existing works on out-of-distribution (OOD) detection and misclassification detection (Hendrycks and Gimpel, 2016) use traditional binary classification metrics, e.g., AUPR and AUROC. As we discussed in Section 1 and 2, the output probability of a calibrated model should reflect the true likelihood. However, AUROC and AUPR cannot reflect true model calibration because the model can still achieve high scores even though it has high confidences for misclassified and OOD samples. We argue that it is more reasonable to use the Normalized Bounded Area Under the Calibration Curve (NBAUCC) defined as in Section 4.   Table 5 shows an illustrative example. As can be seen, h 1 is better calibrated than h 2 , since h 1 can detect OOD samples under a wide range of threshold (0.15 < τ < 0.9) while h 2 requires an absurdly large threshold (0.85 < τ < 0.9). However, if we use the traditional AUPR and AUROC metrics, we will conclude that h 1 is as well calibrated as h 2 since AUPR h 1 = AUPR h 2 = 0.417 and AUROC h 1 = AUROC h 2 = 1. On the other hand, if we use NBAUCC, we will have NBAUCC We remark that it is more appropriate to use NBAUCC 0.5 than NBAUCC 1 since a calibrated model should provide low confidences for the misclassified and OOD samples and it is unreasonable to use a large threshold to detect them. Table 6 and 7 report the NBAUCCs of all the methods on misclassification and OOD detection when τ upper = 0.7 and τ upper = 1. Table 8 and 9 report the ablation study results of all the methods when τ upper = 0.7 and τ upper = 1. Figure 5 and 6 report the parameter study results of all the methods when τ upper = 0.7 and τ upper = 1.      Figure 6: Parameter study of δ on , δ off and δ y . We use NBAUCC 0.7 for OOD and misclassification detection.