Calibration of Pre-trained Transformers

Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT and RoBERTa in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. For each task, we consider in-domain as well as challenging out-of-domain settings, where models face more examples they should be uncertain about. We show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.


Introduction
Neural networks have seen wide adoption but are frequently criticized for being black boxes, offering little insight as to why predictions are made (Benitez et al., 1997;Dayhoff and DeLeo, 2001;Castelvecchi, 2016) and making it difficult to diagnose errors at test-time. These properties are particularly exhibited by pre-trained Transformer models (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019), which dominate benchmark tasks like SuperGLUE (Wang et al., 2019), but use a large number of self-attention heads across many layers in a way that is difficult to unpack (Clark et al., 2019;Kovaleva et al., 2019). One step towards understanding whether these models can be trusted is by analyzing whether they are calibrated (Raftery et al., 2005;Jiang et al., 2012;Kendall and Gal, 2017): how aligned their posterior probabilities are with empirical likelihoods (Brier, 1950;Guo et al., 2017). If a model assigns 70% probability to an event, the event should occur 70% of the time if the model is calibrated. Although the model's mechanism itself may be uninterpretable, a calibrated model at least gives us a signal that it "knows what it doesn't know," which can make these models easier to deploy in practice (Jiang et al., 2012).
In this work, we evaluate the calibration of two pre-trained models, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), on three tasks: natural language inference (Bowman et al., 2015), paraphrase detection (Iyer et al., 2017), and commonsense reasoning (Zellers et al., 2018). These tasks represent standard evaluation settings for pretrained models, and critically, challenging out-ofdomain test datasets are available for each. Such test data allows us to measure calibration in more realistic settings where samples stem from a dissimilar input distribution, which is exactly the scenario where we hope a well-calibrated model would avoid making confident yet incorrect predictions.
Our experiments yield several key results. First, even when used out-of-the-box, pre-trained models are calibrated in-domain. In out-of-domain settings, where non-pre-trained models like ESIM (Chen et al., 2017) are over-confident, we find that pretrained models are significantly better calibrated. Second, we show that temperature scaling (Guo et al., 2017), multiplying non-normalized logits by a single scalar hyperparameter, is widely effective at improving in-domain calibration. Finally, we show that regularizing the model to be less certain during training can beneficially "smooth" probabilities, improving out-of-domain calibration.

Related Work
Calibration has been well-studied in statistical machine learning, including applications in forecasting (Brier, 1950;Raftery et al., 2005;Gneiting et al., 2007;Palmer et al., 2008), medicine (Yang and Thompson, 2010;Jiang et al., 2012), and computer vision (Kendall and Gal, 2017;Guo et al., 2017;Lee et al., 2018). Past work in natural language processing has studied calibration in the nonneural (Nguyen and O'Connor, 2015) and neural (Kumar and Sarawagi, 2019) settings across several tasks. However, past work has not analyzed largescale, pre-trained models, and we additionally analyze out-of-domain settings, whereas past work largely focuses on in-domain calibration (Nguyen and O'Connor, 2015;Guo et al., 2017).
Another way of hardening models against outof-domain data is to be able to explicitly detect these examples, which has been studied previously (Hendrycks and Gimpel, 2016;Liang et al., 2018;Lee et al., 2018). However, this assumes discrete notions of domains; calibration is a more general paradigm and gracefully handles settings where notions of domain are less quantized.

Posterior Calibration
A model is calibrated if the confidence estimates of its predictions are aligned with empirical likelihoods. For example, if we take 100 samples where a model's prediction receives posterior probability 0.7, the model should get 70 of the samples correct. Formally, calibration is expressed as a joint distribution P (Q, Y ) over confidences Q ∈ R and labels Y ∈ Y, where perfect calibration is achieved when P (Y = y|Q = q) = q. This probability can be empirically approximated by binning predictions into k disjoint, equally-sized bins, each consisting of b k predictions. Following previous work in measuring calibration (Guo et al., 2017), we use expected calibration error (ECE), which is a weighted average of the difference between each bin's accuracy and confidence: k b k n |acc(k) − conf(k)|. For the experiments in this paper, we use k = 10.

Tasks and Datasets
We perform evaluations on three language understanding tasks: natural language inference, paraphrase detection, and commonsense reasoning. Significant past work has studied cross-domain robust-   (Parikh et al., 2016) and Enhanced Sequential Inference Model (ESIM) (Chen et al., 2017) use LSTMs and attention on top of GloVe embeddings (Pennington et al., 2014) to model pairwise semantic similarities. In contrast, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) are large-scale, pre-trained language models with stacked, general purpose Transformer (Vaswani et al., 2017) layers.
ness using sentiment analysis (Chen et al., 2018;Peng et al., 2018;Miller, 2019;Desai et al., 2019). However, we explicitly elect to use tasks where out-of-domain performance is substantially lower and a variety of domain shifts are exhibited. Below, we describe our in-domain and out-of-domain datasets. 2 For all datasets, we split the development set in half to obtain a held-out, non-blind test set.
Natural Language Inference. Stanford Natural Language Inference (SNLI) is a large-scale entailment dataset where the task is to determine whether a hypothesis is entailed, contradicted by, or neutral with respect to a premise (Bowman et al., 2015). Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018) contains similar entailment data across several domains, which we can use as unseen test domains.
Commonsense Reasoning. Situations with Adversarial Generations (SWAG) is a grounded commonsense reasoning task where models must select the most plausible continuation of a sentence among four candidates (Zellers et al., 2018). Hel-laSWAG (HSWAG), an adversarial out-of-domain dataset, serves as a more challenging benchmark for pre-trained models (Zellers et al., 2019); it is  Table 2: Out-of-the-box calibration results for indomain (SNLI, QQP, SWAG) and out-of-domain (MNLI, TwitterPPDB, HellaSWAG) datasets using the models described in Table 1. We report accuracy and expected calibration error (ECE), both averaged across 5 runs with random restarts. distributionally different in that its examples exploit statistical biases in pre-trained models. Table 1 shows a breakdown of the models used in our experiments. We use the same set of hyperparameters across all tasks. For pre-trained models, we omit hyperparameters that induce brittleness during fine-tuning, e.g., employing a decaying learning rate schedule with linear warmup (Sun et al., 2019;Lan et al., 2020). Detailed information on optimization is available in Appendix B.

Out-of-the-box Calibration
First, we analyze "out-of-the-box" calibration; that is, the calibration error derived from evaluating a model on a dataset without using post-processing steps like temperature scaling (Guo et al., 2017). For each task, we train the model on the in-domain training set, and then evaluate its performance on the in-domain and out-of-domain test sets. The results are shown in Table 2. We remark on a few observed phenomena below: Non-pre-trained models exhibit an inverse relationship between complexity and calibration. Simpler models, such as DA, achieve competitive in-domain ECE on SNLI (1.02) and QQP (3.37), and notably, significantly outperform pre-trained models on SNLI. However, the more complex ESIM, both in number of parameters and architecture, sees increased in-domain ECE despite having higher accuracy on all tasks.
However, pre-trained models are generally more accurate and calibrated. Rather surprisingly, pre-trained models do not show characteristics of the aforementioned inverse relationship, despite having significantly more parameters. On SNLI, RoBERTa achieves an ECE in the ballpark of DA and ESIM, but on QQP and SWAG, both BERT and RoBERTa consistently achieve higher accuracies and lower ECEs. Pre-trained models are especially adept out-of-domain, where on Hel-laSWAG in particular, RoBERTa reduces ECE by a factor of 3.4 compared to DA.
Using RoBERTa always improves in-domain calibration over BERT. In addition to obtaining better task performance than BERT, RoBERTa consistently achieves lower in-domain ECE. Even out-of-domain, RoBERTa outperforms BERT in all but one setting (TwitterPPDB). Nonetheless, our results show that representations induced by robust pre-training (e.g., using a larger corpus, more training steps, dynamic masking) (Liu et al., 2019) lead to more calibrated posteriors. Whether other changes to pre-training (Yang et al., 2019;Lan et al., 2020;Clark et al., 2020) lead to further improvements is an open question.

Post-hoc Calibration
There are a number of post-hoc techniques to correct a model's calibration. Using our in-domain development set, we can, for example, post-process model probabilities via temperature scaling (Guo et al., 2017), where a scalar "temperature" hyperparameter T divides non-normalized logits before the softmax operation. As T → 0, the distribution's mode receives all the probability mass, while as T → ∞, the probabilities become uniform.
Furthermore, we experiment with models trained in-domain with label smoothing (LS) (Miller et al., 1996;Pereyra et al., 2017) as opposed to conventional maximum likelihood estimation (MLE). By nature, MLE encourages models to sharpen the posterior distribution around the top prediction, a high confidence which is typically unwarranted in out-of-domain settings. Label smoothing presents one solution to over-confidence by maintaining uncertainty over the label space during training: we minimize the KL divergence with the distribution  Table 3: Post-hoc calibration results for BERT and RoBERTa on in-domain (SNLI, QQP, SWAG) and out-ofdomain (MNLI, TwitterPPDB, HellaSWAG) datasets. Models are trained with maximum likelihood estimation (MLE) or label smoothing (LS), then their logits are post-processed using temperature scaling ( §4.4). We report expected calibration error (ECE) averaged across 5 runs with random restarts. Darker colors imply lower ECE.
placing a 1 − α fraction of probability mass on the gold label and α |Y|−1 fraction of mass on each other label, where α ∈ (0, 1) is a hyperparameter. 3 This re-formulated learning objective does not require changing the model architecture.
For each task, we train the model with either MLE or LS (α = 0.1) using the in-domain training set, use the in-domain development set to learn an optimal temperature T , and then evaluate the model (scaled with T ) on the in-domain and outof-domain test sets. From Table 3, we draw the following conclusions: MLE models with temperature scaling achieve low in-domain calibration error. MLE models always outperform LS models in-domain, which suggests incorporating uncertainty when in-domain samples are available is not an effective regularization scheme. Even when using with a small smoothing value (0.1), LS models do not achieve nearly as good out-of-the-box results as MLE models, and temperature scaling hurts LS in many cases. By contrast, RoBERTa with temperature-scaled MLE achieves ECE values from 0.7-0.8, implying that MLE training yields scores that are fundamentally good but just need some minor rescaling.
However, out-of-domain, label smoothing is generally more effective. In most cases, MLE models do not perform well on out-of-domain datasets, with ECEs ranging from 8-12. However, LS models are forced to distribute probability mass across classes, and as a result, achieve significantly lower ECEs on average. We note that LS is particularly effective when the distribution shift is strong.  For example, on the adversarial HellaSWAG, when used out-of-the-box, RoBERTa-LS obtains a factor of 5.8 less ECE than RoBERTa-MLE.
Optimal temperature scaling values are bounded within a small interval. Table 4 enumerates the learned temperature values for BERT-MLE and RoBERTa-MLE. For in-domain tasks, the optimal temperature values are generally in the range 1-1.4. Interestingly, out-of-domain, TwitterPPDB and HellaSWAG require larger temperature values than MNLI, which suggests the degree of distribution shift and magnitude of T may be closely related.

Conclusion
Posterior calibration is one lens to understand the trustworthiness of model confidence scores. In this work, we examine the calibration of pre-trained Transformers in both in-domain and out-of-domain settings. Results show BERT and RoBERTa coupled with temperature scaling achieve low ECEs in-domain, and when trained with label smoothing, are also competitive out-of-domain.  (Williams et al., 2018), QQP (Iyer et al., 2017), TwitterPPDB (Lan et al., 2017), SWAG (Zellers et al., 2018), and HellaSWAG (Zellers et al., 2019).

B Training and Optimization
For non-pre-trained model baselines, we chiefly use the open-source implementations of DA (Parikh et al., 2016) and ESIM (Chen et al., 2017) in AllenNLP (Gardner et al., 2018). For SWAG/HellaSWAG specifically, we run the baselines available in the authors' code. 4 For BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), we use bert-base-uncased and roberta-base, respectively, from Hug-gingFace Transformers (Wolf et al., 2019). BERT is fine-tuned with a maximum of 3 epochs, batch size of 16, learning rate of 2e-5, gradient clip of 1.0, and no weight decay. Similarly, RoBERTa is fine-tuned with a maximum of 3 epochs, batch size of 32, learning rate of 1e-5, gradient clip of 1.0, and weight decay of 0.1. Both models are optimized with AdamW (Loshchilov and Hutter, 2019). Other than early stopping on the development set, we do not perform additional hyperparameter searches. Finally, all experiments are conducted on NVIDIA V100 32GB GPUs.

C Visualizations
Reliability diagrams (Nguyen and O'Connor, 2015;Guo et al., 2017) visualize the alignment between posterior probabilities (confidence) and empirical outcomes (accuracy), where a perfectly calibrated model has conf(k) = acc(k) for each bucket of real-valued predictions k ( §3). We show several reliability diagrams, each under different configurations, in Figures 1, 2, and 3.  In-domain calibration of BERT and RoBERTa with temperature scaling (TS). Models are both trained and evaluated on SNLI, QQP, and SWAG, respectively, then are post-processed using temperature scaling ( §4.4). ZERO ERROR depicts perfect calibration (e.g., expected calibration error = 0). Figure 3: Out-of-domain calibration of RoBERTa with different learning objectives. RoBERTa is trained on SWAG using either maximum likelihood estimation (ROBERTA-MLE) or label smoothing (ROBERTA-LS) and evaluated on HellaSWAG. ZERO ERROR depicts perfect calibration (e.g., expected calibration error = 0).