Posterior Calibrated Training on Sentence Classification Tasks

Most classification models work by first predicting a posterior probability distribution over all classes and then selecting that class with the largest estimated probability. In many settings however, the quality of posterior probability itself (e.g., 65% chance having diabetes), gives more reliable information than the final predicted class alone. When these methods are shown to be poorly calibrated, most fixes to date have relied on posterior calibration, which rescales the predicted probabilities but often has little impact on final classifications. Here we propose an end-to-end training procedure called posterior calibrated (PosCal) training that directly optimizes the objective while minimizing the difference between the predicted and empirical posterior probabilities.We show that PosCal not only helps reduce the calibration error but also improve task performance by penalizing drops in performance of both objectives. Our PosCal achieves about 2.5% of task performance gain and 16.1% of calibration error reduction on GLUE (Wang et al., 2018) compared to the baseline. We achieved the comparable task performance with 13.2% calibration error reduction on xSLUE (Kang and Hovy, 2019), but not outperforming the two-stage calibration baseline. PosCal training can be easily extendable to any types of classification tasks as a form of regularization term. Also, PosCal has the advantage that it incrementally tracks needed statistics for the calibration objective during the training process, making efficient use of large training sets.


Introduction
Classification systems, from simple logistic regression to complex neural network, typically predict posterior probabilities over classes and decide the final class with the maximum probability.The model's performance is then evaluated by how accurate the predicted classes are with respect to outof-sample, ground-truth labels.In some cases, however, the quality of posterior estimates themselves must be carefully considered as such estimates are often interpreted as a measure of confidence in the final prediction.For instance, a well-predicted posterior can help assess the fairness of a recidivism prediction instrument (Chouldechova, 2017) or select the optimal number of labels in a diagnosis code prediction (Kavuluru et al., 2015).Guo et al. (2017) showed that a model with high classification accuracy does not guarantee good posterior estimation quality.In order to correct the poorly calibrated posterior probability, existing calibration methods (Zadrozny and Elkan, 2001;Platt et al., 1999;Guo et al., 2017;Kumar et al., 2019) generally rescale the posterior distribution predicted from the classifier after training.Such post-processing calibration methods re-learn an appropriate distribution from a held-out validation set and then apply it to an unseen test set, causing a severe discrepancy in distributions across the data splits.The fixed split of the data sets makes the post-calibration very limited and static with respect to the classifier's performance.
We propose a simple but effective training technique called Posterior Calibrated (PosCal) training that optimizes the task objective while calibrating the posterior distribution in training.Unlike the post-processing calibration methods, PosCal directly penalizes the difference between the predicted and the true (empirical) posterior probabilities dynamically over the training steps.
PosCal is not a simple substitute of the postprocessing calibration methods.Our experiment shows that PosCal can not only reduce the calibration error but also increase the task performance on the classification benchmarks: compared to the baseline MLE (maximum likelihood estimation) arXiv:2004.14500v2[cs.CL] 1 May 2020 training method, PosCal achieves 2.5% performance improvements on GLUE (Wang et al., 2018) and 0.5% on xSLUE (Kang and Hovy, 2019), and at the same time 16.1% posterior error reduction on GLUE and 13.2% on xSLUE.

Related Work
Our work is primarily motivated by previous analyses of posterior calibration on modern neural networks.Guo et al. (2017) pointed out that in some cases, as the classification performance of neural networks improves, its posterior output becomes poorly calibrated.There are a few attempts to investigate the effect of posterior calibration on natural language processing (NLP) tasks: Nguyen and O'Connor (2015) empirically tested how classifiers on NLP tasks (e.g., sequence tagging) are calibrated.For instance, compared to the Naive Bayes classifier, logistic regression outputs wellcalibrated posteriors in sentiment classification task.Card and Smith (2018) also mentioned the importance of calibration when generating a training corpus for NLP tasks.
As noted above, numerous post-processing calibration techniques have been developed: traditional binning methods (Zadrozny andElkan, 2001, 2002) set up bins based on the predicted posterior p, recalculate calibrated posteriors q per each bin on a validation set, and then update every p with q if p falls into the certain bin.On the other hand, scaling methods (Platt et al., 1999;Guo et al., 2017;Kull et al., 2019) re-scale the predicted posterior p from the softmax layer trained on a validation set.Recently, Kumar et al. (2019) pointed out that such re-scaling methods do not actually produce well-calibrated probabilities as reported since the true posterior probability distribution can not be captured with the often low number of samples in the validation set2 .To address the issue, the authors proposed a scaling-binning calibrator, but still rely on the validation set.
In a broad sense, our end-to-end training with the calibration reduction loss can be seen as sort of regularization designed to mitigate over-fitting.Just as classical explicit regularization techniques such as the lasso (Tibshirani, 1996) penalize models large weights, here we penalize models with posterior outputs that differ substantially from the estimated true posterior.

Posterior Calibrated Training
In general, most of existing classification models are designed to maximize the likelihood estimates (MLE).Its objective is then to minimize the crossentropy (Xent) loss between the predicted probability and the true probability over k different classes.
During training time, PosCal minimizes the cross-entropy as well as the calibration error as a multi-task setup.While the former is a task-specific objective, the latter is a statistical objective to make the model to be statistically well-calibrated from its data distribution.Such data-oriented calibration makes the task-oriented model more reliable in terms of its data distribution.Compared to the prior post-calibration methods with a fixed (and often small) validation set, PosCal dynamically estimates the required statistics for calibration from the train set during training iterations.
Given a training set D = {(x 1 , y 1 )..(x n , y n )} where x i is a p-dimensional vector of input features and y i is a k-dimensional one-hot vector corresponding to its true label (with k classes), our training minimizes the following loss: where L xent is the cross-entropy loss for task objective (i.e., classification) and L cal is the calibration loss on the cross-validation set.λ is a weighting value for a calibration loss L cal .In practice, the optimal value of λ can be chosen via cross-validation.More details are given in §4.
Each loss term can be then calculated as follows: where L xent is a typical cross-entropy loss with p as an updated predicted probability while training.L cal is our proposed loss for minimizing the calibration loss: q is an true (empirical) probability and d is an function to measure the difference (e.g., mean squared error or Kullback-Leibler divergence) between the updated p and true posterior q probabilities.The empirical probability q can be calculated by measuring the ratio of true labels per each bin split by the predicted posterior p from each update.We sum up the losses from every class j ∈ {1, 2..k}.

Experiment
We investigate how our end-to-end calibration training produces better calibrated posterior estimates without sacrificing task performance.
Task: NLP classification benchmarks.We test our models on two different benchmarks on NLP classification tasks: GLUE (Wang et al., 2018) and xSLUE (Kang and Hovy, 2019).GLUE contains different types of general-purpose natural language understanding tasks such as question-answering, sentiment analysis and text entailment.Since true labels on the test set are not given from the GLUE benchmark, we use the validation set as the test set, and randomly sample 1% of train set as a validation set.xSLUE (Kang and Hovy, 2019) is yet another classification benchmark but on different types of styles such as a level of humor, formality and even demographics of authors.For the details of each dataset, refer to the original papers.
Metrics.In order to measure the task performance, we use different evaluation metrics for each task.For GLUE tasks, we report F1 for MRPC, Matthews correlation for CoLA, and accuracy for other tasks followed by Wang et al. (2018).For xSLUE, we use F1 score.
To measure the calibration error, we follow the metric used in the previous work (Guo et al., 2017); Expected Calibration Error (ECE) by measuring how the predicted posterior probability is different from the empirical posterior probability: where pkb is an averaged predicted posterior probability for label k in bin b, q kb is a calculated empirical probability for label k in bin b, B kb is a size of bin b in label k, and n is a total sample size.The lower ECE, the better the calibration quality.
Models.We train the classifiers with three different training methods: MLE, L1, and PosCal.MLE is a basic maximum likelihood estimation training by minimizing the cross-entropy loss, L1 is MLE training with L 1 regularizer, and PosCal is our proposed training by minimizing L PosCal (Eq 1).For PosCal training, we use Kullback-Leibler divergence to measure L cal .We also report ECE with a temperature scaling (Guo et al., 2017) (tScal), which is considered the state-of-theart post-calibration method.
For our classifiers, we fine-tuned the pre-trained BERT classifier (Devlin et al., 2019).Details on the hyper-parameters used are given in Appendix A.
Results.larization in GLUE for both task performance and calibration, though not in xSLUE.Compared to the tScal, PosCal shows a stable improvement over different tasks on calibration reduction, while tScal sometimes produces a poorly calibrated result (e.g., CoLA, MRPC).
Analysis.We visually check the statistical effect of PosCal with respect to calibration.Figure 1 shows how predicted posterior distribution of PosCal is different from MLE.We choose two datasets where PosCal improves both accuracy and calibration quality compared with the basic MLE: RTE from GLUE and Stanford's politeness dataset from xSLUE.We then draw two different histograms: a histogram of p frequencies (top) and a calibration histogram, p versus the empirical posterior probability q (bottom).Figure 1(c,d) show that PosCal spreads out the extremely predicted posterior probabilities (0 or 1) from MLE to be more well calibrated over different bins.The wellcalibrated posteriors also help correct the skewed predictions in Figure 1(a,b).
To better understand in which case PosCal helps correct the wrong predictions from MLE, we an-  From our manual investigation above, we find that statistical knowledge about posterior probability helps correct p while training PosCal, so making p switch its prediction.For further analysis, we provide more examples in Appendix C.

Conclusion and Future Directions
We propose a simple yet effective training technique called PosCal for better posterior calibration.Our experiments empirically show that PosCal can improve both the performance of classifiers and the quality of predicted posterior output compared to MLE-based classifiers.The theoretical underpinnings of our PosCal idea are not explored in detail here, but developing formal statistical support for these ideas constitutes interesting future work.Currently, we fix the bin size at 10 and then estimate q by calculating accuracy of p per bin.Estimating q with adaptive binning can be a potential alternative for the fixed binning.

A Details on Hyper-Parameters
All models are trained with equal hyperparameters:learning rate 2e-5, and BERT model size BERT BASE .Also, we set up an early stopping rule for train: we track the validation loss for every 50 steps and then halt to train if current validation loss is bigger than the averaged 10 prior validation losses (i.e., patience 10).For L1, we use the regularization weight value 1-e8.For PosCal, we set up another weight value λ for L Cal , and the number of updating empirical probability per epoch (u).We tune these two hyper-parameters per each task.For more details, see Table 5.As a baseline of post-calibration method, we also report ECE with a temperature scaling (Guo et al., 2017), which is current state-of-the-art method.

B Examples When MLE and PosCal Predicts Different Label
Table 6 shows some examples in RTE and Stanford-Politeness datasets with their predicted p of true label from MLE and PosCal.

Figure 1 :
Figure 1: Histogram of predicted probabilities (top) and their calibration histograms (bottom) between MLE ( blue-shaded ) and PosCal ( red-shaded ) on RTE in GLUE and SPoliteness in xSLUE.The overlap is purple-shaded .X-axis is the predicted posterior, and Y-axis is its frequencies (top) and empirical posterior probabilities (bottom).The diagonal, linear line in(c,d) means the expected (or perfectly calibrated) case.We observe that PosCal alleviate the posterior probabilities with the small predictions toward the expected calibration .Best viewed in color.
Hyper-parameters for PosCal training across tasks : the number of updating empirical probabilities per epoch u and weight value λ for L Cal .We tune them using the validation set.
We show a detailed training procedure of PosCal in Algorithm 1.While training, we update the model parameters (i.e., weight matrices in the classifier) as well as the empirical posterior probabilities by calculating the predicted posterior with the recently updated parameters.For Q, we exactly calculate a label frequency per bin B. Since it is time-consuming to update Q at every step, we set up the number of Q updates per each epoch so as to only update Q at each batch.
For example, COR by MLE and INCOR by PosCal in the fourth row of Table3means that there are three test samples that MLE correctly at all" in the first case and "Could you" in the second case) is not captured well by MLE, so the model predicts the incorrect label.