Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three

Modern neural networks do not always produce well-calibrated predictions, even when trained with a proper scoring function such as cross-entropy. In classification settings, simple methods such as isotonic regression or temperature scaling may be used in conjunction with a held-out dataset to calibrate model outputs. However, extending these methods to structured prediction is not always straightforward or effective; furthermore, a held-out calibration set may not always be available. In this paper, we study ensemble distillation as a general framework for producing well-calibrated structured prediction models while avoiding the prohibitive inference-time cost of ensembles. We validate this framework on two tasks: named-entity recognition and machine translation. We find that, across both tasks, ensemble distillation produces models which retain much of, and occasionally improve upon, the performance and calibration benefits of ensembles, while only requiring a single model during test-time.


Introduction
For a calibrated model, an event with a forecast confidence p occurs in held-out data with probability p. Calibrated probabilities enable meaningful decision making, either by machines such as downstream probabilistic systems (Nguyen and O'Connor, 2015), or by end-users who must interpret and trust system outputs (Jiang et al., 2012). The calibration of modern neural models has recently received increased attention in both the natural language processing and machine learning communities. A major finding is that modern neural networks do not always produce wellcalibrated predictions. As a result, much recent work has focused on improving model calibration, predominantly with post-hoc calibration methods (Guo et al., 2017).
However, post-hoc calibration methods have primarily been developed in the context of classification tasks. Thus, it is unclear how these methods will affect the performance of sequence-level structured prediction tasks (Kumar and Sarawagi, 2019). Additionally, post-hoc calibration methods require a held out calibration dataset, which may not be available in all circumstances. To improve calibration, an alternate approach is model ensembling, which is closely related to approximating the intractable posterior distribution over model parameters (Lakshminarayanan et al., 2017;Pearce et al., 2018;Dusenberry et al., 2020). Although computationally expensive, both at training and inference time, ensembling does not require a separate calibration set. Furthermore, ensembles have been found to be competitive or even outperform other calibration methods, particularly in more challenging settings such as dataset shift (Snoek et al., 2019).
In this paper, we study ensemble distillation as a means of achieving calibrated and accurate structured models while avoiding the high cost of naive ensembles at inference time (Hinton et al., 2015). Ensemble distillation consists of two stages: In the first stage, we select a base model for the task, such as a recurrent neural network or Transformer, and then train an ensemble of K such models, ensuring diversity either via sub-sampling ( §4) or with different random seeds ( §5). In the second stage, the ensemble of K teacher models is distilled into a single student model. Prior work has examined the effects of ensemble distillation on measures of uncertainty in vision tasks (Li and Hoiem, 2019;Englesson and Azizpour, 2019). To our knowledge, this is the first systematic study of the effect of ensemble distillation on the calibration of structured prediction models-we consider NER and NMT-which we find poses distinct challenges both in terms of measuring cali-  Figure 1: Plots of a) BLEU score (↑ is better) and b) Top-1 ECE (↓ is better) of ensembles and distilled ensembles of NMT models, compared to the mean (and standard error, the shaded region) of five individual standard models. Ensembles vastly improve both performance and calibration over individual models, and ensemble distillation is able to retain much of this improvement in a single model. Further, we find that even small ensembles, e.g. of size 3, are enough to see significant improvements over single models. Experimental details are described in §5.
bration and efficiently distilling large ensembles.
To this end, our contributions may be summarized as follows: • Our key finding is that a model distilled from an ensemble consistently outperforms baseline single models ( §4, §5), both in terms of calibration and task performance. • We propose a straightforward memoization technique which, when combined with a top-K approximation, enables distillation of large ensembles with negligible training overhead for NMT ( §5.1). • We study the interaction between ensembling, distillation, and other commonly employed techniques including stochastic weight averaging and label smoothing in NMT ( §5.3). • We investigate methods to produce effective ensembles in structured prediction settings, finding that small numbers of independent models initialized from different random seeds outperform an alternative based on single optimization trajectories ( §6.1). • Finally, we compare the calibration performance of ensembles relative to temperature scaling, which requires a separate calibration dataset, finding that it provides an orthogonal benefit ( §6.2).
Our findings suggest that ensemble distillation has potential to become a standard training recipe in settings where calibration is important.

Calibration
Given an arbitrary observation X and a model with parameters θ, we are interested in the predictive uncertainty, p θ (Y | X), of an event Y . Our objective is to compute the predictive uncertainty of p θ over of a finite sample of held-out data, We then say that p θ is calibrated if the predictive uncertainty agrees with held-out observations; that is, if the model predicts an event with confidence p, then that event prediction is correct p% of the time.
Calibrated models can be useful for downstream systems which benefit from accurate estimates of uncertainty (Jiang et al., 2012;Nguyen and O'Connor, 2015). Recently, it has been noted that a large portion of modern neural networks are not well calibrated after training (Nguyen and O'Connor, 2015;Ott et al., 2018a;Kumar and Sarawagi, 2019), although it has been found that pre-training can help with this in natural language processing (Desai and Durrett, 2020).

Measuring calibration
In this work, we are interested in tasks where Y = {y 1 , y 2 , . . . , y T } is a sequence, such that each y i is drawn from some fixed vocabulary V such as a fixed set of named-entity types or a languagespecific sub-word vocabulary. 1 However, due to the combinatorially large size of the output space Y, any event Y ∈ Y has a minuscule probability, making it difficult to meaningfully calculate calibration. Thus, when evaluating the calibration of p θ , we focus on calibration with respect to tokenlevel sub-sequences of Y , i.e. p θ (y t | X).
Since we evaluate model calibration on a finite amount of data, it is not possible to directly determine exactly what proportion of all events with probability p θ will be correct. Instead, various metrics have been proposed to estimate how well calibrated a model is. Our evaluations in this work center around two metrics which are common in the literature: the Brier score (Brier, 1950), which is the mean squared error between the model's predictions and the targets, and the Expected Calibration Error (ECE; Naeini et al., 2015), which uses binning to measure the correlation between confidence and accuracy. Following (Nguyen and O'Connor, 2015), we use adaptive binning to select bin boundaries that allow an equal number of sampled confidences per bin.

Addressing calibration
A number of post-hoc solutions to the problem of poor calibration have been proposed, including Platt scaling (Platt, 1999), isotonic regression (Zadrozny and Elkan, 2001), and temperature scaling (Guo et al., 2017). However, these methods were predominantly designed for classification problems; in structured prediction problems, post-hoc re-calibration can sometimes hurt original performance (Kumar and Sarawagi, 2019). Additionally, post-hoc methods assume the availability of a held-out calibration set, which may not always be feasible in some settings. Thus, improving neural network calibration during the training procedure is still an open area of research.
It is well-known that neural model ensembles may improve task performance relative to individual models, although at the cost of increased compute and memory resources during training and inference (Simonyan and Zisserman, 2014;He et al., 2015;Jozefowicz et al., 2016). Recently, it has been observed that ensembles of independent models trained with different random seeds also manifest improved calibration (Lakshminarayanan et al., 2017;Snoek et al., 2019). Intuitively, independently initialized models may be over-or under-confident in different ways on ambiguous inputs; as a result, the average of their predictive distributions provides a more robust es-timate of the true uncertainty associated with any given input.

Ensemble Distillation
3.1 Knowledge distillation Hinton et al. (2015) first proposed knowledge distillation as a procedure to train a low-capacity student model on the fixed distribution q of a highercapacity teacher model. In its general form, the distillation loss L Student optimized by the student model with parameters θ has the form where β is an interpolation between the standard negative log-likelihood loss 2 (L NLL ) and the knowledge distillation loss (L KD ), and D is the training dataset. In general, L KD is some measure of dissimilarity between a the student and teacher distributions over examples in the training data, typically cross-entropy or KL-Divergence.
As our full output space Y is combinatorially large, exact comparison of p θ (Y | X) and q(Y | X) is intractable. A common method to address this is to instead distill teacher distributions at the token level (Hinton et al., 2015;Kim and Rush, 2016). In models that make Markovassumptions, such as some NER models with CRF layers, we can efficiently compute the token-level distributions marginalized over all possible label sequences Y for each token. In auto-regressive models, such as the NMT models we consider, marginalization over all possible sequences is intractable. In this case, the token-level loss is evaluated using teacher-forcing (Williams and Zipser, 1989), by conditioning on true targets up to time t.

Ensemble distillation
Ensemble distillation uses knowledge distillation to train a student model on the output of an ensemble. Most previous approaches to ensemble distillation collapses the ensemble distribution into a single point estimate by averaging the teacher distributions (Hinton et al., 2015;Korattikara et al., 2015). This has been shown to be an effective way of distilling the uncertainty captured by an ensemble in computer vision tasks (Li and Hoiem, 2019;Englesson and Azizpour, 2019). Recently Malinin et al. (2020) showed that by instead distilling the distribution over the ensemble into a prior network (Malinin and Gales, 2018), the student can learn to model both the epistemic and aleatory uncertainty of the ensemble.
As our goal is to improve model calibration, which captures both types of uncertainty, we follow previous methods of ensemble distillation which collapse the ensemble distribution into a point-estimate by uniformly averaging the distributions of each teacher. Formally, given an ensemble of K models, our task is to train a single student model to match a teacher distribution q which is composed of the K distributions from the ensemble, q k . Maintaining consistency with how we derive predictions from an ensemble, when performing token-level distillation we construct the teacher distribution q as a mixture of each ensemble distribution: In addition to token-level distillation, Kim and Rush (2016) proposed sequence-level distillation, which approximates the global distribution q(Y | X (i0 )) with the top M samples and treats each samples as an additional training example during student learning. This technique can be prohibitively expensive to use, as it increases the training time of the student by a factor of M ; a problem which is exacerbated during ensemble distillation, as the factor becomes M × K. To maintain simplicity in our distillation procedure, and comparability to tasks for which this technique does not apply, 3 we focus only on tokenlevel ensemble distillation.

Ensemble Distillation for NER
We evaluate the calibration and performance effects of ensemble distillation on NER models.
In these experiments, we examine teacher ensembles that use either strong independence assumptions (subsequently referred to as "IID"), or 1st order Markov assumptions. These settings allow us to examine the effects of distilling globally-marginalized versus locally-marginalized structured distributions into student models. We experiment on the 2003 CoNLL Dataset (Tjong Kim Sang and De Meulder, 2003), which contains datasets in English and German, and consider both languages in our experiments. Our NER models use representations from pretrained masked language models: BERT for English and multilingual-BERT (mBERT) for German (Devlin et al., 2019). Given an input sequence X = {x 1 , . . . , x T } BERT outputs representations for each x t . 4 We consider two separate models in our experiments: The 'IID' model makes predictions based solely off of the token-level logits output from a feed-forward layer applied to the BERT representations, making each predictionŷ t independently from all others. The 'CRF' model instead passes the representations through a bidirectional LSTM layer, and the result into a conditional random field (CRF) with learned transition parameters (Lample et al., 2016). All models are trained using the Adafactor optimizer; we use a learning rate of 1e-4 for training the ensembles, and 5e-5 during distillation.
For each dataset, we trained K = 9 models in both the IID and CRF framework. To encourage diversity, each model in a given framework uses a different 1/10 split 5 of the training set for its early stopping criterion, in addition to using a different random seed. We then consider ensembles of sizes K = 3, 6, 9 models, where for K = 3, 6 the individual models are chosen randomly, but in such a way that the ensemble of 3 is always a subset of the ensemble of 6. During inference time, each ensemble's per-token distribution q k (y t | X) is averaged uniformly to create the ensemble's distribution q(y t | X). In IID ensembles, q k (y t | X) is taken directly from the logits at timestep t. In the CRF model, the Forward-Backward algorithm is used to compute distributions for each timestep which are globally normalized over all possible output sequences Y. Predictions are then made for each token based on the maximum likelihood prediction from the ensembled distribution, q. For comparison, we also train a collection of 9 models, each with a different random initialization, on the entire training dataset and report their average performance and standard error. Note that this setup disadvantages the ensemble, as each individual model in an ensemble has access to strictly less training data than the individual models.

Ensemble distillation
During ensemble distillation, we only distill into IID models, although we consider both IID and CRF ensembles as teacher distributions. This allows us to examine the effects of distilling globally marginalized distributions into locally marginalized models. Each student's distillation loss L KD is the token-level cross-entropy 6 between the student's distribution p θ (y t | X) and the ensembled distribution q(y t | X), with an interpolation parameter of β = 5 6 between L KD and the true train loss, L NLL (θ). All distilled models are trained using the final 1/10th training split as validation data.

Evaluating calibration in NER
For highly imbalanced data, like NER labels, common measurements of calibration do not sufficiently distinguish between models (Benedetti, 2010). One way we account for this is to use stratified Brier score (Wallace and Dahabreh, 2014), which has two components: the Brier score over all positive events (BS + ) and over all negative events (BS − ), whereby one of these (usually BS + ) is more sensitive to a model's calibration.
However, we note a potential drawback of relying on BS + , namely that it is entangled with the model's recall. 7 We also wish to use Expected Calibration Error, which more closely captures calibration in the sense defined in §2, but ECE is also rendered useless in an imbalanced setting.
To address both of these issues, we therefore propose an alternative "balanced" version of each metric: for each entity-type, 8 we consider the top 2N most confident predictions, where N is the number of tokens with true labels of that class. After this filtering, "Balanced ECE" (B-ECE) is computed as the weighted sum of each class's (adaptively binned) ECE. "Balanced Brier score" (B-BS) is similarly computed as the weighted sum of each class's Brier score over this filtered set. These metrics correct the problem of imbalance and better reflect a model's calibration independent of its recall (and thus test performance).

Results
We report the results of single models, ensembles, and distilled ensembles on F1, BS-, BS+, B-BS, and B-ECE in Table 1. We find that, across all settings and languages, ensembles outperform individual models in both F1 and calibration. Distillation only moderately hurts these numbers. Distilled models still outperform single models; additionally, they are vastly better calibrated than single models, indicating that distillation is effective at retaining the calibration benefits of ensembles.
While distilling IID ensembles into an IID student generally lowers performance compared to the ensemble, distilling the ensembled CRF distributions obtained into an IID model can actually yield higher performance and calibration than the ensemble. This suggests that global CRF distributions may not ensemble well at the token level, but are still effective distillation teachers when distilled into a local IID model with no global distribution of its own.
As a further examination of the benefits of improved calibration, we produce precision-recall curves (PR) by thresholding token-level probabilities. We find that improved calibration translates to higher area under the PR curve. Figures and experimental details are reported in Appendix B.

Ensemble Distillation for NMT
In this section, we evaluate ensemble distillation for NMT models. 9 State-of-the-art NMT models such as the Transformer are auto-regressive, meaning that the probability of a given target y t is a function of all previous targets y <t (Vaswani et al., 2017). Thus, distilling teacher information in this scenario is different from what is done in NER; the structure level knowledge which is being distilled is inherently greedy (the teacher distributions do not take into account future sequences) and the distributions are built off of the gold labeled sequences up to that point (making it difficult to distill the global behavior of the ensemble).
All experiments are run on the WMT16 En→De and De→En tasks, using the vanilla Transformer-Base architecture from Vaswani et al. (2017). 10 We use a vocabulary of 32K symbols based on a 9 All NMT experiments are run using the fairseq framework (Ott et al., 2019), using standard recipes and commodity hardware.
10 Unless otherwise specified, our experimental configuration mirrors that of Vaswani et al. (2017) model with the "base" architecture. joint source and target byte pair encoding (Sennrich et al., 2015;Ott et al., 2018b). Unlike in our NER setting, all models are trained on the full training set, with variation being instilled only through different random initializations and data order. All models considered use stochastic weight averaging (SWA; Izmailov et al., 2018). Additionally, to evaluate the effect of label smoothing (Szegedy et al., 2016;Müller et al., 2019) on calibration we consider 2 variations of NMT experiments: Models trained on standard cross-entropy loss (CE-SWA), and models trained on cross-entropy loss with a smoothing factor of λ = 0.1 (LS-SWA). Models are added to ensembles based on order of random seed. 11 During ensemble inference, the next output token is taken from the argmax of the averaged tokendistribution across all models in the ensemble.

Challenges of ensemble distillation
Token-level distillation requires access to the teacher distribution during training, which in our experiments involves a distribution over 32K subwords. As we are interested in distilling an ensemble of teacher models, when training on devices like GPUs with limited memory, it may not feasible to keep all models in the ensemble on device. Even on devices with sufficient memory, the additional overhead associated with ensemble inference may lead to impractical training times.
To enable scaling to large ensembles with minimal training overhead, we memoize to disk the ensemble predictive distributions associated with each token in the training data. During training, the memoized values are streamed along with source and target subwords for calculation of the NLL and distillation losses. However, this solution incurs a large storage cost, namely O(T · V ) floating point numbers for a training dataset consisting of T tokens and V subwords. Thus, we propose the following simple approximation scheme to reduce the storage requirements to O(T · V ), where V V . For each token t = 1, . . . , T , we store a vector v (t) ∈ Z V of indices associated with the top-V tokens of the teacher distribution, along with a vector p (t) ∈ R K of corresponding probabilities. During training, L KD is evaluated with respect to these fixed top-V events.

Distillation experimental details
As we found label smoothing to significantly hurt ensemble calibration (Table 2), our distillation experiments only consider the CE-SWA ensembles as teachers. We use a truncation level of V = 64 in Table 2 and report additional results for different truncation amounts in Table 3. The distillation loss with weight β is evaluated over the tokens which are in top-V using a fixed temperature of 1. The negative log-likelihood loss with weight 1 − β is identical to other models and also uses label smoothing with λ = 0.1. All results use a weight of β = 0.5 on the distillation objective and use a random initialization of the model parameters, which preliminary experiments suggested was optimal. 12 Other experimental details match those of single models.

Results
Calibration for NMT is typically measured using the ECE of next-token predictions (ECE-1). 13 To better understand the calibration of the distribution of the model's predictions, we supplement this with the ECE of the top five predictions at each token (ECE-5). 14 We report the BLEU scores and calibration metrics of our ensembles, students, and baseline models in Table 2. We find that individual models trained with label smoothing have slightly better BLEU scores and calibration than those trained without, which is consistent with the findings in (Müller et al., 2019), in which they attribute this improvement to reducing overconfidence. Surprisingly, however, ensembles of models trained using label smoothing actually have worse calibration than independent models, and this effect grows as more models are incorporated. We hypothesize that penalizing overconfidence is effective for improving calibration of a single model, but that this results in overcorrection when models which have been similarly penalized are ensembled together. This is supported by the reliability plots in Figure 2, which show that the individual LS models are underconfident in their top predictions, which is compounded by ensembling, whereas non-LS individual models are slightly overconfident in their top predictions, which is corrected by ensembling.
For ensembles that do not incorporate label smoothing, we observe the same trends for NMT as we do for NER: ensembles consistently improve performance, and distillation results in a single model which significantly outperforms baseline models both in terms of calibration and BLEU. We also see a more consistent trend of improvement as the ensemble size increases, which we attribute to the substantially larger NMT dataset size (Figure 1).

Effect of truncation size V .
We consider V = {32, 64, 128, 256} which requires {32, 64, 128, 254} gigabytes of storage respectively to memoize the teacher distributions. To put these storage requirements in perspective, naively storing the full predictive distribution would require approximately 17 terabytes of storage. Note that the storage requirements of the proposed distillation procedure are constant with respect to the number of models in the teacher ensemble, so in principle the proposed approach could be used to distill significantly larger ensembles than considered in this work. The results for De→En are re-  ported in Table 3. Surprisingly, as V becomes smaller, performance does not monotonically degrade, suggesting that truncation could have a beneficial regularisation effect. In fact, although V = 32 has a marginally worse BLEU score than the best models, it has the best ECE-5 score. This suggests that for large datasets it may be reasonable to use aggressive truncations, although we do not experiment with values smaller than V = 32.
6 Further Experiments

Single-model ensembles
Our findings suggest that even ensembles of relatively small size (3-4) can still yield significant improvements over single models. In this section, we explore whether these findings can be mirrored by an ensemble which is built from a single optimization trajectory, built from multiple checkpoints. For this purpose we consider a popular technique introduced by Loshchilov and Hutter (2016). The authors define SGDR, a scheme for training with a cyclical learning rate, and find that an ensemble of 'snapshots' of the model taken when the learning rate is at a minimum gives similar improvements in accuracy to proper ensembles.
We follow the same procedure used to train our single CE+SWA NMT models, stopping 3 epochs earlier. We then warm-start this model and train for 3 epochs 15 using a cyclical learning rate, saving the model at the end of each. Table 4 gives the results obtained by ensembling the saved checkpoints, and a comparison to an equivalent proper ensemble. We find that SGDR improves calibration over single models, but not to the same extent as the ensemble, and does not improve BLEU.  Table 4: Single-run ensemble performance for NMT. We include the performance of the 3-model ensemble and the average individual model performance. We find that single-run ensembles have better calibration than single models, but do not see the same performance gains that true ensembles do.
Applying SGDR to NER experiments yielded results which did not improve over the individual NER models in Table 1. We posit that using pretrained BERT reduces the amount of diversity which can be introduced in a single training run.

Temperature scaling
One of the benefits of the proposed framework is that it does not require the use of a separate validation set to achieve improvements in calibration. This also means that when one is available, it can be used in conjunction with our method to further improve calibration. A well-studied method for performing post-hoc re-calibration using additional data is temperature scaling (Guo et al., 2017). To explore the interactions between temperature scaling and ensemble distillation, we perform temperature scaling on our German NER IID models and our largest IID ensemble, using the validation set for tuning calibration. Additionally, we train a new student on the temperature-scaled ensemble. We report the test performance and calibration of all models, compared to the models without temperature scaling, in Table 5.
We find that temperature scaling can improve individual model calibration, but it does not surpass the calibration of ensembles. 16 Additionally, we see that temperature scaling can further be used to improve the calibration of both ensembles and ensemble-distilled models. However, the effect on performance varies; while temperature scaling hurts ensemble performance, it has a significant positive effect on the student model.

Conclusion
Summary of contributions. We present a systematic study of the effect of ensembles on the calibra-16 Note that temperature scaling has no effect on an individual IID model's performance, as it does not change the ranking of predictions.  Table 5: CoNLL-2003 German IID results for individual models, 9-ensembles, and distilled 9-ensembles with and without temperature scaling (TS). We find that we can utilize temperature scaling in all cases to boost calibration, but temperature scaling only helps overall performance when used in combination with distillation.
tion of structured prediction models, which consistently improve calibration and performance relative to single models. Our key finding is that ensemble distillation may be used to produce a single model that preserves much of the improved calibration and performance of the ensemble while being as efficient as single models at inference time. Furthermore, we show that calibration of the single student models can be further improved by other, orthogonal, re-calibration methods. We release all code and scripts. 17 Open research questions. Non-autoregressive translation (NAT) is an active area of research for NMT (Gu et al., 2017;Stern et al., 2019;Ghazvininejad et al., 2019). Most knowledge distillation for NAT is performed at the sequence level, and ignores distributional information at the token level. In future work, we are interested in exploring NAT using distilled ensembles with truncated distributions, and assessing how improved calibration impacts non-sequential decoding performance. Finally, Snoek et al. (2019) find that deep ensembles can significantly improve outof-domain performance over single models, and we are interested in exploring whether our distillation techniques retain these benefits.
A Additional NMT results Table 6 gives results for ensembles of models which do not use SWA to combine checkpoints. We see that although the performance of the independent models is worse than those which use SWA, ensembles of them essentially match the performance of the corresponding ensembles which did use SWA. This suggests that ensembling obviates the need for checkpoint averaging.

B The effect of calibration on PR curves for NER
In this section, we illustrate a further advantage of calibrated NER models, which is that they enable straightforward thresholding of the returned confidences at different operating points of interest. In general, one may be willing to trade-off precision or recall according to the application. The popular F1 metric for NER evaluates at one such operating point. The framework of precision-recall (PR) curves provides a graphical illustration of performance of different models across a range of operating points, and the area under the PR curve provides a summary statistic that enables comparing different models across the entire range of operating points (Flach and Kull, 2015). Note however that sequence distributions do not enable straightforward thresholding because the probability of any particular sequence is vanishingly small. Therefore, it is necessary to consider marginal probabilities of positions or short spans instead. While expensive in the case of the CRF, requiring dynamic programming for each possible event of interest, note that our distilled IID model enables direct thresholding on the calibrated perposition probabilities.
To illustrate the benefit of improved calibration, Figure 3 shows PR curves for four models: • An individual CRF model • An individual IID model • A model distilled from a 9-ensemble of CRFs • A model distilled from a 9-ensemble of IIDs We find that the distilled ensembles, which have better calibration, have greater AUC than individual models, and generally dominate them around the threshold corresponding to F1.

C Label smoothing in NER
We are not aware of a thorough study of the effects of label smoothing on NER tasks. Our experiments found that, similar to the NMT case, label smoothing did somewhat improve calibration for individual models. However, label smoothing gave mixed results when used in conjunction with our framework for ensembles and ensemble distillation, and generally the best results were achieved without it. We report our findings in Table 7.

E Information about datasets
CoNLL-2003 comprises annotated text in two languages-English and German-taken from news articles. Details about sources, splits, and entity-type statistics can be found in (Tjong Kim Sang and De Meulder, 2003). The NER information is annotated in IOB format; we modify this to IOB2 as a pre-processing step.