Knowing What You Know: Calibrating Dialogue Belief State Distributions via Ensembles

The ability to accurately track what happens during a conversation is essential for the performance of a dialogue system. Current state-of-the-art multi-domain dialogue state trackers achieve just over 55% accuracy on the current go-to benchmark, which means that in almost every second dialogue turn they place full confidence in an incorrect dialogue state. Belief trackers, on the other hand, maintain a distribution over possible dialogue states. However, they lack in performance compared to dialogue state trackers, and do not produce well calibrated distributions. In this work we present state-of-the-art performance in calibration for multi-domain dialogue belief trackers using a calibrated ensemble of models. Our resulting dialogue belief tracker also outperforms previous dialogue belief tracking models in terms of accuracy.


Introduction
Task-oriented dialogue systems aim to act as assistants to their users, solving tasks such as finding a restaurant, booking a train, or providing information about a tourist attraction. They have become very popular with the introduction of virtual assistants such as Siri and Alexa.
Two tasks are fundamental to such a system. The first is the ability to track what happened in the conversation, referred to as tracking. Based on the result of tracking, the system needs to conduct the conversation towards the fulfilment of the user goal, referred to as planning. The tracking component summarises the dialogue history, or the past, while the planning component manages the dialogue and concerns the future. In this work we focus on the first component.
Early approaches to statistical dialogue modelling view dialogue as a Markov decision process (Levin et al., 1998) and define a set of dialogue states that the conversation can be in at any given dialogue turn. The tracking component tracks the dialogue state. In recent years discriminative models achieve state-of-the-art dialogue state tracking (DST) results Zhang et al., 2019;Heck et al., 2020). Still, in a multidomain setting such as MultiWOZ (Eric et al., 2019;, they achieve an accuracy of just over 55%. This means that in approximately 45% of cases they make a wrong prediction and, even worse, they have full confidence in that wrong prediction. In the wake of statistical dialogue modeling, the use of partially observable Markov decision processes has been proposed to address this issue. The idea is to model the probability over all possible dialogue states in every dialogue turn (Williams and Young, 2007). This probability distribution is referred to as the belief state. The advantages of belief tracking are probably best illustrated by an excerpt from a dialogue with a real user in (Metallinou et al., 2013): even though the dialogue state predicted with the highest probability is not the true one, the system is able to provide a valid response because the true dialogue state also has assigned a non-zero probability.
A model is considered well calibrated if its confidence estimates are aligned with the empirical likelihood of its predictions (Desai and Durrett, 2020).
The belief state can be modelled by deep learning-based approaches such as the neural belief tracker (Mrkšić et al., 2017), the multi-domain belief tracker (Ramadan et al., 2018), the globally conditioned encoder belief tracker (Nouri and Hosseini-Asl, 2018) and the slot utterance matching belief tracker (SUMBT)  models. None of these models however address the issue of calibrating the probability distribution that they provide, resulting in them being more confident than they should be. In a dialogue setting, overconfidence can lead to bad decisions and unsuccessful dialogues.
In this work, we present methods for learning well-calibrated belief distributions. Our contributions are the following: • We present the state-of-the-art performance in calibration for dialogue belief trackers using a calibrated ensemble of models, called the calibrated ensemble belief state tracker (CE-BST).
• Our model achieves best overall joint goal accuracy among the state-of-the-art belief tracking models.
Such a well-calibrated belief tracking model is essential for the planning component to successfully conduct dialogue.

Related Work
Since no other belief tracking methods that we are aware of have achieved success in producing wellcalibrated confidence, we look towards methods used in other language tasks. Natural language inference is a related task that also benefits from wellcalibrated confidence in predictions. Desai and Durrett (2020) introduce the use of post-processing techniques such as temperature scaling to produce better-calibrated confidence estimates. Additionally, there have been recent advances in the construction of more adequate loss functions. These methods, including Bayesian matching and prior networks, aim to learn well-calibrated models without the burden of requiring many extra parameters. These methods achieve good calibration in computer vision tasks such as CIFAR (Joo et al., 2020;Malinin and Gales, 2018;Szegedy et al., 2016).
When the limitations of a single model still inhibit us from producing more accurate and bettercalibrated models, a popular alternative is to use an ensemble of models. Recently Malinin and Gales (2020) showed the success of using an ensemble of models for machine translation, and in particular utilising accurate confidence predictions for analysing translation quality.

Calibration Techniques
In this section we explain the details of three calibration techniques that we apply to dialogue belief tracking.

Loss Functions
The loss function can have a great impact on the calibration and accuracy of models. The most commonly used loss function in belief tracking is the standard softmax cross entropy loss. However, it tends to cause overconfident predictions where most of the probability is placed on the top class.
Label smoothing cross entropy (Szegedy et al., 2016) aims to resolve this problem by replacing the one-hot targets of cross entropy with a smoothed target distribution. That is, for label y i and smoothing parameter α ∈ 0, 1 K , the target distribution will be: where K is the number of possible values of c.
The loss for a model with parameters θ and a set of N output logitsẑ 1 ,ẑ 2 , ...,ẑ N with true labels y 1 , y 2 , ..., y N is defined as: (2) where KL is the Kullback-Leibler divergence between two distributions (Kullback and Leibler, 1951).
Alternatively, Bayesian matching loss (Joo et al., 2020) uses a Dirichlet distribution as the final activation function. The target is constructed using the Bayes rule, where we assume the observed label y i to be an observation from a categorical distribution y i |π i ∼ Cat(π i ) and π i is the true underlying distribution of the label. To introduce uncertainty into the target distribution we assume that the prior of π i is a Dirichlet distribution, Dir(1). In this way, we have a highly uncertain prior distribution. From this it can be shown that the posterior will be π i |y i ∼ Dir(1 + I(y i )), where I(y i ) is the one-hot representation of y i . The loss function is then constructed using the negative log likelihood of the true label given the predicted distributionπ i ∼ Dir(ẑ i ), penalised by the KL divergence from the the uncertain Dir(1) distribution: where λ > 0 is the penalisation parameter.

Ensemble Distribution Estimation
From a Bayesian viewpoint, the probability of observing an outcome given the observed examples can be broken down into two components: the predictive distribution of the model and the posterior of the model given the observed examples. The posterior of the model given the data is an unknown distribution which can be estimated in various ways. One method is to use an ensemble of models, where the ensemble acts as an estimator for the posterior distribution of the parameters, p(θ|D), where D represents the observed examples. Let q(θ) represent the distribution over all possible members of an ensemble. This distribution could be seen as the ensemble estimate of the posterior, p(θ|D), (Malinin et al., 2019;Malinin and Gales, 2020).
Since this integral is still intractable we need to estimate it using Monte Carlo. To sample from the ensemble distribution q(θ) we consider two approaches: using dropout during inference to collect an ensemble of N equally likely models (Gal and Ghahramani, 2016), or alternatively bootstrap sampling N equally likely subsets of the data to train N equally likely ensemble members. Let these N members be {θ (1) , θ (2) , ..., θ (N ) }. The estimated predictive distribution can then be calculated as follows:p

Temperature Scaling
Temperature scaling is a post-processing technique which scales the logits of the model by a scaling factor β > 1 ( Guo et al., 2017), resulting in bettercalibrated estimates. The temperature scaling parameter β can be trained on a development set.

Experimental Setup
We seek to build a well-calibrated dialogue belief tracker. For our baseline belief tracker, we use the SUMBT model architecture , which uses BERT (Devlin et al., 2018) as a turn encoder and multi-head attention for slot candidate matching. We perform all experiments on the Mul-tiWOZ 2.1 dataset (Eric et al., 2019), the current standard dataset for multi-domain dialogue. When training using Bayesian matching, we use a scaling coefficient of λ = 0.003, and for label smoothing, a smoothing coefficient of α = 0.05. For the ensemble belief tracker, we train 10 identical independent models, each with a sub-sample of 7500 dialogues. All hyper-parameters are obtained using a parameter search based on validation set performance. For all training, we use the BERT-base-uncased model from PyTorch Transformers (Wolf et al., 2019) for turn embedding. We use a gated recurrent unit with a hidden dimension 300 for latent tracking and Euclidean distance for value candidate scoring. During training, we use a learning rate of 5e − 5 in combination with a linear learning rate scheduler, the warm-up proportion is set to 0.1. A dropout rate of 0.3 is used, and training is performed for 100 epochs. 1

Joint Goal Accuracy
The joint goal accuracy (JGA) is the percentage of turns for which the model predicts the complete user goal correctly. We further propose the introduction of an adjusted top 3 JGA, which considers a user goal prediction correct if the true label for each slot is among the top 3 predicted candidates for that slot in the belief state given there are at least 5 possible candidates.

L2 Norm Error
The L2 norm error is the L2 norm of the difference between the true labels and the predicted distributions. To form the user goals and belief states we concatenate all the slot labels and slot distributions. This error measure does not only consider the accuracy of the predictions but also the uncertainty.

Joint Goal Calibration Error
A well-calibrated model is one where the accuracy is aligned with the confidence predictions. The expected calibration error (ECE) evaluates the calibration by measuring the difference between the model's confidence and accuracy (Guo et al., 2017), meaning a lower ECE indicates better calibration. Hence: where B is the number of bins, b k are the bin sizes, N the number of observations, acc(k) and conf(k) the accuracy and confidence measures of bin k. We also propose an adapted ECE, called the expected joint goal calibration error (EJCE), which uses the joint goal accuracy for bin k as acc(k), and the following metric as confidence: wherep i (v|s) is the predicted probability of value v for slot s given the i th observation in bin k.   All of the calibration techniques presented above can be combined. Here, we focus on the most important combinations and present the results in Table 1. We make the following observations. First, cross entropy on its own leads to a high EJCE, as expected. Second, label smoothing reduces EJCE while leading to a negligible drop in accuracy. Third, Bayesian matching underperformed in our experiments, suggesting a difficulty in choosing the right priors. Fourth, temperature scaling is not an effective way of calibrating uncertainty, as the same calibration is applied to each observation. Finally, the ensemble methods produce very promising results for both accuracy and calibration of the model. In particular, if we look at the Top 3 JGA, our method achieves an improvement of 14.11 percentage points over the baseline, in the Appendix we include a comprehensive set of Top n JGA results. In Figure 1 we plot JGA as a function of confidence. The best calibrated model is the one that is closest to the diagonal, i.e. the one whose confidence for each dialogue state is closest to the achieved accuracy. From this reliability diagram we see that both the dropout and model ensembles improve model calibration and do not produce over-confident output as the cross entropy baseline does. In Table 2 we compare our model to some of the best performing belief and state tracking models. Here we see that we outperform the best performing belief tracker but the state-of-the-art (SOTA) state trackers (Heck et al., 2020;Chen et al., 2020;Hosseini-Asl et al., 2020) have a significantly higher JGA. However, when analysing the L2 norm 2 we see that the uncertainty estimates of belief tracking models compensate for the lower joint goal accuracy. This corroborates our premise that it is important to have well calibrated confidence estimates and not just a high JGA.

Conclusion
We applied a number of calibration techniques to a baseline dialogue belief tracker. We showed that a label smoothed trained ensemble provides stateof-the-art calibration of the belief state distributions and has the best accuracy among the available belief trackers. Although it does not compete with state trackers in terms of JGA, when considering top 3 predictions it achieves 84.08% accuracy (Top 3 JGA), almost 30 percentage points above state-of-the art state trackers. We also find that our model has the best L2 norm performance, which suggests that the quality of predicted uncertainty is as important as the average JGA.
It is important to note that the proposed calibration methods can be applied to any neural dialogue belief tracking method. The uncertainty estimates predicted by this model could improve the success of dialogue systems because this model can provide the dialogue manager with a good measure of confidence. This could allow the system to ask questions in moments of confusion. In the Appendix we include example dialogues to illustrate this. In future, we aim to combine the state-of-the-art dialogue state tracking and belief tracking methods to create a method that can achieve both states-of-theart joint goal accuracy and well-calibrated belief states.

A.1 Joint Goal Accuracy Analysis
In Table 3 we compare SUMBT and our CE-BST method using 5 different top n joint goal accuracy's.   User: Can you help me find a place to go in the centre? System: I can help you with that. Is there a certain kind of attraction that you would like to visit? User: Surprise me! Give me the postcode as well. System: Would you prefer the castle galleries is a museum in the centre of town. Their post code is cb23bj.
User: Great! I am also looking for a place to eat in the same area. Something not too expensive, but not cheap.  User: I need a train to Stansted airport that leaves on Sunday. System: Did you have a time you would like to arrive or leave? User: I need to arrive by 14:30. System: tr1668 will arrive at 14:08, would that work for you? User: That is perfect. I would like to make a booking for 6 people please. System: Booking was successful, the total fee is 48.48 gbp payable at the station. Your reference number is hf03ug02. Do you need assistance with anything else? User: I need to eat too. System: What type of restaurant and price range are you looking for? User: I would like Catalan food. It needs to be in the centre and be expensive.