“None of the Above”: Measure Uncertainty in Dialog Response Retrieval

This paper discusses the importance of uncovering uncertainty in end-to-end dialog tasks and presents our experimental results on uncertainty classification on the processed Ubuntu Dialog Corpus. We show that instead of retraining models for this specific purpose, we can capture the original retrieval model’s underlying confidence concerning the best prediction using trivial additional computation.


Introduction
Uncertainty modeling is a widely explored problem in dialog research. Stochastic models like deep Qnetworks (Tegho et al., 2017), Gaussian processes (Gai and Young, 2014), and partially observable Markov decision process (Roy et al., 2000) are often used in spoken dialog systems to optimize dialog management by explicitly estimating uncertainty in policy assignments.
However, these approaches are either computationally intensive (Gal and Ghahramani, 2015) or require significant work on refining policy representations (Gai and Young, 2014). Moreover, most current uncertainty studies in dialog focus on the dialog management component. End-to-end (E2E) dialog retrieval models jointly encode a dialog and a candidate response (Wu et al., 2016;Zhou et al., 2018), assuming the ground truth is always present in the candidate set, which is not the case in production. Larson et al. (2019) recently showed that classifiers that perform well on in-scope intent classification for task-oriented dialog systems struggle to identify out-of-scope queries. The response selection task in the most recent Dialog System Technology Challenge (Chulaka Gunasekara and Lasecki, 2019) also explicitly mentions that "none 1 Our datasets for the NOTA task are released at https://github.com/yfeng21/nota prediction of the proposed utterances is a good candidate" should be a valid option.
The goal of this paper is to set a new direction for future task-oriented dialog system research: while retrieving the best candidate is crucial, it should be equally important to identify when the correct response (i.e. ground truth) is not present in the candidate set. In this paper, we measure the E2E retrieval model's capability to capture uncertainty by inserting an additional "none of the above" (NOTA) candidate into the proposed response set at inference time.
The contributions of this paper include: (1) demonstrating that it is crucial to learn the relationship amongst the candidates as a set instead of looking at point-wise matching to solve the NOTA detection task. As a result, the logistic regression (LogReg) approach proposed here consistently achieves the best performance compared to several strong baselines. (2) extensive experiments show that the raw output score (logits) is more informative in terms of representing model confidence than normalized probabilities after the Softmax layer.

Related Work
Our use of NOTA to measure uncertainty in dialog response is motivated by the design of student performance assessment in psychology studies.
Test creators often include NOTA candidates in multiple-choice design questions, both as correct answers and as distractors. How the use of NOTA affects the difficulty and discrimination of a question has been discussed widely (Gross, 1994;Pachai et al., 2015). For assessment purposes, a common finding is that using NOTA as the correct response increases question difficulty, and also lures high-and low-performing students toward distractors (Pachai et al., 2015).
Returning a NOTA-like response is a common practice in dialog production systems (IBM). The idea of adding the NOTA option to a candidate set is also widely used in other language technology fields like speaker verification (Pathak and Raj, 2013). However, the effect of adding NOTA is rarely introduced in dialog retrieval research problems. To the best of our knowledge, we are the first to scientifically evaluate a variety of conventional approaches for retrieving NOTA in the dialog field.

Ubuntu Dataset
All of the experiments herein use the Ubuntu (Lowe et al., 2015) Dialog Corpus, which contains multiturn, goal-oriented chat logs on the Ubuntu forum. For next utterance retrieval purposes, we use the training data version that was preprocessed by Mehri and Eskenazi (2019), where all negative training samples (500,127) were removed, and, for each context, 9 distractor responses were randomly chosen from the dataset to form the candidate response set, together with the ground truth response. For the uncertainty task, we use a special token NOTA to represent the "none of the above" choice, as in multiple-choice questions. More details on this NOTA setup can be found in Sections 4.1 and 4.2. The modified training dataset has 499,873 dialog contexts, and each has 10 candidate responses. The validation and test sets remain unchanged, with 19,561 validation samples and 18,921 test samples.

Dual LSTM Encoder
The LSTM dual encoder model consists of two single-layer, uni-directional encoders, one to encode the embedding (c) of the context and one to encode the embedding (r) of the response. The output function is computed as a dot product of the two, f (r, c) = c T r. This model architecture has already been shown to perform well for the Ubuntu dataset (Lowe et al., 2015;Kadlec et al., 2015). We carry out experiments with the following variants of the vanilla model for training: Binary This is the most common training method for next utterance ranking on the Ubuntu corpus.
With training data prepared in the format of [CON-TEXT] [RESPONSE] [LABEL], the model performs binary classification on each sample, predicting whether a given response is the ground truth. The binary cross-entropy between the label and σ(f (r, c)) following a sigmoid layer is used as the loss function.
Dropout Gal and Ghahramani (2015) found that dropout layers can be used in neural networks as a Bayesian approximation to the Gaussian process, and thus have the ability to represent model uncertainty in deep learning. Inspired by this work, we add a dropout layer after each encoder's hidden layer at training time. At inference, we have the dropout layer activated and pass each sample through n times, and then make the final prediction by taking a majority vote among the n predictions. Unlike the other models, the NOTA binary classification decision is not based on the output score itself, but rather is calculated on the score variance of each response.

Experimental Setup
LSTM For the LSTM models, unless otherwise specified, the word embeddings are initialized randomly with a dimension of 300, and a hidden size of 512. The vocabulary is constructed of the 10000 most common words in the training dataset, plus the UNK and PAD special tokens. We use the Adam algorithm (Kingma and Ba, 2014) for optimization with a learning rate of 0.005. The gradients are clipped to 5.0. With a batch size of 128, we train the model for 20 epochs, and select the best checkout based on its performance on the validation set. In the dropout model, we use a dropout probability of 50%.
LogReg For the logistic regression model, we train on the validation set's LSTM outputs with the same hyperparameter (where applicable to Lo-gReg) setup as in the corresponding LSTM model.

Direct Prediction
For the direct prediction experiment, we randomly choose 50% of the response sets and replace the ground truth responses with the NOTA special token (we label this subset as isNOTA). For the other 50% samples, we replace the first distractor with the NOTA token (we label this subset as notNOTA). By using this setup, we ensure that a NOTA token is always present in the candidate set. Although making decisions based on logits (Directlogits) or probability (DirectProb) yields the same argmax prediction, we collect both output scores for the following LogReg model (details in Section 4.3). Concretely, the final output y of a direct prediction model is:

Threshold
Another common approach toward returning NOTA is to reject a candidate utterance based on confidence score thresholds. Therefore, in the threshold experiments, with the same preprocessed data as in Section 4.1, we remove all NOTA tokens at the inference model's batch preparation stage, leaving 9 candidates, thus 50% of the response sets (the is-NOTA set) with no ground truth present. After the model outputs scores for each candidate response, with the predefined threshold, it further decides whether to accept the prediction with the highest score as its final response, or to reject the prediction and give NOTA instead. We investigate the performance of setting the threshold based on probability (ThresholdProb) and logits (ThresholdLogits) respectively. Concretely, the final output y is given by:

Logistic Regression
We (4) where input to the LogReg model f (r i , c) is the output of LSTM models, either in logits or normalized form, as previously defined in subsection 3.2.

Metric Design
Dialog retrieval tasks often use recall out of k (R x @k) as a key metric, measuring out of x candidates how often the answer is in top-k. In this paper, we focus on the top-1 accuracy R x @1 (R x for short) with a candidate set size of x, where x ∈ {2, 5, 10, 20, 40, 60, 80, 100}. The recall metric is modified for uncertainty measurement purposes, and is further extended to calculate the NOTA accuracy out of x (N x ), and F1 scores for each class (N F 1 x , GF 1 x ). Let D = {c, y} and D n = {c, isNOTA} be the two subparts of data that correspond to samples that are notNOTA and isNOTA respectively, the above metrics are computed by: In Equation (6), the numerator represents correctly predicted (same as in Equation (5)) plus other true negative isNOTA predictions, where the model correctly predicts notNOTA, but fails to choose the ground truth.
The positive class in N F 1 x is the isNOTA class, and the positive class in GF 1 x is the notNOTA class.

More Candidates
In real-world problems, retrieval response sets usually have many more than 10 candidates. Therefore, we further test the selection and binary models on a bigger reconstructed test set. For each context, we randomly select 90 more distractors from other samples' candidate responses, producing a candidate response set of size 100 for each context. Table 1 summarizes the experimental results. Due to space limitation, this table only displays results on 10 candidates. Complete results on other numbers of candidates, which have similar performance patterns as 10, are found in the Appendix. The thresholds and hyperparameters are tuned on the validation set according to the highest average F1 score. For the selection model, in addition to the original dataset, we also train the model on a modified training dataset, containing NOTA choices as in inference datasets, with the same set of hyperparameters. As expected, since there are now fewer real distractor responses, training including NOTA improves the model's NOTA classification performance, but sacrifices recall scores, which is not desirable. In all the models, regardless of the training dataset used and the model architecture, adding a logistic regression on top of the LSTM output significantly improves average F1 scores. Specifically, the highest F1 scores are always achieved with logits scores as LogReg input features. These results show that, though setting a threshold is a common heuristic to balance true and false acceptance rates (Larson et al., 2019), its NOTA predic-  Table 1: Results on 10 candidates. R represents recall, N represents binary NOTA classification accuracy, N F 1 represents the F1 score on the NOTA class, and GF 1 represents the F1 score on the ground-truthpresent class. Average F1 is the average of N F 1 and GF 1.

Results and Analysis
tion performance is not comparable to the LogReg approach, even after an exhaustive grid-search of best thresholds. This finding is underlined by receiver operating characteristic (ROC) curves on the validation set  Figure 1 shows the ROC curves for predicting NOTA directly with LSTM. Figure 2 shows ROC plots for predicting NOTA with LogReg in the same order as Figure 1, where a separate LogReg model is trained for each score setting. In both figures, the areas under curve (AUC) indicate that logits serves as a more discriminative confidence score compared to the normalized softmax score. Comparing the top right plots in both Figures, we can see that with the same set of logits scores as threshold criteria, AUC is boosted from 0.71 to 0.91 with the additional LogReg model, providing further evidence that LogReg significantly outperforms the LSTM models in this NOTA classification task.  Figure 3 shows the model's distribution of max scores on the validation set. We see that there are apparent differences between isNOTA' and notNOTA's best score distributions. This is an encouraging observation because it suggests that current retrieval models can already distinguish good versus wrong responses to some extent. Note that as the NOTA token is not included in training, for direct prediction tasks, the NOTA token is encoded as an UNK token at inference time. The tails of the isNOTA plot in both the DirectLogits and DirectProb graphs suggest that the model will, very rarely, pick the unknown token as the best response. Figure 4 shows the average F1 score trends with the original selection model on the test set with 100 distractors. The plot shows the trend that with more distractors, the LSTM model struggles to determine the presence of ground truth, while the LogReg model performs consistently well. The complete results of this extended test set are in the Appendix.

Discussion
With NOTA options in the training data, the models learn to sometimes predict NOTA as the best response, resulting in more false-positive isNOTA predictions at inference time. Also, by replacing various ground truths and strong distractors with NOTA, the model has fewer samples to help it learn to distinguish between different ground truths and strong distractors/ Thus it performs less well on borderline predictions (scores close to the threshold). This behavior results in some selection methods trained on the dataset containing NOTA tokens performing worse than when they are trained on the original dataset. This motivates us to advocate the proposed LogReg approach instead of the conventional add a NOTA choice method.
Another prominent advantage of the LogReg approach is that it does not require data-or modeldependent input like embedding vectors or hidden layer output. Instead, it takes logits or normalized scores, both of which can be output from any models. This feature makes our approach insensitive to the underlying architecture.

Conclusions
We have created a new NOTA task on the Ubuntu Dialog Corpus, and have proposed to solve the problem by learning the response set representation with a binary classification model. We hope the dataset we release will be used to benchmark future dialog system uncertainty research.   Table 2 shows the original selection model's performance on different sizes of candidate response sets. The direct predict model is run as it does not need further tuning. Threshold approach, especially with softmax probability as threshold, will need separate rounds of tuning on the threshold. Table 3 shows the complete results for all models on the test set, both for 2 candidates and for 10 candidates. Here, the average F1 is averaged on all 4 F1 scores. For each model architecture, the best performing setting for each metric is in bold. Table 3: @10 and @2 represent metrics on 10 and 2 candidates respectively. R represents recall, N represents binary NOTA classification accuracy, N F 1 represents the F1 score on the NOTA class, and GF 1 represents the F1 score on the ground-truth-present class. Average F1 is obtained on the 4 F1 scores.