Towards More Accurate Uncertainty Estimation in Text Classification

The uncertainty measurement of classiﬁed re-sults is especially important in areas requiring limited human resources for higher accuracy. For instance, data-driven algorithms diagnosing diseases need accurate uncertainty score to decide whether additional but limited quantity of experts are needed for rectiﬁca-tion. However, few uncertainty models focus on improving the performance of text classi-ﬁcation where human resources are involved. To achieve this, we aim at generating accurate uncertainty score by improving the conﬁ-dence of winning scores. Thus, a model called MSD, which includes three independent components as “mix-up”, “self-ensembling”, “dis-tinctiveness score”, is proposed to improve the accuracy of uncertainty score by reducing the effect of overconﬁdence of winning score and considering the impact of different categories of uncertainty simultaneously. MSD can be applied with different Deep Neural Networks. Extensive experiments with ablation setting are conducted on four real-world datasets, on which, competitive results are obtained.

The uncertainty measurement of classified results is especially important in areas requiring limited human resources for higher accuracy. For instance, data-driven algorithms diagnosing diseases need accurate uncertainty score to decide whether additional but limited quantity of experts are needed for rectification. However, few uncertainty models focus on improving the performance of text classification where human resources are involved. To achieve this, we aim at generating accurate uncertainty score by improving the confidence of winning scores. Thus, a model called MSD, which includes three independent components as "mix-up", "self-ensembling", "distinctiveness score", is proposed to improve the accuracy of uncertainty score by reducing the effect of overconfidence of winning score and considering the impact of different categories of uncertainty simultaneously. MSD can be applied with different Deep Neural Networks. Extensive experiments with ablation setting are conducted on four real-world datasets, on which, competitive results are obtained.

Introduction
Text classification is a popular topic with broad applications. A successful and common model for text classification is Deep Neural Network (DNN). However, some real-world applications expect results with higher accuracy than the ones achieved by state-of-the-art algorithms. Hence, the most uncertain predictions need domain experts for further decisions (Zhang et al., 2019). To efficiently leverage the limited human resources, it is essential to calculate uncertainty score of the model prediction, which quantifies how unconfident the model prediction is. This paper aims at generating more accurate uncertainty score through DNNs in the text classification with human involvement in the * Corresponding author. testing process. This is different from active learning, which involves experts in the training process.
Though various metrics of the uncertainty score have been studied (Dong et al., 2018;Shen et al., 2019;Xiao and Wang, 2019;Kumar et al., 2019), the existing metrics directly or indirectly depend on winning score, which is the maximum probability in a semantic vector (softmax vector from the last layer of a DNN model) (Thulasidasan et al., 2019). Therefore, improving Confidence of Winning Score (CWS), which describes how confident the winning score matches the sample uncertainty and represents the accuracy of the winning score, is helpful to improve the accuracy of uncertainty score. To show the effect of improving CWS, this paper considers a basic way to measure uncertainty score, which is the reciprocal of winning score (Snoek et al., 2019). However, we face two challenges in improving CWS: (1) how to reduce effect of overconfidence of winning score 1 to boost negative correlation between the winning score and sample uncertainty, (2) how to generate winning scores by considering comprehensive categories of uncertainty in one model rather than only one or two categories of uncertainty at a time.
The overconfidence of winning scores has been neglected by vast previous works in Natural Language Processing. We identify the presence of overconfidence for the training samples: because the winning scores of training samples are all set as 1 by one-hot labels, each sample will have the same uncertainty score. Consequently, the training sample uncertainty will be the same. Together, the winning scores and sample uncertainty are the same for various training samples. Hence, the negative correlation between the winning scores and sample uncertainty cannot be guaranteed, which is a negative effect of the overconfidence. The effect will affect calculating the uncertainty scores. Concretely, in the testing process, we apply different predicted winning scores to match different sample uncertainty based on a latent assumption that the predicted winning score is negatively correlated to the sample uncertainty. However, the assumption is biased because of the negative effect of the overconfidence. To mitigate the impact of overconfidence, we generate new training sample representations with different winning scores, which are also negatively correlated to the sample uncertainty.
Additionally, the process generating the winning score should consider the impact of different categories of uncertainty simultaneously, while vast of the previous works (Shen et al., 2019;Xiao and Wang, 2019;Zhang et al., 2019) only consider one or two categories of uncertainty at a time 2 . We assume the partial consideration will decrease the CWS, and so will the accuracy of uncertainty score. We verify this assumption by our ablation experiments. The uncertainty of a model prediction is derived from two parts: data uncertainty and model uncertainty. The data uncertainty (Rohekar et al., 2019) is further divided into two categories: epistemic uncertainty comes from lack of knowledge, such as only few training data or out-of-distribution testing data; aleatoric uncertainty is caused by noisy data in the generation of both training data and testing data. The model uncertainty  also has two categories: parametric uncertainty comes from different possibilities of parameter values in estimating model parameters under the current model structure and training data; structural uncertainty is uncertainty about whether the current model design (e.g., layers, loss functions) is reasonable or sufficient for the current task and training data. Since the solution of structural uncertainty requires extremely high computations, such as Neural Architecture Search (NAS) (Zoph and Le, 2016;Xie et al., 2019), we only reduce or scale the other three categories of uncertainty simultaneously to improve the CWS.
To address the above two challenges, we propose a model called MSD, which is named as the initials of its components ("Mix-up", "Self-ensembling", and "Distinctiveness score") aiming at handling overconfidence and various uncertainty with flexibility. The flexibility means that MSD is effective on different DNN models (Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Transformer (Vaswani et al., 2017)), and each component in MSD is independent, which can be 2 Please refer to our appendix for detailed comparisons arbitrarily assembled. The main contributions of our work can be summarized as follows, Reducing impact of overconfidence. To reduce the impact of overconfidence in calculating uncertainty scores, we apply mix-up to generate new sample representations boosting the negative correlation between the winning scores and sample uncertainty. Considering various uncertainty comprehensively. We propose MSD with three components to handle the epistemic uncertainty, aleatoric uncertainty, and parametric uncertainty simultaneously, so that the uncertainty score is more accurate. Designing flexibility of MSD. MSD can be applied with different DNNs (CNN, RNN, and Transformer). Each component in MSD can be assembled with other components arbitrarily due to their independence. Implementing extensive experiments. We evaluated MSD by the improvement of text classification accuracy in simulating human involvement. The experiments of MSD with ablation setting on four datasets achieved competitive results, which demonstrated that MSD generates more accurate uncertainty scores.

Related work
Methods mitigating uncertainty: One main solution to mitigate uncertainty is Bayesian Neural Network (BNN) (Klein et al., 2017), which is a neural network with a prior distribution on its weights. Based on BNN, variational Bayesian inference is proposed, which finds an approximated distribution of parameters for the true distribution of parameters by Kullback-Leibler (KL) divergence (Xiao and Wang, 2019;Wen et al., 2018;Louizos and Welling, 2017;Malinin and Gales, 2019). Further, as an approximation of variational Bayesian inference, Monte Carlo dropout is proposed (Gal and Ghahramani, 2016;Kendall and Gal, 2017). This is implemented by training a model with dropout before every layer, and also performing the dropout in the testing process to derive results from different sampled parameter sets. Plus, an approximation of Monte Carlo dropout is tried by only adding dropout before the last layer (Riquelme et al., 2018;Snoek et al., 2019). Besides BNN, noise injection is the other main technique to mitigate uncertainty. It has two categories: parameter noise injection adds noise perturbation in network weights (Plappert et al., 2017); data noise injection directly inputs noise perturbation into data (Dong et al., 2018).
Metrics scaling uncertainty: Many metrics about uncertainty score are proposed based on the softmax vectors. As an important element in the sofmax vectors, winning score is proposed in (Hendrycks and Gimpel, 2016). Furthermore, temperature scaling (Guo et al., 2017) is proposed to get the calibrated probability by adding a scalar parameter to each class in calculating softmax vector. Applying winning score as prediction confidence is proposed in (Niculescu-Mizil and Caruana, 2005;Guo et al., 2017). This confidence is further applied in Expected Calibration Error (Naeini et al., 2015), which is the absolute value of the difference between the accuracy and confidence of results. Besides, Overconfidence Error is proposed by applying winning score as confidence and penalizing samples with confidence values greater than accuracy values (Thulasidasan et al., 2019). In addition, four metrics for result confidence are proposed in  by combining expectation and variance of predictions from different sampled parameter sets. In addition, cross-entropy is applied to calculate uncertainty score by dropout sampling and bin counting in (Zhang et al., 2019), which also considers text classification with human involvement. Different from previous works, we improve the accuracy of uncertainty score by reducing the effect of overconfidence and considering three categories of uncertainty simultaneously.

Basic Text Classification Model
In the traditional text classification model (Zhang et al., 2019;Shen et al., 2018), given an original text, we apply preprocessing (tokenization, lemmatization, etc.) to get its tokens in discrete numbers. Then, a pre-trained token embedding, such as word2vec (Mikolov et al., 2013) or Glove (Pennington et al., 2014) is applied as a projector. After that, a sequence of dense vectors for i-th text Z i = [z i1 , z i2 , ..., z in ] is derived by the embedding, where z ij is the embedding of j-th word. The Z i is fed to a sequence model f , such as CNN or RNN. Finally, we get i-th text representation x i from the penultimate layer of f with dropout, and predicted semantic vector y i = [y i1 , y i2 , ..., y ic ] from the last layer of f , where c is the number of classes and y ij is the probability that i-th text belongs to j-th class. Finally, the f is trained by cross-entropy loss between the predicted semantic vector y i and one-hot label y i = [ y i1 , y i2 , ..., y ic ] as follows, (1) In the testing process, uncertainty score U is formulated as follows, where y * i is semantic vector of i-th testing sample and max(y * i ) is the winning score of y * i . Then, U conveys the uncertainty of model result. Fig. 1 illustrates the training process of our model. In the first row, after prepossessing training text, we calculate the text representations, which is output of the penultimate layer with dropout. Then, we mix these representations in the batch-level. These mix-up-generated representations are fed into a fully connected (FC) layer for final semantic vectors. In the second row, we apply another model, which implements self-ensembling, with independent optimized parameters but the same structure as the one in the first row.

Overview Of MSD
In the testing process, besides computing the reciprocals of winning scores with dropout mechanism, distinctiveness scores is also calculated by the Mahalanobis distance between the testing samples and distributions of training samples. Finally, the uncertainty score is calculated by adding the reciprocal of winning score and distinctiveness score.

MSD Training: Mix-up
Since the overconfidence is caused by the training samples with same winning scores due to the one-hot labels, and adding noise perturbation in the training process is a way to mitigate aleatoric uncertainty, we apply mix-up (Zhang et al., 2017;Thulasidasan et al., 2019) to jointly address the two issues. Mix-up generates new sample representations with various winning scores.
Concretely, we have i-th sample representation x i from the penultimate layer of f with dropout. In a batch, we randomly mix i-th and j-th samples' representations (x i and x j ) and one-hot labels ( y i and y j ) to get a mix-up sample representation x and ground truth label y. We formulate mix-up as,  Orange arrows, green arrows, and blue arrows represent data flow of the first (default) model, second model, and labels respectively. Since self-ensembling is optional, it is illustrated as dotted lines. The distinctiveness score is not shown in the diagram since it is applied in the testing process. The numbers shown in y, y and y are probabilities of the semantic vectors.
where α is a random number ranging from Ω to 1.00. The Ω is set above 0.5, so the i-th sample's semantics will be the main semantics of x, which is regarded as class of x in MSD. Since the difference among winning scores and negative correlation between the winning scores and sample uncertainty are essential to reduce the impact of overconfidence, we analyze the two factors below. Difference: Since 1 ≥ α ≥ Ω > 0.5, y has a winning score as α if x i and x j have different classes, or as 1 if two samples have the same class. Then, firstly, α or 1 is randomly chosen; secondly, the specific value of α is randomly sampled. Thus, different values of winning scores of training samples are achieved by the mix-up.
Negative correlation: Since x includes i-th sample's representation x i with ratio α > 0.5, x j can be regarded as noise of x. In one scenario, when x i and x j have different classes, x j has obvious effect from noise on x i due to different distributions in various semantics. In this case, when α is greater, x has less noise from different semantic distributions. Then, x is less adulterated and has higher confidence belonging to the class of x i . Since now winning score equals to α and the transitivity in math: the higher winning score is, the less adulterated x is, which means x has less uncertainty. In another scenario, where x i and x j belong to same class, we assume that x j has no effect from noise on x i , because x j belongs to same distributions as x i due to same semantics. Thus, the x is the least uncertain in its class, and its winning score is the highest as 1. Hence, the negative correlation is boosted by mix-up.
After the mix-up, we feed x rather than x to FC layer for its predicted semantic vector y. However, we do not use cross-entropy loss (Eq. 1) in MSD, because it will learn the winning scores close to 1 due to no limit on the upper bound, which cannot ensure the negative correlation. Instead, we use KL divergence loss as one of our loss functions, because y i approximates to provide both upper and lower bound limitations by its non-zero element(s). The y ij is a random value in each batch and each epoch. The overconfidence is reduced by mix-up due to difference and negative correlation. Besides, x j can be regarded as random noise perturbation. Therefore, the aleatoric uncertainty is mitigated.

MSD Training: Self-ensembling
The parametric uncertainty comes from different sets of weights achieving similar training losses. Although the dropout can mitigate parametric uncertainty, previous works ignore the effect of selfensembling (Laine and Aila, 2016;Park et al., 2018), which can boost the model combination and further decrease parametric uncertainty. We assume the dropout reduces the parametric uncertainty by loss generated within a model, while the self-ensembling reduces it from loss generated between models. The loss generated between models can help stabilize the model weights, because it can provide extra limitations besides the loss generated in a model, which reduce feasible weight sets. Plus, the designed component should aim at mitigating parametric uncertainty while have little impact on the model performance. Consider that the self-ensembling calculates the loss between the same models, which has more effect on model robustness and less impact on model performance, we apply the self-ensembling in addition to dropout to further mitigate the parametric uncertainty. We construct another model with the same framework (e.g. layers, loss functions, dropout rate), and apply a self-ensemble loss L SE to minimize the difference between two outputs from two models (the first model and the second model 3 ) with the same framework and inputs, where θ a is parameter set of a-th model, φ a represents randomly sampled dropout neurons in neural network f , and D[y 1 , y 2 ] is a metric between two semantic vectors. D is Mean Square Error (MSE). Although we already have the loss L SE , we add KL divergence loss L KL 2 in the second model for the same setting. The L KL 2 is same as Eq. 5, while y ij comes from the second model. We formulate MSD loss function L M SD as follows, where λ 1 and λ 2 equal to 1 and a positive value respectively, when we apply self-ensembling, otherwise they both equal to 0. Together, the parametric uncertainty is further reduced.

MSD Testing: Distinctiveness Score
We also consider the epistemic uncertainty. Though out-of-distribution testing samples are known as the sources of epistemic uncertainty, they show that the epistemic uncertainty is the distinctiveness between the testing and training texts. However, it is not easy to consider the distinctiveness in the training process, because the training process is not aware of distributions of the testing samples. Therefore, we assume each class-level distribution of the training data can be modeled as a multivariable Gaussian distribution. We consider distance between a testing sample and each class-level Gaussian distribution as one part of the distinctiveness 3 The first model is our default model, and the second model is only required when we apply self-ensembling. They are shown in the first row and second row respectively in Fig. 1. score. Motivated by (Lee et al., 2018), we apply Mahalanobis distance as follows, where x * i is the representation of i-th testing sample in the first model without mix-up, and µ s is the mean of representations of all training samples that belong to s-th class. Σ −1 is inverse of the covariance of all training samples. We do not apply the covariance in class-level to avoid singular matrices. After we obtain the Mahalanobis distance m i = [m i1 , m i2 , ..., m ic ] of i-th testing sample to each class-conditional Gaussian distribution, we can also have a predicted class from this view, which is the class with the smallest distance in m i . In this way, we design penalty p as the other part in the distinctiveness score, which is not considered in (Lee et al., 2018), as below, where r m is a classified result by m i and r y is the class with maximum probabilities in predicted semantic vector y * i . ξ is a constant and set as 10 in our work. Our distinctiveness score d i is, where log is a logarithm to the base 10; β 1 and β 2 are constants, both set as 1. Thus, the epistemic uncertainty is scaled in the uncertainty score to improve its accuracy. And the component improves CWS indirectly, because it remedies CWS for missing the epistemic uncertainty in the training process.

MSD Testing: Uncertainty Score
After we have trained our model by applying mixup and self-ensembling, the winning scores will have higher confidence and accuracy due to reducing the overconfidence, aleatoric uncertainty, and parametric uncertainty. Regardless of whether we use self-ensembling or not, we only apply the first model to calculate the mean of predicted semantic vectorsȳ * i with dropout mechanism. Concretely, given a testing sample x * i , we obtain k different predicted semantic vectors y * i1 , y * i2 , ..., y * ik by k times tryouts with the same dropout rate, from which,ȳ * i is the mean of k different y * i . The maximum probability inȳ * i is our winning score. Besides training for more confident winning scores, we also scale distinctiveness score d i to measure the impact of epistemic uncertainty. We calculate our final uncertainty score U as, where γ 1 and γ 2 are constants.

Experiments
Focusing on the text classification with human involvement, we evaluate the performance of MSD on four real-world datasets. Sec. 4.1 shows an overview of our experiment settings. Sec. 4.2 compares the performance between MSD and the stateof-the-art methods, and analyzes results of ablation experiments and parameter sensitivity analysis.

Experimental Setup
We apply Glove embedding (Pennington et al., 2014), which is pretrained with dimension of 200, as our word embedding by default. For CNN model, we train MSD by setting a sequence model as a 3-layer CNN by default, with batch size of 32, momentum of 0.9, initial learning rate as 0.001 by Adam (Kingma and Ba, 2014), kernel size of each layer as 3, 4, 5, respectively, as well as dropout rate of 0.3. For RNN model, Bidirectional Gated Recurrent Units (BiGRU) (Jabreel et al., 2018) is applied as an example of RNN model with two hidden layers. For Transformer, we apply XLnet  as an example 4 .

Datasets
The four real-world-based datasets used in our experiments are as follow: (1) 20 Newsgroups (20News) (Lang, 1995) includes 20 different news categories with 20,000 documents in it.
(2). Amazon Reviews (Amazon) (McAuley and Leskovec, 2013) is a collection of reviews from Amazon from May 1996 to July 2013. For better comparison, we apply data from Sports and outdoors category, which is same as (Zhang et al., 2019). This dataset has 272,630 text samples with sentimental rating labels from 1 to 5. (3) IMDb Reviews (IMDb) has binary sentimental rating with 50,000 popular movie reviews. (4). Yelp Reviews (Yelp) (Zhang et al., 2015) is a collection with sentimental rating labels from 1 to 5. It has two parts: the first part has 130,000 samples for each rating; the second part has 10,000 samples for each rating.
For the first three datasets, we apply the same split setting as (Zhang et al., 2019), where for each dataset, 70% of samples form the training set, 10% of samples form the validation set, and the rest 20% form the testing set. For the Yelp dataset, we choose 9,000 samples randomly from the second part for each label as training set and the rest 1,000 samples for each label as the validation set, while all samples in the first part form the testing set.

Metrics
To evaluate the performance improvement of text classification with human involvement, which shows accuracy of uncertainty scores, we scale classification accuracy in different eliminated ratios. Concretely, for a testing set S with q samples and eliminated ratio r, we remove the most uncertain samples S r from S based on uncertainty score ranking, where S r has r × q samples. The more accurate uncertainty score we obtain, the more misclassified samples will be removed with the same r. Thus, if a model generates more accurate uncertainty scores, then the F1 scores for the rest testing samples will be higher with the same r. Because uncertainty score is more crucial for semantics with less training samples (e.g. "patient data samples" versus "the data for the healthy" in disease detection), we apply macro F1 score for the rest testing samples in the different eliminated ratios.

Baselines and Ablation Setting
We compare MSD with a state-of-the-art method, which achieves superior improvement of F1 scores in text classification with human involvement (Zhang et al., 2019). It proposes two methods: Dropout-Entropy (DE) is a dropout-entropy based model, and DE+Metric is a DE model along with metric learning. As for MSD, we divide MSD into three sub-models for ablation study: MSD1 is a sub-model with only mix-up component; MSD2-a (abbreviate as MSD2) is one with two components, we apply mix-up and self-ensembling components by default; to show the flexibility of MSD, we design MSD2-b, which has two components as mixup and distinctiveness score; and MSD3 is one with all three components. Table 1, 2, 3, 4 report the F1 score improvement in the text classification with various eliminated ratios (10%, 20%, 30%, 40%) for CNN model. The improved ratios of F1 scores compared with no uncertainty elimination (0% column), are illustrated after the F1 scores. The parameter setting of Ω, λ 2 , γ 1 , γ 2 are given after each MSD in order. Three datasets (20 Newsgroups, Amazon, Yelp) are compared in macro F1 scores, except IMDb, which applies weighted F1 score for better comparison with (Zhang et al., 2019). From the tables, we conclude as below.

Results of CNN model
1) Better values of F1 scores: MSDs (MSD1, MSD2, MSD3) improve F1 scores in values when certain portions of the most uncertain samples are eliminated. Especially for the Amazon dataset, the DE and DE+Metric both have negative growth when more uncertain samples are removed with eliminated ratios increased. This shows the accuracy of uncertainty score scaled in the testing is low for Amazon by DE and DE+Metric, while MSDs achieve significant increase on F1 when the most uncertain samples are eliminated, such as 26.64% increase in 40% elimination. In the 20News, MSDs achieve slightly lower F1 scores compared with DE+Metric, although slightly higher in F1 scores compared with DE. This is caused by obvious difference between texts with various semantics, so the uncertainty influence weakens and MSD is not very effective in the 20News.
2) Better improved ratios of F1 scores: If the uncertainty scores are more accurate, higher improvement in the ratios of F1 scores would also be achieved. In comparison to DE and DE+Metric, MSDs always achieve better improved ratios of F1. Thus, MSDs generate more accurate uncertainty score. Especially, though MSD2 has lower F1 score compared with DE+Metric in 0% elimination, it still gets higher F1 score in 40% elimination in IMDb. Plus, though the F1 scores of MSDs are not higher compared with DE+Metric in 20News, higher improved ratios of F1 scores are achieved by MSDs. Thus, MSD is also competitive in comparison with baselines in 20News.
3) Effectiveness for each component by ablation setting: Our proposed three components can be applied independently and further improve accuracy of uncertainty scores by combining them in the most situations, which shows effect of comprehensive consideration of uncertainty. In the 20News and Yelp datasets, when one or two components are added, we find consistent increase on F1 scores from MSD1 to MSD2, and from MSD2 to MSD3. Though the MSD3 does not achieve consistently higher improvement in various ratio eliminations in the IMDb and Amazon datasets, the performance of MSD2 is consistently higher than MSD1. It shows the effectiveness of self-ensembling in reducing the influence of uncertainty. Besides, the MSD3 achieves higher improvement of F1 scores in some eliminated ratios compared with MSD2 in the IMDb and Amazon datasets. We explain this as: the out-of-distribution testing texts do not distribute evenly in various eliminated ratios. Table 5 and Table 6 report the F1 score improvement in text classification with various eliminated ratios (10%, 20%, 30%, 40%) for BiGRU and XLnet respectively. Then, we conclude as below.

Results of Transformer and RNN model
1) Higher performance in macro F1 by MSD3: From Tables 5 and 6, MSD3 achieves higher improved ratios of F1 scores in different eliminated ratios. Though the MSD2-b has higher F1 scores with eliminated ratios 10% and 20%, the other F1 scores of MSD3 in the two tables are still the highest in each eliminated ratio. The superior improvement of both F1 scores and ratios of F1 scores shows the joint effect of three components. Furthermore, the results of MSD2-b and MSD3 show the effect of distinctiveness scores for macro F1 in Amazon, which has imbalanced data distributions. Besides, though MSD2-a performs poorly by mix-up and self-ensembling, this performance is reasonable. Because XLnet is a pretrained model, we have parameters of only two FC layers to train, which has much less feasible solutions of possible parameters compared with the CNN and RNN models. Thus, further decrease of feasible solutions of possible parameters brings negative effect in this case.
2) Flexibility of MSD: From the Table 5 and  Table 6 for RNN and Transformer respectively, as well as Tables 1, 2, 3, and 4 for CNN model, we can observe the competitive performance of MSD in text classification F1 scores compared with two baselines. This verifies that MSD is effective to assemble with other DNNs (CNN, RNN and Transformer). Besides, the ablation setting of MSD1, MSD2-a, MSD2-b and MSD3 shows that the three components in MSD can be assembled arbitrarily based on the characteristics of datasets.
1) Parameters for mix-up: The left panel in Fig. 2 shows the effectiveness of different Ω. We apply Ω = 0.999999 to approximate no mix-up. From the subfigure, we find: (1) the F1 scores are slightly sensitive to different Ω, while the improved ratios of F1 scores are not sensitive to the change of Ω.
(2) For the 20News, when Ω = 0.75, the macro F1 scores are the highest in different ratios, which are higher than the F1 scores of Ω = 0.999999. This shows the effectiveness of mix-up in improving the accuracy of uncertainty score.
2) Parameters for self-ensembling: The impact of self-ensembling parameter λ 2 is shown in the middle panel in Fig. 2. This panel shows: (1) the F1 scores and their improved ratios in various eliminated ratios are significantly sensitive to λ 2 , especially when λ 2 is greater than 1.
(2) For Amazon dataset, macro F1 scores are the highest when λ 2 = 0.1 rather than λ 2 = 0.01. This again verifies the effectiveness of self-ensembling in improving the accuracy of uncertainty score.

Conclusion
We aims at generating more accurate uncertainty score to improve the performance of text classification with human involvement. We propose MSD with three independent components to improve the CWS by mitigating the effect of overconfidence and handling the impact of three categories of uncertainty. MSD can be applied to various DNNs (CNN, RNN and Transformer) and each component in MSD can be arbitrarily assembled. Extensive experiments on four real-world datasets demonstrate that MSD obtains more accurate uncertainty scores, and superiorly improved classification performance when partial uncertain predictions are simulatively assigned to the experts.