Uncertainty Modeling for Machine Comprehension Systems using Efficient Bayesian Neural Networks

While neural approaches have achieved significant improvement in machine comprehension tasks, models often work as a black-box, resulting in lower interpretability, which requires special attention in domains such as healthcare or education. Quantifying uncertainty helps pave the way towards more interpretable neural networks. In classification and regression tasks, Bayesian neural networks have been effective in estimating model uncertainty. However, inference time increases linearly due to the required sampling process in Bayesian neural networks. Thus speed becomes a bottleneck in tasks with high system complexity such as question-answering or dialogue generation. In this work, we propose a hybrid neural architecture to quantify model uncertainty using Bayesian weight approximation but boosts up the inference speed by 80% relative at test time, and apply it for a clinical dialogue comprehension task. The proposed approach is also used to enable active learning so that an updated model can be trained more optimally with new incoming data by selecting samples that are not well-represented in the current training scheme.


Introduction
Neural approaches demonstrate strong learning capability, achieve significant improvement in various natural language processing tasks (Devlin et al., 2019), and are increasingly applied in real-world applications (Du et al., 2019;J Kurisinkel and Chen, 2019). However, neural models typically operate as black-box functions and thus lack interpretability. Interpreting (or even if only partially) the output from neural models is important in domains such as healthcare, when a model is prone to make incorrect diagnosis (Settles, 2012). To tackle this issue, one approach is to evaluate the confidence of an output generated by a model regarding an input, which is through quantifying model uncertainty or epistemic uncertainty. When the uncertainty measure of the model is high, one could be prompted to intervene in the automated decision process by either overriding the system's decision or escalating the situation to a domain expert. This approach would also favor model training with new incoming streams of data that may be ill-represented in the current setting.
Different from the label probability produced by models, epistemic uncertainty is derived from the weight variance under the observation on a certain distribution. Bayesian neural networks (BNNs) (Denker and LeCun, 1990;Buntine and Weigend, 1991), in which prior distributions are applied as additional constraints to weights, have been shown to be effective for quantifying epistemic uncertainty. Instead of obtaining deterministic weights, Bayesian methods update weights via distribution-based estimation . Therefore, one can sample different possible weights and forward inputs through the network multiple times, then obtain epistemic uncertainty according to the variance of a set of predictions. Moreover, drawing experience from past work, modeling within a Bayesian framework can lead to potentially better representations and predictions in various tasks Xiao and Wang, 2019).
Most previous studies applied Bayesian neural approaches on classification or regression tasks. In this paper, we focus on modeling and utilizing epistemic uncertainty for question-answering (QA) systems in natural language processing, and tackling the aforementioned issues in the healthcare domain. Since the neural network architectures for QA tasks are relatively more complex, there is a need to balance the learning quality and inference speed. To this end, we propose a hybrid neural architecture by integrating Bayesian approximation to a base neural model and optimize its training strategy. We conduct experiments on a clinical conversational scenario in Section 4.1 and a question-answering benchmark dataset (see Appendix B). The result shows that our approach can achieve better performance and is capable of modeling epistemic uncertainty. Furthermore, we analyze the characteristics of the quantified uncertainties and conduct an active learning experiment on the clinical corpus.

In Relation to Other Work
Neural question-answering approaches, often applied to machine comprehension tasks, have achieved rapid progress lately, benefiting from large-scale corpora (Rajpurkar et al., 2016), semantic vector representations (Pennington et al., 2014), sophisticated neural architectures (Seo et al., 2017), and deep contextual language models (Devlin et al., 2019), pushing the state-of-the-art performance on various benchmarks. However, the extent to which these systems truly understand language remains unclear, and models are vulnerable to adversarial samples (Jia and Liang, 2017).
As an effective approach to model weight variance and generate predictions, Bayesian neural networks and their variants have been applied in computer vision for image classification  and autonomous vehicles to better model safety (McAllister et al., 2017). In natural language processing, uncertainty modeling has been adopted in sentiment analysis, named entity recognition and language modeling (Xiao and Wang, 2019). Such approaches have also proved effective in domain-specific active learning such as named entity recognition (Shen et al., 2017). To the best of our knowledge, we take the first stab to introduce neural epistemic uncertainty modeling in question-answering tasks.
Making Bayesian neural networks tractable on large-scale practical problems has been a focus in the research field since the 1990's (Denker and LeCun, 1990;Hinton and Van Camp, 1993;Barber and Bishop, 1998). More recently, several approximation methods have been proposed, including Bayes-by-Backprop (Blundell et al., 2015), which places a prior distribution over model parameters and calculates the Kullback-Leibler (KL) divergence between approximated and expected posterior distribution; and Monte-Carlo Dropout (Gal and Ghahramani, 2016), which applies dropout in both training and inference stages to approximate Bayesian variational inference. Sampling from approximated posterior distribution using gradient uncertainty can also be used to represent uncertainty in predictions (Park et al., 2018).

Modeling Epistemic Uncertainty with Bayesian Neural Networks
A traditional neural model f W (.) with a specific network architecture f (.) learns and optimizes weights W by point estimation, therefore the inference process is deterministic. However, in practice, there is a degree of uncertainty associated with the weights (epistemic uncertainty), which can be modeled by representing the weights W as a distribution. To this end, Bayesian neural networks aim to estimate the posterior distribution of W, based on the observation of data D. Here, the posterior is denoted as p(W|D) and once it is estimated, the prediction of an input x is generated by marginalizing over the posterior: However, the exact solution is intractable, thus variational inference (Graves, 2011) is used to estimate the true posterior p(W|D) with an approximation q(W) parametrized by θ. This approximation is typically obtained by minimizing the KL divergence between the two distributions, and can be performed by Bayes-by-Backprop (Blundell et al., 2015) or Monte-Carlo Dropout (Gal and Ghahramani, 2016). At the inference stage, we can draw weights from the approximated posterior W ∼ q(W).
With the approximated weight distribution, we can employ a weight sampling scheme to represent In our question-answering setting, since the final output is generated as an answer span (Wang and Jiang, 2016) and the prediction is a classification task pointing on the input sequence (Vinyals et al., 2015), we quantify the uncertainty at the start and end positions respectively. More specifically, we conduct the Monte-Carlo (MC) integration by repeating the inference process m times, then the answer span is selected based on the mean of all sampled predictions. For any input text x i , we denote the probability of the answer span starting at token t as p i t . Then, the epistemic uncertainty of answer span starting token is quantified as: where x i is the text sequence input, m is the weight sampling time, p j i t is the probability produced by the jth sampling, and E denotes the expectation of all predictions. The final uncertainty output is the sum of both ends of the predicted answer span.

Hybrid Neural Architecture
Since uncertainty quantification in Bayes-by-Backprop and Monte-Carlo Dropout needs multiple sampling and forward iterations, inference time is linearly increased. In practical scenarios, machine comprehension and dialogue tasks often require deeper and larger neural architectures than classification or regression tasks, thus the inference process becomes more time-consuming. On the other side, in neural language approaches, the implicit linguistic features are modeled hierarchically from token and sentence to document level in a deep contextualized architecture (Clark et al., 2019), and semantic-related features at top neural layers play an important role in the machine comprehension task. Therefore, to speed up the inference process without sacrificing the uncertainty modeling capability, we propose a hybrid neural architecture (see Figure 1). More specifically, we split a neural network for question-answering into two sub-functions: (1) feature representation component, which is a traditional neural network, and (2) prediction component with Bayesian approximation, in which we adopt the Bayesian weight estimation. As the feature representation component produces deterministic outputs, the hybrid model will only conduct weight sampling on the prediction component, thus significantly reducing the inference time. Moreover, by integrating Bayesian weight approximation in Section 3.1 to a base neural network, the hybrid model can still be trained in an end-to-end way, 1 and we can obtain epistemic uncertainty via Equation 3.

Question-Answering on Clinical Conversations
We evaluated the proposed approach on a spoken dialogue comprehension corpus, consisting of nurse-topatient symptom monitoring conversations (Liu et al., 2019). This corpus was inspired by real dialogues in the clinical setting where nurses enquire about symptoms of patients. Linguistic structures at the semantic, syntactic, discourse and pragmatic levels were abstracted from these conversations to construct templates for simulating multi-turn dialogues. These conversations cover 9 topics/symptoms (e.g., headache, cough). For each conversation, the average word number 2 is 255 and the average interactive turn number is 15.5. For the comprehension task, questions were raised to query different attributes of a specified symptom; e.g., How frequently did you experience chest pain? Answer spans were labeled with start and end indices. The training, validation and test set 3 are 30k, 3k and 1k respectively. We choose Bi-Directional Attention Flow network (Seo et al., 2017) as the base architecture, 4 which fuses question-aware and context-aware attention and performs competitively in various questionanswering corpora. Pre-trained word embeddings from Glove (Pennington et al., 2014) were utilized and fixed during training. Out-of-vocabulary words were replaced with the [unk] token. The hidden size and embedding dimension were set to 300, and batch size was set to 128. During training, the validationbased early stop strategy was applied. During prediction, we selected answer spans using the maximum product of the start and end position. In our Bayes-by-Backprop (BBB) implementations, weights in the prediction component were sampled from a mixture of two Gaussian distributions with small variances (Blundell et al., 2015) during inference for uncertainty modeling. In our Monte-Carlo Dropout (MCDO) implementations, dropout in the Bayesian approximation component was permanently enabled during training and inference for uncertainty modeling. L2 weight regularization was added to the feature representation component. All models were implemented in Pytorch (Paszke et al., 2019). More details of hyper-parameter configuration are described in Appendix A.  Table 2: Evaluation results on Bayesian uncertainty based active selection and random collection.
As shown in Table 1, models with uncertainty estimation components achieve higher performance than the base model, and MCDO models perform better. Compared with applied Bayesian estimation on all weights (Pure MCDO), our Feature-and-Bayesian (FAB) model still obtains comparable results. Meanwhile, in the inference stage, the FAB MCDO model is significantly faster than the Pure MCDO model. Then, we quantify the epistemic uncertainty with MCDO models as described in Section 3.1. We set the Monte-Carlo sampling instances to 100, and collect all the predictions. Then we calculated the variance of the softmax probability at the start and end positions respectively. As shown in Figure 2, the epistemic uncertainties of the two models were similar. Moreover, there was a certain overlap (69%) when we ranked the 1k test samples with their uncertainty scores and selected the top-k ones (k=300). This indicates that we can refer to the FAB model's uncertainty output with shorter inference time.

Active Learning on Clinical Conversation QA
Based on the previous result, we explore to apply the proposed hybrid model to active learning. Since it is time-consuming to annotate a large number of clinical data samples from the electronic health records (EHR), we expect to utilize epistemic uncertainty to identify samples that are potentially the most helpful for training (Siddhant and Lipton, 2018). To this end, we split the training set in Section 4.1 to two subsets (set A and set B), and conducted active learning in two steps: (1) We trained the FAB MCDO model on set A (15k samples); (2) We evaluated the epistemic uncertainty on all samples of set B (15k samples); (3) We selected 5k samples in set B with the highest uncertainty scores, added them to the training set, and re-trained the model from scratch. We also randomly selected 5k samples from set B as control. As shown in Table 2, FAB MCDO selection obtains larger performance improvement than the random scheme, achieving 87% performance of full set training with 66.7% samples. Moreover, although adopting various Bayesian active learning methods is beyond the scope of this paper, the proposed model can also be used with other acquisition functions such as BatchBALD (Kirsch et al., 2019).

Conclusion
In this work, we defined how to quantify epistemic uncertainty in question-answering tasks. We further proposed a hybrid neural architecture that achieves performance comparable to regular Bayesian neural networks but offers greater efficiency, speeding up the inference processing time by 80% relative. The proposed approach also enabled active learning for dialogue comprehension tasks so that an updated model was trained more optimally with new incoming data by selecting training samples that may not have been well-represented in the current training dataset.

A Training Configuration for Clinical QA Scenario
The hyper-parameters of the model adopted on the clinical dialogue comprehension task is shown in Table 3. Moreover, In our hybrid neural architecture, we adopt several strategies which empirically benefit the performance in the training process: (1) In the warm-up training epochs, all weights of the hybrid architecture were updated jointly, with a warm-up learning rate of 2e − 5.
(2) After warm-up training, the prediction component was trained with Bayesian weight estimation by sampling from a mixture of two prior Gaussian distributions, where σ 1 = 0.05 and σ 2 = 0.1 (Blundell et al., 2015) or applying Monte Carlo Dropout (Gal and Ghahramani, 2016), and we assigned a learning rate of 1e − 3 to the prediction component while that of the feature representation component was set to 1e − 4; (3) Layer normalization was added in the last layer of the feature representation component, providing feature outputs with lower variance.

B Evaluation on a Reading Comprehension Benchmark Corpus
In this section, we adapt our approach in a common question-answering benchmark corpus: SQuAD (Rajpurkar et al., 2016). Different from the domain-specific dialogue dataset, models for this benchmark can significantly benefit from utilizing large-scale pre-trained contextual representation. Therefore, following our design in Section 3, here we use a pre-trained language model BERT (Devlin et al., 2019) as the feature representation component, and add two linear layers with Bayesian approximation for the prediction component. We trained the "bert-base-uncased" version of BERT along with the prediction component, with a separate optimizer and different learning rate and weight decay configurations. As shown in Table 4, the model can achieve higher performance than the baseline. The uncertainty calculated on all samples of the evaluation set is shown in Figure 3.