Selective Question Answering under Domain Shift

To avoid giving wrong answers, question answering (QA) models need to know when to abstain from answering. Moreover, users often ask questions that diverge from the model’s training data, making errors more likely and thus abstention more critical. In this work, we propose the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer (i.e., not abstain on) as many questions as possible while maintaining high accuracy. Abstention policies based solely on the model’s softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. Instead, we train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Crucially, the calibrator benefits from observing the model’s behavior on out-of-domain data, even if from a different domain than the test data. We combine this method with a SQuAD-trained QA model and evaluate on mixtures of SQuAD and five other QA datasets. Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model’s probabilities only answers 48% at 80% accuracy.


Introduction
Question answering (QA) models have achieved impressive performance when trained and tested on examples from the same dataset, but tend to perform poorly on examples that are out-of-domain (OOD) (Jia and Liang, 2017;Chen et al., 2017;Yogatama et al., 2019;Talmor and Berant, 2019;Fisch et al., 2019). Deployed QA systems in search engines and personal assistants need to gracefully handle OOD inputs, as users often ask questions that fall outside of the system's training distribution. While the ideal system would correctly answer all First, a QA model is trained only on source data. Then, a calibrator is trained to predict whether the QA model was correct on any given example. The calibrator's training data consists of both previously held-out source data and known OOD data. Finally, the combined selective QA system is tested on a mixture of test data from the source distribution and an unknown OOD distribution.
OOD questions, such perfection is not attainable given limited training data (Geiger et al., 2019). Instead, we aim for a more achievable yet still challenging goal: models should abstain when they are likely to err, thus avoiding showing wrong answers to users. This general goal motivates the setting of selective prediction, in which a model outputs both a prediction and a scalar confidence, and abstains on inputs where its confidence is low (El-Yaniv and Wiener, 2010;Geifman and El-Yaniv, 2017). In this paper, we propose the setting of selective question answering under domain shift, which captures two important aspects of real-world QA: (i) test data often diverges from the training distribution, and (ii) systems must know when to abstain. We train a QA model on data from a source distribution, then evaluate selective prediction performance on a dataset that includes samples from both the source distribution and an unknown OOD distribution. This mixture simulates the likely scenario in which users only sometimes ask questions that are covered by the training distribution. While the sys-tem developer knows nothing about the unknown OOD data, we allow access to a small amount of data from a third known OOD distribution (e.g., OOD examples that they can foresee).
We first show that our setting is challenging because model softmax probabilities are unreliable estimates of confidence on out-of-domain data. Prior work has shown that a strong baseline for indomain selective prediction is MaxProb, a method that abstains based on the probability assigned by the model to its highest probability prediction (Hendrycks and Gimpel, 2017;Lakshminarayanan et al., 2017). We find that MaxProb gives good confidence estimates on in-domain data, but is overconfident on OOD data. Therefore, MaxProb performs poorly in mixed settings: it does not abstain enough on OOD examples, relative to in-domain examples.
We correct for MaxProb's overconfidence by using known OOD data to train a calibrator-a classifier trained to predict whether the original QA model is correct or incorrect on a given example (Platt, 1999;Zadrozny and Elkan, 2002). While prior work in NLP trains a calibrator on in-domain data (Dong et al., 2018), we show this does not generalize to unknown OOD data as well as training on a mixture of in-domain and known OOD data. Figure 1 illustrates the problem setup and how the calibrator uses known OOD data. We use a simple random forest calibrator over features derived from the input example and the model's softmax outputs.
We conduct extensive experiments using SQuAD (Rajpurkar et al., 2016) as the source distribution and five other QA datasets as different OOD distributions. We average across all 20 choices of using one as the unknown OOD dataset and another as the known OOD dataset, and test on a uniform mixture of SQuAD and unknown OOD data. On average, the trained calibrator achieves 56.1% coverage (i.e., the system answers 56.1% of test questions) while maintaining 80% accuracy on answered questions, outperforming MaxProb with the same QA model (48.2% coverage at 80% accuracy), using MaxProb and training the QA model on both SQuAD and the known OOD data (51.8% coverage), and training the calibrator only on SQuAD data (53.7% coverage).
In summary, our contributions are as follows: (1) We propose a novel setting, selective question answering under domain shift, that captures the practical necessity of knowing when to abstain on test data that differs from the training data.
(2) We show that QA models are overconfident on out-of-domain examples relative to indomain examples, which causes MaxProb to perform poorly in our setting.
(3) We show that out-of-domain data, even from a different distribution than the test data, can improve selective prediction under domain shift when used to train a calibrator.

Related Work
Our setting combines extrapolation to out-ofdomain data with selective prediction. We also distinguish our setting from the tasks of identifying unanswerable questions and outlier detection.

Extrapolation to out-of-domain data
Extrapolating from training data to test data from a different distribution is an important challenge for current NLP models (Yogatama et al., 2019). Models trained on many domains may still struggle to generalize to new domains, as these may involve new types of questions or require different reasoning skills (Talmor and Berant, 2019;Fisch et al., 2019). Related work on domain adaptation also tries to generalize to new distributions, but assumes some knowledge about the test distribution, such as unlabeled examples or a few labeled examples (Blitzer et al., 2006;Daume III, 2007); we assume no such access to the test distribution, but instead make the weaker assumption of access to samples from a different OOD distribution.

Selective prediction
Selective prediction, in which a model can either predict or abstain on each test example, is a longstanding research area in machine learning (Chow, 1957;El-Yaniv and Wiener, 2010;Geifman and El-Yaniv, 2017). In NLP, Dong et al. (2018) use a calibrator to obtain better confidence estimates for semantic parsing. Rodriguez et al. (2019) use a similar approach to decide when to answer QuizBowl questions. These works focus on training and testing models on the same distribution, whereas our training and test distributions differ.
Selective prediction under domain shift. Other fields have recognized the importance of selective prediction under domain shift. In medical applications, models may be trained and tested on different groups of patients, so selective prediction is needed to avoid costly errors . In computational chemistry, Toplak et al. (2014) use selective prediction techniques to estimate the set of (possibly out-of-domain) molecules for which a reactivity classifier is reliable. To the best of our knowledge, our work is the first to study selective prediction under domain shift in NLP.
Answer validation. Traditional pipelined systems for open-domain QA often have dedicated systems for answer validation-judging whether a proposed answer is correct. These systems often rely on external knowledge about entities (Magnini et al., 2002;Ko et al., 2007). Knowing when to abstain has been part of past QA shared tasks like RespubliQA (Peñas et al., 2009) and QA4MRE (Peñas et al., 2013). IBM's Watson system for Jeopardy also uses a pipelined approach for answer validation (Gondek et al., 2012). Our work differs by focusing on modern neural QA systems trained end-to-end, rather than pipelined systems, and by viewing the problem of abstention in QA through the lens of selective prediction.

Related goals and tasks
Calibration. Knowing when to abstain is closely related to calibration-having a model's output probability align with the true probability of its prediction (Platt, 1999). A key distinction is that selective prediction metrics generally depend only on relative confidences-systems are judged on their ability to rank correct predictions higher than incorrect predictions (El-Yaniv and Wiener, 2010). In contrast, calibration error depends on the absolute confidence scores. Nonetheless, we will find it useful to analyze calibration in Section 5.3, as miscalibration on some examples but not others does imply poor relative ordering, and therefore poor selective prediction. Ovadia et al. (2019) observe increases in calibration error under domain shift.

Identifying
unanswerable questions. In SQuAD 2.0, models must recognize when a paragraph does not entail an answer to a question (Rajpurkar et al., 2018). Sentence selection systems must rank passages that answer a question higher than passages that do not (Wang et al., 2007;Yang et al., 2015). In these cases, the goal is to "abstain" when no system (or person) could infer an answer to the given question using the given passage. In contrast, in selective prediction, the model should abstain when it would give a wrong answer if forced to make a prediction.
Outlier detection. We distinguish selective prediction under domain shift from outlier detection, the task of detecting out-of-domain examples (Schölkopf et al., 1999;Hendrycks and Gimpel, 2017;. While one could use an outlier detector for selective classification (e.g., by abstaining on all examples flagged as outliers), this would be too conservative, as QA models can often get a non-trivial fraction of OOD examples correct (Talmor and Berant, 2019;Fisch et al., 2019). Hendrycks et al. (2019b) use known OOD data for outlier detection by training models to have high entropy on OOD examples; in contrast, our setting rewards models for predicting correctly on OOD examples, not merely having high entropy.

Problem Setup
We formally define the setting of selective prediction under domain shift, starting with some notation for selective prediction in general.

Selective Prediction
Given an input x, the selective prediction task is to output (ŷ, c) whereŷ ∈ Y (x), the set of answer candidates, and c ∈ R denotes the model's confidence. Given a threshold γ ∈ R, the overall system predictsŷ if c ≥ γ and abstain otherwise.
The risk-coverage curve provides a standard way to evaluate selective prediction methods (El-Yaniv and Wiener, 2010). For a test dataset D test , any choice of γ has an associated coverage-the fraction of D test the model makes a prediction on-and risk-the error on that fraction of D test . As γ decreases, coverage increases, but risk will usually also increase. We plot risk versus coverage and evaluate on the area under this curve (AUC), as well as the maximum possible coverage for a desired risk level. The former metric averages over all γ, painting an overall picture of selective prediction performance, while the latter evaluates at a particular choice of γ corresponding to a specific level of risk tolerance.

Selective Prediction under Domain Shift
We deviate from prior work by considering the setting where the model's training data D train and test data D test are drawn from different distributions. As our experiments demonstrate, this setting is challenging because standard QA models are overconfident on out-of-domain inputs.
To formally define our setting, we specify three data distributions. First, p source is the source distribution, from which a large training dataset D train is sampled. Second, q unk is an unknown OOD distribution, representing out-of-domain data encountered at test time. The test dataset D test is sampled from p test , a mixture of p source and q unk : for α ∈ (0, 1). We choose α = 1 2 , and examine the effect of changing this ratio in Section 5.8. Third, q known is a known OOD distribution, representing examples not in p source but from which the system developer has a small dataset D calib .

Selective Question Answering
While our framework is general, we focus on extractive question answering, as exemplified by SQuAD (Rajpurkar et al., 2016), due to its practical importance and the diverse array of available QA datasets in the same format. The input x is a passage-question pair (p, q), and the set of answer candidates Y (x) is all spans of the passage p.
, but differ in their associated confidence c.

Methods
Recall that our setting differs from the standard selective prediction setting in two ways: unknown OOD data drawn from q unk appears at test time, and known OOD data drawn from q known is available to the system. Intuitively, we expect that systems must use the known OOD data to generalize to the unknown OOD data. In this section, we present three standard selective prediction methods for indomain data, and show how they can be adapted to use data from q known .

MaxProb
The first method, MaxProb, directly uses the probability assigned by the base model toŷ as an estimate of confidence. Formally, MaxProb with model f estimates confidence on input x as: MaxProb is a strong baseline for our setting. Across many tasks, MaxProb has been shown to distinguish in-domain test examples that the model gets right from ones the model gets wrong (Hendrycks and Gimpel, 2017). MaxProb is also a strong baseline for outlier detection, as it is lower for out-of-domain examples than in-domain examples (Lakshminarayanan et al., 2017;Hendrycks et al., 2019b). This is desirable for our setting: models make more mistakes on OOD examples, so they should abstain more on OOD examples than in-domain examples.
MaxProb can be used with any base model f . We consider two such choices: a model f src trained only on D train , or a model f src+known trained on the union of D train and D calib .

Test-time Dropout
For neural networks, another standard approach to estimate confidence is to use dropout at test time. Gal and Ghahramani (2016) showed that dropout gives good confidence estimates on OOD data.
Given an input x and model f , we compute f on x with K different dropout masks, obtaining prediction distributionsp 1 , . . . ,p K , where eachp i is a probability distribution over Y (x). We consider two statistics of thesep i 's that are commonly used as confidence estimates. First, we take the mean of p i (ŷ) across all i (Lakshminarayanan et al., 2017): This can be viewed as ensembling the predictions across all K dropout masks by averaging them. Second, we take the negative variance of thê p i (ŷ)'s (Feinman et al., 2017;Smith and Gal, 2018): Higher variance corresponds to greater uncertainty, and hence favors abstaining. Like MaxProb, dropout can be used either with f trained only on D train , or on both D train and the known OOD data. Test-time dropout has practical disadvantages compared to MaxProb. It requires access to internal model representations, whereas MaxProb only requires black box access to the base model (e.g., API calls to a trained model). Dropout also requires K forward passes of the base model, leading to a K-fold increase in runtime.

Training a calibrator
Our final method trains a calibrator to predict when a base model (trained only on data from p source ) is correct (Platt, 1999;Dong et al., 2018). We differ from prior work by training the calibrator on a mixture of data from p source and q known , anticipating the test-time mixture of p source and q unk . More specifically, we hold out a small number of p source examples from base model training, and train the calibrator on the union of these examples and the q known examples. We define c Calibrator to be the prediction probability of the calibrator.
The calibrator itself could be any binary classification model. We use a random forest classifier with seven features: passage length, the length of the predicted answerŷ, and the top five softmax probabilities output by the model. These features require only a minimal amount of domain knowledge to define. Rodriguez et al. (2019) similarly used multiple softmax probabilities to decide when to answer questions. The simplicity of this model makes the calibrator fast to train when given new data from q known , especially compared to retraining the QA model on that data.
We experiment with four variants of the calibrator. First, to measure the impact of using known OOD data, we change the calibrator's training data: it can be trained either on data from p source only, or both p source and q known data as described. Second, we consider a modification where instead of the model's probabilities, we use probabilities from the mean ensemble over dropout masks, as described in Section 4.2, and also add c DropoutVar as a feature. As discussed above, dropout features are costly to compute and assume white-box access to the model, but may result in better confidence estimates. Both of these variables can be changed independently, leading to four configurations.

Experimental Details
Data. We use SQuAD 1.1 (Rajpurkar et al., 2016) as the source dataset and five other datasets as OOD datasets: NewsQA (Trischler et al., 2017), Trivi-aQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019). 1 These are all extractive question answering datasets where all questions are answerable; however, they vary widely in the nature of passages (e.g., Wikipedia, news, web snippets), questions (e.g., Jeopardy and trivia questions), and relationship between pas-sages and questions (e.g., whether questions are written based on passages, or passages retrieved based on questions). We used the preprocessed data from the MRQA 2019 shared task (Fisch et al., 2019). For HotpotQA, we focused on multi-hop questions by selecting only "hard" examples, as defined by Yang et al. (2018). In each experiment, two different OOD datasets are chosen as q known and q unk . All results are averaged over all 20 such combinations, unless otherwise specified. We sample 2,000 examples from q known for D calib , and 4,000 SQuAD and 4,000 q unk examples for D test . We evaluate using exact match (EM) accuracy, as defined by SQuAD (Rajpurkar et al., 2016). Additional details can be found in Appendix A.1.
QA model. For our QA model, we use the BERTbase SQuAD 1.1 model trained for 2 epochs . We train six models total: one f src and five f src+known 's, one for each OOD dataset.
Selective prediction methods. For test-time dropout, we use K = 30 different dropout masks, as in Dong et al. (2018). For our calibrator, we use the random forest implementation from Scikitlearn (Pedregosa et al., 2011). We train on 1,600 SQuAD examples and 1,600 known OOD examples, and use the remaining 400 SQuAD and 400 known OOD examples as a validation set to tune calibrator hyperparameters via grid search. We average our results over 10 random splits of this data. When training the calibrator only on p source , we use 3,200 SQuAD examples for training and 800 for validation, to ensure equal dataset sizes. Additional details can be found in Appendix A.2.

Main results
Training a calibrator with q known outperforms other methods. Table 1 compares all methods that do not use test-time dropout. Compared to MaxProb with f src+known , the calibrator has 4.3 points and 6.7 points higher coverage at 80% and 90% accuracy respectively, and 1.1 points lower AUC. 2 This demonstrates that training a calibrator is a better use of known OOD data than training a QA model. The calibrator trained on both p source and q known also outperforms the calibrator trained on p source alone by 2.4% coverage at 80% accuracy. All methods perform far worse than the optimal selective predictor with the given base model, though  Test-time dropout improves results but is expensive. Table 2 shows results for methods that use test-time dropout, as described in Section 4.2. The negative variance ofp i (ŷ)'s across dropout masks serves poorly as an estimate of confidence, but the mean performs well. The best performance is attained by the calibrator using dropout features, which has 3.9% higher coverage at 80% accuracy than the calibrator with non-dropout features. Since test-time dropout introduces substantial (i.e., Kfold) runtime overhead, our remaining analyses focus on methods without test-time dropout.
The QA model has lower non-trivial accuracy on OOD data. Next, we motivate our focus on selective prediction, as opposed to outlier detection, by showing that the QA model still gets a non-trivial fraction of OOD examples correct. Table 3 shows the (non-selective) exact match scores Figure 2: Area under the risk-coverage curve as a function of how much data from q known is available. At all points, using data from q known to train the calibrator is more effective than using it for QA model training.
for all six QA models used in our experiments on all datasets. All models get around 80% accuracy on SQuAD, and around 40% to 50% accuracy on most OOD datasets. Since OOD accuracies are much higher than 0%, abstaining on all OOD examples would be overly conservative. 4 At the same time, since OOD accuracy is worse than in-domain accuracy, a good selective predictor should answer more in-domain examples and fewer OOD examples. Training on 2,000 q known examples does not significantly help the base model extrapolate to other q unk distributions.
Results hold across different amounts of known OOD data. As shown in Figure 2, across all amounts of known OOD data, using it to train and validate the calibrator (in an 80-20 split) performs better than adding all of it to the QA training data and using MaxProb.

Overconfidence of MaxProb
We now show why MaxProb performs worse in our setting compared to the in-domain setting: it is miscalibrated on out-of-domain examples. Figure 3a shows that MaxProb values are generally lower for OOD examples than in-domain examples, following previously reported trends (Hendrycks and Gimpel, 2017;. However, the MaxProb values are still too high out-of-domain. Figure 3b shows that MaxProb is not well calibrated: it is underconfident in-domain, and overconfident out-of-domain. 5 For example, for a Max-   Figure 3: MaxProb is lower on average for OOD data than in-domain data (a), but it is still overconfident on OOD data: when plotting the true probability of correctness vs. MaxProb (b), the OOD curve is below the y = x line, indicating MaxProb overestimates the probability that the prediction is correct. The calibrator assigns lower confidence on OOD data (c) and has a smaller gap between in-domain and OOD curves ( trained for fewer epochs. Our QA model is only trained for two epochs, as is standard for BERT. Our findings also align with Ovadia et al. (2019), who find that computer vision and text classification models are poorly calibrated out-of-domain even when well-calibrated in-domain. Note that miscalibration out-of-domain does not imply poor selective prediction on OOD data, but does imply poor selective prediction in our mixture setting.

Extrapolation between datasets
We next investigated how choice of q known affects generalization of the calibrator to q unk . Figure 4 shows the percentage reduction between MaxProb and optimal AUC achieved by the trained calibrator. The calibrator outperforms MaxProb over all dataset combinations, with larger gains when q known and q unk are similar. For example, samples from TriviaQA help generalization to SearchQA and vice versa; both use web snippets as passages. Samples from NewsQA, the only other non-Wikipedia dataset, are also helpful for both. On the other hand, no other dataset significantly helps generalization to HotpotQA, likely due to HotpotQA's unique focus on multi-hop questions.

Calibrator feature ablations
We determine the importance of each feature of the calibrator by removing each of its features individually, leaving the rest. From Table 4, we see that the most important features are the softmax probabilities and the passage length. Intuitively, passage length is meaningful both because longer passages have more answer candidates, and because passage length differs greatly between different domains.

Error analysis
We examined calibrator errors on two pairs of q known and q unk -one similar pair of datasets and one dissimilar. For each, we sampled 100 errors in which the system confidently gave a wrong answer (overconfident), and 100 errors in which the sys- Figure 4: Results for different choices of q known (y-axis) and q unk (x-axis). For each pair, we report the percent AUC improvement of the trained calibrator over MaxProb, relative to the total possible improvement. Datasets that use similar passages (e.g., SearchQA and TriviaQA) help each other the most. Main diagonal elements (shaded) assume access to q unk (see Section 5.9).  tem abstained but would have gotten the question correct if it had answered (underconfident). These were sampled from the 1000 most overconfident or underconfident errors, respectively.
q known = NewsQA, q unk = TriviaQA. These two datasets are from different non-Wikipedia sources. 62% of overconfidence errors are due to the model predicting valid alternate answers, or span mismatches-the model predicts a slightly different span than the gold span, and should be considered correct; thus the calibrator was not truly overconfident. This points to the need to improve QA evaluation metrics . 45% of underconfidence errors are due to the passage requiring coreference resolution over long distances, including with the article title. Neither SQuAD nor NewsQA passages have coreference chains as long or contain titles, so it is unsurprising that the calibrator struggles on these cases. Another 25% of underconfidence errors were cases in which there was insufficient evidence in the paragraph to answer the question (as TriviaQA was constructed via distant supervision), so the calibrator was not incorrect to assign low confidence. 16% of all underconfidence errors also included phrases that would not be common in SQuAD and NewsQA, such as using "said bye bye" for "banned." q known = NewsQA, q unk = HotpotQA. These two datasets are dissimilar from each other in multiple ways. HotpotQA uses short Wikipedia passages and focuses on multi-hop questions; NewsQA has much longer passages from news articles and does not focus on multi-hop questions. 34% of the overconfidence errors are due to valid alternate answers or span mismatches. On 65% of the underconfidence errors, the correct answer was the only span in the passage that could plausibly answer the question, suggesting that the model arrived at the answer due to artifacts in HotpotQA that facilitate guesswork (Chen and Durrett, 2019;Min et al., 2019). In these situations, the calibrator's lack of confidence is therefore justifiable.

Relationship with Unanswerable Questions
We now study the relationship between selective prediction and identifying unanswerable questions.
Unanswerable questions do not aid selective prediction. We trained a QA model on SQuAD 2.0 (Rajpurkar et al., 2018), which augments SQuAD 1. majority baseline of 48.9 EM. Taken together, these results indicate that identifying unanswerable questions is a very different task from knowing when to abstain under distribution shift. Our setting focuses on test data that is dissimilar to the training data, but on which the original QA model can still correctly answer a nontrivial fraction of examples. In contrast, unanswerable questions in SQuAD 2.0 look very similar to answerable questions, but a model trained on SQuAD 1.1 gets all of them wrong.

Changing ratio of in-domain to OOD
Until now, we used α = 1 2 both for D test and training the calibrator. Now we vary α for both, ranging from using only SQuAD to only OOD data (sampled from q known for D calib and from q unk for D test ). Figure 5 shows the difference in AUC between the trained calibrator and MaxProb. At both ends of the graph, the difference is close to 0, showing that MaxProb performs well in homogeneous settings. However, when the two data sources are mixed, the calibrator outperforms MaxProb significantly. This further supports our claim that MaxProb performs poorly in mixed settings.

Allowing access to q unk
We note that our findings do not hold in the alternate setting where we have access to samples from q unk (instead of q known ). Training the QA model with this OOD data and using MaxProb achieves average AUC of 16.35, whereas training a calibrator achieves 17.87; unsurprisingly, training on examples similar to the test data is helpful. We do not focus on this setting, as our goal is to build selective QA models for unknown distributions.

Discussion
In this paper, we propose the setting of selective question answering under domain shift, in which systems must know when to abstain on a mixture of in-domain and unknown OOD examples. Our setting combines two important goals for real-world systems: knowing when to abstain, and handling distribution shift at test time. We show that models are overconfident on OOD examples, leading to poor performance in the our setting, but training a calibrator using other OOD data can help correct for this problem. While we focus on question answering, our framework is general and extends to any prediction task for which graceful handling of out-of-domain inputs is necessary.
Across many tasks, NLP models struggle on out-of-domain inputs. Models trained on standard natural language inference datasets (Bowman et al., 2015) generalize poorly to other distributions (Thorne et al., 2018;Naik et al., 2018). Achieving high accuracy on out-of-domain data may not even be possible if the test data requires abilities that are not learnable from the training data (Geiger et al., 2019). Adversarially chosen ungrammatical text can also cause catastrophic errors Cheng et al., 2020). In all these cases, a more intelligent model would recognize that it should abstain on these inputs.
Traditional NLU systems typically have a natural ability to abstain. SHRDLU recognizes statements that it cannot parse, or that it finds ambiguous (Winograd, 1972). QUALM answers reading comprehension questions by constructing reasoning chains, and abstains if it cannot find one that supports an answer (Lehnert, 1977).
NLP systems deployed in real-world settings inevitably encounter a mixture of familiar and unfamiliar inputs. Our work provides a framework to study how models can more judiciously abstain in these challenging environments.
Reproducibility. All code, data and experiments are available on the Codalab platform at https: //bit.ly/35inCah.

A.1 Dataset Sources
The OOD data used in calibrator training and validation was sampled from MRQA training data, and the SQuAD data for the same was sampled from MRQA validation data, to prevent train/test mismatch for the QA model (Fisch et al., 2019). The test data was sampled from a disjoint subset of the MRQA validation data.

A.2 Calibrator Features and Model
We ran experiments including question length and word overlap between the passage and question as calibrator features. However, these features did not improve the validation performance of the calibrator. We hypothesize that they may provide misleading information about a given example, e.g., a long question in SQuAD may provide more opportunities for alignment with the paragraph, making it more likely to be answered correctly, but a long question in HotpotQA may contain a conjunction, which is difficult for the SQuAD-trained model to extrapolate to. For the calibrator model, we experimented using an MLP and logistic regression. Both were slightly worse than Random Forest.

A.3 Outlier Detection for Selective Prediction
In this section, we study whether outlier detection can be used to perform selective prediction. We train an outlier detector to detect whether or not a given input came from the in-domain dataset (i.e., SQuAD) or is out-of-domain, and use its probability of an example being in-domain for selective prediction. The outlier detection model, training data (a mixture of p source and q known ), and features are the same as those of the calibrator. We find Figure 6: When considering only one answer option as correct, MaxProb is well-calibrated in-domain, but is still overconfident out-of-domain. that this method does poorly, achieving an AUC of 24.23, Coverage at 80% Accuracy of 37.91%, and Coverage at 90% Accuracy of 14.26%. This shows that, as discussed in Section 2.3 and Section 5.2, this approach is unable to correctly identify the OOD examples that the QA model would get correct.

A.4 Underconfidence of MaxProb on SQuAD
As noted in Section 5.3, MaxProb is underconfident on SQuAD examples due to the additional correct answer options given at test time but not at train time. When the test time evaluation is restricted to allow only one correct answer, we find that MaxProb is well-calibrated on SQuAD examples ( Figure 6). The calibration of the calibrator improves as well (Figure 7). However, we do not retain this restriction for the experiments, as it diverges from standard practice on SQuAD, and EM over multiple spans is a better evaluation metric since there are often multiple answer spans that are equally correct. Table 1 in Section 5.2 shows the coverage of Max-Prob and the calibrator over the mixed dataset D test while maintaining 80% accuracy and 90% accuracy. In Table 5, we report the fraction of these answered questions that are in-domain or OOD. We also show the accuracy of the QA model on each portion.

A.5 Accuracy and Coverage per Domain
Our analysis in Section 5.3 indicated that Max-Prob was overconfident on OOD examples, which we expect would make it answer too many OOD questions and too few in-domain questions. Indeed,