On the Transferability of Minimal Prediction Preserving Inputs in Question Answering

Recent work (Feng et al., 2018) establishes the presence of short, uninterpretable input fragments that yield high confidence and accuracy in neural models. We refer to these as Minimal Prediction Preserving Inputs (MPPIs). In the context of question answering, we investigate competing hypotheses for the existence of MPPIs, including poor posterior calibration of neural models, lack of pretraining, and “dataset bias” (where a model learns to attend to spurious, non-generalizable cues in the training data). We discover a perplexing invariance of MPPIs to random training seed, model architecture, pretraining, and training domain. MPPIs demonstrate remarkable transferability across domains achieving significantly higher performance than comparably short queries. Additionally, penalizing over-confidence on MPPIs fails to improve either generalization or adversarial robustness. These results suggest the interpretability of MPPIs is insufficient to characterize generalization capacity of these models. We hope this focused investigation encourages more systematic analysis of model behavior outside of the human interpretable distribution of examples.


Introduction
establish the presence of shortened input sequences that yield high confidence and accuracy for non-pretrained neural models. These Minimal Prediction Preserving Inputs (MPPIs) are constructed by iteratively removing the least important word from the query to obtain the shortest sequence for which the model's prediction remains unchanged (example shown in Figure 1). 1 Humans are unable to make either confident or accurate predictions on these inputs. Follow up work treats * equal contribution 1 For question answering we construct MPPIs by only removing words from the query. Modifying the context paragraph is poorly defined in MPPI generation as it perturbs the output space, rendering an answer impossible or trivial.

Context
... The site currently houses three cinemas, including the restored Classic the United Kingdom's last surviving news cinema still in full-time operation-alongside two new screens ...

Original
What's the name of the United Kingdom 's sole remaining news cinema ? Reduced news Confidence 0.57 → 0.51 Figure 1: A SQUAD dev set example. Given the original Context, the model makes the same correct prediction ("Classic") on the Reduced question (MPPI) as the Original, with almost the same score. For humans, the reduced question, "news", is nonsensical. strong model performance on such partial-inputs as equivalent with models improperly learning the task (Feng et al., 2019;Kaushik and Lipton, 2018;He et al., 2019). Accordingly, we evaluate this proposition in question answering (QA), investigating the properties of MPPIs and how their existence relates to "dataset bias", out-of-domain generalization, and adversarial robustness.
First we examine the hypothesis that MPPIs are a symptom of poor neural calibration. Feng et al. (2018) propose we can "attribute [these neural] pathologies primarily to the lack of accurate uncertainty estimates in neural models." As neural models tend to overfit the log-likelihood objective by predicting low-entropy distributions (Guo et al., 2017) this can manifest in over-confidence on gibberish examples outside of the training distribution (Goodfellow et al., 2014). We test this hypothesis using pretrained models, shown to have better posterior calibration and out-of-distribution robustness (Hendrycks et al., 2020;Desai and Durrett, 2020). Contrary to expectations, we find largescale pretraining does not produce more human interpretable MPPIs.
Second we examine the hypothesis that MPPIs are the symptom of "dataset bias" -where a flawed annotation procedure results in hidden linguistic cues or "annotation artifacts" (Gururangan et al., 2018;Niven and Kao, 2019). Models trained on such data distribution can rely on simple heuristics rather than learning the task. As such, input fragments or "partial inputs" are often sufficient for a model to achieve strong performance on flawed datasets. This explanation has been considered for both Natural Language Inference tasks (the "hypothesis-only" input for Poliak et al. (2018); Gururangan et al. (2018)) and for Visual Question Answering (the "question-only" model for Goyal et al. (2017)). We expect models which rely on these spurious cues would fail to generalize well to other "domains" (datasets with different collection and annotation procedures). We discover even models trained in different domains perform nearly as well on MPPIs as on full inputs, contradicting this hypothesis. Further, we test their transferability across a number of other factors, including random training seed, model size, and pretraining strategy, and confirm their invariance to each of these.
Third we examine the hypothesis that MPPIs inhibit generalization. This intuition is based on MPPI's poor human interpretability, which could suggest models should not attend to these signals. To test this hypothesis, we regularize this phenomenon directly to promote more human understandable MPPIs, and measure the impact on outdomain generalization and adversarial robustness. Interestingly, out-domain generalization and robustness on Adversarial SQUAD (Jia and Liang, 2017) vary significantly by domain, with both declining slightly on average due to regularization.
In conjunction, these results suggest MPPIs may represent an unique phenomenon from what previous work has observed and analyzed. The performance of these inputs is not well explained by domain-specific biases, or posterior overconfidence on out-of-distribution inputs. Instead, this behavior may correspond to relevant signals as the impact of their partial mitigation suggests. We hope these results encourage researchers to not assume MPPIs, or other uninterpretable model behaviour, are dataset artifacts that require mitigation a priori. Before presenting mitigation solutions, we propose they follow a more systematic analysis proposed by our actionable framework by (a) rigorously testing the alleged causes of the observed behaviour, (b) confirming the bias does not generalize/transfer, and (c) ensuring the solution provides To examine how MPPIs transfer across Question Answering domains we employ 6 diverse QA training sets and 12 evaluation sets. 4 The datasets were selected for annotation variety, differing on: question type, document source, annotation instructions, whether the question was collected independently of the passage, and skills required to answer the question. This set represents a realistic spectrum of domains for evaluating generalization.
We set aside 2k examples from each domain's validation sets in order to generate MPPIs for model evaluation. For each experiment we also generate a set of randomly shortened queries to compare against the MPPIs -we refer to this as the "Random MPPI" baseline. For each of the original examples, we generate this baseline by randomly removing words until the length matches that of the corresponding MPPI. BERT-B 32.1 / 9.9 -/ -29.8 / 9.9 XLNET-L 26.0 / 7.2 29.8 / 9.9 -/ -RANDOM 13.0 / 1.8 12.6 / 0.9 14.2 / 1.3 Table 2: The mean similarity, measured in Jaccard Similarity / Exact Match (%), between the MPPIs from different model types and the random baseline.

Invariance of MPPIs
cluding DRQA, and BIMPM (Wang et al., 2017). We extend this investigation for modern, pretrained Transformers, and assess the "invariance" of MPPIs: measuring whether they are random, or are affected by model architecture, pretraining strategy, or training dataset (domain).
In subsequent experiments we compare sets of MPPIs using the mean Exact Match or Generalized Jaccard Similarity (GJS), a variant of Jaccard Similarity, which accounts for the possibility of repeated tokens in either of the sequences being compared. Generalized Jaccard Similarity is defined between two token sequences X and Y in Equation 1. Here, n is the index of every element that appears in X ∪ Y .
We will refer to this as "Jaccard Similarity" for simplicity.

Random Seed
First, we investigate whether MPPIs are "random", or influenced by weight initialization and training data order. Measuring the Jaccard Similarity between MPPI sequences produced by models with different training seeds we find JS MPPI = 57.1%±1.2, as compared to JS R = 13.8%±0.8 on the Random MPPI baseline. This suggests MPPIs are not simply the side-effect of randomness in the training procedure.

Pretraining and Architecture
One hypothesis is that traditional LSTM-based models, such as DRQA, do not have sufficient pretraining or "world knowledge" to rely on the entire sequence, and overfit to subsets of the input. If this were the primary source of MPPIs, we might expect models that are better calibrated and more robust to out-of-distribution examples to have longer and more interpretable MPPIs. Accordingly, we test this hypothesis with large pretrained transformers, which recent work demonstrates have  better posterior calibration and robustness to outof-distribution inputs.
Specifically, Desai and Durrett (2020) examine 3 separate NLP tasks, using "challenging out-ofdomain settings, where models face more examples they should be uncertain about", and find that "when used out-of-the-box, pretrained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5× lower". Similarly Hendrycks et al. (2020) systematically show "Pretrained transformers are also more effective at detecting anomalous or [out-of-distribution] examples". These findings suggest pretrained transformers should produce more interpretable MPPIs than non-pretrained models.
However, in Table 1 we show MPPIs remain incomprehensibly short for all 6 domains and even for pretrained transformer models (DRQA produces MPPIs on SQUAD of mean length 2.04). In Table 2 we show MPPIs produced by different model architectures and pretraining strategies are similar, significantly exceeding the Jaccard Similarity of the Random MPPI baseline (JS R = 13.8%). This would not be problematic if pretrained models produced lower confidences for MPPIs than the original examples (demonstrating some form of calibration). However, we find the opposite is true. Taking SQuAD for instance we see in 85% of cases the BERT model is more confident on the MPPI than the original example.
Lastly, we verify with manual grading tasks that the MPPIs for BERT and XLNet are no more interpretable to humans than DrQA's MPPIs, as shown in Table 5. This suggests that short, uninterpretable MPPIs are ubiquitous in modern neural question answering models and unmitigated by large scale pretraining, or improved out-of-distribution robustness.

Cross-Domain Similarity
Next, we investigate the extent to which MPPIs are domain-specific. We do this by measuring their similarity when produced by models trained in different domains. If MPPIs are the product of bias in the training data, such as annotation artifacts, we would expect them to be relatively domain specific, as different datasets carry different biases. In Table 3 a model trained from each domain (Train Dataset) generates MPPIs for each other domain (Reduction Dataset). For each Reduction Dataset, we measure the mean Jaccard Similarity between MPPIs produced by the Train Dataset model and MPPIs produced by the Reduction Dataset (indomain) model. In parentheses we show the mean Jaccard Similarity between the Random MPPIs and the Train Dataset MPPIs. In all cases, MPPIs demonstrate higher similarity than the random baseline, indicating that they are not domain specific.

Cross-Domain Transferability of MPPIs
Even when models generate different MPPIs, they may still transfer to the other domain. We would like to measure MPPI transferability, independent of their similarity between models. If QA models perform well on MPPIs generated from a range of domains then this would suggest they are not a product of bias in the training data. Instead, they may retain information important to question answering, rather than annotation artifacts. To better measure the extent of MPPI transferability, we (a) train one model on SQuAD (Train Dataset), and another on NewsQA (Reduction Dataset), (b) use the NewsQA-model to generate 2k MPPIs on the NewsQA evaluation set, and (c) measure the F1 performance of the SQuAD-model evaluated on both the original NewsQA evaluation set and the MPPI queries as generated in part (b). Figure 2 shows performance on out-domain MPPIs are 46.6% closer to original performance than on Random MPPIs. This evidence suggests MPPIs are highly transferable across domains. Consequently, MPPIs may relate to generalization, despite their poor human interpretability.
In Table 4   adversarial robustness F1 on Adversarial SQUAD (AR) all decline slightly on average with MPPI regularization -by 0.2%, 2.7%, and 0.6% respectively. These results suggest a model's ability to make predictions on MPPIs is not strongly correlated with either generalization or robustness across 13 total QA datasets. However, the relative stability of in-domain performance as compared to outdomain performance suggests mitigating MPPIs is more harmful when crossing domain boundaries. Certain train datasets exhibit greater sensitivity to MPPI regularization than others. For instance SearchQA is drastically affected in all measures, HotpotQA hardly at all, and SQuAD actually improves by 3.1% in adversarial robustness. Additionally, Table 4 shows the 95% confidence intervals for out-domain generalization are often as large as the mean change in performance. Empirically, this demonstrates the effect of MPPI regularization is not consistent, having both positive and negative impacts on performance, depending on which of the 12 out-domain datasets is in question. 6

Discussion
In SQUAD, the most common MPPI is the empty string (40%). Among non-empty strings, the most common MPPI tokens are: "what", "?", "who", "how", "when". Despite the pattern of interrogative words, these tokens are already among the most frequent in SQUAD questions, so it's challenging to measure the unique information they convey.
A more direct approach to understand the informative signal of MPPIs is to measure their "human insufficiency" property directly. We conduct a grading task, comparing human ability to answer real, MPPI, and random MPPI queries. Table 5 shows that humans could only correctly 6 See Figure 9 in Appendix A.4 for details.
answer BERT and XLNet MPPIs slightly more often than random MPPIs (32% and 26% exact match compared to 17%), but could answer 43.5% of MPPIs produced by MPPI-regularized BERT. Although this confirms MPPI-regularization partially resolves over-confident behaviour for these human non-interpretable inputs, we've observed the resulting model fares slightly worse in domain generalization and robustness.
We find no evidence that MPPIs are explained by poorly calibrated neural models, lack of pretraining knowledge, or dataset-specific bias. Alternatively they may relate, at least in part, to useful and transferable signals. While practitioners, especially in model debiasing tasks, have focused on human understandable and generalizable features, this work would encourage them to also consider the presence of generalizable features which are not human interpretable. This observation closely relates to prior work in computer vision suggesting human uninterpretable, adversarial examples can be the result of "features", not "bugs", in which Ilyas et al. (2019) observe "a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data." We hope this work provides a framework to rigorously evaluate the impact of bias mitigation methods on robustness and generalization, and encourages ML practitioners to examine assumptions regarding unexpected model behaviour on out-of-distribution inputs.

Conclusion
We empirically verify the surprising invariance of MPPIs to random seed, model architecture, and pretraining, as well as their wide transferability across domains. These results suggest that MPPIs may not be best explained by poorly calibrated neural estimates of confidence or dataset-specific bias. Examining their relationship to generalization and adversarial robustness, we highlight the ability to maintain in-domain performance but significantly alter out-domain performance and robustness. We hope our results encourage a more systematic analysis of hypotheses regarding model behavior outside the human interpretable distribution of examples.  (2017), we borrowed the implementation and hyper-parameters from hitvoice (https://github.com/hitvoice/DrQA) and train on 1 NVIDIA Tesla V100 GPU. 8

A.2 Dataset
We employ 6 diverse QA training sets and 12 evaluation sets from the MRQA 2019 workshop (https:  Table 7 shows their statistics.
We use the hyperparameters described in Table 6 for training on each dataset. We use all the training data provided for each by MRQA.

A.3 Generating MPPIs
The process for generating MPPIs closely follows the procedure described by Feng et al. (2018). We operate with a beam size of k = 3, finding that larger beam sizes exhibit diminishing returns, and 7 https://github.com/huggingface/ transformers 8 We used the open source version available at https: //github.com/hitvoice/drqa.  rarely produce different results. The procedure involves iteratively removing the token which is "least important" to the model. The least important token is defined as the one that when removed provides the smallest decrease in confidence in the originally predicted span. Note that in some cases confidence in the originally predicted span can even increase with the removal of a token. In any case, the least important token is always designated by the lowest confidence in the original prediction. The stop condition is when removing any additional token would change the model's prediction.
Note that we follow previous work in only removing words from the query in extractive question answering. The reason for this is the MPPI can be poorly defined when context tokens are removed. Since the output predictions are over the context tokens for extractive question answering, its possible to warp the answer space, or remove the answer  altogether. Additionally, if we do not permit any alterations to the original prediction tokens, then there exists a trivial solution: remove all tokens except for the predicted answer. In this case an extractive question answering model is forced to predict that answer, with no alternative options. Consequently, MPPIs that allow modifications to the context, or output space, can be poorly defined. Since in question answering the query is an essential input to provide confident answers, we believe this is the most reasonable setup for the task.
For completeness, we describe our method of computing span confidence for question answering, given that there are many variations. Let S ∈ R N be the vector of start logits and E ∈ R N be the vector of end logits, both of sequence length N . For every combination of i, j ∈ [0, N ] where j ≥ i ≤ min(j + C, N ), and C = 30 is the maximum answer span length, we compute the confidence for that span of answer text as the sum of their respective logits S i + E j . The final confidence probability p i j for a given span is as shown in Equation 2.
The model, on the other hand, can still make the same prediction as it did on the full input, and with a similar degree of confidence.

A.4 Regularizing MPPIs
There are a couple differences between the MPPI entropy-regularization strategy employed in this work and in Feng et al. (2018). While Feng et al. (2018) fine-tune an a model already trained for the question answering task, we regularize MPPIs in the initial fine-tuning stage (starting from BERT and XLNet's pre-trained weights). Secondly, they alternate updates between two optimizers, one batch of maximum likelihood, two for MPPI entropy maximization, whereas we use the same optimizer and shuffle together equal numbers of MPPI and regular inputs. We find our method (without rigorous comparison) to be slightly more effective on BERT at mitigating the MPPI phenomenon (measured by subsequent MPPI length). We suspect, if there is an advantage, it is due to the regularization beginning with the start of fine-tuning, rather than on a subsequent stage of fine-tuning.
For completeness, we provide our entropy regularization loss term in Equation 3. LetX denote the set of inputs that have been reduced to their MPPI, H (·) denote the entropy and f (y|x) denote the predicted confidence for y given x. We then represent the loss term for MPPIs as L MPPI , where the constant C = 10 is chosen such that maximizing the entropy will minimize the loss. We use λ = 0.1 as the most effective choice in our limited set of trials. The full loss term, for all inter-mixed regular question answering, and MPPI examples is the sum of standard QA loss L QA , and the MPPI loss term L MPPI , as shown in Equation 4.
In Figure 3 we display the full comparison between the performances of the MPPI regularized models and the regular models on 13 QA datasets, including Adversarial SQuAD (Jia and Liang, 2017).  Figure 3: The generalization and robustness of BERT models evaluated on 12 datasets, as well as Adversarial SQuAD. The "(*)" indicates MPPI-regularization during training.

B How do MPPI Lengths Compare?
In the main paper we describe the differences in length distributions between original and MPPI queries. To provide more detail into the length distributions we plot histograms of the query word lengths, for the original queries, MPPI queries, and MPPI queries after the MPPI regularization procedure. These lengths are plotted below for SQuAD (Figure 4) The query length distributions show that MPPIs are significantly shorter than original queries, with the MPPIs of regularized models somewhere in between. These length distributions may be sufficient to explain why humans find the non-regularized MPPIs completely uninterpretable, and the regularized MPPIs somewhat more interpretable.

C Are MPPIs Invariant to Random
Seed?
One of the preliminary questions in our investigation was whether changing the random training seed significantly altered the MPPI produced by a model. If it were the case that this had a drastic effect, we might suspect MPPIs were somewhat random, or the product of meangingless overconfidence on out-of-distribution inputs. Table 8 illustrates the random seed experiment in full. Training 10 SQuAD models, each with different random seeds, we generate MPPIs on the 2k SQuAD evaluation set, and compare 5 pairs. We measure the mean Generalized Jaccard Similarity of MPPIs produced by 2 models trained with different seeds. We see the similarity between MPPIs trained with different seeds far exceed those of Rand-A, and Rand-B, which are akin to a "random" simulation of MPPIs. As with our previous random baselines these are generated by randomly sampling tokens from the original query, preserving word order, and ensuring that the length distribution matches that of the actual MPPIs to which they are being compared.

D Are MPPIs Invariant to Training
Domain?
We discussed the invariance of MPPIs to training domain at length in the paper for BERT. For completeness, we provide the raw results for BERT in Table 9 and for XLNet in  Expanding on the MPPI generalization analysis in Section 3.2, we provide the raw results. The crossdomain generalization of BERT and XLNet models on MPPIs sourced from different training domains is available in Table 11 and Table 12 respectively. Figure 10 visualizes how well XLNET generalizes to different MPPI domains. The results mirror those of BERT shown in the main paper.     Table 12: Cross-Domain Generalization of XLNET Large models on different types of inputs. Values correspond to F1 scores on the question answering 2k evaluation set specified by the column.