Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases

State-of-the-art models often make use of superficial patterns in the data that do not generalize well to out-of-domain or adversarial settings. For example, textual entailment models often learn that particular key words imply entailment, irrespective of context, and visual question answering models learn to predict prototypical answers, without considering evidence in the image. In this paper, we show that if we have prior knowledge of such biases, we can train a model to be more robust to domain shift. Our method has two stages: we (1) train a naive model that makes predictions exclusively based on dataset biases, and (2) train a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize. Experiments on five datasets with out-of-domain test sets show significantly improved robustness in all settings, including a 12 point gain on a changing priors visual question answering dataset and a 9 point gain on an adversarial question answering test set.


Introduction
While recent neural models have shown remarkable results, these achievements have been tempered by the observation that they are often exploiting dataset-specific patterns that do not generalize well to out-of-domain or adversarial settings. For example, entailment models trained on MNLI (Bowman et al., 2015) will guess an answer based solely on the presence of particular keywords (Gururangan et al., 2018) or whether sentences pairs contain the same words (Mc-Coy et al., 2019), while QA models trained on SQuAD (Rajpurkar et al., 2016) tend to select text near question-words as answers, regardless of context (Jia and Liang, 2017).
We refer to these kinds of superficial patterns as bias. Models that rely on bias can perform well on in-domain data, but are brittle and easy to fool (e.g., SQuAD models are easily distracted by irrelevant sentences that contain many question words). Recent concern about dataset bias has led researchers to re-examine many popular datasets, resulting in the discovery of a wide variety of biases (Agrawal et al., 2018;Anand et al., 2018;Min et al., 2019;Schwartz et al., 2017).
In this paper, we build on these works by showing that, once a dataset bias has been identified, we can improve the out-of-domain performance of models by preventing them from making use of that bias. To do this, we use the fact that these biases can often be explicitly modelled with simple, constrained baseline methods to factor them out of a final model through ensemble-based training.
Our method has two stages. First, we build a bias-only model designed to capture a naive solution that performs well on the training data, but generalizes poorly to out-of-domain settings. Next, we train a second model in an ensemble with the pre-trained bias-only model, which incentivizes the second model to learn an alternative strategy, and use the second model alone on the test set. We explore several different ensembling methods, building on product-of-expert style approaches (Hinton, 2002;Smith et al., 2005). Figure 1 shows an example of applying this procedure to prevent a visual question answering (VQA) model from guessing answers because they are typical for the question, a flaw observed in VQA models (Goyal et al., 2018;Agrawal et al., 2018).
We evaluate our approach on a diverse set of tasks, all of which require models to overcome a challenging domain-shift between the train and test data. First, we build a set of synthetic datasets that contain manually constructed biases by adding artificial features to MNLI. We then consider three challenge datasets proposed by prior work (Agrawal et al., 2018;McCoy et al., Figure 1: An example of applying our method to a Visual Question Answering (VQA) task. We assume predicting green for the given question is almost always correct on the training data. To prevent a model from learning this bias, we first train a bias-only model that only uses the question as input, and then train a robust model in an ensemble with the bias-only model. Since the bias-only model will have already captured the target pattern, the robust model has no incentive to learn it, and thus does better on test data where the pattern is not reliable. 2019; Jia and Liang, 2017), which were designed to break models that adopt superficial strategies on well known textual entailment (Bowman et al., 2015), reading comprehension (Rajpurkar et al., 2016), and VQA (Antol et al., 2015) datasets.
We additionally construct a new QA challenge dataset, TriviaQA-CP (for TriviaQA changing priors). This dataset was built by holding out questions from TriviaQA (Joshi et al., 2017) that ask about particular kinds of entities from the train set, and evaluating on those questions in the dev set, in order to challenge models to generalize between different types of questions.
We are able to improve out-of-domain performance in all settings, including a 6 and 9 point gain on the two QA datasets. On the VQA challenge set, we achieve a 12 point gain, compared to a 3 point gain from prior work. In general, we find using an ensembling method that can dynamically choose when to trust the bias-only model is the most effective, and we present synthetic experiments and qualitative analysis to illustrate the advantages of that approach. We release our datasets and code to facilitate future work. 1

Related Work
Researchers have raised concerns about bias in many datasets. For example, many joint natu-1 github.com/chrisc36/debias ral language processing and vision datasets can be partially solved by models that ignore the vision aspect of the task (Jabri et al., 2016;Anand et al., 2018;Caglayan et al., 2019). Some questions in recent multi-hop QA datasets (Yang et al., 2018;Welbl et al., 2018) can be solved by single-hop models (Chen and Durrett, 2019;Min et al., 2019). Additional examples include story completion (Schwartz et al., 2017) and multiple choice questions (Clark et al., 2016. Recognizing that bias is a concern in diverse domains, our work is the first to perform an evaluation across multiple datasets spanning language and vision. Recent dataset construction protocols have tried to avoid certain kinds of bias. For example, both CoQA (Reddy et al., 2019) and QuAC  take steps to prevent annotators from using words that occur in the context passage, VQA 2.0 (Goyal et al., 2018) selects examples to limit the effectiveness of question-only models, and others have filtered examples solvable by simple baselines (Yang et al., 2018;Zhang et al., 2018b;Zellers et al., 2018). While reducing bias is important, developing ways to prevent models from using known biases will allow us to continue to leverage existing datasets, and update our methods as our understanding of what biases we want to avoid evolve. Recent work has focused on biases that come from ignoring parts of the input (e.g., guessing the answer to a question before seeing the evidence). Solutions include generative objectives to force models to understand all the input (Lewis and Fan, 2019), carefully designed model architecture (Agrawal et al., 2018;, or adversarial removal of class-indicative features from model's internal representations (Ramakrishnan et al., 2018;Zhang et al., 2018a;Belinkov et al., 2019;Grand and Belinkov, 2019). In contrast, we consider biases beyond partial-input cases (Feng et al., 2019), and show our method is superior on VQA-CP. Concurrently, He et al. (2019) also suggested using a product-of-experts ensemble to train unbiased models, but we consider a wider variety of ensembling approaches and test on additional domains.
A related task is preventing models from using particular problematic dataset features, which is often studied from the perspective of fairness (Zhao et al., 2017;Burns et al., 2018). A popular approach is to use an adversary to remove information about a target feature, often gender or ethnicity, from a model's internal representations (Edwards and Storkey, 2016;Kim et al., 2019). In contrast, the biases we consider are related to features that are essential to the overall task, so they cannot simply be ignored.
Evaluating models on out-of-domain examples built by applying minor perturbations to existing examples has also been the subject of recent study (Szegedy et al., 2014;Belinkov and Bisk, 2018;Carlini and Wagner, 2018;Glockner et al., 2018). The domain shifts we consider involve larger changes to the input distribution, built to uncover higher-level flaws in existing models.

Methods
This section describes the two stages of our method, (1) building a bias-only model and (2) using it to train a robust model through ensembling.

Training a Bias-Only Model
The goal of the first stage is to build a model that performs well on training data, but is likely to perform very poorly on the out-of-domain test set. Since we assume we do not have access to examples from the test set, we must apply a-priori knowledge to meet this goal.
The most straightforward approach is to iden-tify a set of features that are correlated with the class label during training, but are known to be uncorrelated or anticorrelated with the label on the test set, and then train a classifier on those features. 2 For example, our VQA-CP (Agrawal et al., 2018) bias-only model (see Section 5.2) uses the question type as input, because the correlations between question types and answers is very different in the train set than the test set (e.g., 2 is a common answer to "How many..." questions on the train set, but is rare for such questions on the test set). However, a benefit of our method is that the bias can be modelled using any kind of predictor, giving us a way to capture more complex intuitions. For example, on SQuAD our bias-only model operates on a view of the input built from TF-IDF scores (see Section 5.4), and on our changing prior TriviaQA dataset our bias-only model makes use of a pre-trained named entity recognition (NER) tagger (see Section 5.5).

Training a Robust Model
This stage trains a robust model that avoids using the method learned by the bias-only model.

Problem Definition
We assume n training examples x 1 , x 2 , . . . , x n , each of which has an integer label y i , where y i ∈ {1, 2, . . . , C} and C is the number of classes. We additionally assume a pre-trained bias-only predictor, h, where h(x i ) = b i = b i1 , b i2 , ..b iC and b ij is the bias-only model's predicted probability of class j for example i. Finally we have a second predictor function, f , with parameters θ, where f (x i , θ) = p i and p i is a similar probability distribution over the classes. Our goal is to construct a training objective to optimize θ so that f will learn to select the correct class without using the strategy captured by the bias-only model.

General Approach
We train an ensemble of h and f . In particular, for each example, a new class distribution,p i , is computed by combining p i and b i . During training, the loss is computed usingp i and the gradients are backproped through f . During evaluation f is used alone. We propose several different ensembling methods.

Bias Product
Our simplest ensemble is a product of experts (Hinton, 2002): Probabilistic Justification: For a given example, x, let x b be the bias of the example. That is, it is the features we will use in our bias-only model. Let x −b be a view of the example that captures all information about that example except the bias. Assume that x −b and x b are conditionally independent given the label, c. Then to compute p(c|x) we have: Where 2 is from applying Bayes Rule while conditioning on x −b , 3 follows from the conditional independence assumption, and 4 applies Bayes Rule a second time to p(x b |c).
We cannot directly model p(c|x −b ) because it is usually not possible to create a view of the data that excludes the bias. Instead, with the goal of encouraging the model to fall into the role of computing p(c|x −b ), we compute p(c|x b )/p(c) using the bias-only model, and train the product of the two models to compute p(c|x).
In practice, we ignore the p(c) factor because, on our datasets, either the classes are uniformly distributed (MNLI), the bias-only model cannot easily capture a class prior since it is using a pointer network (QA), or because we want to remove class priors from model anyway (VQA).

Learned-Mixin
The assumption of conditional independence (Equation 3) will often be too strong. For example, in some cases the robust model might be able to predict the bias-only model will be unreliable for certain kinds of training examples. We find that this can cause the robust model to selectively adjust its behavior in order to compensate for the inaccuracy of the bias-only model, leading to errors in the out-of-domain setting (see Section 5.1).
Instead we allow the model to explicitly determine how much to trust the bias given the input: where g is a learned function. We compute g as softplus(w · h i ) where w is a learned vector, h i is the last hidden layer of the model for example x i , and the softplus(x) = log(1+e x ) function is used to prevent the model reversing the bias by multiplying it by a negative weight. w is trained with the rest of the model parameters. This reduces to bias product when g(x i ) = 1.
A difficulty with this method is that the model could learn to integrate the bias into p i and set g(x i ) = 0. We find this does sometimes occurs in practice, and our next method alleviates this challenge.

Learned-Mixin +H
To prevent the learned-mixin ensemble from ignoring b i , we add an entropy penalty to the loss: Where H(z) = − j z j log(z j ) is the entropy and w is a hyperparameter. Penalizing the entropy encourages the bias component to be non-uniform, and thus have a greater impact on the ensemble.

Evaluation Methodology
We evaluate our methods on several datasets that have out-of-domain test sets. Some of these tasks, such as HANS (McCoy et al., 2019) or Adversarial SQuAD (Jia and Liang, 2017), can be solved easily by generating additional training examples similar to the ones in the test set (e.g., Wang and Bansal (2018)). We, instead, demonstrate that it is possible to improve performance on these tasks by exploiting knowledge of general, biased strategies the model is likely to adopt.
Our evaluation setup consists of a training set, an out-of-domain test set, a bias-only model, and a main model. To run an evaluation we train the bias-only model on the train set, train the main model on the train set while employing one of the methods in Section 3, and evaluate the main model on the out-of-domain test set. We also report performance on the in-domain test set, when available. We use models that are known to work well for their respective tasks for the main model, and do not further tune their hyperparameters or perform early stopping. We consider two extractive QA datasets, which we treat as a joint classification task where the model must select the start and end answer token (Wang and Jiang, 2017). For these datasets, we build independent bias-only models for selecting the start and end token, and separately ensemble those biases with the classifier's start token and end token output distributions. We apply a ReLU layer to the question and passage embeddings, followed by max-pooling, to construct a hidden state for computing the learned-mixin weights.
We compare our methods to a reweighting baseline described below, and to training the main model without any modifications. On VQA we also compare to the adversarial methods from Ramakrishnan et al. (2018) and Grand and Belinkov (2019). The other biases we consider are not based on observing only part of the input, so these adversarial methods cannot be directly applied.

Reweight Baseline
As a non-ensemble baseline, we train the main model on a weighted version of the data, where the weight of example x i is 1 − b iy i (i.e., we weigh examples by one minus the probability the bias-only model assigns the correct label). This encourages the main model to focus on examples the bias-only model gets wrong.

Hyperparameters
One of our methods (Learned-Mixin +H) requires hyperparameter tuning. However hyperparameter tuning is challenging in our setup since our assumption is that we have no access to out-ofdomain test examples during training. A plausible option would be to tune hyperparameters on a dev set that exhibits a related, but not identical, domain shift to the test set, but unfortunately none of our datasets have such dev sets. Instead we follow prior work (Grand and Belinkov, 2019;Ramakrishnan et al., 2018) and perform model selection on the test set. Although this presents an important caveat to the results of this method, we think it is still of interest to observe that the entropy regularizer can be very impactful. Future work may be able to either construct suitable development sets, or propose other hyperparameter-tuning methods to relieve this issue. The hyperparameters selected are shown in Appendix A.

Experiments
We provide experiments on five different domains, summarized in Table 1, each of which requires models to overcome a challenging domain-shift between train and test data. In the following sections we provide summaries of the datasets, main models and bias-only models, but leave low-level details to the appendix.

Synthetic Data
Data: We experiment with a synthetic dataset built by modifying MNLI (Bowman et al., 2015). In particular, we add a feature that is correlated with the class label to the train set, and build an out-of-domain test set by adding a randomized version of that feature to the MNLI matched dev set. We additionally construct an in-domain test set by modifying the matched dev set in the same way as was done in the train set. We build three variations of this dataset: Indicator: Adds the token "0", "1", or "2" to the start of the hypothesis, such that 80% of the time the token corresponds to the example's label (i.e., "0" if the class is "entailment", "1" if the class is contradiction, ect.). In the out-of-domain test set, the token is selected randomly.
Excluder: The same as Indicator, but with a 3% chance the added token corresponds to the example's label, meaning the token can usually be used to eliminate one of the three output classes.
Dependent: In the previous two settings, the added bias is independent of the example given the example's label. To simulate a case where this independence is broken, we experiment with adding an additional feature that is correlated with the bias feature, but is not treated as being part of the bias (i.e., it is not used by the bias-only model). In particular, 80% of the time a token is added to the start of the hypothesis that matches the label with 90% probability, and the "0" token is appended to the end of the hypothesis. The other 20% of the time a random token is prepended and "1" is appended.

Bias-Only Model:
The bias-only model predicts the label using the first token of the hypothesis.
Main Model: We use a recurrent co-attention model, similar to ESIM (Chen et al., 2017). Details are given in Appendix B.
Results: Table 2 shows the results. All ensembling methods work well on the Indicator bias. The reweight method performs poorly on the Excluder bias, likely because the bias-only model assigns the correct class approximately 50% probability for almost all the training examples, making the weights mostly uniform. This illustrates a general weakness with reweighting methods: they require at least a small number of bias-free examples for the model to learn from.
The bias product method performs poorly on the Dependent bias. Inspection shows that, when the indicator is 1, the bias product model is anticorrelated with the bias. In particular, it assigns an average of 22.5% probability to the class indicated by the bias, where an unbaised model would assign an average of 33% since the bias is random. The root cause is that, if the indicator is 1, the model knows the bias is likely to be wrong, so it learns to subtract the value the bias-only model will produce from its own output in order to cancel out the bias-only model's effect on the ensemble's output.
The learned-mixin model does not suffer from this issue, and assigns the class indicated by the bias an average of 34.5% probability. Analysis shows that g(x i ) is set to 0.00 ± 0.0001 when the indicator is turned off, and to 1.91 ± 0.285 otherwise, showing that the model learns to turn off the bias-only component of the ensemble as needed, thus avoiding this over-compensating issue. The entropy regularizer appears to be unnecessary on this dataset because g(x i ) does not go to zero.

VQA-CP
Data: We evaluate on the VQA-CP v2 (Agrawal et al., 2018) dataset, which was constructed by resplitting the VQA 2.0 (Goyal et al., 2018) train and validation sets into new train and test sets such that the correlations between question types and answers differs between each split. For example, "tennis" is the most common answer for questions that start with "What sport..." in the train set, whereas "skiing" is the most common answers for those questions in the test set. Models that choose answers because they are typical in the training data will perform poorly on this test set.
Bias-Only Model: VQA-CP comes with questions annotated with one of 65 question types, corresponding to the first few words of the question (e.g., "What color is"). The bias-only model uses this categorical label as input, and is trained on the same multi-label objective as the main model. Results: Table 3 shows the results.
The learned-mixin method was highly effective, boosting performance on VQA-CP by about 9 points, and the entropy regularizer can increase this by another 3 points, significantly surpassing   prior work. For the learned-mixin ensemble, we find g(x i ) is strongly correlated with the bias's expected accuracy 5 , with a spearmanr correlation of 0.77 on the test data. Qualitative examples (Figure 2) further suggest the model increases g(x i ) when it knows if can rely on the bias-only model.

HANS
Data: We evaluate on the HANS adversarial MNLI dataset (McCoy et al., 2019). This dataset was built by constructing templated examples of entailment and non-entailment, such that the hypothesis sentence only includes words that are also in the premise sentence. Naively trained models tend to classify all such examples as "entailment" because detecting the presence of many shared words is an effective tactic on MNLI.

Bias-Only Model:
The bias-only model is a shallow linear classifier using the following features: (1) whether the hypothesis is a sub-sequence of the premise, (2) whether all words in the hypothesis appear in the premise, (3) the percent of words from the hypothesis that appear in the premise, (4) the average of the minimum distance between each premise word with each hypothesis word, measured using cosine distance with the 5 Computed as j sijbij/ j bij where sij is the score for class j on example i  fasttext (Mikolov et al., 2018) word vectors, and (5) the max of those same distances. We constrain the bias-only model to put the same amount of probability mass on the neutral and contradiction classes so it focuses on distinguishing entailment and non-entailment, and reweight the dataset so that the entailment and non-entailment examples have an equal total weight to prevent a class prior from being learned.
Main Models: We experiment with both the uncased BERT base model (Devlin et al., 2019), and the same recurrent model used for the synthetic data (see Appendix B). We use the default hyperparameters for BERT since they work well for MNLI.
Results: Table 4 shows the results. We show scores for individual heuristics used in HANS in Appendix C. For the recurrent method, both the bias product and learned-mixin +H methods result in about a three point gain. However, for the BERT model, the simpler reweight method is more effective. We noticed high variance in performance between runs in this setting, and speculate the ensemble methods might be compounding this instability by introducing additional complexity. G=0.11 G + =2.34 G=0.00 G + =1.89 Figure 2: Qualitative examples of the values of g(x i ) on the VQA-CP training data for the learned-mixin model (labelled "G") and learned-mixin +H model (labelled "G+"). The question type and the bias model's highest ranked answer for that type are shown above. We find g(x i ) is larger when the bias answers are likely to be correct.

Adversarial SQuAD
Data: We evaluate on the Adversarial SQuAD (Jia and Liang, 2017) dataset, which was built by adding distractor sentences to the passages in SQuAD (Rajpurkar et al., 2016). The sentences are built to closely resemble the question and contain a plausible answer candidate, but with a few key semantic changes to ensure they do not incidentally answer the question. Models that naively focus on sentences that contain many question words are often fooled by the new sentence.
Bias-Only Models: We consider two bias-only models: (1) TF-IDF: the TF-IDF score between each sentence and question is used to select an answer (meaning tokens within the same sentence all get the same score) and (2) TF-IDF Filtered: the same but excluding pronouns and numbers from the words used to compute the TF-IDF scores. The second model is motivated by the fact distractor sentences never include numbers or pronouns that occur in the question.
Main Model: We use an updated version of BiDAF (Seo et al., 2017), that uses the fasttext words vectors (Mikolov et al., 2018), includes an additional recurrent layer, and simplifies the prediction stage (see Appendix D).
Results: Table 5 shows the results. We find the bias product method improves performance by up to 3 points, and the learned-mixin +H model achieves up to a 9 point gain. The importance of including the entropy penalty is explained by the fact that, without the penalty, the model learns to ignore the bias by settings g(x i ) close to zero. For example, on the AddSent dataset with the TF-IDF filtered bias, the learned-mixin ensemble sets g(x i ) to an average of 0.13, while the learnedmixin +H ensemble increases that to 5.16. The high values are likely caused by the fact the biasonly model is very weak, since it assigns the same score to each token in each sentence, so the model can often scale it by large values. As expected, we get better results using the TF-IDF Filtered bias which is more closely tailored to how the test set was constructed.

TriviaQA-CP
Data: We construct a changing-prior QA dataset from TriviaQA (Joshi et al., 2017) by categorizing questions into three classes, Person, Location, and Other, based on what kind of entity they are asking about. During training, we hold out all the person questions or all the location questions from the train set, and evaluate on the person or location questions in the TriviaQA dev set. Details can be found in Appendix E.

Bias-Only Model:
The bias-only model uses NER tags, identified by running the Stanford NER Tagger (Finkel et al., 2005) on the passage, as input.   We only apply the model to tokens that have a NER tag, and assign all other tokens the average score given to the tokens with NER tags in order to prevent the model from reflecting a preference for entity tokens in general.
Main Model: We use a larger version of the model used for Adversarial SQuAD (see Appendix D), to account for the larger dataset.
Results: Table 6 shows the results. Similar to adversarial SQuAD, the bias product method is moderately effective, and the ensemble method is superior as long as a suitable regularizer is applied. We again observe that the learned-mixin method tends to push g(x i ) close to zero without the entropy penalty (average of 0.25 without the penalty vs. 5.01 with the penalty on the Location dev set). We see smaller gains on the person dataset. One possible cause is that differentiating between people and other named entities, such as organizations or groups, is difficult for the main model, and as a result it does not learn a strong non-person prior even without the use of a debiasing method.

Discussion
Despite tackling a diverse range of problems, we were able to improve out-of-domain performance in all settings. The bias product method works consistently, but can almost always be significantly out-performed by the learned-mixin method with an appropriate entropy penalty. The reweight baseline improved performance on HANS, but was relatively ineffective in other cases.
Increasing the out-of-domain performance usually comes at the cost of losing some in-domain performance, which is unsurprising since the biased approaches we are removing are helpful on the in-domain data. TriviaQA-CP stands out as a case where this trade-off is minimal.
A possible issue is that our methods reduce the need for the model to solve examples the bias-only model works well on (since the ensemble's prediction will already be mostly correct for those examples), which effectively reduces the amount of training data. An ideal approach would be to block the model from using the bias-only method, and require it to solve examples the bias-only method solves through other means. We suspect this will necessitate a more clear-box method since it requires doing fine-grained regularization of how the model is solving individual examples.

Conclusion
Our key contribution is a method of using human knowledge about what methods will not generalize well to improve model robustness to domain-shift. Our approach is to train a robust model in an ensemble with a pre-trained naive model, and then use the robust model alone at test time. Extensive experiments show that our method works well on two adversarial datasets, and two changing-prior datasets, including a 12 point gain on VQA-CP. Future work includes learning to automatically detect dataset bias, which would allow our method to be applicable with less specific prior knowledge.

B Co-Attention NLI Model
The model we use for NLI is based on ESIM (Chen et al., 2017). It has the following stages: Embed: Embed the words using a character CNN, following what was done by Seo et al. (2017), and the fasttext crawl word embeddings (Mikolov et al., 2018), then run a shared BiLSTM over the results.
Co-Attention: Compute an attention matrix using the formulation from Seo et al. (2017), and use it to compute a context vector for each premise word (Bahdanau et al., 2015). Then build an augmented vector for each premise word by concatenating the word's embedding, the context vector, and the elementwise product of the two. Augmented vectors for the hypothesis are built in the same way using the transpose of the attention matrix.
Pool: Run another shared BiLSTM over the augmented vectors, and max-pool the results. The max-pooled vectors from the premise and hypothesis are fed into a fully-connected layer, and then into a softmax layer with three outputs to compute class probabilities.
We apply variational dropout at a rate of 0.2 between all layers, and to the recurrent states of the LSTM, and train the model for 30 epochs using the Adam optimizer (Kingma and Ba, 2015) with a batch size of 32. The learning rate is decayed by 0.999 every 100 steps. We use 200 dimensional LSTMs and a 50 dimensional fully connected layer.

C Fine-Grained HANS Results
We show the scores our methods achieve for the various heuristics used in HANS in Table 8. Our methods reduce the extent to which models naively guess entailment in all cases. Interestingly, the BERT model shows significantly degraded performance on the entailment examples when using the reweight and bias product method, but largely maintains its performance on those examples when using the learned-mixin method.

D Modified BiDAF QA Model
The model we use for QA is based on BiDAF (Seo et al., 2017). It has the following stages: Embed: Embed the words using a character CNN following Seo et al. (2017) and the fasttext crawl word embeddings (Mikolov et al., 2018). Then run a BiLSTM over the results to get context-aware question embeddings and passage embeddings.
Bi-Attention: Apply the bi-directional attention mechanism from Seo et al. (2017) to produce question-aware passage embeddings.
Predict: Apply a fully connected layer, then two more BiLSTM layers, then a two dimensional linear layer to produce start and end scores for each token.

E TriviaQA-CP
In this section we discuss our changing-prior Triv-iaQA dataset, TriviaQA-CP. This dataset was built by training a classifier to identify TriviaQA (Joshi et al., 2017)    locations, or other topics, and then selecting an answer-containing passage for each question as context. There are two versions of this dataset: a person changing-priors dataset that was built by removing the person questions from the train set and using only person questions from the dev set for evaluation, and a location changing-priors dataset that was built by repeating this process for location questions. Statistics for these two sets are shown in Table 9. We review the three-step procedure we used to construct this dataset below.

Distantly Supervised Classification:
We first train a preliminary question-type classifier using distant supervision. We noisily label person and location questions using a manually constructed set of patterns (e.g., questions with the phrase "What is the family name of..." are almost always about people), and by attempting to look up the answers in the Yago database (Suchanek et al., 2007) and checking if the answer belongs to a person or location category. Questions that did not match either of these heuristics are labelled as other.
We use these labels to train a simple recurrent model that embeds the question using the fasttext words vectors, applies a 100 dimensional BiL-STM, max-pools, and then applies a softmax layer with 3 outputs. We train the model for 3 epochs using the Adam optimizer (Kingma and Ba, 2015), and apply 0.5 dropout to the embeddings and 0.2 dropout to the recurrent states and the output of the max-pooling layer.
Supervised Classification: Next we use higher quality labels to train a second linear classifier to re-calibrate the recurrent model's predictions, and to integrate its predictions with the distantly supervised heuristics.
An author manually labelled 1,100 questions, then a classifier was trained on those questions using the predictions from the recurrent model as features, as well as two additional features built from looking up the category of the answer in Yago as before. This classifier was then used to decide the final question classifications. Table 10 shows the accuracy of these classifiers. The final model achieves about 95% accuracy. We find about 25% of the questions are about people and about 20% of the questions are about locations.
Paragraph Selection: In TriviaQA, each question is paired with multiple documents. We simplify the task by selecting a single answer-containing paragraph for each question. We use the approach of Clark and Gardner (2018) to break up the documents into passages of at most 400 tokens, and rank the passages for each question using their linear paragraph ranker. Each question is then paired with the highest ranking paragraph that contains an answer.  Table 10: Accuracy, and per-class scores, on the manually annotated questions for the various question classification methods we used when building TriviaQA-CP.