Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Natural Language Inference (NLI) datasets often contain hypothesis-only biases---artifacts that allow models to achieve non-trivial performance without learning whether a premise entails a hypothesis. We propose two probabilistic methods to build models that are more robust to such biases and better transfer across datasets. In contrast to standard approaches to NLI, our methods predict the probability of a premise given a hypothesis and NLI label, discouraging models from ignoring the premise. We evaluate our methods on synthetic and existing NLI datasets by training on datasets containing biases and testing on datasets containing no (or different) hypothesis-only biases. Our results indicate that these methods can make NLI models more robust to dataset-specific artifacts, transferring better than a baseline architecture in 9 out of 12 NLI datasets. Additionally, we provide an extensive analysis of the interplay of our methods with known biases in NLI datasets, as well as the effects of encouraging models to ignore biases and fine-tuning on target datasets.


Introduction
Natural Language Inference (NLI) is often used to gauge a model's ability to understand a relationship between two texts (Cooper et al., 1996;Dagan et al., 2006). In NLI, a model is tasked with determining whether a hypothesis (a woman is sleeping) would likely be inferred from a premise (a woman is talking on the phone). 2 The development of new large-scale datasets has led to a flurry of various neural network architectures for solving NLI. However, recent work has found that many NLI datasets contain biases, or annotation artifacts, i.e., features present in hypotheses that enable models to perform surprisingly well using only the hypothesis, without learning the relationship between two texts (Gururangan et al., 2018;Poliak et al., 2018b;Tsuchiya, 2018). 3 For instance, in some datasets, negation words like "not" and "nobody" are often associated with a relationship of contradiction. As a ramification of such biases, models may not generalize well to other datasets that contain different or no such biases.
Recent studies have tried to create new NLI datasets that do not contain such artifacts, but many approaches to dealing with this issue remain unsatisfactory: constructing new datasets (Sharma et al., 2018) is costly and may still result in other artifacts; filtering "easy" examples and defining a harder subset is useful for evaluation purposes (Gururangan et al., 2018), but difficult to do on a large scale that enables training; and compiling adversarial examples (Glockner et al., 2018) is informative but again limited by scale or diversity. Instead, our goal is to develop methods that overcome these biases as datasets may still contain undesired artifacts despite annotation efforts.
Typical NLI models learn to predict an entailment label discriminatively given a premisehypothesis pair (Figure 1a), enabling them to learn hypothesis-only biases. Instead, we predict the premise given the hypothesis and the entailment label, which by design cannot be solved using data artifacts. While this objective is intractable, it motivates two approximate training methods for standard NLI classifiers that are more resistant to biases. Our first method uses a hypothesis-only classifier ( Figure 1b) and the second uses negative sampling by swapping premises between premisehypothesis pairs (Figure 1c).  Figure 1: Illustration of (a) the baseline NLI architecture, and our two proposed methods to remove hypothesis only-biases from an NLI model: (b) uses a hypothesis-only classifier, and (c) samples a random premise. Arrows correspond to the direction of propagation. Green or red arrows respectively mean that the gradient sign is kept as is or reversed. Gray arrow indicates that the gradient is not back-propagated -this only occurs in (c) when we randomly sample a premise, otherwise the gradient is back-propagated. f and g represent encoders and classifiers.
We evaluate the ability of our methods to generalize better in synthetic and naturalistic settings. First, using a controlled, synthetic dataset, we demonstrate that, unlike the baseline, our methods enable a model to ignore the artifacts and learn to correctly identify the desired relationship between the two texts. Second, we train models on an NLI dataset that is known to be biased and evaluate on other datasets that may have different or no biases. We observe improved results compared to a fully discriminative baseline in 9 out of 12 target datasets, indicating that our methods generate models that are more robust to annotation artifacts.
An extensive analysis reveals that our methods are most effective when the target datasets have different biases from the source dataset or no noticeable biases. We also observe that the more we encourage the model to ignore biases, the better it transfers, but this comes at the expense of performance on the source dataset. Finally, we show that our methods can better exploit small amounts of training data in a target dataset, especially when it has different biases from the source data.
In this paper, we focus on the transferability of our methods from biased datasets to ones having different or no biases. Elsewhere , we have analyzed the effect of these methods on the learned language representations, suggesting that they may indeed be less biased. However, we caution that complete removal of biases remains difficult and is dependent on the techniques used. The choice of whether to remove bias also depends on the goal; in an in-domain scenario certain biases may be helpful and should not necessarily be removed.
In summary, in this paper we make the follow-ing contributions: • Two novel methods to train NLI models that are more robust to dataset-specific artifacts. • An empirical evaluation of the methods on a synthetic dataset and 12 naturalistic datasets. • An extensive analysis of the effects of our methods on handling bias.

Motivation
A training instance for NLI consists of a hypothesis sentence H, a premise statement P , and an inference label y. A probabilistic NLI model aims to learn a parameterized distribution p θ (y | P, H) to compute the probability of the label given the two sentences. We consider NLI models with premise and hypothesis encoders, f P,θ and f H,θ , which learn representations of P and H, and a classification layer, g θ , which learns a distribution over y. Typically, this is done by maximizing this discriminative likelihood directly, which will act as our baseline ( Figure 1a). However, many NLI datasets contain biases that allow models to perform non-trivially well when accessing just the hypotheses (Tsuchiya, 2018;Gururangan et al., 2018;Poliak et al., 2018b). This allows models to leverage hypothesis-only biases that may be present in a dataset. A model may perform well on a specific dataset, without identifying whether P entails H. Gururangan et al. (2018) argue that "the bulk" of many models' "success [is] attribute[d] to the easy examples". Consequently, this may limit how well a model trained on one dataset would perform on other datasets that may have different artifacts.
Consider an example where P and H are strings from {a, b, c}, and an environment where P en-tails H if and only if the first letters are the same, as in synthetic dataset A. In such a setting, a model should be able to learn the correct condition for P to entail H. 4 Imagine now that an artifact c is appended to every entailed H (synthetic dataset B). A model of y with access only to the hypothesis side can fit the data perfectly by detecting the presence or absence of c in H, ignoring the more general pattern. Therefore, we hypothesize that a model that learns p θ (y | P, H) by training on such data would be misled by the bias c and would fail to learn the relationship between the premise and the hypothesis. Consequently, the model would not perform well on the unbiased synthetic dataset A.
Instead of maximizing the discriminative likelihood p θ (y | P, H) directly, we consider maximizing the likelihood of generating the premise P conditioned on the hypothesis H and the label y: p(P | H, y). This objective cannot be fooled by hypothesis-only features, and it requires taking the premise into account. For example, a model that only looks for c in the above example cannot do better than chance on this objective. However, as P comes from the space of all sentences, this objective is much more difficult to estimate.

Training Methods
Our goal is to maximize log p(P | H, y) on the training data. While we could in theory directly parameterize this distribution, for efficiency and simplicity we instead write it in terms of the standard p θ (y | P, H) and introduce a new term to approximate the normalization: log p(P | y, H) = log p θ (y | P, H)p(P | H) p(y | H) .
Throughout we will assume p(P | H) is a fixed constant ( justified by the dataset assumption that, lacking y, P and H are independent and drawn at random). Therefore, to approximately maximize this objective we need to estimate p(y | H). We propose two methods for doing so.

Method 1: Hypothesis-only Classifier
Our first approach is to estimate the term p(y | H) directly. In theory, if labels in an NLI dataset depend on both premises and hypothesis (which Poliak et al. (2018b) call "interesting NLI"), this should be a uniform distribution. However, as discussed above, it is often possible to correctly predict y based only on the hypothesis. Intuitively, this model can be interpreted as training a classifier to identify the (latent) artifacts in the data. We define this distribution using a shared representation between our new estimator p φ,θ (y | H) and p θ (y | P, H). In particular, the two share an embedding of H from the hypothesis encoder f H,θ . The additional parameters φ are in the final layer g φ , which we call the hypothesis-only classifier. The parameters of this layer φ are updated to fit p(y | H) whereas the rest of the parameters in θ are updated based on the gradients of log p(P | y, H).
Training is illustrated in Figure 1b. This interplay is controlled by two hyper-parameters. First, the negative term is scaled by a hyper-parameter α. Second, the updates of g φ are weighted by β. We therefore minimize the following multitask loss functions (shown for a single example): We implement these together with a gradient reversal layer (Ganin & Lempitsky, 2015). As illustrated in Figure 1b, during back-propagation, we first pass gradients through the hypothesis-only classifier g φ and then reverse the gradients going to the hypothesis encoder g H,θ (potentially scaling them by β). 5

Method 2: Negative Sampling
As an alternative to the hypothesis-only classifier, our second method attempts to remove annotation artifacts from the representations by sampling alternative premises. Consider instead writing the normalization term above as, where the expectation is uniform and the last step is from Jensen's inequality. 6 As in Method 1, we define a separate p φ,θ (y | P , H) which shares the embedding layers from θ, f P,θ and f H,θ . However, as we are attempting to unlearn hypothesis bias, we block the gradients and do not let it update the premise encoder f P,θ . 7 The full setting is shown in Figure 1c.
To approximate the expectation, we use uniform samples P (from other training examples) to replace the premise in a (P , H)-pair, while keeping the label y. We also maximize p θ,φ (y | P , H) to learn the artifacts in the hypotheses. We use α ∈ [0, 1] to control the fraction of randomly sampled P 's (so the total number of examples remains the same). As before, we implement this using gradient reversal scaled by β.
Finally, we share the classifier weights between p θ (y | P, H) and p φ,θ (y | P , H). In a sense this is counter-intuitive, since p θ is being trained to unlearn bias, while p φ,θ is being trained to learn it. However, if the models are trained separately, they may learn to co-adapt with each other (Elazar & Goldberg, 2018). If p φ,θ is not trained well, we might be fooled to think that the representation does not contain any biases, while in fact they are still hidden in the representation. For some evidence that this indeed happens when the models are trained separately, see . 8 6 There are more developed and principled approaches in language modeling for approximating this partition function without having to make this assumption. These include importance sampling (Bengio & Senecal, 2003), noisecontrastive estimation (Gutmann & Hyvärinen, 2010), and sublinear partition estimation . These are more difficult to apply in the setting of sampling full sentences from an unknown set. We hope to explore methods for applying them in future work. 7 A reviewer asked about gradient blocking. Our motivation was that, for a random premise, we do not have reliable information to update its encoder. However, future work can explore different configurations of gradient blocking. 8 A similar situation arises in neural cryptography (Abadi

Experimental Setup
To evaluate how well our methods can overcome hypothesis-only biases, we test our methods on a synthetic dataset as well as on a wide range of existing NLI datasets. The scenario we aim to address is when training on a source dataset with biases and evaluating on a target dataset with different or no biases. We first describe the data and experimental setup before discussing the results.  , since it is known to contain significant annotation artifacts. We evaluate the robustness of our methods on other, target datasets.
As target datasets, we use the 10 datasets investigated by Poliak et al. (2018b) in their hypothesisonly study, plus two test sets: GLUE's diagnostic test set, which was carefully constructed to not contain hypothesis-biases (Wang et al., 2018), and SNLI-hard, a subset of the SNLI test set that is thought to have fewer biases (Gururangan et al., 2018). The target datasets include humanjudged datasets that used automatic methods to pair premises and hypotheses, and then relied on humans to label the pairs: SCITAIL , ADD-ONE-RTE (Pavlick & Callison-Burch, 2016)  Detailed descriptions of these datasets can be found in Poliak et al. (2018b). 10 We leave additional NLI datasets, such as the Diverse NLI Collection (Poliak et al., 2018a), for future work. 11 Many NLI models encode P and H separately (Rocktäschel et al., 2016;Mou et al., 2016;Liu et al., 2016;Cheng et al., 2016;Chen et al., 2017), although some share information between the encoders via attention Duan et al., 2018). 12 Specifically, representations are concatenated, subtracted, and multiplied element-wise. methods for mitigating biases use the same technique for representing and combining sentences. Additional implementation details are provided in Appendix A.2.
For both methods, we sweep hyper-parameters α, β over {0.05, 0.1, 0.2, 0.4, 0.8, 1.0}. For each target dataset, we choose the best-performing model on its development set and report results on the test set. 13

Synthetic Experiments
To examine how well our methods work in a controlled setup, we train on the biased dataset (B), but evaluate on the unbiased test set (A). As expected, without a method to remove hypothesisonly biases, the baseline fails to generalize to the test set. Examining its predictions, we found that the baseline model learned to rely on the presence/absence of the bias term c, always predicting TRUE/FALSE respectively. Table 1 shows the results of our two proposed methods. As we increase the hyper-parameters α and β, our methods initially behave like the baseline, learning the training set but failing on the test set. However, with strong enough hyperparameters (moving towards the bottom in the tables), they perform perfectly on both the biased training set and the unbiased test set. For Method 1, stronger hyper-parameters work better.  Method 2, in particular, breaks down with too many random samples (increasing α), as expected. We also found that Method 1 did not require as strong β as Method 2. From the synthetic experiments, it seems that Method 1 learns to ignore the bias c and learn the desired relationship between P and H across many configurations, while Method 2 requires much stronger β.

Results on existing NLI datasets
Table 2 (left block) reports the results of our proposed methods compared to the baseline in application to the NLI datasets. The method using the hypothesis-only classifier to remove hypothesis-only biases from the model (Method 1) outperforms the baseline in 9 out of 12 target datasets (∆ > 0), though most improvements are small. The training method using negative sampling (Method 2) only outperforms the baseline in 5 datasets, 4 of which are cases where the other method also outperformed the baseline. These gains are much larger than those of Method 1. We also report results of the proposed methods on the SNLI test set (right block). As our results improve on the target datasets, we note that Method 1's performance on SNLI does not drastically decrease (small ∆), even when the improvement on the target dataset is large (for example, in SPR). For this method, the performance on SNLI drops by just an average of 1.11 (0.65 STDV). For Method 2, there is a large decrease on SNLI as results drop by an average of 11.19 (12.71 STDV). For these models, when we see large improvement on a target dataset, we often see a large drop on SNLI. For example, on ADD-ONE-RTE, Method 2 outperforms the baseline by roughly 17% but performs almost 50% lower on SNLI. Based on this, as well as the results on the synthetic dataset, Method 2 seems to be much more unstable and highly dependent on the right hyper-parameters.

Analysis
Our results demonstrate that our approaches may be robust to many datasets with different types of bias. We next analyze our results and explore modifications to the experimental setup that may improve model transferability across NLI datasets.

Interplay with known biases
A priori, we expect our methods to provide the most benefit when a target dataset has no hypothesis-only biases or such biases that differ from ones in the training data. Previous work estimated the amount of bias in NLI datasets by comparing the performance of a hypothesis-only classifier with the majority baseline (Poliak et al., 2018b). If the classifier outperforms the baseline, the dataset is said to have hypothesis-only biases. We follow a similar idea for estimating how similar the biases in a target dataset are to those in the source dataset. We compare the performance of a hypothesis-only classifier trained on SNLI and evaluated on each target dataset, to a majority baseline of the most frequent class in each target dataset's training set (Maj). We also compare to a hypothesis-only classifier trained and tested on Figure 2: Accuracies of majority and hypothesis-only baselines on each dataset (x-axis). The datasets are generally ordered by increasing difference between a hypothesis-only model trained on the target dataset (green) compared to trained on SNLI (yellow). each target dataset. 14 Figure 2 shows the results. When the hypothesis-only model trained on SNLI is tested on the target datasets, the model performs below Maj (except for MNLI), indicating that these target datasets contain different biases than those in SNLI. The largest difference is on SPR: a hypothesis-only model trained on SNLI performs over 50% worse than one trained on SPR. Indeed, our methods lead to large improvements on SPR (Table 2), indicating that they are especially helpful when the target dataset contains different biases. On MNLI, this hypothesis-only model performs 10% above Maj, and roughly 20% worse compared to when trained on MNLI, suggesting that MNLI and SNLI have similar biases. This may explain why our methods only slightly outperform the baseline on MNLI ( Table 2).
The hypothesis-only model trained on each target dataset did not outperform Maj on DPR, ADD-ONE-RTE, SICK, and MPE, suggesting that these datasets do not have noticeable hypothesis-only biases. Here, as expected, we observe improvements when our methods are tested on these datasets, to varying degrees (from 0.45 on MPE to 31.11 on SICK). We also see improvements on datasets with biases (high performance of training on each dataset compared to the corresponding majority baseline), most noticeably SPR. The only exception seems to be SCITAIL, where we do not improve despite it having different biases than SNLI. However, when we strengthen α and β (below), Method 1 outperforms the baseline. 14 A reviewer noted that this method may miss similar bias "types" that are achieved through different lexical items. We note that our use of pre-trained word embeddings might mitigate this concern.
Finally, both methods obtain improved results on the GLUE diagnostic set, designed to be biasfree. We do not see improvements on SNLI-hard, indicating it may still have biases -a possibility acknowledged by Gururangan et al. (2018).

Stronger hyper-parameters
In the synthetic experiment, we found that increasing α and β improves the models' ability to generalize to the unbiased dataset. Does the same apply to natural NLI datasets? We expect that strengthening the auxiliary losses (L 2 in our methods) during training will hurt performance on the original data (where biases are useful), but improve on the target data, which may have different or no biases (Figure 2). To test this, we increase the hyperparameter values during training; we consider the range {1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0}. 15 While there are other ways to strengthen our methods, e.g., increasing the number or size of hidden layers (Elazar & Goldberg, 2018), we are interested in the effect of α and β as they control how much bias is subtracted from our baseline model. Table 3 shows the results of Method 1 with stronger hyper-parameters on the existing NLI datasets. As expected, performance on SNLI test sets (SNLI and SNLI-hard in Table 3) decreases more, but many of the other datasets benefit from stronger hyper-parameters (compared with Table 2). We see the largest improvement on SICK, achieving over 10% increase compared to the 1.8% gain in Table 2. As for Method 2, we found large drops in quality even in our basic configurations (Appendix A.3), so we do not increase the hyper-parameters further. This should not be too surprising, adding too many random premises will lead to a model's degradation.

Fine-tuning on target datasets
Our main goal is to determine whether our methods help a model perform well across multiple datasets by ignoring dataset-specific artifacts. In turn, we did not update the models' parameters on other datasets. But, what if we are given different amounts of training data for a new NLI dataset?
To determine if our approach is still helpful, we updated four models on increasing sizes of training data from two target datasets ( MNLI and 15 The synthetic setup required very strong hyperparameters, possibly due to the clear-cut nature of the task. In the natural NLI setting, moderately strong values sufficed.    Figure 3 shows the results on the dev sets. In MNLI, pre-training is very helpful when finetuning on a small amount of new training data, although there is little to no gain from pre-training with either of our methods compared to the baseline. This is expected, as we saw relatively small gains with the proposed methods on MNLI, and can be explained by SNLI and MNLI having simi- 16 We hold out 10K examples from the training set for dev as gold labels for the MNLI test set are not publicly available. We evaluate on MNLI's matched dev set to assure consistent domains when fine-tuning. lar biases. In SICK, pre-training with either of our methods is better in most data regimes, especially with very small amounts of target training data. 17 25% of the 7 Related Work Biases and artifacts in NLU datasets Many natural language undersrtanding (NLU) datasets contain annotation artifacts. Early work on NLI, also known as recognizing textual entailment (RTE), found biases that allowed models to perform relatively well by focusing on syntactic clues alone (Snow et al., 2006;Vanderwende & Dolan, 2006). Recent work also found artifacts in new NLI datasets (Tsuchiya, 2018;Gururangan et al., 2018;Poliak et al., 2018b).
Other NLU datasets also exhibit biases. In ROC Stories (Mostafazadeh et al., 2016), a story cloze dataset, Schwartz et al. (2017b) obtained a high performance by only considering the candidate endings, without even looking at the story context. In this case, stylistic features of the candidate endings alone, such as the length or certain words, were strong indicators of the correct ending (Schwartz et al., 2017a;Cai et al., 2017). A similar phenomenon was observed in reading comprehension, where systems performed non-trivially well by using only the final sentence in the passage or ignoring the passage altogether (Kaushik & Lipton, 2018). Finally, multiple studies found nontrivial performance in visual question answering (VQA) by using only the question, without access to the image, due to question biases (Zhang et al., 2016;Kafle & Kanan, 2016, 2017Goyal et al., 2017;Agrawal et al., 2017).
Transferability across NLI datasets It has been known that many NLI models do not transfer across NLI datasets. Chen Zhang's thesis (Zhang, 2010) focused on this phenomena as he demonstrated that "techniques developed for textual entailment" datasets, e.g., RTE-3, do not transfer well to other domains, specifically conversational entailment (Zhang & Chai, 2009, 2010. Bowman et al. (2015) and Williams et al. (2018) demonstrated (specifically in their respective Tables 7 and 4) how models trained on SNLI and MNLI may not transfer well across other NLI datasets like SICK. Talman & Chatzikyriakidis (2018) recently reported similar findings using many advanced deep-learning models.
Domain-adversarial neural networks aim to increase robustness to domain change, by learning to be oblivious to the domain using gradient reversals (Ganin et al., 2016). Our methods rely similarly on gradient reversals when encouraging models to ignore dataset-specific artifacts. One distinction is that domain-adversarial networks require knowledge of the domain at training time, while our methods learn to ignore latent artifacts and do not require direct supervision in the form of a domain label.
Others have attempted to remove biases from learned representations, e.g., gender biases in word embeddings (Bolukbasi et al., 2016) or sensitive information like sex and age in text representations (Li et al., 2018). However, removing such attributes from text representations may be difficult (Elazar & Goldberg, 2018). In contrast to this line of work, our final goal is not the removal of such attributes per se; instead, we strive for more robust representations that better transfer to other datasets, similar to Li et al. (2018).
Recent work has applied adversarial learning to NLI. Minervini & Riedel (2018) generate adversarial examples that do not conform to logi-cal rules and regularize models based on those examples. Similarly, Kang et al. (2018) incorporate external linguistic resources and use a GAN-style framework to adversarially train robust NLI models. In contrast, we do not use external resources and we are interested in mitigating hypothesisonly biases. Finally, a similar approach has recently been used to mitigate biases in VQA (Ramakrishnan et al., 2018;Grand & Belinkov, 2019).

Conclusion
Biases in annotations are a major source of concern for the quality of NLI datasets and systems. We presented a solution for combating annotation biases by proposing two training methods to predict the probability of a premise given an entailment label and a hypothesis. We demonstrated that this discourages the hypothesis encoder from learning the biases to instead obtain a less biased representation. When empirically evaluating our approaches, we found that in a synthetic setting, as well as on a wide-range of existing NLI datasets, our methods perform better than the traditional training method to predict a label given a premise-hypothesis pair. Furthermore, we performed several analyses into the interplay of our methods with known biases in NLI datasets, the effects of stronger bias removal, and the possibility of fine-tuning on the target datasets.
Our methodology can be extended to handle biases in other tasks where one is concerned with finding relationships between two objects, such as visual question answering, story cloze completion, and reading comprehension. We hope to encourage such investigation in the broader community.