There is Strength in Numbers: Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training

Natural Language Inference (NLI) datasets contain annotation artefacts resulting in spurious correlations between the natural language utterances and their respective entailment classes. These artefacts are exploited by neural networks even when only considering the hypothesis and ignoring the premise, leading to unwanted biases. Previous work proposed tackling this problem via adversarial training, but this leads to learned sentence representations that still suffer from the same biases (Belinkov et al., 2019b). As a solution, we propose using an ensemble of adversaries during the training, encouraging the model to jointly decrease the accuracy of these different adversaries while ﬁtting the data. We show that using an ensemble of adversaries can prevent the bias from being relearned after the model training is completed, further improving how well the model generalises to different NLI datasets. In particular, these models out-performed previous approaches when tested on 12 different NLI datasets not used in the model training. Finally, the optimal number of adversarial classiﬁers depends on the dimensionality of the sentence representations, with larger dimensional representations beneﬁting when trained with a greater number of adversaries.


Introduction
NLI datasets are known to contain artefacts associated with their human annotation processes (Gururangan et al., 2018).Neural models are particularly prone to picking up on these artefacts, and relying on these biases and spurious correlations rather acquiring than a true semantic understanding of the task.Because these artefacts are often dataset specific (Poliak et al., 2018;Tsuchiya, 2018), models that rely on these artefacts consequently generalise poorly when tested on other datasets (Belinkov et al., 2019a).
Adversarial training is one way to alleviate this problem: an adversarial classifier is trained to learn from the artefacts, while the model is trained to reduce the performance of this classifier while fitting the task data.In this context, adversarial training aims to produce sentence representations that are invariant to the bias, allowing the model to perform its task without being influenced by certain pre-defined artefacts in the data, and improve its generalisation properties.Belinkov et al. (2019a) have shown that adversarial training improves how well models generalise to different datasets, with these models yielding more accurate results on 9 out of 12 of the datasets tested.
However, after adversarial training, the biases are not necessarily removed from sentence representations produced by the model (Elazar and Goldberg, 2018;Belinkov et al., 2019b), with little change in the ability of a classifier to predict these biases from the trained representations.This strongly undermines the idea that adversarial training prevents models learning from artefacts in the training data.
In the case of Belinkov et al. (2019a), adversarial training is applied to remove the hypothesis-only bias of neural models trained on SNLI.However, after the adversarial training, the hypothesis-only bias can still be almost fully relearned from the model's sentence representations (Belinkov et al., 2019b).While the adversarially trained models performed better when applied to a range of NLI datasets, the performance does not improve on SNLI-hard (Belinkov et al., 2019a), a subset of SNLI containing examples without the same hypothesis-only bias (Gururangan et al., 2018).
In this paper, we propose using an ensemble of adversarial classifiers to remove the hypothesisonly bias from a model's sentence representations.This method aims to produce sentence representations where the bias can no longer be detected, further improving how well the models generalise to other datasets.Moreover, we show that the debiased models perform better on SNLI-hard compared to models where the hypothesis-only bias is still present.Lastly, we investigate changing the dimensionality of the sentence representations to see whether this impacts the optimal number of adversaries required to de-bias a model.Our findings show that the higher the dimensionality of the sentence representations, the more adversaries are required to remove this bias.
To summarise the research hypotheses addressed in this paper: i) We investigate whether using an ensemble of adversarial classifiers can remove the hypothesis-only bias within NLI models.For large enough dimensions, this method achieved a statistically significant reduction in the bias compared to using only one adversarial classifier.ii) We test whether the models de-biased with an ensemble of adversaries generalise better to other NLI datasets with different hypothesis-only biases.This proves to be the case, improving model accuracy scores across most of the datasets.Importantly, the debiased models also perform better on SNLI-hard, producing a statistically significant improvement compared to a baseline model.iii) We inspect the optimal number of adversaries to use, depending on the dimensionality of the model sentence representation.As this dimensionality is increased, more adversaries are required to de-bias the model.iv) We compare the effect of adversarial training with a linear classifier to using a non-linear multilayer perceptron as the adversary.These results show that using a more complex adversarial classifier is not always beneficial.Instead, the best choice of adversary depends on the classifier being used to relearn the bias after the sentence representation has been trained.

Related Work
The Hypothesis-Only Bias.Gururangan et al. (2018); Tsuchiya (2018) demonstrated how models can predict the class within the SNLI dataset when only processing the hypothesis, reaching accuracy scores as high as twice the majority baseline (67% vs. 34%).This is possible due to hypothesisonly biases such as the observation that negation words ("no" or "never") are more commonly used in contradiction hypotheses (Gururangan et al., 2018;Poliak et al., 2018).The hypothesis sentence length is another example of an artefact that models can learn from, with entailment hypotheses being shorter than either contradiction or neutral hypotheses (Gururangan et al., 2018).Tsuchiya (2018) showed that the hypothesisonly bias predictions are significantly better than the majority baseline for SNLI, although this was not the case for the SICK dataset (Marelli et al., 2014).Poliak et al. (2018) find that human-elicited datasets such as SNLI and MultiNLI have the largest hypothesis-only bias, i.e. datasets where humans were asked to create a hypothesis for a given premise.This bias is also dataset specific, with Belinkov et al. (2019a) finding that only MultiNLI shares some of the same hypothesis-only bias as the SNLI dataset.
Generalisation to Other Datasets.For the other datasets tested, a SNLI-trained hypothesisonly bias classifier performed worse than a majority classifier.Bowman et al. (2015) and Williams et al. (2018) show that models trained on the SNLI and MultiNLI datasets do not necessarily learn good representations for other NLI datasets, such as SICK.Analogous results were also reported by Talman and Chatzikyriakidis (2018) for more complex models.Gururangan et al. (2018) and Tsuchiya (2018) identify how NLI models perform worse on hard examples, which are defined as the examples that a hypothesis-only model has misclassified.This suggests that the success of NLI models may be overstated, with models relying on artefacts in their training data to achieve high performance (Gururangan et al., 2018).
Biases and Artefacts.SNLI and MultiNLI are not the only datasets to suffer from the presence of annotation artefacts and biases.In the past, machine reading datasets were also found to contain syntactic clues that were giving away the correct prediction (Vanderwende and Dolan, 2005;Snow et al., 2006).For instance, Kaushik and Lipton (2018) show that, in several reading comprehension datasets such as bAbI (Weston et al., 2016) and Childrens Books Test (Hill et al., 2016), it is possible to get non-trivial results by considering only the last passage of the paragraph.In visual question answering datasets, several studies found it is often possible to answer the question without looking at the corresponding image (Zhang et al., 2016;Kafle and Kanan, 2016;Goyal et al., 2017;Agrawal et al., 2018).Similarly, in the ROCStories corpus (Mostafazadeh et al., 2016), Schwartz et al. (2017); Cai et al. (2017) show it is possible to achieve non-trivial prediction accuracy by only considering candidate endings and without taking the stories in account.
Learning Robust Models.Neural models are known to be vulnerable to so-called adversarial examples, i.e. instances explicitly crafted by an adversary to cause the model to make a mistake (Szegedy et al., 2014).Most recent work focuses on simple semantic-invariant transformations, showing that neural models can be overly sensitive to small modifications of the inputs and paraphrasing.For instance, Ribeiro et al. ( 2018) use a set of simple syntactic changes, such as replacing What is with What's.Other semantics-preserving perturbations include typos (Hosseini et al., 2017), the addition of distracting sentences (Wang and Bansal, 2018;Jia and Liang, 2017), character-level perturbations (Ebrahimi et al., 2018), and paraphrasing (Iyyer et al., 2018).Minervini and Riedel (2018) propose searching for violations of constraints, such as the symmetry of contradiction and transitivity of entailment, for identifying where NLI models make mistakes.More robust models can be produced by training on these adversarial examples, preventing the models from making similar mistakes in the future.
Adversarial training is another effective procedure for creating more robust models (Wang et al., 2019): Ganin and Lempitsky (2015a) use adversarial training to improve domain adaption, allowing models to learn features that are helpful for the model task but which are also invariant with respect to changes in the domain.This was achieved by jointly training two models, one to predict the class label and one to predict the domain, and then regularising the former model to decrease the accuracy of the latter via gradient reversal.Belinkov et al. (2019b) use adversarial training to remove the hypothesis-only bias from models trained on SNLI.However, while the bias appears to be removed during the model training where the adversarial classifier performs poorly, after freezing the de-biased representation the bias can be almost fully recovered.Similarly, Elazar and Goldberg (2018) found that adversarial training and gradient reversal does not remove demographic information such as age or gender, with this information still present in the de-biased sentence representations.
There is some evidence that adversarial training produces more robust models: Belinkov et al. (2019a) found that models trained on the SNLI dataset using adversarial training generalise better to other datasets.However, these same models show degraded performance on SNLI-hard, which is supposedly the ideal dataset to test for generalisation as it resembles SNLI the most in terms of domain and style while lacking the examples with the largest bias.Therefore, while adversarial training has shown benefits in helping NLI models generalise better to different datasets (Belinkov et al., 2019a), the model sentence representations still retain the original biases (Belinkov et al., 2019b).This paper implements an ensemble of multiple adversaries to remove more of the model biases, producing models that generalise better to other NLI datasets compared to previous research findings.

Ensemble Adversarial Training
We follow an adversarial training approach for removing the hypothesis-only bias from sentence representations.Specifically, we generalise the adversarial training framework proposed by Belinkov et al. (2019a) to make use of multiple adversaries: n adversarial models are jointly trained for predicting the relationship between premise and hypothesis given only the representation of the hypothesis (hypothesis-only adversaries).At the same time, a sentence encoder together with an hypothesis-premise model are jointly trained to fit the training data, while decreasing the accuracy of these adversaries.Formally, given a hypothesis h and a premise p, the predictions of the hypothesispremise model ŷ and the i-th hypothesis-only adversary ŷa i can be formalised as follows: where ŷ, ŷa i ∈ R 3 are (unnormalised) score distributions over the three NLI classes, i.e. entailment, contradiction, and neutral, and θ e , θ c , θ a i respectively denote the parameters of the encoder, the hypothesis-premise model, and the i-th hypothesisonly adversary.The adversarial training procedure can be formalised as optimising the following min-imax objective: where D is a dataset, and L ce denotes the crossentropy loss (Goodfellow et al., 2016).The hyperparameter λ ∈ [0, 1] denotes the trade-off between the losses of the hypothesis-premise model and the hypothesis-only adversaries.Similarly to Belinkov et al. (2019a), we optimise the minimax objective in Eq. ( 1) using gradient reversal (Ganin and Lempitsky, 2015b), which leads to an optimisation procedure equivalent to the popular gradient descent ascent algorithm (Lin et al., 2019).
To test the impact of using multiple adversarial classifiers when changing the dimensionality of the sentence representation, in our experiments we train with {1, 5, 10, 20} bias classifiers for {256, 512, 1024, 2048} dimensional sentence representations.The learned sentence representation is then frozen, and 20 adversarial classifiers are randomly reinitialised before they attempt to re-learn the hypothesis-only bias from the frozen de-biased sentence representation.The maximum accuracy from across the 20 adversarial classifiers is then reported after trying to remove the bias, showing the maximum bias that can still be learnt from the representation.
Model Architecture.Using the same experimental set-up as Belinkov et al. (2019a) and Poliak et al. (2018), an InferSent model (Conneau et al., 2017) is used with pretrained GloVe 300 dimensional word embeddings.The InferSent model architecture consists of a Long Short-Term Memory network (Hochreiter and Schmidhuber, 1997) (LSTM) encoder which creates a 2048 dimensional sentence representation.
Significance Testing.We perform statistical testing to assess whether the differences between using one or five adversarial classifiers is significant.This involves repeating the experiments for both one and five adversarial classifiers with ten different random seeds for each of the dimensions considered.For each experiment, the de-biasing is performed before a classifier attempts to learn the bias again from the frozen sentence representations.
We use bootstrapping hypothesis testing (Efron and Tibshirani, 1993) to test the statistical signifi-cance by comparing the means from the two samples.We also provide p-values from a Mann Whitney U-test (Mann and Whitney, 1947).The bootstrapping considers the null hypothesis that there is no difference between the mean bias re-learnt from using five adversarial classifiers compared to just using one adversarial classifier.In addition, we use a Bonferroni correction factor (Shaffer, 1995) of four when evaluating the p-values, taking into account the multiple hypothesis testing across each different dimension.P-values smaller than 0.05 are considered significant.

Using Deeper Adversaries
We additionally use a multi-layer perceptron as a more complex adversarial classifier to understand whether the bias that can be re-learnt depends on the type of classifier used.For 512, and 2048 dimensions, the experiments are repeated using nonlinear classifiers instead of linear classifiers, both during the adversarial training and also afterwards when the classifiers try to re-learn the biases from the frozen sentence representations.Five adversaries are used for 512 dimensions compared to ten for the 2048 dimensional representation.
We perform the experiments with three different scenarios: i) Using the non-linear classifiers during the adversarial training, but not afterwards, and instead trying to re-learn the bias with a linear classifier.ii) Using linear classifiers during the adversarial training but then non-linear classifiers are used to try to re-learn the biases after the sentence representation is frozen.iii) Finally, non-linear classifiers are used both during adversarial training and afterwards when trying to re-learn the biases from the frozen sentence representation.The nonlinear multi-layer perceptron classifier consists of three linear layers, and two non-linear layers using tanh.

Evaluating De-biased Sentence Encoders
After training the models on SNLI with adversarial training, we test these de-biased models on a range of other datasets to see whether they generalise better to different datasets.The performance of the de-biased models is compared to a baseline model trained on SNLI where no adversarial training has been performed.
By using different random seeds, we compare ten baseline SNLI-trained models with models using one adversary and 20 adversaries, with each of these models tested on SNLI-hard.We perform bootstrap hypothesis testing to understand whether there is a significant difference between using one adversary and the baseline models with no adversarial training.We have repeated this hypothesis testing to compare models de-biased using 20 adversaries to the baseline models.
We also test performance of the de-biased models on 12 different NLI datasets to understand whether models de-biased with an ensemble of adversaries perform better than the baseline and models de-biased with one adversary.The datasets analysed in these experiments are the same datasets tested by Belinkov et al. (2019a).While the previous results in this paper use an LSTM encoder, a BiLSTM has been implemented when evaluating the de-biased sentence encoders on other datasets.This is to ensure that the experiments are a fair like-for-like comparison with the results generated in previous work by Belinkov et al. (2019a).We apply the hyper-parameters that perform best on a dataset's validation set, in line with the experiments conducted by Belinkov et al. (2019a).

Results
An ensemble of multiple adversarial classifiers has been applied during model training to understand whether this reduces the amount of bias learnt by a model.Our results show that training a model with an ensemble of adversaries does reduce the model bias, doing so across each sentence representation dimensionality tested.Moreover, more adversaries are required for de-biasing larger dimensional sentence representations.
When training with just one adversarial classifier for a 2,048 dimensional sentence representation, the accuracy of this classifier only reached 39% during model training.However, the maximum of 19 other hypothesis-only bias classifiers reached 58% during this time.Additionally, after the sentence representation was frozen and the bias classifiers had a chance to re-learn the hypothesisonly bias, the maximum bias classifier accuracy increased to 62%.This result mirrors the findings of Belinkov et al. (2019b) where the bias is not removed from the sentence representation when using only one adversary.
For 2,048 dimensional sentence representations, as the number of adversaries are increased up to 20, less bias can be found in the resulting de-biased sentence representation.When the number of adversarial classifiers are increased from 1 to 20, the accuracy of the hypothesis-only bias classifiers reduces from 62% to 53%.
The more dimensions that the sentence representation has, the more of the hypothesis-only bias can be re-learnt for the same number of adversaries after the sentence representation is fully trained.In particular, when using a 256 dimensional sentence representation and five adversarial classifiers, the hypothesis-only bias can only be learnt with 42% accuracy after being retrained on the frozen de-biased representation.
The optimal number of adversarial classifiers depends on the dimensionality of the sentence representation.For 2,048 dimensions this is 20 adversaries, while for 256 dimensions this reduces to 5 adversaries (see Fig. 1 for the full results, or Table 1 for summary results).

Statistical Testing Multiple Adversaries
We have applied statistical testing to understand whether the improvements seen using an ensemble of adversarial classifiers is statistically significant.For sentence representations with 2,048, 1,024 or 512 dimensions this was a statistically significant result.Although the results were not statistically significant for a smaller 256 dimensional representation.
For 2,048, 1,024 and 512 dimensional sentence representations, the statistical testing provides pvalues smaller than 0.05.The null hypothesis is therefore rejected in these cases, with the alternative hypothesis stating that using five adversaries reduces the mean bias re-learnt from the sentence representations compared to using just one adversarial classifier (see Table 2).The result for 256 dimensions is not statistically significant.
The findings highlight how using multiple adversarial classifiers helps to reduce the bias that can be re-learnt from the trained de-biased sentence representation for dimensions greater or equal to 512.Fig. 2 displays these results in a boxplot diagram.

Using Deeper Adversaries
To investigate the impact of changing the strength of the adversary on the bias that the model learns, multi-layer perceptrons are used during model trained as the adversarial classifiers.The results show that that more complex multi-layer perceptrons do not always perform better, and that the best choice of adversary depends on the type of classifier that tries to relearn the bias from the trained sentence representations.
When a non-linear model is used to re-learn the bias from the frozen sentence representation, less bias can be recovered if a non-linear model was used as the adversarial classifier during training instead of a linear adversarial classifier (see Fig. 3).Therefore, when using a more complex classifier to re-learn the bias, a model of at least the same complexity should be used in the adversarial training to remove these biases.If a linear classifier is used as the adversary, a non-linear classifier can find more bias when learning from the de-biased representation than a linear classifier can (see Fig. 4).The results also show that if a linear model is being used to re-learn the bias, then using a linear model as the adversary instead of a multi-layer perceptron reduces the amount of bias that can be recovered (see Fig. 3).This could suggest that the best approach is to use the same type of classi-fier for both the adversarial model and the classifier used to re-learn the bias afterwards.However, more adversarial classifiers may be required when using non-linear classifiers as adversaries, and therefore more experimentation is required to test this hypothesis.

Evaluating De-biased Encoders
The models trained with an ensemble of adversaries are applied to 12 different NLI datasets to test whether these de-biased models generalise better than models trained with either one or no adversarial classifier.The datasets tested include SNLIhard, where de-biased models that are no longer influenced by the hypothesis-only bias are expected to perform better.Models trained using an ensemble of adversaries performed better across most of these datasets, including SNLI-hard where there is a statistically significant improvement compared to a baseline model with no adversarial training.
Models trained using one adversarial classifier achieved 1.1% higher accuracy on SNLI-hard than the baseline models, compared to 1.6% for models trained using 20 adversarial classifiers.However, we cannot reject the null hypothesis that the mean accuracy using 1 adversary is no better than the baseline, as our p-value is greater than 0.05 (0.07).
On the other hand, with a p-value of 0.015 we can accept the alternative hypothesis that models trained with 20 adversaries have a higher mean accuracy than the baseline models when tested on SNLI-hard.Across 8 of the 13 datasets analysed, models trained with an ensemble of 20 adversarial classifiers performed better than when using only one adversarial classifier (see Table 3).For three of the remaining datasets, the performance was the same between using one adversary and 20 adversaries.The performance when using an ensemble of adversaries was on average 0.7 points higher than when using one adversary, which in turn outperformed the baseline by 0.9 points.

Discussion
The paper finds that the higher the dimensionality of the sentence representations, the less effective a single adversary is at removing the bias stored within the model.These differences may explain why past research has sometimes found adversarial training to be an effective way to remove biases (Xie et al., 2017), while other research has found the bias remains hidden within the model sen-tence representations (Elazar and Goldberg, 2018;Belinkov et al., 2019b).
While Elazar and Goldberg (2018) already identified how ensembles of adversarial classifiers can help reduce the bias permanently, they were not able to re-learn the main task when more than 5 adversaries were used during training.In this paper we divide the weight given to the adversaries in the loss equation by the number of adversarial classifiers, helping to scale up the number of adversaries that can be used.

Conclusions and Future Work
The paper sets out to prevent NLI models learning from the hypothesis-only bias by using an ensemble of adversarial classifiers.Not only does this method reduce the biases contained in the model sentence representations, but these models generalise better to different NLI datasets than models trained using one adversary.
For an InferSent model with 2,048 dimensional representations, using an ensemble of adversarial classifiers reduced the bias stored in the sentence representations.On the other hand, using just one adversarial classifier removed little of the bias.The higher the dimensionality, the higher the optimal number of adversarial classifiers appears to be.
The representations de-biased with an ensemble of adversaries also generalised better to other datasets, improving on previous research efforts using only one adversarial classifier.Moreover, the models trained with an ensemble of adversaries also performed better when tested on SNLI-hard.This is the behaviour expected from de-biased models that no longer use the hypothesis-only bias to inform their predictions.However, after implementing the adversarial training, a non-linear classifier may still be able to detect the bias in the sentence representations where linear classifiers are not able to do so.It is also unclear whether adversarially training with a more complex classifier can be advantageous when trying to prevent a linear classifier from eventually re-learning the biases.Finally, while this paper has demonstrated the conditions under which biases can be removed from a linear classifier, preventing a non-linear classifier from learning the biases is more difficult and merits further experimentation.

Figure 1 :
Figure 1: Maximum bias classifier accuracy after the adversarial training, compared to the maximum bias classifier accuracy after the bias is re-learnt from the frozen de-biased sentence representation.The main NLI task accuracy is also displayed.Each dimension is tested using 0, 1, 5, 10 and 20 adversaries.
Maximum accuracy (%) from 20 bias classifiers when re-learning the hypothesis-only bias from the frozen de-biased sentence representation.The lowest accuracy figure for each dimension is highlighted, after testing 0, 1, 5, 10 and 20 adversaries.

Figure 2 :
Figure 2: Maximum accuracy scores of the bias classifiers when they are retrained on de-biased sentence representations for each of the experiments tested.Ten experiments were performed for each condition, using one or five adversaries for each dimension.

Figure 3 :
Figure 3: The bias re-learnt after using either a linear or non-linear classifier during the adversarial training, when the classifier used to re-learn the bias is a linear classifier (left hand side), or a nonlinear classifier (right hand side).This has been tested in two different scenarios, each with a different number of dimensions and adversarial models.

Figure 4 :
Figure 4: The bias re-learnt after using a linear classifier during the adversarial training, when the classifier used to re-learn the bias is either a classifier or a nonlinear classifier.

Table 2 :
p-values after performing Bootstrapping (B) and Mann-Whitney (MW) hypothesis tests, using a Bonferroni correction factor of 4. * indicates a statistically significant result with a p-value below 0.05.Highlighted values indicate that the mean is significantly smaller than its comparison mean value, using the bootstrapping p-values.

Table 3 :
Maximum accuracy (%) from 20 bias classifiers when trying to re-learn the hypothesis-only bias from the frozen de-biased sentence representation.The lowest accuracy figure for each dimension is highlighted.