Towards Debiasing NLU Models from Unknown Biases

NLU models often exploit biases to achieve high dataset-specific performance without properly learning the intended task. Recently proposed debiasing methods are shown to be effective in mitigating this tendency. However, these methods rely on a major assumption that the types of bias should be known a-priori, which limits their application to many NLU tasks and datasets. In this work, we present the first step to bridge this gap by introducing a self-debiasing framework that prevents models from mainly utilizing biases without knowing them in advance. The proposed framework is general and complementary to the existing debiasing methods. We show that it allows these existing methods to retain the improvement on the challenge datasets (i.e., sets of examples designed to expose models' reliance on biases) without specifically targeting certain biases. Furthermore, the evaluation suggests that applying the framework results in improved overall robustness.


Introduction
Neural models often achieve impressive performance on many natural language understanding tasks (NLU) by leveraging biased features, i.e., superficial surface patterns that are spuriously associated with the target labels (Gururangan et al., 2018;. 2 Recently proposed debiasing methods are effective in mitigating the impact of this tendency, and the resulting models are shown to perform better beyond training distribution. They improved the performance on challenge test sets that are designed such that relying on the spurious association leads to incorrect predictions. Prevailing debiasing methods, e.g., example reweighting (Schuster et al., 2019), confidence regularization (Utama et al., 2020), and model ensembling Clark et al., 2019;Mahabadi et al., 2020), are agnostic to model's architecture as they operate by adjusting the training loss to account for biases. Namely, they first identify biased examples in the training data and down-weight their importance in the training loss so that models focus on learning from harder examples. 3 While promising, these model agnostic methods rely on the assumption that the specific types of biased features (e.g., lexical overlap) are known a-priori. This assumption, however, is a limitation in various NLU tasks or datasets because it depends on researchers' intuition and task-specific insights to manually characterize the spurious biases, which may range from simple word/n-grams cooccurrence (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;Schuster et al., 2019) to more complex stylistic and lexico-syntactic patterns (Zellers et al., 2019;Snow et al., 2006;Vanderwende and Dolan, 2006). The existing datasets or the newly created ones (Zellers et al., 2019;Sakaguchi et al., 2020;Nie et al., 2019b) are, therefore, still very likely to contain biased patterns that remain unknown without an in-depth analysis of each individual dataset (Sharma et al., 2018).
In this paper, we propose a new strategy to enable the existing debiasing methods to be applicable in settings where there is little to no prior information about the biases. Specifically, models should automatically identify potentially biased examples without being pinpointed at a specific bias in advance. Our work makes the following novel contributions in this direction of automatic bias mitigation.
First, we analyze the learning dynamics of a large pre-trained model such as BERT (Devlin et al., 2019) on a dataset injected with a synthetic and controllable bias. We show that, in very small data settings, models exhibit a distinctive response to synthetically biased examples, where they rapidly increase the accuracy (→ 100%) on biased test set while performing poorly on other sets, indicating that they are mainly relying on biases. Second, we present a self-debiasing framework within which two models of the same architecture are pipelined to address the unknown biases. Using the insight from the synthetic dataset analysis, we train the first model to be a shallow model that is effective in automatically identifying potentially biased examples. The shallow model is then used to train the main model through the existing debiasing methods, which work by down-weighting the potentially biased examples. These methods present a caveat in that they may lose useful training signals from the down-weighted training examples. To account for this, we also propose an annealing mechanism which helps in retaining models' in-distribution performance (i.e., evaluation on the test split of the original dataset).
Third, we experiment on three NLU tasks and evaluate the models on their existing challenge datasets. We show that models obtained through our self-debiasing framework gain equally high improvement compared to models that are debiased using specific prior knowledge. Furthermore, our cross-datasets evaluation suggests that our general framework that does not target only a particular type of bias results in better overall robustness.
Terminology This work relates to the growing number of research that addresses the effect of dataset biases on the resulting models. Most research aims to mitigate different types of bias on varying parts of the training pipeline (e.g., dataset collection or modeling). Without a shared definition and common terminology, it is quite often that the term "bias" discussed in one paper refers to a different kind of bias mentioned in the others. Following the definition established in the recent survey paper by Shah et al. (2020), the dataset bias that we address in this work falls into the category of label bias. This bias emerges when the conditional distribution of the target label given certain features in the training data diverges substantially at test time. These features that are associated with the label bias may differ from one classification setting to the others, and although they are predictive, MNLI synthetic: premise: What's truly striking, though, is that Jobs has never really let this idea go.
orig. hypo.: Jobs never held onto an idea for long.
biased: 0 Jobs never held onto an idea for long.
anti-biased: 1 Jobs never held onto an idea for long.
label: 0 (contradiction) Figure 1: Synthetic bias datasets are created by appending an artificial feature to the input text that allows models to use it as a shortcut to the target label. For each example in MNLI, a number-coded label (contradiction: 0 , entailment: 1 , neutral: 2 ) is appended to the hypothesis sentences.
relying on them for prediction may be harmful to fairness (Elazar and Goldberg, 2018)

Motivation and Analysis
Debiasing NLU models Recent NLU tasks are commonly formulated as multi-class classification problems (Wang et al., 2018), in which the goal is to predict the semantic relationship label y ∈ Y given an input sentence pairs x ∈ X . For each example x, let b(x) be the biased features that happen to be predictive of label y in a specific dataset. The aim of a debiasing method for an NLU task is to learn a debiased classifier f d that does not mainly use b(x) when computing p(y|x). Model-agnostic debiasing methods (e.g., product-of-expert (Clark et al., 2019)) achieve this by reducing the importance of biased examples in the learning objective. To identify whether an example is biased, they employ a shallow model f b , a simple model trained to directly compute p(y|b(x)), where the features b(x) are hand-crafted based on the task-specific knowledge of the biases. However, obtaining the prior information to design b(x) requires a dataset-specific analysis (Sharma et al., 2018). Given the ever-growing number of new datasets, it would be a time-consuming and  Figure 2: The learning trajectory of a BERT model on MNLI datasets that are synthetically biased with different proportions: 0.9, 0.8, 0.7, and 0.6. All settings show models' tendency to rely on biases after seeing only a small number of training examples (accuracy goes up rapidly on "biased" while goes down on "anti-biased" after less than 10K training steps). costly process to identify biases before applying the debiasing methods.
In this work, we propose an alternative strategy to automatically obtain f b to enable existing debiasing methods to work with no precise prior knowledge. This strategy assumes a connection between large pre-trained models' reliance on biases with their tendency to operate as a rapid surface learner, i.e., they tend to quickly overfit to surface form information especially when they are finetuned in a small training data setting (Zellers et al., 2019). This tendency of deep neural network to exploit simple patterns in the early stage of the training is also well-observed in other domains of artificial intelligence (Arpit et al., 2017;Liu et al., 2020). Since biases are commonly characterized as simple surface patterns, we expect that models' rapid performance gain is mostly attributed to their reliance on biases. Namely, they are likely to operate similarly as f b after they are exposed to only a small number of training instances, i.e., achieving high accuracy on the biased examples while still performing poorly on the rest of the dataset.
Synthetic bias We investigate this assumption by analyzing the comparison between models' performance trajectory on biased and anti-biased ("coun-terexamples" to the biased shortcuts) test sets as more examples are seen during the training. Our goal is to obtain a fair comparison without the confounds that may result in performance differences on these two sets. Specifically, the examples from the two sets should be similar except for the presence of a feature that is biased in one set and anti-biased in the other. For this reason, we construct a synthetically biased data based on the MNLI dataset (Williams et al., 2018) using a procedure illustrated in Figure 1. A synthetic bias is injected by appending an artificial feature to 30% of the original examples. We simulate the presence of bias by controlling m portion of these manipulated examples such that their artificial feature is associated with the ground truth label ("biased"), whereas, in the remaining (1 − m), the feature is disassociated with the label ("anti-biased"). 4 Using a similar injection procedure we can produce both fully biased and anti-biased test sets in which 100% of the examples contain the synthetic features. Models that blindly predict based on the artificial feature are guaranteed to achieve 0% ac-curacy on the anti-biased test.
Model's performance trajectory We finetune a bert-base-uncased model (Wolf et al., 2019) on the whole MNLI datasets that are partially biased with different proportions (m = {0.9, 0.8, 0.7, 0.6}). We evaluate each model on the original as well as the two fully biased and antibiased test sets. Figure 2 shows the performance trajectory in all settings. As expected, the models show the tendency of relying on biases after only seeing a small fraction of the dataset. Specifically, at an early point during training, models achieve 100% accuracy on the biased test and drop to almost 0% on the anti-biased test. This behavior is more apparent as the proportion of biased examples is increased by adjusting m from 0.6 to 0.9. Training a shallow model The analysis suggests that we can obtain a substitute f b by taking a checkpoint of the main model early in the training, i.e., when the model has only seen a small portion of the training data. However, we observe that the resulting model makes predictions with rather low confidence, i.e., assigns a low probability to the predicted label. As shown in Figure 3 (top), most predictions fall in the 0.4 probability bin, only slightly higher than a uniform probability (0.3). We further find that by training the model for multiple epochs, we can obtain a confident f b that overfits biased features from a smaller sample size ( Figure 3, bottom). We show in Section 3 that overconfident f b is particularly important to better identify biased examples.

Self-debiasing Framework
We propose a self-debiasing framework that enables existing debiasing methods to work without requiring a dataset-specific knowledge about the biases' characteristics. Our framework consists of two stages: (1) automatically identifying biased examples using a shallow model; and (2) using this information to train the main model through the existing debiasing methods, which are augmented with our proposed annealing mechanism.

Biased examples identification
First, we train a shallow model f b , which approximates the behavior of a simple hand-crafted model that is commonly used by the existing debiasing methods to identify biased examples. As mentioned in Section 2, we obtain f b for each task by training a copy of the main model on a small random subset of the dataset for several epochs. The model f b is then used to make predictions on the remaining unseen training examples. Given a training example {x (i) , y (i) }, we denote the output of the shallow model as Probabilities p b are assigned to each training instance to indicate how likely that it contains biases. Specifically, the presence of biases can be estimated using the scalar probability value of p (i) b on the correct label, which we denote as p , where c is the index of the correct label. We can interpret p (i,c) b by the following: when the model predicts an example x (i) correctly with high confidence, i.e., p Conversely, when the model makes an overconfident error, i.e., p is likely to be a harder example from which models should focus on learning.

Debiased training objective
We use the obtained p b to train the main model f d parameterized by θ d . Specifically, p b is utilized by the existing model-agnostic debiasing methods to down-weight the importance of biased examples in the training objective. In the following, we describe how the three recent model-agnostic debiasing methods (example reweighting (Schuster et al., 2019), product-of-expert Clark et al., 2019;Mahabadi et al., 2020), and confidence regularization (Utama et al., 2020)) operate within our framework: Example reweighting This method adjusts the importance of a training instance by directly assigning a scalar weight that indicates whether the instance exhibits a bias. Following Clark et al. (2019), this weight scalar is computed as 1 − p The individual loss term is thus defined as: Where p d is the softmax output of f d . This formulation means that the contribution of an example to the overall loss is steadily decreased as the shallow model assigns a higher probability to the correct label (i.e., more confident prediction).
Product-of-expert In this method, the main model f d is trained in an ensemble with the shallow model f b , by combining their softmax outputs. The ensemble loss on each example is defined as: During the training, we only optimize the parameters of f d while keeping the parameters of f b fixed. At test time, we use only the prediction of f d .
Confidence regularization This method works by regularizing model confidence on the examples that are likely to be biased. Utama et al. (2020) use a self-distillation training objective (Furlanello et al., 2018;Hinton et al., 2015), in which the supervision by the teacher model is scaled down using the output of the shallow model. The loss on each individual example is defined as a cross entropy between p d and the scaled teacher output: Where f t is the teacher model (parameterized identically to f d ) that is trained using a standard cross entropy loss on the full dataset, and f t (x) = p t . This "soft" label supervision provided by the scaled teacher output discourages models to make overconfident predictions on examples containing biased features.

Annealing mechanism
Our shallow model f b is likely to capture multiple types of bias, leading to more examples being down-weighted in the debiased training objectives. As a result, the effective training data size is reduced even more, which leads to a substantial in-distribution performance drop in several debiasing methods Clark et al., 2019). To mitigate this, we propose an annealing mechanism that allows the model to gradually learn from all examples, including ones that are detected as biased. This is done by steadily lowering p (i,c) b as the training progresses toward the end. At training step t, the probability vector p is scaled down by re-normalizing all probability values that have been raised to the power of α t : , where K is the number of labels and index j ∈ {1, ..., K}. The value of α t is gradually decreased throughout the training using a linear schedule. Namely, we set the value of α t to range from the maximum value 1 at the start of the training to the minimum value a in the end of the training: α t = 1 − t (1−a) T , where T is the total number of training steps. In the extreme case where a is set to 0, p b vectors are scaled down closer to uniform distribution near the end of the training. This results in a more equal importance of all examples, which is equivalent to the standard cross entropy loss.
We note that since this mechanism gradually exposes models to potentially biased instances, it presents the risk of model picking up biases and adopting back the baseline behavior. However, our results and analysis suggest that when the parameter a is set to a value close to 1, the annealing mechanism can still provide an improvement on the in-distribution data while retaining a reasonably well performance on the challenge test sets.

Evaluation Tasks
We perform evaluations on three NLU tasks: natural language inference, fact verification, and paraphrase identification. We simulate a setting where we have not enough information about the biases for training a debiased model, and thus biased examples should be identified automatically. Therefore, we only use the existing challenge test set for each examined task strictly for evaluation and do not use the information about their corresponding bias types during training. In the following, we briefly discuss the datasets used for training on each task as well as their corresponding challenge test sets to evaluate the impact of debiasing methods: Natural language inference We use the English Multi-Genre Natural Language Inference (MNLI) dataset (Williams et al., 2018) which consists of 392K pairs of premise and hypothesis sentences annotated with their textual entailment information. We test NLI models on lexical overlap bias using HANS evaluation set . It contains examples, in which premise and hypothesis sentences that consist of the same set of words may not hold an entailment relationship, e.g., "cat caught a mouse" vs. "mouse caught a cat". Since word overlapping is biased towards entailment in MNLI, models trained on this dataset often perform close to a random baseline on HANS.
Paraphrase identification We experiment with the Quora Question Pairs dataset. 5 It consists of 362K questions pairs annotated as either duplicate or non-duplicate. We perform an evaluation using PAWS dataset (Zhang et al., 2019)   the resulting models perform the task by relying on lexical overlap biases.

Fact verification
We run debiasing experiments on the FEVER dataset (Thorne et al., 2018). It contains pairs of claim and evidence sentences labeled as either support, refutes, and not-enoughinformation. We evaluate on the FeverSymmetric test set (Schuster et al., 2019), which is collected to reduced claim-only biases (e.g., negative phrases such as "refused to" or "did not" are associated with the refutes label).

Main Model
We apply our self-debiasing framework on the BERT model (Devlin et al., 2019), which performs very well on the three considered tasks. 6 It also shows substantial improvements on the corresponding challenge datasets when trained through the existing debiasing methods (Clark et al., 2019;He et al., 2019). For each examined debiasing method, we show the comparison between the results when it is applied within our framework and when it is trained using prior knowledge to detect training examples with a specific bias. For the second scenario, MNLI and QQP models are debiased using a lexical overlap bias prior, whereas FEVER model is debiased using hand-crafted claim-only biased features. We use the results reported in their corresponding papers. Additionally, we train a baseline BERT model with a standard cross entropy loss. 6 We use the pre-trained bert-base-uncased model available at https://huggingface.co/ transformers/pretrained_models.html.

Hyperparameters
The hyperparameters of our framework include the number of training samples and epochs to train the shallow model f b as well as parameter a to schedule the annealing process. We only use the training data, and no information about the challenging sets, for tuning these parameters. Based on the insight from our synthetic bias analysis (Section 2), we choose the sample size and the number of epochs which result in f b that satisfies the following conditions: (1) its accuracy on the unseen training examples is around 60% to 70%; (2) More than 90% of their predictions fall into the high confidence bin (> 0.9). These variables vary for each task depending on their diversity and difficulty. For instance, it takes 2000 examples and 3 epochs of training for MNLI, and only 500 examples and 4 epochs for an easier task such as QQP. 7 For the annealing mechanism, we set a = 0.8 as the minimum value of α t for all experiments across the three tasks. Although this may not be an optimal configuration for all tasks, it still allows us to observe how gradually increasing the importance of "biased" examples may affect the overall performance.

Results and Discussion
Main results We experiment with several training methods for each task: the baseline training, debiased training with prior knowledge, and the debiased training using our self-debiasing framework (with and without annealing mechanism). We   present the results on the three tasks in Table 1. Each model is evaluated both in terms of their indistribution performance on the original development set and their out-of-distribution performance on the challenge test set. For each setting, we report the average results across 5 runs. We observe that: (1) models trained through self-debiasing framework obtain equally high improvements on challenge sets of the three tasks compared to their corresponding debiased models trained with a prior knowledge (indicated as known-bias). In some cases, the existing debiasing methods can even be more effective when applied using the proposed framework, e.g., self-debias example reweighting obtains 52.3 F1 score improvement over the baseline on the nonduplicate subset of PAWS (compared to 33.6 by its known-bias counterpart). This indicates that the framework is equally effective in identifying biased examples without previously needed prior knowledge; (2) Most improvements on the challenge datasets come at the expense of the in-distribution performance (dev column) except for the confidence regularization models. For instance, the self-debias product-of-expert (PoE) model, without annealing, performs 2.2pp lower than the known-bias model on MNLI dev set. This indicates that self-debiasing may identify more potentially biased examples and thus effectively omit more training data; (3) Annealing mechanism (indicated by ♠) is effective in mitigating this issue in most cases, e.g., improving PoE by 0.5pp on FEVER dev and 1.2pp on MNLI dev while keeping relatively high challenge test accuracy. Self-debias reweighting augmented with annealing mechanism even achieves the highest HANS accuracy in addition to its improved in-distribution performance.
Cross-datasets evaluation Previous work demonstrated that targeting a specific bias to optimize performance in the corresponding challenge dataset may bias the model in other unwanted directions, which proves to be counterproductive in improving the overall robustness (Nie et al., 2019a;Teney et al., 2020). One way to evaluate the impact of debiasing methods on the overall robustness is to train models on one dataset and evaluate them against other datasets of the same task, which may have different types and amounts of biases (Belinkov et al., 2019a). A contemporary work by  specifically finds that debiasing models based on only a single bias results in models that perform significantly worse upon cross-datasets evaluation for the reading comprehension task.
Motivated by this, we perform similar evaluations for models trained on MNLI through the three debiasing setups: known-bias to target the HANS-specific bias, self-debiasing, and self-debiasing augmented with the proposed annealing mechanism. We do not tune the hyperparameters for each target dataset and use the models that we previously reported in the main results. As the target datasets, we use 4 NLI datasets: Scitail (Khot et al., 2018), SICK (Marelli et al., 2014), GLUE diagnostic set (Wang et al., 2018), and 3way version of RTE 1, 2, and 3 (Dagan et al., 2005;Bar-Haim et al., 2006;Giampiccolo et al., 2007). 8 We present the results in Table 2. We observe that the debiasing with prior knowledge to target the specific lexical overlap bias (indicated by known HANS ) can help models to perform better on SICK and Scitail. However, its resulting models under-perform the baseline in RTE sets and GLUE diagnostic, degrading the accuracy by 0.5 and 0.6pp. In contrast, the self-debiased models, with and without annealing mechanism, outperform the baseline on all target datasets, both achieving additional 1.1pp on average. The gains by the two self-debiasing suggest that while they are effective in mitigating the effect of one particular bias (i.e., lexical overlap), they do not result in models learning other unwanted patterns that may hurt the performance on other datasets. These results also extend the findings of  to the NLU settings in that addressing multiple biases at once, as done by our general debiasing method, leads to a better overall generalization.
Analyzing the annealing mechanism In previous experiments, we show that setting the mini-   mum α t to only slightly lower than 1 (i.e., a = 0.8) results in improvements on the in-distribution without substantial degradation on challenge datasets scores. We question whether this behavior persists once we set a closer to 0. Specifically, do models fall back to the baseline performance when the loss gets more equivalent to the standard cross-entropy at the end of the training? We run additional experiments using the selfdebiased example reweighting on QQP ⇒ PAWS evaluations. We consider the following values to set the minimum α t : 1.0, 0.8, 0.6, 0.4, 0.2, and 0.0. For each experiment, we report the average scores across multiple runs. As we see in Figure 4, the challenge test scores decrease as we set minimum a to lower values. Annealing can still offer a reasonable trade-off between in-distribution and challenge test performances up until a = 0.6, before falling back to baseline performance at a = 0. These results suggest that models are still likely to learn spurious shortcuts from biased examples that they are exposed to even at the end of the training. Consequently, the annealing mechanism should be used cautiously by setting the minimum α t to moderate values, e.g., 0.6 or 0.8.

Impact on learning dynamics
We previously show ( Figure 2) that baseline models tend to learn easier examples more rapidly, allowing them to make correct predictions by relying on biases. As the self-debiasing framework manages to mitigate this fallible reliance, we expect some changes in models' learning dynamics. We are, therefore, interested in characterizing these changes by analyzing their training loss curve. In particular, we examine the individual losses on each training batch and measure their variability using percentiles (i.e., 0th, 25th, 50th, 75th, and 100th percentile). Figure 5 shows the comparison of the individual loss vari- Bias identification stability Researchers have recently observed large variability in the generalization performance of fine-tuned BERT model (Mosbach et al., 2020;Zhang et al., 2020), especially in the out-of-distribution evaluation settings (McCoy et al., 2019a;Zhou et al., 2020). This may raise concerns on whether our shallow models, which are trained on the sub-sample of the training data, can consistently learn to rely mostly on biases. We, therefore, train 10 instances of shallow models on the MNLI dataset using different random seeds (for classifier's weight initialization and training sub-sampling). For evaluation, we perform two different partitionings of MNLI dev set based on the output of two simple hand-crafted models, which use lexical overlap and hypothesis-only features (Gururangan et al., 2018), respectively. The stability of bias utilization across the runs is evaluated by measuring their performance on easy and hard subsets of each partitioning, where examples that simple models predicted correctly belong to easy and the rest belong to hard. 9 Figure 6 shows the results. We observe small variability in the overall dev set performance which ranges in 61-65% accuracy. Similarly, the models obtain consistently higher accuracy on the easy subsets over the hard ones: 79-85% vs. 56-59% on the lexical-overlap partitioning and 72-77% vs. 48-50% on the hypothesis-only partitioning. The results indicate that: 1) the bias-reliant behavior of shallow models is stable; and 2) shallow models capture multiple types of bias. However, we also observe one rare instance of the shallow model that fails to converge during training and is stuck at making random predictions (33% in MNLI). This may indicate that the biased examples are undersampled in that particular run. In that case, we can easily spot this undesired behavior, discard the model, and perform another sampling.

Related Work
The artifacts of large scale dataset collections result in dataset biases that allow models to perform well without learning the intended reasoning skills. In NLI, models can perform better than chance by only using the partial input (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018) or by basing their predictions on whether the inputs are highly overlapped Dasgupta et al., 2018). Similar phenomena exist in various tasks, including argumentation mining (Niven and Kao, 2019), reading comprehension (Kaushik and Lipton, 2018), or story cloze completion (Schwartz et al., 2017;Cai et al., 2017). To allow a better evaluation of models' reasoning capabilities, researchers constructed challenge test sets composed of "counterexamples" to the spurious shortcuts that models may adopt (Jia and Liang, 2017;Glockner et al., 2018;Zhang et al., 2019;Naik et al., 2018). Models evaluated on these sets often fall back to random baseline performance.
There has been a flurry of work in dynamic dataset construction to systematically reduce dataset biases through adversarial filtering (Zellers et al., 2018;Sakaguchi et al., 2020; or human in the loop (Nie et al., 2019b;Kaushik et al., 2020;Gardner et al., 2020). While promising, researchers also show that newly constructed datasets may not be fully free of hidden biased patterns (Sharma et al., 2018). It is thus crucial to complement the data collection efforts with learning algorithms that are more robust to biases, such as the recently proposed product-ofexpert (Clark et al., 2019;He et al., 2019;Mahabadi et al., 2020), confidence regularization (Utama et al., 2020), or other training strategies (Belinkov et al., 2019b;Yaghoobzadeh et al., 2019;Tu et al., 2020). Despite their effectiveness, these methods are limited by their assumption on the availability of information about the task-specific biases. Our framework aims to alleviate this limitation and enable them to address unknown biases.

Conclusion
We present a general self-debiasing framework to address the impact of unknown dataset biases by omitting the need for thorough dataset-specific analysis to discover the types of biases for each new dataset. We adopt the existing debiasing methods into our framework and enable them to obtain equally high improvements on several challenge test sets without targeting a specific bias. The evaluation also suggests that our framework results in better overall robustness compared to the biasspecific counterparts. Based on our analysis, future work in the direction of automatic bias mitigation may include identifying potentially biased examples in an online fashion and discouraging models from exploiting them throughout the training.