End-to-End Bias Mitigation by Modelling Biases in Corpora

Several recent studies have shown that strong natural language understanding (NLU) models are prone to relying on unwanted dataset biases without learning the underlying task, resulting in models that fail to generalize to out-of-domain datasets and are likely to perform poorly in real-world scenarios. We propose two learning strategies to train neural models, which are more robust to such biases and transfer better to out-of-domain datasets. The biases are specified in terms of one or more bias-only models, which learn to leverage the dataset biases. During training, the bias-only models' predictions are used to adjust the loss of the base model to reduce its reliance on biases by down-weighting the biased examples and focusing the training on the hard examples. We experiment on large-scale natural language inference and fact verification benchmarks, evaluating on out-of-domain datasets that are specifically designed to assess the robustness of models against known biases in the training data. Results show that our debiasing methods greatly improve robustness in all settings and better transfer to other textual entailment datasets. Our code and data are publicly available in \url{https://github.com/rabeehk/robust-nli}.


Introduction
Recent neural models (Devlin et al., 2019;Radford et al., 2018;Chen et al., 2017) have achieved high and even near human-performance on several largescale natural language understanding benchmarks. However, it has been demonstrated that neural models tend to rely on existing idiosyncratic biases in the datasets, and leverage superficial correlations between the label and existing shortcuts in the training dataset to perform surprisingly well, 1 without learning the underlying task (Kaushik and Lipton, 2018; Gururangan 1 We use biases, heuristics or shortcuts interchangeably. Poliak et al., 2018;Schuster et al., 2019;McCoy et al., 2019b). For instance, natural language inference (NLI) is supposed to test the ability of a model to determine whether a hypothesis sentence (There is no teacher in the room) can be inferred from a premise sentence (Kids work at computers with a teacher's help) (Dagan et al., 2006). 2 However, recent work has demonstrated that large-scale NLI benchmarks contain annotation artifacts; certain words in the hypothesis that are highly indicative of inference class and allow models that do not consider the premise to perform unexpectedly well (Poliak et al., 2018;Gururangan et al., 2018). As an example, in some NLI benchmarks, negation words such as "nobody", "no", and "not" in the hypothesis are often highly correlated with the contradiction label.
As a result of the existence of such biases, models exploiting statistical shortcuts during training often perform poorly on out-of-domain datasets, especially if the datasets are carefully designed to limit the spurious cues. To allow proper evaluation, recent studies have tried to create new evaluation datasets that do not contain such biases (Gururangan et al., 2018;Schuster et al., 2019;McCoy et al., 2019b). Unfortunately, it is hard to avoid spurious statistical cues in the construction of large-scale benchmarks, and collecting new datasets is costly (Sharma et al., 2018). It is, therefore, crucial to develop techniques to reduce the reliance on biases during the training of the neural models.
We propose two end-to-end debiasing techniques that can be used when the existing bias patterns are identified. These methods work by adjusting the crossentropy loss to reduce the biases learned from the training dataset, down-weighting the biased examples so that the model focuses on learning the hard examples. Figure 1 illustrates an example of applying our strategy to prevent an NLI model from predicting the labels using existing biases in the hypotheses, where the  Figure 1: An illustration of our debiasing strategies applied to an NLI model. The bias-only model only sees the hypothesis, where negation words like "not" are highly correlated with the contradiction label. We train a robust NLI model by training it in combination with the bias-only model and motivate it to learn different strategies than the ones used in the bias-only model. The robust NLI model does not rely on the shortcuts and obtains improved performance on the test set. bias-only model only sees the hypothesis. Our strategy involves adding this bias-only branch f B on top of the base model f M during training. We then compute the combination of the two models f C in a way that motivates the base model to learn different strategies than the ones used by the bias-only branch f B . At the end of the training, we remove the bias-only classifier and use the predictions of the base model.
In our first proposed method, Product of Experts, the training loss is computed on an ensemble of the base model and the bias-only model, which reduces the base model's loss for the examples that the bias-only model classifies correctly. For the second method, Debiased Focal Loss, the bias-only predictions are used to directly weight the loss of the base model, explicitly modulating the loss depending on the accuracy of the bias-only model. We also extend these methods to be robust against multiple sources of bias by training multiple bias-only models.
Our approaches are simple and highly effective. They require training only a simple model on top of the base model. They are model agnostic and general enough to be applicable for addressing common biases seen in many datasets in different domains.
We evaluate our models on challenging benchmarks in textual entailment and fact verification, including HANS (Heuristic Analysis for NLI Systems) (McCoy et al., 2019b), hard NLI sets (Gururangan et al., 2018) of Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) and MultiNLI (MNLI) (Williams et al., 2018), and FEVER Symmetric test set (Schuster et al., 2019). The selected datasets are highly challenging and have been carefully designed to be unbiased to allow proper evaluation of the out-of-domain performance of the models. We additionally construct hard MNLI datasets from MNLI development sets to facilitate the out-of-domain evaluation on this dataset. 3 We show that including our strategies on training baseline models, including BERT (Devlin et al., 2019), provides a substantial gain on out-of-domain performance in all the experiments.
In summary, we make the following contributions: 1) Proposing two debiasing strategies to train neural models robust to dataset bias. 2) An empirical evaluation of the methods on two large-scale NLI datasets and a fact verification benchmark; obtaining a substantial gain on their challenging out-of-domain data, including 7.4 points on HANS, 4.8 points on SNLI hard set, and 9.8 points on FEVER symmetric test set, setting a new state-of-the-art. 3) Proposing debiasing strategies capable of combating multiple sources of bias. 4) Evaluating the transfer performance of the debiased models on 12 NLI datasets and demonstrating improved transfer to other NLI benchmarks. To facilitate future work, we release our datasets and code.

Related Work
To address dataset biases, researchers have proposed to augment datasets by balancing the existing cues (Schuster et al., 2019) or to create an adversarial dataset (Jia and Liang, 2017). However, collecting new datasets, especially at a large scale, is costly, and thus remains an unsatisfactory solution. It is, therefore, crucial to develop strategies to allow models to be trained on the existing biased datasets. Schuster et al. (2019) propose to first compute the n-grams in the dataset's claims that are the most associated with each fact-verification label. They then solve an optimization problem to assign a balancing weight to each training sample to alleviate the biases. In contrast, we propose several end-to-end debiasing strategies. Additionally, Belinkov et al. (2019a) propose adversarial techniques to remove from the NLI sentence encoder the features that allow a hypothesisonly model to succeed. However, we believe that in general, the features used by the hypothesis-only model can include some information necessary to perform the NLI task, and removing such information from the sentence representation can hurt the performance of the full model. Their approach consequently degrades the performance on the hard SNLI set, which is expected to be less biased. In contrast, we propose to train a bias-only model to use its predictions to dynamically adapt the classification loss to reduce the importance of the most biased examples.
Concurrently to our work, Clark et al. (2019) and He et al. (2019) have also proposed to use the product of experts (PoE) models for avoiding biases. They train their models in two stages, first training a bias-only model and then using it to train a robust model. In contrast, our methods are trained in an end-to-end manner, which is convenient in practice. We additionally show that our proposed Debiased Focal Loss model is an effective method to reduce biases, sometimes superior to PoE. We have evaluated on new domains of NLI hard sets and fact verification. Moreover, we have included an analysis showing that our debiased models indeed have lower correlations with the bias-only models, and have extended our methods to guard against multiple bias patterns simultaneously. We furthermore study transfer performance to other NLI datasets.

Reducing Biases
Problem formulation We consider a general multi-class classification problem. Given a dataset consisting of the input data x i ∈ X , and labels y i ∈ Y, the goal of the base model is to learn a mapping f M parameterized by θ M that computes the predictions over the label space given the input data, shown as f M : X → R |Y| . Our goal is to optimize θ M parameters such that we build a model that is more resistant to benchmark dataset biases, to improve its robustness to domain changes where the biases typically observed in the training data do not exist in the evaluation dataset.
The key idea of our approach, depicted in Figure 1, is first to identify the dataset biases that the base model is susceptible to relying on, and define a biasonly model to capture them. We then propose two strategies to incorporate this bias-only knowledge into the training of the base model to make it robust against the biases. After training, we remove the bias-only model and use the predictions of the base model.

Bias-only Branch
We assume that we do not have access to any data from the out-of-domain dataset, so we need to know a priori about the possible types of shortcuts we would like the base model to avoid relying on. Once these patterns are identified, we train a bias-only model designed to capture the identified shortcuts that only uses biased features. For instance, a hypothesis-only model in the large-scale NLI datasets can correctly classify the majority of samples using annotation artifacts (Poliak et al., 2018;Gururangan et al., 2018). Motivated by this work, our bias-only model for NLI only uses hypothesis sentences. Note that the bias-only model can, in general, have any form, and is not limited to models using only a part of the input data. For instance, on the HANS dataset, our bias-only model makes use of syntactic heuristics and similarity features (see Section 4.3).
Let x b i ∈ X b be biased features of x i that are predictive of y i . We then formalize this bias-only model as a mapping f B : X b → R |Y| , parameterized by θ B and trained using cross-entropy (CE) loss L B :

Proposed Debiasing Strategies
We propose two strategies to incorporate the bias-only f B knowledge into the training of the base model f M . In our strategies, the predictions of the bias-only model are combined with either the predictions of the base model or its error, to down-weight the loss for the examples that the bias-only model can predict correctly. We then update parameters of the base model θ M based on this modified loss L C . Our learning strategies are end-to-end. Therefore, to prevent the base model from learning the biases, the bias-only loss L B is not back-propagated to any shared parameters of the base model, such as a shared sentence encoder.

Method 1: Product of Experts
Our first approach is based on the product of experts (PoE) method (Hinton, 2002). Here, we use this method to combine the bias-only and base model's predictions by computing the element-wise product between their predictions as σ . We compute this combination in the logarithmic space, making it appropriate for the normalized exponential below: The key intuition behind this model is to combine the probability distributions of the bias-only and the base model to allow them to make predictions based on different characteristics of the input; the bias-only branch covers prediction based on biases, and the base model focuses on learning the actual task. Then the base model parameters θ M are trained using the cross-entropy loss L C of the combined classifier f C : When updating the base model parameters using this loss, the predictions of the bias-only model decrease the updates for examples that it can accurately predict.
Justification: Probability of label y i for the example x i in the PoE model is computed as: Then the gradient of cross-entropy loss of the combined classifier (2) w.r.t θ M is (Hinton, 2002): where δ y i k is 1 when k=y i and 0 otherwise. Generally, the closer the ensemble's prediction σ(f k C (.)) is to the target δ y i k , the more the gradient is decreased through the modulating term, which only happens when the bias-only and base models are both capturing biases.
In the extreme case, when the bias-only model correctly classifies the sample, σ(f y i C (x i , x b i )) = 1 and therefore ∇ θ M L C (θ M ; θ B ) = 0, the biased examples are ignored during training. Conversely, when the example is fully unbiased, the bias-only classifier predicts the uniform distribution over all labels ) and the gradient of ensemble classifier remains the same as the CE loss.

Method 2: Debiased Focal Loss
Focal loss was originally proposed in Lin et al. (2017) to improve a single classifier by down-weighting the well-classified points. We propose a novel variant of this loss that leverages the bias-only branch's predictions to reduce the relative importance of the most biased examples and allows the model to focus on learning the hard examples. We define Debiased Focal Loss (DFL) as: where γ is the focusing parameter, which impacts the down-weighting rate. When γ is set to 0, DFL is equivalent to the cross-entropy loss. For γ >0, as the value of γ is increased, the effect of down-weighting is increased. We set γ =2 through all experiments, which works well in practice, and avoid fine-tuning it further. We note the properties of this loss: (1) When the example x i is unbiased, and the bias-only branch does not do well, σ(f y i B (x b i )) is small, therefore the scaling factor is close to 1, and the loss remains unaffected.
(2) As the sample is more biased and σ(f y i B (x b i )) is closer to 1, the modulating factor approaches 0 and the loss for the most biased examples is down-weighted.

RUBi baseline (Cadene et al., 2019)
We compare our models to RUBi (Cadene et al., 2019), a recently proposed model to alleviate unimodal biases learned by Visual Question Answering (VQA) models. Cadene et al. (2019)'s study is limited to VQA datasets. We, however, evaluate the effectiveness of their formulation on multiple challenging NLU benchmarks. RUBi consists in first applying a sigmoid function φ to the bias-only model's predictions to obtain a mask containing an importance weight between 0 and 1 for each label. It then computes the element-wise product between the obtained mask and the base model's predictions: The main intuition is to dynamically adjust the predictions of the base model to prevent it from leveraging the shortcuts. Then the parameters of the base model θ M are updated by back-propagating the cross-entropy loss L C of the combined classifier.

Joint Debiasing Strategies
Neural models can, in practice, be prone to multiple types of biases in the datasets. We, therefore, propose methods for combining several bias-only models. To avoid learning relations between biased features, we do not consider training a classifier on top of their concatenation.
be different sets of biased features of x i that are predictive of y i , and let f B j be an individual bias-only model capturing x b j i . Next, we extend our debiasing strategies to handle multiple bias patterns.
Method 1: Joint Product of Experts We extend our proposed PoE model to multiple bias-only models by computing the element-wise product between the predictions of bias-only models and the base model as: Then the base model parameters θ M are trained using the cross-entropy loss of the combined classifier f C .
Method 2: Joint Debiased Focal Loss To extend DFL to handle multiple bias patterns, we first compute the element-wise average of the predictions of the multiple bias-only models: , and then compute the DFL (3) using the computed joint bias-only model.

Evaluation on Unbiased Datasets
We provide experiments on a fact verification (FEVER) and two large-scale NLI datasets (SNLI and MNLI). We evaluate the models' performance on recently-proposed challenging unbiased evaluation sets. We use the BERT (Devlin et al., 2019) implementation of Wolf et al. (2019) as our main baseline, known to work well for these tasks. In all the experiments, we use the default hyperparameters of the baselines.

Fact Verification
Dataset: The FEVER dataset contains claimevidence pairs generated from Wikipedia. Schuster et al. (2019) collected a new evaluation set for the FEVER dataset to avoid the idiosyncrasies observed in the claims of this benchmark. They made the original claim-evidence pairs of the FEVER evaluation dataset symmetric, by augmenting them and making each claim and evidence appear with each label. Therefore, by balancing the artifacts, relying on statistical cues in claims to classify samples is equivalent to a random guess. The collected dataset is challenging, and the performance of the models relying on biases evaluated on this dataset drops significantly.
Base models: We consider BERT as the base model, which works the best on this dataset (Schuster et al., 2019), and predicts the relations based on the concatenation of the claim and the evidence with a delimiter token (see Appendix A).

Bias-only model:
The bias-only model predicts the labels using only claims as input.
Results: Table 1 shows the results. Our proposed debiasing methods, PoE and DFL, are highly effective, boosting the performance of the baseline by 9.8 and 7.5 points respectively, significantly surpassing the prior work of Schuster et al. (2019).

Loss
Dev  (2018) show that the success of the recent textual entailment models is attributed to the biased examples, and the performance of these models is substantially lower on the hard sets.
Base models: We consider BERT and In-ferSent (Conneau et al., 2017) as our base models. We choose InferSent to be able to compare with the prior work of Belinkov et al. (2019b).

Bias-only model:
The bias-only model predicts the labels using the hypothesis (Appendix B).   If the hypothesis is a subtree in the premise's parse tree; 4) The number of tokens shared between premise and hypothesis normalized by the number of tokens in the premise. We additionally include some similarity features: 5) The cosine similarity between premise and hypothesis's pooled token representations from BERT followed by min, mean, and max-pooling. We consider the same weight for contradiction and neutral labels in the bias-only loss to allow the model to recognize entailment from not-entailment. During the evaluation, we map the neutral and contradiction labels to not-entailment.  We compare our results with the concurrent work of Clark et al., who propose a PoE model similar to ours, which gets similar results. The main difference is that our models are trained end-to-end, which is convenient in practice, while Clark et al.'s method requires two steps, first training a bias-only model and then using this pre-trained model to train a robust model. The Reweight baseline in Clark et al. is a special case of our DFL with γ =1 and performs similarly to our DFL method (using default γ =2). Their Learned-Mixin+H method requires hyperparameter tuning. Since the assumption is not having access to any out-of-domain test data, and there is no available dev set for HANS, it is challenging to perform hyper-parameter tuning. Clark et al. follow prior work (Grand and Belinkov, 2019;Ramakrishnan et al., 2018) and perform model section on the test set.

Results
To provide a fair comparison, we consequently also tuned γ in DFL by sweeping over {0.5,1,2,3,4}. DFL is the selected model, with γ = 3. With this hyperparameter tuning, DFL is even more effective, and our best result performs 2.8 points better than Clark et al. (2019).

Jointly Debiasing Multiple Bias Patterns
To evaluate combating multiple bias patterns, we jointly debias a base model on the hypothesis artifacts and syntactic biases.  Bias-only models: We use the hypothesis-only and syntactic bias-only models as in Sections 4.2 and 4.3.
Results: Table 5 shows the results. Models trained to be robust to hypothesis biases (p) do not generalize to HANS. On the other hand, models trained to be robust on HANS (n) use a powerful bias-only model resulting in a slight improvement on MNLI mismatched hard dev set. We expect a slight degradation when debiasing for both biases since models need to select samples accommodating both debiasing needs. The jointly debiased models successfully obtain improvements on both datasets, which are close to the improvements on each dataset by the individually debiased models.

Transfer Performance
To evaluate how well the baseline and proposed models generalize to solving textual entailment in domains that do not share the same annotation biases as the large NLI training sets, we take trained NLI models and test them on several NLI datasets.
Datasets: We consider a total of 12 different NLI datasets. We use the 11 datasets studied by Poliak et al. (2018). These datasets include MNLI, SNLI, SciTail (Khot et al., 2018), AddOneRTE (ADD1) (Pavlick and Callison-Burch, 2016), Johns Hopkins Ordinal Commonsense Inference (JOCI) , Multiple Premise Entailment (MPE) (Lai et al., 2017), Sentences Involving Compositional Knowledge (SICK) (Marelli et al., 2014), and three datasets from White et al. (2017) which are automatically generated from existing datasets for other NLP tasks including: Semantic Proto-Roles (SPR) (Reisinger et al., 2015), Definite Pronoun Resolution (DPR) (Rahman and Ng, 2012), FrameNet Plus (FN+) (Pavlick et al., 2015), and the GLUE benchmark's diagnostic test . We additionally consider the Quora Question Pairs (QQP) dataset, where the task is to determine whether two given questions are semantically matching (duplicate) or not. As in Gong et al. (2017), we interpret duplicate question pairs as an entailment relation and neutral otherwise. We use the same split ratio mentioned by Wang et al. (2017). Since the datasets considered have different label spaces, when evaluating on each target dataset, we map the model's labels to the corresponding target dataset's space. See Appendix D for more details.
We strictly refrained from using any out-of-domain data when evaluating on the unbiased split of the same benchmark in Section 4. However, as shown by prior work (Belinkov et al., 2019a), since different NLI target datasets contain different amounts of the bias found in the large-scale NLI dataset, we need to adjust the amount of debiasing according to each target dataset. We consequently introduce a hyperparameter α for PoE to modulate the strength of the bias-only model in ensembling. We follow prior work (Belinkov et al., 2019a) and perform model selection on the dev set of each target dataset  Table 6: Accuracy results of models with BERT transferring to new target datasets. All models are trained on SNLI and tested on the target datasets. ∆ are absolute differences between our methods and the CE loss baseline. and then report results on the test set. 4 We select hyper-parameters γ, α from {0.4,0.6,0.8,2,3,4,5}.
Results: Table 6 shows the results of the debiased models and baseline with BERT. As shown in prior work (Belinkov et al., 2019a), the MNLI datasets have very similar biases to SNLI, which the models are trained on, so we do not expect any improvement in the relative performance of our models and the baseline for MNLI and MNLI-M. On all the remaining datasets, our proposed models perform better than the baseline, showing a substantial improvement in generalization by using our debasing techniques. We additionally compare with Belinkov et al. (2019a) in Appendix D and show that our methods substantially surpass their results.

Discussion
Analysis of Debiased Focal Loss: As expected, improving the out-of-domain performance could come at the expense of decreased in-domain performance since the removed biases are useful for performing the in-domain task. This happens especially for DFL, in which there is a trade-off between in-domain and out-of-domain performance that depends on the parameter γ, and when the baseline model is not very powerful like InferSent. To understand the impact of γ in DFL, we train an InferSent model using DFL for different values of γ on the SNLI dataset and evaluate its performance on SNLI test and SNLI hard sets. As illustrated in Figure 2, increasing γ increases debiasing and thus hurts in-domain accuracy on SNLI, but out-of-domain accuracy on the SNLI hard set is increased within a wide range of values (see a similar plot for BERT in Appendix E).
Correlation Analysis: In contrast to Belinkov et al. (2019a), who encourage only the encoder to not capture the unwanted biases, our learning strategies influence the parameters of the full model to reduce the reliance on unwanted patterns more effectively. To test this assumption, in Figure 3, we report the correlation between the element-wise loss of the debiased models and the loss of a bias-only model on the considered datasets. The results show that compared to the baselines, our debiasing methods, DFL and PoE, reduce the correlation to the bias-only model, confirming that our models are effective at reducing biases. Interestingly, on MNLI, PoE has less correlation with the bias-only model than DFL and also has better performance on the unbiased split of this dataset. On the other hand, on the HANS dataset, DFL loss is less correlated with the bias-only model than PoE and also obtains higher performance on the HANS dataset.

Conclusion
We propose two novel techniques, product-of-experts and debiased focal loss, to reduce biases learned by neural models, which are applicable whenever one can specify the biases in the form of one or more bias-only models. The bias-only models are designed to leverage biases and shortcuts in the datasets. Our debiasing strategies then work by adjusting the cross-entropy loss based on the performance of these bias-only models, to focus learning on the hard examples and downweight the importance of the biased examples. Additionally, we extend our methods to combat multiple bias patterns simultaneously. Our proposed debiasing techniques are model agnostic, simple, and highly effective. Extensive experiments show that our methods substantially improve the model robustness to domainshift, including 9.8 points gain on FEVER symmetric test set, 7.4 on HANS dataset, and 4.8 points on SNLI hard set. Furthermore, we show that our debiasing techniques result in better generalization to other NLI datasets. Future work may include developing debiasing strategies that do not require prior knowledge of bias patterns and can automatically identify them.

A Fact Verification
Base model: We fine-tune all models using BERT for 3 epochs and use the default parameters and default learning rate of 2e−5. Bias-only model: Our bias-only classifier is a shallow nonlinear classifier with 768, 384, 192 hidden units with Tanh nonlinearity.

B Natural Language Inference
Base model: InferSent uses a separate BiLSTM encoder to learn sentence representations for premise and hypothesis. It then combines these embeddings following Mou et al. (2016) and feeds them to the default nonlinear classifier. With InferSent we train all models for 20 epochs as default without using earlystopping. We use the default hyper-parameters and following , we set the BiLSTM dimension to 512. We use the default nonlinear classifier with 512 and 512 hidden neurons with Tanh nonlinearity. With BERT, we finetune all models for 3 epochs.
Bias-only model: For debiasing models using BERT, we use the same shallow nonlinear classifier explained in Appendix A, and for the ones using InferSent, we use a shallow linear classifier with 512 and 512 hidden units. Table 7

D Transfer Performance
Mapping: We train all models on SNLI and evaluate their performance on other target datasets. SNLI contains three labels, contradiction, neutral, and entailment. Some of the datasets we consider contain only two labels. In the case of labels entailed and not-entailed, as in DPR, we map contradiction and neutral to the not-entailed class. In the case of labels entailment and neutral, as in SciTail, we map contradiction to neutral.
Comparison with Belinkov et al. (2019a): We modified the implementations of Belinkov et al. (2019a) and corrected some implementation issues in the InferSent baseline (Conneau et al., 2017). Compared to the original InferSent implementation, the main differences in our implementation include: (a) We incorporated the fixes suggested for the bugs in the implementation of mean/max-pooling over BiLSTM in the InferSent baseline 5 (b). We additionally observed that the aggregation of losses over each batch was computed with the average instead of the intended summation and we corrected it. 6 (c) We followed the implementation of InferSent and we removed out-of-vocabulary (OOV) words from the sentence representation, while Belinkov Table 9: Accuracy results of models with InferSent transferring to new target datasets. All models are trained on SNLI and tested on the target datasets. M1 and M2 are our re-implementation of Belinkov et al. (2019a). ∆ are relative differences in percentage with respect to CE loss.
. them by introducing an OOV token. We additionally observed during the pre-processing of some of the target datasets in the implementation of Belinkov et al., some of the samples are not considered due to the preprocessing issues. We fix the pre-processing issues and evaluate our models and our reimplementations of Belinkov et al. (2019a) on the same corpora. We set the BiLSTM dimension to 512 across all models. Note that Belinkov et al. use BiLSTM dimension of 2048, and due to the mentioned differences in implementations and datasets, the results reported in Belinkov et al. (2019a) are not comparable. However, we still on average surpass their reported results substantially. Our reimplementations and scripts to reproduce the results are publicly available in https: //github.com/rabeehk/robust-nli-fixed.
As used in prior work to adjust the learning-rate of the bias-only and baseline models (Belinkov et al., 2019a), we introduce a hyperparameter β for the bias-only model to modulate the loss of the bias-only model in ensembling. We sweep hyper-parameters γ, α over {0.02, 0.05, 0.1, 0.6, 2.0, 4.0, 5.0} and β over {0.05,0.2,0.4,0.8,1.0}. Table 9 shows the results of our debiasing models (DFL, PoE), our reimplementations of proposed methods in Belinkov et al. (2019a) (M1, M2), and the baseline with In-ferSent (CE). The DFL model outperforms the baseline in 10 out of 12 datasets, while the PoE model outperforms the baseline in 9 datasets and does equally well on the DPR dataset. As shown in prior work (Belinkov et al., 2019a), the MNLI dataset has very sim-ilar biases to SNLI, which the models are trained on, so we do not expect any improvement in the relative performance of our models and the baseline for MNLI dataset. Interestingly, our methods obtain improvement on MNLI-M, in which the test data differs from training distribution. Our proposed debiasing methods, PoE and DFL, are highly effective, boosting the relative generalization performance of the baseline by 3.39% and 2.57% respectively, significantly surpassing the prior work of Belinkov et al. (2019a). Compared to M1 and M2, our methods outperform them on 9 datasets, while they do better on two datasets of SPR and FN+, and slightly better on the DPR dataset. However, note that DPR is a very small dataset and all models perform close to random-chance on this dataset.