An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference

The prior work on natural language inference (NLI) debiasing mainly targets at one or few known biases while not necessarily making the models more robust. In this paper, we focus on the model-agnostic debiasing strategies and explore how to (or is it possible to) make the NLI models robust to multiple distinct adversarial attacks while keeping or even strengthening the models' generalization power. We firstly benchmark prevailing neural NLI models including pretrained ones on various adversarial datasets. We then try to combat distinct known biases by modifying a mixture of experts (MoE) ensemble method and show that it's nontrivial to mitigate multiple NLI biases at the same time, and that model-level ensemble method outperforms MoE ensemble method. We also perform data augmentation including text swap, word substitution and paraphrase and prove its efficiency in combating various (though not all) adversarial attacks at the same time. Finally, we investigate several methods to merge heterogeneous training data (1.35M) and perform model ensembling, which are straightforward but effective to strengthen NLI models.


Introduction
Natural language inference (NLI) (also known as recognizing textual entailment) is a widely studied task which aims to infer the relationship (e.g., entailment, contradiction, neutral) between two fragments of text, known as premise and hypothesis (Dagan et al., 2005(Dagan et al., , 2013. Recent works have found that NLI models are sensitive to the compositional features (Nie et al., 2019), syntactic heuristics (McCoy et al., 2019), stress test (Geiger et al., 2018;Naik et al., 2018) and human artifacts in the data collection phase (Gururangan et al., 2018;Poliak et al., 2018b;Tsuchiya, 2018). * Equal contribution. Accordingly, several adversarial datasets are proposed for these known biases 1 .
Through our preliminary trials on specific adversarial datasets, we find that although the model specific or dataset specific debiasing methods could increase the model performance on the paired adversarial dataset, they might hinder the model performance on other adversarial datasets, as well as hurt the model generalization power, i.e. deficient scores on cross-datasets or cross-domain settings. These phenomena motivate us to investigate if it exists a unified model-agnostic debiasing strategy which can mitigate distinct (or even all) known biases while keeping or strengthening the model generalization power.
We begin with NLI debiasing models. To make our trials more generic, we adopt a mixture of experts (MoE) strategy (Clark et al., 2019), which is known for being model-agnostic and is adaptable to various kinds of known biases, as backbone. Specifically we treat three known biases, namely word overlap, length mismatch and partial input heuristics as independent experts and train corresponding debiasing models. Our results show that the debiasing methods tied to one particular known bias may not be sufficient to build a generalized, robust model. This motivates us to investigate a better solution to integrate the advantages of distinct debiasing models. We find model-level ensemble is more effective than other MoE ensemble methods. Although our findings are based on the MoE backbone due to the prohibitive exhaustive studies on the all existing debiasing strategies, we provide actionable insights on combining distinct NLI debiasing methods to the practitioners.  Wang et al. (2019c) (h) Minervini   Then we explore model agnostic and generic data augmentation methods in NLI, including text swap, word substitution and paraphrase. We find these methods could help NLI models combat multiple (though not all) adversarial attacks, e.g. augmenting training data by swapping hypothesis and premise could boost the model performance on stress tests and lexical inference test, and data augmentation by paraphrasing the hypothesis sentences could help the models resist the superficial patterns from syntactic and partial input heuristics. We also observe that increasing training size by incorporating heterogeneous training resources is a simple but effective method to build robust and generalized models. Specifically we investigate how to incorporate different training data with different sizes and annotation processes, as well as the best way to perform model ensembling.

Benchmark Datasets
Our benchmark datasets include the adversarial datasets 2 and some widely used general-purpose 2 Some datasets listed in Table 1 were originally proposed to probe for systematicity. Here we call them 'adversarial' NLI datasets which test the generalization power of NLI models. 3

Adversarial Datasets
Categorization: to provide more insights on how the adversarial datasets attack the models, we roughly categorize them in Table 1 according to their characteristics and elaborate the categorization in this section. To facilitate the narrative of following sections, we rename the adversarial datasets according to their prominent features. Comparability: all the following datasets are collected based on the public available resources proposed by their authors, thus the experimental results in this paper are comparable to the numbers reported in the original papers and the other papers that use these datasets 4 .

Partial-input (PI) Heuristics
Partial-input heuristics refer to the hypothesisonly bias (Poliak et al., 2018b)  3 The datasets used in this paper can be found in the following github repository https://github.com/ tyliupku/nli-debiasing-datasets model using unigram pattern pair features across two sentences as well as unigram features in hypothesis and premise sentences to obtain the 'lexically misleading scores (LMS)' for each instance in the test sets. We use CS 0.7 in their paper which denotes the subsets whose LMS are larger that 0.7.

Logical Inference Ability (LI)
Lexical Inference Test (LI-LI): A proper NLI system should recognize hypernyms and hyponyms; synonym and antonyms. We merge the "antonym" category in Naik et al. (2018) and Glockner et al. (2018) to assess the models' capability to model lexical inference. Text-fragment Swap Test (LI-TS): NLI system should also follow the first-order logic constraints (Wang et al., 2019c;Minervini and Riedel, 2018). For example, if the premise sentence s p entails the hypothesis sentence s h , then s h must not be contradicted by s p . We then swap the two sentences in the original MultiNLI mismatched dev sets. If the gold label is 'contradiction', the corresponding label in the swapped instance remains unchanged, otherwise it becomes 'non-contradicted'.

Insights within Adversarial Tests
To provide actionable insights to NLP practitioners, we list how these adversarial instances constructed and why they might fail NLI models in Table 1. Those adversarial datasets are potentially correlated with each other due to similar constructing process or constructing goals. For example, 'PI-CD', 'PI-SP' and 'IS-CS' are all created with instance selection from original test sets in order to attack the models which improperly rely on the superficial lexical patterns, thus they might be potentially correlated. Although we could analytically assess the correlation between adversarial datasets, it is hard to demonstrate their underlying relationships from a quantitative perspective. We instead try to utilize the model performances on these adversarial datasets as surrogates to visualize their correlations. Concretely, we first collect the model accuracy scores on each adversarial dataset according to 30 runs of 10 baseline models (3 runs each) listed in Table 3. Then we show the pearson correlation coefficients of the model scores on any two distinct adversarial datasets in Fig 1. According to Fig 1, 'IS-SD' (HANS) has higher correlation with 'IS-CS' and 'LI-TS' compared with other adversarial datasets, we assume this is because they are constructed based on cross sentence heuristics in the natural occurring settings, as opposed to stress test datasets which add tautology like 'and true is true' to the end of hypothesis sentences (Naik et al., 2018). 'LI-LI' instances are created by few lexical changes on premise sentence which would easily fall into 'word overlap' heuristics as elaborated in the 'IS-SD' dataset, thus 'LI-LI' has low correlation with 'IS-SD'.

Other Data Resources
Generalization Power Test: we test the models on several general purpose datasets, including NLI diagnostic dataset (Diag) (Wang et al., 2019b), for which we use 'Matthews correlation coefficient' (Matthews, 1975) as the evaluation metric. We also incorporate RTE (Dagan et al., 2005), SICK (Marelli et al., 2014) Table 3: The performance of models on adversarial and generalization power tests (Sec 2) trained on MultiNLI. B and L in the subscript denote base and large versions of pretrained models. We use bold and underlined numbers to represent the highest scores in each column/block. Same marks are also used in Table 4, 5 and 6.
2018) in our testing.

Model Performance on the Benchmark
We show the performance of different models trained on MultiNLI in Table 3. The general trend is that more powerful model which has higher performance on the original (in-domain) test sets (RoBERTa (large)) outperforms most models in both adversarial and general purpose settings. In the following sections, we investigate several model agnostic methods for debiasing NLI models. Specifically, we are interested in: 1) how to (or is it possible to) make the NLI models robust to multiple distinct adversarial attacks using a unified debiasing method and 2) how the debiasing methods influence model generalization power of NLI.

Mixture of Experts (MoE) Debiasing
We utilize the MoE ensemble model Clark et al. (2019) as the backbone to mitigate three known biases in NLI. Concretely, we implement the 'instance reweighting' and 'bias product' methods in Clark et al. (2019). Based on these methods, we perform several trials on combating several distinct NLI biases at the same time.

Debiasing Methods
Notations: for a known NLI bias, they firstly train a bias-only model B and then use its output b as a guidance to train the prime model. In the context of three-way NLI training, b i is a normalized 3element vector which represents the predicted possibility of each NLI label for i-th training example. Suppose p i is output of the prime model which has the same meaning as b i . Instance Reweighting: suppose b y i i is the possibility that the bias-only model assigns to the correct label y i for i-th training example. They trained the models in a weighted version of the data, where the weight α i for the i-th training example is (1-b y i i ). The loss function for a training batch with k examples is a weighted sum of instance-level loss l i :

Bias Product Ensemble:
an ensemble method that is a product of expertŝ By doing so, the prime model would be encouraged to learn all the information except the specific bias. An intuitive justification from the probabilistic view can be found in Clark et al. (2019). Note that while training, only the prime model is updated while the bias-only model remains unchanged. (1) whether the hypothesis is a subsequence of the premise, (2) whether all words in the hypothesis appear in the premise, (3) the percent of words from the hypothesis that appear in the premise, (4) the average and the max of the minimum distance between each premise word with each hypothesis word. We use their trained   (2018) shows that the length of hypothesis and premise over different labels is not evenly distributed (ST-LM in Sec 2.1.4). So we trained a bias-only classifier based on the following sentence length related features: 1) the sentence lengths of hypothesis and premise sentences, 2) the mean and difference of these lengths. Our classifier achieves 41.3% accuracy on the mismatched dev set of MultiNLI, which outperforms the majority class baseline by 6.1%.

Combating Distinct Biases
Suppose we already have m bias-only models {B 1 , B 2 , · · · , B m } and the corresponding output {b 1 , b 2 , · · · , b m } at hand, we test three different approaches to integrate these models. MixWeight: Using the product of weights from different debiasing models while performing instance reweighting. We replace the weight for the i-th training example (α i in Sec 3.1) with m j=1 (1 − b y i i ) and utilize the same loss function as 'instance reweighting' in Sec 3.1). AddProduct: We view different bias-only models as multiple independent experts and then apply the bias product ensemble as 'bias product en-semble' in Sec 3.2:p i = sof tmax(log(p i ) + m j=1 log(b j i )). BestEnsemble: We also try to ensemble the best single debiasing models. In our experiments (Table 4), we ensemble the three reweighting models ('ReW' models in column 2,4 and 6) for each bias to form the BestEnsemble model.

Discussions for MoE Methods
For mixture of experts model, we summarize our findings from Table 4 below: 1) For all three known biases in Sec 3.2, we find that the debiasing methods targeting at specific known biases increase the model performance on the corresponding adversarial datasets, e.g. for the word overlap heuristics, BiasProd model gets 71.0% accuracy on IS-SD (HANS) test set, 7.2% higher than baseline.
2) The bias-specific methods might not make the NLI models more robust and generalized. For example, the methods designed for word overlap heuristics get lower scores on PI-CD, PI-SP, IC-CS, LI-TS test sets than the baseline model.
3) The proposed debiasing merging methods BestEn (Sec 3.3) inherits the advantages of the 4 bias-specific methods on PI-CD, IS-SD, LI-TS and ST-LM compared with other MoE debiasing models.

Data Augmentation
In this section, we explore 3 automatic augmentation ways without collecting new data. For fair comparison, in all the following settings, we double the training size by automatically generating the same number of augmented instances as the original training sets as shown in Table 5.  PI-CD PI-SP IS-SD IS-CS LI-LI LI-TS

Methods
Text Swap: It is an easy-to-implement method which swaps the premise p and hypothesis h sentences in the original datasets. It might be an potential solution to combat the partial-input heuristics (Sec 2.1.1) as the superficial patterns are not observed in the premise sentences. According to the first-order logic rules (LI-TS in Sec 2.1.3), we can only determine the gold labels for the swapped sentence pairs whose original labels are contradiction. For the entailment and neutral instances, we using the ensembled RoBERTa large model trained on 'all4' training set (Table 6) to label the swapped sentence pairs. Word Substitution: We also tried to create new training instances by flipping the words in the hypothesis sentences. We try two ways to perform substitution: 1) synonym: We use NLTK (Bird and Loper, 2004) to firstly find the synonym candidates of the content words (including nouns, verbs and adjectives) in the hypothesis sentences, and then we replace the content words with their synonyms if the cosine similarity ([-1,1]) between the original window and the window after replacement is larger than 0. The window contains at most 3 words including the replaced word and its neighbours. We represent that window by maxpooling over the 300d Glove (Pennington et al., 2014) embedding of the words in that window.
2) Masked LM: we randomly select 30% content words and then load the pretrained BERT large model to perform masked LM task. We uniformly sample from top-100 ranking candidate words (excluding the original word) and then replace the original content word with the sampled one. Paraphrase: We create the paraphrases for the original hypothesis sentences by back translation Hu et al., 2019) using the pretrained English-German and German-English machine translation models (Ng et al., 2019). To increase the diversity, we use beam search (size=5) for German-English translation and get the paraphrase by sampling from the can-didate sentences.

Quality Analysis
To assess the quality of augmented data, we conduct both automatic and human evaluation. For automatic evaluation, we use the best NLI model (RoBERTa(large) model with 'All4+SinEN' in Table 6) in this paper to judge if the labels of augmented data are consistent with the predictions of our best NLI model. For human evaluation, we firstly sample 50 instances from each augmented training data and then hire 3 human annotators to decide the relation for the sentences pairs. We shuffle the 200 instances without showing the annotators the augmentation method for certain instances. We also ask the annotators to be objective and not to guess the augmentation methods and then use the majority vote for final annotation. The accuracy of text swap, word substitution (synonym), word substitution (MLM) and paraphrase are 84.0%, 82.0%, 88.1% and 92.9% respectively based on human-annotated gold labels. Correspondingly, word substitution (synonym), word substitution (MLM) and paraphrase get 76.9 %, 83.5% and 94.5% accuracy on the automatic evaluation. Paraphrase augmentation is shown to have the highest quality among the four methods.

Discussions for Data Augmentations
For Data Augmentation, we show the performance of a BERT base model using different data augmentation methods in Table 5. Text swap method increases the model performance on IS-CS, LI-LI, LI-TS and ST test sets, as it can make the data distribution in the premises and hypotheses more balanced. It is also an easyto-implement method which could serve as a baseline to evaluate other automatic data augmentation methods. For the other two methods, the fragility of NLI models to partial input and inter-sentence heuristics is partially due to the rigid word-label concurrence (PI-SP in Sec 2.1.1) or word-to-word mapping (IS-SD, IS-CS in Sec 2.1.2). More di-  1 72.9 92.1 84.7 68.1 79.3 80.4 40.9 57.1 68.3 61.8 92.8 30.3  verse lexical choices via word substitution or paraphrase might help to relieve the biases caused by these heuristics. We see that 'word sub' in Table 5 outperforms baseline on IS-CS, LI-TS and ST; 'paraphrase' outperforms the baseline on IS-SD, LI-TS. However, these two methods get lower scores on other adversarial and general purpose datasets as these debiasing techniques bias the model towards being robust to a specific bias, so it compensates by trading off performance.

Dataset Merging and Model Ensemble
In this section we explore 1) to what extend larger dataset and ensemble would make the NLI models more robust to distinct adverserial datasets. 2) what is the best way to combine the large-scale NLI training sets in very different domains.

Merging Heterogeneous Datasets
To set up more diverse and stronger baselines for the proposed benchmark datasets, we use 4 large-scale training datasets: SNLI, MNLI, DNLI and ANLI for the following experiments. Those training sets are created using different strategies. Specifically, SNLI and MNLI are created in a human elicited way (Poliak et al., 2018b): the human annotators are asked to write a hypothesis sentence according to the given premise and label. DNLI recasts other NLP tasks to fit in the form of NLI. ANLI is created as hard datasets that may fail the models. Since those datasets vary in sizes, domains and collection processes, they might have different contribution to the final predictions. Here we investigate two instance reweighting methods accordingly. Notations: suppose we have k training sets respectively. p i can be the average scores of multiple test sets or the score on an single in-domain/ out-of-domain/ adversarial test set.
Size-based reweighting (SR): Smaller training sets might have less influence on the models than larger ones. In this setting, we try to increase the weight of smaller datasets so that each dataset contributes more equally to the final predictions. We implement this reweighting method by replacing the α i in Sec 3.1 with ( k n k )/n i (i ∈ T i ).
Performance-based reweighting (PR): Different training sets may vary in annotation quality and collection process thus have distinct model performance. In this setting, we reweight the training instances with the performance of a baseline model on the specific training sets. We still use the instance weights in Sec 3.1 with α i = p i /( k p k )(i ∈ T i ).

Model Ensemble
We try two modes for model ensemble: mixed and single mode. In the mixed mode, we ensemble three different models (BERT, XLNet, RoBERTa) while in the single mode, we ensemble three same models (RoBERTa*3).

Discussions
For Dataset merging and model ensemble, according to Table 6, We find that: 1) Incorporating heterogeneous training data is a straightforward method to enhance the robustness of NLI models. Empirically we see incorporating datasets with adversarial human-in-the-loop annotating (e.g. ANLI) is more efficient that incorporating automatically constructed dataset without human curation (e.g. DNLI).
2) In RoBERTa base model, the 'All4+PR' model get higher scores on diagnostic and ANLI test sets than 'All4' baseline, which shows that increasing the weight of higher quality dataset may help to increase accuracy on certain test sets. Notably, performance based reweighting helps the model gain 2 points (49.2 vs 51.2) on ANLI compared with baseline model while keeping the inference ability on DNLI, SNLI and MNLI test sets.
3) In RoBERTa large model, we see that on some datasets, like IS-SD, the mixed ensemble model may even outperform the single ensemble model even if its two components (XLNet and BERT) are less powerful than those (RoBERTa) in single ensemble mode.   6 Experimental Settings

Implementation Details
We set up both pretrained and non-pretrained model baselines for the proposed evaluation bechmarks. We rerun their public available codebases (Wolf et al., 2019), including InferSent (Conneau et al., 2017)  ' token in the pretrained models to three-way NLI classification via linear transformation. We show the per-layer analyses for RoBERTa model in Table 2. We try to reduce the randomness of our experiments by 3 runs using different random seeds. We report the median of the 3 runs for all the tables except the ensemble-related (Sec 5.2) experiments in Table 6. Table 7 shows how we evaluate the test sets with only two labels in 3-way NLI classification.

Model Selection Strategy
Since we test the NLI models on multiple generalpurpose dataset. it is an important question how we choose the dev set. We explore 3 different model selection settings: 1) Origin: using the original in-domain dev set.
2) Mixed: using the merged dev sets which include all the instances in the in-domain and extra dev sets in generalization power tests.
3) Oracle: tuning the model for each generalization power test using its own dev set. We show the performance of a BERT base model trained on MultiNLI utilizing the above mentioned model selection strategies in Table 8. In this paper we use the 'origin' mode, as it is too expensive to use the 'oracle' strategy in all experiments, besides we did not see much difference between the 'mixed' and 'origin' modes. Notably when we merge different training sets, we also merge their dev sets correspondingly to form a unified in-domain dev set in Table 6.

Related Work
Bias in NLI: The bias in the data annotation exists in many tasks, e.g. lexical inference (Levy et al., 2015), visual question answering (Goyal et al., 2017), ROC story cloze (Cai et al., 2017;Schwartz et al., 2017) etc. The NLI models are shown to be sensitive to the compositional features in premises and hypotheses (Nie et al., 2019;Dasgupta et al., 2018), data permutations (Schluter and Varab, 2018;Wang et al., 2019c) and vulnerable to adversarial examples Minervini and Riedel, 2018;Glockner et al., 2018) and crafted stress test (Geiger et al., 2018;Naik et al., 2018). Other evidences of artifacts include sentence occurrence (Zhang et al., 2019), syntactic heuristics between hypotheses and premises (Mc-Coy et al., 2019) and black-box clues derived from neural models (Gururangan et al., 2018;Poliak et al., 2018b;He et al., 2019). Rudinger et al. (2017) showed hypotheses in SNLI has the evidence of gender, racial stereotypes, etc. Sanchez et al. (2018) analysed the behaviour of NLI models and the factors to be more robust. Benchmark collection in NLI: GLUE (Wang et al., 2019b,a) benchmark contains several NLIrelated benchmark datasets. However it does not include adversarial test sets, domain specific test (Romanov and Shivade, 2018;Ravichander et al., 2019). Researchers create NLI datasets using different collection criteria, such as recasting other NLP tasks to NLI (Poliak et al., 2018a), iteratively filtering adversarial training data by model decisions (Bras et al., 2020) (model-in-the-loop), counterfactually augmenting training data by human editing examples to break the model (Kaushik et al., 2020) (human-in-the-loop) and multi-round annotating depending on both human and model decisions (Nie et al., 2020).

Conclusions
We try to investigate how to build robust and generalized NLI models by model-agnostic debiasing strategies, including mixture of experts ensemble (MoE), data augmentation (DA), dataset merging and model ensemble, and benchmark these methods on various adversarial and general purpose datasets. Our findings suggest model-level MoE ensemble, text swap DA and performance based dataset merging would effectively combat multiple (though not all) distinct biases.
Although we haven't found a debiasing strategy that can guarantee the NLI models to be more robust on every adversarial dataset used in this paper, we leave the question of whether such a debiasing method exists for future research.