Improving QA Generalization by Concurrent Modeling of Multiple Biases

Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets. However, focusing on dataset-specific biases limits their ability to learn more generalizable knowledge about the task from more general data patterns. In this paper, we investigate the impact of debiasing methods for improving generalization and propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data. Our framework weights each example based on the biases it contains and the strength of those biases in the training data. It then uses these weights in the training objective so that the model relies less on examples with high bias weights. We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths. We perform the evaluations in two different settings, in which the model is trained on a single domain or multiple domains simultaneously, and show its effectiveness in both settings compared to state-of-the-art debiasing methods.


Introduction
As a result of annotation artifacts, existing NLP datasets contain shallow patterns that correlate with target labels (Gururangan et al., 2018;McCoy et al., 2019;Schuster et al., 2019a;Le Bras et al., 2020;Jia and Liang, 2017;Das et al., 2019). Models tend to exploit these shallow patterns-which we refer to as biases in this paper-instead of learning general knowledge about solving the target task.
Existing debiasing approaches weaken the impact of such biases by disregarding or down- 1 The code and data are available at https://github.com/UKPLab/ qa-generalization-concurrent-debiasing.
weighting affected training examples. They are often evaluated using adversarial or synthetic sets that contain counterexamples, in which relying on the examined bias will result in incorrect predictions (Belinkov et al., 2019;Clark et al., 2019;He et al., 2019;Mahabadi et al., 2020).
Importantly, the majority of existing debiasing approaches only deal with a single bias. They improve the performance scores on a targeted adversarial evaluation set, while typically resulting in performance decreases on the original datasets, or on adversarial datasets that contain different types of biases (Utama et al., 2020;Nie et al., 2019;He et al., 2019).
In this paper, we show that modeling multiple biases is a key factor to benefit from debiasing methods for improving both in-domain performance and out-of-domain generalization, and propose a new debiasing framework for concurrent modeling of multiple biases during training. A key challenge for developing a general framework that can handle multiple biases is to properly combine them when various biases' strength is different in each dataset. Previous work has found that if the ratio of biased examples is high, down-weighting, or disregarding all of them results in an insufficient training signal, which leads to performance decreases (Clark et al., 2019;Utama et al., 2020). Therefore, we propose a novel multi-bias weighting function that weights each example according to multiple biases and based on each bias' strength in the training domain. We incorporate the multi-bias weights in the training objective by adjusting the loss according to the bias weights of individual training examples so that the model relies on more general patterns of the data.
We evaluate our framework with extractive question answering (QA), for which a wide range of datasets from different domains exist-some contain crucial biases Sug-awara et al., 2020;Jia and Liang, 2017).
Existing approaches to improve generalization in QA either are only applicable when there exist multiple training domains (Talmor and Berant, 2019;Takahashi et al., 2019;Lee et al., 2019) or rely on models and ensembles with larger capacity (Longpre et al., 2019;Su et al., 2019;Li et al., 2019). In contrast, our novel debiasing approach can be applied to both single and multi-domain scenarios, and it improves the model generalization without requiring larger pre-trained language models.
We compare our framework with the two stateof-the-art debiasing methods of Utama et al. (2020) and Mahabadi et al. (2020). We study its impact in two different scenarios where the model is trained on a single domain, or multiple domains simultaneously. Our results show the effectiveness of our framework compared to other debiasing methods, e.g., when the model is trained on a single domain, it improves generalization over six unseen datasets by around two points on average while the improvement is less than 0.5 points for other debiasing approaches.
Our contributions: 1. We propose a new debiasing framework that handles multiple biases at once while incorporating the bias strengths in the training data. We show that the use of our framework leads to improvements in both in-domain and outof-domain evaluations.
2. We are the first to investigate the impact of debiasing methods for improving generalization using multiple QA training and evaluation sets.

Related Work
Debiasing Methods There is a growing amount of research literature on various debiasing methods to improve the robustness of models against individual biases in the training data (Clark et al., 2019;Mahabadi et al., 2020;Utama et al., 2020;He et al., 2019;Schuster et al., 2019b). The central idea of the methods proposed in previous work is to reduce the impact of training examples that contain a bias. Existing work either reduces the importance of biased examples in the loss function (Clark et al., 2019;Mahabadi et al., 2020), lowers the confidence on biased examples (Utama et al., 2020), or trains an ensemble of a bias model for learning biased examples, and a base model for learning from non-biased examples (Clark et al., 2019;He et al., 2019;Mahabadi et al., 2020).
A crucial limitation of the majority of existing methods is that they only target a single bias. While they improve the performances on the adversarial evaluation sets crafted for this particular bias, they lead to lower performance scores on non-targeted evaluation sets including the in-domain data (Nie et al., 2019), i.e., unlearning a specific bias does not indicate that the model has learned more general patterns of the data (Jha et al., 2020). We thus need debiasing approaches that help the model to learn from less-biased patterns of the data and improve its overall performance across various datasets that are not biased or may contain different biases.
We compare our framework with recently proposed debiasing methods of Utama et al. (2020) and Mahabadi et al. (2020).
Utama et al. (2020) address a single bias. While improving the performance on the adversarial evaluation set, they also maintain the performance on the in-domain data distribution, which are exceptions to the aforementioned methods. Mahabadi et al. (2020) handle multiple biases jointly and show that their debiasing methods can improve the performance across datasets if they fine-tune their debiasing methods on each target dataset to adjust the debiasing parameters. However, the impact of their method is unclear on generalization to unseen evaluation sets.
In contrast to these state-of-the-art debiasing methods, we (1) concurrently model multiple biases without requiring any information about evaluation datasets, and (2) show that our debiasing framework achieves improvements in in-domain, as well as unseen out-of-domain datasets.
Generalization in QA The ability to generalize models to unseen domains is important across a variety of QA tasks (Rücklé et al., 2020;Guo et al., 2020;Talmor and Berant, 2019). In this work, we focus on extractive QA. In this context, the MRQA workshop held a shared task dedicated to evaluating the generalization capabilities of QA models to unseen target datasets (Fisch et al., 2019a). The winning team (Li et al., 2019) uses an ensemble of multiple pre-trained language models, which includes XLNet (Yang et al., 2019) and ERNIE (Sun et al., 2019). Other submissions outperform the baseline by using more complex models with more parameters and better pre-training. For example,  Figure 1: An illustration of our debiasing framework. The teacher and bias models are trained beforehand. During training, the corresponding teacher model for the input example outputs a prediction distribution, which will be used for distilling the knowledge to the student. Each bias model generates a bias weight for the examples. We combine all the bias weights and use them to adapt the distillation loss. Su et al. (2019)  Unlike the methods mentioned above, in this paper, we propose a model-agnostic approach to handle biases of the training data for improving the generalization capability of QA models. Our proposed approach improves generalization without requiring any additional training data or employing larger models or ensembles.

Multi-bias Debiasing Framework
Let D T = {D t 1 , . . . D tn } be the set of n training datasets, and D E = {D e 1 , . . . D em } be the set of m evaluation sets that represent out-of-domain data. Each example x i in both training and evaluation datasets contains a question q i , a context c i , and an answer span a i as the input. The corresponding output for x i is the start s i and end e i indices, which denote the span of the correct answer in c i . Our goal is to train a single model on D T that achieves good zero-shot transfer performances on D E , i.e., obtaining a generalizable model that transfers well to unseen domains.
To achieve this, we propose a novel debiasing framework that models multiple biases of the training data. The framework consists of four components (see Figure 1): (1) multi-domain knowledge distillation (KD) to distill the knowledge from mul-tiple teachers into a single student model ( § 3.1); (2) a set of bias models that we use for detecting biased training examples ( § 3.2); (3) a novel multi-bias weighting function that weights individual training examples based on the biases they contain ( § 3.3); and (4) a bias-aware loss function, which encourages the model to focus on more general data patterns instead of heavily biases examples. We examine two different losses that either scale the teacher predictions or adjust each training example's weight during training ( § 3.4).
In the following, we will describe the four components in more detail.

Multi-domain Knowledge Distillation
The idea of multi-domain knowledge distillation is to distill an ensemble of teacher models into a single student model by learning from the soft teacher labels instead of the hard one-hot labels. Even when only used with one training set, KD can provide a richer training signal than one-hot labels (Hossein Mobahi, 2020;Hinton et al., 2015).
We first train n teacher models {M t 1 , . . . , M tn }, one for each of the training sets. We then distill the knowledge from all the teacher models into one multi-domain student model M . For every example (x i , y i ) from dataset D j , we obtain the probability distribution p t i from the teacher model M t j and minimize the Kullback-Leibler (KL) divergence between the student distribution p s i and teacher distribution p t i .

Bias Models
In order to prevent models from learning patterns associated with biases, we first need to recognize the biased training examples. The common method for doing so is to train models that only leverage bias patterns for solving the task (Clark et al., 2019;Mahabadi et al., 2020;Utama et al., 2020;He et al., 2019). We call these models bias models B 1 , . . . , B k . For instance, some answers can be identified by only considering the interrogative adverbs that indicate the question types, e.g., when, where, etc. . Therefore, the corresponding bias model will only uses those adverbs in questions to identify answers. We use such bias models to compute weights that determine how well the training examples can be solved by relying on the biases.
Since QA models should predict the indices of the start and end tokens of an answer span, we define two bias weights β j,s and β j,e for each ex- s predicted output distribution of the start index for x i and g is the gold start index, we define β j,s as follows: By setting β j,s to zero, we treat the example as unbiased if it cannot be answered by the bias model. We determine β j,e accordingly for the end index. To simplify our notation, in the remainder of this work, we denote β(x i ) as the bias weight of one example and do not differentiate between start and end indices.

Multi-Bias Weighting Function
As we show in § 5.1, each dataset contains various biases with different strengths. If we directly use the output of the bias models to down-weight or filter all biased examples, as it is the case in existing debiasing methods, we will lose the training signal from a considerable portion of the training data. This will in turn decrease the overall performance (Utama et al., 2020). To apply our framework to training sets that may contain multiple biases of different strengths, we automatically weight the output of the bias models according to the strength of each bias in each training dataset.
Therefore, we propose a scaling factor F S (B k , D t j ) to automatically control the impact of bias B k in dataset D t j in our debiasing framework, i.e., to reduce the impact of bias on the loss function when the bias is commonly observed in the dataset.
The scaling factor is defined as: where EM measures the performance of the examined model on the given dataset based on the exact match score, and M t j is the teacher model that is trained on D t j . This lowers the impact of strong biases whose corresponding bias models perform well, e.g., when their performance is close to the performance of the teacher model. If F S = 0, the performance of B k equals to M t j , indicating that this bias type exists in all the training examples. Thus, we do not use it for debiasing. We then combine multiple biases for a single training example x i ∈ D t j as follows: The scaling factor F S (B k , D t j ) computes a dataset-level weight for bias B k while β k (x i ) computes an example-level weight for x i based on B k . In summary, an example x i receives a high weight based on B k if (1) x i can be correctly answered using the bias model B k , and (2) B k is not prevalent in the training examples of D t j . The final bias weight F B (x i ) of a bias B k on example x i is the product of the example-level and dataset-level weights.
The purpose of using the minimum in Equation 2 is to retain as much training signal as possible from the original data by only down-weighting examples that are affected by all biases.

Bias-Aware Loss Function
The final step is to incorporate F B within the distillation process to adapt the loss of each example based on its corresponding bias weight.
Assume p t i and p s i are the probability predictions of a teacher model M t j and a student model M on example x i ∈ D t j , respectively. We incorporate F B in the loss function in two different ways: (1) multi-bias confidence regularization (Mb-CR), and (2) multi-bias weighted loss (Mb-WL). While bias weights are used to scale the teacher probabilities in Mb-CR, they are directly applied to weight the training loss in Mb-WL. The main difference between these two training losses is that the bias weights have a more direct and therefore a stronger impact on the loss function in Mb-WL.

Multi-bias confidence regularization (Mb-CR).
We adapt the confidence regularization method of Utama et al. (2020) to our setup to concurrently debias multiple biases. We use F B to scale the teacher predictions to make the teacher less confident on biased examples. We define the scaled probability of the teacher model on token j of x i as follows: We then train the student model M by minimizing the Kullback-Leibler divergence between p s i and Multi-bias weighted loss (Mb-WL). In this approach, we use the bias weights to directly weight the corresponding loss of each training example. In this case, the training objective is to minimize the weighted Kullback-Leibler divergence L between p s i and p t i as follows:

Examined Biases and Bias Models
We incorporate four biases in our experiments.
• Wh-word : the corresponding model for detecting this bias only uses the interrogative adverbs from the question. • Lexical overlap (Jia and Liang, 2017): in many QA examples, the answer is in the sentence of the context that has a high similarity to the question. To recognize this bias, we train the bias model using only the sentence of the context that has the highest similarity to the question, if the answer lies in this sentence. 3 Otherwise, we exclude the example during training. • Empty question (Sugawara et al., 2020): the answer can be found without the presence of a question, e.g., by selecting the most prominent entity of the context. The model for detecting this bias only uses contexts without questions.

Data
We use five training datasets. This includes SQuAD

Evaluation Settings
We evaluate our proposed methods in two different settings: (1) single-domain (SD), and (2)   ting is the multi-task model of Fisch et al. (2019b) which is a BERT model trained on all datasets with multi-task learning. We refer to this baseline as MT-BERT.
We report Exact Match (EM), i.e., whether the predicted answer exactly matches the correct one. We include the corresponding F 1 scores which measure the overlap rate between the predicted answer and the gold one in the appendix.

Strength of biases on different datasets
We report the ratio of the examples for each dataset that are correctly answered by our bias models (see §4.2) in Table 1. A higher ratio corresponds to a stronger observed bias. We observe that (1) different datasets are more affected by certain biases, e.g., the ratio of examples that can be answered without the question (the empty question bias) is 8% in SQuAD while it is 38% in NQ, (2) NewsQA is least affected by biases overall while NQ and HotpotQA are most affected, (3) only few instances are affected by all four biases, and (4) except for NewsQA, the majority of training examples are affected by at least one bias. Therefore, methods that down-weight or ignore all biased examples will considerably weaken the overall training signal. We observe that (1) without using any additional training examples or increasing the model size, we can improve generalization by using our debiasing methods, (2) the impact of debiasing methods is stronger when the training data is more biased, and (3) the use of our proposed debiasing methods not only improve generalization, but it also improves the performance on the in-domain evaluation dataset, which contains similar biases as those of the training data. This is in contrast to previous work that either decreases the in-domain performance (He et al., 2019; Clark et al., 2019;Mahabadi et al., 2020), or at most preserves it (Utama et al., 2020). We analyze the reason for this in §6.1. Table 3 shows the results of the multi-domain setting. Talmor and Berant (2019) show that training MT-BERT on multiple domains leads to robust generalization. Since MT-BERT is trained on multiple domains simultaneously, which are not equally affected by different biases, the model is less likely to learn these patterns. However, our results show that our debiasing methods further improve the average EM scores by more than one point even if the model is trained on multiple domains.

Discussion and Analysis
In this section, we discuss the benefits and limitations of our framework.

Why our debiasing improves in-domain and out-of-domain performances?
The main differences of our proposed framework to the state-of-the-art debiasing approaches are as follows: • It is a general framework and can be used with any bias-aware training objectives, e.g., that of (2020)'s DFL will indicate whether our proposed methods for modeling of multiple biases improves the performance or any method that models multiple biases jointly will have the same impact. For a fair comparison, we use the same bias types and bias weights in all the debiasing methods.    (2020) that handles a single bias, we use the lexical overlap bias, as it is the most dominant bias in the majority of our training datasets (see Table 1). 4 Based on the SD results, we observe that (1) debiasing only based on the lexical overlap bias, which is the strongest bias in the training data, considerably drops the in-domain performance, and it has a negligible impact on out-of-domain results, and (2) while combining all biases using DFL improves the in-domain results, it does not have a significant impact on out-of-domain performances. This shows the importance of (a) concurrent modeling of multiple-biases, and (b) our proposed multi-bias methods in improving the overall performance. We will further investigate the impact of each of the components in our framework in §6.2.
The results of CR(lex) in the MD setting show that debiasing based on a single bias-one that is common in most of training datasets-negatively impacts the in-domain and out-of-domain performances. Similar to the SD results, the DFL bias combination has a more positive impact on indomain instead of out-of-domain in MD results.
Overall, both SD and MD results show the effectiveness of our proposed framework for both in-domain and out-of-domain setups.

Impact of the Framework Components
We investigate the impact of the components of our framework including: (1) knowledge distillation (KD): by replacing the teacher probabilities with gold labels in Mb-WL; and (2) the scaling factor (F S ): by removing the scaling factor from Equation 2. Table 5 reports the results for the SD setting when the model is trained on NQ. The results show that KD has a big impact on the generalization of Mb-WL, while F S has a stronger impact on Mb-CR's generalization.  In addition, we evaluate the impact of combining multiple biases in Table 6 by using a single bias at a time instead of modeling multiple biases. The results show that multi-bias modeling (1) is more useful than modeling any individual bias for both in-domain and out-of-domain experiments, and (2) has a more significant impact on Mb-CR compared to Mb-WL.  Table 6: The performance differences between using single-bias modeling compared to multi-bias modeling. All models are trained on NQ dataset.

Is debiasing always beneficial?
We hypothesize that applying debiasing methods will not lead to performance gains if (1) the presence of examined biases is not strong in the training data, i.e., if most of the examples are unbiased, and therefore the model that is trained on this data will not be biased, to begin with, and (2) the out-ofdomain set strongly contain the biases based on which the model is debiased during training.
To verify the first hypothesis, we evaluate the single-domain experiments using the NewsQA dataset that contains the smallest ratio of biased examples, i.e., only 1% of the data contain all of the examined biases. The results are reported in Table 7, which in turn confirms our hypothesis.  Regarding the second hypothesis, we report the results of the bias models on the evaluation sets in Table 8. The results of all bias models are very high on RelExt compared to other evaluation datasets, and as we see from the results of both SD and MD settings in Table 2 and 3, our debiasing methods are the least effective on improving the out-ofdomain performance on this evaluation set.

Conclusion
In this paper we (1) investigate the impact of debiasing methods on QA model generalization for both single and multi-domain training scenarios, and (2) propose a new framework for improving the in-domain and out-of-domain performances by concurrent modeling of multiple biases. Our framework weights each training example according to multiple biases and based on the strength of each bias in the training data. It uses the resulting bias weights in the training objective to prevent the model from mainly focusing on learning biases. We evaluate our framework using two different training objectives, i.e., multi-bias confidence regularization and multi-bias loss re-weighting, and show its effectiveness in both single and multidomain training scenarios. We further compare our framework with two state-of-the-art debiasing methods of Utama et al. (2020) and Mahabadi et al. (2020). We show that knowledge distillation, modeling multiple biases at once, and weighting the impact of each bias based on its strength in the training data are all important factors in improving the in-domain and out-of-domain performances. While recent literature on debiasing in NLP focuses on improving the performance on adversarial evaluation sets, this work opens new research directions on wider uses of debiasing methods. The main advantage of using our debiasing methods is that they improve the performance and generalization without requiring additional training data or larger models. Future work could build upon our framework by applying it to a wide range of tasks beyond QA using task-specific bias models.

A Training details
We use the same hyperparameters as the MRQA shared task. To be more specific, we use BertAdam optimizer with a learning rate of 3 × 10 −5 and batch size of 6. We sample all training examples in each dataset during training and evaluation. All our models are trained for 2 epochs. We choose the size of 512 tokens to be the maximum sequence fed into the neural network. Contexts with longer tokens will be split into several training instances. The single domain experiment takes roughly 3 hours on a single Nvidia Tesla V100-SXM3-32GB GPU while it takes around 15 hours for the multi-domain experiment on the same GPU. Table 9 presents a brief description for each of the examined training and evaluation sets.

C SD results using other training data
We report the results of the SD setting using NQ, TriviaQA, and NewsQA in the paper. Table 10 reports the results, using the EM score, on the remaining training data, i.e., SQuAD and HotpotQA. Debiasing the model on SQuAD has a more positive impact on out-of-domain results while debiasing the model that is trained on HotpotQA has a better impact on in-domain performances.

D Results using F1 scores
The results in the paper are reported using the EM score. Table 11-Table 17 show the results of this work using F1 scores. The main difference of EM and F1 scores are for answers whose corresponding span contains more than one word. If a system partially detects the correct span boundary, it receive a partial F1 score but a zero EM score. As we see, the findings of the paper would remain the same using F1 scores instead of EM scores.