Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting

With the recent proliferation of the use of text classifications, researchers have found that there are certain unintended biases in text classification datasets. For example, texts containing some demographic identity-terms (e.g., “gay”, “black”) are more likely to be abusive in existing abusive language detection datasets. As a result, models trained with these datasets may consider sentences like “She makes me happy to be gay” as abusive simply because of the word “gay.” In this paper, we formalize the unintended biases in text classification datasets as a kind of selection bias from the non-discrimination distribution to the discrimination distribution. Based on this formalization, we further propose a model-agnostic debiasing training framework by recovering the non-discrimination distribution using instance weighting, which does not require any extra resources or annotations apart from a pre-defined set of demographic identity-terms. Experiments demonstrate that our method can effectively alleviate the impacts of the unintended biases without significantly hurting models’ generalization ability.


Introduction
With the development of Natural Language Processing (NLP) techniques, Machine Learning (ML) models are being applied in continuously expanding areas (e.g., to detect spam emails, to filter resumes, to detect abusive comments), and they are affecting everybody's life from many aspects. However, human-generated datasets may introduce some human social prejudices to the models (Caliskan-Islam et al., 2016). Recent works have found that ML models can capture, utilize, and even amplify the unintended biases (Zhao et al., 2017), which has raised lots of concerns about the * Equal contributions from both authors. This work was done when Guanhua Zhang was an intern at Tencent.  Table 1: Percentage of toxic comments by some specific demographic identity-terms in the dataset released by Dixon et al. (2018). discrimination problem in NLP models (Sun et al., 2019). Text classification is one of the fundamental tasks in NLP. It aims at assigning any given sentence to a specific class. In this task, models are expected to make predictions with the semantic information rather than with the demographic group identity information (e.g., "gay", "black") contained in the sentences.
However, recent research points out that there widely exist some unintended biases in text classification datasets. For example, in a toxic comment identification dataset released by Dixon et al. (2018), it is found that texts containing some specific identity-terms are more likely to be toxic. More specifically, 57.4% of comments containing "gay" are toxic, while only 9.6% of all samples are toxic, as shown in Table 1.
Because of such a phenomenon, models trained with the dataset may capture the unintended biases and perform differently for texts containing various identity-terms. As a result, predictions of models may discriminate against some demographic minority groups. For instance, sentences like "She makes me happy to be gay" is judged as abusive by models trained on biased datasets in our experiment, which may hinder those minority groups who want to express their feelings on the web freely.
Recent model-agnostic research mitigating the unintended biases in text classifications can be summarized as data manipulation methods (Sun et al., 2019). For example, Dixon et al. (2018) propose to apply data supplementation with additional labeled sentences to make toxic/non-toxic balanced across different demographic groups. Park et al. (2018) proposes to use data augmentation by applying gender-swapping to sentences with identityterms to mitigate gender bias. The core of these works is to transform the training sets to an identitybalanced one. However, data manipulation is not always practical. Data supplementation often requires careful selection of the additional sentences w.r.t. the identity-terms, the labels, and even the lengths of sentences (Dixon et al., 2018), bringing a high cost for extra data collection and annotation. Data augmentation may result in meaningless sentences (e.g., "He gives birth."), and is impractical to perform when there are many demographic groups (e.g., for racial bias cases).
In this paper, we propose a model-agnostic debiasing training framework that does not require any extra resources or annotations, apart from a pre-defined set of demographic identity-terms. We tackle this problem from another perspective, in which we treat the unintended bias as a kind of selection bias (Heckman, 1979). We assume that there are two distributions, the non-discrimination distribution, and the discrimination distribution observed in the biased datasets, and every sample of the latter one is drawn independently from the former one following a discrimination rule, i.e., the social prejudice. With such a formalization, mitigating the unintended biases is equivalent to recovering the non-discrimination distribution from the selection bias. With a few reasonable assumptions, we prove that we can obtain the unbiased loss of the non-discrimination distribution with only the samples from the observed discrimination distribution with instance weights. Based on this, we propose a non-discrimination learning framework. Experiments on three datasets show that, despite requiring no extra data, our method is comparable to the data manipulation methods in terms of mitigating the discrimination of models.
The rest of the paper is organized as follows. We summarize the related works in Section 2. Then we give our perspective of the problem and examine the assumptions of commonly-used methods in Section 3. Section 4 introduces our non-discrimination learning framework. Taking three datasets as ex-amples, we report the experimental results of our methods in Section 5. Finally, we conclude and present the future works in Section 6.

Related Works
Non-discrimination and Fairness Nondiscrimination focuses on a number of protected demographic groups, and ask for parity of some statistical measures across these groups (Chouldechova, 2017). As mentioned by Friedler et al. (2016), non-discrimination can be achieved only if all groups have similar abilities w.r.t. the task in the constructed space which contains the features that we would like to make a decision. There are various kinds of definitions of non-discrimination corresponding to different statistical measures. Popular measures include raw positive classification rate (Calders and Verwer, 2010), false positive and false negative rate (Hardt et al., 2016) and positive predictive value (Chouldechova, 2017), corresponding to different definitions of non-discrimination. Methods like adversarial training (Beutel et al., 2017;Zhang et al., 2018) and fine-tuning (Park et al., 2018) have been applied to remove biasedness.
In the NLP area, fairness and discrimination problems have also gained tremendous attention. Caliskan-Islam et al. (2016) show that semantics derived automatically from language corpora contain human biases. Bolukbasi et al. (2016) show that pre-trained word embeddings trained on large-scale corpus can exhibit gender prejudices and provide a methodology for removing prejudices in embeddings by learning a gender subspace. Zhao et al. (2018) introduce the gender bias problem in coreference resolution and propose a general-purpose method for debiasing.
As for text classification tasks, Dixon et al. (2018) first points out the unintended bias in datasets and proposes to alleviate the bias by supplementing external labeled data. Kiritchenko and Mohammad (2018) examines gender and race bias in 219 automatic sentiment analysis systems and finds that several models show significant bias. Park et al. (2018) focus on the gender bias in abusive language detection task and propose to debias by augmenting the datasets with gender-swapping operation. In this paper, we propose to make models fit a non-discrimination distribution with calculated instance weights.
Instance Weighting Instance weighting has been broadly adopted for reducing bias. For example, the Inverse Propensity Score (IPS) (Rosenbaum and Rubin, 1983) method has been successfully applied for causal effect analyses (Austin and Stuart, 2015), selection bias (Schonlau et al., 2009), position bias (Wang et al., 2018Joachims et al., 2017) and so on. Zadrozny (2004) proposed a methodology for learning and evaluating classifiers under "Missing at Random" (MAR) (Rubin, 1976) selection bias. Zhang et al. (2019) study the selection bias in natural language sentences matching datasets, and propose to fit a leakage-neutral distribution with instance weighting. Jiang and Zhai (2007) propose an instance weighting framework for domain adaptation in NLP, which requires the data of the target domain.
In our work, we formalize the discrimination problem as a kind of "Not Missing at Random" (NMAR) (Rubin, 1976) selection bias from the non-discrimination distribution to the discrimination distribution, and propose to mitigate the unintended bias with instance weighting.

Perspective
In this section, we present our perspective regarding the discrimination problem in text classifications. Firstly, we define what the nondiscrimination distribution is. Then, we discuss what requirements non-discrimination models should meet and examine some commonly used criteria for non-discrimination. After that, we analyze some commonly used methods for assessing discrimination quantitatively. Finally, we show that the existing debiasing methods can also be seen as trying to recover the non-discrimination distribution and examine their assumptions.

Non-discrimination Distribution
The unintended bias in the datasets is the legacy of the human society where discrimination widely exists. We denote the distribution in the biased datasets as discrimination distribution D.
Given the fact that the real world is discriminatory although it should not be, we assume that there is an ideal world where no discrimination exists, and the real world is merely a biased reflection of the non-discrimination world. Under this perspective, we assume that there is an nondiscrimination distribution D reflecting the ideal world, and the discrimination distribution D is drawn from D but following a discriminatory rule, the social prejudice. Attempting to correct the bias of datasets is equivalent to recover the original nondiscrimination distribution D.
For the text classification tasks tackled in this paper, we denote X as the sentences, Y as the (binary) label indicator variable 1 , Z as the demographic identity information (e.g. "gay", "black", "female") in every sentence. In the following paper, we use P (·) to represent the probability of the discrimination distribution D in datasets, and Q(·) for non-discrimination distribution D. Then the non-discrimination distribution D should meet that, which means that the demographic identity information is independent of the labels 2 .

Non-Discrimination Model
For text classification tasks, models are expected to make predictions by understanding the semantics of sentences rather than by some single identityterms. As mentioned in Dixon et al. (2018), a model is defined as biased if it performs better for sentences containing some specific identity-terms than for ones containing others. In other words, a non-discrimination model should perform similarly across sentences containing different demographic groups. However, "perform similarly" is indeed hard to define. Thus, we pay more attention to some criteria defined on demographic groups. A widely-used criterion is Equalized Odds (also known as Error Rate Balance) defined by Chouldechova (2017), requiring the Y to be independent of Z when Y is given, in which Y refers to the predictions of the model. This criterion is also used by Borkan et al. (2019) in text classifications.
Besides the Equalized Odds criterion, a straightforward criterion for judging non-discrimination is Statistical Parity (also known as Demographic Parity, Equal Acceptance Rates, and Group Fairness) (Calders and Verwer, 2010;Dwork et al., 2012), which requires Y to be independent of Z, i.e., Pr( Y |Z) = Pr( Y ). Another criterion is Predictive Parity (Chouldechova, 2017), which requires Y to be independent of Z when condition Y = 1 is given, i.e., Pr(Y | Y = 1, Z) = Pr(Y | Y = 1). Given the definitions of the three criterions , we propose the following theorem, and the proof is presented in Appendix A.
Theorem 1 (Criterion Consistency). When tested in a distribution in which Pr(Y |Z) = Pr(Y ), Y satisfying Equalized Odds also satisfies Statistical Parity and Predictive Parity.
Based on the theorem, in this paper, we propose to evaluate models under a distribution where the demographic identity information is not predictive of labels to unify the three widely-used criteria. Specifically, we define that a non-discrimination model should meet that,

Assessing the Discrimination
Identity Phrase Templates Test Sets (IPTTS) are widely used as non-discrimination testing sets to assess the models' discrimination (Dixon et al., 2018;Park et al., 2018;Sun et al., 2019;Kiritchenko and Mohammad, 2018). These testing sets are generated by several templates with slots for each of the identity-terms. Identity-terms implying different demographic groups are slotted into the templates, e.g., "I am a boy." and "I am a girl.", and it's easy to find that IPTTS satisfies Pr(Y |Z) = Pr(Y ). A non-discrimination model is expected to perform similarly in sentences generated by the same template but with different identity-terms.
in which, FPR overall and FNR overall , standing for False Positive Rate and False Negative Rate respectively, are calculated in the whole IPTTS. Correspondingly, FPR z and FNR z are calculated on each subset of the data containing each specific identity-term. These two metrics can be seen as a relaxation of Equalized Odds mentioned in Section 3.2 (Borkan et al., 2019).
It should also be emphasized that FPED and FNED do not evaluate the accuracy of models at all, and models can get lower FPED and FNED by making trivial predictions. For example, when tested in a distribution where Pr(Y |Z) = Pr(Y ), if a model makes the same predictions for all inputs, FPED and FNED will be 0, while the model is completely useless.

Correcting the Discrimination
Data manipulation has been applied to correct the discrimination in the datasets (Sun et al., 2019). Previous works try to supplement or augment the datasets to an identity-balanced one, which, in our perspective, is primarily trying to recover the nondiscrimination distribution D.
For data supplementation, Dixon et al. (2018) adds some additional non-toxic samples containing those identity-terms which appear disproportionately across labels in the original biased dataset. Although the method is reasonable, due to high cost, it is not always practical to add additional labeled data with specific identity-terms, as careful selection of the additional sentences w.r.t. the identity-terms, the labels, and even the lengths of sentences is required (Dixon et al., 2018).
The gender-swapping augmentation is a more common operation to mitigate the unintended bias (Zhao et al., 2018;Sun et al., 2019). For text classification tasks, Park et al. (2018) augment the datasets by swapping the gender-implying identityterms (e.g., "he" to "she", "actor" to "actress") in the sentences of the training data to remove the correlation between Z and Y . However, it is worth mentioning that the gender-swapping operation additionally assumes that the non-discrimination distribution D meets the followings, in which X ¬ refers to the content of sentences except for the identity information. And we argue that these assumptions may not hold sometimes. For example, the first assumption may result in some meaningless sentences (e.g., "He gives birth.") (Sun et al., 2019). Besides, this method is not practical for situations with many demographic groups.

Our Instance Weighting Method
In this section, we introduce the proposed method for mitigating discrimination in text classifications. We first make a few assumptions about how the discrimination distribution D in the datasets are generated from the non-discrimination distribution D. Then we demonstrate that we can obtain the unbiased loss on D only with the samples from D, which makes models able to fit the non-discrimination distribution D without extra resources or annotations.

Assumptions about the Generation Process
Considering the perspective that the discrimination distribution is generated from the nondiscrimination distribution D, we refer S ∈ [0, 1] as the selection indicator variable, which indicates whether a sample is selected into the biased dataset or not. Specifically, we assume that every sample (x, z, y, s) 3 is drawn independently from D following the rule that, if s = 1 then the sample is selected into the dataset, otherwise it is discarded, then we have and as defined in Section 3.1, the nondiscrimination distribution D satisfies Ideally, if the values of S are entirely at random, then the generated dataset can correctly reflect the original non-discrimination distribution D and does not have discrimination. However, due to social prejudices, the value of S is not random. Inspired by the fact that some identity-terms are more associated with some specific labels than other identity-terms (e.g., sentences containing "gay" are more likely to be abusive in the dataset as mentioned before), we assume that S is controlled by Y and Z 4 . We also assume that, given any Z and Y , the conditional probability of S = 1 is greater than 0, defined as, Meanwhile, we assume that the social prejudices will not change the marginal probability distribution of Z, defined as, which also means that S is independent with Z in D, i.e., Q(S|Z) = Q(S).
Among them, Assumption 1 and 2 come from our problem framing. Assumption 3 helps simplify the problem. Assumption 4 helps establish the non-discrimination distribution D. Theoretically, when Z is contained in X, which is a common case, consistent learners should be asymptotically immune to this assumption (Fan et al., 2005). A more thorough discussion about Assumption 4 can be found in Appendix B.

Making Models Fit the
Non-discrimination Distribution D Unbiased Expectation of Loss Based on the assumptions above, we prove that we can obtain the loss unbiased to the non-discrimination distribution D from the discrimination distribution with calculated instance weights.
Theorem 2 (Unbiased Loss Expectation). For any classifier f = f (x, z), and for any loss function ∆ = ∆(f (x, z), y), if we use w = Q(y) P (y|z) as the instance weights, then Then we present the proof for Theorem 2.
Proof. We first present an equation with the weight w, in which we use numbers to denote the assumptions used in each step and bayes for the Bayes' Theorem. Q(x, z, y) P (x, z, y) ∆(f (x, z), y)dP (x, z, y) = ∆(f (x, z), y)dQ(x, z, y) =E x,y,z∼ D ∆(f (x, z), y) .
Algorithm 1: Non-discrimination Learning Input: The dataset {x, z, y}, the number of fold K for cross prediction and the prior probability Q(Y = 0) and Q(Y = 1) Procedure: 01 Train classifiers and use K-fold cross-predictions to estimating P (y|z) with the dataset 02 Calculate the weights w = Q(y) P (y|z) for all samples 03 Train and validate models using w as the instance weights Non-discrimination Learning Theorem 2 shows that, we can obtain the unbiased loss of the non-discrimination distribution D by adding proper instance weights to the samples from the discrimination distribution D. In other words, non-discrimination models can be trained with the instance weights w = Q(y) P (y|z) . As the discrimination distribution is directly observable, estimating P (y|z) is not hard. In practice, we can train classifiers and use cross predictions to estimate P (y|z) in the original datasets. Since Q(y) is only a real number indicating the prior probability of Y ∈ [0, 1] on distribution D, we do not specifically make an assumption on it. Intuitively, setting Q(Y ) = P (Y ) can be a good choice. Considering an non-discrimination dataset where P (Y |Z) = P (Y ), the calculated weights Q(y) P (y|z) should be the same for all samples when we set Q(Y ) = P (Y ), and thus have little impacts on trained models.
We present the step-by-step procedure for nondiscrimination learning in Algorithm 1. Note that the required data is only the biased dataset, and a pre-defined set of demographic identity-terms, with which we can extract {x, y, z} for all the samples.

Experiments
In this section, we present the experimental results for non-discrimination learning. We demonstrate that our method can effectively mitigate the impacts of unintended discriminatory biases in datasets.

Dataset Usage
We evaluate our methods on three datasets, including the Sexist Tweets dataset, the Toxicity Comments dataset, and the Jigsaw Toxicity dataset.  task 5 . The dataset consists of tweets annotated by experts as "sexist" or "normal." We process the dataset as to how Park et al. (2018) does. It is reported that the dataset has an unintended gender bias so that models trained in this dataset may consider "You are a good woman." as "sexist." We randomly split the dataset in a ratio of 8 : 1 : 1 for training-validation-testing and use this dataset to evaluate our method's effectiveness on mitigating gender discrimination.

Sexist Tweets
Toxicity Comments Another choice is the Toxicity Comments dataset released by Dixon et al. (2018), in which texts are extracted from Wikipedia Talk Pages and labeled by human raters as either toxic or non-toxic. It is found that in this dataset, some demographic identity-terms (e.g., "gay", "black") appear disproportionately among labels. As a result, models trained in this dataset can be discriminatory among groups. We adopt the split released by Dixon et al. (2018) and use this dataset to evaluate our method's effectiveness on mitigating discrimination towards minority groups.

Jigsaw Toxicity
We also tested a recently released large-scale dataset Jigsaw Toxicity from Kaggle 6 , in which it is found that some frequently attacked identities are associated with toxicity. Sentences in the dataset are extracted from the Civil Comment platform and annotated with toxicity and identities mentioned in every sentence. We randomly split the dataset into 80% for training, 10% for validation and testing respectively. The dataset is used to evaluate our method's effectiveness on large-scale datasets. The statistic of the three datasets is shown as in Table 2.

Evaluation Scheme
Apart from the original testing set of each dataset, we use the Identity Phrase Templates Test Sets (IPTTS) described in Section 3.3 to evaluate the models as mentioned in Section 3.3. For experiments with the Sexist Tweets dataset, we generate IPTTS following Park et al. (2018). For experiments with Toxicity Comments datasets and Jigsaw Toxicity, we use the IPTTS released by Dixon et al. (2018). Details about the IPTTS generation are introduced in Apendix C.
For metrics, we use FPED and FNED in IPTTS to evaluate how discriminatory the models are, and lower scores indicate better equality. However, as mentioned in Section 3.3, these two metrics are not enough since models can achieve low FPED and FNED by making trivial predictions in IPTTS. So we use AUC in both the original testing set and IPTTS to reflect the trade-off between the debiasing effect and the accuracy of models. We also report the significance test results under confidence levels of 0.05 for Sexist Tweets dataset and Jigsaw Toxicity dataset 7 .
For baselines, we compare with the genderswapping method proposed by Park et al. (2018) for the Sexist Tweets dataset, as there are only two demographics groups (male and female) provided by the dataset, it's practical for swapping. For the other two datasets, there are 50 demographics groups, and we compare them with data supplementation proposed by Dixon et al. (2018).

Experiment Setup
To generate the weights, we use Random Forest Classifiers to estimate P (y|z) following Algorithm 1. We simply set Q(Y ) = P (Y ) to partial out the influence of the prior probability of Y . The weights are used as the sample weights to the loss functions during training and validation.
For experiments with the Sexist Tweets dataset, we extract the gender identity words (released by Zhao et al. (2018)) in every sentence and used them as Z. For experiments with Toxicity Comments dataset, we take the demographic group identity words (released by Dixon et al. (2018)) contained in every sentence concatenated with the lengths of sentences as Z, just the same as how Dixon et al. (2018) chose the additional sentence for data supplement. For experiments with the Jigsaw Toxicity dataset, the provided identity attributes of every sentence and lengths of sentences are used as Z.
For experiments with the Toxicity Comments dataset, to compare with the results released by  The results of Baseline and Supplement are taken from Dixon et al. (2018) beddings (Pennington et al., 2014) are used. We also report results when using gender-debiased pretrained embeddings (Bolukbasi et al., 2016) for experiments with Sexist Tweets. All the reported results are the average numbers of ten runs with different random initializations.

Experimental Results
In this section, we present and discuss the experimental results. As expected, training with calculated weights can effectively mitigate the impacts of the unintended bias in the datasets.
Sexist Tweets Tabel 3 reports the results on Sexist Tweets dataset. Baseline refers to vanilla mod-els. Swap refers to models trained and validated with 2723 additional gender-swapped samples to balance the identity-terms across labels (Park et al., 2018). Weight refers to models trained and validated with calculated weights. "+" refers to models using debiased word embeddings.
Regarding the results with the GloVe word embeddings, we can find that Weight performs significantly better than Baseline under FPED and FNED, which demonstrate that our method can effectively mitigate the discrimination of models. Swap outperforms Weight in FPED and FNED, but our method achieves significantly higher IPTTS AUC. We notice that Swap even performs worse in terms of IPTTS AUC than Baseline (although the difference is not significant at 0.05), which implies that cost for the debiasing effect of Swap is the loss of models' accuracy, and this can be ascribed to the gender-swapping assumptions as mentioned in Section 3.4. We also notice that both Weight and Swap have lower Orig. AUC than Baseline and this can be ascribed to that the unintended bias pattern is mitigated.
Regarding the results with the debiased word embeddings, the conclusions remain largely unchanged, while Weight get a significant improvement over Baseline in terms of IPTTS AUC. Besides, compared with GloVe embeddings, we can find that debiased embeddings can effectively improve FPED and FNED, but Orig. AUC and IPTTS AUC also drop. Table 4 reports the results on Toxicity Comments dataset. Baseline refers to vanilla models. Supplement refers to models trained and validated with 4620 additional samples to balance the identity-terms across labels (Dixon et al., 2018). Weight refers to models trained and validated with calculated instance weights.

Toxicity Comments
From the table, we can find that Weight outperforms Baseline in terms of IPTTS AUC, FPED, and FNED, and also gives sightly better debiasing performance compared with Supplement, which demonstrate that the calculated weights can effectively make models more non-discriminatory. Meanwhile, Weight performs similarly in Orig. AUC to all the other methods, indicating that our method does not hurt models' generalization ability very much.
In general, the results demonstrate that our method can provide a better debiasing effect without additional data, and avoiding the high cost of extra data collection and annotation makes it more practical for adoptions. Table 5 reports the results on Jigsaw Toxicity dataset. Baseline refers to vanilla models. Supplement refers to models trained and validated with 15249 additional samples extracted from Toxicity Comments to balance the identityterms across labels. Weight refers to models trained with calculated weights. Similar to results on Toxicity Comments, we find that both Weight and Supplement perform significantly better than Baseline in terms of IPTTS AUC and FPED, and the results of Weight and Supplement are comparable. On the other hand, we notice that Weight and Supplement improve FNED slightly, while the differences are not statistically significant at confidence level 0.05.

Jigsaw Toxicity
To gain better knowledge about the debiasing effects, we further visualize the evaluation results on the Jigsaw Toxic dataset for sentences containing some specific identity-terms in IPTTS in Figure 1, where ∆FPR z and ∆FNR z are presented. Based on the definition of FPED and FNED, values closer to 0 indicate better equality. We can find that Baseline trained in the original biased dataset can discriminate against some demographic groups.
For example, sentences containing identity words like "gay", "homosexual" and "lesbian" are more likely to be falsely judged as "toxic" as indicated by ∆FPR, while ones with words like "straight" are more likely to be falsely judged as "not toxic" as indicated by ∆FNR. We can also notice that Weight performs more consistently among most identities in both FPR and FNR. For instance, ∆FPR of the debiased model in samples with "gay", "homosexual" and "lesbian" significantly come closer to 0, while |∆FNR| also drop for "old" and "straight".
We also note that FPR overall and FPR overall of Weight are significantly better than the results of Baseline, i.e., FPR overall results are 0.001 and 0.068 for Weight and Baseline respectively, and FNR overall results are 0.061 and 0.068 for Weight and Baseline respectively, representing that Weight is both more accurate and more non-discriminatory on the IPTTS set.

Conclusion
In this paper, we focus on the unintended discrimination bias in existing text classification datasets. We formalize the problem as a kind of selection bias from the non-discrimination distribution to the discrimination distribution and propose a debiasing training framework that does not require any extra resources or annotations. Experiments show that our method can effectively alleviate discrimination. It's worth mentioning that our method is general enough to be applied to other tasks, as the key idea is to obtain the loss on the non-discrimination distribution, and we leave this to future works.

A Proof for the Criterion Consistency Theorem
Proof. Here we present the proof for Theorem 1. For the Statistical Parity criterion, For the Predictive Parity criterion,

B Discussion about Assumption 4
We show that even if the assumption does not hold, we can still make models fit Q(Y |X) with calculated weights when Z is contained in X, which is the common setting in practical. We firstly present the equation of the weights w without the assumption P (Z) = Q(Z). After applying these weights to every sample in the dataset, we can get a new distribution defined as below, P * (x, y, zx) = wx,y,z x · P (x, y, zx) w x ,y ,z x · dP (x , y , z x ) .
in which we use P * (·) to represent the probability in the obtained distribution. As Z is contained in X, we use Z X to represent the specific Z contained in every X. Then we have P * (y|x) = P * (x, zx, y) y P * (x, zx, y ) = P (x, zx, y) · Q(x,zx,y) P (x,zx,y) · P (zx) Q(zx) y P (x, zx, y ) · Q(x,zx,y ) P (x,zx,y ) · P (zx) Given the result P * (y|x) = Q(y|x), the consistent learners should be asymptotically immune to different assumptions regarding Q(Z), where a learner is defined as consistent if the learning algorithm can find a model θ that is equivalent to the true model at producing class conditional probabilities given an exhaustive training data set (Fan et al., 2005). In practical, however, as the requirements are often hard to met, we note that models may still be affected due to the deviation between P * (x) and Q(x), which is widely studied in the covariate shift problem (Shimodaira, 2000;Ben-David et al., 2007;Jiang and Zhai, 2007). In our paper, as we don't assume the availability of extra resources and prior knowledge, we simply set P (Z) = Q(Z).
We leave more explorations about this assumption for future work.
1 You are a (adj. inoffensive) (identity-term). 0 You are a (adj. offensive) (identity-term). 1   Table 6. We use the released codes by Dixon et al. (2018) and use the gender word pairs released by Zhao et al. (2018) as "identity-term." Some of the slotted words are presented in Table 7. To make sentences longer, we also add some semantic-neutral sentences provided by Dixon et al. (2018) as a suffix to each template. Finally, we get 75238 samples, 37538 of which are abusive, and the mean of sentence lengths is 17.5.
For experiments with Toxicity Comments datasets and Jigsaw Toxicity, we use the IPTTS  released by Dixon et al. (2018). The testing set is created by several templates slotted by a broad range of identity-terms, which consists of 77000 examples, 50% of which are toxic.

D Frequency of Identity-terms in Toxic Samples and Overall
To give a better understanding of how the weights change the distribution of datasets, we compare the original Jigsaw Toxicity dataset and the one with calculated weights for the frequency of a selection of identity-terms in toxic samples and overall, as shown in Table 8. We can find that after adding weights, the gap between frequency in toxic samples and overall significantly decrease for almost all identity-terms, which demonstrate that the unintended bias in datasets is effectively mitigated.