LOGAN: Local Group Bias Detection by Clustering

Machine learning techniques have been widely used in natural language processing (NLP). However, as revealed by many recent studies, machine learning models often inherit and amplify the societal biases in data. Various metrics have been proposed to quantify biases in model predictions. In particular, several of them evaluate disparity in model performance between protected groups and advantaged groups in the test corpus. However, we argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. In fact, a model with similar aggregated performance between different groups on the entire data may behave differently on instances in a local region. To analyze and detect such local bias, we propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region and allows us to better analyze the biases in model predictions.


Introduction
Machine learning models such as deep neural networks have achieved remarkable performance in many NLP tasks. However, as noticed by recent studies, these models often inherit and amplify the biases in the datasets used to train the models (Zhao et al., 2017;Bolukbasi et al., 2016;Caliskan et al., 2017;Zhou et al., 2019;Manzini et al., 2019;Blodgett et al., 2020).
To quantify bias, researchers have proposed various metrics to study algorithmic fairness at both individual and group levels. The former measures if a model treats similar individuals consistently no matter which groups they belong to, while the latter requires the model to perform similarly for protected groups and advantaged groups in the cor-pus. 1 In this paper, we argue that studying algorithmic fairness at either level does not tell the full story. A model that reports similar performance across two groups in a corpus may behave differently between these two groups in a local region.
For example, the performance gap of a toxicity classifier for sentences mentioning black and white race groups is 4.8%. 2 This gap is only marginally larger than the performance gap of 2.4% when evaluating the model on two randomly split groups. However, if we evaluate the performance gap on the sentences containing the token "racist", the performance gap between these two groups is as large as 19%. Similarly, Zhao et al. (2017) report that a visual semantic role labeling system tends to label an image depicting cooking as woman cooking than man cooking. However, the model is, in fact, more likely to produce an output of man cooking when the agent in the image wears a chef hat. We call these biases exhibited in a neighborhood of instances local group bias in contrast with global group bias which is evaluated on the entire corpus.
To detect local group bias, we propose LOGAN, a LOcal Group biAs detectioN algorithm to identify biases in local regions. LOGAN adapts a clustering algorithm (e.g., K-Means) to group instances based on their features while maximizing a bias metric (e.g., performance gap across groups) within each cluster. In this way, local group bias is highlighted, allowing a developer to further examine the issue.
Our experiments on toxicity classification and MS-COCO object classification demonstrate the effectiveness of LOGAN. We show that besides 1 For example, Zhao et al. (2018a) and Rudinger et al. (2018) evaluate the bias in coreference resolution systems by measuring the difference in F1 score between cases where a gender pronoun refers to an occupation stereotypical to the gender and the opposite situation.
2 Performance in accuracy on the unintended bias detection task (Conversation AI team, 2019) successfully detecting local group bias, our method also provides interpretations for the detected bias. For example, we find that different topics lead to different levels of local group bias in the toxicity classification.

Related Work
Bias Evaluation Researchers have proposed to study algorithmic fairness from both individual and group perspectives (Dwork et al., 2012;Dwork and Ilvento, 2018). To analyze group fairness, various metrics have been proposed. For example, demographic parity (Dwork et al., 2012) requires the probability of the predictor making positive prediction to be independent of the sensitive attributes. However this metric cannot always guarantee fairness, as we can accept correct examples in one demographic group but make random guess in another one as long as we maintain the same acceptance ratio. To solve this problem, Hardt et al. (2016) propose new metrics, equalized odds and equalized opportunity, to measure the discrimination related to the sensitive attributes which require the predictions to be independent of the demographic attributes given true labels. In NLP, many studies use the performance gap between different demographic groups as a bias measurement (Gaut et al., 2020;Kiritchenko and Mohammad, 2018;. The choice of bias metric depends on applications. In this work, we use performance gap as the bias evaluation metric. However, our approach can be generalized to other metrics.

Bias in NLP Applications
Recent advances in machine learning models boost the performance of various NLP applications. However, recent studies show that biases exhibit in NLP models. For example, researchers demonstrate that representations in NLP models are biased toward certain societal groups (Bolukbasi et al., 2016;Caliskan et al., 2017;Zhao et al., 2018bZhou et al., 2019;May et al., 2019). Stanovsky et al. (2019) and Font and Costa-jussà (2019) show that gender bias exhibits in neural machine translations while Dixon et al. (2018) and Sap et al. (2019) reveal biases in text classification tasks. Other applications such as cross-lingual transfer learning (Zhao et al., 2020) and natural language generation (Sheng et al., 2019) also exhibit unintended biases.

Methodology
In this section, we first provide formal definitions of local group bias and then the details of the detection method LOGAN.
Performance Disparity Assume we have a trained model f and a test corpus D = {(x i , y i )} i=1...n that is used to evaluate the model. Let P f (D) represents the performance of the model f evaluated on the corpus D. Based on the applications, the performance metric can be accuracy, AUC, false positive rates, etc. For the sake of simplicity, we assume each input example x i is associated with one of demographic groups (e.g., male or female), i.e., x i ∈ A 1 or x i ∈ A 2 . 3 As a running example, we take performance disparity as the bias metric. That is, if P f (A 1 ) − P f (A 2 ) > , then we consider that the model exhibits bias, where is a given threshold.

Definition of local group bias
We define local group bias as the bias exhibits in certain local region of the test examples. Formally, given a cen- While this definition is based on performance disparity, it is straightforward to extend the notion of local group bias to other bias metrics.
LOGAN The goal of LOGAN is to cluster instances in D such that (1) similar examples are grouped together, and (2) each cluster demonstrates local group bias contained in f . To achieve this goal, LOGAN generates cluster C = {C i,j } i=1...n,j=1...k by optimizing the following objective: where L c is the clustering loss and L b is local group bias loss. λ ≥ 0 is a hyper-parameter to control the trade-offs between the two objectives. C ij = 1 if x i is assigned to the cluster j; C ij = 0 otherwise. We introduce these two loss terms in the following.
Clustering objective The loss L c is derived from a standard clustering technique. In this paper, we consider the K-Means clustering method (Lloyd, 1982). Specifically, the loss L c of K-Means is Note that our framework is general and other clustering techniques, such as Spectral clustering (Shi and Malik, 2000), DBSCAN (Ester et al., 1996), or Gaussian mixture model can also be applied in generating the clusters. Besides, the features used for creating the clusters can be different from the features used in the model f .
Local group bias objective For the local group bias loss L b , the goal is to obtain a clustering that maximizes the bias metric within each cluster. In the following descriptions, we take the performance gap between different attributes (see Eq. (1)) as an example to describe the bias metric.
Letŷ i = f (x i ) be the prediction of f on x i . The local group bias loss L b is defined as the negative summation of performance gaps over all the clusters. If accuracy is used as the performance evaluation metric, where I is the indicator function. Similar to K-Means algorithm, we solve Eq. (2) by iterating two steps: first, assign x i to its closest cluster j based on current µ j ; second, update µ j based on current label assignment. We use k-means++ (Arthur and Vassilvitskii, 2007) for the cluster initialization and stop when the model converges or reaches enough iterations. To make sure each cluster contains enough instances, in practice, we choose a large k (k = 10 in our case) and merge a small cluster to its closest neighbor. 4 For local group bias detection, we only consider clusters with at least 20 examples from each group.

Experiments
In this section, we show that LOGAN is capable of identifying local group bias, and the clusters Figure 1: Accuracy for White (blue circle) and Black (orange square) groups in each cluster using LOGAN. The length of the dashed line shows the gap. Red box highlights the accuracy of these two groups on the entire corpus. Clusters 0 and 1 demonstrate strong local group bias. Full results are in Appendix A.3. generated by LOGAN provide an insight into how bias is embedded in the model.

Toxicity Classification
This task aims at detecting whether a comment is toxic (e.g. abusive or rude). Previous work has demonstrated that this task is biased towards specific identities such as "gay" (Dixon et al., 2018). In our work, we use toxicity classification as one example to detect local group bias in texts and show that such local group bias could be caused by different topics in the texts.
Dataset We use the official train and test datasets from Conversation AI team (2019). As the dataset is extremely imbalanced, we down-sample the training dataset and reserve 20% of it as the development set. In the end, we have 204, 000, 51, 000 and 97, 320 examples for train, development and test, respectively. We tune λ = {1, 5, 10, 100} and choose the one with the largest number of clusters showing local group bias.
Model We fine-tune a BERT sequence classification model from Wolf et al. (2019) for 2 epochs with a learning rate 2 × 10 −5 , max sequence length 220 and batch size 20. The model achieves 90.2% accuracy on the whole test dataset. 5 We use sentence embeddings from the second to last layer of a pre-trained BERT model as features to perform clustering. We also provide clustering results based on the sentence embeddings extracted from a finetuned model in Appendix A.4. Bias Detection There are several demographic groups in the toxic dataset such as gender, race and religion. We focus on the binary gender (male/female) and binary race (black/white) in the experiments. For local group bias, we report the largest bias score among all the clusters. Figure 1 shows the accuracy of white and black groups in each cluster using LOGAN. The example bounded in the red box is the global accuracy of these two groups. Based on the results in Figure 1 and Table  1, we only detect weak global group bias in the model predictions. However, both K-Means and LOGAN successfully detect strong local group bias. In particular, LOGAN identifies a local region that the model has difficulties in making correct predictions for female group. While we use the gap of accuracy as the bias metric, the clusters detected by LOGAN also exhibit local bias when evaluating using other metrics. Table 2 shows the gap of subgroup AUC scores over the clusters. Similar to the results in Table 1, K-Means and LOGAN detect local group bias. In particular, the first and the third clusters in Figure 1 also have larger AUC disparity than the global AUC gap. Similarly, the first three clusters in Figure 1 have a significantly larger gap of False Positive Rate across different groups than when evaluating on the entire dataset.

RACE
Bias Interpretation To better interpret the local group bias, we run a Latent Dirichlet Allocation topic model (Blei et al., 2003) to discover the main topic of each cluster. Table 3 lists the top 20 topic words for the most and least biased clusters using LOGAN under RACE attributes. We remove the words related to race attributes such as "white" and "black". Other results are in Appendix A.2. We find that different topics in each cluster may lead   to different levels of local group bias. For example, compared with the less biased group, the most biased group includes a topic on supremacy.

Comparison between K-Means and LOGAN
We compare LOGAN with K-Means using the following 3 metrics. "Inertia" sums over the distances of all instances to their closest centers which is used to measure the clustering quality. We normalize it to make the inertia of K-Means 1.0. To measure the utility of local group bias detection, we look at the ratio of clusters showing a bias score at least 5% 6 (BCR) as well as the ratio of instances within those biased clusters (BIR). Table 4 shows that LOGAN increases the ratio of clusters exhibiting non-trivial local group bias by a large margin with trivial trade offs in inertia.

Object Classification
We conduct experiments on object classification using MS- COCO (Lin et al., 2014). Given one image, the goal is to predict if one object appears  Table 4: Comparison between K-Means and LOGAN under RACE attributes. " BCR" and "BIR" refer to the ratio of biased clusters and ratio of instances in those biased clusters, respectively. "|Bias|" here is the averaged absolute bias score for those biased clusters.
in the image. Following the setup in , we exclude person from the object labels.
Dataset Similar to Zhao et al. (2017) and , we extract the gender label for one image by looking at the captions. For our analysis, we only consider images with gender labels. In the end, there are 22, 800, 5, 400 and 5, 400 images left for train, development and test, respectively.

Model
We use the basic model from  for this task, which adapts a standard ResNet-50 pre-trained on ImageNet with the last layer modified. We follow the default hyperparameters of the original model.

Bias Detection and Interpretation
We evaluate bias in the predictions of the object classification model by looking at the accuracy gap between male and female groups for each object. In the analysis, we only consider objects with more than 100 images in the test set. This results in a total of 26 objects. Among the three methods, Global can only detect group bias at threshold 5% (i.e., performance gap ≥ 5%) for 14 objects, while K-Means and LOGAN increase the number to 19 and 21 respectively.
Comparing LOGAN with K-Means, among all the 26 objects, the average inertia is almost the same (the ratio is 1.001). On average, 34.0% and 35.7% of the clusters showing local group bias at threshold 5% (i.e. BCR) and the ratio of instances in those biased clusters (i.e., BIR) are 57.7% and 54.9% for K-Means and LOGAN, respectively.
We further investigate the local groups discovered by LOGAN by comparing the images in the less biased local groups with the strong biased ones. We find that, for example, in the most biased local groups, the images often contain "handbag" with a street scene. In such a case, the model is more likely to correctly predict the agent in the image is woman (see Appendix A.5).

Conclusion
Machine learning models risk inheriting the underlying societal biases from the data. In practice, many works use the global performance gap between different groups as a metric to detect the bias. In this work, we revisit the coarse-grained metric for group bias analysis and propose a new method, LOGAN, to detect local group bias by clustering. Our method can help detect model biases that previously are hidden from the global bias metrics and provide an explanation of such biases.
We notice there are some limitations in LOGAN. For example, the number of instances in clusters could be uneven (see Appendix A.3).

A.1 Reproducibility
We describe the details of our two models here. For toxicity classification tasks, we run the model on a GeForce GTX 1080 Ti GPU for 2 epochs, which takes about 3 hours to finish the fine-tuning procedure. The accuracy for the dev dataset is 89.4 %. For MS-COCO object classification tasks, we use the basic model from https://github.com/uva vision/Balanced-Datasets-Are-Not-Enough.
We train the model based on the default hyperparameters in this repo (for example, batch size is 32, learning rate is 10 −5 ). We get meanAP 52.3% and 53.1% for development and test, respectively. We attach partial code in the supplemental materials.

A.2 Topic words in different clusters
We list all the top 20 words from the topic model using K-Means and LOGAN in   Table 6: Top 20 topic words the most and least biased cluster using "K-Means" under RACE attributes. Number in parentheses is the bias score(%) of that cluster.    Table 9: Bias detection on toxic classification using LOGAN. Accuracy is shown in %.

A.3 Local Bias Detection
A.4 Results using embeddings extracted from a fine-tuned BERT model In this section, we provide the results using the second to last layer embeddings from the fine-tuned BERT model to do local bias detection in

A.5 Local Clusters for MS-COCO dataset
In this section, we show the local group bias analysis for MS-COCO objection classification tasks.