Comparative Evaluation of Label Agnostic Selection Bias in Multilingual Hate Speech Datasets

Work on bias in hate speech typically aims to improve classiﬁcation performance while relatively overlooking the quality of the data. We examine selection bias in hate speech in a language and label independent fashion. We ﬁrst use topic models to discover latent semantics in eleven hate speech corpora, then, we present two bias evaluation metrics based on the semantic similarity between topics and search words frequently used to build corpora. We discuss the possibility of revising the data collection process by comparing datasets and analyzing contrastive case studies.


Introduction
Hate speech in social media dehumanizes minorities through direct attacks or incitement to defamation and aggression. Despite its scarcity in comparison to normal web content, Mathew et al. (2019) demonstrated that it tends to reach large audiences faster due to dense connections between users who share such content. Hence, a search based on generic hate speech keywords or controversial hashtags may result in a set of social media posts generated by a limited number of users (Arango et al., 2019). This would lead to an inherent bias in hate speech datasets similar to other tasks involving social data (Olteanu et al., 2019) as opposed to a selection bias (Heckman, 1977) particular to hate speech data.
Mitigation methods usually point out the classification performance and investigate how to debias the detection given false positives caused by gender group identity words such as "women" (Park et al., 2018), racial terms reclaimed by communities in certain contexts (Davidson et al., 2019), or names of groups that belong to the intersection of gender and racial terms such as "black men" (Kim et al., 2020). The various aspects of the dataset construction are less studied though it has recently been shown, by looking at historical documents, that we may somehow neglect the data collection process (Jo and Gebru, 2020). Thus, in the present work, we are interested in improving hate speech data collection with evaluation before focusing on classification performance.
We conduct a comparative study on English, French, German, Arabic, Italian, Portuguese, and Indonesian datasets using topic models, specifically Latent Dirichlet Allocation (LDA) (Blei et al., 2003). We use multilingual word embeddings or word associations to compute the semantic similarity scores between topic words and predefined keywords and define two metrics that calculate bias in hate speech based on these measures. We use the same list of keywords reported by Ross et al. (2016) for German, Sanguinetti et al. (2018) for Italian, Ibrohim and Budi (2019) for Indonesian, Fortuna et al. (2019) for Portuguese; allow more flexibility in both English (Waseem and Hovy, 2016;Founta et al., 2018;Ousidhoum et al., 2019) and Arabic (Albadi et al., 2018;Mulki et al., 2019;Ousidhoum et al., 2019) in order to compare different datasets based on shared concepts that have been reported in their respective paper descriptions; and for French, we make use of a subset of keywords that covers most of the targets reported by Ousidhoum et al. (2019). Our first bias evaluation metric measures the average similarity between topics and the whole set of keywords, and the second one evaluates how often keywords appear in topics. We analyze our methods through different use cases which explain how we can benefit from the assessment.
Our main contributions consist of (1) designing bias metrics that evaluate hateful web content using topic models; (2) examining selection bias in eleven datasets; and (3) turning present hate speech corpora into an insightful resource that may help us balance training data and reduce bias in the future.
Bias in social data is broad and addresses a wide range of issues (Olteanu et al., 2019;Papakyriakopoulos et al., 2020). Shah et al. (2020) present a framework to predict the origin of different types of bias including label bias (Sap et al., 2019), selection bias (Garimella et al., 2019), model overamplification (Zhao et al., 2017), and semantic bias (Garg et al., 2018). Existing work deals with bias through the construction of large datasets and the definition of social frames (Sap et al., 2020), the investigation of how current NLP models might be non-inclusive of marginalized groups such as people with disabilities (Hutchinson et al., 2020), mitigation (Dixon et al., 2018;Sun et al., 2019), or better data splits (Gorman and Bedrick, 2019). However, Blodgett et al. (2020) report a missing normative process to inspect the initial reasons behind bias in NLP without the main focus being on the performance which is why we choose to investigate the data collection process in the first place.
In order to operationalize the evaluation of selection bias, we use topic models to capture latent semantics. Regularly used topic modeling techniques such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have proven their efficiency to handle several NLP applications such as data exploration (Rodriguez and Storer, 2020), Twitter hashtag recommendation (Godin et al., 2013), authorship attribution (Seroussi et al., 2014), and text categorization (Zhou et al., 2009).
In order to evaluate the consistency of the generated topics, Newman et al. (2010) used crowdsourcing and semantic similarity metrics, essentially based on Pointwise Mutual Information (PMI), to assess the coherence; Mimno et al. (2011) estimated coherence scores using conditional log-probability instead of PMI; Lau et al.
(2014) enhanced this formulation based on normalized PMI (NPMI); and Lau and Baldwin (2016) investigated the effect of cardinality on topic generation. Similarly, we use topics and semantic similarity metrics to determine the quality of hate speech datasets, and test on corpora that vary in language, size, and general collection purposes for the sake of examining bias up to different facets.

Bias Estimation Method
The construction of toxic language and hate speech corpora is commonly conducted based on keywords and/or hashtags. However, the lack of an unequivocal definition of hate speech, the use of slurs in friendly conversations as opposed to sarcasm and metaphors in elusive hate speech (Malmasi and Zampieri, 2018), and the data collection timeline (Liu et al., 2019) contribute to the complexity and imbalance of the available datasets. Therefore, training hate speech classifiers easily produces false positives when tested on posts that contain controversial or search-related identity words (Park et al., 2018;Sap et al., 2019;Davidson et al., 2019;Kim et al., 2020).
To claim whether a dataset is rather robust to keyword-based selection or not, we present two label-agnostic metrics to evaluate bias using topic models. First, we generate topics using Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Then, we compare topics to predefined sets of keywords using a semantic similarity measure. We test our methods on different numbers of topics and topic words.

Predefined Keywords
In contrast to Waseem (2016), who legitimately questions the labeling process by comparing ama-  Table 1: Examples of keywords present in the predefined lists along with their English translations. The keywords include terms frequently associated with controversies such as comunist in Italian, slurs such as m*ng*l in French, insults such as in Arabic, and hashtags such as rapefugees in German.
teur and professional annotations, we investigate how we could improve the collection without taking the annotations into account. In other terms, how the data selection contributes to the propagation of bias and therefore, false positives during first, the annotation step, then the classification.
We define B 1 and B 2 assess how the obtained social media posts semantically relate to predefined keywords. The bias metric B 1 measures this relatedness on average, while B 2 evaluates how likely topics are to contain keywords. We use predefined sets of keywords that can be found in the hate speech resource paper descriptions (Waseem and Hovy, 2016;Ross et al., 2016;Sanguinetti et al., 2018;Founta et al., 2018;Albadi et al., 2018;Fortuna et al., 2019;Mulki et al., 2019), appeared on reported websites 1 , or seen along with the corpus (Ibrohim and Budi, 2019; Ousidhoum et al., 2019). Table 1 shows examples of keywords utilized to gather toxic posts 2 . The list of keywords provided by Ibrohim and Budi (2019), which contains 126 words, is the largest we experiment with. The Portuguese, Italian, and German lists are originally small since they focus on particular target groups 3 , whereas the remaining lists have been re-1 Such as the HateBasehttps://hatebase.org/. 2 We will make both the lists and links to their sources available to the research community. 3 The target groups are women, immigrants, and refugees.

DATASET TOPIC WORDS
Founta et al. (2018) f***ing, like, know Ousidhoum et al. (2019) ret***ed, sh*t**le, c*** Waseem and Hovy (2016) (2019) ID user, orang, c*b*ng EN user, person, t*dp*le Ross et al. (2016) DE rapefugees, asylanten, merkel EN rapefugees, asylum seekers, merkel duced slightly to meet the objectives presented in the descriptions of all the corpora we used. Table 2 shows examples of topics that were generated from the chosen datasets. Although Founta et al. (2018) report collecting data based on controversial hashtags and a large dictionary of slurs, Waseem and Hovy (2016) (2019)'s Arabic dataset contains the word pigs used to insult people, a slang word, and the word camels as a part of a demeaning expression that means "camels urine drinkers" which is usually used to humiliate people from the Arabian Peninsula. The three words exist in the predefined list of keywords similarly to all French, Portuguese, Italian and most German and Idonesian topic words.

Topic Models
Italian, German and Portuguese topics are composed of words related to immigrants and refugees as they correspond to the main targets of these datasets. The French topic also contains the name of a political ideology typically associated with more liberal immigration policies.
Other than slurs, named entities can be observed in Waseem and Hovy (2016)'s topic, which includes the name of a person who participated in an Australian TV show that was discussed in the tweets 4 ; the German topic includes the name of the German Chancellor Merkel since she was repeatedly mentioned in tweets about the refugee crisis (Ross et al., 2016) Despite their short length, the example topics can provide us with a general idea about the type of bias present in different datasets. For instance, topics generated from datasets in languages which are mainly spoken in Europe and the USA commonly target immigrants and refugees, in contrast to Arabic and Indonesian topics which focus on other cultural, social, and religious issues. Overall, all topics show a degree of potentially quantifiable relatedness to some predefined key concepts. Similarly, we would like to assess topic bias in hate speech based on the semantic similarity between high scoring words in each topic and the set of search keywords used to collect data.

Bias Metrics
Given a set of topics T={t 1 , . . . , t |T| } generated by LDA, with each topic t i ={w 1 , . . . , w n } composed of n words, and a predefined list of keywords w of size m such as w ={w 1 , . . . , w m }, we define the two bias functions B 1 and B 2 based on Sim 1 and Sim 2 , respectively.
Sim 1 measures the similarity between two words w j ∈ t i and w k ∈ w for t i ∈ T, with 0 < i ≤ |T|, such as: Sim(w j , w k ) (1) B 1 computes the mean similarity between each w j ∈ t i and w k ∈ w , then the mean given all gen- 4 Waseem and Hovy (2016) report collecting tweets about My Kitchen Rules (mkr). erated topics, such as: Sim 2 measures the maximum similarity of each word w j ∈ t i and keyword w k ∈ w , w j , such as ∀w j ∈ t i and ∀w k ∈ w with 0 < j ≤ n and 0 < k ≤ m: Then, we calculate B 2 similarly to B 1 : Both B 1 and B 2 scores aim to capture how the word distribution of a given dataset can lead to false positives. B 1 evaluates how the whole set of keywords w semantically relates to the whole set of topics T by measuring their relatedness to each topic word w j ∈ t i , then to each topic t i ∈ T. Whereas B 2 verifies whether each topic word w j ∈ t i is similar or identical to a keyword w k ∈ w . In summary, B 1 determines the average stability of topics given keywords, and B 2 how regularly keywords appear in topics.

Results
In this section, we demonstrate the impact of our evaluation metrics applied to various datasets and using different similarity measures.

Experimental Settings
The preprocessing steps we apply to all the datasets consist of (1) the anonymization of the tweets by changing @mentions to @user, then deleting @users, and (2) the use of NLTK 5 to skip stopwords. Then, we run the Gensim (Řehůřek and Sojka, 2010) implementation of LDA (Blei et al., 2003) to generate topics. We vary the number of topics and words within the range [2,100] to take the inherent variability of topic models into account.
In the general cases presented in Figures 1, 2, 3, and 4, we fix the number of topics to be equal to 8 when we alter the number of topic words and likewise, we fix the number of topic words to be equal to 8 when we experiment with different numbers   . We fix the number of topics to 8 when we alter the number of words and similarly, we fix the number of words to 8 when we change the number of topics. We use the multilingual Babylon embeddings to compute the semantic similarity between words.
of topics. We define the semantic similarity measure Sim between each topic word and keyword to be the cosine similarity between their embedding vectors in the space of the multilingual pretrained Babylon embeddings (Smith et al., 2017) with respect to each of the seven languages we examine.

Robustness Towards The Variability of Topic Models
Figures 1 and 2 show the average B 1 and B 2 score variations given all the datasets. The scores are given numbers of topics and topic words within the range [2,100], respectively. Despite B 1 scores being similar on average, we notice that the larger the number of topics, the more outliers we observe. In parallel, the smaller the number of words, the more outliers we see. This is due to possible randomness when large topics are generated.
On the other hand, B 2 scores are larger on average due to the high probability of keywords appearing in topics regardless of the dataset. This naturally translates to B 2 showing more stability regarding the change in topic numbers in comparison to topic words.   dataset (Ross et al., 2016). In German, we reach the maximum 0.41 when the number of words in each topic equals 2, and the minimum when it equals 100. On the other hand, we observe the most noticeable changes when we vary the number of topics in French (Ousidhoum et al., 2019) such that B 1 = 0.34 when |T| = 2 versus 0.21 when |T| = 7 and back to 0.37 when |T| = 100.

Robustness of Keyword-based Selection
However, we remark overall cohesion despite the change in topic numbers especially in the case of Italian and Portuguese caused by the limited numbers of search keywords, that equal 5 and 7 respectively.
Moreover, the account-based dataset of (Mulki et al., 2019), referred to as AR3 in Figures 3 and   4 shows more robustness towards keywords. Nevertheless, such a collection strategy may generate a linguistic bias that goes with the same stylistic features used by the targeted accounts, similarly to Waseem and Hovy (2016)'s user bias reported by Arango et al. (2019).

Hate Speech Embeddings
Besides using multilingual Babylon embeddings, we train hate speech embeddings with Word2Vec (Mikolov et al., 2013) to examine whether this can help us tackle the problem of outof-the-vocabulary words caused by slang, slurs, named entities, and ambiguity.
Since we test on single French, German, Ital-  Table 3: B 1 scores based on trained hate speech embeddings for 10 topics. We have manually clustered the keywords released by Ousidhoum et al. (2019) based on discriminating target attributes. For instance, the word ni**er belongs to the origin (ORIG) category, raghead to religion (REL), and c**t to gender (GEN). For normalization purposes, we skipped disability since we did not find Arabic keywords that target people with disabilities.
ian, Indonesian, and Portuguese datasets, we do not train embeddings on these languages due to the lack of data diversity. In contrast, we train English hate speech embeddings on Waseem and Hovy (2016) The B 1 scores reported in Table 3 are larger than the ones reported in Figures 1 and 3 resulting from the difference between the size of the embedding space of Babylon and hate speech embeddings. Our embeddings are trained on a limited amount of data but, we can still notice slight differences in the scores. Interestingly, B 1 scores reveal potentially overlooked targets as in Albadi et al. (2018)'s sectarian dataset that is supposed to target people based on their religious affiliations, yet its B 1 scores given all discriminating attributes are comparable. 6 we use Tweepy http://docs.tweepy.org/en/ latest/api.html to retrieve tweets that have not been deleted.  Table 4: B 1 scores for English hate speech datasets using WordNet given 10 topics and keywords clustered based on origin (ORIG), religion (REL), and gender (GEN). The scores are reported for tweets that have not been labeled non-hateful or normal. Although we initially attempted to study the differences of pretrained word embeddings and word associations, we found that many (w j , w k ) pairs involve out-of-thevocabulary words. In such cases (w j , w k ) would have a WordNet Similarity score WUP = 0 which is why the scores are in the range [0.25, 0.35].

General versus Corpus-Specific Lists of Keywords
We consider two examples in the following use case: (1) Waseem and Hovy (2016) who report building their dataset based on hashtags such as mkr, victim card, and race card, and (2) Albadi et al. (2018) who report building their sectarian dataset based on religious group names such as Judaism, Islam, Shia, Sunni and Christianity. The initial list of predefined keywords such as the ones we have shown in Table 1 carries additional words in English and Arabic. Therefore, for these two datasets, we have measured bias using two predefined lists of keywords: the initial list and one that is specific to the dataset in question. The scores given the general set of keywords are reported in Figures 3 and 4, such as AR2 refers to Albadi et al. (2018) and EN2 to Waseem and Hovy (2016). The B 1 and B 2 scores given corpusspecific lists of keywords are either the same or ±0.01 the reported scores. We observed a maximum difference of 0.03, which is why reporting these scores would have been repetitive.
In conclusion, this is a symptom of high similarity in present English and Arabic hate speech datasets despite their seemingly different collection strategies and timelines.

WordNet and Targeted Hate Bias
In addition to word embeddings, we test our evaluation metrics on WordNet (Fellbaum, 1998)'s WUP (Wu and Palmer, 1994) similarity. WUP evaluates the relatedness of two synsets, or word senses, c 1 and c 2 , based on hypernym relations. Synsets with short path distances are more related than those with longer ones. Wu and Palmer (1994) scale the depth of the two synset nodes by the depth of their Least Common Subsumer (LCS) or the most specific concept that is an ancestor of c 1 and c 2 (Newman et al., 2010).
In this use case, we aim to present a prospective label bias extension of our metrics by testing B 1 on toxic tweets only. Consequently, we consider tweets that were not annotated normal or non-hateful. We question the present annotation schemes by computing B 1 with Sim=WUP.
Waseem and Hovy (2016), Founta et al. (2018) and Ousidhoum et al. (2019) report using different keywords and hashtags to collect tweets. However, the scores shown in Table 4 indicate that the datasets might carry similar meanings, specifically because WUP relies on hypernymy rather than common vocabulary use. The comparison of B 1 scores given target-specific keywords also implies that the annotations could be non-precise. We may therefore consider fine-grained labeling schemes in which we explicitly involve race, disability, or religious affiliation as target attributes, rather than general labels such as racist or hateful.

Case Study
Figures 5(a) and 5(b) show bias scores generated for the German dataset (Ross et al., 2016) which contains 469 tweets collected based on 10 keywords related to the refugee crisis in Germany. We notice that B 1 scores fluctuate in the beginning, reach a threshold, then get lower when the number of topics increases. B 1 remains stable within different numbers of words as opposed to B 2 scores that increase when more topic words are generated since eventually, all topics would include at least one keyword.
On the other hand, Figures 5(c) and 5(d) show bias scores generated for the Indonesian dataset (Ibrohim and Budi, 2019) which contains more than 13,000 tweets collected based on a heterogeneous set of 126 keywords. In such settings, B 1 is almost constant for both the number of topics and topic words, contrary to B 2 sores that arise when many topics are generated since new topics would include words that did not appear in the previously generated ones.

Discussion
We consider our bias evaluation metrics to be labelagnostic and tested this claim in the different use  cases we presented in section 4. Table 5 reports the Spearman's correlation scores between the properties of each dataset and its average B 1 and B 2 scores given different numbers of topics and topic words. The correlation scores show that, on average, our metrics do not depend on summary statistics either. We observe low correlation scores between the different features and B 1 scores. B 1 correlates the best with the number of keywords and the vocabulary size whereas B 2 correlates the best with the average cosine similarity between keywords.
Although our bias metrics do not take annotations into account, we notice a global trend of overgeneralizing labels as presented in Section 4.6. Despite the fact that this is partly due to the absence of a formal definition of hate speech, we do believe that there could be a general framework which specifies several aspects that must be annotated.
Moreover, we notice recurring topics in many languages, such as those centered around immigrants and refugees which may later lead to false positives during the classification and hurt the detection performance. Hence, we believe that our evaluation metrics can help us recognize complementary biases in various datasets, facilitate transfer learning, as well as enable the enhancement of the quality of the data during collection by performing an evaluation step at the end of each search round.
(a) Variations of B1 and B2 scores given #topics in the German dataset.
(b) Variations of B1 and B2 scores given #words in the German dataset.
(c) Variations of B1 and B2 scores given #topics in the Indonesian dataset.
(d) Variations of B1 and B2 scores given #words in the Indonesian dataset. Figure 5: Variations of B 1 (in blue) and B 2 (in red) scores on the German and Indonesian datasets.

Conclusion
We proposed two label-agnostic metrics to evaluate bias in eleven hate speech datasets that differ in language, size, and content. The results reveal potential similarities across available hate speech datasets which may hurt the classification performance.
As unpreventable as selection bias in social data can be, we believe there is a way to mitigate it by incorporating evaluation as a step which directs the construction of a new dataset or when combining existing corpora.
Our metrics are extensible to other forms of bias such as user, label, and semantic biases, and could be adapted in cross-lingual contexts using different similarity measures.