HONEST: Measuring Hurtful Sentence Completion in Language Models

Language models have revolutionized the field of NLP. However, language models capture and proliferate hurtful stereotypes, especially in text generation. Our results show that 4.3% of the time, language models complete a sentence with a hurtful word. These cases are not random, but follow language and gender-specific patterns. We propose a score to measure hurtful sentence completions in language models (HONEST). It uses a systematic template- and lexicon-based bias evaluation methodology for six languages. Our findings suggest that these models replicate and amplify deep-seated societal stereotypes about gender roles. Sentence completions refer to sexual promiscuity when the target is female in 9% of the time, and in 4% to homosexuality when the target is male. The results raise questions about the use of these models in production settings.


Introduction
1 Natural Language Processing powers many applications we use (or are subjected to) every day, e.g., internet search engines, virtual assistants, or recruiting tools. Increasingly, these applications include text generation. Unfortunately, these methods are likely to reproduce and reinforce a wide range of existing stereotypes in real-world systems. It is therefore important to quantify and understand these biases. Both to avoid the psychological burden of different vulnerable groups, and to advocate for equal treatment and opportunities. Recent research has focused on uncovering and measuring bias in input representations, models, and other aspects (Shah et al., 2020). For example, Bolukbasi et al. (2016); Caliskan et al. (2017); Gonen and Goldberg (2019) demonstrated the presence of implicit sexism in word embeddings. Zhao et al. 1 Note: this paper contains explicit statements of hurtful and offensive language in various languages, which may be upsetting to readers. (2017) demonstrated that models exaggerate found biases, and Kiritchenko and Mohammad (2018) showed that a simple change of pronouns or first names could significantly alter the sentiment of an otherwise identical sentence. Recently, contextualized language models, lead by Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), have become the standard in NLP leaderboards. 2 Several studies (Kurita et al., 2019;May et al., 2019;Sheng et al., 2019;Nangia et al., 2020) have analyzed their implicit biases related to word use and associations based on word similarity. However, apart from associations, these models can also generate or complete sentences in a cloze-test style. This capability opens new avenues for text generation, but also includes the risk of producing hurtful and stereotyped sentences.
We are the first to investigate the generation of explicitly hurtful stereotypes in language models for English and five gender-inflected languages (Italian, French, Portuguese, Romanian, and Span-ish). Gender-inflected languages associate a grammatical gender case with verbs, nouns, and adjectives. In English, "X is known for ___" describes statements for male and female X. In genderinflected languages, we also have to inflect the verb and article "elle/il est connue/connu comme une/un ___". This complex gender marking makes stereotyped completions more likely, but also requires a carefully designed study to identify societal stereotypes in these less-investigated languages. 3 We manually create a benchmark set of cloze sentence templates, validated by native speakers for syntactic correctness. Table 1 shows examples of templates filled by BERT models in different languages. We fill these templates via languagespecific language models (BERT and GPT-2) and measure the number of hurtful words generated that way. We further categorize the words via a lexicon of hurtful words (Bassignana et al., 2018). Finally, we introduce a measure, the HONEST score (hurtfulness of language model sentence completion), to compute how likely each language model is to produce hurtful completions.
Contributions 1) We release a novel benchmark data set of manually-created sentence templates to measure the generation of hurtful sentence completions in six languages; 2) we use this dataset to assess gendered stereotype bias in the generated results; 3) we propose a measure, HONEST, to understand which language model generates more hurtful sentences; 4) we release code and data for reproducibility at https://github.com/MilaNLProc/ honest.

Method
Template-closed forms are an effective way of evaluating language models. Petroni et al. (2019) use cloze-based forms to evaluate the amount of relational knowledge included in BERT, and Ettinger (2020) use them as a set of psycholinguistic diagnostic tools. Cloze-based forms have a long history in psycholinguistics to understand human sentence processing (Ettinger, 2020).
Here, we use a similar methodology to test hurtful language in different language models. For example, our templates look as follows: "X are good at ___", 3 Grammatical gender is not the same as biological sex or societal gender, but gender-inflected languages do usually assign different grammatical gender to male and female subjects.  where X is a variable identity term and the blank is the part to be completed. We generate a set of these templates similarly to Sheng et al. (2019), but specifically made for gender-inflected languages. Table 2 shows some examples of our templates. We use a set of identifier terms in singular and plural (i.e., woman, women, girl, boys) and a series of predicates (i.e., "works as ___", "is known for ___"). The identity terms differ in grammatical gender in all our languages. Our templates have been checked by native speakers of each of the five languages to ensure we create syntactically correct and meaningful sentences. Moreover, we asked the native speakers to make the templates as natural as possible in the respective language. We created a dataset of 420 instances for each language, which are generated from 28 identity terms (14 male and 14 female) and 15 templates.
Language Models BERT is natively trained on a cloze-style masked language modeling task: random tokens in the input sentence are replaced with a [MASK] token, for BERT to predict the correct token. After training, given a new sentence with a [MASK] token (the "___" in our templates), BERT returns the most probable tokens that could replace [MASK].
GPT-2 does not use masked tokens. Instead, we give the model the unfinished sentence up to the blank and let it complete it. This potentially returns several words.
Lexicon We use HurtLex (Bassignana et al., 2018), a multilingual lexicon of hurtful language, to evaluate the completions. We want to quantify how many times a sentence is completed with a hurtful word and if there is a difference between genders (represented by the identity terms). HurtLex maps each word to a class of hurtful expressions (e.g., DEROGATORY WORDS, ANI-MALS). For each BERT completion, we check if the word is present in the lexicon. For GPT-2, we   check if at least one of the words generated by the model is in the lexicon. Then, we compute the percentage of hurtful completions for each category among the top-K candidates.
HONEST Finally, we define a global score to get a birds-eye view of the models' performance, which is complementary to the category analysis. The HONEST score is the average of hurtful completions of any class. More formally, for a language model LM , assuming we have a set T templates filled with the identity terms, we define HONEST as follows: t∈T c∈compl(LM,t,K) 1 HurtLex (c)

|T | * K
Here, 1 HurtLex is the indicator function for the set of words in HurtLex and compl(LM, t, K) is the function that returns the top-K completions of LM on template t.

Experimental Evaluation
We use the HuggingFace (Wolf et al., 2020) pretrained language-specific BERT and GPT-2 models (see Appendix B for more details).
Tables 3 and 4 show the percentage of hurtful words BERT and GPT-2 have generated to complete the templates. 4 The tables distinguish the languages, gender target, and the different HurtLex categories. Higher percentages are reported in red, lower ones in green.
The average higher percentages of GPT-2 with respect to the BERT ones (6.2% vs. 3.1% respectively) are due to the different evaluation settings (see Section 2). Because GPT-2 potentially produces more words, this difference suggests that the more text we generate, the more likely one is to find an appearance of hurtful stereotypes.
The category with the highest percentages across all languages is, unsurprisingly, DEROGATORY WORDS. They are defined in (Bassignana et al., 2018) as words with a clearly offensive and negative value, which are not negative stereotypes or slurs. These terms are associated with both men and women with relatively high percentages (10.7% on average). Nevertheless, there is a slight imbalance between male and female subjects (12.3% vs. 9.2%, respectively). This is due to some limitations of the lexicon. While it includes numerous genderspecific insults, nouns and adjectives with negative connotations that can be used for both genders are predominantly reported in their male form. For example, the Italian lexicon contains male insults such as buffone, cretino, zotico (en. clown, moron, lout), but not their female versions buffona, cretina, zotica. This does not affect the insights presented in this paper. On the contrary, it suggests that our results are an underestimation of the real problem, making this only the tip of the iceberg.
Some categories have more pronounced differences between men and women. Our findings cor-roborate the study by Preston and Stanley (1987). Thirty years later, insults against women still continue to be based on sexual promiscuity, and slurs against men refer to homosexuality or sexual inadequacy (Scruton, 2017).
Words containing references to sexual promiscuity (the categories PROSTITUTION and FEMALE GENITALIA) are associated predominantly with women (9% each for women, vs. 1.4% and 3% for men). This is not surprising, considering the disproportionately larger number of terms for a sexually promiscuous woman in all our languages (Stanley, 1977). Examples of sentence completions for this category are: the woman is a whore, the girl likes to have sex, the woman is known as a prostitute looking for money.
On the other hand, the HOMOSEXUALITY category is more strongly associated with men (4% vs. 1.2% for women). A psychological study (Slaatten et al., 2014) demonstrated that gay-related namecalling is associated with the endorsement of hypermasculine norms. Similar to the previous case, this is also motivated by the fact that epithets associated with male homosexuality are more frequent than ones associated with female homosexuality in all our languages. Examples of sentence completions for this category are: the husband is known as a homosexual, he is gay. While these sentences are not necessarily hurtful, sexual orientation terms (a protected category) should not be the first completion given by a language model for general purpose templates. We also investigated the possible impact of model design and training data but did not identify any systematic differences. HONEST. In Table 5 we show the HONEST scores for different language models and languages. Our results show that CamemBERT is the BERTderived model with the most hurtful language generation issues. The same is true for GPT-2 trained on French data, suggesting that French models should take this issue into consideration. The best results come from Portuguese and Spanish models. These results could indicate either differences in training data or language-specific differences in the use of swearwords.

Related Work
The analysis of bias in Natural Language Processing has gained a lot of attention in recent years (Hovy and Spruit, 2016; Shah et al., 2020), specifically on gender bias (Zhao et al.,   The pioneering work of (Bolukbasi et al., 2016) demonstrated that word embeddings (even when trained on formal corpora) exhibit gender stereotypes to a disturbing extent. On top of that, several studies have been proposed to measure and mitigate bias in word embeddings (Chaloner and Maldonado, 2019;Zhou et al., 2019;Nissim et al., 2020) and more recently on pre-trained contextualized embeddings models (Kurita et al., 2019;May et al., 2019;Field and Tsvetkov, 2019;Sheng et al., 2019;Nangia et al., 2020;Vig et al., 2020).
However, most studies focus on English. Despite a plethora of available language-specific models (Nozza et al., 2020), there currently exist few studies on biases in other languages. This is a severe limitation, as English findings do not automatically extend to other languages, especially if those exhibit morphological gender agreement. Only McCurdy and Serbetçi (2017); Zhou et al. (2019) examine the bias in word embeddings of gender-inflected languages, demonstrating the need for an adequate framework different from the ones proposed for English. To the best of our knowledge, we are the first to investigate stereotype bias in various language model completions beyond English.

Conclusion
We present the first analysis of stereotyped sentence completions generated by contextual models in gender-inflected languages. We introduce the HONEST score to quantify the amount of hurtful completions in a language model. We release a novel benchmark data set of manually created templates, validated by native speakers in five gender-inflected languages, i.e., Italian, French, Portuguese, Romanian, and Spanish. Our results show that BERT and GPT-2, nowadays ubiquitous in research and industrial NLP applications, demonstrate a disturbing tendency to generate hurtful text. In particular, template sentences with a female subject are completed in 10% of the time with stereotypes about sexual promiscuity. Sentences with male subjects are completed 5% of the times with stereotypes about homosexuality. This finding raises questions about the role of these widespread models in perpetuating hurtful stereotypes. In future work, we will investigate sentence completions with "benevolent sexism" categories (Jha and Mamidi, 2017), e.g., stereotypes like women are good at cooking or men are good at ruling. Moreover, we plan to study the handling of protected category terms in natural language generation systems with data augmentation (Dixon et al., 2018;Nozza et al., 2019) and regularization techniques (Kennedy et al., 2020).

Ethical Considerations
Our experimental results suggest a need to discuss the ethical aspect of these models. BERT and GPT-2 have shown astonishing capabilities and pushed the envelope of natural language understanding -not without some doubts (Bisk et al., 2020;Bender and Koller, 2020). However, our results, together with those of (Sheng et al., 2019;Kurita et al., 2019;Zhou et al., 2019), should make us reflect on the dual use of these models, i.e., how they are used outside our research community.
Can BERT or GPT-2 harm someone if used in production, by proliferating and amplifying harmful stereotypes? These models are now often included in industrial pipelines that are generally driven by economic needs, not academic interest. When we combine this ubiquity with the general low interpretability of deep learning methods, we can easily see a problematic issue.
Pre-trained models are often used as-is, but they bring their biases along wherever they are used: trusting the pre-training to be fair can give a false sense of security. This is directly connected to the recent easy availability of these models; almost anyone can download and use a pre-trained model now. While this is a great advancement for the democratization of technology, it also raises serious questions.
We, as scientists, should be aware of the consequences the naïve use of these models can have. Democratizing without educating can damage those people who fight the most to be recognized as equal members of our society, if our models continue to spread old hurtful stereotypes.
Finally, we want to explicitly address the limitation of our approach with respect to the binary nature of our gender analysis. The lack of representation for non-binary people and the gender assumption of the identity terms is a major limitation in our work. It is due to data and language constraints, not a value judgment. We want to add our voice to Mohammad (2020) in the hope of future work to disaggregate information for different genders.

Data Statement
We follow Bender and Friedman (2018) on providing a Data Statement for our templates to provide a better picture of the possibilities and limitations of the data, and to allow future researchers to spot any biases we might have missed.
Templates were generated by native speakers of the respective languages from European Countries, all in the age group 25-30. The data we share is not sensitive to personal information, as it does not contain information about individuals. Our data does not contain hurtful messages that can be used in hurtful ways.