Does Gender Matter? Towards Fairness in Dialogue Systems

Recently there are increasing concerns about the fairness of Artificial Intelligence (AI) in real-world applications such as computer vision and recommendations. For example, recognition algorithms in computer vision are unfair to black people such as poorly detecting their faces and inappropriately identifying them as “gorillas”. As one crucial application of AI, dialogue systems have been extensively applied in our society. They are usually built with real human conversational data; thus they could inherit some fairness issues which are held in the real world. However, the fairness of dialogue systems has not been well investigated. In this paper, we perform a pioneering study about the fairness issues in dialogue systems. In particular, we construct a benchmark dataset and propose quantitative measures to understand fairness in dialogue models. Our studies demonstrate that popular dialogue models show significant prejudice towards different genders and races. Besides, to mitigate the bias in dialogue systems, we propose two simple but effective debiasing methods. Experiments show that our methods can reduce the bias in dialogue systems significantly. The dataset and the implementation are released to foster fairness research in dialogue systems.


Introduction
AI techniques have brought great conveniences to our lives. However, they have been proven to be unfair in many real-world applications such as computer vision (Howard and Borenstein, 2018), audio processing (Rodger and Pendharkar, 2004), and recommendations (Yao and Huang, 2017). In other words, AI techniques may make decisions that are skewed towards certain groups of people in these applications (Mehrabi et al., 2019). In the field of computer vision, some face recognition algorithms fail to detect faces of black users (Rose, 2010) or inappropriately label black people as "gorillas" (Howard and Borenstein, 2018). In the field of audio processing, it is found that voice-dictation systems recognize a voice from a male more accurately than that from a female (Rodger and Pendharkar, 2004). Moreover, when predicting criminal recidivism, risk assessment tools tend to predict that people of some certain races are more likely to commit a crime (Tolan et al., 2019). The fairness of AI systems has become one of the biggest concerns due to its huge negative social impacts.
Dialogue systems are important AI applications. They interact with users through human-like conversations to satisfy their needs. Conversational question answering agents provide users with the information they want to find (Saha et al., 2018). Task-oriented dialogue agents, such as Apple Siri and Microsoft Cortana, assist users to complete specific tasks such as trip planning and restaurant reservations (Jurafsky and Martin, 2009). Non-task-oriented dialogue agents, also known as chatbots, are designed to chit-chat with users in open domains for entertainment (Ritter et al., 2011). Dialogue systems have attracted increasing attention in the academic field (Chen et al., 2017;Gao et al., 2019) and have been widely Table 1: Examples of gender and racial biases in dialogue systems.

Context
Response Bias Hahaha, he has a really cute laugh and smile:d He really is the sweetest. All he does is make that cute little smile.

Sentiment: positive
Hahaha, she has a really cute laugh and smile:d She cute but she is also kinda evil.
Sentiment: negative Oh my god, for real, what is with this music during the downtime?
Also this reminds me: my live karaoke cover band cure-aoke is still the best idea I've ever had.

Not offensive
Oh my god, for real, what is with dis music during the downtime?
The only good future song is percocet and stripper joint. I have no idea why that one is good but the rest are hot wet poo.
Offensive deployed in our daily lives. However, the fairness issues of dialogue systems have not been well studied yet. Dialogue systems are often built based on real human conversational data through machine learning especially deep learning techniques (Shang et al., 2015;Serban et al., 2016;Serban et al., 2017). Thus, they are likely to inherit some fairness issues against specific groups that are held in the real world such as gender and racial biases. Examples of gender and racial biases we observed from a popular Transformer retrieval dialog model are demonstrated in Table 1. When we simply change a word of males in a given context to its counterpart of females such as from "he" to "she", the sentiments of the corresponding responses are changed from positive to negative. As we replace a phrase in standard English with African American English such as replacing "this" with "dis", the response becomes more offensive. The goal of dialogue systems is to talk with users and provide them with assistance and entertainment. If the systems show discriminatory behaviors, some underprivileged groups of users can be offended. Moreover, public commercial chatbots can get resisted for their improper speech (Wolf et al., 2017). Hence, there is an urgent demand to investigate the fairness issues of dialog systems.
In this work, we conduct a pioneering study about the fairness issues in two types of popular dialogue models, i.e., generative dialogue models (Sutskever et al., 2014) and retrieval dialogue models (Vaswani et al., 2017). In particular, we aim to answer three research questions: (1) do fairness issues exist in dialogue models? (2) how to quantitatively measure fairness? and (3) how to mitigate the bias in dialogue systems and ensure the fairness of them? Our key contributions are summarized as follows: • We construct a benchmark dataset to study gender and racial biases in dialogue models; • We define the fairness in dialogue systems formally and introduce a set of measurements to understand the fairness of a dialogue system quantitatively; • We propose two simple but effective debiasing methods which are demonstrated by experiments to be able to mitigate the biases in dialogue systems significantly.
The rest of the paper is organized as follows. First, in Section 2, we define the fairness in dialogue systems, present our approach to constructing the dataset for the fairness research, and detail the measurements to understand the fairness of dialogue models. Then, in Section 3, we conduct a fairness test on two representative dialogue models to verify whether dialogue systems can be biased. Afterward, we introduce our debiasing methods and show the experimental results in Section 4. Next, in Section 5, we present related works. Finally, we summarize and conclude the work in Section 6.

Fairness Analysis in Dialogue Systems
In this section, we first formally define fairness in dialogue systems. Then we introduce our method to construct the dataset to investigate fairness and then detail various measurements to quantitatively evaluate fairness in dialogue systems. Table 1, the fairness issues in dialogue systems exist between different pairs of groups, such as male vs. female, white people vs. black people 2 . Also, fairness of dialogue systems can be measured in different ways, such as sentiment and politeness. In this section, we propose a general definition of fairness in dialogue systems that covers all specific situations.

As shown in the examples in
We denote the pair of groups we are interested in as G = (A, B), where A and B can be male and female in the gender case, or white people and black people in the race case. For the context related to group B is called the parallel context of context C A . The pair of (C A , C B ) is referred as a parallel context pair. We suppose the context C A related to group A follows a distribution T A . Correspondingly, the parallel contexts C B follows a mirror distribution T B .
Definition 1 Given a dialogue model D that can be viewed as a function D : {C|C → R} which maps a context C to a response R, as well as a measurement M that maps a response R to a scalar score s, the dialogue model D is considered to be fair for groups A and B in terms of the measurement M when: To test the fairness of dialogue systems, in the next, we will first build a very large parallel context corpus to estimate the context distributions T A and T B . Then we will formulate the fairness analysis problem as a hypothesis-testing problem with regard to Equation 1.

Hypothesis Test
Suppose we have a large parallel context corpus containing n parallel context pairs {(C , which can be viewed as n samples from the distributions T A and T B . To test the hypothesis in Equation 1, ). Then we have the hypotheses: ) and X B = M(D(C B )). When n is large enough, we can construct a Z-statistic which approximately follows the standard normal distribution: where x A , x B are the sample means of X A and X B and S 2 A , S 2 B are the sample variances of them. In the experiments, we will use the Z-statistic for the hypothesis test. If its corresponding p-value is less than 0.05, then we reject the null hypothesis H 0 and consider the dialogue model to be not fair for groups A and B in terms of measurement M.

Parallel Context Data Construction
To study the fairness of a dialogue model on a specific pair of group G, we need to build data O G which contains a great number of parallel contexts pairs. We first collect a list of gender word pairs for the (male, female) groups and a list of race word pairs for the (white, black) groups. The gender word list consists of male-related words with their female-related counterparts. The race word list consists of common African American English words or phrases paired with their counterparts in standard English. Some examples  Table 2(a). For the full lists, please refer to Appendix A.1 and A.2. Afterward, for each word list, we first filter out a certain number of contexts that contain at least one word or phrase in the list from a large dialogue corpus. Then, we construct parallel contexts by replacing these words or phrases with their counterparts. All the obtained parallel context pairs form the data to study the fairness of dialogue systems.

Fairness Measurements
In this work, we evaluate fairness in dialogue systems in terms of four measurements, i.e., diversity, politeness, sentiment, and attribute words.

Diversity
Diversity of responses is an important measurement to evaluate the quality of a dialogue system (Chen et al., 2017). Dull and generic responses make users boring while diverse responses make a conversation more human-like and engaging. Hence, if a dialogue model produces diverse responses for different groups, the user experience of a part of users will be impacted. We measure the diversity of responses through the distinct metric (Li et al., 2016). Specifically, let distinct-1 and distinct-2 denote the numbers of distinct unigrams and bigrams divided by the total number of generated words in the responses. We report the diversity score as the average of distinct-1 and distinct-2 scores.

Politeness
Chatbots should talk politely with human users. Offensive responses cause users discomfort and should be avoided (Henderson et al., 2018;Dinan et al., 2019b;Liu et al., 2019;Liu et al., 2020b). Fairness in terms of politeness exists when a dialogue model is more likely to provide offensive responses for a certain group of people than others. In this measurement, we apply an offensive language detection model (Dinan et al., 2019b) to predict whether a response is offensive or not. This model is specialized to judge offensive language in dialogues. The politeness measurement is defined as the expected probability of a response to the context of a certain group being offensive. It is estimated by the ratio of the number of offensive responses over the total number of produced responses.

Sentiment
The sentiment of a piece of text refers to the subjective feelings it expresses, which can be positive, negative, and neutral. A fair dialogue model should provide responses with a similar sentiment distribution for people of different groups. In this measurement, we assess the fairness in terms of sentiment in dialogue systems. We use the public sentiment analysis tool Vader (Hutto and Gilbert, 2014) to predict the sentiment of a given response. It outputs a normalized, weighted composite score of sentiment ranging from −1 to 1. Since the responses are very short, the sentiment analysis for short texts could be inaccurate.
To ensure the accuracy of this measure, we only consider the responses with scores higher than 0.8 as positive and the ones with the scores lower than −0.8 as negative. The sentiment measures are the expected probabilities of a response to the context of a certain group being positive and negative. The measurements are estimated by the ratio of the number of responses with positive and negative sentiments over the total number of all produced responses, respectively.

Attribute Words
People usually have stereotypes about some groups and think that they are more associated with certain words. For example, people tend to associate males with words related to careers and females with words related to family (Islam et al., 2016). These words are called attributes words. We measure this kind of fairness in dialogue systems by comparing the probability of attribute words appearing in the responses to contexts of different groups. We build a list of career words and a list of family words to measure the fairness on the (male, female) group. For the (white, black) groups, we construct a list of pleasant words and a list of unpleasant words. We build a more comprehensive attribute word lists based on the attribute words provided in (Islam et al., 2016). Table 2 (b) shows some examples of the attribute words. The full lists can be found in Appendices A.3 and A.4. In the measurement, we report the expected number of the attribute words appearing in one response to the context of different groups. This measurement is estimated by the average number of the attribute words appearing in one produced response.

Experiment on Fairness Test
In this section, we first introduce the two popular dialogue models under study, then detail the experimental settings, and finally, we present the fairness results with discussions.

Dialogue Models
Typical chit-chat dialogue models can be categorized into two classes (Chen et al., 2017): generative models and retrieval models. Given a context, the former generates a response word by word from scratch while the latter retrieves a candidate from a fixed repository as the response according to some matching patterns. In this work, we investigate the fairness in two representative models in the two categories, i.e., the Seq2Seq generative model (Sutskever et al., 2014) and the Transformer retrieval model (Vaswani et al., 2017).

The Seq2Seq Generative Model
The Seq2Seq models are popular in the task of sequence generation (Sutskever et al., 2014), such as text summarization, machine translation, and dialogue generation. It consists of an encoder and a decoder, both of which are typically implemented by RNNs. The encoder reads a context word by word and encodes it as fixed-dimensional context vectors. The decoder then takes the context vector as input and generates its corresponding output response. The model is trained by optimizing the cross-entropy loss with the words in the ground truth response as the positive labels. The implementation details are as follows. Both the encoder and the decoder are implemented by 3-layer LSTM networks with hidden states of size 1,024. The last hidden state of the encoder is fed into the decoder to initialize the hidden state of the decoder. Pre-trained Glove word vectors (Pennington et al., 2014) are used as the word embeddings with a size of 300. The model is trained through stochastic gradient descent (SGD) with a learning rate of 1.0 on 2.5 million single-turn dialogues collected from Twitter. In the training process, the dropout rate and gradient clipping value are set to 0.1.

The Transformer Retrieval Model
The Transformer proposed in (Vaswani et al., 2017) is an encoder-decoder framework, which models sequences by pure attention mechanism instead of RNNs. Specifically, in the encoder part, positional encodings are first added to the input embeddings to indicate the position of each word in the sequence. Next, the input embeddings pass through stacked encoder layers, where each layer contains a multihead self-attention mechanism and a position-wise fully connected feed-forward network. The retrieval dialogue model only takes advantage of the encoder to encode the input contexts and candidate responses. Then, the model retrieves the candidate response whose encoding matches the encoding of the context best as the output. The model is trained in batches of instances, by optimizing the cross-entropy loss with the ground truth response as a positive label and the other responses in the batch as negative labels.
The implementation of the model is detailed as follows. In the Transformer encoder, we adopt 2 encoder layers. The number of heads of attention is set to 2. The word embeddings are randomly initialized and the size is set to 300. The hidden size of the feed-forward network is set as 300. The model is trained  through Adamax optimizer (Kingma and Ba, 2014) with a learning rate of 0.0001 on around 2.5 million single-turn dialogues collected from Twitter. In the training process, the dropout mechanism is not used. The gradient clipping value is set to 0.1. The candidate response repository is built by randomly choosing 500,000 utterances from the training set.

Experimental Settings
In the experiment, we focus only on single-turn dialogues for simplicity. We use a public conversation dataset 3 that contains around 2.5 million single-turn conversations collected from Twitter to train the two dialogue models. The models are trained under the ParlAI framework . To build the data to evaluate fairness, we use another Twitter dataset which consists of around 2.4 million single-turn dialogues. For each dialogue model, we construct a dataset that contains 300,000 parallel context pairs as described in the last section. When evaluating the diversity, politeness, and sentiment measurements, we first remove the repetitive punctuation from the produced responses since they interfere with the performance of the sentiment classification and offense detection models. When evaluating with the attribute words, we lemmatize the words in the responses through WordNet lemmatizer in NLTK toolkit (Bird, 2006) before matching them with the attribute words.

Experimental Results
We first present the results of fairness in terms of gender in Tables 3 and 4. We feed 300,000 parallel context pairs of (male, female) into the dialogue models and evaluate the produced responses with the four measurements. We also show the values of Z-statistics and their corresponding p-values. We make the following observations from the tables. First, in terms of the diversity, the retrieval model produces more diverse responses than the generative model. This is consistent with the fact that Seq2Seq generative model tends to produce more dull and generic responses (Li et al., 2016) compared to responses from retrieval models. We observe that both models produce more diverse responses for males than females, which may be unfair in terms of diversity in dialogue systems. Second, from the politeness measurement, we can see that females receive more offensive responses from both models, which show that dialogue systems talk to females more unfriendly than males. Third, sentiment results show that females receive more negative responses and less positive responses. Fourth, in terms of measurement of attribute words, there are more career words appearing in the responses for males and more family words in the responses  for females. This is consistent with people's stereotype that males dominate the field of career while females are more family-minded. Finally, in almost all the cases, the p-value of the hypothesis test is less than 0.05, which demonstrates the null hypothesis H 0 should be rejected and the biases against different genders in dialogue models are very significant. Then we show the results of fairness in terms of race in Tables 5 and 6. Similarly, 300,000 parallel context pairs of (white, black) are input into the dialogue models. From the tables, we make the following observations. The first observation is that black people receive less diverse responses from the two dialogue models. It demonstrates that it is unfair in terms of diversity for races. Second, dialogue models tend to produce more offensive languages for black people. Third, in terms of the sentiment measurements, the black people get more negative responses but less positive responses. Fourth, as for the attribute words, unpleasant words are mentioned more frequently for black people, while white people are associated with more pleasant words. Finally, for all the measurements, the p-values we get are far less than 0.05, which ensures the statistical significance of the above results.
To summarize, the dialogue models trained on real-world conversation data indeed share similar unfairness as that in the real world in terms of gender and race. Given that dialogue systems have been widely applied in our society, it is strongly desired to handle the fairness issues in dialogue systems.

Debiasing Methods
Given that our experiments show that there exist significant biases in dialogue systems, a natural question should be asked: how can we remove the biases in dialogue systems and ensure their fairness? Note that for retrieval-based dialogue models, all the possible responses are chosen from a repository. So there exist a trivial but effective way to eliminate the biases by simply removing all the biased candidate responses from the response pool. Hence, we only consider the debiasing problem of the generative Seq2Seq dialogue model. To solve this problem, we introduce two simple but effective debiasing methods: (1) counterpart data augmentation (CDA); and (2) word embedding regularization (WER).

Counterpart Data Augmentation
The biases of learning-based models come from training data. Thus, we can remove the biases in dialogue systems from their sources by eliminating the biases in the data (Bellamy et al., 2018). Borrowing the idea from (Maudslay et al., 2019), we simply augment the training data by adding counterpart dialogue data based on the original data. To construct training data free from gender or race bias, for each contextresponse pair in the original training data, we replace all the gender or race words (if exist) in it with their counterpart and add the resulting context-response pair into the training set as the augmented data.

Word Embedding Regularization
Although the above method can mitigate the biases in dialogue systems, in some cases, the learning algorithm is not allowed to access the training data, which makes this method impractical. It's important to develop an in-processing debiasing technique that reduces the biases during the training phase (Chen et al., 2017). Based on this consideration, we propose to introduce a regularization term that decreases the distance between the embedding of a gender or race word and that of its counterpart into the loss function. Suppose L ori is the original training loss function, we optimize the dialogue model by minimizing the following loss function: where k is a hyperparameter, W is the gender or race word list and e w is the embedding of word w. In this way, as the training process goes on, all the gender or race words and their counterparts will become closer in the embedding space. The model will gradually treat them equally so the biases can be avoided.

Experiments and results
We conduct experiments to test the effectiveness of our proposed debiasing methods. We first train a CDA model and a WER model in the same setting as the original model and then conduct fairness tests on them. Specifically, for the CDA model, we obtain an augmented training data set that contains 4, 197, 883 single-turn dialogues from the original training set that contains around 2, 580, 433 dialogues. For the WER model, We set the coefficient k as 0.5. The experimental results of the debiasing models are shown in Table 7. We can observe that first, for most of the cases, both of the two debiasing models reduce gender biases and race biases in terms of various measurements significantly. The differences between the two groups are controlled within a reasonable range and are not statistically significant anymore. Second, WER performs better than CDA in mitigating biases. However, a drawback of WER is, after sufficient training with the regularization term, the dialogue model tends to generate similar responses to two genders or races, which may degrade the diversity of the generated responses. It reminds us that there may exist a trade-off between the performance and the fairness of a model. It's important for us to find a balance according to specific situations.

Related Work
Existing works attempt to address the issue of fairness in various machine learning tasks such as classification (Kamishima et al., 2012;Zafar et al., 2015), regression (Berk et al., 2017), graph embedding (Bose and Hamilton, 2019) and clustering (Backurs et al., 2019;Chen et al., 2019). Besides, we will briefly introduce related works that study fairness issues on NLP tasks.
Word Embedding. Word Embeddings often exhibit a stereotypical human bias for text data, causing a serious risk of perpetuating problematic biases in imperative societal contexts. Popular state-of-theart word embeddings regularly mapped men to working roles and women to traditional gender roles (Bolukbasi et al., 2016), thus led to methods for the impartiality of embeddings for gender-neutral words. In the work (Bolukbasi et al., 2016), a 2-step method is proposed to debias word embeddings. The work (Zhao et al., 2018b) proposes to modify Glove embeddings by saving gender information in some dimensions of the word embeddings while keeping the other dimensions unrelated to gender.
Coreference Resolution. The work (Zhao et al., 2018a) introduces a benchmark called WinoBias to measure the gender bias in coreference resolution. To eliminate the biases, a data-augmentation technique is proposed in combination with using word2vec debiasing techniques.
Language Modeling. In the work (Bordia and Bowman, 2019), a measurement is introduced for measuring gender bias in a text generated from a language model that is trained on a text corpus along with measuring the bias in the training text itself. A regularization loss term is introduced to minimize the projection of embeddings in the gender subspace following a soft debiasing technique introduced in (Bolukbasi et al., 2016).
Machine Translation. In the work (Prates et al., 2018), it is shown that Google's translation system can suffer from gender bias by making sentences taken from the U.S. Bureau of Labor Statistics into a dozen languages that are gender-neutral, including Yoruba, Hungarian, and Chinese, translating them into English, and showing that Google Translate shows favoritism toward males for stereotypical fields such as STEM jobs. In the work (Bordia and Bowman, 2019), the authors use existing debiasing methods in the word embeddings to remove biases in machine translation models. These methods do not only help them to mitigate the existing bias in their system, but also boost the performance of their system by one BLEU score.
Text/Dialogue Generation. In the work (Dinan et al., 2019a), the authors examine gender bias in both dialogue datasets and generative dialogue models. They mainly focus on personalized dialogue generation and investigate the bias in characters, personas, and human-generated dialogue utterances in a personabased dialogue dataset. In the work (Dinan et al., 2020), the authors propose to measure the gender bias in NLP models in three dimensions and create classifiers to determine the gender inclination. However, both works fail to provide an accurate definition of gender bias in texts, which leads to questionable bias measurements such as simply counting the number of gender words in texts or human evaluation. The former confuses gender bias with reasonable differences between genders, while the latter can be highly subjective and not scalable. Moreover, based on the bias measurements in this work, there is a recent work (Liu et al., 2020a) introducing an adversarial learning framework Debiased-Chat to mitigate gender bias in neural dialogue models.

Conclusion
In this paper, we have investigated the fairness issues in dialogue systems. In particular, we define fairness in dialogue systems formally and further introduce four measurements to evaluate fairness of a dialogue system quantitatively, including diversity, politeness, sentiment, and attribute words. Moreover, we construct data to study gender and racial biases for dialogue systems. Then, we conduct detailed experiments on two types of dialogue models, i.e., generative models and retrieval based models, to analyze the fairness issues in the dialogue systems. The results show that there exist significant genderand race-specific biases in dialogue systems. We introduce two debiasing methods to mitigate the biases in dialogue systems. Experiments show that the proposed methods effectively reduce the biases and ensure fairness of dialogue systems. IIS1955285. Zitao Liu is supported by the Beijing Nova Program (Z201100006820068) from Beijing Municipal Science & Technology Commission.