Mitigating Gender Bias for Neural Dialogue Generation with Adversarial Learning

Dialogue systems play an increasingly important role in various aspects of our daily life. It is evident from recent research that dialogue systems trained on human conversation data are biased. In particular, they can produce responses that reflect people's gender prejudice. Many debiasing methods have been developed for various natural language processing tasks, such as word embedding. However, they are not directly applicable to dialogue systems because they are likely to force dialogue models to generate similar responses for different genders. This greatly degrades the diversity of the generated responses and immensely hurts the performance of the dialogue models. In this paper, we propose a novel adversarial learning framework Debiased-Chat to train dialogue models free from gender bias while keeping their performance. Extensive experiments on two real-world conversation datasets show that our framework significantly reduces gender bias in dialogue models while maintaining the response quality.


Introduction
The elimination of discrimination is an important issue that our modern-day society is facing. Learning from humans' behaviors, machine learning algorithms have been proven to inherit the prejudices from humans (Mehrabi et al., 2019). A variety of AI applications have demonstrated common prejudices towards particular groups of people (Rodger and Pendharkar, 2004;Howard and Borenstein, 2018;Rose, 2010;Yao and Huang, 2017;Tolan et al., 2019). It is evident from recent research that learning-based dialogue systems also suffer from discrimination problems (Liu et al., 2019a;Dinan et al., 2019). Dialogue models show significant prejudices towards certain groups of people * The corresponding author: Zitao Liu by producing biased responses to messages related to different genders (Liu et al., 2019a). A biased dialogue system will produce improper speeches, which can bring in bad experiences to users or even cause negative social impacts (Wolf et al., 2017;Liu et al., 2019b. Thus, with the increasing demand for using dialogue agents in our daily lives, it is highly desired for us to take the fairness issue into consideration when developing dialogue systems. The gender bias 1 in dialogues comes from different dimensions -the gender of the person that speakers are talking about (speaking-about), and the gender of the speaker (speaking-as) and the addressee (speaking-to) (Dinan et al., 2020). In this work, we focus on mitigating the gender bias of dialogue systems in the speaking-about dimension. It is the most common format of gender bias in dialogues which exists under both speaker-given dialogue scenario, where the personas of the speaker or the addressee are known (Li et al., 2016;Zhang et al., 2018), and speaker-agnostic dialogue scenario, where the information of the speakers is unknown. Given messages with the same content for different genders, dialogue models could produce biased responses, which have been measured in terms of their politeness and sentiment, as well as the existence of biased words (Liu et al., 2019a). Table 1 shows one example from a generative dialogue model trained on the Twitter dialogue corpus. When we change the words in the messages from "he" to "she", the responses produced by the dialogue model are quite different. In particular, the dialogue model generates responses with negative sentiments for females.
There are debiasing methods in natural language processing such as data augmentation (Dinan et al.,

Message
Response Really wishes he could take at least one step on this husker floor... I'm sure he's going to be a great guest. Really wishes she could take at least one step on this husker floor... I'm sure she's a little jealous. 2019) and word embeddings regularization (Liu et al., 2019a). Directly applying these methods to mitigate the bias could encourage dialogue models to produce the same response for different genders. This strategy can lead to producing unreasonable responses such as "he gave birth to a baby" and also reduce the diversity of the generated responses. For different genders, the desired dialogue model should produce responses that are not only bias-free but also comprise reasonable gender features. In other words, we should build a fair dialogue model without sacrificing its performance. To achieve this goal, we face three key challenges. First, dialogues contain various genderrelated contents. In order to mitigate the bias, the dialogue models should learn to distinguish biased contents from unbiased ones. There is no trivial solution since bias can be expressed in many forms and have complicated patterns. Second, even if the first challenge is addressed, eliminating biased contents in responses by the dialogue models remains hard. Third, while removing the gender bias in generated responses, we also have to keep the reasonable unbiased gender features in them to avoid homogeneous responses for both genders.
In this paper, we propose a novel framework Debiased-Chat to train bias-free generative dialogue models. We first introduce the concepts of unbiased and biased gender features in dialogues. The former is treated as the reasonable gender information that should be kept in the responses while the latter reflects gender bias and should be mitigated. Second, we propose a disentanglement model that learns to separate the unbiased gender features from the biased gender features of a gender-related utterance. Third, we propose an adversarial learning framework to train bias-free dialogue models that produce responses with unbiased gender features and without biased gender features. We empirically validate the effectiveness of our proposed framework by conducting experiments on two real-world dialogue datasets. Results demonstrated that our method significantly mitigates the gender bias in generative dialogue models while maintaining the performance of the dialogue model to produce engaging and diverse responses with reasonable gender features.

The Proposed Framework
In this section, we detail the proposed framework. Note that in this work, we focus on the classical generative Seq2Seq dialogue model for single-turn dialogue generation while we leave other settings such as the multi-turn case as future work. We first define two key concepts. We refer to the reasonable and fair gender features in a response as the unbiased gender features of the response. They include gendered terms and words or phrases specially used to describe one gender. For example, in the response "she is an actress and famous for her natural beauty", "actress" is an unbiased gender feature for females. We call the unreasonable and discriminatory gender features in a response as the biased gender features. According to the definition of the bias in dialogue models in (Liu et al., 2019a), any offensive, sentimental expressions and biased words correlated with one gender are considered as its biased gender features. For instance, given the same context with different genders as shown in Table 1, for the response to females, "I'm sure she's a little jealous", the word "jealous" is a biased gender feature under the context.

An Overview
With the aforementioned definitions, our proposed dialogue model aims to produce responses with unbiased gender features but free from biased gender features. Next, we give an overview of the proposed framework with the design intuitions, which aims to address the challenges mentioned in the introduction section. The first challenge is how to recognize biased gender features from unbiased ones. Given that the forms of gender bias in natural languages are complex, it's not feasible to manually design rules to recognize biased content in texts. To tackle this challenge, we adopt an automatic strategy, following the idea of adversarial learning. We propose a disentanglement model (right of Figure 1) to learn to separate the unbiased gender features f (u) and the semantic features f (s) of a gender-related utterance. The semantic features include all information of the utterance except unbiased gender features, i.e., the content information and possibly biased gender features. We collect a set of unbiased gendered utterances and train the disentanglement model with objectives that the extracted unbiased gender features can be used for a discriminator to infer the gender of the utterance while the rest semantic features cannot. Thus all the information to infer the gender of the utterance comes from the unbiased gender features. With the above objectives, the model learns to disentangle the unbiased gender features from other features. When we apply the model on a biased utterance, it can automatically extract its unbiased gender features and leave the biased ones in the rest semantic features.
To address the second challenge (remove biased gender features in dialogues) and the third challenge (reserve unbiased gender features in dialogues), we propose our framework to train biasfree dialogue models (left of Figure 1). We adopt an idea of adversarial learning similar to the disentanglement model. Given a response from the dialogue model, its two disentangled feature vectors are fed into two discriminators D 1 and D 2 respectively, to predict the gender of the dialogue 2 . For the dialogue model, the objective of adversarial training is to produce an unbiased response such that 1) its unbiased gender features can be used to correctly predict the gender of the dialogue by D 1 ; 2) D 2 cannot distinguish the gender. The intuition of the design is below. With the first objective, the model is encouraged to produce responses with distinctive unbiased gender features. Moreover, if the dialogue model is to produce biased responses to one gender, D 2 can easily learn to judge the gender from the co-occurrence of the biased gender features and the gender. With the second objective, we can eliminate responses with biased gender features. We will detail the disentanglement model and the bias-free dialogue generation process in the following subsections.

Unbiased Gendered Utterance Corpus
Given the dialogue corpus D, we collect all the gender-related utterances from it. Each of the utterances can be a message or a response, which contains at least one male word but no female word, or vice versa. Then, we filter out all utterances that could be biased. Following the bias measurements in (Liu et al., 2019a), we remove all the utterances which 1) are offensive, or 2) show strong positive or negative sentiment polarity, or 3) contain career or family words. The rest utterances form an Unbiased Gendered Utterance Corpus

Bias-free Dialogue Generation
where U i is the i-th utterance and g i is its gender label. The corpus is used to train the disentanglement model.

Model Design
The illustration of the disentanglement model is shown on the right of Figure 1.
Autoencoder. We adopt an autoencoder as the disentanglement model, in which both the encoder and the decoder are implemented using recurrent neural networks (RNN) with gated recurrent unit (GRU) cells (Cho et al., 2014). The encoder learns to encode an utterance U into a latent vector h ∈ R d . The latent vector h is then mapped into the space of unbiased gender features R u and the space of the semantic features R s by two 1-layer feedforward networks respectively, to get the unbiased gender features f (u) and the semantic features f (s) . The concatenation of the unbiased gender and the semantic features f = [f (u) : f (s) ] is then fed into the decoder to reconstruct the original utterance U .
Discriminators. In the autoencoder, to disen-tangle the latent representation h into the unbiased gender features f (u) and the semantic features f (s) , we take advantage of the idea of adversarial learning. We first train two discriminators D (det) 1 and D (det) 2 to distinguish whether the utterance U is related to male or female based on the unbiased gender features f (u) and the semantic features f (s) , respectively. The discriminators are implemented via one-layer feedforward neural networks, which predict the probability distribution of the genders p (u) ∈ R 2 and p (s) ∈ R 2 based on f (u) and f (s) , respectively.
Adversarial Training. In the adversarial training process, we hope that the discriminator D cannot. The outputs of the discriminators are used as signals to train the disentanglement model so that it will assign the gender-related information into the unbiased gender features f (u) while ensuring that the semantic features f (s) do not include any gender information. Thus, we define two losses in terms of the discriminators D where g is the gender label of the utterance and p i is the i-th element of p. L D (det) 1 is the cross-entropy loss function on p (u) . Minimizing L D (det) is the entropy of the predicted distribution p (s) . Minimizing it makes p (s) close to an even distribution, so that D (det) 2 tends to make random predictions. To further ensure that only f (s) encodes content information of the utterance, following (John et al., 2018), we add two more discriminators D and assign them to predict the bag-of-words (BoW) features of the utterance based on f (u) and f (s) , respectively. Given an utterance, we first remove all stopwords and unbiased gender words in it 3 . Then, its BoW feature is represented as is the frequency of w i in the utterance and L is the length of the utterance after removal. The discriminators D are also implemented via 1-layer feedforward neural networks to get the predicted BoW featuresp (u) ∈ R |V | andp (s) ∈ R |V | based on f (u) and f (s) , respectively. Similar to Eqs. (1) and (2), we optimize the disentanglement model with two additional losses: We denote the reconstruction loss of the autoencoder as L rec . Then the final objective function for optimizing the disentanglement model is calculated as , where k 0 , . . . , k 4 are hyperparameters to adjust the contributions of the corresponding losses.

Training Process
We train the discriminators and the autoencoder alternatively. We update the disentanglement model DET as well as the discriminators for n epoch epochs. On each batch of training data, we first update the discriminators D on the loss L (det) .

Model Design
As shown on the left of Figure 1, the dialogue model is treated as the generator in adversarial learning. Given a message, it generates a response. The response is projected into its unbiased gender feature vector f (u) and the semantic feature vector f (s) through the disentanglement model. Two feature vectors are fed into two discriminators D 1 and D 2 respectively, to predict the gender of the dialogue where both D 1 and D 2 are implemented as 3-layer feedforward neural networks with the activate function ReLU. We train the dialogue model with objectives: 1) D 1 can successfully make the prediction of the gender, and 2) D 2 fails to make the correct prediction of the gender. Hence, we define two additional losses L D 1 and L D 2 in the same format as L D (det) 1 and L D (det) 2 (Eqs. (1) and (2)), respectively.

Training Process
The optimization process is detailed in Algorithm 1. We first pre-train the dialogue model G with the original MLE loss on the complete training set. Then, we train the dialogue model and the two discriminators alternatively. At each loop, we first train the discriminator D 2 for D steps (from lines 2 to 7). At each step, we sample a batch of ex- where the message contains at least one male word but no female word, or vice versa, and each dialogue is assigned with a gender label g i . Given the message X i , we sample a responseŶ i from G. We update D 2 by optimizing the cross-entropy (CE) loss that measures the performance of D 2 to correctly classify the sampled responseŶ i as g i . Then we update the dialogue model G along with D 1 (from lines 8 to 14) by optimizing the compound loss: To calculate the losses L D 1 and L D 2 , we sample a responseŶ i for the message X i from the dialogue model G. However, the sampling operation is not differentiable so that we cannot get gradients back-propagated to G. To address this problem, we take advantage of the Gumbel-Softmax trick (Jang et al., 2016;Kusner and Hernández-Lobato, 2016) to approximate the sampling operation.
Besides, it is pointed out that the teacher forcing strategy can effectively alleviate the instability problem in adversarial text generation (Li et al., 2017). Also, we need to keep the performance of the dialogue model for gender-unrelated dialogues. Thus, we train the dialogue model G on the neutral dialogue corpus D (n) by optimizing the MLE loss for G teach steps steps at each loop (from lines 15 to 19). The neutral dialogue corpus is also a subset of the dialogue corpus D which contains gender-unrelated dialogues whose messages have no gender words. We stop the training process until the dialogue model passes the fairness test on the fairness validation corpus F that is constructed following (Liu et al., 2019a).

Experiment
In this section, we validate the effectiveness of the proposed framework. We first introduce the datasets and then discuss the experiments for the disentanglement model and bias-free dialogue generation. Finally, we further demonstrate the framework via a case study.

Datasets
Twitter Conversation Dataset. The Twitter conversation dataset 4 is a public human conversation dataset from the Twitter platform. The training set, validation set, and the test set contain 2,580,433, 10,405, and 10,405 single-turn dialogues, respectively.
Reddit Movie Dialogue Dataset. Reddit movie dialogue dataset (Dodge et al., 2015) is a public dataset collected from the movie channel of the Reddit forum. The original dataset contains 2,255,240 single-turn dialogues. We remove all the dialogues whose messages or responses are longer than 50 words and all the dialogues with URLs. We finally keep 500,000 dialogues for training, 8,214 In the autoencoder, both the encoder and decoder are 1-layer GRU networks with a hidden size of 1,000. The word embedding size is set as 300. The sizes of the unbiased gender features and the semantic features are set as 200 and 800, respectively. The vocab size is 30,000. We set k 0 = 1, k 1 = 10, k 2 = 1, k 3 = 1 and k 4 = 3. The unbiased gendered utterance corpus to train the disentanglement model is constructed from the training set of the dialogue dataset, as described in 2.2. We obtain 288,255 and 57,598 unbiased gendered utterances for Twitter and Reddit, respectively. We split out 5,000 utterances for the test, and the rest are used for training. We train the disentanglement model for 20 epochs with a batch size of 32.

Experimental Results
We design the experiment exploring whether the disentanglement model learns to separate the unbiased gender features from the semantic features successfully. We train two linear classifiers with the same structure as the discriminators D to classify the gender of an utterance based on the unbiased gender features and the semantic features, respectively. The classification accuracy on the test set is shown in Table 2. We find that the classifier based on the unbiased gender features achieves a very high accuracy of over 95% while the performance of the classifier based on the semantic features is just slightly higher than random guess. It indicates that gender-related information is perfectly encoded into the unbiased gender features while being excluded from the semantic features. These observations suggest that our disentanglement model can successfully disentangle the gender features from the semantic features.
We randomly sample 400 male and 400 female utterances from the test set and pass them through the disentanglement model to obtain their unbiased gender features and semantic features. We conduct dimension reduction on them by t-distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and Hinton, 2008) and show the results in two plots. As shown in Figure 2, the unbiased gender features are clearly divided into two areas, while the semantic features are mixed altogether evenly. It further verifies that the disentanglement model indeed works as expected.

Baselines
We directly apply two existing debiasing methods to dialogue models as baselines.
Counterpart Data Augmentation (CDA). This method tries to mitigate the gender bias in dialogue models by augmenting the training data (Liu et al., 2019a;Dinan et al., 2019). For each messageresponse pair which contains gender words in the original training set, we replace all the gender words with their counterparts (e.g., he and she, man and woman) and obtain a parallel dialogue. It is added to the training set as the augmented data.
Word Embedding Regularization (WER). In this method (Liu et al., 2019a), besides the original MLE loss, we train the dialogue model with an auxiliary regularization loss which reduces the difference between the embeddings of the gender words and that of their counterparts. We empirically set the weight of the regularization term as k = 0.25.

Experimental Settings
For Seq2Seq dialogue models, the encoder and the decoder are implemented by 3-layer LSTM networks with a hidden size of 1,024. Word embedding size is set as 300, and the vocab size is 30,000. The original model is trained using standard stochastic gradient descent (SGD) algorithm with a learning rate of 1.0. In the adversarial train- ing process of Debiased-Chat, both the dialogue model and the discriminators are trained by Adam optimizer (Kingma and Ba, 2014) with the initial learning rate of 0.001. The temperature value τ for Gumbel-Softmax is initialized as 1.0 and decreases through dividing by 1.1 every 200 iterations. It stops decreasing when τ < 0.3. Hyper-parameters are empirically set as k 0 = k 1 = k 2 = 1 and D steps = 2, G steps = 2, G teach steps = 1 based on validation performances. All the models are trained on NVIDIA Tesla K80 GPUs.

Experimental Results
We first conduct a fairness test on the baselines and our model to compare their ability in debiasing, and then compare the quality of the responses they generate in terms of relevance and diversity. Fairness Evaluation. Following (Liu et al., 2019a), we formulate the problem of the fairness analysis as a hypothesis test problem. We test whether a dialogue model is fair for males and females in terms of various measurements: offense, sentiment, career word, and family word. We construct fairness test corpora, which contain 30,000 parallel message pairs as described in (Liu et al., 2019a) from the Twitter dataset and the Reddit dataset, respectively. Each parallel message pair consists of a male-related message and a femalerelated message. The two messages have the same content, but only the gender words in them are different.
In Table 3, we report the results of the fairness where "Offense Rate" is the offense rate of the produced responses towards male-and female-related messages; "Senti.Pos/Neg" indicates the rate of responses with positive and negative sentiments; and "Career Word" and "Family Word" mean the average number of the career and family words in one response. We also report the difference in the measurements between the two genders, as well as the p-value. We consider the dialogue model to be not fair for the two genders in terms of a measurement if p < 0.05. We make the following observations. First, the original model shows significant gender bias. Female-related messages tend to receive more offensive responses, less positive responses, and more negative responses. Career words are more likely to appear in the context of males, while family words are more likely to appear in the context of females. Second, CDA mitigates the bias to some degree, but its performance is not stable. In some cases, the bias is even amplified. Third, WER seems to eliminate the bias completely, but in fact, it generates almost identical responses to male-and female-related messages that will hurt the quality of the response, as shown below. Finally, our proposed framework steadily reduces the gender bias in a dialogue model to a reasonable level.
Quality Evaluation. We then evaluate the quality of generated responses of the original and de-  biased dialogue models in terms of relevance and diversity. We do the evaluation on the test set of the two dialogue datasets. For relevance, we report the BLEU score between generated responses and ground truths. For diversity, we report the metric distinct proposed in (Li et al., 2015). The results are shown in Table 4.
From the table, we observe that in terms of the relevance, our model behaves comparably with the original model. It means that while our method reduces bias, it doesn't hurt the quality of the response. Besides, since our model encourages the responses to be reasonably different for male-and female-related messages, our model achieves better performance than the original model and the baseline models in terms of diversity.

Case Study
To further demonstrate the effectiveness of the proposed framework, we show one pair of parallel messages and their responses produced by various dialogue models in Table 5. In this case, responses generated by the original model show bias. Among the debiased dialogue models, the CDA model generates responses with only the pronoun "he" changed to "she", and both of two responses are offensive. It shows that the CDA model mit-igates bias crudely by producing responses with similar content. WER model generates identical nonsense responses for two messages. Our model generates responses that are free from bias and contain unbiased gender features. The male response is similar to the original one. The female response is not offensive and reflects the features of females. The word "dressing" is recognized by the disentanglement model as an unbiased gender feature of females and is encouraged to appear in the context of a female. This example demonstrates that our model increases the diversity of the responses for different genders while mitigating gender bias.

Related Work
The fairness problems in natural language processing have received increasing attention (Mehrabi et al., 2019). Word Embeddings exhibit human bias for text data. Researchers find that in word embeddings trained on large-scale real-world text data, the word "man" is mapped to "programmer" while "woman" is mapped to "homemaker" (Bolukbasi et al., 2016). They also propose a 2-step method for debiasing word embeddings. Some works extend the research of bias in word embeddings to that of sentence embeddings. May et al. (2019) propose Sentence Encoder Association Test (SEAT) based on Word Embedding Association Test (WEAT) (Islam et al., 2016). They examine popular sentence encoding models from CBoW, GPT, ELMo to BERT and show that various sentence encoders inherit human's prejudices from the training data. For the task of coreference resolution, a benchmark named WinoBias is proposed in (Zhao et al., 2018) to measure the gender bias. This work provides a debiasing method based on data augmentation.  first explore the gender bias in language models. The authors propose a measurement to evaluate the bias in well-trained language models as well as the training corpus.
They propose to add a regularization term in the loss function to minimize the projection of word embeddings onto the gender subspace introduced in (Bolukbasi et al., 2016). They also point out that reducing gender biases may result in a decline in the performance of the language model in terms of perplexity. Prates et al. (2018) reveal that Google's machine translation system shows gender biases in produced translations in various languages. Existing debiasing methods for word embeddings are adopted to mitigate the biases in machine translation systems . This work shows that while the embedding-based technique reduces the biases, it also improves the performance of the machine translation system by one BLEU score.
Dialogue systems have been shown to be sensitive to the input messages (Niu and Bansal, 2018;Zhang et al., 2020;Xu et al., 2020). They could produce very different responses to messages with the same content but different demographic mentions, which may reflect the social bias of humans. Liu et al. (2019a) first study the bias in dialogue systems. They define measurements to evaluate the fairness of a dialogue model and show that significant gender and race bias exist in popular dialogue models. Dinan et al. (2019) analyze gender bias in persona-based dialogue models and proposes a combination debiasing method. Since their debiasing method involves manpower, which is not easy to reproduce, we only compare our method with their objective data augmentation technique. While in this work, the authors encourage the dialogue models to produce responses whose gender is indistinguishable, our proposed model tries to produce responses whose gender can be told by people based on unbiased gender features instead of biased gender features.

Conclusion
In this work, we focus on the problem of mitigating gender bias in neural dialogue models. We propose an adversarial training framework Debiased-Chat to reduce the bias of a dialogue model during the training process. With the help of a disentanglement model, we design an adversarial learning framework that trains dialogue models to cleverly include unbiased gender features and exclude biased gender features in responses. Experiments on two human conversation datasets demonstrate that our model successfully mitigates gender bias in dialogue models and outperforms baselines by producing more engaging, diverse, and gender-specific responses. In the future, we will investigate debiasing retrieval-based dialogue models and more complicated pipeline-based dialogue systems.