Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models trained on two different corpora (news articles, and Wikipedia) exhibit considerable levels of bias. We then propose embedding and sentiment prediction-derived regularization on the language model’s latent representations. The regularizations improve fairness metrics while retaining comparable levels of perplexity and semantic similarity.


Introduction
Language modeling has advanced rapidly due to efficient model architectures (Vaswani et al., 2017;Dai et al., 2019) and the availability of large-scale datasets (Radford et al., 2019;Zellers et al., 2019). Large-scale language models have been applied not only for representation extraction to support downstream tasks (Peters et al., 2018;Devlin et al., 2019), but are also used for many natural language generation applications (Radford et al., 2019;Solaiman et al., 2019;Zellers et al., 2019;Zhang Figure 1: Conditioning text "My friend is a/an <occupation>, and we...", alongside various text continuations generated by a GPT-2 language model. On the right, the empirical sentiment distribution of the generated texts is shown: they reveal a systematic difference in sentiment depending on occupation ("baker'' or "accountant") in the conditioning context. et al., 2019). While the generation of coherent text is becoming increasingly practical, it also prompts models to internalize social biases present in the training corpus. Investigating the social impact and fairness of the text generated from language models has thus received considerable research interest (Solaiman et al., 2019;Wallace et al., 2019;Sheng et al., 2019).
In this paper, we aim to both quantify and reduce a language model's sentiment bias for a given sensitive attribute. Consider, for example, the conditioning text "My friend is a/an <occupation>, and we..." on the left of Figure 1. A 1.5B-parameter GPT-2 language model can generate a variety of plausible continuations to it, yet the empirical distribution of sentiment scores differs depending on the occupation chosen in the conditioning context. When generating 1,000 continuations for both "accountant" and "baker", and then measuring the sentiment scores of the resulting sentences using the Google Cloud sentiment API, a systematic difference is revealed: the GPT-2 model tends to gen-erate continuations with more positive sentiment for "baker", and more negative sentiment with "accountant" as the occupation. When systematically evaluating this phenomenon by manipulating different sensitive attributes values (e.g., country names, occupations, or person names) in the conditioning context -that is, performing counterfactual evaluation -we find that sentiment scores for the generated texts can vary substantially, suggesting the existence of sentiment bias. Such a sentiment bias can pose a concern for using the text generated by language models in downstream applications (e.g., dialogue agents (Zhang et al., 2019)) from a fairness perspective.
To quantify sentiment bias, we propose the use of individual and group fairness metrics from the fair machine learning literature (Dwork et al., 2012;Jiang et al., 2019;Hardt et al., 2016). We furthermore propose a general framework to reduce sentiment bias given a fairness specification based on sensitive attributes (e.g., fairness w.r.t. a predefined set of occupation names). Using this framework, we propose embedding and sentiment predictionderived regularization on the language model's latent representations. Experiments demonstrate that both proposed methods reduce sentiment bias while retaining a comparable level of perplexity and semantic similarity, and show a trade-off between fairness and semantic relevance.
While specifying concretely what optimal model fairness behavior should be is difficult -it might be defined by law or regulators -we provide a general framework to address given fairness specifications on sensitive attributes. Our main contributions are: • We demonstrate the existence of systematic counterfactual sentiment bias in texts generated by large-scale language models ( §3).
• We propose two novel metrics: individual and group fairness metrics to quantify counterfactual sentiment bias in language generation ( §3).
• To the best of our knowledge, this paper is the first to introduce a general framework to reduce bias under a specification measure (e.g., sentiment) for texts generated by language models given sensitive attributes. While we focus on sentiment biases on a few common sensitive attributes (country, occupation and name), the framework can be generalized to other specifications ( §4).
• We evaluate the proposed methods using both automatic metrics and human evaluations of sentiment and semantic relevance, and find a strong correlation between automatic metrics and human evaluations ( §5).

Background & Related Work
Bias in natural language processing systems. Besides learning to favor the language of the authors' demographic group (Hovy and Søgaard, 2015), NLP models can pick up on a variety of cultural associations and undesirable social biases (Caliskan et al., 2017). Systematic imbalances were observed across NLP tasks, such as gender bias in coreference resolution (Zhao et al., 2018;Rudinger et al., 2018), visual semantic role labeling (Zhao et al., 2017), image captioning (Hendricks et al., 2018), and demographic biases in language generation (Sheng et al., 2019), text classification (Dixon et al., 2018;Garg et al., 2019). Concretely in sentiment analysis, Kiritchenko and Mohammad (2018) found systematic biases with respect to race and gender across more than 200 systems.
Mitigating bias in language models. Rather than debiasing word embeddings, Lu et al. (2018) proposed counterfactual data augmentation as a remedy to occupation-specific gender biases, and found that it can much better retain model performance than debiasing word embeddings, especially in language modeling. Zhao et al. (2019) and Basta et al. (2019) demonstrated gender bias in pretrained language modeling representations (ELMo), which translates into downstream tasks, but did not consider the language generated by the ELMo language model. Bordia and Bowman (2019), as well as Qian et al. (2019) identified biases in a language modeling context and propose regularization strategies of generating certain words (e.g., "doctor") with differently gendered inputs. In contrast to these prior works on mitigating gender biases of language models based on the probabilities of generating certain words (such as occupation ratios), we probe texts generated by language models using a sentiment analysis system, similar to Sheng et al. (2019). We further propose a general framework to mitigate bias for a given specification (e.g., fairness w.r.t. predefined country names, occupations, gendered names) under a specification measure (e.g., sentiment, regard, etc.). Prior work mostly considers comparatively small language modeling training sets. In contrast, we investigate bias in Transformer-based models with a similar number of parameters (708 million parameters) to GPT-2 (Solaiman et al., 2019) trained on English news articles from WMT-19 (40GB of text) and WikiText-103 (Merity et al., 2016).
Fairness. Popular statistical fairness criteria often aim at achieving individual fairness (Dwork et al., 2012) or group fairness (Hardt et al., 2016) goals. In recent years, causal inference tools are also used in fairness research to extend beyond statistical fairness criteria making use of causal graphs. Similar to individual fairness, which requires similar individuals to be treated similarly (Dwork et al., 2012), counterfactual fairness requires the same model predictions before and after intervention on sensitive attributes in data-generating causal graphs (Kusner et al., 2017;Kilbertus et al., 2017;Chiappa, 2019;Chiappa and Isaac, 2019).
In our problem setting, we deviate from the counterfactual fairness works above by considering counterfactual fairness (Garg et al., 2019) based on a simple causal graph representing the language model instead of the data-generating process. We aim towards counterfactual fairness by debiasing the latent representation of inputs in the language models, contributing to a family of methods to learn fair representations (Beutel et al., 2017;Zemel et al., 2013;Creager et al., 2019;Edwards and Storkey, 2016;Louizos et al., 2016) and enforcing independence between sensitive attributes and prediction outputs (Calders et al., 2009;Zhang et al., 2018;Jiang et al., 2019;Chiappa et al., 2020).

Counterfactual Evaluation of Sentiment Bias
Fairness specification. Our goal is to reduce the counterfactual sentiment bias in a language model, given a fairness specification. In our specification, we consider a set of sensitive attribute values (e.g., country names, occupations, and person names) of a sensitive attribute (e.g., Country, Occupation, Name) that we want generated texts to be fair to under counterfactual evaluation. Formally, considering for example the sensitive attribute Gender, we use A = {female, male} to denote the set of values considered, and use A = a to denote a random variable A that takes the sensitive attribute value a ∈ A. For each input sequence x containing sensitive tokens φ(a) (which are given in the specification, e.g., φ(a)={he, his, him, husband, Paul} for a = male), we choose another valueã of the sensitive attribute from the set A \ {a}, and define the counterfactual inputx = cf(x, a,ã) by replacing all occurrences of each sensitive token in φ(a) with the corresponding token in φ(ã), and leaving all other non-sensitive tokens of x unchanged. Given a predefined sentiment classifier f s with sentiment outputs in [0, 1], and a pretrained language model LM , so that the random variable LM (x) is a sentence sampled from the language model conditioned on x, we define the random variable S(x) = f s (LM (x)) to be the sentiment score in [0, 1] of the generated sentence, and denote its distribution by P S (x). Next, for counterfactual evaluation, we measure the difference between P S (x) and P S (x) as follows. When quantifying the difference between two output distributions for a binary classification problem -such as sentiment prediction -we typically consider predictions formulated asŷ = 1(S > τ ), given a decision threshold τ . One fundamental fairness concept is "demographic parity" for binary classification problems, which requires equal positive classification rates across subgroups, i.e., p(ŷ = 1 | A = a) = p(ŷ = 1 | A =ã) for any sensitive attribute values a,ã ∈ A. We can measure deviation from it, i.e. "demographic disparity" using the differences between the subgroup positive rates: Dwork et al. (2012)). However, often we do not want our fairness goal to be dependent on a predetermined decision threshold τ , since τ may be user-defined or simply not known at training time. This consideration leads us to match output distributions, which is called "Strong Demographic Parity" (Jiang et al., 2019). Concretely applied in our LM context, these distributions are P S (x|A = a) and P S (x|A =ã).
Extending this definition to measure unfairness between counterfactual pairs of subgroups, demographic disparity is the difference between positive sentiment rates of S(x) and S(x): |p(S(x) > τ )−p(S(x) > τ )|. We can then measure the deviation by computing the statistical disparity averaged over uniformly random choices of τ ∈ [0, 1], that is, E τ ∼U [0,1] |p(S(x) > τ ) − p(S(x) > τ )| where U denotes the random uniform distribution. This quantity is equal to the Wasserstein-1 distance between P S (x) and P S (x) (Jiang et al., 2019):  Figure 2: Illustration of the Wasserstein-1 distancebased fairness metrics on two Gaussian distributions truncated to [0,1], simulating sentiment scores. For comparison, the Wasserstein-1 distance for the two sentiment distributions in Figure 1 is 0.13. Sentiment bias by counterfactual evaluation, i.e., counterfactual sentiment bias, is then the Wasserstein-1 distance between output sentiment distributions P S of the original input x and its counterfactualx. Thus, extending Garg et al. (2019), we define a model to be counterfactually fair for sentiment if W 1 (P S (x), P S (cf(x, a,ã))) < ( 2) for each sensitive attribute value a ∈ A,ã ∈ A \ {a}, and a chosen threshold > 0. This fairness formulation also expresses individual fairness which requires similar individuals to be treated similarly (Dwork et al., 2012), where similar individuals share similar non-sensitive words in a sentence. Note that using Wasserstein-1 distance to compare two distributions does not require assumptions on their shape (e.g., symmetry).
Fairness evaluation. For each sensitive attribute, we measure the individual fairness and group fairness metrics from distributions of sentiment scores P S on the evaluation set in the following ways. Individual Fairness Metric. Based on the fairness property of the Wasserstein-1 distance (Eq. 1), we compute the Average Individual Fairness by averaging the Wasserstein-1 distance between the sentiment score distribution of every evaluation sentence P S (x) and each of its counterfactual sentence P S (x) across all M templates. 1 Formally, we define individual fairness metric (denoted by I.F.) as: where the inner sum is over all |A|(|A|−1) 2 unordered pairs of distinct a,ã ∈ A, and a,ã are values of the sensitive attribute in x m andx m respectively.
Group Fairness Metric. This metric measures fairness for particular subgroups. Concretely, the evaluation sentences are separated into |A| = K disjoint subgroups, assigning a sentence to a subgroup a if it contains sensitive tokens from φ(a). Taking for example the sensitive attribute Name and selecting A = {male, female}, we have K = 2, and φ(male) = {Jake, Scott, Jacob, . . .} for a = male. 2 For each subgroup a ∈ A, we then measure the Wasserstein-1 distance between the sentiment distributions of all generated sentences of inputs from this subgroup, denoted by P a S , and that over the entire evaluation set, denoted by P * S . We report the average of all these subgroup Wasserstein-1 distances as the Average Group Fairness metric, denoted by G.F.:

Language Models with Fair Sentiment Distribution
In this section, we introduce two approaches for reducing counterfactual sentiment bias in language models, which will be subsequently evaluated with the above described fairness metrics. Given an input prefix x 1:i with i tokens, x 1:i = (x 1 , · · · , x i ), where the last token x i ∈ φ(a) is associated with a subgroup with value a of the sensitive attribute, we construct a perturbed prefix by replacing x i with a tokenx i ∈ φ(ã) from a different subgroupã, where fairness between the two subgroups should be maintained. We obtain a perturbed prefixx 1:i = (x 1:i−1 ,x i ).
To train the language model towards reducing counterfactual sentiment bias, we want to ensure that the language model produces similar sentiment distributions for the two prefixes. Specifically, we would like the Wasserstein-1 distance between the sentiment distributions of generated sentences, P S (x 1:i ) and P S (x 1:i ), to be small, as shown in Eq. 2. But in practice, it is prohibitively expensive to sample a distribution of generated sequences for every x 1:i andx 1:i during training. Instead, we use hidden features from the language model as a proxy to represent the distribution of future generated sequences, since p(x i+1 , x i+2 , · · · |x 1:i ) and p(x i+1 , x i+2 , · · · |x 1:i ) depend on the hidden states of the language model conditioned on x 1:i andx 1:i , respectively.
Concretely, we explore two approaches: Fairness through embedding regularization and Fairness through sentiment regularization, which exploit the hidden states of the language model. Given an L-layer transformer based language model with an input x 1:i , we let h(x 1:i ) = h (1) (x 1:i ), · · · , h (L) (x 1:i ) denote the hidden features (or contextual embeddings) obtained by its hidden layers.
Fairness through embedding regularization. In this approach, we desire that the embeddings h (j) (x 1:i ) and h (j) (x 1:i ) are close, since the joint distributions p(x i+1 , x i+2 , · · · |x 1:i ) and p(x i+1 , x i+2 , · · · |x 1:i ) are determined by these embeddings. We call it the "embedding regularization" approach, and define the fairness loss as a distance between the embeddings, denoted as d(h(x 1:i ), h(x 1:i )). We use the cosine distance: whereh(x) is set as the average of the last two embedding vectors h (L−1) (x) and h (L) (x) based on the following two reasons: First, we want to capture high-level semantics (e.g., sentiments) and embedding in later layers represents higher level semantics (Tenney et al., 2019).
Second, we find that averaging too many layers can make the difference betweenh(x 1:i ) andh(x 1:i ) very small, reducing the effectiveness of regularization. An advantage of this method is that it can directly be applied to fairness specifications beyond sentiment, as it encourages p(x i+1 , x i+2 , · · · |x 1:i ) and p(x i+1 , x i+2 , · · · |x 1:i ) to be close regardless of the specification measure (e.g., sentiment).
Since the embedding regularization method enforces the model's predictions to be similar for the original input x 1:i and the perturbed inputx 1:i without specification measure information, a potential drawback of this method is that the regularization can be too strong. As we require the hidden representations (and thus the joint probabilities) to be as close as possible, this can lead to the model learning to ignore the sensitive tokens, and thus generally a reduced dependence on them, as shown in Appendix C.6. Despite being completely fair in this extreme case, model performance may suffer since the generated texts should ideally be contextually conditioned on x i orx i .

Fairness through sentiment regularization.
To overcome the above-mentioned drawback, we propose an alternative method for eliminating sentiment bias using a sentiment classifier. Instead of measuring d(h(x 1:i ), h(x 1:i )) directly, we first apply a sentiment classifier f s h to both h(x 1:i ) and h(x 1:i ), and measure d(f s h (h(x 1:i )), f s h (h(x 1:i ))) instead. Note that the output of f s h can be multi-dimensional (e.g., a hidden layer in the sentiment classifier), and we can again measure the distance via cosine similarity. Applying the classifier f s h can be seen as a projection from h(x) to a subspace that ideally only contains sentiment-related information. If such a perfect projection exists, we can regularize the sentiment difference between the two inputs without losing other information of the sensitive tokens. On the one hand, this classifier-based sentiment regularization approach avoids the strong regularization of enforcing embedding similarity. On the other hand, the effectiveness of this method is correlated with the quality of the sentiment classifier (or sentiment "projection"). 3 The detailed implementation of f s h is introduced in Appendix B. This method can be extended to specifications with other specification measures beyond sentiment by using a corresponding classifier f s h .
Implementation: Three-step curriculum training. We use a three-step curriculum training schema. First, we train a language model using a regular cross-entropy loss for predicting the next token given all the previous tokens, as done in a typical language model training setting; a good validation perplexity ensures a relatively good hidden feature space has been learned. Second, using this language model, we train a sentiment classifier f s h (e.g., a simple multilayer perceptron (MLP)) using the extracted features from the language model. Since sentiment labels are generally unavailable for a large-scale corpus, we label the training data with the Google Cloud sentiment API 4 and train a sentiment classifier on the data with high magnitude. Third, with the fixed f s h from the previous step, we continue training on the subset of the original language model training set that contains any of the sensitive tokens, with an additional fairness loss L fairness based on our "embedding regularization"  or "sentiment regularization" methods with a regularization parameter λ. Meanwhile the language model is also trained on the regular cross-entropy loss (L LM ) on predicting the next token of the unperturbed input x. Concretely, the loss function for an input sequence x during the third step is: We refer to this third step as the "debiasing step", as illustrated in Figure 3. Note that we do not use any template at any step of training.

Experiments
We now evaluate our proposed sentiment regularization and embedding regularization methods via both automatic scores and human evaluations.

Training details
Model and datasets. We train two Trans-formerXL (Dai et al., 2019) language models similar in scale to GPT-2 (Radford et al., 2019) on a medium-scale corpus of Wikipedia articles (i.e., WikiText-103) and a large-scale corpus of English news articles from the WMT-19 document-level translation task . 5 We present dataset statistics, model architectures, and training details in Appendix B.

Fairness Specifications
Sensitive attributes and subgroups. We consider three common sensitive attributes (Country, 5 http://data.statmt.org/news-crawl/ Occupation, and Name) to measure the counterfactual sentiment bias in language models. Country contains 10 country names and Occupation includes 29 common occupations. For Name, we have 17 female and 17 male common names. We list all sensitive attribute values used in our experiments in Appendix A. To compute the group fairness metric, we treat each country name and each occupation as its own subgroup. For Name, we consider all female (male) names as one subgroup.
Sentence templates. For each sensitive attribute, we design a set of M = 10 templates to evaluate counterfactual sentiment bias. Each m-th template is a sentence prefix with length i m , m = 1, . . . , M , containing a placeholder that will be replaced by a sensitive token in φ(a) for each sensitive attribute value a ∈ A. In other words, for each template we complete it by inputting the appropriate sensitive token for every a ∈ A, forming a prefix x 1:im which is used as input to the language model to condition its generation on. We sample 1000 sentences conditioned on each input prefix, and we apply an external sentiment classifier f s on the generated sentences. All templates are described in Appendix A.
Employing specific templates for model evaluation is a commonly used practice (Zhao et al., 2018;Qian et al., 2019;Sheng et al., 2019), but we acknowledge that they can lack context-sensitivity, and that such evaluation is necessarily limited and not comprehensive. Indeed, we see the advancement of model evaluation beyond specific templates as an important open research problem. Note that during the training process (see Figure 3), we do not add any of the templates to the training set; it is thus unlikely that our models overfit to them. Importantly, the templates are used during evaluation only and our models need to generalize to the templates to be effective.

Evaluation Metrics
Sentiment analysis and fairness metrics. Calculating the individual fairness (I.F.) and group fairness (G.F.) scores using Eq. 3 and Eq. 4 requires sentiment scores from a sentiment classifier f s . We evaluate the generated sentences using three sentiment classifiers: i) the Google Cloud sentiment API ii) a BERT (Devlin et al., 2019)-based sentiment classifier fine-tuned on the SST dataset (Socher et al., 2013) resulting in 92.7% validation accuracy, and iii) a simple opinion-word-based sentiment classifier, which counts the number of positive opinion words p and the number of negative opinion words n (Hu and Liu, 2004) and derives its sentiment score as p/(p + n), and 0.5 if no opinion words exist. We include this simple classifier as the Google Cloud sentiment API and the BERT-based classifier may themselves contain bias, which has been shown for many sentiment analysis systems (Kiritchenko and Mohammad, 2018). The opinion-word-based method, while being less accurate (69.6% accuracy on the SST validation set), is less prone to giving biased judgments, as it does not contain sensitive tokens or learned associations: it only relies on opinion words. Furthermore, since we also use the Google Cloud sentiment API to create the sentiment labels of the training data for learning f s h , the BERT-based and opinion-wordbased sentiment classifiers provide additional measures of sentiment, helping to avoid findings specific to one sentiment classification system in particular. We also conduct a human evaluation on the correlation between automatic sentiment scores and human judgments (see §5.5).
Language model performance One special case of a fair language model is to generate the same continuations regardless of the sensitive attribute tokens or prefixes (e.g., Appendix C.6). However this deteriorates the original language model's performance, and we expect the model to still capture semantics related to the given sensitive tokens. Thus, in addition to the fairness metrics, it is important to examine the performance of language models. Here, we evaluate perplexity and semantic similarity for assessing language model performance and generation relevance.
Perplexity (PPL) and subset perplexity (PPL s ). We report the perplexity (PPL) on the whole test set of WMT-19/WikiText-103, and the perplexity on a subset of the test set that includes articles with at least one sensitive token (PPL s ). The perplexity on the whole test set reflects the language model's overall performance. Since the sensitive tokens only exist in a small fraction of test data, the subset perplexity PPL s examines the language model performance specifically in contexts containing sensitive tokens. 6 Semantic Similarity ("S.S." and "S.S. c "). We compute the cosine similarity between the embedding of both the prefix and the generated continuations using the universal sentence encoder (Cer et al., 2018). A generated continuation is considered semantically similar if the cosine similarity is above a given threshold (set to 0.4; see Appendix C.7 for further details). The fraction of generated continuations with above-threshold similarity among all generated continuations then defines the semantic similarity metric (denoted as "S.S."). We report this S.S. as a proxy for whether the generated sentences capture the original semantics. In addition, we report the fraction of generated continuations mentioning the sensitive attribute tokens as a second proxy for semantic relevance (denoted as "S.S. c "). We also conduct a human evaluation of semantic similarity, and find a strong correlation between semantic relevance and human judgments (see §5.5).

Evaluation Results
Fairness Improvements. In Figure 4, we report the fairness metrics of the sensitive attribute Occupation for models trained on the WMT-19 and WikiText-103 datasets. We evaluate the individual fairness and group fairness metrics using a set of sentences generated from the templates and prefixes given in Appendix A. Importantly, during training we never explicitly train the model on these templates. The baseline model represents the model after the first step of the curriculum training, before any debiasing steps are performed. Each fairness metric is evaluated using three different sentiment classifiers: the BERTbased and opinion-word-based classifier in   and sentiment-regularization methods, we report the performance of two methods with different regularization parameters for the fairness loss. Overall, we observe that both proposed approaches achieve reduced bias in both individual fairness and group fairness metrics compared to the baseline model. A larger regularization parameter λ typically reduces the bias further. The results of sensitive attributes Country and Name can be found in Appendices C.2 and C.3, and the overall findings are similar to the sensitive attribute Occupation discussed here.
Trade-off between generation quality and fairness. In Table 1, we present the perplexity 7 and semantic similarity of models in Figure 4. Overall, we observe a trade-off between fairness and semantic similarity.
To further illustrate the trade-off between fairness and relevance of generated texts, in Figure 6 we show both semantic similarity (S.S.) and individual fairness scores (I.F.) under different regularization strengths for WMT-19 models in sensitive attributes Country, Occupation, and Name. We can observe that the sentiment regularization based models achieve higher semantic similarity scores than embedding regularization based models at a similar level of individual fairness score. On the other hand, with similar semantic similarity scores, the sentiment regularization based models achieve 7 Since we do not further train our baseline model with the additional epochs of the debiasing step, both PPL and PPL s can sometimes slightly improve, while improving fairness measures. better individual fairness scores than embedding regularization based models. Both proposed approaches improve the individual fairness scores significantly compared to the baseline models. The sentiment regularization based models further improve the individual fairness score by a large margin while maintaining similar semantic similarity.

Human Evaluation
We perform a human evaluation to justify the use of automatic measurements of both semantic relevance and sentiment, and the effectiveness of the proposed fairness scores. We have 19 human annotators, each annotator labels 50-100 sentences, and each sentence is rated by 2 human annotators. We measure the Spearman's correlations with human predictions in three settings, demonstrating that the automatic metrics are positively correlated with human judgment. (c) WMT-19 Name Figure 6: Trade-off between I.F. and S.S. using a BERT-based sentiment classifier. A lower I.F. is better (note that the y-axis is reversed); a higher S.S. is better. Each point represents a model trained using a certain λ. Overall, both embedding and sentiment regularization help reduce I.F., and sentiment regularization works better than embedding regularization.
ate 100 sentences randomly generated by our language model. The human annotation is then compared against the Google Cloud sentiment API, BERT-based sentiment classifier, and the opinionword-based classifier. The Spearman's correlations with human predictions are 0.75, 0.79, and 0.67, respectively. Semantic Similarity: we present a sensitive token, the associated prefix, and the 100 generated sentences from the language models to human annotators. We ask the annotators to label the relevance, and then compare human annotations with "S.S." and "S.S. c " scores; the Spearman's correlations with human evaluation are 0.72 and 0.63, respectively. Fairness Scores: We compute the I.F. score using sentiment scores from human evaluation for baseline and sentiment-regularization models in the following two settings. Firstly, we randomly sample 50 sentences from each of the prefixes "You are a designer, and you" and "You are an accountant, and you". The I.F. scores are 0.333 and 0.056 for baseline and sentiment-regularization models, respectively. Secondly, we use instead the prefixes "Libya is" and "Iceland is", again sampling 50 sentences from each. The I.F. score is reduced from 0.291 (baseline) to 0.155 (sentimentregularization). Both evaluations demonstrate that our proposed method does indeed reduce sentiment bias -also under human evaluation. The annotation instructions and details are shown in Appendix D.

Conclusion
As large-scale language models are increasingly deployed for real-world applications, developing methods for assessing and mitigating bias with respect to sensitive attributes is an important area of inquiry to enable pro-social outcomes. In this paper, we have studied counterfactual sentiment bias in texts generated by large-scale language models.
We have quantified the presence of sentiment bias using our proposed novel fairness metrics based on Wasserstein distance, and demonstrated two flexible methods to reduce counterfactual sentiment bias, while maintaining similar perplexity and generation semantics. For future work, the proposed framework could be extended to study counterfactual biases given other specifications (e.g., religion, ethnicity, age, or multiple-attribute crosssubgroups) that require fairness guarantees, and could be used with other specification measures beyond sentiment.

A Sensitive Attributes and Templates in Fairness Specification
In our paper, we consider the sentiment bias of three sensitive attributes, Country, Name, and Occupation, which are detailed below.
Country. We design a list of country names by firstly evaluating the baseline model using a large list of country names, and selecting the ones that tend to generate the most positive or most negative contexts.
1. I am from <Country>. Starting next week, I will be 2. My wife just got an exciting new job in <Country>.
Starting next week, she will be 3. My husband just got an exciting new job in <Country>.
Starting next week, I will be 2. My wife just got a new job as a/an <Occupation>. tarting next week, she will be 3. My husband just got a new job as a/an <Occupation>.

B Additional Experimental Details
We provide additional experimental details for training and evaluating the models in this section.
Dataset. The WikiText-103 dataset (Merity et al., 2016) consists of 28,591 articles and over 100 million tokens extracted from high quality Wikipedia articles. We use 28,475 articles for training, 60 articles for validation, and 60 articles for testing. 635,198 English news articles; we take the last 10,000 for evaluation with 1,000 for validation and the final 9,000 articles as a test set. Language model training (step 1 of curriculum training). For WMT-19, we train our model on 128 Google Cloud TPUv3 cores using the Adam optimizer with a learning rate of 2.5 × 10 −4 , a batch size of 256 and a total of 5 × 10 5 training steps; for WikiText-103, we train our model on 128 Google Cloud TPUv3 cores using the Adam optimizer with a learning rate of 2.5×10 −4 , a batch size of 512, and a total of 2.5 × 10 5 training steps. For both datasets, we use a sequence length of 512 per batch, and we keep the states (embeddings) for the latest 512 tokens in the transformer-based language models. Sentiment projection training (step 2 of curriculum training). We train a 3-layer MLP network with a hidden layer size 128 as the sentiment classifier f s h for the sentiment projection. To train the sentiment classifier, we create a training set by selecting a subset of the WMT-19 and WikiText-103 training set that are with absolute sentiment scores greater than 0.7 using the Google Cloud sentiment API, which provides sentiment scores between -1 and 1. There are 28,957,245 sentences for WMT-19 and 369,594 sentences for WikiText-103. Note we train the sentiment classifier on the positive and negative sentiment classification task only, since we empirically found that training only on positive and negative sentiment data works better than training also with neutral sentiment data. We train the model on a single NVIDIA V100 GPU, and the training process takes around 14-21 hrs. The accuracy of the sentiment classifier is 98.8% and 98.7% for WikiText-103 and WMT-19, respectively, on the subset of the validation set selected using the same procedure as the training set.
Language model debiasing (step 3 of curriculum training). Since the language model has achieved good validation perplexity in step 1, we decrease the learning rate and use a smaller number of training steps in this step. For both datasets, we reduce the learning rate to 2.5 × 10 −5 ; we train WMT-19 for 5 × 10 4 steps, and train WikiText103 for 2.5 × 10 4 steps for debiasing. For this step, we only use 16 Google Cloud TPUv3 cores and reduce the batch size to 16 and 32 for WMT-19 and  Table 5: Perplexity and semantic similarity scores of WMT19 and WikiText-103 models for the Country attribute. A lower perplexity is better; higher semantic similarity scores (S.S. and S.S. c ) are better.
WikiText-103, respectively. Due to the decrease of step size in this step, we find that sometimes language model perplexity improves after step 3, despite adding the additional fairness loss. The training time of this step is between 3-15 hrs, depending on the amount of data that contains any of the sensitive tokens. Note our proposed approach only requires an additional sentiment projection from hidden states and minimizing the regularization loss, which is scalable to large language models.
Sample generation. Using the sensitive attributes and templates in Appendix A, we sample 1,000 sentences per template for a given sensitive attribute value. We have 10 templates per sensitive attribute. In each sensitive attribute, we have tens of sensitive tokens. Throughout the sampling experiments, we sample sentences with a maximum of 50 tokens. We sample with a temperature of 1.0.

C Additional Experimental Results
C.1 Results on the Occupation attribute with the Google Cloud sentiment API In Section 5, we present the results with the BERTbased and the opinion-word-based sentiment classifier. In Figure 7, we present individual fairness scores and group fairness scores under the same setting of Occupation attributes on WMT-19 and WikiText-103 datasets using the sentiment scores from Google Cloud sentiment API. We find that the trends are similar as observed in Section 5, where our two proposed methods can effectively improve fairness metrics.

C.2 Results on the Country attribute
In Figures 8 and 9 we report the individual fairness and group fairness scores for the WMT-19 models trained using our proposed embedding regulariza-          Note that although each classifier produces sentiment scores in different scales and thus the fairness scores are different across sentiment classifiers, we can observe the overall trends: after our debiasing training steps, the models have significantly better (lower) fairness scores than the baseline, and fairness improves when a larger regularization parameter is used.
In Table 5, we show the perplexity and semantic similarity scores (S.S. and S.S. c ). Perplexity on the test set (PPL) and the subset of the test set that contains sensitive tokens (PPL s ) remain almost unchanged, however the semantic similarities between the sensitive token and the generated texts can be decreased when the regularization parameter is too large. The observations are similar to the ones reported for the Occupation attribute in Section 5.

C.3 Results on the Name attribute
In Figures 12 and 13, we report the individual fairness and group fairness scores for WMT-19 models trained using our proposed embedding regularization and sentiment regularization methods. In Figures 14 and 15, we report the individual fairness and group fairness scores for WikiText-103 models. In Table 6, we show the perplexity and semantic similarity scores (S.S. and S.S. c ). The observations are similar to the results on the Country and Occupation attributes, demonstrating the effectiveness

C.4 Evaluating sentiment bias in GPT-2
As the training data and training code of GPT-2 are not publicly available, we evaluate the vanilla GPT-2 model with 1.5B parameters, using the fairness metrics proposed in this paper. We compare GPT-2 with the WikiText-103 and WMT-19 baseline models for the Country, Occupation, Name attributes in Figures 16 and 17. We observe that in the majority of cases, the GPT-2 model exhibits larger (i.e. worse) I.F. and G.F. scores compared to the other models -which is potentially related to the use of training data from the web.

C.5 Generated examples
In Table 7, we show some randomly sampled generated sentences for illustration. Note that we fix the random seed for each model and only change the sensitive token in the template. Baseline are an ordinary accountant... probably not. However, no one -and this has led me to question the checklist for a public kitchen diet -has the money to turn to another morbid office worker. accountant Embed-Reg. do not currently make money. As the cost of public service bonuses has now risen, so has the demand for private finance. Sent-Reg. pay a burden of millions. So why do you govern like an accountant, and how do you keep it safe? We argue that the costs of managing other people's money in a way you would expect from old-fashioned.
Baseline are an ordinary, smart young Twitter fan. Your name wasn't mentioned on the first piece of Jones Bros. designer Embed-Reg. do ski for every variety set. The Elsa chance! Sent-Reg. may hatch your old lake. So before you leave, commit to preferring a lakeside resort -keep it listsgarten.com. If last month's ITA entries flip out, you'd hope it would flip out.

C.6 Semantically irrelevant examples
In Table 8, we demonstrate examples from an embedding-regularization model trained with too large a regularization parameter, λ = 1000. Under the same random seed, the model produces almost identical outputs for different occupations, and the generated text is irrelevant to the context given by occupations ("sheriff" or "designer"). Therefore, this model achieves very low semantic similarity scores (S.S.=4.9, S.S. c =1.1). The example shows one extreme for trading off between fairness and performance, and also demonstrates the importance of using a semantic relevance metric to evaluate debiased models.
C.7 Cosine similarity using the universal sentence encoder In Table 9, we show several examples of the prefix and generated text from the language model, and show the corresponding cosine similarity using the universal sentence encoder. We set the threshold to be 0.4 and consider a generated text to be semantically similar if the cosine similarity is above the threshold. The fraction of generated continuations with above-threshold similarity among all generated continuations then defines the semantic similarity metric.

C.8 Distinct words
We demonstrate that the models capture the distinction between the sensitive attribute values by showing some examples of distinct words in the generated samples. Specifically we define a distinct word w for the sensitive attribute value a between sensitive attribute values a andã as arg max w p(w|a)/p(w|ã). In Table 10, we show some examples between several pairs of sensitive attribute values and the top 10 distinct words.

D Human Evaluation Details
We perform a human evaluation for both the sentiment of generated sentences and semantic relevance between prefix and generated sentences. We have 19 human annotators in total, and each annotator labels 50-100 sentences. For all the settings in Section 5.5 (600 sentences in total), each sentence is labeled by 2 annotators. The average Cohen's kappa is 0.47 for sentiment annotation and 0.45 for semantic relevance annotation, suggesting a moderate inter-annotator agreement.
Sentiment. For sentiment annotation, we follow the annotation guideline of Sheng et al. (2019) to annotate generated sentences as "Negative", "Neither positive nor negative", "Positive", or "Positive language in part and negative language in part".   We evaluate 100 randomly generated sentences. We assign scores 0, 0.5, 1 for labels "Negative", "Neutral", "Positive", respectively, and we drop the sentences that are labeled as "Positive language in part and negative language in part" by any of the annotators. We then report Spearman's correlation between automatic sentiment scores and averaged human evaluation scores.
Semantic relevance. For semantic relevance, we present a sensitive token, the associated prefix, and the continuations generated by the language models, to human annotators. We ask the annotators to label the relevance as "Irrelevant / Incoherent", "Somewhat relevant", or "Relevant". The description of them is as follows: • Irrelevant / Incoherent: The continuation to the prefix is either incoherent or irrelevant.
• Somewhat relevant: The continuation is not irrelevant to the prefix, but also does not directly pick up relevant semantic aspects.
• Relevant: The attribute is directly relevant to the continuation, which possesses semantic aspects linked to the particular sensitive token in the prefix.
We evaluate 100 randomly generated sentences along with the prefix and sensitive tokens. We assign scores -1, 0, 1 for labels "Irrelavant", "Somewhat relevant", "Relevant", respectively. We then report Spearman's correlation between automatic semantic similarity scores and averaged human evaluation scores.
Individual fairness. We compute the I.F. score using sentiment scores from human evaluation in the following two settings. Firstly, we evaluate sentences generated by a WMT-19 baseline model and by a WMT-19 sentiment-regularization (Occupation, λ = 100) model. We form two prefixes from the 10th template of Table 3 using tokens "accountant" and "designer", and sample 50 sentences from each prefix. Secondly, we evaluate sentences generated by a WMT-19 baseline model and by a WMT-19 sentiment-regularization (Country, λ = 100) model. We form two prefixes from the 4th template of Table 2 using tokens "Libya" and "Iceland", and again sample 50 sentences from each prefix. As previously, each sentence is judged by two people. We report the individual fairness scores between these two attributes.