Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology

Gender stereotypes are manifest in most of the world’s languages and are consequently propagated or amplified by NLP systems. Although research has focused on mitigating gender stereotypes in English, the approaches that are commonly employed produce ungrammatical sentences in morphologically rich languages. We present a novel approach for converting between masculine-inflected and feminine-inflected sentences in such languages. For Spanish and Hebrew, our approach achieves F1 scores of 82% and 73% at the level of tags and accuracies of 90% and 87% at the level of forms. By evaluating our approach using four different languages, we show that, on average, it reduces gender stereotyping by a factor of 2.5 without any sacrifice to grammaticality.


Introduction
One of the biggest challenges faced by modern natural language processing (NLP) systems is the inadvertent replication or amplification of societal biases. This is because NLP systems depend on language corpora, which are inherently "not objective; they are creations of human design" (Crawford, 2013). One type of societal bias that has received considerable attention from the NLP community is gender stereotyping (Garg et al., 2017;Rudinger et al., 2017;Sutton et al., 2018). Gender stereotypes can manifest in language in overt ways. For example, the sentence he is an engineer is more likely to appear in a corpus than she is an engineer due to the current gender disparity in engineering. Consequently, any NLP system that is trained such a corpus will likely learn to associate engineer with men, but not with women (De-Arteaga et al., 2019).
To date, the NLP community has focused primarily on approaches for detecting and mitigating gender stereotypes in English (Bolukbasi et al., 2016;Dixon et al., 2018;Zhao et al., 2017). Yet, gender stereotypes also exist in other languages  Figure 1: Transformation of Los ingenieros son expertos (i.e., The male engineers are skilled) to Las ingenieras son expertas (i.e., The female engineers are skilled). We extract the properties of each word in the sentence. We then fix a noun and its tags and infer the manner in which the remaining tags must be updated. Finally, we reinflect the lemmata to their new forms.
because they are a function of society, not of grammar. Moreover, because English does not mark grammatical gender, approaches developed for English are not transferable to morphologically rich languages that exhibit gender agreement (Corbett, 1991). In these languages, the words in a sentence are marked with morphological endings that reflect the grammatical gender of the surrounding nouns. This means that if the gender of one word changes, the others have to be updated to match. As a result, simple heuristics, such as augmenting a corpus with additional sentences in which he and she have been swapped (Zhao et al., 2018), will yield ungrammatical sentences. Consider the Spanish phrase el ingeniero experto (the skilled engineer). Replacing ingeniero with ingeniera is insufficient-el must also be replaced with la and experto with experta.
In this paper, we present a new approach to counterfactual data augmentation (CDA; Lu et al., 2018) for mitigating gender stereotypes associated with animate 1 nouns (i.e., nouns that represent people) for morphologically rich languages. We introduce a Markov random field with an optional neural parameterization that infers the manner in which a sentence must change when altering the grammatical gender of particular nouns. We use this model as part of a four-step process, depicted in Fig. 1, to reinflect entire sentences following an intervention on the grammatical gender of one word. We intrinsically evaluate our approach using Spanish and Hebrew, achieving tag-level F 1 scores of 83% and 72% and form-level accuracies of 90% and 87%, respectively. We also conduct an extrinsic evaluation using four languages. Following Lu et al. (2018), we show that, on average, our approach reduces gender stereotyping in neural language models by a factor of 2.5 without sacrificing grammaticality.

Gender Stereotypes in Text
Men and women are mentioned at different rates in text (Coates, 1987). This problem is exacerbated in certain contexts. For example, the sentence he is an engineer is more likely to appear in a corpus than she is an engineer due to the current gender disparity in engineering. This imbalance in representation can have a dramatic downstream effect on NLP systems trained on such a corpus, such as giving preference to male engineers over female engineers in an automated resumé filtering system. Gender stereotypes of this sort have been observed in word embeddings (Bolukbasi et al., 2016;Sutton et al., 2018), contextual word embeddings (Zhao et al., 2019), and co-reference resolution systems (Rudinger et al., 2018;Zhao et al., 2018) inter alia.
A quick fix: swapping gendered words. One approach to mitigating such gender stereotypes is counterfactual data augmentation (CDA; Lu et al., 2018). In English, this involves augmenting a corpus with additional sentences in which gendered words, such as he and she, have been swapped to yield a balanced representation. Indeed, Zhao et al. (2018) showed that this simple heuristic significantly reduces gender stereotyping in neural co-reference resolution systems, without harming system performance. Unfortunately, this approach is only applicable to English and other languages with little morphological inflection. When applied to morphologically rich languages that exhibit gender agreement, it yields ungrammatical sentences.
The problem: inflected languages. Many languages, including Spanish and Hebrew, have gender inflections for nouns, verbs, and adjectivesi.e., the words in a sentence are marked with morphological endings that reflect the grammatical gender of the surrounding nouns. 2 This means that if the gender of one word changes, the others have to be updated to preserve morpho-syntactic agreement (Corbett, 2012). Consider the following example from Spanish, where we wish to transform Sentence (1) to Sentence (2). (Parts of words that mark gender are depicted in bold.) This task is not as simple as replacing el with la-ingeniero and experto must also be reinflected. Moreover, the changes required for one language are not the same as those required for another (e.g., verbs are marked with gender in Hebrew, but not in Spanish).
( Our approach. Our goal is to transform sentences like Sentence (1) to Sentence (2) and vice versa. To the best of our knowledge, this task has not been studied previously. Indeed, there is no existing annotated corpus of paired sentences that could be used to train a supervised model. As a result, we take an unsupervised 3 approach using dependency trees, lemmata, part-of-speech (POS) tags, and morpho-syntactic tags from Universal Dependencies corpora (UD;Nivre et al., 2018). Specifically, we propose the following four-step process: 1. Analyze the sentence (including parsing, morphological analysis, and lemmatization).  Figure 2: Dependency tree for the sentence El ingeniero alemán es muy experto.
2. Intervene on a gendered word.
4. Reinflect the lemmata to their new forms.
This process is depicted in Fig. 1. The primary technical contribution is a novel Markov random field for performing step 3, described in the next section.

A Markov Random Field for Morpho-Syntactic Agreement
In this section, we present a Markov random field (MRF; Koller and Friedman, 2009) for morphosyntactic agreement. This model defines a joint distribution over sequences of morpho-syntactic tags, conditioned on a labeled dependency tree with associated part-of-speech tags. Given an intervention on a gendered word, we can use this model to infer the manner in which the remaining tags must be updated to preserve morpho-syntactic agreement. A dependency tree for a sentence (see Fig. 2 for an example) is a set of ordered triples (i, j, ), where i and j are positions in the sentence (or a distinguished root symbol) and ∈ L is the label of the edge i → j in the tree; each position occurs exactly once as the first element in a triple. Each dependency tree T is associated with a sequence of morpho-syntactic tags m = m 1 , . . . , m |T | and a sequence of part-ofspeech (POS) tags p = p 1 , . . . , p |T | . For example, the tags m ∈ M and p ∈ P for ingeniero are [MSC; SG] and NOUN, respectively, because ingeniero is a masculine, singular noun. For notational simplicity, we define M = M |T | to be the set of all length-|T | sequences of morpho-syntactic tags.
We define the probability of m given T and p as where the binary factor ψ(·, · | ·, ·, ·) ≥ 0 scores how well the morpho-syntactic tags m i and m j agree given the POS tags p i and p j and the label . For example, consider the amod (adjectival modifier) edge from experto to ingeniero in Fig As we explain in §3.1, we use these unary factors to force or disallow particular tags when performing an intervention; we do not learn them. Eq. (1) is normalized by the following partition function: Z(T, p) can be calculated using belief propagation; we provide the update equations that we use in App. A. Our model is depicted in Fig. 3. It is noteworthy that this model is delexicalized-i.e., it considers only the labeled dependency tree and the POS tags, not the actual words themselves.
Linear parameterization. We define a matrix where c is the number of morpho-syntactic subtags. For example, [MSC; SG] has two subtags MSC and SG. We then define ψ(·, · | ·, ·, ·) as follows: where m i ∈ {0, 1} c is a multi-hot encoding of m i .
Neural parameterization. As an alternative, we also define a neural parameterization of W (p i , p j , ) to allow parameter sharing among El ingeniero alemán es muy experto edges with different parts of speech and labels: where U ∈ R c×c×n 1 , V ∈ R n 1 ×3n 2 , and n 1 and n 2 define the structure of the neural parameterization and each e(·) ∈ R n 2 is an embedding function.
Parameterization of φ i . We use the unary factors only to force or disallow particular tags when performing an intervention. Specifically, we define where α > 1 is a strength parameter that determines the extent to which m i should remain unchanged following an intervention. In the limit as α → ∞, all tags will remain unchanged except for the tag directly involved in the intervention. 4

Inference
Because our MRF is acyclic and tree-shaped, we can use belief propagation (Pearl, 1988) to perform exact inference. The algorithm is a generalization of the forward-backward algorithm for hidden Markov models (Rabiner and Juang, 1986). Specifically, we pass messages from the leaves to the root and vice versa. The marginal distribution of a node is the point-wise product of all its incoming messages; the partition function Z(T, p) is the sum of any node's marginal distribution. Computing Z(T, p) takes polynomial time (Pearl, 1988)specifically, O(n · |M | 2 ) where M is the number of morpho-syntactic tags. Finally, inferring the highest-probability morpho-syntactic tag sequence m given T and p can be performed using the max-product modification to belief propagation. 4 In practice, α is set using development data.

Parameter Estimation
We use gradient-based optimization. We treat the negative log-likelihood − log (Pr(m | T, p)) as the loss function for tree T and compute its gradient using automatic differentiation (Rall, 1981). We learn the parameters of §3.1 by optimizing the negative log-likelihood using gradient descent.

Intervention
As explained in §2, our goal is to transform sentences like Sentence (1) to Sentence (2) by intervening on a gendered word and then using our model to infer the manner in which the remaining tags must be updated to preserve morphosyntactic agreement. For example, if we change the morpho-syntactic tag for ingeniero from [MSC;SG] to [FEM;SG], then we must also update the tags for el and experto, but do not need to update the tag for es, which should remain unchanged as [IN; PR; SG]. If we intervene on the i th word in a sentence, changing its tag from m i to m i , then using our model to infer the manner in which the remaining tags must be updated means using Pr(m −i | m i , T, p) to identify high-probability tags for the remaining words. Crucially, we wish to change as little as possible when intervening on a gendered word. The unary factors φ i enable us to do exactly this. As described in the previous section, the strength parameter α determines the extent to which m i should remain unchanged following an intervention-the larger the value, the less likely it is that m i will be changed.
Once the new tags have been inferred, the final step is to reinflect the lemmata to their new forms.  This task has received considerable attention from the NLP community (Cotterell et al., 2016(Cotterell et al., , 2017.
We use the inflection model of . This model conditions on the lemma x and morphosyntactic tag m to form a distribution over possible inflections. For example, given experto and [A; FEM; PL], the trained inflection model will assign a high probability to expertas. We provide accuracies for the trained inflection model in Tab. 1.

Experiments
We used the Adam optimizer (Kingma and Ba, 2014) to train both parameterizations of our model until the change in dev-loss was less than 10 −5 bits. We set β = (0.9, 0.999) without tuning, and chose a learning rate of 0.005 and weight decay factor of 0.0001 after tuning. We tuned log α in the set {0.5, 0.75, 1, 2, 5, 10} and chose log α = 1.
For the neural parameterization, we set n 1 = 9 and n 2 = 3 without any tuning. Finally, we trained the inflection model using only gendered words. We evaluate our approach both intrinsically and extrinsically. For the intrinsic evaluation, we focus on whether our approach yields the correct morphosyntactic tags and the correct reinflections. For the extrinsic evaluation, we assess the extent to which using the resulting transformed sentences reduces gender stereotyping in neural language models.

Intrinsic Evaluation
To the best of our knowledge, this task has not been studied previously. As a result, there is no existing annotated corpus of paired sentences that can be used as "ground truth." We therefore annotated Spanish and Hebrew sentences ourselves, with annotations made by native speakers of each language. Specifically, for each language, we extracted sentences containing animate nouns from that language's UD treebank. The average length of these extracted sentences was 37 words. We then manually inspected each sentence, intervening on the gender of the animate noun and reinflecting the sentence accordingly. We chose Spanish and Hebrew because gender agreement operates differ-  Table 3: Tag-level precision, recall, F 1 score, and accuracy and form-level accuracy for the baselines ("-BASE") and for our approach ("-LIN" is the linear parameterization, "-NN" is the neural parameterization).
ently in each language. We provide corpus statistics for both languages in the top two rows of Tab. 2. We created a hard-coded ψ(·, · | ·, ·, ·) to serve as a baseline for each language. For Spanish, we only activated, i.e. set to a number greater than zero, values that relate adjectives and determiners to nouns; for Hebrew, we only activated values that relate adjectives and verbs to nouns. We created two separate baselines because gender agreement operates differently in each language.
To evaluate our approach, we held all morphosyntactic subtags fixed except for gender. For each annotated sentence, we intervened on the gender of the animate noun. We then used our model to infer which of the remaining tags should be updated (updating a tag means swapping the gender subtag because all morpho-syntactic subtags were held fixed except for gender) and reinflected the lemmata. Finally, we used the annotations to compute the taglevel F 1 score and the form-level accuracy, excluding the animate nouns on which we intervened.
Results. We present the results in Tab. 3. Recall is consistently significantly lower than precision. As expected, the baselines have the highest precision (though not by much). This is because they reflect well-known rules for each language. That said, they have lower recall than our approach because they fail to capture more subtle relationships.
For both languages, our approach struggles with conjunctions. For example, consider the phraseél es un ingeniero y escritor (he is an engineer and a writer). Replacing ingeniero with ingeniera does not necessarily result in escritor being changed to escritora. This is because two nouns do not normally need to have the same gender when they are conjoined. Moreover, our MRF does not include co-reference information, so it cannot tell that, in this case, both nouns refer to the same person. Note  Figure 4: Gender stereotyping (left) and grammaticality (right) using the original corpus, the corpus following CDA using naïve swapping of gendered words ("Swap"), and the corpus following CDA using our approach ("MRF").
that including co-reference information in our MRF would create cycles and inference would no longer be exact. Additionally, the lack of co-reference information means that, for Spanish, our approach fails to convert nouns that are noun-modifiers or indirect objects of verbs.
Somewhat surprisingly, the neural parameterization does not outperform the linear parameterization. We proposed the neural parameterization to allow parameter sharing among edges with different parts of speech and labels; however, this parameter sharing does not seem to make a difference in practice, so the linear parameterization is sufficient.

Extrinsic Evaluation
We extrinsically evaluate our approach by assessing the extent to which it reduces gender stereotyping. Following Lu et al. (2018), focus on neural language models. We choose language models over word embeddings because standard measures of gender stereotyping for word embeddings cannot be applied to morphologically rich languages.
As our measure of gender stereotyping, we compare the log ratio of the prefix probabilities under a language model P lm for gendered, animate nouns, such as ingeniero, combined with four adjectives: good, bad, smart, and beautiful. The translations we use for these adjectives are given in App. B. We chose the first two adjectives because they should be used equally to describe men and women, and the latter two because we expect that they will reveal gender stereotypes. For example, consider log x∈Σ * P lm (BOS El ingeniero bueno x) x∈Σ * P lm (BOS La ingeniera buena x) .
If this log ratio is close to 0, then the language model is as likely to generate sentences that start with el ingeniero bueno (the good male engineer) as it is to generate sentences that start with la  ingeniera bueno (the good female engineer). If the log ratio is negative, then the language model is more likely to generate the feminine form than the masculine form, while the opposite is true if the log ratio is positive. In practice, given the current gender disparity in engineering, we would expect the log ratio to be positive. If, however, the language model were trained on a corpus to which our CDA approach had been applied, we would then expect the log ratio to be much closer to zero. Because our approach is specifically intended to yield sentences that are grammatical, we additionally consider the following log ratio (i.e., the grammatical phrase over the ungrammatical phrase): log x∈Σ * P lm (BOS El ingeniero bueno x) x∈Σ * P lm (BOS El ingeniera bueno x) .
We trained the linear parameterization using UD treebanks for Spanish, Hebrew, French, and Italian (see Tab. 2). For each of the four languages, we parsed one million sentences from Wikipedia (May 2018 dump) using Dozat and Manning (2016)'s parser and extracted taggings and lemmata using the method of Müller et al. (2015). We automatically extracted an animacy gazetteer from WordNet (Bond and Paik, 2012) and then manually filtered the output for correctness. We provide the size of the languages' animacy gazetteers and the percentage of automatically parsed sentences that contain an animate noun in Tab. 4. For each sentence containing a noun in our animacy gazetteer, we created a copy of the sentence, intervened on  Figure 5: Gender stereotyping for words that are stereotyped toward men or women in Spanish using the original corpus, the corpus following CDA using naïve swapping of gendered words ("Swap"), and the corpus following CDA using our approach ("MRF").
the noun, and then used our approach to transform the sentence. For sentences containing more than one animate noun, we generated a separate sentence for each possible combination of genders.
Choosing which sentences to duplicate is a difficult task. For example, alemán in Spanish can refer to either a German man or the German language; however, we have no way of distinguishing between these two meanings without additional annotations. Multilingual animacy detection (Jahan et al., 2018) might help with this challenge; co-reference information might additionally help. For each language, we trained the BPE-RNNLM baseline open-vocabulary language model of Mielke and Eisner (2018) using the original corpus, the corpus following CDA using naïve swapping of gendered words, and the corpus following CDA using our approach. We then computed gender stereotyping and grammaticality as described above. We provide example phrases in Tab. 5; we provide a more extensive list of phrases in App. C.
Results Fig. 4 demonstrates depicts gender stereotyping and grammaticality for each language using the original corpus, the corpus following CDA using naïve swapping of gendered words, and the corpus following CDA using our approach. It is immediately apparent that our approch reduces gender stereotyping. On average, our approach reduces gender stereotyping by a factor of 2.5 (the lowest and highest factors are 1.2 (Ita) and 5.0 (Esp), respectively). We expected that naïve swapping of gendered words would also reduce gender stereotyping. Indeed, we see that this simple heuristic reduces gender stereotyping for some but not all of the languages. For Spanish, we also examine specific words that are stereotyped  Table 5: Prefix log-likelihoods of Spanish phrases using the original corpus, the corpus following CDA using naïve swapping of gendered words ("Swap"), and the corpus following CDA using our approach ("MRF"). Phrases 1 and 2 are grammatical, while phrases 3 and 4 are not (dentoted by "*"). Gender stereotyping is measured using phrases 1 and 2. Grammaticality is measured using phrases 1 and 3 and using phrases 2 and 4; these scores are then averaged.
toward men or women. We define a word to be stereotyped toward one gender if 75% of its occurrences are of that gender. Fig. 5 suggests a clear reduction in gender stereotyping for specific words that are stereotyped toward men or women. The grammaticality of the corpora following CDA differs between languages. That said, with the exception of Hebrew, our approach either sacrifices less grammaticality than naïve swapping of gendered words and sometimes increases grammaticality over the original corpus. Given that we know the model did not perform as accurately for Hebrew (see Tab. 3), this finding is not surprising.

Related Work
In contrast to previous work, we focus on mitigating gender stereotypes in languages with rich morphology-specifically languages that exhibit gender agreement. To date, the NLP community has focused on approaches for detecting and mitigating gender stereotypes in English. For example, Bolukbasi et al. (2016) proposed a way of mitigating gender stereotypes in word embeddings while preserving meanings; Lu et al. (2018) studied gender stereotypes in language models; and Rudinger et al. (2018) introduced a novel Winograd schema for evaluating gender stereotypes in co-reference resolution. The most closely related work is that of Zhao et al. (2018), who used CDA to reduce gender stereotypes in co-reference resolution; however, their approach yields ungrammatical sentences in morphologically rich languages. Our approach is specifically intended to yield grammatical sentences when applied to such languages. Habash et al. (2019) also focused on morphologically rich languages, specifically Arabic, but in the context of gender identification in machine translation.

Conclusion
We presented a new approach for converting between masculine-inflected and feminine-inflected noun phrases in morphologically rich languages. To do this, we introduced a Markov random field with an optional neural parameterization that infers the manner in which a sentence must change to preserve morpho-syntactic agreement when altering the grammatical gender of particular nouns. To the best of our knowledge, this task has not been studied previously. As a result, there is no existing annotated corpus of paired sentences that can be used as "ground truth." Despite this limitation, we evaluated our approach both intrinsically and extrinsically, achieving promising results. For example, we demonstrated that our approach reduces gender stereotyping in neural language models. Finally, we also identified avenues for future work, such as the inclusion of co-reference information.

A Belief Propagation Update Equations
Our belief propagation update equations are where N (i) returns the set of neighbouring nodes of node i. The belief at any node is given by B Adjective Translations Tab. 6 and Tab. 7 contain the feminine and masculine translations of the four adjectives that we used.

C Extrinsic Evaluation Example Phrases
For each noun in our animacy gazetteer, we generated sixteen phrases. Consider the noun engineer as an example. We created four phrases-one for each translation of The good engineer, The bad engineer, The smart engineer, and The beautiful engineer. These phrases, as well as their prefix log-likelihoods are provided below in Tab Table 8: Prefix log-likelihoods of Spanish phrases using the original corpus, the corpus following CDA using naïve swapping of gendered words ("Swap"), and the corpus following CDA using our approach ("MRF"). Ungrammatical phrases are denoted by "*".