Equalizing Gender Biases in Neural Machine Translation with Word Embeddings Techniques

Neural machine translation has significantly pushed forward the quality of the field. However, there are remaining big issues with the translations and one of them is fairness. Neural models are trained on large text corpora which contains biases and stereotypes. As a consequence, models inherit these social biases. Recent methods have shown results in reducing gender bias in other natural language processing applications such as word embeddings. We take advantage of the fact that word embeddings are used in neural machine translation to propose the first debiased machine translation system. Specifically, we propose, experiment and analyze the integration of two debiasing techniques over GloVe embeddings in the Transformer translation architecture. We evaluate our proposed system on a generic English-Spanish task, showing gains up to one BLEU point. As for the gender bias evaluation, we generate a test set of occupations and we show that our proposed system learns to equalize existing biases from the baseline system.


Introduction
Language is one of the most interesting and complex skills used in our daily life, and may even be taken for granted on our ability to communicate. However, the understanding of meanings between lines in natural languages is not straightforward for the logic rules of programming languages. Natural language processing (NLP) is a sub-field of artificial intelligence that focuses on making natural languages understandable to computers. Similarly, the translation between different natural languages is a task for machine translation (MT). Neural machine translation (NMT) is a recent approach in MT which learns patterns between source and target language corpora to produce text translations using deep neural networks (Sutskever et al., 2014). One downside of models trained with human generated corpora is that social biases present in the data are learned. This is shown when training word embeddings, a vector representation of words, in news sets with crowd-sourcing evaluation to quantify the presence of biases, such as gender bias, in those representation (Bolukbasi et al., 2016). This can affect downstream applications (Zhao et al., 2018a) and are at risk of being amplified (Zhao et al., 2017). The objective of this work is to study the presence of gender bias in MT and give insight on the impact of debiasing in such systems. An example of this gender bias is the word "friend" in the English sentence "She works in a hospital, my friend is a nurse" would be correctly translated to "amiga" (feminine of friend) in Spanish, while "She works in a hospital, my friend is a doctor" would be incorrectly translated to "amigo" (masculine of friend) in Spanish. We consider that this translation contains gender bias since it ignores the fact that, for both cases, "friend" is a female and translates by focusing on the occupational stereotypes, i.e. translating doctor as male and nurse as female.
The main contribution of this study is providing progress on the recent detected problem which is gender bias in MT (Prates et al., 2018). The progress towards reducing gender bias in MT is made in two directions: first, we define a framework to experiment, detect and evaluate gender bias in MT for a particular task; second, we propose to use debiased word embeddings techniques in the MT system to reduce the detected bias. This is the first study in proposing debiasing techniques for MT. The rest the paper is organized as follows. Section 2 reports material relevant to the background of the study. Section 3 presents previous work on the bias problem. Section 4 reports the methodology used for experimentation and section 5 details the experimental framework. The results and discussion are included in section 6 and section 7 presents the main conclusions and ideas for further work.

Background
This section presents the two most important models that are used in this paper. First, we describe what is the transformer model which is the stateof-the-art model in MT and second, we report a brief description of word embeddings and the corresponding techniques to debias them.

Transformer
The Transformer (Vaswani et al., 2017) is a neural network architecture purely based on selfattention mechanisms that show an improvement in performance on MT tasks over previous recurrent and convolutional models, also is more efficient in using computation resources and faster in training. Neural networks start by representing individual words as vectors, word embeddings (more on this later), to process language as a vector space representation which can have a fixed or variable length. Words surrounding another word determine its meaning and how it is represented in this space, thus context influences in deciding the appropriate meaning for a given task using such representation. The Transformer computes a reduced constant number of steps using a self-attention mechanism on each one. This mechanism models the relations between words independently of their position, thus improving the number of steps needed to determine a target word. An attention score is computed for all words in a sentence when comparing the contribution of each word to the next representation. An encoder reads an input sentence to generate a representation which is later used by a decoder to produce a sentence output word by word. New representations are generated at each step in parallel for all words. The decoder uses self-attention in the generated words and also uses the representations from the last words in the encoder to produce a single word each time.

Word embeddings
Word embeddings are vector representations of words. This representation is less sparse and more expressive, opposite to discrete atomic symbols and one-hot vectors. It is used in many NLP applications. Based on the hypothesis that words appearing in same contexts share semantic meaning, this continuous vector space representation gathers semantically similar words. Arithmetic operations can be performed with these embeddings, in order to find analogies between pairs of nouns with the pattern "A is to B what C is to D" (Mikolov et al., 2013). For nouns, such as countries and their respective capitals or for the conjugations of verbs. While there are many techniques for extracting word embeddings, in this work we are using Global Vectors, or GloVe (Pennington et al., 2014). Glove is an unsupervised method for learning word embeddings. This count-based method, uses statistical information of word occurrences from a given corpus to train a vector space for which each vector is related to a word and their values describes their semantic relations.

Debiasing word embeddings
The presence of biases in word embeddings has aroused as a topic of discussion about fairness. More specifically, gender stereotypes are learned from human generated corpora as shown by (Bolukbasi et al., 2016). Several debiasing approaches have been proposed. Debiaswe is a postprocess method for debiasing previously generated embeddings (Bolukbasi et al., 2016). GN-GloVe is a method for generating gender neutral embeddings (Zhao et al., 2018b). The main ideas behind these algorithms are described next.
Debiaswe (Bolukbasi et al., 2016) is a postprocess method for debiasing word embeddings. It consists of two main parts: First the direction of the embeddings where the bias is present is identified. Second, the gender neutral words in this direction are neutralized to zero and also equalizes the sets by making the neutral word equidistant to the remaining ones in the set. The disadvantage of the first part of the process is that it can remove valuable information in the embed-dings for semantic relations between words with several meanings that are not related to the bias being treated. (Zhao et al., 2018b) is an algorithm for learning gender netural word embeddings models. It is modified to have protected attributes in the embeddings so to capture them in a specific dimension. For a protected attribute like gender, the minimization objective is composed first by capturing the word proximity like GloVe (Pennington et al., 2014) and the second restricts gender information in a specific dimension so the remaining are neutral. A set of seed male and female words are used to define metrics for computing the optimization and a set of gender neutral words is used for restricting neutral words in a gender direction.

Related work
There are studies on the presence of biases in many NLP applications. Word embeddings can learn biases learn from human generated corpora. (Bolukbasi et al., 2016) showed that stereotypical analogies are present in word embeddings both for gender and race. (Caliskan et al., 2017) found also a strong gender and racial bias presence is found in pre-trained embeddings and proposed a method for measuring bias in word embeddings. (Zhao et al., 2018b) proposed GN-GloVe, an algorithm to generated gender neutral word embeddings. The approach is to restrict gender information attributes in certain dimensions to keep the remaining free of this attributes. (Zhao et al., 2018a) shows that sexism present in a coreference resolution system is due to the word embeddings components. Applications that use these embeddings, such as curriculum filtering, may discriminate candidates because of their gender. The amplification of biases in downstream applications is a concerning problem also that can enlarge the gap between genders, for example in search engines, for professions where the name of the candidates may be discriminated by the algorithm because of their bias towards a specific gender. Thus, broadening even further gender inequality for a given field. (Zhao et al., 2017) shows that gender bias is learned and amplified in models trained from data sets containing web images used in language modelling tasks. As an example of, the word "cooking" is more probable to be re-lated to females than males and it can be further amplified. (Park et al., 2018) studies the reduction of such biases in abusive language detection. These models have a strong bias towards words that identify gender because of the data sets in which they are trained. Sentences that do not necessarily show sexism are detected as false positives compromising the robustness of the models. Debiased word embeddings combined with augmenting and swapping gender data is the most effective method for reducing gender bias for this task. (Prates et al., 2018) performs a case study on gender bias in machine translation. They build a test set consisting of a list of jobs and gender-specific sentences. Using English as a target language and a variety of gender neutral languages as a source, i.e. languages that do not explicitly give gender information about the subject, they test these sentences on the translating service Google Translate. They find that occupations related to science, engineering and mathematics present a strong stereotype toward male subjects. However, late 2018, Google announced in their developers blog 1 that efforts are put on providing gender-specific translations in Google Translate. Thus, gives both the translation for female and male when translating from gender-neutral languages.

Methodology
In this section, we describe the methodology used for this study. In order to study the translation system, the prior layer of both the Encoder and Decoder in the Transformer (Vaswani et al., 2017) where the word embeddings are trained, is adapted to use pretrained word embeddings. Then, we train the system with different pre-trained word embeddings (based on GloVe (Pennington et al., 2014)) to have a set of models. The scenarios are the following: • No pre-trained word embeddings, i.e. they are learned within the training of the model.
• Pre-trained word embeddings which are learned from the same corpus.
• Pre-trained word embeddings which are learned from the same corpus using a debiasing algorithm.
Also, the models with pre-trained embeddings given to the Transformer have two cases. For one case, using pre-trained embeddings in both the encoder and decoder sides, see Figure 1. For another case, using pre-trained embeddings only in the encoder side. See Figure 2.
To evaluate the performance of the models we use the BLEU metric (Papineni et al., 2002). This metric gives a score for a predicted translation set compared to its expected output.
5 Experimental framework

Corpora
The language pair used for the experiments is English-Spanish. A large training set of 16,554,790 sentences has been used for training the model. The validation and test sets used are the newstest2012 and newstest2013, respectively from the Workshop on Machine Translation 2 (WMT), which comprises 3,003 and 3,000 sentences. See Table 2.
To study gender bias, we have developed an additional test set with custom sentences to evaluate the quality of the translation in the models. We built this test set using a sentence pattern "I've known {her, him, <proper noun>} for a long time, my friend is {a, an} <occupation>" for a list of occupations from different professional areas. We refer to this test as Occupations test, their related sizes are also listed in Table 2 and sample sentences from this set are in Table 1. We use Spanish proper names to reduce ambiguity in this particular test. These sentences are properly tokenized before using them in the test. With these test sentences we see how "friend" is translated into its Spanish equivalent "amiga" or "amigo" which has a gender relation for each word, female and male, respectively. Note that we are formulating sentences with an ambiguous word "friend" that can be translated into any of the two words and we are adding context in the same sentence so that the system has enough information to translate them correctly. The list of occupations is from the U.S. Bureau of Labor Statistics 3 , which also includes statistical data for gender and race for most professions. We use a pre-processed version of this list from (Prates et al., 2018).

Models
The word embeddings are trained from the same corpus, using GloVe (Pennington et al., 2014) and GN-GloVe (Zhao et al., 2018b). The dimension of the vectors is settled to 512 as standard and kept through all the experiments in this study. The parameter values for training the word embedding models with GloVe and GN-GloVe methods are listed in Table 3. Debiaswe (Bolukbasi et al., 2016) is a debiasing post-process performed on trained embeddings. Instead of having parameters for learning the representation it uses a set of words to define the gender direction and to neutralize and equalize the bias from the word vectors. Three set of words are used in the Debiaswe algorithm. One set of ten pairs of words such as woman-man, girl-boy, she-he... are used to define the gender direction. Another set of 218 genderspecific words such as aunt, uncle, wife, husband... are used for learning a larger set of genderspecific words. Finally, a set of crowd-sourced male-female equalization pairs such as dad-mom, boy-girl, granpa-grandma... that represent gender direction are equalized in the algorithm. For the Spanish side, the sets are adapted for the task and slightly modified to avoid unclear words from the English language or unnecessary repetitions. The sets from GN-GloVe are similarly adapted to the Spanish language. The architecture to train the models for the translation task is the Transformer (Vaswani et al., 2017). The evaluation of the performance of the model is obtained by its BLEU score (Papineni et al., 2002). The parameter values used in the Transformer are the same as proposed in the baseline provided by the toolkit OpenNMT 4 and listed in Table 4. OpenNMT has built-in tools for training with pre-trained embeddings. These pre-trained embeddings have been implemented with the cor-     responding github repositories in GloVe 5 , Debiaswe 6 and GN-GloVe 7 . The GPUs used for training are separate groups of four NVIDIA TITAN Xp and NVIDIA GeForce GTX TITAN. The duration time for training is approximately 3 and 5 days, respectively. In the im-plementation, the model is set to accumulate the gradient two times before updating the parameters. This simulates 4 more GPUs during training giving a total of 8 GPUs, however the training takes longer.

Results
BLEU scores for the test set newstest2013 are listed in Table 5. Note that we can use the pretrained word embeddings either in the encoder or in the decoder or on both of them. We report results for the case of using pre-trained word embeddings on both the encoder and decoder and only in the encoder. As shown on the right part of the table, some experiments were done with fixed embeddings for the same models preventing its values from being further updated during training. The models with these fixed pre-trained embeddings show a slight decrease in performance and are not further evaluated for the gender bias experiments. They are left in the table for comparison. The models with updated pre-trained embeddings during training are later used for a qualitative analysis on the impact of using debiased word embed-   For the cases studied, values do not differ much, all around 30 BLEU points. Using pre-trained embeddings for initialization seems to improve the translation is slightly coherently with previous studies (Qi et al., 2018). Furthermore, debiasing word embeddings keeps this improvement or even increases it when using GN-Glove in the encoder and the decoder. In all cases, BLEU does not decrease, except for the case of GN-Glove (only Enc). In the next subsection, we show how each of the models performs on gender debiasing, but we want to underline that these models do not decrease the quality of translation in terms of BLEU when tested in an standard MT task. Qualitative analysis is performed on the Occupations test set, examples of which are shown in Table 1. This test set has context information to translate the ambiguous word of "friend" either into "amigo" or "amiga". If the lower is the bias in the system, the better the system will be able to translate to the correct gender. See Table 6 for the percentages of how "friend" is translated for each model. Note that we are using in all cases "updated" embeddings, since they are the best quality systems from Table 5. On the one hand, the system translates the masculine pronoun at almost 100% accuracy for all models, while not all occupations are well translated. On the other hand, for the feminine pronoun, the accuracy of this task is not as precise as its counterpart and it shows a slight decrease in accuracy for all models. For the case using a proper noun (instead of the pronoun as correference of "friend") like, for example, "John" or "Mary", the accuracy of this task shows further biases. However, it is worth mentioning that proper names can induce another kind of bias such as racial stereotypes. Note that gender debiasing is performed by augmenting the percentage of "amiga" in the translation in the presence of the female pronoun while keeping the quality of translation (coherently with generic results in Table 5). The most neutral system is achieved with GloVe and Debiaswe pretrained embeddings, being updated also updated during training. The improvement is over 80% compared to the baseline system and over 4% compared to the non-debiased pre-trained word embeddings.

Conclusions and further work
Biases learned from human generated corpora in natural language processing applications is a topic that has been gaining relevance over the last few years. Specifically, for machine translation, studies quantifying gender bias present in news corpora and proposing debiasing approaches for word embedding models have shown improvements on this matter. We studied the impact of gender debiasing on neural machine translation. We trained sets of word embeddings with the standard GloVe algorithm. Then, we debiased the embeddings using Debiaswe (Bolukbasi et al., 2016) and also trained its gender neutral version with GN-GloVe (Zhao et al., 2018b). We used all these different models on the Transformer (Vaswani et al., 2017). Experiments were reported on using these word embeddings on both the encoder and decoder sides, or only the encoder side. The models were evaluated using the BLEU metric on the standard task of the WMT newstest2013 test set. BLEU performance was similar when using bias and debiased models and even improved when using the latter.
To study of the bias for the translations of the models, we developed a specific test set. This set consists of sentences that includes context of the gender of the ambiguous "friend" in the English-to-Spanish translation. This word can be translated to feminine or masculine and the proper translation has to be derived from context. Our hypothesis is that if the translation system is gender biased, the context will be disregarded, while if the system is neutral, the translation will be correct (since it has the information of gender in the sentence). Results show that the male pronoun is always identi-fied, despite not all occupations being well translated, while the female pronoun has different ratio of appearance for different models. Successfully, we achieve almost 100% accuracy of translation on this pronouns when using the debiased word embeddings of Debiaswe both on the encoder and decoder. Also, this system slightly improves the BLEU performance from the baseline translation system. Therefore, we are "equalizing" the translation, while keeping its quality. Experimental material from this paper is available in github 8 . As already mentioned, this is the first work on proposing gender debiased translation systems, Future work on the topic of debiasing translation algorithms is required.
In this paper, we studied gender as a bias in machine translation, however other social constructs and stereotypes may be present in corpora, whether individually or combined, such as race, religious beliefs or age; this being just a small subset of possible biases which will present new challenges for fairness both in ML and MT. For the task in hand, the type of corpora used for the study is commonly related to news articles. Human corpora has a broad spectrum of categories, as an instance: industrial, medical, legal... and other biases particular to each area may present interesting problems. Also, other language pairs with different degree in specifying gender information in their written or spoken communication could be studied for the evaluation of debiasing in machine translation.