Measuring Biases of Word Embeddings: What Similarity Measures and Descriptive Statistics to Use?

Word embeddings are widely used in Natural Language Processing (NLP) for a vast range of applications. However, it has been consistently proven that these embeddings reflect the same human biases that exist in the data used to train them. Most of the introduced bias indicators to reveal word embeddings’ bias are average-based indicators based on the cosine similarity measure. In this study, we examine the impacts of different similarity measures as well as other descriptive techniques than averaging in measuring the biases of contextual and non-contextual word embeddings. We show that the extent of revealed biases in word embeddings depends on the descriptive statistics and similarity measures used to measure the bias. We found that over the ten categories of word embedding association tests, Mahalanobis distance reveals the smallest bias, and Euclidean distance reveals the largest bias in word embeddings. In addition, the contextual models reveal less severe biases than the non-contextual word embedding models.


Introduction
Word embedding models including Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), BERT (Devlin et al., 2018), ELMo (Peters et al., 2018), and GPT (Radford et al., 2018) have become popular components of many NLP frameworks and are vastly used for many downstream tasks. However, these word representations preserve not only statistical properties of human language but also the human-like biases that exist in the data used to train them (Bolukbasi et al., 2016;Caliskan et al., 2017;Kurita et al., 2019;Basta et al., 2019;Gonen and Goldberg, 2019). It has also been shown that such biases propagate to the downstream NLP tasks and have negative impacts on their performance (May et al., 2019;Leino et al., 2018). There are studies investigating how to miti-gate biases of word embeddings (Liang et al., 2020;Ravfogel et al., 2020).
Different approaches have been used to present and quantify corpus-level biases of word embeddings. Bolukbasi et al. (2016) proposed to measure the gender bias of word representations in Word2Vec and GloVe by calculating the projections into principal components of differences of embeddings of a list of male and female pairs. Basta et al. (2019) adapted the idea of "gender direction" of (Bolukbasi et al., 2016) to be applicable to contextual word embeddings such as ELMo. In (Basta et al., 2019) first, the gender subspace of ELMo vector representations is calculated and then, the presence of gender bias in ELMo is identified. Gonen and Goldberg (2019) introduced a new gender bias indicator based on the percentage of sociallybiased terms among the k-nearest neighbors of a target term and demonstrated its correlation with the gender direction indicator. Caliskan et al. (2017) developed Word Embedding Association Test (WEAT) to measure bias by comparing two sets of target words with two sets of attribute words and documented that Word2Vec and GloVe contain human-like biases such as gender and racial biases. May et al. (2019) generalized the WEAT test to phrases and sentences by inserting individual words from WEAT tests into simple sentence templates and used them for contextual word embeddings. Kurita et al. (2019) proposed a new method to quantify bias in BERT embeddings based on its masked language model objective using simple template sentences. For each attribute word, using a simple template sentence, the normalized probability that BERT assigns to that sentence for each of the target words is calculated, and the difference is considered the measure of the bias. Kurita et al. (2019) demonstrated that this probabilitybased method for quantifying bias in BERT was more effective than the cosine-based method.
Motivated by these recent studies, we comprehensively investigate different methods for bias exposure in word embeddings. Particularly, we investigate the impacts of different similarity measures and descriptive statistics to demonstrate the degree of associations between the target sets and attribute sets in the WEAT. First, other than cosine similarity, we study Euclidean, Manhattan, and Mahalanobis distances to measure the degree of association between a single target word and a single attribute word. Second, other than averaging, we investigate minimum, maximum, median, and a discrete (gridbased) optimization approach to find the minimum possible association to report between a single target word and the two attribute sets in each of the WEAT tests. We consistently compare these bias measures for different types of word embeddings including non-contextual (Word2Vec, GloVe) and contextual ones (BERT, ELMo, GPT, GPT2).

Method
Implicit Association Test (IAT) was first introduced by Greenwald et al. (1998a) in psychology to demonstrate the enormous differences in response time when participants are asked to pair two concepts they deem similar, in contrast to two concepts they find less similar. For example, when subjects are encouraged to work as quickly as possible, they are much likely to label flowers as pleasant and insects as unpleasant. In IAT, being able to pair a concept to an attribute quickly indicates that the concept and attribute are linked together in the participants' minds. The IAT has widely been used to measure and quantify the strength of a range of implicit biases and other phenomena, including attitudes and stereotype threat (Karpinski and Hilton, 2001;Kiefer and Sekaquaptewa, 2007;Stanley et al., 2011).
Inspired by IAT, Caliskan et al. (2017) introduced WEAT to measure the associations between two sets of target concepts and two sets of attributes in word embeddings learned from large text corpora. A hypothesis test is conducted to demonstrate and quantify the bias. The null hypothesis states that there is no difference between the two sets of target words in terms of their relative distance/similarity to the two sets of attribute words. A permutation test is performed to measure the null hypothesis's likelihood. This test computes the probability that target words' random permutations would produce a greater difference than the observed difference. Let X and Y be two sets of target word embeddings and A and B be two sets of attribute embeddings. The test statistics is defined as: In other words, s(w, A, B) quantifies the association of a single word w with the two sets of attributes, and s(X, Y, A, B) measures the differential association of the two sets of targets with the two sets of attributes. Denoting all the partitions of X ∪ Y with (X i , Y i ) i , the one-sided p-value of the permutation test is: The magnitude of the association of the two target sets with the two attribute sets can be measured with the effect size as: It is worth mentioning that d is a measure used to calculate how separated two distributions are and is basically the standardized difference of the means of the two distributions (Cohen, 2013). Controlling for the significance, a larger effect size reflects a more severe bias.
WEAT and almost all the other studies inspired by it (Garg et al., 2018;Brunet et al., 2018;Gonen and Goldberg, 2019;May et al., 2019) use the following approach to measure the association of a single target word with the two sets of attributes (equation 1). First, they use cosine similarity to measure the target word's similarity to each word in the attribute sets. Then they calculate the average of the similarities over each attribute set.
In this paper we investigate the impacts of other functions such as min(·), mean(·), median(·), or max(·) for function f (·) in equation (1) (originally only mean(·) has been used). Also, in this paper in addition to cosine similarity, we consider Euclidean and Manhattan distances as well as the following measures for the s( − → w , − → a ) in equation (1).
Mahalanobis distance: introduced by P. C. Mahalanobis (Mahalanobis, 1936) this distance measures the distance of a point from a distribution: It is worth noting that the Mahalanobis distance takes into account the distribution of the set of attributes while measuring the association of the target word w with an attribute vector.
Discrete optimization of the association measure: In equation (1), s(w, A, B) quantifies the association of a single target word w with the two sets of attributes. To quantify the minimum possible association of a target word w with the two sets of attributes, we first calculate the distance of w from all attribute words in A and B, then calculate all possible differences and find the minimum difference.

Biases studied
We studied all ten bias categories introduced in IAT (Greenwald et al., 1998a) and replicated in WEAT to measure the biases in word embeddings. The ten WEAT categories are briefly introduced in Table 1. For more detail and example of target and attribute words, please check Appendix A. Although WEAT 3 to 5 have the same names, they have different target and attribute words.  Table 1: The associations studied in the WEAT As described in section 2, we need each attribute set's covariance matrix to compute Mahalanobis distance. To get stable covariance matrix estimation due to the high dimension of the embeddings we first created larger attribute sets by adding synonym terms. Next, we estimated the sparse covariance matrices as the number of samples in each attribute set is smaller than the number of features.
To enforce sparsity, we estimated the l1 penalty using k-fold cross validation with k=3.

Results of experiments
We examined the 10 different types of biases in WEAT (Table 1) for word embedding models listed in Table 2. We used publicly available pre-trained models. For contextual word embeddings, we used single word sentences as input instead of using simple template sentences used in other studies (May et al., 2019;Kurita et al., 2019). The simple template sentences such as "this is TARGET" or "TARGET is ATTRIBUTE" used in other studies do not really provide any context to reveal the contextual capability of embeddings such as BERT or ELMo. This way, the comparisons between the contextual embeddings and non-contextual embeddings are fairer as both of them only get the target or attribute terms as input. For each model, we performed the WEAT tests using four similarity metrics mentioned in section 2: cosine, Euclidean, Manhattan, Mahalanobis. For each similarity metric, we also used min(·), mean(·), median(·), or max(·) as the f (·) in equation (1). Also, as explained in section 2, we discretely optimized the association measure and found the minimum association in equation (1). In these experiments (Table  3 and Table 4), the larger and more significant effect sizes imply more severe biases.  Impacts of different descriptive statistics: Our first goal was to report the changes in the measured biases when we change the descriptive statistics. The range of effect sizes was from 0.00 to 1.89 (µ = 0.65, σ = 0.5). Our findings show that mean has a better capability to reveal biases as it provides the most cases of significant effect sizes (µ = 0.8, σ = 0.52) across models and distance measures. Median is close to the mean with (µ = 0.74, σ = 0.48) among all its effect sizes. The effect sizes for minimum (µ = 0.68, σ = 0.48) and maximum (µ = 0.65, σ = 0.48) are close to each other, but smaller than mean and median.
The discretely optimized association measure (Eq. 2) provides the smallest effect sizes (µ = 0.39, σ = 0.3) and reveals the least number of implicit biases. These differences as the result of applying different descriptive statistics in the association measure (Eq. (1)) show that the revealed biases depend on the applied statistics to measure the bias.
For example, in the cosine distance of Word2Vec, if we change the descriptive statistic from mean to minimum, the biases for WEAT 3 and WEAT 4 will become insignificant (no bias will be reported). In another example, in GPT model, while the result of mean cosine is not significant for WEAT 3 and WEAT 4, they become significant for median cosine. Moreover, almost for all models, the effect size of the discretely optimized minimum distance is not significant. Our intention for considering this statistic was to report the minimum possible association of a target word with the attribute sets. If this measure is used for reporting biases, one can misleadingly claim that there is no bias.
Impacts of different similarity measures: The effect sizes for cosine, Manhattan, and Euclidean are closer to each other and greater than the Mahalanobis distance (cosine: (µ = 0.72, σ = 0.49), Euclidean: (µ = 0.67, σ = 0.5), Manhattan: (µ = 0.63, σ = 0.48), Mahalanobis: (µ = 0.58, σ = 0.45)). Mahalanobis distance also detects the fewest number of significant bias types across all models. As an example, while mean and median effect sizes for WEAT 3 and WEAT 5 in GloVe and Word2Vec are mostly significant for cosine, Euclidean, and Manhattan; the same results are not significant for the Mahalanobis distance. That means with the Mahalanobis distance as the measure of the bias, no bias will be reported for WEAT 3 and WEAT 5 tests. This emphasizes the importance of chosen similarity measures in detecting biases of word embeddings. More importantly, as the Mahalanobis distance considers the distribution of attributes in measuring the distance, it may be a better choice than the other similarity measures for measuring and revealing biases with GPT showing fewer number of biases. Biases in different word embedding models: Using any combination of descriptive statistics and similarity measures, all the contextualized models have less significant biases than GloVe and Word2Vec. In Table 3 the number of tests with significant implicit biases out of the 10 WEAT tests along with the mean and standard deviation of the effect sizes for all embedding models have been reported. The complete list of effect sizes along with their p-value are provided in Table 4. Following our findings in the previous sections, we choose mean of Euclidean to reveal biases. By doing so, GloVe and Word2Vec show the most number of significant biases with 9 and 7 significant biases in 10 WEAT categories (Table 3). Using mean of Euclidean, our results confirm all the results by Caliskan et al. (2017), which used mean of cosine in all WEAT tests. The difference is that with the mean of Euclidean measure, the biases are revealed as being more severe. (smaller p-values). Using mean of Euclidean, GPT and ELMo show the fewest number of implicit biases. GPT model shows bias in WEAT 2, 3, and 5. ELMo's significant biases are in WEAT 1, 3, and 6. Using mean Euclidean, almost all models (except for ELMo) confirm the existence of a bias in WEAT 3 to 5. Moreover, all contextualized models found no bias in associating female with arts and male with science (WEAT 7), mental diseases with temporary attributes and physical diseases with permanent attributes (WEAT 9), and young people's name with pleasant attribute and old people's name with unpleasant attributes (WEAT 10).  Table 3: Number of revealed biases out of the 10 WEAT bias types for the studied word embeddings along with the (µ, σ) of their effect sizes. The larger the effect size the more severe the bias.

Conclusions
We studied the impacts of different descriptive statistics and similarity measures on association tests for measuring biases in contextualized and non-contextualized word embeddings. Our findings demonstrate that the detected biases depend on the choice of association measure. Based on our experiments, mean reveals more severe biases and the discretely optimized version reveals fewer number of severe biases. In addition, cosine distance reveals more severe biases and the Mahalanobis distance reveals less severe ones. Reporting biases with mean of Euclidean/Mahalanobis distances identifies more/less severe biases in the models. Furthermore, contextual models show less biases than the non-contextual ones across all 10 WEAT tests with GPT showing the fewest number of biases.  Table 4: WEAT effect size, *: significance at 0.01, **: significance at 0.001, ***: significance at 0.0001, ****: significance at 0.00001.