An Empirical Study of the Downstream Reliability of Pre-Trained Word Embeddings

While pre-trained word embeddings have been shown to improve the performance of downstream tasks, many questions remain regarding their reliability: Do the same pre-trained word embeddings result in the best performance with slight changes to the training data? Do the same pre-trained embeddings perform well with multiple neural network architectures? Do imputation strategies for unknown words impact reliability? In this paper, we introduce two new metrics to understand the downstream reliability of word embeddings. We find that downstream reliability of word embeddings depends on multiple factors, including, the evaluation metric, the handling of out-of-vocabulary words, and whether the embeddings are fine-tuned.


Introduction
Pre-trained word embeddings have been shown to improve neural networks' performance across a wide variety of tasks. For instance, pretrained word embeddings improve the performance of models for text classification (Kim, 2014), relation extraction (Nguyen and Grishman, 2015), named entity recognition (Lample et al., 2016), and machine translation (Qi et al., 2018). However, neural network performance is unstable when retrained multiple times on the same dataset (Kolen and Pollack, 1991). Small changes in the training data can result in substantial differences in overall performance. Similarly, after retraining word embeddings instead of the model (i.e., to incorporate out-of-vocabulary words or capture changes in their semantic meanings), the instability of the word embeddings themselves, can cause differences in downstream performance (Leszczynski et al., 2020). Yet, retraining is still important. Otherwise, performance will deteriorate over time (Kim et al., 2017). In this paper, we evaluate the downstream reliability of pre-trained word embeddings. Leszczynski et al. (2020) discuss how model instability can increase a company's cost of deploying and keeping natural language processing (NLP) pipelines in production. Let X andX be two different sets of embeddings, and let g X and gX be two models trained on X andX. Leszczynski et al. (2020) define instability as is a held-out test set for task some task T , and L is a fixed loss function. If L is the zeroone loss, then the measure captures the fraction of predictions that disagree between models trained on each set of embeddings. They analyze the model disagreement between two sets of embeddings trained on corpora collected from the same source at different time periods. They show that slight changes made to the beginning of an NLP pipeline (word embeddings) have a considerable impact on downstream performance. Intuitively, NLP engineers can potentially spend a lot of time searching for potential bugs caused by concept drift in word embeddings. Moreover, the time and resources required to train many state-of-the-art neural networks are growing substantially over time (Strubell et al., 2019;Schwartz et al., 2019). This trend results in a potential increase in greenhouse gasses (Strubell et al., 2019), as well as increasing the social cost to participate in natural language processing research (Schwartz et al., 2019).
In many scenarios, it is expected that different sets of embeddings may result in better performance, e.g., training on new domains (Moen and Ananiadou, 2013). Thus, there are many aspects related to the reliability and stability of the NLP pipeline that is neither cost-effective nor efficient and have yet to be studied. For example, instead of slight changes in word embeddings trained on the same source corpora, but collected at different times, what if task T 's training dataset changed? What if the embeddings are the same, and the model g() changes? Will the same pre-trained embeddings work if the dataset changes? Will the same embeddings work for different models? How will changes in pre-trained embeddings affect downstream fairness? Overall, if an NLP engineer does not need to evaluate every model, with every set of pre-trained word embeddings, for every question of interest, both the computational cost of putting models into production and the engineer's time spent evaluating model variations may be reduced.
Toward addressing the above methodological gaps, this paper presents the following contributions: (1.) We introduce two new metrics to measure the downstream reliability of word embeddings: Cross-Dataset Reliability (CDR) and Cross-Model Reliability (CMR); (2.) using our reliability measures, we provide a comprehensive analysis of the downstream reliability of word embeddings on three offensive language datasets, multiple standard classification and fairness metrics, and multiple neural network architectures; and (3.) we provide an in-depth discussion about our findings, their implications, as well as the limitations of our study.

Related Work
In this section, we briefly describe the main areas of research in which this paper is based: word embeddings, word embedding instability, and the downstream impact of word embeddings.
Word Embeddings. Word embeddings capture the distributional nature between words, i.e., word vectors will encode the context in which words frequently appear. There are multiple word embedding methods, such as, latent semantic analysis (Deerwester et al., 1990), Word2Vec (Mikolov et al., 2013a;Mikolov et al., 2013b), and GLOVE (Pennington et al., 2014). The basic component driving our research is that pre-trained word embeddings have been shown to be useful for a wide variety of downstream NLP tasks, such as text classification (Kim, 2014), relation extraction (Nguyen and Grishman, 2015), named entity recognition (Lample et al., 2016), and machine translation (Qi et al., 2018). So, we want to understand more about their downstream impact.
Word Embedding Instability. There have been significant findings in the instability of word embeddings (Hellrich and Hahn, 2016;Hellrich et al., 2019;Antoniak and Mimno, 2018;Burdick et al., 2018;Pierrejean and Tanguy, 2018). Hellrich and Hahn (2016) show that when investigating neighboring words in the embedding space, there is low reliability in which words are surrounding a particular token across multiple runs. In fact, stability is only present when word vectors are trained on a small corpus with a limited vocabulary (Hellrich and Hahn, 2016). Generally, even the smallest change in the training corpus causes high variation in the nearest neighbor distances. Furthermore, these variations are not only subject to low frequency words, instability is found to be present in vocabulary words that occur relatively frequently (Hellrich and Hahn, 2016). Instability has been shown in multiple algorithms (e.g., Skip-Gram (Hellrich and Hahn, 2016) and SVD (Hellrich et al., 2019)), as well as in different training corpora (e.g., historical text (Hellrich and Hahn, 2016) and social media (Antoniak and Mimno, 2018)).
Instability and the Downstream Impact of Word Embeddings. Rogers et al. (2018) explored morphological, semantic, and distributional factors that are correlated with downstream performance. Their work pointed to ways of improving the downstream performance of neural architectures by modifying word embeddings. Yet, while understanding what results in better performing embeddings are important, the instability of machine learning methods after small changes to features, word embeddings, and training data on the the downstream performance of certain tasks is troubling (May et al., 2019;Leszczynski et al., 2020). If the data distribution changes rapidly, then the machine learning models need to be retrained frequently, or at least, the word embeddings need to be retrained to handle out-of-vocabulary words and their respective meanings. Burdick et al. (2018) study which factors correlate with word embedding stability. A few of their findings include, GLOVE (Pennington et al., 2014) is one of the most stable methods, and part-of-speech is one of the biggest factors of instability. Fard et al. (2016) analyze the instability of machine learning models. They provide a metric of model instability, called prediction churn, and provide a markov chain monte carlo techinques to reduce churn, i.e., to reduce predictions changing dramatically. May et al. (2019) provides insight into correlationsbetween downstream performance and the compression abilities of word embeddings. Leszczynski et al. (2020) provide the first metric for downstream instability of word embeddings, and they show that increasing the embedding dimensions can reduce embedding instability. This paper builds on the work of Leszczynski et al. (2020) by providing insight into practical downstream instability issues of word embeddings.

Methodology
In this paper, we use two definitions of reliability: cross-dataset reliability (CDR) and cross-model reliability (CMR). These definitions differ from the definition of word embedding stability defined by Leszczynski et al. (2020) represent two independent datasets. f (Z, m) ∈ R q is a vector of evaluation metrics (e.g., AUC, F1, accuracy, etc.) for model m (e.g., CNN) on dataset Z. The scores in f (Z, m) are the result of training model m using q different pre-trained word embeddings. Therefore, f (Z, m) k is the result (e.g., AUC) of model m trained on dataset Z with pre-trained embeddings k. Thus, we define CDR as where C() represents the Spearman rho correlation between vectors f (Z, m) and f (Z, m). Intuitively, a high correlation means that the word embeddings which result in the best and worst performance for model m are similar for dataset Z and datasetZ. Next, CMR is defined as wherem and m are two different models such as an LSTM and CNN, respectively. Contrary to CDR, the intuition behind CMR is that a high correlation value means that the embeddings which result in the best and worst performance on datasets Z for model m are similar to the results of modelm on dataset Z.
To improve the robustness of our measurements, each result f (Z, m) k is the average of training model m on dataset Z ten times using the k-th set of embeddings with different random seeds.

Datasets
We use three English offensive language datasets in this paper: a sexist dataset (Sexist), an abusive dataset (Abusive), and a general offensive language dataset (OLID). Essentially, each dataset contains offensive language. Sexist language is a subset of abusive language (Waseem and Hovy, 2016), and abusive language is a subset of offensive language (Zampieri et al., 2019). Therefore, classifiers trained to detect general offensive language should detect both abusive and sexist tweets. The datasets were chosen because they differ slightly, but, their over-arching theme is the same. This similarity is discussed for the Sexist and Abusive datasets in Park et al. (2018). Moreover, for each dataset, we use 60% for training, 20% for validation, and 20% for testing. Each dataset's test split is used to calculate the reliability metrics defined in Equations 2 and 3, as Z andZ. We briefly describe each dataset below: Sexist Tweets Dataset (Sexist). The Sexist dataset is a collection of tweets that was annotated as 'Sexist', 'Racist', or 'Neither'. It was originally collected and used in Waseem and Hovy (2016) and Waseem (2016). It was then used to evaluate the fairness of offensive language classifiers in Park et al. (2018). The tweets were collected by searching for words related to sexism, then, using criteria from critical race theory, the tweets were manually annotated by experts. Following Park et al. (2018), we focus on 'Sexist' tweets by removing 'Racist' tweets from the dataset. The dataset contains 14,937 total tweets, of which 3,378 are annotated with the 'Sexist' class.
Offensive Language Dataset (OLID). The OLID dataset (Zampieri et al., 2019) contains 14,100 tweets labeled using a hierarchical annotation scheme where the top level (task A) differentiates 'Offensive' and 'Not Offensive' tweets. The bottom level (task C) categorizes insults/threats as targeting an individual, group, or other. For the purposes of this paper, we only use the first level, task A, of the hierarchy. The first level contains two classes: 'Offensive' and 'Not Offensive'. This results in a total of 4,640 tweets categorized as 'Offensive'.

Word Embeddings and Imputation Strategies
We test 17 publicly available sets of embeddings that vary in terms of the source embedding algorithm, training data, and dimension. The embeddings include GLOVE (Pennington et al., 2014), FastText (Bojanowski et al., 2017), and SkipGram (Mikolov et al., 2013b) variations. The complete listing of the embeddings used in our experiments can be found in the Supplementary Material. When using pre-trained word embeddings, there are implementation-level details that can affect downstream stability and reliability. For example, how should out-of-vocabulary words (i.e, words in the training dataset, but not in the pre-trained embeddings) be handled? Should unknown words be ignored, or should the unknown words be initialized randomly? The choice between fine-tuning the embeddings or keeping them static can also impact downstream performance and reliability. Moreover, if static embeddings can achieve similar performance while also being more reliable, then the cost of training neural networks can be slightly reduced.
Overall, we make use of three imputation strategies in our experiments: "Impute", "No Impute", and "Static". For "Impute", we initialize the embeddings of words missing in the pretrained embedding set, but that appear in the training dataset, with a random embedding using a uniform distribution in the range from -.1 to .1. The embeddings learned using the "Impute " strategy are fine-tuned during training. For the "No Impute" strategy, out-of-vocabulary words are ignored, but the embeddings are still finetuned during training. Finally, the "Static" strategy ignores out-of-vocabulary words, and the pre-trained embeddings are static during training, i.e., the embeddings are not fine-tuned.

Text Classification Models
In our experiments, we explore four neural network architectures: Convolutional Neural Networks (CNN), Neural Bag-of-Words (NBoW), Long Short-Term Memory Networks (LSTM), and Gated Recurrent Units (GRU). Every model uses standard word embeddings as their input (i.e., contextual embeddings are not tested).
Covolutional Neural Networks (CNN). We use the model proposed by Kim (2014). Essentially, the model is a shallow CNN with max-over-time pooling, followed by a sigmoid output layer. Let x i ∈ R d represent a d-dimensional embedding of the i-th word in a document. The CNN learns to extract ngrams from text that are predictive of the downstream task. Formally, each span of s words are concatenated, [x i−s+1 ; . . . ; x i ], into a contextual vector c j ∈ R s(d+2e) . Next, using a rectified linear unit (Nair and Hinton, 2010) f (), the covolution operation is applied to each vector, where b ∈ R q . Next, given the convolved context vectors [ĉ 1 ,ĉ 2 , . . . ,ĉ n+s−1 ], the CNN map them into a fixed sized vector using max-over-time pooling such thatĉ j max represents the max value across the j-th feature map. Finally, g is passed to a sigmoid output layer. We train the CNN with filter sizes that span three, four, and five words. Furthermore, we use a total of 300 filters for each size, and the Adam optimizer (Kingma and Ba, 2014).
Neural Bag-of-Words (NBoW). Unlike the CNN, which learns to extract predictive ngrams from text, the NBoW model only processes unigrams. Specifically, NBoW sums the word embeddings where Q is the total number of words in an instance (e.g., a tweet). The summed vector 1 h is passed to a sigmoid output layer. We train the NBoW model with the Adam optimizer (Kingma and Ba, 2014).
Long Short-Term Memory Networks (LSTM). While CNNs only extract informative n-grams from text, recurrent neural networks (RNNs) are able to capture long term dependencies between words. For our RNN method, we use long-short-term-memory (LSTM) (Hochreiter and Schmidhuber, 1997), specifically we use a variant introduced by Graves (2012), where i i , f i , o i represent the input, forget, and output gates. The hidden state vector h Q of the final word in each sentence is passed to a sigmoid output layer. The LSTM model is trained with hidden state size of 512 using the Adam optimizer (Kingma and Ba, 2014).
Gated Recurrent Units (GRU). We also explore a variant of the LSTM architecture, GRUs (Cho et al., 2014). GRUs are similar to LSTMs, but they have fewer parameters and do not have an output gate. The GRU we use is defined as where h i ,ĥ i , z i , and f i are the output, candidate activation, update gate, and forget gate vectors, respectively.
denotes the Hadamard product, i.e., element-wise multiplication. Like the LSTM, the final output vector h Q is passed to a full-connected sigmoid output layer. We train the GRU model with a hidden state size of 512 using the Adam optimizer (Kingma and Ba, 2014).

Experiments
In this section, we describe the evaluation metrics used in our experiments, and relate the overall performance of the models and imputation strategies to reliability.

Evaluation Metrics
We use four evaluation metrics in our experiments: AUC, GAUC, FPED, and FNED. AUC is the standard area under the receiver operating characteristic curve (ROC). This metric is used to evaluate the performance of the model on a held-out test set. GAUC is also the area under the ROC curve. However, instead of evaluating on the held-out test set, the AUC is calculated on the gender bias template (Gen-Temp) dataset, which we describe below. The FPED and FNED scores are also computed on GenTemp.

Sexist
Abusive OLID  Table 1: Overall Performance of each model (CNN, BoW, LSTM, and GRU) on all three datasets (Sexist, Abusive, and OLID) averaged over all pre-trained word embeddings. ↑ and ↓ mark whether the performance is better with a higher (↑) or lower (↓) score. Scores are reported on a held-out split of each dataset (AUC) as well as the synthetic gender dataset (GAUC, FPED, and FNED).
Overall, each metric is used to evaluate reliability as part of Equations 2 and 3 (i.e., they are the output of f (Z, m)). While many metrics have been proposed to evaluate fairness (Zliobaite, 2015;Hardt et al., 2016;Dixon et al., 2018a;Borkan et al., 2019;Mitchell et al., 2019), unfortunately, most methodologies require ground-truth or inferred demographic annotations (Badjatiya et al., 2019;Garg et al., 2019). In the absence of annotated demographic data, (Dixon et al., 2018a) propose fuzzing methods-a method of testing fairness with simulated data to analyze how predictions change if the topic of the tweet stays the same, but the text in pre-defined templates (e.g. I am {adjective}) is slightly altered. For example, fuzzing techniques will randomly change demographic words (e.g., "he", "she", "husband", and "wife") in a tweet without changing its meaning.
In this paper we use the code released by Dixon et al. (2018b), 2 . Specifically, we generated 1,424 samples (712 pairs) by filling the templates with common gender identity pairs (e.g., male/female, man/woman, etc.). We call this set of filled templates GenTemp. The created templates contain neutral and offensive nouns and adjectives inside the vocabulary to retain balance in 'Not Offensive' and 'Offensive' samples. See the Supplementary Material for a complete listing of the nouns and adjectives.
Following the experimental setup of Park et al. (2018), to measure the fairness of the different models, we use the AUC metric on GenTemp (GAUC), and compare the absolute differences between the false positive rate and false negative rate calculated independently for each gender. False-positive (FPR) and false-negative rates (FNR) are defined as where TP, FP, FN, and TN represent the number of true positives, false positives, false negatives, and true negatives, respectively. Moreover, we use the False Positive Equality Difference (FPED) and False Negative Equality Difference (FNED) (Dixon et al., 2018a). FPED and FNED are defined as respectively, where T = {Male, Female}. FPR and FNR represent the overall false positive and false negative rates, respectively. F P R t and F N R t represent the group-specific (i.e., Male or Female) false positive and false negative rates.

Overall Performance
We report the overall performance of each model on the three offensive language datasets in Table 1.
While the goal of this paper is not to develop the best offensive language classifiers, it is important to understand each model's performance when analyzing reliability. Each model's performance in Table 1 is averaged over all 17 embeddings and the ten repeated runs of each model. Furthermore, the AVG rows mark the average performance of an embedding strategy across all four models. Concerning the AUC on each dataset's held-out test set, we find that the "Impute" and "No Impute" embedding strategies result in the best performance on average. For instance, The AVG AUC on the Sexist dataset is 0.925, which is more than 4% higher than the "No Impute" strategy (0.881 AUC) and 8% higher than the "Static" AVG (0.840). Thus, static embeddings generally result in sub-optimal performance on average. We find a similar pattern for GAUC-the AUC results on the gender bias template dataset. The GAUC for the Abusive dataset (0.816 GAUC) is nearly 9% better than the use of the "No Impute" strategy (0.730 GAUC). We also find that the LSTM and GRU models generally result in the best AUC and GAUC, e.g., on the Sexist dataset, the LSTM's AUC (0.934) is nearly 2% better than the CNN and BoW models. Based on the FPED and FNED metrics, the fairest model and imputation strategies vary depending on the dataset. For instance, the "Static" embedding imputation strategy obtains the best scores for FPED (0.012) and FNED (0.034) on the OLID dataset, which is considerably better than the "Impute" strategy (FPED 0.087; FNED 0.124). Contrary to the OLID results, the "Impute" strategy on the Abusive dataset results in the best FPED (0.012) and FNED (0.021) scores, which is nearly 5% better than the "No Impute" Strategies FPED (0.061) and FNED (0.067). Overall, while the best model and imputation strategies vary for FPED and FNED, the bias is small in the Abusive and OLID datasets.   While we report the average and standard deviations in Table 1, it is also important to measure the difference between the best and worst performing embeddings for each model. If all embeddings result in the same performance, then reliability is irrelevant. Moreover, if the "Static" embedding imputation strategy is more sensitive to pre-trained embedding choice, but it can still match the performance of the "Impute" strategy for some embeddings, then it is important to know that. Thus, in Figure 1 we report the best and worst scores on the OLID dataset. The results are obtained by training a model repeatedly (ten times) for each set of embeddings. The runs are averaged to report the average score obtained by each model-embedding pair. Figure 1 displays the maximum (max) and minimum (min) scores. For each set of pretrained embeddings, we average the AUC of each model's ten runs. Hence, the max represents the largest pre-trained embedding's average AUC score. Interestingly, we find that the best embeddings for each model result in approximately the same AUC across all imputation strategies. For instance, the best AUC for the LSTM model is 0.87 for every imputation strategy. However, the choice of imputation strategy substantially impacts the worst AUC scores, i.e., the worst embeddings for each model impact each model's overall average. This implies that the "Static" embedding strategy is more sensitive to the selected pre-trained embeddings than the "Impute" strategy. Nonetheless, there is still a 3-7% absolute difference between the best and worst scores for each model using the "Impute" strategy-suggesting that pre-trained embedding choice is important. But, embedding choice has a larger  Table 2: Cross-dataset reliability (CDR) results measured by the Spearman rho correlation between the performance scores (AUC, GAUC, FPED, and FNED) of the same model trained on different datasets. The correlation measures whether the word embeddings that result in the best (worst) performance for a given model are the same on different, but similar, datasets. Higher (↑) scores marks more correlation.
impact when the "Static" strategy is used. Please see the Supplementary Material for a complete listing of the max, min, and median of each model's AUC and GAUC scores on every dataset. In summary, for all datasets, we found that pretrained embedding choice has a smaller impact on the AUC of the models using the "Impute" strategy than the "No Impute" and "Static" strategies. Yet, the GAUC performance is substantially impacted by the choice of pre-trained embeddings for all imputation methods. For instance, the BoW method has a 29% absolute difference between the min and max GAUC scores using the "Impute" strategy on the OLID dataset. Thus, the general importance of the reliability scores will depend on imputation strategy and evaluation metric.

Cross-Dataset Reliability Analysis
The CDR results are shown in Table 2. We find that the use of static embeddings results in stable crossdataset AUC and GAUC performance. For instance, the correlation between the AVG Impute model and Static models, on the Sexist and Abusive datasets, jumps from 0.415 to 0.878. Intuitively, this means if the word embeddings are static, then the embeddings that result in the best performance for a certain model (e.g., CNN) on the Sexist dataset will likely result in the best performance on the Abusive dataset. We find similar correlation improvements between Sexist ⇔ OLID (0.523 to 0.586) and Abusive ⇔ OLID (0.695 to 0.936). Additionally, in the static embedding setting, we see high correlation between datasets that differ substantially in size (e.g., Sexist ⇔ Abusive). This result suggests that small datasets could potentially be used to find which pre-trained embeddings work the best for a given problem, then the same embeddings will generalize to a larger dataset. While the AUC and GAUC CDR correlations improve using static embeddings, the CDR scores for the FPED and FNED fairness metrics do not have any noticeable correlation improvements. For instance, with static embeddings, the AVG CDR FPED score between the Sexist and OLID datasets is -0.321, i.e., on average, the embeddings that result in the fairer models in the Sexist dataset are inversely related to the embeddings in OLID. We find similar patterns in the FPED results for Sexist ⇔ Abusive (-0.081) and FNED results on Abusive ⇔ OLID (-0.117) with static embeddings. The pattern continues for the "Impute" and "No Impute'' embedding strategies.  The correlation measures whether the word embeddings that result in the best (worst) performance for a given model are the same for different models.

Cross-Model Reliability Analysis
In Figure 4, we report correlation heatmaps of the CMR study on each dataset. Note that all the results are in the diagonal blocks to report cross-model correlation on the same dataset. We analyze two of the evaluation metrics: AUC and GAUC. Overall, similar to the CDR results, we find that static embeddings result in more downstream reliability. For example, in Figure 4a, we see a steady improvement in CMR from the "Impute" strategy-where some models are negatively correlated with each other (e.g., the CNN and LSTM)-to the "Static" embedding imputation strategy. We find similar results for the abusive and OLID datasets shown in Figures 4b and 4c, respectively. In addition, the CMR correlation between the GRU and LSTM models is consistently higher than other pairs of models. This result suggests that the more similarities between two neural network architectures, the more likely the embeddings that perform well for one, will perform well for the other. In Figure 3, we show the FPED CMR results on the Abusive dataset. Similar to the CDR findings, we find that there is little downstream fairness reliability of the word embeddings. Specifically, we find nearly zero correlation between the FPED results of the CNN and the LSTM models with the "Impute" strategy. Furthermore, with the "Static" embedding strategy, many of the CMR results have negative correlations. The results imply that while the static embeddings resulting in the best AUC and GAUC performance are similar across models, there is no guarantee about FPED and FNED performance. We have similar results for the FPED and FNED results for all three datasets. However, because of space limitations, the rest of the Cross-Model FPED and FNED Figures can be found in the Supplementary Material.

Discussion
We have two major findings in this paper. First, if we fine-tune word embeddings, then the downstream performance will be erratic between models and datasets. While the difference between a model trained on different embeddings may be smaller with the "Impute" strategy for some metrics/datasets, there are still meaningful differences. Moreover, the finding that static embeddings are reliable and can match the "Impute" strategy scores with certain pre-trained embeddings implies that we may be able to save computation by not fine-tuning the embeddings. Also, we may be able to save computation by not evaluating all embeddings on future training iterations with the model. Thus, if computational efficiency is essential, and an NLP engineer wants to train a model on recently collected data, they only need to evaluate the best embeddings from their previous study. Likewise, if an NLP engineer wants to efficiently test a new model with static embeddings, based on the CMR results, the engineer only needs to evaluate the new model on the embeddings that initially resulted in the best performance.
Our second major finding for the FPED and FNED fairness metrics is that the reliability of the downstream fairness of word embeddings is erratic across models, datasets, and embedding imputation strategies. So, if the FPED and FNED fairness metrics are essential for the downstream task, it is crucial to test every embedding-model combination. However, many industries (e.g., health, government, etc.) that have applications where fairness is mission-critical may not have the resources for large-scale experimentation. Therefore, even if these industries rely on human-NLP collaboration, finding efficient and reliable methods of fairness estimation may be helpful, even if the accuracy does not necessarily achieve state-of-the-art performance. While there has been research about reducing the social cost of testing the fairness of text classification models (Rios, 2020), all embedding-model combinations still need to be tested in their framework. It is our opinion, that reducing the social and computational cost of testing the fairness of NLP methodologies is an important avenue of future research.
One of the major limitations of this study is that we treat gender as a binary concept while evaluating fairness. Gender is difficult (impossible) to detect automatically because gender is not a binary classification task. People may identify as binary trans people, non-binary people, or as gender non-conforming people. In this work we do not classify users into gender categories, but we do use a synthetic dataset to estimate binary gender fairness. As future work, we believe it is best to perform controlled experiments where we ask users how they identify, rather than grouping them automatically or using toy synthetic test sets. This approach-of asking rather than predicting-is also suggested for studies about gender in Scheuerman et al. (2019).
Finally, it is important to note the similarities and differences of our reliability metrics (CDR and CMR) to domain adaptation. Generally, domain adaptation methods attempt to improve the performance of machine learning models on data for a task that does not match the original data distribution (Ramponi and Plank, 2020). In this paper, we explore the downstream reliability of word embeddings when applied to datasets that do not match the original data distribution. Yet, the downstream performance of word embeddings may be reliable, but not generalize well in terms of overall performance. Ideally, future work should explore methods that generalize well and are reliable.

Conclusion
In this paper, we develop metrics of downstream reliability of pre-trained word embeddings. Specifically, we measure the downstream reliability of word embeddings across datasets and models. Our findings conclude that the performance of word embeddings are reliable when they are static, i.e., when they are not fine-tuned. This implies, without fine-tuning, that every publicly available set of word embeddings does not need to be evaluated if an NLP engineer trains on an updated dataset, or tests a different neural network architecture.

Supplementary Material A Word Embeddings
In Table 3, we link to the publicly available word embeddings we use in our experiments. We test three models: SkipGram, GLOVE, and FastText. We also explore different embeddings sizes, ranging for 25 dimensions to 300.

B GenTemp Dataset Details
In this section, we list the nouns and adjectives used in our experiments. These lists can be combined with the code released by Dixon et al. 3

C Expanded Overall Results
For each neural network model, we report the maximum, median, and minimum AUC and GAUC scores trained on different sets of embeddings in Tables 4, 5, and 6. The results are obtained by training a model repeatedly (ten times) for each set of embeddings. The runs are averaged to report the average score obtained by a model-embedding pair. The numbers in the tables are used to generate Figure 1.

D Cross-Model Reliability FPED and FNED Results
In Figure 4, we report correlation heatmaps of the CMR study on each dataset. Note that all the results are in the diagonal blocks to report cross-model correlation on the same dataset. We analyze two of the evaluation metrics: FPED and FNED. This expands on the analysis in the main paper. Here, we find that the CMR scores for the FPED and FNED metrics can vary depending on the dataset and neural network architecture. For instance, while the results get words from Impute to Static in Figure 4b, we see the opposite pattern in Figure 4a. The only pattern that holds across datasets and metrics is that the word embeddings which result in the most fair model in the GRU and LSTM architectures are highly correlated.