Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Contextualized representations (e.g. ELMo, BERT) have become the default pretrained representations for downstream NLP applications. In some settings, this transition has rendered their static embedding predecessors (e.g. Word2Vec, GloVe) obsolete. As a side-effect, we observe that older interpretability methods for static embeddings — while more diverse and mature than those available for their dynamic counterparts — are underutilized in studying newer contextualized representations. Consequently, we introduce simple and fully general methods for converting from contextualized representations to static lookup-table embeddings which we apply to 5 popular pretrained models and 9 sets of pretrained weights. Our analysis of the resulting static embeddings notably reveals that pooling over many contexts significantly improves representational quality under intrinsic evaluation. Complementary to analyzing representational quality, we consider social biases encoded in pretrained representations with respect to gender, race/ethnicity, and religion and find that bias is encoded disparately across pretrained models and internal layers even for models with the same training data. Concerningly, we find dramatic inconsistencies between social bias estimators for word embeddings.


Introduction
Word embeddings (Bengio et al., 2003;Collobert and Weston, 2008;Collobert et al., 2011) have been a hallmark of modern natural language processing (NLP) for many years. Embedding methods have been broadly applied and have experienced parallel and complementary innovations alongside neural network methods for NLP. Advances in embedding quality in part have come from integrating additional information such as syntax (Levy and Goldberg, 2014a;Li et al., 2017), morphology (Cotterell and Schütze, 2015), subwords (Bojanowski et al., 2017), subcharacters (Stratos, 2017;Yu et al., 2017) and, most recently, context (Peters et al., 2018;Devlin et al., 2019). Due to their tremendous representational power, pretrained contextualized representations, in particular, have seen widespread adoption across myriad subareas of NLP.
The recent dominance of pretrained contextualized representations such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) (2019) address ethical concerns such as security (adversarial robustness) and social bias. In fact, the neologism BERTology was coined specifically to describe this flurry of interpretability research. 1 While these works have provided nuanced finegrained analyses by creating new interpretability schema/techniques, we instead take an alternate approach of trying to re-purpose methods developed for analyzing static word embeddings.
In order to employ static embedding interpretability methods to contextualized representations, we begin by proposing a simple strategy for converting from contextualized representations to static embeddings. Crucially, our method is fully general and assumes only that the contextualized model maps word sequences to vector sequences. Given this generality, we apply our method to 9 popular pretrained contextualized representations. The resulting static embeddings serve as proxies for the original contextualized model. We initially examine the representational quality of these embeddings under intrinsic evaluation. Our evaluation produces several insights regarding layer-wise lexical semantic understanding and representational variation in contextualized representations. Importantly, our analyses suggest constructive improvements to potentially improve downstream practices in using contextualized models. Simultaneously, we find that our static embeddings substantially outperform Word2Vec and GloVe and therefore suggests our method serves the dual purpose of being a lightweight mechanism for generating static embeddings that track with advances in contextualized representations. Since static embeddings have significant advantages with respect to speed, computational resources, and ease of use, these results have important implications for resource-constrained settings (Shen et al., 2019), environmental concerns (Strubell et al., 2019), and the broader accessibility of NLP technologies. 2 Alongside more developed methods for embedding analysis, the static embedding setting is also equipped with a richer body of work regarding social bias. In this sense, we view understanding the encoded social bias in representations as a societally critical special-case of interpretability research. We employ methods for identifying and quantifying gender, racial/ethnic, and religious bias (Bolukbasi et al., 2016;Garg et al., 2018;Manzini et al., 2019) to our static embeddings. These experiments not only shed light on the properties of our static embeddings for downstream use but can also serve as a proxy for understanding latent biases in the original pretrained contextual representations. We find that biases in different models and across different layers are quite disparate; this has important consequences on model and layer selection for downstream use. Further, for two sets of pretrained weights learned on the same training data, we find that bias patterns still remain fairly distinct. Most surprisingly, our large-scale evaluation makes clear that existing bias estimators are dramatically inconsistent with each other.

Methods
In order to use a contextualized model like BERT to compute a single context-agnostic representation for a given word w, we define two operations.
The first is subword pooling: the application of a pooling mechanism over the k subword representations generated for w in context c in order to compute a single representation for w in c, i.e. {w 1 c , . . . , w k c } → w c . Beyond this, we define context combination to be the mapping from representations w c 1 , . . . , w cn of w in different contexts c 1 , . . . , c n to a single static embedding w that is agnostic of context. Subword Pooling. The tokenization procedure for BERT can be decomposed into two steps: performing a simple word-level tokenization and then potentially deconstructing a word into multiple subwords, yielding w 1 , . . . , w k such that cat(w 1 , . . . , w k ) = w where cat(·) indicates concatenation. Then, every layer of the model computes vectors w 1 c , . . . , w k c . Given these vectors, we consider four pooling mechanisms to compute w c : w c = f (w 1 c , . . . , w k c ) f ∈ {min, max, mean, last} min(·), max(·) are element-wise min/max pooling, mean(·) is the arithmetic mean and last(·) indicates selecting the last vector, w k c . Context Combination. Next, we describe two approaches for specifying contexts c 1 , . . . , c n and combining the associated representations w c 1 , . . . , w cn .
• Decontextualized: For a word w, we use a single context c 1 = w. That is, we feed the single word w into the pretrained model and use the outputted vector as the representation of w (applying subword pooling if the word is split into multiple subwords).
• Aggregated: Since the Decontextualized strategy presents an unnatural input to the pretrained encoder, which likely never encountered w in isolation, we instead aggregate representations of w across multiple contexts. In particular, we sample n sentences from a text corpus D (see §A.2) each of which contains the word w, and compute the vectors w c 1 , . . . , w cn . Then, we apply a pooling strategy to yield a single representation that aggregates representations across contexts: w = g(w c 1 , . . . , w cn ); g ∈ {min, max, mean}

Setup
We begin by verifying that the resulting static embeddings that we derive retain their representational strength, to some extent. We take this step to ensure that properties we observe of the static embeddings can be attributed to, and are consistent with, the original contextualized representations. Inspired by concerns with probing methods/diagnostic classifiers (Liu et al., 2019a;Hewitt and Liang, 2019) regarding whether learning can be attributed to the classifier and not the underlying representation, we employ an exceptionally simple parameter-free method for converting from contextualized to static representations to ensure that any properties observed in the latter are not introduced via this process. When evaluating static embedding performance, we consider Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) embeddings as baselines since they have been the most prominent pretrained static embeddings for several years. Similarly, we begin with BERT as the contextualized model as it is currently the most prominent in downstream use among the growing number of alternatives. We provide identical analyses for 4 other contextualized model architectures (GPT-2 (Radford et al., 2019), XLNet (Yang et al., 2019, RoBERTa (Liu et al., 2019b), DistilBERT (Sanh et al., 2019)) and, in total, 9 sets of pretrained weights. All models, weights, and naming conventions used are enumerated in Appendix C and Table 9. Additional representation quality results appear in Tables 4-7 and Figures 4-10. We primarily report results for bert-base-uncased; further results for bert-large-uncased appear in Figure 3.

Evaluation Details
To assess the representational quality of our static embeddings, we evaluate on several word similarity and word relatedness datasets. 3 We consider 4 such datasets: RG65 (Rubenstein and Goodenough, 1965), WS353 (Agirre et al., 2009), SIM-LEX999 (Hill et al., 2015 and SIMVERB3500 (Gerz et al., 2016) (see §A.4 for more details). Taken together, these datasets contain 4917 examples and specify a vocabulary V of 2005 unique words. Each example is a pair of words (w 1 , w 2 ) with a gold-standard annotation (provided by one or more humans) of the semantic similarity or relatedness between w 1 and w 2 . A word embedding is evaluated by the relative correctness of its ranking 3 Concerns with this decision are discussed in §A.3.   of the similarity/relatedness of all examples in a dataset with respect to the gold-standard ranking using the Spearman ρ coefficient. Embedding predictions are computed using cosine similarity.

Results
Pooling Strategy. In Figure 1, we show the performance on all 4 datasets for the resulting static embeddings. For embeddings computed using the Aggregated strategy, representations are aggregated over N = 100K sentences where N is the number of total contexts for all words ( §A.5). Across all four datasets, we see that g = mean is the best-performing pooling mechanism within the Aggregated strategy and also outperforms the Decontexualized strategy by a substantial margin. Fixing g = mean, we further observe that mean pooling at the subword level also performs best (the dark green dashed line in all plots). We further find that this trend consistently holds across pretrained models.
Number of Contexts. In Table 1, we see that performance for both BERT-12 and BERT-24 steadily increases across all datasets with increas- ing N ; this trend holds for the other 7 pretrained models. In particular, in the largest setting with N = 1M, the BERT-24 embeddings distilled from the best-performing layer for each dataset drastically outperform both Word2Vec and GloVe. However, this can be seen as an unfair comparison given that we are selecting specific layers for specific datasets. As the middle band of Table 1 shows, we can fix a particular layer for all datasets and still outperform both Word2Vec and GloVe on all datasets.
Relationship between N and model layer. In Figure 1, there is a clear preference towards the first quarter of the model's layers (layers 0-3) with a sharp drop-off in performance immediately thereafter. A similar preference for the first quarter of the model is observed in models with a different number of layers ( Figure 3, Figure 10). Given that our intrinsic evaluation is centered on lexical semantic understanding, this appears to be largely consistent with the findings of Liu et al. Cross-Model Results. Remarkably, we find that most tendencies we observe generalize well to all other pretrained models we study (specifically the optimality of f = mean, g = mean, the improved performance for larger N , and the layer-wise tendencies with respect to N ). This is particularly noteworthy given that several works have found that different contextualized models pattern substantially differently (Liu et al., 2019a;Ethayarajh, 2019).
In Table 2, we summarize the performance of all models we studied. All of the models considered were introduced during a similar time period and have comparable properties in terms of downstream performance. In spite of this, we observe that their static analogues perform radically differently. For example, several do not reliably outperform Word2Vec and GloVe despite outperforming Word2vec and GloVe reliably in downstream evaluation. Future work may consider whether the reduction to static embeddings affects different models differently and whether this is reflective of the quality of context-agnostic lexical semantics from other types of linguistic knowledge (e.g. context modelling, syntactic understanding, and semantic composition). In general, these results proach with similar motivations. provide further evidence to suggest that linguistic understanding captured by different pretrained weights may be substantially different, even for models with near-identical Transformer (Vaswani et al., 2017) architectures.
Somewhat surprisingly, in Table 2, DistilBert-6 outperforms BERT-12 on three out of the four datasets despite being distilled (Ba and Caruana, 2014;Hinton et al., 2015) from BERT-12. Analogously, RoBERTa, which was introduced as a direct improvement over BERT, does not reliably outperform the corresponding BERT models.

Bias
Bias is a complex and highly relevant topic in developing representations and models in NLP and ML. In this context, we study the social bias encoded within our static word representations as a proxy for understanding biases of the source contextualized representations. As Kate Crawford argued for in her NIPS 2017 keynote, while studying individual models is important given that specific models may propagate, accentuate, or diminish biases in different ways, studying the representations that serve as the starting point and that are shared across models (which are used for possibly different tasks) allows for more generalizable understanding of bias (Barocas et al., 2017).
In this work, we simultaneously consider multiple axes of social bias (i.e. gender, race, and religion) and multiple proposed methods for computationally quantifying these biases. We do so precisely because we find that existing NLP literature has primarily prioritized gender (which may be a technically easier setting and is starkly incomplete in terms of social biases of interest). Further, as we will show, different computational specifications of bias that evaluate the same underlying social phenomena yield markedly different results. As a direct consequence, we strongly caution that the results must be taken with respect to the definitions of bias being applied. Further, we note that an embedding which receives low bias scores cannot be assumed to be (nearly) unbiased. Instead, it satisfies the significantly weaker condition that under existing definitions the embedding exhibits low bias and perhaps additional (more nuanced) definitions are needed.

Definitions
Bolukbasi et al. (2016) introduced a measure of gender bias which assumes access to a set P = {(m 1 , f 1 ), . . . , (m n , f n )} of (male, female) word pairs where m i and f i only differ in gender (e.g. 'men' and 'women'). They compute a gender direction g: where [0] indicates the first principal component.
Then, given a set N of target words that we are interested in evaluating the bias with respect to, Bolukbasi et al. (2016) specifies the bias as: bias This definition is only inherently applicable to binary bias settings, i.e. where there are exactly two protected classes. Multi-class generalizations are difficult to realize since constructing P requires aligned k-tuples whose entries only differ in the underlying social attribute and this becomes increasingly challenging for increasing k. Further, this definition assumes the first principal component explains a large fraction of the observed variance.
Garg et al. (2018) introduced a different definition that is not restricted to gender and assumes access to sets A 1 = {m 1 , · · · , m n } and A 2 = {f 1 , · · · , f n } of representative words for each of the two protected classes. For each class, µ i = mean w∈A i w is computed. Garg et al. (2018) computes the bias in two ways: bias Compared to the definition of Bolukbasi et al.
(2016), these definitions may be more general as constructing P is strictly more difficult than constructing A 1 , A 2 (as P can always be split into two such sets but the reverse is not generally true) and Garg et al. (2018)

Results
Inspired by the results of Nissim et al. (2020), in this work we transparently report social bias in existing static embeddings as well as the embeddings we produce. In particular, we exhaustively report the measured bias for all 3542 valid (pretrained model, layer, social attribute, bias definition, target word list) 5-tuples -all possible combinations of static embeddings and bias measures considered.
The results for models beyond BERT appear in Figures 11-18. We specifically report results for binary gender (male, female), two-class religion (Christianity, Islam) and three-class race (white, Hispanic, and Asian), directly following Garg et al. (2018). We study bias with respect to target word lists of professions N prof and adjectives N adj . These results are by no means intended to be comprehensive with regards to the breadth of bias socially and only address a restricted subset of social biases which notably does not include intersectional biases. The types of biases being evaluated for are taken with respect to specific word lists (which are sometimes subjective albeit being peer-reviewed) that serve as exemplars and definitions of bias are grounded in the norms of the United States. All word lists are provided in Appendix B and are sourced in §A.6.
Layer-wise Bias Trends. In Figure 2, we report layer-wise bias across all (attribute, definition) pairs. We clearly observe that for every social attribute, there is a great deal of variation across the layers in the quantified amount of bias for a fixed bias estimator. Further, while we are not surprised that different bias measures for the same social attribute and the same layer assign different absolute scores, we observe that they also do not agree in relative judgments. For gender, we observe that the bias estimated by the definition of Manzini et al. (2019) steadily increases before peaking at the penultimate layer and slightly decreasing thereafter. In contrast, under bias GARG-EUC 5 We slightly modify the definition of Manzini et al. (2019) by (a) using cosine similarity where they use cosine distance and (b) inserting absolute values around each term in the mean over N . We make these changes to introduce consistency with the other definitions and to permit comparison. we see a distribution with two peaks corresponding to layers at the start or end of the pretrained model with less bias within the intermediary layers. For estimating the same quantity, bias GARG-COS is mostly uniform across the layers. Similarly, in looking at the religious bias, we see similar inconsistencies with the bias increasing monotonically from layers 2 through 8 under bias . In general, while the choice of N (and the choice of A i for gender) does affect the absolute bias estimates, the relative trends across layers are fairly robust to these choices for a specific definition.
Consequences. Taken together, our analysis suggests a concerning state of affairs regarding bias quantification measures for (static) word embeddings.
In particular, while estimates are seemingly stable to some types of choices regarding word lists, bias scores for a particular word embedding are tightly related to the definition being used and existing bias measures are markedly inconsistent with each other. We find this has important consequences beyond understanding the social biases in our representations. Concretely, we argue that without certainty regarding the extent to which embeddings are biased, it is impossible to properly interpret the meaningfulness of debiasing procedures (Bolukbasi et al., 2016;Zhao et al., 2018a,b;Sun et al., 2019) as we cannot reliably estimate the bias in the embeddings both before and after the procedure. This is further compounded with the existing evidence that current intrinsic measures of social bias may not handle geometric behavior such as clustering (Gonen and Goldberg, 2019).
Cross-Model Bias Trends. In light of the above, next we compare bias estimates across different pretrained models in Table 3. Given the conflicting scores assigned by different definitions, we retain all definitions along with all social attributes in this comparison. However, we only consider target words given by N prof due to the aforementioned stability (and for visual clarity) with results for N adj appearing in Table 8. Since we do not preprocess or normalize embeddings, the scores using bias GARG-EUC are incomparable (and may be improper to compare in the layer-wise case) as   Data alone does not determine bias. Comparing the results for BERT-12 and BERT-24 (full layer-wise results for BERT-24 appear in Figure 11) reveals that bias trends for BERT-12 and BERT-24 are starkly different for any fixed 6 When we normalized using the Euclidean norm, we found the relative results to reliably coincide with those for bias bias measure. What this indicates is the bias observed in contextualized models is not strictly determined by the training data (as these models share the same training data as do all other 12 and 24 model pairs) and must also be a function of the architecture, training procedure, and/or random initialization.
Takeaways. Ultimately, given the aforementioned issues regarding the reliability of bias measures, it is difficult to arrive at clear consensus of the how the bias encoded compares between our distilled representations and prior static embeddings. What our analysis does resolutely reveal is a pronounced and likely problematic effect of existing bias definitions on the resulting bias estimates. cally construct a single semantically-bleached sentence which is fed into a sentence encoder to yield a static representation. In doing so, they introduce SEAT as a means for studying biases in sentence encoders by applying WEAT (Caliskan et al., 2017) to the resulting static representations. This approach appears inappropriate for quantifying bias in sentence encoders 7 as sentence encoders are trained on semantically-meaningful sentences and semantically-bleached constructions are not representative of this distribution and their templates heavily rely on deictic expressions which are difficult to adapt for certain syntactic categories such as verbs (as required for SIMVERB3500 especially). Given these concerns, our reduction method may be preferable for use in estimation of bias in contextualized representations. Due to the fact that we use mean-pooling, our approach may lend itself to interpretations of the bias in a model on average across contexts.
Ethayarajh (2019) (Zhao et al., 2017(Zhao et al., , 2018a(Zhao et al., , 2019aStanovsky et al., 2019). 8 Our bias evaluation is in the style of (a) and we consider multi-class social bias in the lens of gender, race, and religion whereas prior work has centered on binary gender. Additionally, while most prior work has discussed the static embedding setting, recent work has considered sentence encoders and contextualized models. Zhao et al. sentences. When compared to these approaches, we study a broader class of biases under more than one bias definition and consider more than one model. Further, while many of these approaches generally neglect reporting bias values for different layers of the model, we show this is crucial as bias is not uniformly distributed throughout model layers and practitioners often do not use the last layer of deep Transformer models (Liu et al., 2019a;Zhang et al., 2020;Zhao et al., 2019b). 9 7 Future Directions Our work furnishes multiple insights about pretrained contextualized models that suggest changes (subword pooling, layer choice, beneficial variance reduction via averaging across contexts) to improve downstream performance. Recent models have combined static and dynamic embeddings (Peters et al., 2018;Bommasani et al., 2019;Akbik et al., 2019) and our representations may also support drop-in improvements in these settings.
While not central to our goals, we discovered that our static embeddings substantially outper- The generality of the proxy analysis method implies that other interpretability methods for static embeddings can also be considered. Further, post-processing approaches beyond analysis/interpretability such as dimensionality reduction may be particularly intriguing given that this is often challenging to perform within large multi-

Discussion and Open Problems
While our work demonstrates that contextualized representations retain substantial representational power even when reduced to be noncontextual, it is unclear what information is lost. After all, contextualized representations have been so effective precisely because they are tremendously contextual (Ethayarajh, 2019). As such, the validity of treating the resulting static embeddings as reliable proxies for the original contextualized model still remains open.
On the other hand, human language processing has often been conjectured to have both context-dependent and context-independent properties (Barsalou, 1982;Rubio-Fernández, 2008;Depraetere, 2014Depraetere, , 2019. Given this divide, our approach may provide an alternative mechanism for clarifying how these two properties interact in the computational setting from both an interpretability standpoint (i.e. comparing results for analyses on the static embeddings and the original contextualized representations) and a downstream standpoint (i.e. comparing downstream performance for models initialized using the static embeddings and the original contextualized representations). However, the precise relationship between the role of context in human language processing and computational language processing remains unclear.
Theoretical explanation for the behavior we observe in two settings is also needed. First, it is unclear why learning contextualized representations and then reducing them to static embeddings drastically outperforms directly learning static embeddings. In particular, the GloVe embeddings we use are learned using 6 billion tokens whereas the BERT representations were trained on roughly half as much data (3.3 billion tokens). Perhaps the behavior is reminiscent of the benefits of modelling in higher dimensional settings temporarily as is seen in other domains (e.g. the kernel trick and Mercer's theorem for learning non-linear classifiers using inner product methods): begin by recasting the problem in a more expressive space (contextualized representations) and then project/reduce to the original space (static embeddings). Second, the reason for the benefits of the variance reduction that we observe are unclear. Given that best-performing mechanism is to average over many contexts, it may be that approaching the asymptotic mean of the distribution across contexts is desirable/helps combat the anisotropy that exists in the original contextualized space (Ethayarajh, 2019).

Conclusion
In this work, we consider how methods developed for analyzing static embeddings can be re-purposed for understanding contextualized representations. We introduce simple and effective procedures for converting from contextualized representations to static word embeddings. When applied to pretrained models like BERT, we find the resulting embeddings are useful proxies that provide insights into the pretrained model while simultaneously outperforming Word2Vec and GloVe substantially under intrinsic evaluation. We further study the extent to which various social biases (gender, race, religion) are encoded, employing several different quantification schemas. Our large-scale analysis reveals that bias is encoded disparately across different popular pretrained models and different model layers. Our findings also have significant implications with respect to the reliability of existing protocols for estimating bias in word embeddings.

Reproducibility
All data, code and visualizations are made publicly available. 10 Further details are explictly and comprehensively reported in Appendix A.  (Tables 4-7). Similarly, we provide layerwise bias estimates for all additional models in Figures 11-18. Results for target words specified as adjectives are given in Table 8.

A.2 Data
We use English Wikipedia as the corpus D in context combination for the Aggregated strategy. The specific subset of English Wikipedia 11 used was lightly preprocessed with a simple heuristic to remove bot-generated content. Individual Wikipedia documents were split into sentences using NLTK (Loper and Bird, 2002). We chose to exclude sentences containing fewer than 7 sentences or greater than 75 tokens (token counts we computed using the NLTK word tokenizer) though we did not find this filtering decision to be particularly impactful in initial experiments. The specific pretrained Word2Vec 12 and GloVe 13 embeddings used were both 300 dimensional. The Word2Vec embeddings were trained on approximately 100 billion words from Google News and the GloVe embeddings were trained on 6 billion tokens from Wikipedia 2014 and Gigaword 5. We chose the 300-dimensional embeddings in both cases as we believed they were the most frequently used and generally the best performing on both intrinsic evaluations (Hasan and Curry, 2017) and downstream tasks.

A.3 Evaluation Decisions
In this work, we chose to conduct intrinsic evaluation experiments that focused on word similarity and word relatedness. We did not consider the related evaluation of lexical understanding via word analogies as they have been shown to decompose into word similarity subtasks (Levy and Goldberg, 2014b) and there are significant concerns about the validity of these analogies tests (Nissim et al., 2020). We acknowledge that word similarity and word relatedness tasks have also been heavily scrutinized (Faruqui et al., 2016;Gladkova and Drozd, 2016). A primary concern is that results are highly sensitive to (hyper)parameter selection (Levy et al., 2015). In our setting, where the parameters of the embeddings are largely fixed based on which pretrained models are publicly released and where we exhaustively report the impact of most remaining parameters, we find these concerns to still be valid but less relevant.
To this end, prior work has considered various preprocessing operations on static embeddings such as clipping embeddings on an elementwise basis (Hasan and Curry, 2017) when performing intrinsic evaluation. We chose not to study these preprocessing choices as they create discrepancies between the embeddings used in intrinsic evaluation and those used in downstream tasks (where this form of preprocessing is generally not considered) and would have added additional parameters implicitly. Instead, we directly used the computed embeddings from the pretrained model with no changes throughout this work.

A.4 Representation Quality Dataset Trends
Rubenstein and Goodenough (1965) introduced a set of 65 noun-pairs and demonstrated strong correlation (exceeding 95%) between the scores in their dataset and additional human validation. Miller and Charles (1991) introduced a larger collection of pairs which they argued was an improvement over RG65 as it more faithfully addressed semantic similarity. Agirre et al. (2009) followed this work by introducing a even more pairs that included those of Miller and Charles (1991) as a subset and again demonstrated correlations with human scores exceeding 95%. Hill et al. (2015) argued that SIMLEX999 was an improvement in coverage over RG65 and more correctly quantified semantic similarity as opposed to semantic relatedness or association when compared to WS353. Beyond this, SIMVERB3500 was introduced by Gerz et al. (2016) to further increase coverage over all predecessors. Specifically, it shifted the focus towards verbs which had been heavily neglected in the prior datasets which centered on nouns and adjectives.

A.5 Experimental Details
We used PyTorch (Paszke et al., 2017) throughout this work with the pretrained contextual word representations taken from the Hugging-Face pytorch-transformers repository 14 . Tokenization for each model was conducted using its corresponding tokenizer, i.e.
results for GPT2 use the GPT2Tokenizer in pytorch-transformers. For simplicity, throughout this work, we introduce N as the total number of contexts used in distilling with the Aggregated strategy. Concretely, N = w i ∈V n i where V is the vocabulary used (generally the 2005 words in the four datasets considered). As a result, in finding contexts, we filter for sentences in D that contain at least one word in V. We choose to do this as this requires a number of candidate sentences upper bounded with respect to the most frequent word in V as opposed to filtering for a specific value for n which requires a number of sentences scaling in the frequency of the least frequent word in V. The N samples from D for the Aggregated strategy were sampled uniformly at random. Accordingly, as the aforementioned discussion suggests, for word w i , the number of examples n i which contain w i scales in the frequency of w i in the vocabulary being used. As a consequence, for small values of N , it is possible that rare words would have no examples and computing a representation w using the Aggregated strategy would be impossible. In this case, we back-offed to using the Decontextualized representation for w i . Given this concern, in the bias evaluation, we fix n i = 20 for every w i . In initial experiments, we found the bias results to be fairly stable when choosing values n i ∈ {20, 50, 100}. The choice of n i would correspond to N = 40100 (as the vocabulary size was 2005) in the representation quality section in some sense (however this assumes a uniform distribution of word frequency as opposed to a Zipf distribution). The embeddings in the bias evaluation are drawn from layer X 4 using f = mean, g = mean as we found these to be the best performing embeddings generally across pretrained models and datasets in the representational quality evaluation. 14 https://github.com/huggingface/ pytorch-transformers

C Naming Conventions
Throughout this work, we make use of several naming conventions/substitutions. In the case of models, we use the form 'MODEL-X' where X indicates the number of layers in the model and consequently the model produces X + 1 representations for any given subword (including the initial layer 0 representation). Table 9 describes the complete correspondence of our shorthand and the full names.
In the case of model names, the full form is the name assigned to the pretrained model (that was possibly reimplemented) released by HuggingFace.     Table 8: Social bias within static embeddings from different pretrained models with respect to a set of adjectives, N adj . Parameters are set as f = mean, g = mean, N = 100000 and the layer of the pretrained model used in distillation is X 4 .