Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications

The success of language models based on the Transformer architecture appears to be inconsistent with observed anisotropic properties of representations learned by such models. We resolve this by showing, contrary to previous studies, that the representations do not occupy a narrow cone, but rather drift in common directions. At any training step, all of the embeddings except for the ground-truth target embedding are updated with gradient in the same direction. Compounded over the training set, the embeddings drift and share common components, manifested in their shape in all the models we have empirically tested. Our experiments show that isotropy can be restored using a simple transformation.


Introduction
Word embeddings, both static (Mikolov et al., 2013a;Pennington et al., 2014) and contextualized (Peters et al., 2018), have been instrumental to the progress made in Natural Language Processing over the past decade (Turian et al., 2010;Wu et al., 2016;Liu et al., 2018;Peters et al., 2018;Devlin et al., 2019). In recent years, language models based on Transformer architecture (Vaswani et al., 2017) have led to state-of-the-art performance on problems such as machine translation (Vaswani et al., 2017), question answering (Devlin et al., 2019;Liu et al., 2019b), and Word Sense Disambiguation (Bevilacqua and Navigli, 2020), among others. However, it has been observed that representations from Transformers exhibit undesirable properties, such as anisotropy, that is tend to occupy only a small subspace of the embedding space. The observation has been documented by a number of studies (Gao et al., 2019;Ethayarajh, 2019;Wang et al., 2020). A similar property has been identified in the past in static word embeddings (Mu and Viswanath, 2018). To address the issues, postprocessing methods (Mu and Viswanath, 2018), and regularization terms have been proposed (Gao et al., 2019;Wang et al., 2019cWang et al., , 2020. However, the mechanism that leads to undesirable properties remains unclear. Without understanding the mechanism, it is going to be difficult to address the fundamental issue properly. The deficiencies are most pronounced in the representations of rare words, as we will show in Section 4. Performance of pretrained language models is inconsistent and tends to decrease when input contains rare words (Schick and Schütze, 2020b,a). Schick and Schütze (2020a) observe that replacing a portion of words in the MNLI (Williams et al., 2018) entailment data set with less frequent synonyms leads to decrease in performance of BERT-base and RoBERTa-large by 30% and 21.8% respectively. 2 After enriching rare words with surface-form features and additional context, Schick and Schütze (2020a) decrease the performance gap to 20.7% for BERT and 17% for RoBERTa, but the gap remains large nonetheless. Why do even the large-scale, pretrained language models struggle to learn good representations of rare words? Consider a language model with an embedding matrix shared between the input and output layers, a standard setup known as weight tying trick (Inan et al., 2017). Intuitively, at any training step t, optimization of the cross-entropy loss can be characterized as "pulling" the target embedding, w T , closer to the model's output vector h t , while "pushing" all other embeddings, W \w T , in the same direction, away from the output vector h t . This leads to what we call common enemies effect -the effect of the target words producing gradients of the same direction for all of the nontarget words. Compounded over the training set, the embeddings drift and share common components, manifested in their shape in all the models 5118 we have empirically tested; see Figure 1.
Although Gao et al. (2019) report a closely related phenomenon and call it representation degeneration, their analysis is based on an assumption that the embedding matrix is learned after all other parameters of the model are well-optimized and fixed, which is not the case in practice. We conduct our analysis in a more realistic setting, and arrive at different conclusions. We show that embeddings do not occupy a narrow cone, but are shifted in one common direction and only appear as a cone when projected to a lower dimensional space (Section 4.1). In fact simply removing the mean vector of all embeddings, thus centering them, shifts the embeddings back onto a more spherical shape. We evaluate embeddings, before and after centering, on four standard benchmarks and observe significant performance improvement across all of them. Why is removing the mean so effective? We find that the common enemies effect applies to most, if not all, words in the vocabulary but in non-uniform manner. As language is known to follow an approximately Zipfian distribution (Zipf, 1949;Manning and Schütze, 2001;Piantadosi, 2014) even common words will not occur frequently in a text corpus, and in result will be often "pushed" by other target words in the same direction as rare words. Consequently, all embeddings share a significant common direction. We will focus on the analysis of auto-regressive GPT-2 (Radford et al., 2019) and two masked language models, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b). Our contributions can be summarized as follows: • We show that as word embeddings repeatedly share same direction gradients, they are shifted in one dominant direction in the vector space. The effects are the most evident in representations of rare words, but are also present in representations of frequent words.
• The shift causes the distribution of projected embeddings to appear as a narrow cone; we show that simply removing the mean vector is enough to restore the spherical distribution.
• We provide empirical evidence of our analyses using state-of-the-art pretrained language models and demonstrate that removing the mean dramatically improves isotropy of the representations.

Distributed Word Representations
Distributed representations induce a rich similarity space, in which semantically similar concepts are close in distance (Goodfellow et al., 2016;Bengio et al., 2003;Mikolov et al., 2013c). In a language model, the regularities of embeddings space facilitate generalization, assigning a high probability to a sequence of words that has never been seen before but consists of words that are similar to words forming an already seen sentence (Bengio et al., 2003;Mikolov et al., 2013c). Although models such as BERT or GPT-2 produce representations from a function of the entire input sequence, the representations are a result of a series of transformations applied to the input vectors. Consider an example sentence: "The building was dilapidated.", and the sentences resulting from replacing "dilapidated" with either "ruined" or "reconditioned". If the distance in the embeddings space between the two rather infrequent, but antonymous, words "dilapidated" and "reconditioned" is not larger than the distance between "dilapidated" and its relatively frequent synonym "ruined", then by the aforementioned generalization principle there is little to no reason to believe that the distance will become larger in the output layer. 3

Tokenization
Do the subword tokenization methods (Schuster and Nakajima, 2012;Wu et al., 2016;Sennrich et al., 2016;Radford et al., 2019) preserve the word frequency imbalance? Examination of the common tokenization methods, such as Byte-Pair Encoding (Sennrich et al., 2016) and WordPiece (Schuster and Nakajima, 2012;Wu et al., 2016), suggests that subword units induced by tokenization algorithms exhibit similar frequency imbalance to that of full vocabulary. This can be explained by the greedy nature of the vocabulary induction process. Although different methods use different base vocabulary symbols to begin with (i.e., Unicode code points, or bytes), all of the methods construct the vocabulary through iterative merging of the most frequent symbols. As a result, the most frequent units are preserved as words, while the rare words are segmented into subword units. Moreover, the words which are segmented into subword units are infrequent to such a degree that even their combined frequency is orders of magnitude lower than frequency of the most common words. We confirm this empirically by tokenizing the CNN News corpus (See et al., 2017;Hermann et al., 2015) with WordPiece (used in BERT), revealing that over 30% of the corpus can be accounted for using 13 most frequent tokens, and 50% of the corpus can be accounted for using just 85 tokens. On the other hand, to cover at least 98% of the corpus, nearly 15000 tokens are needed. Therefore, we conclude that the tokens follow approximately Zipfian distribution (Zipf, 1949;Manning and Schütze, 2001) similar to that of full vocabulary. We provide a comparison of frequency distributions of tokens and words based on CNN-News corpus in Appendix B. 4

Autoregressive Language Models
Given a sequence of tokens w = [w 1 , ..., w N ] as input, autoregressive (AR) language models assign a probability p(w) to the sequence using factorization p(w) = N t=1 p(w t |w <t ). Consequently, AR language model is trained by maximizing the 4 The preserved imbalance does not imply that subword tokenization is not beneficial to performance of language systems on rare words. It may mitigate some of the issues as shown in (Sennrich et al., 2016), however recent work demonstrates that it does not solve the problem (Schick and Schütze, 2020b,a). likelihood under the forward autoregressive factorization: 5 where h θ (w 1:t−1 ) ∈ R d is the output vector of a model at position t, θ are the model's parameters, W ∈ R |V |×d is the learned embedding matrix, e(w) is a function mapping a token to its representation from the embedding matrix, and label t is the index of the t-th target token in the vocabulary. To estimate the probability, W maps h θ (w 1:t−1 ) to unnormalized scores for every word in the vocabulary V ; the scores are subsequently normalized by the softmax to a probability distribution over the vocabulary. In this paper, we focus on neural language models which compute h θ using the Transformer architecture, however the mechanisms is generally applicable to other common variants of language models (Mikolov et al., 2010;Sundermeyer et al., 2012;Peters et al., 2018).

Masked Language Modeling
Masked Language Modeling (MLM) pretraining objective is to maximize the likelihood of masked tokens conditioned on the (noisy) input sequence. Given a sequence of tokens w = [w 1 , ..., w N ], a corrupted versionŵ is constructed by randomly setting a portion of tokens in w to a special [MASK] symbol. Although MLM estimates the token probabilities of all masked positions,w, simultaneously and renders the factorization from Subsection 3.1 no longer applicable, the mechanism used to "unmask" a token differs only slightly from that in AR, specifically: where m t = 1 indicates w t is masked, and h θ (ŵ) t is the output representations computed as function of the full, noisy, input sequence. Note, that the main difference between the equations 1 and 2 is the context used to condition the estimation. Models trained with MLM objective, like BERT and RoBERTa, compute the output vector utilizing bidirectional context through the self-attention mechanism, while the unidirectional models use only the context to the left of the target token. Moreover, only the probabilities of masked words, w i such that w i ∈w, are estimated.

Learning Rules
Although the two objectives described above differ in terms of the distribution modeled (Yang et al., 2019), both AR and MLM models rely on the softmax function and cross-entropy loss. Using the notation established above, the cross-entropy loss function for an AR model is optimized by minimizing: and for a MLM model it takes a form of: The gradient of the cross-entropy loss with respect to the embedding matrix W is a sum of the gradient flowing through two paths: first one is through the output layer where the embeddings are used to create the targets for the softmax, the second path flows through the encoder stack to the input layer. The gradient flowing through the embedding stack to the input layer is complex, and depends on minute details of a model. Although its contribution is not irrelevant, it is not necessary to illustrate the main point of this section. Thus, we focus on the update rule resulting from the gradient with respect to embeddings in the top layer of a model. For prediction of a token w t , let h θ be the output vector of either AR model ( and letŷ be the true probability distribution, then: The resulting update rule for the embedding matrix is: where η be the learning rate. Sinceŷ is equal to 0 for all the indices except for the index of the target word w t , all the embeddings will become less similar to the representation produced by a model with the exception of the target word embedding. This leads to what we define as the common enemies effect -target words producing gradients of the same direction for all of the non-target words. As the parameters θ are updated during the optimization process, the h θ changes even when the model is provided with the same input. Therefore, the direction of the gradient for the non-target words changes accordingly, but at a particular step the direction of the update is the same for all the nontarget words. This is fundamentally different from the conclusion of Gao et al. (2019), who states that there exists a uniformly negative direction such that its minimization yields a nearly optimal solution for rare words' embeddings. We find that the common enemies effect is the most pronounced in the representations of rare words, which are less likely to appear as targets, but it is evident in all embeddings nonetheless. Transformer-based language models degenerate and occupy a narrow cone in the embedding space, but instead we find that embeddings simply drift in a common, dominant direction. The conclusions of Gao et al. (2019) are strongly influenced by a rapid decay of singular values of an embedding matrix, however, a rapid decay of singular values is not a sufficient condition to reach such conclusions.
In fact, points sampled from a 3D sphere satisfy the condition given above. As at first glance this is not entirely obvious, we provide a toy example in Figure 2 that illustrates why embeddings appear as a cone when projected to a low dimensional space. We sample points at random from two spheres, one centered at the origin and one shifted away from the origin (Figure 2a), and perform Singular Value Decomposition on the two sets of samples. When the sphere moves away from the origin, the difference between the two singular values increases (Figure 2b).
Similarly, the projection of uncentered embeddings (see Figures 1a and 1d) appears as a cone, but when embeddings are centered around origin (Figures 1b, 1e), the shape of their projection changes to resemble a sphere more than a cone; that is simply removing the mean vector µ of an embedding matrix W , where µ = w∈W e(w) / |V |, increases the isotropy of embeddings. Optimization of a neural language model is certainly more complex than our toy example. Most of all, the common enemy effect is not uniform; the amount by which each vector moves in the most dominant direction depends on many factors, among others the size of the training corpus, the diversity of the training corpus, or whether static (BERT) or dynamic (RoBERTa) masking is used. In a more general sense, the magnitude of the gradient with respect to a word vector depends on the value in the logit corresponding to that word, hence the shift will not be uniform.

Unused Tokens and Rare Words
We hypothesize that as rare words drift in common direction, their embeddings become less discriminative than embeddings of frequent words. BERT's vocabulary provides a unique opportunity to investigate the contribution of the same direction gradients to embeddings of particular words. There are 994 special unused tokens in BERT's vocabulary that were not used as inputs or targets during pretraining, thus all the updates to their representations were in the directions opposite to output vectors. As shown in Figure 3, we observe that cosine similarity between the unused tokens and other tokens increases as the frequency decreases. The average cosine similarity between unused words and tokens in indices [28500-29500] 6 is 0.63. In comparison the unused tokens have cosine similarity of 0.27 with tokens in indices  (most frequent tokens, i.e., "to") but the similarity goes up rapidly for tokens other than the most frequent ones. 7 Schick and Schütze (2019), evaluate BERT and RoBERTa on a dataset explicitly measuring the ability of MLM models to "unmask" words of different frequencies, and report that both models struggle to "unmask" rare words. Results presented in this section provide an explanation of this behavior and confirm that embeddings of the rare tokens 6 Although frequency depends on a corpus, in general higher index implies lower frequency due to the way BERT's vocabulary is constructed. 7 We observe a similar pattern in RoBERTa using the last 1000 words in its vocabulary in place of the unused tokens.  are most affected by the common enemies effect.

Experiments
We validate our theoretical analysis through a series of experiments on geometric properties of noncontextualized embeddings.

Isotropy
Although centering an embedding matrix results in a more desirable spectral distribution, tokens of comparable frequency tend to remain clustered in the embedding space, as shown in Fig 1. Therefore, we empirically test how much actual gain in terms of isotropy is obtained in embeddings of the tested models by removing the shared direction. Moreover, Mu and Viswanath (2018) show that the top principal components in skip-gram embeddings (Mikolov et al., 2013a) correspond to frequency of words and demonstrate that such frequency bias can be mitigated by removing the top principal components of an embedding matrix. We evaluate the effectiveness of this approach on embeddings from Transformer-based models. We use BERT, RoBERTa, and GPT-2 in different sizes in our experiments.
Setup: We measure the initial isotropy of embeddings in each of the models, and the isotropy after removing the mean vector µ = w∈W w/|V | from each row of an embedding matrix W , yield-ingW = W −µ. Next, we use a slightly modified approach of Mu and Viswanath (2018), and remove D top principal components from each model's embedding matrix to obtain highly isotropic representations. Finally, we evaluate whether increasing isotropy of embeddings from Transformer-based models can improve performance on standard embedding benchmarks.
Definitions: To measure isotropy, we use the partition function defined in (Arora et al., 2016), where e(w) maps a word w to its embedding and c is a unit vector. For vectors to be isotropic, the value of Z(c) should be approximately constant, according to Lemma 2.1 in (Arora et al., 2016). Based on this property, we empirically measure the isotropy of an embedding matrix W using: where I(W ) ∈ [0, 1]. We follow the standard approach and define X to be the set of eigenvectors of W W (Mu and Viswanath, 2018;Wang et al., 2020). We remove the top principal components using a modified version of the post-processing method proposed by Mu and Viswanath (2018): where W is the embedding matrix,Ŵ is the postprocessed embedding matrix, and D is the number of principal components removed from the original matrix. Mu and Viswanath (2018) use W instead ofW in the term (U jW i )U j in eq. 11, but we find the centered version of W to be more effective. Following Mu and Viswanath (2018), we set D = d/100 , where d is the dimensionality of a model.

Embedding Benchmarks
Setup: We evaluate each model's embedding's performance on common benchmarks for word similarity and relatedness before and after postprocessing. We use the following data sets: • SimLex-999 (Hill et al., 2015) -measures similarity, rather than relatedness or association.
• MEN Test Collection (Bruni et al., 2014 Table 1: Isotropy, I(W ) ∈ [0, 1], of embeddings from various language models. Centering an embedding matrix yields nearly perfectly isotropic embeddings in most of the tested models. W c stands for a centered matrix, W r stands for an embedding matrix with d/100 top principal components removed. ||µ|| 2 /avg(||e(w)|| 2 is the ratio of the L 2 norm of the mean vector, µ, to the average of the L 2 norms of word embeddings.
• Stanford Rare Words (RW) (Luong et al., 2013) -measures similarity of words. In this dataset at least one word in each pair is a rare word.
The data sets are designed to measure embeddings' ability to reflect semantic relations. The performance on the data sets is measured by the correlation between the similarities of the representations and the human scores. We filter out samples consisting of subword units. Although this results in different test sets for different models, our goal is not to compare different models' performance but to validate the benefits of increased isotropy of embeddings. We score relations with both cosine similarity and inner product. Schakel and Wilson (2015) show that vectors of more frequent words tend to have smaller norms, which was confirmed for BERT by Podkorytov et al. (2020). As the longer vectors of rare words are most affected by common enemies effect (see Section 4.2), we evaluate a "scaled-centering" method to account for that.
Specifically, we first compute the mean vector of embeddings normalized to unit lengthμ = w∈W e(w) ||e(w)|| 2 / |V |. Then we scale the mean vector by the norm of each word embedding before subtracting it, e(w) = e(w) − ||e(w)|| 2μ .

Results
Isotropy: We find that merely removing the mean vector is enough for most models to reach nearly perfect isotropy. The results are in Table 1. The only exception is RoBERTa-large, which had the lowest initial isotropy. Interestingly, Schick and Schütze (2020a) show that RoBERTa-large outperforms BERT models on tasks designed explicitly for rare words. Moreover, according to common leaderboards (Wang et al., 2019b,a), RoBERTa performs best on downstream tasks among the models we analyzed.
We stress that the I(W ) is an approximation of the degree of isotropy, and should be treated as such when interpreting its relation to downstream performance. The idea of the partition function Z(c) states that it's value should be constant for any vector c (Arora et al., 2016;Mu and Viswanath, 2018). As there is no closed-form solution for min c∈X and max c∈X , a set of eigenvectors of W W has been used as X in previous studies to approximate the isotropy (e.g., Mu and Viswanath, 2018;Wang et al., 2020). The vectors in X, however, cannot be considered principal components of W , unless the matrix W has been centered. Pearson (1901) states that unless the mean of the data has been subtracted, the best fitting hyperplane would pass through the origin and not through the centroid. Indeed, for RoBERTa-large, the cosine similarity between the top eigenvector of W W and the mean vector is 0.99.
Additionally, as the volume of a cube in R n grows exponentially with n, it may be sufficient for the embeddings to be isotropic around a point lying on a lower dimensional subspace to retain the desired separation. In fact, embeddings from RoBERTa-large have an average pairwise cosine similarity of 0.33 (angle of 70.7°).
We speculate that a longer pretraining of RoBERTa compared to BERT results in a more significant shift of the embeddings in the dominating directions. Simultaneously, a larger pretraining corpus and a dynamic masking scheme used in RoBERTa may result in a more diverse set of shift directions. We leave this line of research for future studies.
Moreover, Mu and Viswanath (2018) Table 2: Average performance (Pearson's r × 100) of the models on the non-contextual benchmarks (SimLex-999, MEN, WordSim353, Stanford Rare Words). Centered stands for embedding matrix centered at origin; Centered-Scaled stands for embedding matrix with mean direction, scaled by the norm of each word embedding, subtracted; Post-Process corresponds to the method defined in eq. 11. Results from different models are not directly comparable due to different tokenization. Best results for each model are underlined. In general, increased isotropy results in increased performance. The improvement of RoBERTa-large is more significant as its initial isotropy is lower. Specific results for each benchmark can be found in Appendix C.
strate that neural language models are capable of learning to remove the mean vector. We leave the question whether Transformer-based language models perform an implicit representation centering operation to future research.
Embedding Benchmarks: We present our results on common benchmarks for word similarity and relatedness in Table 2. We report average scores from all tasks. The results on individual data sets are available in Appendix C. We observe that removing the mean vector, and consequently increasing the isotropy of embeddings, consistently improves the performance across all models, except for the most isotropic BERT-cased models. Furthermore, results in Table 2 demonstrate that "scaled-centering" is more effective than simple mean subtraction, and nearly as effective as the more expensive post-processing method. The only case in which "scaled-centering" does not improve performance is BERT-large-cased with cosine similarity as a scoring function. Performance gains are more pronounced when inner-product is used as a scoring function, regardless of the model or processing method used. Although, initially cosine-similarity yields better results, especially for embeddings with greater L 2 norms, mean subtraction is sufficient to close the gap in all but two models (BERT-cased models). 8

Discussion
There has been a body of literature demonstrating substantial benefits of improved quality of word embeddings on downstream performance (e.g., Mu and Viswanath, 2018;Wang et al., 2019c;Gao et al., 2019;Wang et al., 2020;Schick and Schütze, 2020a). In particular, Gao et al. (2019) propose to add a cosine similarity regularization to the crossentropy loss to increase the aperture of the cone in which embeddings are distributed, and report improved performance on machine translation and language modeling. It is straightforward to demonstrate that the cosine regularization proposed by Gao et al. (2019) is equivalent to minimizing the squared norm of the mean direction of embeddings, hence constraining the most significant drift direction. We provide the derivation of the equivalence in Appendix A.
Large-margin classification has been studied extensively, both in NLP (Wang et al., 2019c) and machine learning in general (Weston and Watkins, 1999;Tsochantaridis et al., 2005). As substantial shared components of embeddings will lead to a decreased classification margin in the output softmax layer, our work offers explanation for the fragility of pretrained language models reported in the literature (e.g., Schütze 2019, 2020a).
Our analyses show clearly that shifting of the embeddings in the embedding space is due to the dynamic interactions between the representations and the embedding vectors. As the embeddings become more similar, the resulting representations become closer, creating a positive feedback mechanism for the representations to drift collectively. In addition, while isotropy of representations is desirable and has an overall positive impact on performance, the relationships between isotropy and performance in Table 1 and Table 2 suggest that the role of isotropy in model performance needs to be further analyzed. The dynamics of the interactions are being further investigated to pinpoint the root cause and their relationship with the model's performance. Gao et al. (2019) present an insightful derivation of uniformly negative gradients for nonapparent words and formulate the optimization of rare words as an α-strongly convex problem but make strong assumptions that the embedding matrix is learned after all other parameters of the model are welloptimized and fixed, which is not the case in practice. We do not make such assumptions, providing a more realistic explanation for the learning process. Wang et al. (2020) propose to reparametrize the embedding matrix using SVD and propose directly controlling the decay rate of singular values. Our paper's purpose is inherently different from that of Wang et al. (2020); we recognize that the fundamental understanding of the problem is missing and provide an explanation for the observations made in previous studies. Another line of work focuses on limitations of the softmax. Yang et al. (2018) suggest that softmax does not have sufficient capacity to model the complexity of language. Zhang et al. (2019) analyze the skip-gram model to show that optimization based on cross-entropy loss and softmax resembles competitive learning in which words compete among each other for the context vector. This idea is closely related to the common enemies effect reported in this paper, however, skip-gram seems to mitigate this through negative sampling (Mikolov et al., 2013b) but similar approaches do not seem to help Transformer pretraining (Clark et al., 2020).

Related Work
A considerable effort has been made to improve performance of language systems on rare words, but the focus has been on either injecting subword information in non-contextual representations (Luong et al., 2013;Lazaridou et al., 2017;Pinter et al., 2017;Bojanowski et al., 2017), replacing rare words' representations through exploiting their context (Khodak et al., 2018;Liu et al., 2019a), or both Schütze, 2019, 2020a). In comparison, we strive to provide an explanation of the underlying problem, which is necessary to render such post-hoc fixes no longer necessary.

Conclusion
We find that the embeddings learned by GPT-2, BERT, and RoBERTa do not degenerate into a narrow cone, as has been suggested in the past, but instead drift in one shared direction. We recognize that target words produce gradients in the same direction for all the non-target words at each training step. Combined with the unbalanced distribution of word frequencies, any two words' embeddings will be repeatedly updated with gradients of the same direction. As such updates accumulate, the embeddings drift and share common components. Our experiments show that simply centering the embeddings restores a nearly perfectly isotropic distribution of tested models' embeddings and simultaneously improves embeddings' ability to reflect semantic relations. This understanding of the learning process dynamics opens exciting avenues for future work, such as improving the most affected embeddings of rare words and formulation of more computationally efficient training objectives. neural machine translation system: Bridging the gap between human and machine translation.

A Cosine Regularization as Mean Direction Minimization
In this section we show the equivalence of Cosine Regularization and mean direction minimization.
as N is a constant, minimizing 1 N || N iŵ i || 2 is equivalent to minimizing the CosReg term. Figure 4 validates that the word frequency imbalance is preserved in a corpus tokenized with Word-Piece (Schuster and Nakajima, 2012;Wu et al., 2016).

C Additional Experimental Results
In Table 3, we provide expanded results on the embedding benchmarks (see Section 5.2 for details).
Our experiments reveal a negative 0.61 correlation between the average norm of the embedding vectors and their isotropy. Additionally, the ratio of the L 2 norm of the mean vector to the average of the L 2 norms of embeddings tends to be larger for less isotropic embeddings. Figure 5 shows distributions of L 2 norms of embeddings from the models studied in this paper. Moreover, Figure 6 compares the effect of centering on the L 2 norms of embeddings from the BERT-large-cased and RoBERTalarge. We leave the relationship between the norms of the vectors, their isotropy, and the pretraining details (e.g., corpus size, number of training steps, weight decay) for future studies.