RPD: A Distance Function Between Word Embeddings

It is well-understood that different algorithms, training processes, and corpora produce different word embeddings. However, less is known about the relation between different embedding spaces, i.e. how far different sets of em-beddings deviate from each other. In this paper, we propose a novel metric called Relative Pairwise Inner Product Distance (RPD) to quantify the distance between different sets of word embeddings. This unitary-invariant metric has a unified scale for comparing different sets of word embeddings. Based on the properties of RPD, we study the relations of word embeddings of different algorithms systematically and investigate the influence of different training processes and corpora. The results shed light on the poorly understood word embeddings and justify RPD as a measure of the distance of embedding space.

With many different sets of word embeddings produced by different algorithms and corpora, it is interesting to investigate the relationships between these sets of word embeddings.Intrinsically, this would help us better understand word embeddings (Levy et al., 2015).Practically, knowing the relationship between different sets of word embeddings helps us build better word meta-embeddings (Yin and Schütze, 2016), reduce biases in word embeddings (Bolukbasi et al., 2016), pick better hyperparameters (Yin and Shen, 2018), and choose suitable algorithms in different scenarios (Kozlowski et al., 2019).
To study the relationship between different embedding spaces systematically, we propose RPD as a measure of the distance between different sets of embeddings.We derive statistical properties of RPD including its asymptotic upper bound and normality under the independence condition.We also provide a geometric interpretation of RPD.Furthermore, we show that RPD is strongly correlated with the performance of word embeddings measured by intrinsic metrics, such as comparing semantic similarity and evaluating analogies.
With the help of RPD, we study the relations among several popular embedding methods, including GloVe (Pennington et al., 2014), SGNS 1 (Mikolov et al., 2013), Singular Value Decomposition (SVD) factorization of PMI matrix, and SVD factorization of log count (LC) matrix.Results show that these methods are statistically correlated, which suggests that there is an unified theory behind these methods.
Additionally, we analyze the influence of training processes, i.e. hyperparameters (negative sampling), random initialization; and the influence of corpora towards word embeddings.Our findings include the fact that different training corpora result in significantly different GloVe embeddings, and that the main difference between embedding spaces comes from the algorithms although hyperparameters also have certain influence.Those findings not only provide some interesting insights of word embeddings but also fit nicely with our intuition, which further proves RPD as a suitable measure to quantify the relationship between different sets of word embeddings.

Background
Before introducing RPD, we review the theory behind some static word embedding methods, and 1 Skip-gram with Negative Sampling arXiv:2005.08113v1[cs.CL] 16 May 2020 discuss some previous works investigating the relationship between embedding spaces.

Word Embedding Models
We consider the following four word embedding models: SGNS, GloVe, SVD PMI , SVD LC .SGNS and GloVe are two widely used embedding methods, while SVD PMI and SVD LC are matrix factorization methods which are intrinsically related to SGNS and GloVe (Levy and Goldberg, 2014b;Levy et al., 2015;Yin and Shen, 2018).
The embedding of all the words forms an embedding matrix E ∈ R n×d , where the d here is the dimension of each word vector and n is the size of the vocabulary.
SGNS maximizes a likelihood function for word and context pairs that occur in the dataset and minimizes it for randomly sampled unobserved pairs, i.e. negative samples (NS).We denote the method with k NS as SGNS k .
GloVe factorizes the log-count matrix shifted by the entire vocabulary's bias term.The bias here are parameters learned stochastically with an objective weighted according to the frequency of words.
SVD PMI/LC SVD factorizes a signal matrix M = U DV T , which aims at reducing the dimensions of the cooccurrence matrix.The resulting embedding is E = U :,1:d D 1 2 1:d,1:d , where d is the dimension of word embeddings.We denote the method as SVD PMI , if the signal is the PMI matrix, and SVD LC if the signal is the log count matrix.
Although the scope of this paper focuses on standard word embeddings that were learned at the word level, RPD could be adapted to analyze embeddings that were learned from word pieces, for example, fastText (Bojanowski et al., 2017) and contextualized embeddings (Peters et al., 2018;Devlin et al., 2019).

Relationship Between Embedding Spaces
Levy and Goldberg (2014b) provide a good analogy between SGNS and SVD PMI .They suggest that SGNS is essentially factorizing the pointwise mutual information (PMI) matrix.However, their analogy is based on the assumption of no dimension constraint in SGNS, which is not possible in practice.Furthermore, their analogy is not suitable for analyzing methods besides SGNS and PMI models since their theoretical derivation relies on the specific objective of SGNS.Yin and Shen (2018) provide a way to select the best dimension of word embeddings for specific tasks by exploring the relations of embedding spaces of different dimension.They introduce Pairwise Inner Product (PIP) loss (Yin and Shen, 2018), an unitary-invariant metric for measuring word embeddings' distance (Smith et al., 2017).
The unitary-invariance of word embeddings states that two embedding vector spaces are equivalent if one can be obtained from another by multiplying a unitary matrix.However, PIP loss is not suitable for comparing numerically across embedding spaces since PIP loss has different energy for different embedding spaces.

Quantifying Distances between Embeddings
In this section, we describe the definition of RPD and its properties, which make RPD a suitable and effective method to quantify the distance between embedding spaces.Note that two embedding spaces do not necessarily have the same vocabulary for calculating the RPD.

RPD
For the following discussion, we always use the Frobenius norm as the norm of matrices.

Definition 1. (RPD)
The RPD between embedding matrices E 1 and E 2 is defined as follows: where Ẽ comes from dividing each entry of E by its standard deviation.For convenience, we let Ẽ ≡ E for the following discussion.
The numerator of RPD respects the unitaryinvariant property of word embeddings, which means that unitary transformation (i.e.rotation) preserves the relative geometry of an embedding space.The denominator is a normalization, which allows us to regard the whole embedding matrix as an integrated part (i.e.RPD does not correlate with the number of words of embedding spaces).This step makes comparisons across methods possible.

Statistical Properties of RPD
We assume the widely used isotropic assumption (Arora et al., 2016) that the ensemble of word vectors consists of i.i.d draws generated by v = sv, where v is from the spherical Gaussian distribution, and s is a scalar random variable.In our case, we can assume each entry of embedding comes from a standard normal distribution E: v ij ∼ N (0, 1).
Note that the assumption may not always work in practice, especially for other embeddings such as contextualized embeddings.However, under the isotropic conditions, the statistical properties derived are intuitively and empirically plausible.Besides, those properties serve to better interpret the value of RPD alone.Since RPD, in many cases, is used for comparison, we should be comfortable with the assumption.Upper bound We estimate the asymptotic upper bound of RPD.By factorizing the numerator of RPD, we get (1).
Applying the Cauchy-Schwarz inequality to the last term of (1)2 , we have the following estimation.
By the law of large numbers, we can prove that lim n→∞ EE T = n √ d (Appendix A).Then, we can tell from (2) that RPD is bounded by 1 when n → ∞.In practice, the number of words n is large enough to let the maximum of RPD stay around 1, which means RPD is well-defined numerically.
we can prove that RPD distributes normally from both an empirical and a theoretical perspective.Theoretically, by applying the central limit theorem to the numerator and the law of large numbers to the denominator of RPD, we can get the normality of RPD under the condition n → ∞, d n = c, where c remains constant (Appendix B).Empirically, we can use Monte Carlo simulation to show the normality and estimate the mean and variance of RPD (Appendix C).With the help of RPD, we can perform hypothesis test (z-test) to evaluate the independence of two embedding spaces.

Geometric Interpretation of RPD
From equation (1), we can tell that the first term goes to 1 when n → ∞.So we only need to discuss the second term.
For the i th row in EE T , we have vector vi = , where v i is the word i's vector in embedding E, n is the number of words.We can interpret vi as another representation of word i projected onto the space spanned by v 1 , v 2 , ..., v n .So for convenience, we denote Ê = EE T with its i th row as vi .
We can prove that 2 ) is the angle between v(1) i (i th row vector of Ê1 ) and v(2) i (i th row vector of Ê2 ) (Appendix D).Therefore, we can understand the value of RPD from the perspective of cosine similarity between vectors.

RPD and Performance
As Yin and Shen (2018) discussed, usability of word embeddings, such as using them to solve analogy and relatedness tasks, is important to practitioners.Through applying different sets of word embeddings to word similarity and word analogy tasks (Mikolov et al., 2013), we study the relationship between RPD and word embeddings' performance.Specifically, we set the word embeddings produced by SGNS with 25 NS as a starting point and use other word embeddings, for example, GloVe as an end point.Then we get a two dimensional point with x as their RPD, y as their absolute performance change in word similarity3 and analogy4 tasks.
By putting those points in Figure 1, we can tell in a certain range of RPD, the larger RPD between the two sets of word embedding means the bigger gap in their absolute performance.Intuitively, RPD is strongly related to cosine similarity, which is the measure of word similarity.RPD also shares the same property of PIP loss, where a small RPD leads to a small difference in relatedness and analogy tasks.We obtain similar results when the starting point is a different embedding space.
Note that this section serves to demonstrate the performance (at least in word similarity and analogy tasks) variation of different embedding spaces is correlated with their RPD.While we are aware of the relevance of other downstream tasks, we do not explore further since our focus lies in investigating the intrinsic geometry relation of embedding spaces.

Experiment
The following experiments serve to apply RPD to explore some questions of interest and further demonstrate that RPD is suitable for investigating the relations between embedding spaces.We leave applying RPD to help improve specific NLP tasks to future research.For example, RPD could be used for combining different embeddings together, which could help us produce better metaembeddings (Kiela et al., 2018).

Setup
If not explicitly stated, the experiments are performed on Text8 corpus (Mahoney, 2011), a standard benchmark corpus used for various natural language tasks (Yin and Shen, 2018).For all methods we experiment, we train 300 dimension embeddings, with window size of 10, and normalize the embedding matrices with their standard deviation5 .The default NS for SGNS is 15.As discussed in the introduction, the relationship between embeddings trained with SGNS and SVD PMI remains controversial (Arora et al., 2016;Mimno and Thompson, 2017).We use the results we obtain in Section 3.2 to test their dependence.

Methods
For example, if one believes that E 1 trained with SGNS and E 2 trained with SVD PMI have no relationship, then the null hypothesis H 0 would be: E 1 and E 2 are independent.Under H 0 , RPD(E 1 , E 2 ) asymptotically follows N (µ, σ 2 ).Then the test statistic z is calculated as follows.
In our case, we estimate µ = 0.953 and σ = 0.001 with Monte Carlo simulation with randomly initialized embeddings.Take RPD(E SGNS 1 , E SVD PMI ) = 0.511 from Table 1 as an example, the statistic z = 442, which means the p-value 0.01.Thus, we can confidently reject H 0 .Notice that we can test any two sets of word embeddings with this method.It is not hard to see that no pair of word embeddings in Table 1 are independent, which suggests that there exists an unified theory behind these methods.

SGNS is Closest to SVD PMI
With the help of RPD, it is also interesting to investigate distances between embeddings produced by different methods.Here, we calculate the RPDs among SGNS (with negative sampling 25, 15, 5, 1), GloVe, SVD PMI , SVD LC .
Table 1 shows the RPDs between SGNS with different negative sampling numbers and other methods.From the table, we can tell that SGNS stays close to SVD PMI , which confirms Levy and Goldberg (2014b)'s theory.

Hyper-parameters Have Influence on Embeddings
From Table 1, an interesting phenomenon is that SGNS becomes closer to other methods with the decrease of negative samples, which suggests that negative sampling is one of the factors driving SGNS away from matrix factorization methods.
With RPDs between different sets of word embeddings, we plot the embeddings in 2D by treating each embedding space as a single point.We first fix point SVD PMI and SVD LC , then we draw other points according to their RPDs with the other methods.Figure 2 helps us see how negative sampling affects the embedding intuitively.Increasing the number of negative samples pulls SGNS away from SVD PMI .Combining Table 1 and Figure 2, we can tell that although the hyper-parameters can influence the embeddings to some extent, the main difference comes from the algorithms.

Different Initializations Barely Influence Embeddings
Random initializations produce different embeddings with the same algorithms and hyperparameters.While those embeddings usually get similar performance on the downstream tasks, people are still concerned about their effects.We investigate the influence of random initializations for GloVe and SGNS.We train the embedding in the same setting multiple times and get the average RPDs for each method.For SGNS, the average RPDs of random initialization is 0.027.For GloVe, the value is 0.059.
We can tell that different random initializations produce essentially the same embeddings.Neither

Different Corpora Produce Different Embeddings
It is well known that different corpora produce different word embeddings.However, it is hard for us to tell how different they are and whether the difference influences downstream applications (Antoniak and Mimno, 2018).Knowing this would help researchers choose the algorithms in specific scenarios, for example, evolving semantic discovery (Yao et al., 2018;Kozlowski et al., 2019).They focus on the semantic evolution of words, but corpora are different in different time scales.Their methods use word embeddings to study semantic shift, which might be influenced by the word embeddings being trained on different corpora, thus getting unreliable results.In this case, it would be important to chose an algorithm less prone to influences by differences in corpora.We train word embeddings using each of text8 (Wikipedia domain, 25097 unique words), WMT14 news crawl6 (Newswire domain, 24359 unique words), TED speech7 (Speech domain, 7389 unique words).We compute RPD on the intersections of their vocabulary From Table 2, we can tell that SGNS is consistently more stable than GloVe in different domains.We suggest that this is because GloVe trains the embedding with co-occurrence matrix, which gets influenced more by the corpus.

Discussion
While our work investigates some interesting problems about word embeddings, there are many other problems about embeddings that can be demonstrated with the help of RPD.We discuss some of them as follows.

RPD and Crosslingual Word Embeddings
Artetxe et al. ( 2018) provide a framework to obtain bilingual embeddings, whose the core step of the framework is an orthogonal transformation and other existing methods can be seen as its variations.The framework proposes to train monolingual embeddings separately and then map them into a shared-embedding space with linear transformation.
While linear transformation is no guarantee for the alignment of two embedding spaces from different languages, RPD could potentially serve as a way to indicate how different language pairs benefit from mapping them with an orthogonal transformation.Since RPD is unitary-invariant, we can calculate RPD between embedding spaces from different language pairs.The smaller RPD is, the better the framework could align this two language embedding spaces.

RPD and Post-Processing Word Embeddings
Post-processing word embeddings can be useful in many ways.For example, Vulić et al. (2018) retrofit word embeddings with external linguistic resources, such as WordNet to obtain better embeddings; Rothe and Schütze (2016) decompose embedding space to get better performance at specialized domains; and Mu and Viswanath (2018) obtain stronger embeddings by eliminating the common mean vector and a few top dominating directions.RPD could serve as a metric to evaluate how the embedding space changes intrinsically after postprocessing.

RPD and Contextualized Word Embeddings
Contextualized embeddings are popular NLP techniques which significantly improve a wide range of NLP tasks (Bowman et al., 2015;Rajpurkar et al., 2018).To understand why contextualized embeddings are beneficial to those NLP tasks, many works investigate the the nature of syntactic (Liu et al., 2019), semantic (Liu et al., 2019), and commonsense knowledge (Zhou et al., 2019) contained in such representations.However, we still know little about the vector space of contextualized embeddings and their rela-tionship with traditional word embeddings, which is important to further apply contextualized embeddings in various scenarios (Lin and Smith, 2019).RPD can potentially serve to help us better understand contextualized embeddings in future research.

Conclusion
In this paper, we propose RPD, a metric to quantify the distance between embedding spaces (i.e different sets of word embeddings).With the help of RPD and its properties, we verify some intuitions and answer some questions.Justifying RPD theoretically and empirically, we believe RPD can offer us a new perspective to understand and compare word embeddings.

Figure 1 :
Figure 1: The plot shows the difference in performance as a function of RPD score.The x-axis for each point represents the RPD between word embeddings produced by SGNS (with NS 15, 5, 1), GloVe, SVD PMI , SVD LC and word embeddings produced by SGNS 25 .The y-axis for each point represents the sum of absolute variation in the performance (word similarity and word analogy).

Figure 2 :
Figure 2: Plot of different methods.We create the plot by fixing the position of SVD LC and SVD PMI .We then derive the position of other word embeddings according to their RPD with existing points on the plot.
GloVe SVD PMI SVD LC

Table 2 :
RPDs between same method trained from different corpora