Correlations between Word Vector Sets

Similarity measures based purely on word embeddings are comfortably competing with much more sophisticated deep learning and expert-engineered systems on unsupervised semantic textual similarity (STS) tasks. In contrast to commonly used geometric approaches, we treat a single word embedding as e.g. 300 observations from a scalar random variable. Using this paradigm, we first illustrate that similarities derived from elementary pooling operations and classic correlation coefficients yield excellent results on standard STS benchmarks, outperforming many recently proposed methods while being much faster and trivial to implement. Next, we demonstrate how to avoid pooling operations altogether and compare sets of word embeddings directly via correlation operators between reproducing kernel Hilbert spaces. Just like cosine similarity is used to compare individual word vectors, we introduce a novel application of the centered kernel alignment (CKA) as a natural generalisation of squared cosine similarity for sets of word vectors. Likewise, CKA is very easy to implement and enjoys very strong empirical results.

By contrast, relatively little effort has been directed towards understanding the similarity measures used to compare these textual embeddings, for which cosine similarity remains a convenient and widespread, yet somewhat arbitrary default, despite some emerging research into the alternatives (Camacho-Collados et al., 2015;De Boom et al., 2015;Santus et al., 2018;Zhelezniak et al., 2019b,a).Part of the appeal of cosine similarity perhaps lies in the simple geometric interpretation behind it.However, as embeddings are ultimately just arrays of numbers, we are free to take alternative viewpoints other than the geometric ones, if they lead to illuminating insights or strong-performing methods.
Following Zhelezniak et al. (2019a), we treat a word embedding not as a geometric vector but as a statistical sample (of e.g.300 observations) from a scalar random variable, and indeed find insights that are both intriguing and noteworthy.We first illustrate that similarities derived from elementary pooling operations and classic univariate correlation coefficients yield excellent results on standard semantic textual similarity (STS) benchmarks, outperforming many recently proposed methods while being much faster and simpler to implement.This empirically validates the advantages of the statistical perspective on word embeddings over the geometric interpretations.In the process, we provide more evidence that departures from normality, and in particular the presence of outliers, can have severe negative effects on the performance of some correlation coeffi-cients.We show how to overcome these complications, by selecting an outlier-removing pooling operation such as max-pooling, applying a more robust correlation coefficient such as Spearman's ρ, or simply clipping (winsorizing) the word vectors.
Next, we demonstrate how to avoid pooling operations completely and compare sets of word embeddings directly via correlation operators between reproducing kernel Hilbert spaces (RKHS).We introduce a novel application of the kernel alignment (KA) and the centered kernel alignment (CKA) as a natural generalisation of the squared cosine similarity and Pearson correlation for the sets of word embeddings.These multivariate correlation coefficients are very easy to implement and also enjoy very strong empirical results.

Related Work
Several lines of research seek to combine the strength of pretrained word embeddings and the elegance of set-or bag-of-words (BoW) representations.Any method that determines semantic similarity between sentences by comparing the corresponding sets of word embeddings is directly related to our work.
Perhaps the most obvious such approaches are based on elementary pooling operations such as average-, max-and min-pooling (Mitchell and Lapata, 2008;De Boom et al., 2015, 2016).While seemingly over-simplistic, numerous studies have confirmed their impressive performance on the downstream tasks (Arora et al., 2017;Wieting et al., 2016;Wieting and Gimpel, 2018;Zhelezniak et al., 2019b) One step further, Zhao and Mao (2017); Zhelezniak et al. (2019b) introduce fuzzy bags-of-words (FBoW) where degrees of membership in a fuzzy set are given by the similarities between word embeddings.Zhelezniak et al. (2019b) show a close connection between FBoW and max-pooled word vectors.
Some approaches do not seek to build an explicit representation and instead focus directly on designing a similarity function between sets.Word Mover's Distance (WMD) (Kusner et al., 2015) is an instance of the Earth Mover's Distance (EMD) computed between normalised BoW, with the cost matrix given by Euclidean distances between word embeddings.In the soft cardinality framework of (Jimenez et al., 2010(Jimenez et al., , 2015)), the contribution of a word to the cardinality of a set depends on its similarities to other words in the same set.Such sets are then compared using an appropriately defined Jaccard index or related measures.DynaMax (Zhelezniak et al., 2019b) uses universe-constrained fuzzy sets designed explicitly for similarity computations.
Approaches that see word embeddings as statistical objects are very closely related to our work.Virtually all of them treat word embeddings as observations from some D-variate parametric family, where D is the embedding dimension.Arora et al. (2016Arora et al. ( , 2017) ) introduce a latent discourse model and show the maximum likelihood estimate (MLE) for the discourse vector to be the weighted average of word embeddings in a sentence, where the weights are given by smooth inverse frequencies (SIF).Nikolentzos et al. (2017); Torki (2018) treat sets of word embeddings as observations from D-variate Gaussians, and compare such sets with cosine similarity between the parameters (means and covariances) estimated by maximum likelihood.Vargas et al. (2019) measure semantic similarity through penalised likelihood ratio between the joint and factorised models and explore Gaussian and von Mises-Fisher likelihoods.
Cosine similarity between covariances is an instance of the RV coefficient and its uncentered version was applied in the context of word embeddings before (Botev et al., 2017).We arrive at a similar coefficient (but with different centering) as a special case of CKA, which in the general case makes no parametric assumptions about disbtributions whatsoever.In particular our version is suitable for comparing sets containing just one word vector, whereas the method of Nikolentzos et al. (2017); Torki (2018) requires at least two vectors in each set.Very recently, Kornblith et al. (2019) used CKA to compare representations between layers of the same or different neural networks.This is again an instance of treating such representations as observations from a D-variate distribution, where D is the dimension of the hidden layer in question.Our use of CKA is completely different from theirs.Unlike all of the above approaches, (Zhelezniak et al., 2019a) see each word embedding itself as D (e.g.300) observations from some scalar random variable.They cast semantic similarity as correlations between these random variables and study their properties using simple tools from univariate statistics.While they consider correlations between individual word vectors and averaged word vectors, they do not formally explore correlations between word vector sets.We review their framework in Section 3 and then proceed to formalise and generalise it to the case of sets of word embeddings.
3 Background: Correlation Coefficients and Semantic Similarity Suppose we have a word embeddings matrix W ∈ R N ×D , where N is the number of words in the vocabulary and D is the embedding dimension (usually 300).In other words, each row w (i) of W is a D-dimensional word vector.When applying statistical analysis to these vectors, one might choose to treat each w (i) as an observation from some Dvariate distribution P D (E 1 , . . .E D ) and model it with a Gaussian or a Gaussian Mixture.While such analysis helps in studying the overall geometry of the embedding space (how dimensions correlate and how embeddings cluster), P D is not directly useful for semantic similarity between individual words.
For the latter, Zhelezniak et al. (2019a) proposed to look at the transpose W T and the corresponding distribution P (W 1 , W 2 , . . ., W N ).Under this perspective, each word vector w (i) is now a sample of D (e.g.300) observations from a scalar random variable W i .Luckily, in applications we are usually not interested in the full joint distribution but only in the similarity between two words, i.e. the bivariate marginal P (W i , W j ).In practice, we make inferences about this marginal from the paired sample (w (i) , w (j) ) through visualisations (histograms, Q-Q plots, scatter plots, etc.) as well as various statistics.Zhelezniak et al. (2019a) found that for all common models (GloVe, fastText, word2vec) the means across word embeddings are tightly concentrated around zero (relative to their dimensions), thus making the widely used cosine similarity practically equivalent to Pearson correlation.However, while word2vec vectors seem mostly normal, GloVe and fastText vectors are highly non-normal, likely due to the presence of heavy univariate and bivariate outliers (as suggested by visualisations mentioned earlier).Quantitatively, the majority of GloVe and fastText vectors fail the Shapiro-Wilk normality test at sig-nificance level 0.05.Therefore, while Pearson's r (and thus cosine similarity) may be acceptable for word2vec, it is preferable to resort to more robust non-parametic correlation coefficients such as Spearman's ρ or Kendall's τ as a similarity measure between GloVe and fastText vectors.
Finally, very similar conclusions were shown to hold for sentence representations obtained by word vector averaging, also referred to as meanpooling.In particular, averaged fastText vectors compared with rank correlation coefficients already show impressive results on standard STS tasks, rivaling much more sophisticated systems.

Correlations between Word Vector Sets
We are interested in applying the statistical framework from Section 3 to measure the semantic similarity between two sentences s 1 and s 2 given by the sets (or bags) S 1 and S 2 of word embeddings respectively.To formalise this new setup, we may see each set of word embeddings S = {w (1) , w (2) , . . ., w (k) } as a sample (of e.g.300 observations) from some theoretical set of scalar random variables R = {W 1 , W 2 , . . ., W k }.In light of the above, our task then lies in finding correlation coefficients corr(R 1 , R 2 ) between R 1 and R 2 and their empirical estimates corr(S 1 , S 2 ) obtained from the paired sample S 1 , S 2 , hoping that such coefficients will serve as a good proxy for semantic similarity.Recall that for singleword sets R 1 = {W i }, R 2 = {W j } the task simply reduces to computing a univariate correlation between word vectors w (i) and w (j) , where the choice of the coefficient (Pearson's r, Spearman's ρ, etc.) is made based on the statistics exhibited by the word embeddings matrix.While generalising this to sets of more than one variable is not particularly hard, there are several ways to do so, each with its own advantages and downsides.In the present work, we group these approaches into two broad families: pooling-based and poolingfree correlation coefficients.

Algorithm 1 MaxPool-Spearman
Require: Word embeddings for the first sentence x (1) , x (2) . . ., x (k) ∈ R 1×d Require: Word embeddings for the second sentence y (1) , y (2) . . ., y (l) ∈ R 1×d Ensure: Similarity score M S # Max-pooling performed element-wise x ← MAX POOL(x (1) , x (2) . . ., x (k) ) y ← MAX POOL(y (1) , y (2) . . ., y (l) ) MS ← SPEARMANCORRELATION(x, y) fixed vector w pool , followed by computing univariate sample correlations.Certainly, these approaches are empirically attractive: not only are they very simple computationally (e.g.see Algorithm 1) but they also keep us in the realm of univariate statisics, where we have an entire arsenal of effective tools for making inferences about W pool .Unfortunately, it is not always clear a priori what should dictate our choice of the pooling function (though, as we will see shortly, for certain functions some statistical justifications do exist).By far the most common pooling operations for word embedding found in the literature are mean-, max-and min-pooling.It is also very common, with some exceptions, to treat these various pooled representation in a completely identical fashion, e.g. by comparing them all with cosine similarity.Intuitively, however, we suggest that the statistics of W pool must heavily depend on the pooling function f pool and thus each such pooled random variable should be studied in its own right.To illustrate this point, we would like to reveal the very different nature of mean-and max-and minpooled sentence vectors though a practical example.

Statistics of the Pooled Representations: A Practical Analysis
Let us begin by examining sentence vectors obtained through mean-pooling.Recall that for common word embedding models, the mean across 300 dimensions of a single word embedding w (i) happens to be close to zero (relative to the dimensions).By the linearity of expectation, we have that and so the mean across w mean will also be close to zero at least for small k.In practice, this seems to hold even for moderate k in naturally occurring sentences, as seen in Figure 1.Based on this, we expect Pearson correlation and cosine similarity to have almost identical performance on the downstream tasks, which is confirmed in Figure 2.
to the right because of the max operation, which we see in Figure 1.In this case, cosine similarity and Pearson correlation will yield different results and, in fact, Pearson's r considerably outperforms cosine on the downstream tasks (Figure 2).This in turn empirically adds weight to the statistical interpretation (correlation) over its geometrical counterpart (angle between vectors).
Recall also that unlike word2vec, GloVe and fastText vectors feature heavy univariate outliers, and the same can be expected to hold for the pooled representations; an example is shown in Figure 3.In case of mean-pooled vectors, this particular departure from normality can be successfully detected by the Shapiro-Wilk normality test, informing the appropriate choice of the correlation coefficient (Pearson's r or robust rank correlation).By contrast, such procedure cannot be readily applied to max-pooled and min-pooled vectors as by construction they exhibit additional departures from normality, such as positive and negative skew respectively.It is always a good idea to consult visualisations for such vectors, such as the ones in Figure 3. Interestingly though, we do observe the some noteworthy regularities, which we describe further in Section 5.
The above example is meant to illustrate that even the simplest pooled random variables show strikingly different statistics depending on the aggregation.While the abundance of various pooling operations may be intimidating, the resulting vectors are always subject to the many tools of univariate statistics.As we hope to have shown, even crude analysis can shed light on the nature of these textual representations, which in turn has notable practical implications, as we will see in Section 5.

Correlations between Random Vectors
Exactly as before, suppose we have two sentences S 1 = {x (1) , x (2) , . . ., x (k) } and S 2 = {y (1) , y (2) , . . ., y (l) } and the corresponding random vectors X = (X 1 , X 2 , . . ., X k ) and Y = (Y 1 , Y 2 , . . ., Y l ).At this point it is important to emphasise again that we relate each word vector x i to a random variable X i and treat the dimensions of x i as D observations from that variable, and similarly for y i and Y i .In contrast with the pooling-based approaches, our task here is to find a suitable correlation coefficient directly between the random vectors X and Y.We begin by recalling the expression for the basic univariate Pearson's r: where and similarly for µ Y and σ Y .The covariance term cov(X, Y ) in the numerator is readily generalised to random vectors by the following crosscovariance operator between reproducing kernel Hilbert spaces (RKHS) F and G (2) where ⊗ denotes the tensor product and µ Here φ and ψ are the feature maps such that φ(x), φ(x ) F = K(x, x ) and ψ(y), ψ(y ) G = L(y, y ), where K and L are the kernels associated with RKHS F and G respectively.Note that if φ and ψ are the identity maps, the cross-covariance operator (2) simply becomes the cross-covariance matrix Gretton et al. (2005a) define the Hilbert-Schmidt independence criterion (HSIC) to be the squared Hilbert-Schmidt norm ||C XY || 2 HS of (2) and derive an expression for it in terms of kernels K and L They also show the empirical estimate of it to be HSIC(K, L) = (D − 1) −2 Tr(KHLH), (3) where H = I − 1 D 11 T is the centering matrix and K = K(X (i) , X (j) ), L = L(Y (i) , Y (j) ), i, j = 1:D are the kernel (Gram) matrices of observations.Crucially, the kernel evaluations for K take place between X (i) = (x i (1) , x i (2) , . . ., x i (k) ) and X (j) = (x j (1) , x j (2) , . . ., x j (k) ) and not between the individual word embeddings x (i) and x (j) , and similarly for L. Thus, both K and L are square matrices of dimension D × D .Indeed, for (3) to make sense, the dimensions of K and L must match.The matching dimension in our case is the word embedding dimension D, while the number of words k and l in the sentences may vary.This is in line with our formalism, which models word vectors as random variables and their dimensions as observations.
(4) We see now that CKA not only generalises the squared Pearson correlation to the multivariate case, it also allows it to operate in highdimensional feature spaces, as commonly done in the kernel literature.The reason this is useful is that under certain conditions (when K and L are characteristic kernels), HSIC can detect any existing dependence with high probability, as the sample size increases (Gretton et al., 2005b).One can also consider the Uncentered Kernel Alignment (or simply KA) (Cristianini et al., 2002), which can then be seen as a similar generalisation but for the univariate cosine similarity.To the best of our knowledge, KA and CKA in general have never been applied before to measure semantic similarity between sets of word embeddings; therefore this work seeks to introduce them as standard definitions for squared Pearson's r and cosine similarity for such sets.

Experiments
We now empirically demonstrate the power of the methods and statistical analysis presented in Section 4, through a set of evaluations on the Se-mantic Textual Similarity (STS) tasks series 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016;;Cer et al., 2017).For methods involving pretrained word embeddings, we use fastText (Bojanowski et al., 2017) trained on Common Crawl (600B tokens), as previous evaluations have indicated that fastText vectors have uniformly the best performance on these tasks out of commonlyused pretrained unsupervised word vectors (Conneau et al., 2017;Perone et al., 2018;Zhelezniak et al., 2019a,b).We provide experiments and significance analysis for additional word vector in the Appendix.The success metric for the STS tasks is the Pearson correlation between the sentence similarity scores provided by human annotators and the scores generated by a candidate algorithm.Note that the dataset for the STS13 SMT subtask is no longer publicly available, so the mean Pearson correlation for STS13 reported in our experiments has been re-calculated accordingly.The code for our experiments builds on the SentEval toolkit (Conneau and Kiela, 2018) and is available on GitHub1 .We first conduct a set of experiments to validate the observations of Sections 4.1 and 4.2 regarding the performance of cosine similarity and various univariate correlation coefficients when applied to pooled word vectors.These results are depicted in Figure 2, for which we can make the following observations.
First, max and min-pooled vectors consistently outperform mean-pooled vectors when all three representations are compared with Pearson correlation.We hypothesise that this is in part because max and min-pooling remove the outliers (to which Pearson's r is very sensitive) from at least one tail of the distribution whereas mean-pooled vectors have outliers in both tails.This outlierremoving property, however, cannot be taken as a sole explanation behind excellent performance of max-pooled vectors, as max-pooling still tends to outperform mean-pooling when both are compared with correlations that are robust to outliers, as well as on word vectors that have very few outliers to begin with (e.g.word2vec).
In addition, the strong performance of rank correlation coefficients (Spearman's ρ and Kendall's τ ) comes solely from their robustness to outliers, as clipping (winsorizing) the top and bottom 5% of the values and then proceeding with Pearson's r closes the gap almost completely.Consistently, on vectors with few outliers (word2vec), Pearson's r achieves the same performance as rank correlations even without winsorization.However, unlike outliers, positive (negative) skew of max-(min-) pooled vectors does not seem to hurt Pearson's r on STS tasks.
Next, we conduct evaluations of the methods proposed in this work alongside other deep learning and set-based similarity measures for STS from the literature.The methods we compare are as follows: • Deep representation approaches: BoW with ELMo embeddings (Peters et al., 2018), Skip-Thought (Kiros et al., 2015), InferSent (Conneau et al., 2017), Universal Sentence Encoder both DAN and Transformer (Cer et al., 2018), STN multitask embeddings (Subramanian et al., 2018), and BERT 12and 24-layer models (Devlin et al., 2018).
• Proposed set-based approaches: max-pooled word vectors with Spearman correlation, CKA with linear kernel (also known as RVcoefficient), CKA with Gaussian kernel (median estimation for σ 2 ), and CKA with distance kernel (distance correlation).
Note that for BERT we evaluated all pooling strategies available in bert-as-service (Xiao, 2018) applied to either the last or second-to-last layers and report results for the best-performing combination, which was mean-pooling on the last layer for both model sizes.Our results are presented in Table 1.We can clearly see that deep learningbased methods do not shine on STS tasks, while simple compositions of word vectors can perform extremely well, especially when an appropriate correlation coefficient is used as the similarity measure.Indeed, the performance of max-pooled vectors with Spearman correlation approaches or exceeds that of more expensive or offline methods like that of Arora et al. (2017), which performs PCA computations on the entire test set.Additionally, while the multivariate correlation methods such as CKA are more computationally expensive than pooling-based approaches (see Table 2), they can provide performance boost on some tasks, making the cost worth it depending on the application.Finally, we conducted an exploratory error analysis and found that many errors are due to the well-known inherent weaknesses of word embeddings.For example, the proposed approaches heavily overestimate similarity when two sentences contain antonyms or when one sentence is the negation of the other.We illustrate these and other cases in the Appendix.

Conclusion
In this work we investigate the application of statistical correlation coefficients to sets of word vectors as a method for computing semantic textual similarity (STS).This can be done either by pooling these word vectors and computing univariate correlations between the resulting representations, or by applying multivariate correlation coefficients to the sets of vectors directly.We provide further empirical evidence that outliers in word vector distributions disrupt performance of set-based similarity metrics as previously shown (Zhelezniak et al., 2019a).We also show working methods for solving or avoiding the issue through vector pooling operations, robust correlations or winsorization.In addition, we found that pooling operations in conjunction with univariate correlation coefficients yield one of the strongest results on downstream STS tasks, while being computationally much more efficient than competing set-based methods.Our findings are supported by a combination of statistical analysis, practical examples and visualisations, and empirical evaluation on standard benchmark datasets.
Both proposed families of approaches serve as strong baselines for future research into STS, as well as useful algorithms for the practitioner, being efficient and simple to implement.
We believe our findings speak to the efficacy of the statistical perspective on word embeddings, which we hope will encourage others to explore further implications of not only this particular framework, but also completely novel interpretations of textual representations.
While the proposed approaches enjoy strong performance on the STS benchmarks relative to the competing methods, the Pearson correlations between gold and system scores remain consistently below 0.9 in all subtasks.It would be extremely useful to establish which similarities are not captured very well by these approaches, at least as judged by humans on the 0 to 5 scale established in (Agirre et al., 2012).For concreteness, we limit our exposition to MaxPool Spearman, noting that similar conclusions hold for CKA-based methods too.
First, we linearly transform the system scores into the range [0, 5], thus making them comparable to gold scores while preserving Pearson correlation.Then, in each subtask we select 5 sentence pairs with the largest absolute difference between the gold and the system score.After that, we manually examine the obtained dataset, focusing predominantly on shorter sentences, where the errors are often obvious and easy to explain.Even under these restrictions, we can readily distinguish between 5 different types of errors, summarised in Table 3.On the one hand, the system heavily underestimates the similarity score when two sentences use completely different vocabulary yet have identical meaning (Type I).On the other hand, it tends to overestimate the similarity when the sentences use very related or even the same words but have different meaning (Types II & III).The similarity is also overestimated when two sentences contain antonyms, or when one sentence is a negation of the other (Types IV & V respectively).A lot of these flaws can be traced back to the well-known weaknesses of word embeddings and the distributional hypothesis, such as mixing together semantic similarity and conceptual relatedness (Hill et al., 2015;Mrkšić et al., 2016), failure to distinguish synonyms from antonyms (Mohammad et al., 2008;Mrkšić et al., 2016) and problems with negation.We hope that any counter-measures to these weaknesses will also improve the proposed sentence-level systems.

B Significance Analysis
Following the procedure described in Zhelezniak et al. (2019b), we construct 95% BCa confidence intervals for the delta in performance between two systems.The key results are as follows.MaxPool Spearman overall statistically outperforms DynaMax (Zhelezniak et al., 2019b) when word vectors are highly non-normal (GloVe) and looses when word vectors seem mostly normal (word2vec), which is in line with our main discussion.Next, max-pooling outperforms meanpooling on the majority of subtasks for all word vector models.Finally, MaxPool Spearman is overall comparable to CKA Gaussian, with the exception of word2vec where CKA is slightly better.it's a good idea to do both.1 3.9 -2.9 Table 3: Error analysis for MaxPool Spearman.Each entry contains a sentence pair, the gold similarity score, the scaled system similarity score, and the difference between the two scores.Errors are categorised into 5 types.The system heavily underestimates the similarity score when two sentences use different vocabulary yet have identical meaning (Type I).Inversely, it overestimates the similarity when the sentences use very related or even the same words but have different meaning (Types II & III).The similarity is also overestimated when two sentences contain antonyms, or when one sentence is a negation of the other (Types IV & V respectively).Table 6: MaxPool Spearman vs. CKA Gaussian:

GloVe
Pearson correlations between human sentence similarity scores and a generated scores.Generated scores Values in bold represent the best result for a subtask given a set of word vectors, based on a 95% BCa confidence interval (Efron, 1987) on the differences between the two correlations.In cases of no significant difference, both values are in bold.

Figure 3 :
Figure 3: Histograms for word embeddings of the word "cats" and pooled representations of the embeddings for the words in the sentence "I like cats because they are very cute animals".

Table 1 :
Mean Pearson correlation on STS tasks for Deep Learning and Set-based methods using fastText.

Table 2 :
Computational complexity of some of the setbased STS methods discussed in this paper.Here n is the sentence length and d is the dimensionality of the word embeddings.
black, and white cat looking at the camera.a black and white dog looking at the camera. 1 3.95 -2.95 III Same keywords but different meaning why do you need to peel peaches to can them?how to peel peaches? 1 4.57 -3.57what does it mean to write a song in a certain key? is it possible to write a song without a key?