Estimating Mutual Information Between Dense Word Embeddings

Word embedding-based similarity measures are currently among the top-performing methods on unsupervised semantic textual similarity (STS) tasks. Recent work has increasingly adopted a statistical view on these embeddings, with some of the top approaches being essentially various correlations (which include the famous cosine similarity). Another excellent candidate for a similarity measure is mutual information (MI), which can capture arbitrary dependencies between the variables and has a simple and intuitive expression. Unfortunately, its use in the context of dense word embeddings has so far been avoided due to difficulties with estimating MI for continuous data. In this work we go through a vast literature on estimating MI in such cases and single out the most promising methods, yielding a simple and elegant similarity measure for word embeddings. We show that mutual information is a viable alternative to correlations, gives an excellent signal that correlates well with human judgements of similarity and rivals existing state-of-the-art unsupervised methods.

One prominent alternative to those correlationbased approaches is mutual information (MI), which is of great importance in information theory and statistics. In some sense, mutual information is an excellent candidate for a similarity measure between word embeddings as it can capture arbitrary dependencies between the variables and has a simple and intuitive expression. Unfortunately, its use in the context of continuous dense word representations has so far been avoided due to the difficulties in estimating MI for continuous random variables (joint and marginal densities are not known in practice).
In this work we make the first steps towards the adoption of MI as a measure of semantic similarity between dense word embeddings. We begin our discussion with how to apply MI for this purpose in principle. Next we carefully summarise the vast literature on estimation of MI for continuous random variables and identify approaches most suitable for our use case. Our chief goal here is to identify the estimators that yield elegant, almost closed-form expressions for the resulting similarity measure as opposed to complicated estimation procedures. Finally, we show that such estimators of mutual information give an excellent signal that correlates very well with human judgements and comfortably rivals existing state-of-the-art unsupervised STS approaches.

Background: Statistical Approaches to Word Embeddings
Suppose we are given a word embedding matrix W ∈ R N ×D , where N is the vocabulary size and D is the embedding dimension (commonly D = 300). Ultimately, the matrix W is simply a table of some numbers and just like any dataset, it is subject to a statistical analysis. There are essentially two ways we can proceed: we can either choose to view W as N observations from D random variables or we can instead consider W T and view it as D observations from N random variables. The first approach allows us to study 'global' properties of the word embedding space (e.g. via PCA, clustering, etc.) and defines 'global' similarity structures, such as Mahalanobis distance, Fisher kernel (Lev et al., 2015), etc.
In the second approach we study the distribution P (W 1 , W 2 , . . . , W N ), where a word embedding w i is a sample of D (= 300) observations from some scalar random variable W i corresponding to the word w i (Zhelezniak et al., 2019a,c). The 'local' similarity between two words w i and w j is then encoded in the dependencies between the corresponding random variables W i , W j . Since the distribution P (W i , W j ) is unknown, we estimate these dependencies based on the sample w i , w j . Certain dependencies can be captured by Pearson, Spearman and Kendall correlation coefficients between word embeddings ρ(w i , w j ), where the choice of the coefficient depends on the statistics of each word embedding model (Zhelezniak et al., 2019a).
Conveniently, correlations can also be used to measure semantic similarity between two sets of words (e.g. phrases and sentences) if one considers the correlations between random vectors X = (X 1 , X 2 , . . . , X lx ) and Y = (Y 1 , Y 2 , . . . , Y ly ), where scalar random variables X i correspond to the words in the first sentence and Y j to the words in the second sentence. This, for example, can be done by first pooling (e.g. mean-or max-pooling) random vectors into scalar variables X pool and Y pool and then estimating univariate correlations corr(X pool , Y pool ) as before. Alternatively, we can measure correlations between random vectors directly using norms of cross-covariance matrices/operators (e.g. the Hilbert-Schmidt independence criterion (Gretton et al., 2005)). Both approaches are known to give excellent results on standard STS benchmarks (Zhelezniak et al., 2019c). A viable alternative to correlations is mutual information (MI), which can detect any kind of dependence between random variables, but which has so far not been explored for this problem.

Mutual Information between Dense Word Embeddings
We operate within the previous setting where we consider two sentences x = x 1 x 2 . . . x lx and y = y 1 y 2 . . . y ly . Our goal now is to estimate the mutual information I(X; Y) between the corresponding random vectors X = (X 1 , X 2 , . . . , X lx ) and Y = (Y 1 , Y 2 , . . . , Y ly ) (1) where p XY (x, y) is the joint density of X and Y and p X (x) = Y p XY (x, y)dy and p Y (y) = X p XY (x, y)dx are the marginal densities. Unfortunately, these theoretical quantities are not available to us and we must somehow estimate I(X; Y) directly from the word embeddings X = (x (1) , x (2) , . . . , x (lx) ) and Y = (y (1) , y (2) , . . . , y (ly) ). Luckily, there is a vast literature on how to estimate mutual information between continuous random variables based on the sample. The first class of methods partitions the supports X , Y into a finite number of bins of equal or unequal (adaptive) size and estimates I(X; Y) based on discrete counts in each bin (Moddemeijer, 1989;Fraser and Swinney, 1986;Darbellay and Vajda, 1999;Reshef et al., 2011;Ince et al., 2016). While such methods are easy to understand conceptually, they might suffer from the curse of dimensionality (especially when sentences are long) and in some sense violate our desire for an elegant closed-form similarity measure. The next class of methods constructs kernel density estimates (KDE) and then numerically integrates such approximate densities to obtain MI (Moon et al., 1995;Steuer et al., 2002). These methods might require a careful choice of kernels and the bandwidth parameters and also violate our simplicity requirement. The third class of methods that has recently gained popularity in the deep learning community is based on neural-network-based estimation of various bounds on mutual information (e.g. by training a critic to estimate the density ratio in (1) However, in our case the sample size (e.g. 300) and dimensionality are not too large (at least for short phrases and sentences), and thus training a separate neural network for a simple similarity computation is hardly justified. This leaves us with the last class of methods that estimates mutual information from the k-nearest neighbour statistics (Kraskov et al., 2004;Ver Steeg and Galstyan, 2013;Ver Steeg, 2014;Ross, 2014;Gao et al., 2015;Gao et al., 2018). These approaches are not without problems (Gao et al., 2015) and inherit the weaknesses of kNN in large dimensions but are very simple to implement. In particular, we focus on the Kraskov-Stögbauer-Grassberger (KSG) estimator (Kraskov et al., 2004) which admits a particularly elegant expression for the resulting similarity measure.

The KSG Similarity Measure
It can be verified that the mutual information is given by I i.e. the difference between the sum of marginal entropies and the joint entropy. Thus, in order to estimate MI, it is sufficient to be able to estimate various entropies in the above equation. In their seminal work, Kozachenko and Leonenko (1987) show how to estimate such differential entropies based on the nearest neighbour statistics. Concretely, these methods approximate the log-density Algorithm 1 Kraskov-Stögbauer-Grassberger (KSG) Similarity Measure Require: Word embeddings for the first sentence X ∈ R lx×D Require: Word embeddings for the second sentence Y ∈ R ly×D Require: The number of nearest neighbours k < at a point by a uniform density in a e.g. Euclidean or Chebyshev norm ball containing its k-nearest neighbours. Kraskov et al. (2004) modify this idea to construct their famous KSG estimator of mutual information given by where D is the embedding dimension, k is the number of nearest neighbours,

Experiments
We now explore the empirical performance of the KSG similarity measure on a standard suite of Semantic Textual Similarity (STS) benchmarks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016 and report Spearman correlation between the system and human scores.  The number of nearest neighbours for KSG that is known to work well in practice on a variety of datasets is k = 3 (Kraskov et al., 2004;Khan et al., 2007). This value seems to strike a good balance between the bias and variance of the estimator. We also run experiments for k = 10 to show that KSG is not very sensitive to this hyperparameter, at least in our setting. As an interesting addition, we also run KSG (k = 10) for max-pooled scalar random variables (Max-Pool+KSG 10). We compare KSG to the following approaches from the literature: Universal Sentence Encoder ( Table 1. In summary, we can see that similarity measures based on mutual information (KSG) perform on par with top correlation-based measures and other leading methods from the literature. Moreover, KSG between pooled variables (MaxPool) is faster and performs only slightly worse than multivariate KSG.

Conclusion
In this work we explored how to apply mutual information (MI) as a semantic similarity measure for continuous dense word embeddings. We have summarised the vast literature on estimating MI for continuous random variables from the sample and singled out a simple and elegant KSG estimator which is based on elementary nearest-neighbour statistics. We showed empirically that this estimator and mutual information in general can be an excellent candidate for a similarity measure between dense word embeddings.